METHODOLOGY
The proposed methodology involves the use of Optical Character
Recognition (OCR) and Large Language Model (LLM) for creating an
application that generates python codes from flowchart images. Flowchart
images uploaded on the application’s user interface that is developed
using Gradio are processed using a python library called EasyOCR to
extract the textual content from the flowcharts. It uses deep learning
models like VGG and ResNet for feature extraction, Long Short-Term
Memory networks to understand the sequential context of the extracted
features and Connectionist Temporal Classification algorithm to decode
the labeled sequences into actual text. A query is constructed using the
text extracted with OCR that is provided as a prompt to an LLM called
Llama 2 Chat (70 Billion Parameters Version) accessed using an API from
an online platform called Replicate. The output returned by the LLM
contains the python code and a brief explanation of it which is
extracted and stored in a python string object before returning it to
the user interface as the output for the uploaded flowchart.
Functionality to copy the python code displayed on the UI was also
incorporated to allow the user to test the code generated for the
uploaded flowchart.
Text Extraction using Optical
Character
Recognition
Optical Character Recognition (OCR) is a process in which readable text
is extracted from images containing textual content. It helps to
interpret printed or handwritten text from various sources such as
scanned documents or captured images. OCR operates on the principles of
pattern recognition and feature extraction. First, an input image
containing text is preprocessed by reducing the noise using Gaussian
filters and Morphological operations like Erosion and Dilation to
enhance its quality. Then the processed image is segmented into
individual characters to isolate them from the background employing
techniques like text line extraction with hough transform and word
extraction with connected component analysis. Statistical features like
density and direction of the foreground pixels in the image are
extracted using zoning, projection histograms, etc. Structural features
like cross points, strokes, loops, horizontal curves, etc are also
extracted. Finally, pattern matching algorithms are utilized to
interpret the extracted features, generating readable text output. A
library called EasyOCR is being used for extracting the text from the
flowchart components in the uploaded image. The uploaded flowchart image
is converted into an array using OpenCV. The image array is provided as
an argument to the readtext method of the Reader class from the EasyOCR
library. The method returns the extracted text, coordinates of the top
left and bottom right corner of the bounding box and the confidence
score of every word extracted from the image. All the extracted data is
stored in a variable. The bounding boxes of the extracted text are drawn
on the uploaded image and the text is put on the uploaded image at
appropriate positions using OpenCV methods as shown in Fig 1. Using
Matplotlib, the flowchart image with the bounding boxes and the
extracted text are displayed in a window.
Fig 1. Extraction of text from flowchart components in the image using
OCR
Creating Structured Query for Large
Language Model
(LLM)
A custom python function named create_query was created to generate a
structured query for the Llama 2 LLM to generate the Python code from
the extracted text. The created function takes two parameters as inputs
which are as follows : A variable which contains the textual content and
the other details extracted from the flowchart image. This input is an
iterable containing tuples where each tuple contains three elements
which are the bounding box coordinates, the extracted text, and the
confidence score for that text. The other argument provided to the
function is a python string containing ”#. The function uses a for loop
to iterate all elements in img_text and extracts only the text from it,
this text is appended to the python string and then the ”def” string is
attached to it at the end. The final structured query string is returned
from the function. The query string containing the OCR extracted text
can be provided to a LLM as a user input and it can generate the python
code for the extracted text as shown in Fig 2.