Optical Character Recognition (Ocr) Using R

Last Updated : 19 Jan, 2024

OCR transforms text images into machine-readable formats. With applications ranging from receipts to license plates, we explore the process, syntax, and examples, demonstrating its versatility. In this tutorial, we will learn to perform Optical Character Recognition in R programming language using the Tesseract and Magick libraries.

Optical Character Recognition

OCR stands for Optical Character Recognition. It is the procedure that transforms a text image into a text format that computers can read. OCR generally scans the image and extracts the text from the image that we can store in any string variable. OCRs are used to read receipts, cheques, code scanners, license plate scanners, and other numerous applications.

The libraries used will be:

tesseract: It is a Neural Net LSTM-based OCR engine that is used for text recognition.
magick: This library is used for image processing in R. We can print the image and also required for Tesseract.

The Tesseract package provides R bindings Tesseract: a powerful optical character recognition (OCR) engine that supports over 100 languages. The engine is highly configurable to tune the detection algorithms and obtain the best possible results.

Syntax

To perform Optical Character Recognition, we simply use the ocr() method and pass the file.

text <- ocr(pngfile)
cat(text)

ocr method takes the png file and extracts the text using its pre-trained model.

Example 1: Reading text from an Image

Step 1: Install and load the libraries:

R

install.packages('tesseract')
install.packages('magick')
library(tesseract)
library(magick)

Step 2: Load an image from a URL or file storage.

R

# Reading the image
img = image_read('https://media.geeksforgeeks.org/wp-content/uploads/20190328185307/gfg28.png')
 
# Display
print(img)

Output:

geeksforgeeks

Step 3: Apply the OCR method on it.

R

# OCR
text <- ocr(img)
 
# extracted text
print(text)

Output:

[1] "GeeksforGeeks\nA computer science portal for geeks\n"

Example 2: Converting text from PDF.

Here we need to convert the PDF into png and then perform the OCR. The syntax is as follows:

pngfile <- pdftools::pdf_convert('https://www.africau.edu/images/default/sample.pdf', dpi = 600)

Here is the full code:

R

library(tesseract)
library(magick)
 
# fetching text from pdf
pngfile <- pdftools::pdf_convert('https://www.africau.edu/images/default/sample.pdf', dpi = 600)
text <- ocr(pngfile)
cat(text)

Output:

Converting page 1 to sample_1.png... done!
Converting page 2 to sample_2.png... done!
This is a small demonstration .pdf file -
just for use in the Virtual Mechanics tutorials. More text. And more
text. And more text. And more text. And more text.
And more text. And more text. And more text. And more text. And more
text. And more text. Boring, zzzzz. And more text. And more text. And
more text. And more text. And more text. And more text. And more text.
And more text. And more text.
And more text. And more text. And more text. And more text. And more
text. And more text. And more text. Even more. Continued on page 2 ...
 Simple PDF File 2
...continued from page 1. Yet more text. And more text. And more text.
And more text. And more text. And more text. And more text. And more
text. Oh, how boring typing this stuff. But not as boring as watching
paint dry. And more text. And more text. And more text. And more text.
Boring. More, a little more text. The end, and just as well.

Text Localization in OCR

Now we will learn to get the position of text and prepare a bounding box around it.

To get the bounding box, we can run the ocr_data() method on the image.

bound_box = ocr_data(img)

Step 1: Load the libraries

R

install.packages(c("png", "tesseract", "magick", "boundingbox", "grid", "magrittr", "ggplot2"))
 
library(png)
library(tesseract)
library(magick)
library(boundingbox)
library(grid)
library(tesseract)
library(magrittr)
library(ggplot2)

Step 2: Load image and generate the bounding box data. The ocr_data() method takes an image and sends the coordinates of the rectangle box in form of (x1, y1, x2, y2) coordinates separated by comma which we extract in later step. The coordinates data is stored in bound_box variable.

R

# png load image
img = image_read('https://media.geeksforgeeks.org/wp-content/uploads/20190328185307/gfg28.png')
 
# getting word and bounding box
bound_box = ocr_data(img)

Step 3: Convert the coordinates from chr to double by extracting the bound_box data splitting by comma and then saving them as xmin, ymin, xmax and ymax respectively.

R

bound_box = as.data.frame(bound_box)
# convert the co ordinates into dataframe
bound_box$bbox <- strsplit(bound_box$bbox, ",")
bound_box$xmin <- sapply(bound_box$bbox, function(x) as.numeric(x[1]))
bound_box$ymin <- sapply(bound_box$bbox, function(x) as.numeric(x[2]))
bound_box$xmax <- sapply(bound_box$bbox, function(x) as.numeric(x[3]))
bound_box$ymax <- sapply(bound_box$bbox, function(x) as.numeric(x[4]))

Output:

           word confidence            bbox
1 GeeksforGeeks   92.04797     5,15,661,96
2             A   96.76034   48,124,71,150
3      computer   96.31223  82,126,237,158
4       science   96.52452 248,123,362,150
5        portal   96.56268 376,122,466,158
6           for   96.14149 480,122,524,150
7         geeks   96.14149 536,122,626,158

Step 4: Plot the image

R

# Plot image with bounding boxes
ggplot() +
  annotation_custom(rasterGrob(img)) +
  geom_rect(data = bound_box, aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax), color = "red", fill = NA) +
  geom_text(data = bound_box, aes(x = (xmin+xmax)/2, y = ymax+10, label = word), color = "red", size = 3) +
  theme_void()+
  scale_y_reverse()

Output:

Screenshot-2023-12-24-220945

Advantages of OCR

Search the pdfs or images using text easily.
Digitizing the paper records
Convert the handwritten or image text easily to strings.

Disadvantages of OCR

All text are not always converted and prone to error based on quality of image.
Handwritten images return poor result due to variety of handwriting of people.

Conclusion

In conclusion, Optical Character Recognition in R opens avenues for text extraction from diverse sources. Tesseract and Magick libraries facilitate seamless integration, enabling tasks such as reading images and converting PDFs. While powerful, OCR’s effectiveness depends on image quality, with potential challenges in handwritten text recognition.

Suggest improvement

What is Optical Character Recognition (OCR)?

Share your thoughts in the comments

Optical Character Recognition (Ocr) Using R