Tesseract OCR with Java with Examples

In this article, we will learn how to work with Tesseract OCR in Java using the Tesseract API.

What is Tesseract OCR?
Tesseract OCR is an optical character reading engine developed by HP laboratories in 1985 and open sourced in 2005. Since 2006 it is developed by Google. Tesseract has Unicode (UTF-8) support and can recognize more than 100 languages “out of the box” and thus can be used for building different language scanning software also. Latest Tesseract version is Tesseract 4. It adds a new neural net (LSTM) based OCR engine which is focused on line recognition but also still supports the legacy Tesseract OCR engine which works by recognizing character patterns.

How OCR works?



Generally OCR works as follows:

  1. Pre-process image data, for example: convert to gray scale, smooth, de-skew, filter.
  2. Detect lines, words and characters.
  3. Produce ranked list of candidate characters based on trained data set. (here the setDataPath() method is used for setting path of trainer data)
  4. Post process recognized characters, choose best characters based on confidence from previous step and language data. Language data includes dictionary, grammar rules, etc.

Advantages

The advantages of OCR are numerous, but namely:

  • it increases the efficiency and effectiveness of office work
  • The ability to instantly search through content is immensely useful, especially in an office setting that has to deal with high volume scanning or high document inflow.
  • OCR is quick ensuring the document’s content remains intact while saving time as well.
  • Workflow is increased since employees no longer have to waste time on manual labour and can work quicker and more efficiently.

Disadvantages

  • The OCR is limited to language recognition.
  • There is lot of effort that is required to make trainer data of different languages and implement that.
  • One also need to do extra work on image processing as it is the most essential part that really matters when it comes to the performance of OCR.
  • After doing such a great amount of work, no OCR can offer an accuracy of 100% and even after OCR we have to determine the unrecognized character by neighbouring methods of machine learning or manually correct it.

How to use Tesseract OCR

  1. The first step is to download the Tess4J API from the link
  2. Extract the Files from the downloaded file
  3. Open your IDE and make a new project
  4. Link the jar file with your project. Refer this link .
  5. Please migrate via this path “..\Tess4J-3.4.8-src\Tess4J\dist”.

Now you are done with your linking jar in your project and ready to use tesseract engine.

Performing OCR on clear images

Now that you have linked the jar file, we can get started with our coding part. The following code reads an image file and perform OCR and display text on the console.

filter_none

edit
close

play_arrow

link
brightness_4
code

import java.io.File;
  
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;
  
public class Test {
    public static void main(String[] args)
    {
        Tesseract tesseract = new Tesseract();
        try {
  
            tesseract.setDatapath("D:/Tess4J/tessdata");
  
            // the path of your tess data folder
            // inside the extracted file
            String text
                = tesseract.doOCR(new File("image.jpg"));
  
            // path of your image file
            System.out.print(text);
        }
        catch (TesseractException e) {
            e.printStackTrace();
        }
    }

chevron_right


Input:


Output:

05221859

Working:

Performing OCR on unclear images

Note that the image selected above is actually very clear and grayscaled but this doesn’t happen in most of the cases. In most of the cases, we get a noisy image and thus a very nosy output. To deal with it we need to perform some processing on the image called Image processing.

Tesseract works best when there is a very clean segmentation of the foreground text from the background. In practice, it can be extremely challenging to guarantee good segmentation. There are a variety of reasons you might not get good quality output from Tesseract if the image has noise on the background. Noise removal from image comes in the part of image processing. For this, we need to know that in what way an image should be processed.

You can refer this article for a detail understanding of how can you improve the accuracy. To implement the same in JAVA, we will make a small intelligence-based model which will scan the RGB content of the image and then convert it into the grayscaled content and also we will perform some zooming effect on the image too.

The below example is a sample code on how the image can be grayscaled based on its RGB content. So if images are very dark then they become brighter and clearer and if in case the images are whitish then they are scaled to little dark contrast so that text is visible.

filter_none

edit
close

play_arrow

link
brightness_4
code

import java.awt.Graphics2D;
import net.sourceforge.tess4j.*;
import java.awt.Image;
import java.awt.image.*;
import java.io.*;
  
import javax.imageio.ImageIO;
  
public class ScanedImage {
  
    public static void
    processImg(BufferedImage ipimage,
               float scaleFactor,
               float offset)
        throws IOException, TesseractException
    {
        // Making an empty image buffer
        // to store image later
        // ipimage is an image buffer
        // of input image
        BufferedImage opimage
            = new BufferedImage(1050,
                                1024,
                                ipimage.getType());
  
        // creating a 2D platform
        // on the buffer image
        // for drawing the new image
        Graphics2D graphic
            = opimage.createGraphics();
  
        // drawing new image starting from 0 0
        // of size 1050 x 1024 (zoomed images)
        // null is the ImageObserver class object
        graphic.drawImage(ipimage, 0, 0,
                          1050, 1024, null);
        graphic.dispose();
  
        // rescale OP object
        // for gray scaling images
        RescaleOp rescale
            = new RescaleOp(scaleFactor, offset, null);
  
        // performing scaling
        // and writing on a .png file
        BufferedImage fopimage
            = rescale.filter(opimage, null);
        ImageIO
            .write(fopimage,
                   "jpg",
                   new File("D:\\Tess4J\\Testing and learning\\output.png"));
  
        // Instantiating the Tesseract class
        // which is used to perform OCR
        Tesseract it = new Tesseract();
  
        it.setDatapath("D:\\Program Files\\Workspace\\Tess4J");
  
        // doing OCR on the image
        // and storing result in string str
        String str = it.doOCR(fopimage);
        System.out.println(str);
    }
  
    public static void main(String args[]) throws Exception
    {
        File f
            = new File(
                "D:\\Tess4J\\Testing and learning\\Final Learning Results\\input.jpg");
  
        BufferedImage ipimage = ImageIO.read(f);
  
        // getting RGB content of the whole image file
        double d
            = ipimage
                  .getRGB(ipimage.getTileWidth() / 2,
                          ipimage.getTileHeight() / 2);
  
        // comparing the values
        // and setting new scaling values
        // that are later on used by RescaleOP
        if (d >= -1.4211511E7 && d < -7254228) {
            processImg(ipimage, 3f, -10f);
        }
        else if (d >= -7254228 && d < -2171170) {
            processImg(ipimage, 1.455f, -47f);
        }
        else if (d >= -2171170 && d < -1907998) {
            processImg(ipimage, 1.35f, -10f);
        }
        else if (d >= -1907998 && d < -257) {
            processImg(ipimage, 1.19f, 0.5f);
        }
        else if (d >= -257 && d < -1) {
            processImg(ipimage, 1f, 0.5f);
        }
        else if (d >= -1 && d < 2) {
            processImg(ipimage, 1f, 0.35f);
        }
    }
}

chevron_right


Input:

input.png

input.png

Output:

output.png

output.png

Working:



My Personal Notes arrow_drop_up

Im a final year MCA student at Panjab University, Chandigarh, one of the most prestigious university of India I am skilled in various aspects related to Web Development and AI I have worked as a freelancer at upwork and thus have knowledge on various aspects related to NLP, image processing and web

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.




Article Tags :
Practice Tags :


2


Please write to us at contribute@geeksforgeeks.org to report any issue with the above content.