Tesseract OCR with Java with Examples
In this article, we will learn how to work with Tesseract OCR in Java using the Tesseract API.
What is Tesseract OCR?
Tesseract OCR is an optical character reading engine developed by HP laboratories in 1985 and open sourced in 2005. Since 2006 it is developed by Google. Tesseract has Unicode (UTF-8) support and can recognize more than 100 languages “out of the box” and thus can be used for building different language scanning software also. Latest Tesseract version is Tesseract 4. It adds a new neural net (LSTM) based OCR engine which is focused on line recognition but also still supports the legacy Tesseract OCR engine which works by recognizing character patterns.
How OCR works?
Generally OCR works as follows:
- Pre-process image data, for example: convert to gray scale, smooth, de-skew, filter.
- Detect lines, words and characters.
- Produce ranked list of candidate characters based on trained data set. (here the setDataPath() method is used for setting path of trainer data)
- Post process recognized characters, choose best characters based on confidence from previous step and language data. Language data includes dictionary, grammar rules, etc.
The advantages of OCR are numerous, but namely:
- it increases the efficiency and effectiveness of office work
- The ability to instantly search through content is immensely useful, especially in an office setting that has to deal with high volume scanning or high document inflow.
- OCR is quick ensuring the document’s content remains intact while saving time as well.
- Workflow is increased since employees no longer have to waste time on manual labour and can work quicker and more efficiently.
- The OCR is limited to language recognition.
- There is lot of effort that is required to make trainer data of different languages and implement that.
- One also need to do extra work on image processing as it is the most essential part that really matters when it comes to the performance of OCR.
- After doing such a great amount of work, no OCR can offer an accuracy of 100% and even after OCR we have to determine the unrecognized character by neighbouring methods of machine learning or manually correct it.
How to use Tesseract OCR
- The first step is to download the Tess4J API from the link
- Extract the Files from the downloaded file
- Open your IDE and make a new project
- Link the jar file with your project. Refer this link .
- Please migrate via this path “..\Tess4J-3.4.8-src\Tess4J\dist”.
Now you are done with your linking jar in your project and ready to use tesseract engine.
Performing OCR on clear images
Now that you have linked the jar file, we can get started with our coding part. The following code reads an image file and perform OCR and display text on the console.
Performing OCR on unclear images
Note that the image selected above is actually very clear and grayscaled but this doesn’t happen in most of the cases. In most of the cases, we get a noisy image and thus a very nosy output. To deal with it we need to perform some processing on the image called Image processing.
Tesseract works best when there is a very clean segmentation of the foreground text from the background. In practice, it can be extremely challenging to guarantee good segmentation. There are a variety of reasons you might not get good quality output from Tesseract if the image has noise on the background. Noise removal from image comes in the part of image processing. For this, we need to know that in what way an image should be processed.
You can refer this article for a detail understanding of how can you improve the accuracy. To implement the same in JAVA, we will make a small intelligence-based model which will scan the RGB content of the image and then convert it into the grayscaled content and also we will perform some zooming effect on the image too.
The below example is a sample code on how the image can be grayscaled based on its RGB content. So if images are very dark then they become brighter and clearer and if in case the images are whitish then they are scaled to little dark contrast so that text is visible.