Offline OCR and Language Translation
This article describes a procedure for performing offline Optical Character Recognition (OCR) and translation of documents that come in the form of a scanned image (jpg, png etc) or PDF.
The assumption is that these documents contain sensitive/personal information, which makes upload to online OCR and Translation services a non-feasible option due to exchange of sensitive information with outside parties.
The approach also assumes the translation and OCR is performed on a Windows 10 platform.
Before starting, you will need to download and install the following:
- Microsoft .NET Framework 4.8
- Microsoft Visual C++ 2019 Redistributable Package
- VietOCR: An OCR GUI which has a dependency on the Microsoft C++ and .NET packages. The GUI utilises the Tesserct Open Source OCR Engine. VietOCR supports all languages that are supported by Tesseract
- Ghostscript: Allows for support of PDF documents. This is an optional component and only required if you plan on performing OCR on PDFs
- Bluestacks: An Android emulator required for running Google Translate App in offline mode
Note: If installing Ghostscript, ensure that the PATH environment variable is updated to include the path to the Ghostscript installaton directory.
- Download the corresponding language pack/s from within VietOCR. This pack should reflect the language of the document for which the OCR will be run against
- Run VietOCR against the source document to convert into UTF8 text
- Install and Setup Google Translate on Bluestacks
- Copy the converted text file over to Bluestacks
- Translate using Google Translate in offline mode
Sample Image to be Used for OCR/Translation
To demonstrate the process of OCR and translate from Chinese to English, we will use a sample “Chinese Resident Identity Card” (embedded below). The image was sourced from Wikipedia.
Download the Source Language Pack through VietOCR
The source image contains Chinese glyphs, therefore, we will need to download Chinese language packs.
- Open VietOCR
- Go to “Settings”
- Select “Download Language Data”
- Select “Chinese (simplified)” and “Chinese (traditional)”
Download the sample image identity card image mentioned earlier in this article and save locally
To perform OCR using VietOCR:
- Open the input source file from within VietOCR
- Select “Chinese (simplified)” from the “OCR Language” drop down menu
- Click on the Toolbar icon which diplays a magnifying glass along with text “OCR”
- Once the conversion is complete, save the output as a UTF8 text file (for this example, it has been saved locally as
Install and Setup Google Translate on Bluestacks
Perform the following from within Bluestacks:
- Click on the Google Play Store icon located on the Home Screen
- This will prompt you to enter login details for your Google account
- After logging in, access the Play Store to search for and install “Google Translate”
Open up Google Translate and choose the following settings:
- Select “English” for “Your Primary Language”
- Select “Chinese” for “Language you translate most often”
- Tick the checkbox “Translate offline”
- Finally click on “Done” and wait for the language packs to be downloaded
Copy the Converted Text File (ocr_version.txt) to Bluestacks
- Open Bluestacks Media Manager, as show below
- Select “Import From Windows”
- Select the text file containing OCR text (ocr_version.txt)
- Wait for the file to import. It will appear in Media Manager once import is complete
- Click on the file to open and choose “Just Once” to open with the default viewer (Open with HTML Viewer)
- The file should load and appear with the correct format
- Position your mouse cursor over a character within the file and Long press left mouse button. You will then be able to choose “Select All”
- You will now have the option to select “Translate”, which will send the text to Google Translate and open up the app
- To open up the full translation, perform the steps as per below
- You should now be able to see a neatly formatted version of the original text along with corresponding tranlsation
Notes on Translation VietOCR Accuracy and Features
The following notes are based on excerpts from the VietOCR tech guide.
- VietOCR contains an implementation of a postprocessing algorithm to improve accuracy by applying corrections to common errors encountered in the OCR process
- Scanned image resolutions should be at least 200 DPI to 400 DPI in monochrome (black&white) or grayscale
- Scanning at higher resolutions will not necessarily result in better recognition accuracy
- The typical settings for scanning are 300 DPI and 1 bpp (bit per pixel) black&white or 8 bpp grayscale uncompressed TIFF or PNG format
- Batch processing is supported which makes room for programmatic automation
- Hunspell Spellcheck is supported and can be applied by downloading the corresponding dictionary files (.aff, .dic) to the
dictfolder of VietOCR
The procedure described in this article may be cumbersome for some, however, the focus was on performing offline OCR/Translation without the use of paid software.
VietOCR is one of many OCR packages. This link provides a list of alternatives.
From my own experience, finding a readily avaiable offline tranlsation packages is quite difficult. Given the accuracy and continuous development of the existing apps (such as Google Translate), it made sense to leverage off an existing product which is continually being enhanced.
If you are after alternatives to Google Translate, then some options are Microsoft Translate & Yandex Translate. These both support offline tranlsation mode.
As of now, these translation apps do not have equivalent versions for Windows, hence the need to install the Bluestacks Android emulator.
Originally published at http://github.com.