Offline OCR and Language Translation

This article describes a procedure for performing offline Optical Character Recognition (OCR) and translation of documents that come in the form of a scanned image (jpg, png etc) or PDF.

The assumption is that these documents contain sensitive/personal information, which makes upload to online OCR and Translation services a non-feasible option due to exchange of sensitive information with outside parties.

The approach also assumes the translation and OCR is performed on a Windows 10 platform.

Pre-requisite Software

  • Microsoft .NET Framework 4.8
  • Microsoft Visual C++ 2019 Redistributable Package
  • VietOCR: An OCR GUI which has a dependency on the Microsoft C++ and .NET packages. The GUI utilises the Tesserct Open Source OCR Engine. VietOCR supports all languages that are supported by Tesseract
  • Ghostscript: Allows for support of PDF documents. This is an optional component and only required if you plan on performing OCR on PDFs
  • Bluestacks: An Android emulator required for running Google Translate App in offline mode

Note: If installing Ghostscript, ensure that the PATH environment variable is updated to include the path to the Ghostscript installaton directory.

Process overview

  • Run VietOCR against the source document to convert into UTF8 text
  • Install and Setup Google Translate on Bluestacks
  • Copy the converted text file over to Bluestacks
  • Translate using Google Translate in offline mode

Sample Image to be Used for OCR/Translation

Download the Source Language Pack through VietOCR

  • Open VietOCR
  • Go to “Settings”
  • Select “Download Language Data”
  • Select “Chinese (simplified)” and “Chinese (traditional)”

Perform OCR

To perform OCR using VietOCR:

  • Open the input source file from within VietOCR
  • Select “Chinese (simplified)” from the “OCR Language” drop down menu
  • Click on the Toolbar icon which diplays a magnifying glass along with text “OCR”
  • Once the conversion is complete, save the output as a UTF8 text file (for this example, it has been saved locally as C:\VietOCR\output\ocr_version.txt)

Install and Setup Google Translate on Bluestacks

  • Click on the Google Play Store icon located on the Home Screen
  • This will prompt you to enter login details for your Google account
  • After logging in, access the Play Store to search for and install “Google Translate”

Open up Google Translate and choose the following settings:

  • Select “English” for “Your Primary Language”
  • Select “Chinese” for “Language you translate most often”
  • Tick the checkbox “Translate offline”
  • Finally click on “Done” and wait for the language packs to be downloaded

Copy the Converted Text File (ocr_version.txt) to Bluestacks

  • Select “Import From Windows”
  • Select the text file containing OCR text (ocr_version.txt)
  • Wait for the file to import. It will appear in Media Manager once import is complete
  • Click on the file to open and choose “Just Once” to open with the default viewer (Open with HTML Viewer)
  • The file should load and appear with the correct format
  • Position your mouse cursor over a character within the file and Long press left mouse button. You will then be able to choose “Select All”
  • You will now have the option to select “Translate”, which will send the text to Google Translate and open up the app
  • To open up the full translation, perform the steps as per below
  • You should now be able to see a neatly formatted version of the original text along with corresponding tranlsation

Notes on Translation VietOCR Accuracy and Features

OCR Accuracy

  • Scanned image resolutions should be at least 200 DPI to 400 DPI in monochrome (black&white) or grayscale
  • Scanning at higher resolutions will not necessarily result in better recognition accuracy
  • The typical settings for scanning are 300 DPI and 1 bpp (bit per pixel) black&white or 8 bpp grayscale uncompressed TIFF or PNG format

Features

  • Hunspell Spellcheck is supported and can be applied by downloading the corresponding dictionary files (.aff, .dic) to the dict folder of VietOCR

Closing Notes

VietOCR is one of many OCR packages. This link provides a list of alternatives.

From my own experience, finding a readily avaiable offline tranlsation packages is quite difficult. Given the accuracy and continuous development of the existing apps (such as Google Translate), it made sense to leverage off an existing product which is continually being enhanced.

If you are after alternatives to Google Translate, then some options are Microsoft Translate & Yandex Translate. These both support offline tranlsation mode.

As of now, these translation apps do not have equivalent versions for Windows, hence the need to install the Bluestacks Android emulator.

Originally published at http://github.com.

Learner. Interests include Cloud and Devops technologies.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store