OCR and OER – update
- Access to Knowledge
Subhashish Panigrahi
25 September 2015
To read the blog post published by Open Education Working Group, see here.
Subhashish Panigrahi (@subhapa) is an educator, author, blogger, Wikimedian, language activist and free knowledge evangelist based in Bengaluru (often called Bangalore), India. After working for a while at the Wikimedia Foundation’s India Program he is currently at the Centre for Internet and Society‘s Access To Knowledge program. He works primarily in building partnership with universities, language research and GLAM (Gallery, Library, Archive and Museums) organizations for bringing more scholarly and encyclopedic content under free licenses, designs outreach programs for South Asian language Wikipedia/Wikimedia projects and communities. He wears many other hats: Editor for Global Voices Odia, Community Moderator of Opensource.com, and Ambassador for India in OpenGLAM Local. Subhashish is the author of a piece “Rising Voices: Indigenous language Digital Activism” in the book Digital Activism in Asia Reader.
Google’s OCR and its use by Wikimedians in South Asia
Some time back on the OCR project support network, Google had announced that the Google drive could be used for Optical Character Recognition (OCR). The software now works for over 248 world languages (including all the major South Asian languages). Though the exact pattern of development of the software is not clear, some of the Wikimedians reported that there is improvement over time in the recognition of their native languages Malayalam and Tamil. The recent encounter has been with a simple, easy to to use and robust software that can detect most languages with over 90% accuracy.
The OCR technology extracts text from images, scans of printed text, and even handwriting to some extent, which means that the text can be extracted pretty much from any old book, manuscript, or image. This certainly brings hope to most Indian languages as there is a lot to digitize. Most of the major Indian languages have plenty of non-digitized literature and the existing OCR systems are not as good as Google when so many languages are concerned as a whole.
Google’s OCR engine is probably using aspects of Tesseract, an OCR engine released as free software, or OCRopus, a free document analysis and optical character recognition (OCR) system that is primarily used in Google Books. Developed as a community project during 1995-2006 and later taken over by Google, Tesseract is considered one of the most accurate OCR engines and works for over 60 languages. The source code is available on GitHub.
The OCR project support page offers additional details on preserving character formatting for things like bold and italics after OCR in the output text.
When processing your document, we attempt to preserve basic text formatting such as bold and italic text, font size and type, and line breaks. However, detecting these elements is difficult and we may not always succeed. Other text formatting and structuring elements such as bulleted and numbered lists, tables, text columns, and footnotes or endnotes are likely to get lost.
The user-end interaction of the OCR software currently is rather simple. The user has to upload an image of the scan in any image format (.jpg, .png, .gif, etc.) or PDF to the Google Drive. Upon completion of the uploading, opening the file in Google Drive shows both the image and the converted text in the same document.
One of the most popular free and open digitization platforms, Wikisource currently hosts hundreds or thousands of free books which are either out of copyright or under Creative Commons licenses (CC-by or CC-by-SA) allowing users to digitize.
While OCR works quite well for Latin based languages, many other scripts do not get OCRed perfectly. So, the Wikisourcers (Wikisource contributors) often have to type the text.
Thus the new Google OCR might be useful both for the Wikisource community and many others who are in the mission of digitizing old text and archiving them.
The image below shows a screen from a tutorial to convert text in the Odia language from a scanned image using Google’s OCR.