• Important New Developments in Arabographic Optical Character Recognition (OCR)

    Author(s):
    Benjamin Kiessling, Matthew Thomas Miller (see profile) , Maxim G, Romanov, Sarah Bowen Savant
    Date:
    2017
    Subject(s):
    Arabic language, Arabic literature, Islam--Study and teaching, Persian language, Persian literature
    Item Type:
    Article
    Tag(s):
    medieval arabic literature, Persian Literature, Persian Studies, Ocr, arabic, Islamic studies
    Permanent URL:
    http://dx.doi.org/10.17613/M6TZ4R
    Abstract:
    The Open Islamicate Texts Initiative (OpenITI) team—building on the foundational opensource OCR work of the Leipzig University (LU) Alexander von Humboldt Chair for Digital Humanities—has achieved Optical Character Recognition (OCR) accuracy rates for printed classical Arabic-script texts in the high nineties. These numbers are based on our tests of seven different Arabic-script texts of varying quality and typefaces, totaling over 7,000 lines (~400 pages, 87,000 words; see Table 1 for full details). These accuracy rates not only represent a distinct improvement over the actual accuracy rates of the various proprietary OCR options for printed classical Arabic-script texts, but, equally important, they are produced using an open-source OCR software called Kraken (developed by Benjamin Kiessling, LU), thus enabling us to make this Arabic-script OCR technology freely available to the broader Islamicate, Persian, and Arabic Studies communities in the near future. In the process we also generated over 7,000 lines of “gold standard” (double-checked) training data that can be used by others for Arabic-script OCR training and testing purposes.
    Metadata:
    Published as:
    Journal article    
    Status:
    Published
    Last Updated:
    6 years ago
    License:
    All Rights Reserved

    Downloads

    Item Name: pdf uw-25-savant-et-al.pdf
      Download View in browser
    Activity: Downloads: 315