GemBox.Pdf
  • Overview
  • Examples
  • Free version
  • Support
  • Pricelist

    Show / Hide Table of Contents

    Optical character recognition (OCR)

    Optical character recognition (OCR) is a process of converting images with text into machine-encoded text. GemBox.Pdf supports OCR via the GemBox.Pdf.Ocr.dll.

    Language data

    These tables contain quick links for downloading trained language data which are necessary for GemBox.Pdf.Ocr to work with other languages besides English.

    You can also download a zip of all files or individual files from the official Tesseract data repository. As an alternative you can check out the tessdata_best repository which contains data trained for the highest accuracy but at the price of lower speed, or the tessdata_fast repository which contains data with higher performance but lower accuracy.

    Languages

    LanguageLanguage Data
    AfrikaansDownload
    AmharicDownload
    ArabicDownload
    AssameseDownload
    AzerbaijaniDownload
    Azerbaijani - CyrillicDownload
    BelarusianDownload
    BengaliDownload
    TibetanDownload
    BosnianDownload
    BretonDownload
    BulgarianDownload
    Catalan; ValencianDownload
    CebuanoDownload
    CzechDownload
    Chinese - SimplifiedDownload
    Chinese - TraditionalDownload
    CherokeeDownload
    CorsicanDownload
    WelshDownload
    DanishDownload
    GermanDownload
    DzongkhaDownload
    Greek, Modern (1453-)Download
    EnglishDownload
    English, Middle (1100-1500)Download
    EsperantoDownload
    Math / equation detection moduleDownload
    EstonianDownload
    BasqueDownload
    FaroeseDownload
    PersianDownload
    Filipino (old - Tagalog)Download
    FinnishDownload
    FrenchDownload
    German - FrakturDownload
    French, Middle (ca.1400-1600)Download
    Western FrisianDownload
    Scottish GaelicDownload
    IrishDownload
    GalicianDownload
    Greek, Ancient (to 1453)Download
    GujaratiDownload
    Haitian; Haitian CreoleDownload
    HebrewDownload
    HindiDownload
    CroatianDownload
    HungarianDownload
    ArmenianDownload
    InuktitutDownload
    IndonesianDownload
    IcelandicDownload
    ItalianDownload
    Italian - OldDownload
    JavaneseDownload
    JapaneseDownload
    KannadaDownload
    GeorgianDownload
    Georgian - OldDownload
    KazakhDownload
    Central KhmerDownload
    Kirghiz; KyrgyzDownload
    Kurmanji (Kurdish - Latin Script)Download
    KoreanDownload
    Korean (vertical)Download
    LaoDownload
    LatinDownload
    LatvianDownload
    LithuanianDownload
    LuxembourgishDownload
    MalayalamDownload
    MarathiDownload
    MacedonianDownload
    MalteseDownload
    MongolianDownload
    MaoriDownload
    MalayDownload
    BurmeseDownload
    NepaliDownload
    Dutch; FlemishDownload
    NorwegianDownload
    Occitan (post 1500)Download
    OriyaDownload
    Panjabi; PunjabiDownload
    PolishDownload
    PortugueseDownload
    Pushto; PashtoDownload
    QuechuaDownload
    Romanian; Moldavian; MoldovanDownload
    RussianDownload
    SanskritDownload
    Sinhala; SinhaleseDownload
    SlovakDownload
    SlovenianDownload
    SindhiDownload
    Spanish; CastilianDownload
    Spanish; Castilian - OldDownload
    AlbanianDownload
    SerbianDownload
    Serbian - LatinDownload
    SundaneseDownload
    SwahiliDownload
    SwedishDownload
    SyriacDownload
    TamilDownload
    TatarDownload
    TeluguDownload
    TajikDownload
    ThaiDownload
    TigrinyaDownload
    TongaDownload
    TurkishDownload
    Uighur; UyghurDownload
    UkrainianDownload
    UrduDownload
    UzbekDownload
    Uzbek - CyrilicDownload
    VietnameseDownload
    YiddishDownload
    YorubaDownload

    Scripts

    ScriptScript Data
    ArabicDownload
    ArmenianDownload
    BengaliDownload
    Canadian AboriginalDownload
    CherokeeDownload
    CyrillicDownload
    DevanagariDownload
    EthiopicDownload
    FrakturDownload
    GeorgianDownload
    GreekDownload
    GujaratiDownload
    GurmukhiDownload
    Han simplifiedDownload
    Han simplified verticalDownload
    Han traditionalDownload
    Han traditional verticalDownload
    HangulDownload
    Hangul verticalDownload
    HebrewDownload
    JapaneseDownload
    Japanese verticalDownload
    KannadaDownload
    KhmerDownload
    LaoDownload
    LatinDownload
    MalayalamDownload
    MyanmarDownload
    Oriya(Odia)Download
    SinhalaDownload
    SyriacDownload
    TamilDownload
    TeluguDownload
    ThaanaDownload
    ThaiDownload
    TibetanDownload
    VietnameseDownload

    Troubleshooting

    GemBox.Pdf.Ocr uses the Tesseract engine under the hood which usually fails with the Error 1 or Error 2 types.

    Error 1

    This error occurs when the Tesseract engine fails during initialization.

    Common reasons are

    • The language data path does not exist or doesn't hold language data files for the requested language.
    • The language data was built for a different version of Tesseract. When using Tesseract dll and language data from GemBox, this should not happen.
    • The language data path contains non-ASCII characters.

    Error 2

    This error occurs when GemBox.Pdf.Ocr fails to load the native Tesseract and Leptonica libraries. The loading routine will try to identify the correct version of the dll that should reside in the x86 or x64 folder under your bin folder based on the executing CPU architecture.

    Common reasons for failure are:

    • The Visual Studio x86 & x64 Runtime is not installed.
    • The x86 and x64 versions Leptonica and Tesseract were not copied to their respective folders in the bin directory.
    • The project is running on unsupported architecture (e.g. ARM).

    Further diagnosis

    Even though the Tesseract engine only returns a success / fail response, it writes a lot more information about why the operation failed to the standard output which can be used to diagnose the error. GemBox.Pdf.Ocr also outputs some information to the Tesseract trace source which may be helpful.

    You can use following diagnostics configuration:

    <system.diagnostics>
    	<sources>
    		<source name="Tesseract" switchValue="Verbose">
    			<listeners>
    				<clear />
    				<add name="console" />
    				<!-- Uncomment to log to a file
    				<add name="file" />
    				-->
    			</listeners>
    		</source>
    	</sources>
    	<sharedListeners>
    		<add name="console" type="System.Diagnostics.ConsoleTraceListener" />
    
    		<!-- Uncomment to log to a file
    		<add name="file"
    		   type="System.Diagnostics.TextWriterTraceListener"
    		   initializeData="c:\log\tesseract.log" />
    		-->
    	</sharedListeners>
    </system.diagnostics>
    
    Back to top

    Facebook • Twitter • LinkedIn

    © GemBox Ltd. — All rights reserved.