Global Capture uses the ABBYY Optical Character Recognition (OCR) Engine to read and classify documents. ABBYY is a very flexible and highly configurable OCR Engine and has configuration options that can be tweaked to help get the best possible results. This is a brief overview of what configuration files are used, and what the parameters in those files, do.
GlobalCapture Settings
The engine uses two primary configuration files, located in several locations; the TextOCR.cfg and the FullPageOCR.cfg. TextOCR.cfg is used to extract data from a page, whereas FullPageOCR.cfg uses the same parameters to convert and create text-searchable PDF files. For GlobalCapture, the FullPageOCR.cfg and the TextOCR.cfg files are in new locations. The parameters are still the same from previous versions.
Template Designer
- TextOCR.cfg: YOURGETSMARTDRIVE:\inetpub\wwwroot\Square9Viewer
GlobalCapture Engine
- TextOCR.cfg: YOURGETSMARTDRIVE:\GetSmart\CaptureServices\GlobalCapture_1
- FullPageOCR.cfg: YOURGETSMARTDRIVE\GetSmart\CaptureServices\GlobalCapture_1
Please Be Advised
If the TextOCR.cfg files are different between the Template Designer and the capture engine, the Template will read differently from what the engine will return.
Full Page OCR and Zonal Settings
We use the Abbyy Engine for Full Page and Text-based OCR. Abbyy is an extremely configurable engine which contains many settings that impact how the data is read. These configuration files are stored in 3 locations within the GetSmart directory:
- FullPageOCR.cfg – TextPDF/Full Page OCR Configuration settings.
- These settings are used when converting a document to a text searchable PDF or other electronic formats.
- TextOCR.cfg
- These settings are used when extracting data use Zonal OCR.
- FullPageBaseSettings.cfg
- These settings contain a profile of commonly used settings present in version 4.1, which are customized further by FullPageOCR.cfg.
Please Be Advised
Changes to the default settings of these configuration files are not supported and to be modified at your own risk. Generally speaking, changes to these files will be done by, or come at the advisement of a Square 9 Technician
Configuration Objects
Both Files, TextOCR.cfg and FullPageOCR.cfg have similar configurations when GlobalSearch desktop client is installed. Each configuration files consists of one or many objects. In each object, there are a number of properties that can be defined, these objects are as follows and are found in the FullPageOCR.cfg and TextOCR.cfg files:
- PDFExportParams
- PagePreprocessingParams
- PageAnalysisParams
- ObjectsExtractionParams
- RecognizerParams
- DocumentStructureDetectionParams
Please Be Advised
These changes are global, changing these will affect all zonal OCR and text PDF activities
FullPageBaseSettings.cfg and ZonalBaseSettings.cfg Settings
FullPageBaseSettings.cfg contains a single string which defines the profile loaded when TextPDF creation is run. It’s values are outlined below:
- DocumentConversion_Accuracy
- (Ex: RTF, DOCX) Suitable for converting documents to an editable format. Settings have been optimized for accuracy
- DocumentConversion_Speed
- Suitable for converting documents to an editable format. Settings have been optimized for speed
- DocumentArchiving_Accuracy
- Suitable for creating an electronic archive (PDF, PDF/A etc.) The settings have been optimized for processing speed. (Skew correction is not performed)
- TextExtraction_Accuracy
- Used to extract zonal field data, optimized for accuracy
- TextExtraction_Speed
- Used to extract zonal field data, optimized for speed at a loss of accuracy
- FieldLevelRecongnition
- For recognizing short text fragments
- HighCompressedImageOnlyPDF
- For creating high-compressed PDF fies which the entire document is saved as pictures
- BusinessCardsProcessing
- For recognizing business cards
- EngineeringDrawingsProcessing
- Optimizes the OCR engine for recognizing technical drawings with text oriented in different directions
FullPageOCR.cfg and TextOCR.cfg Settings
Default Contents
As version 4.1, the default contents of the FullPageOCR.cfg and TextOCR.cfg are as follows. These documents can be found in C:\GetSmart. By changing these two documents, you can effect how the Zonal Templates read document data.
PDFExportsParams
This object defines how PDFs are exported after undergoing TextPDF Creation.
Function | Description | Value |
---|---|---|
PDFAComplianceMode | PDFs will be exported adhering to the defined standard. | PCM_None, PCM_Pdfa_1b, PCM_Pdf_1b |
Colority | Defines if PDFs are exported as Color or Grayscale. | PCM_KeepColority, PCM_ForceToGrey |
TextExport | PDFs will be exported adhering to the defined standard. | PEM_ImageOnText, PEM_ImageOnly, PEM_TextOnly |
PagePreprocessingParams
Function | Description | Value |
---|---|---|
CorrectOrientation | Attempt to auto rotate the image. | Boolean |
PrepareImageMode
Function | Description | Value |
---|---|---|
Rotation | Specifies the rotation angle to apply to the image during preparation. | RT_NoRotation, RT_Clockwise, RT_Counterclockwise, RT_Upsidedown |
CorrectSkew | Tells the OCR engine to correct skew during image preparation. | Boolean |
CorrectSkewMode | Specifies the mode of skew correction. | Do Not Alter |
InvertImage | Tells the OCR engine to invert the colors of the prepared image. | Boolean |
MirrorImage | Tells Square 9’s OCR engine to mirror the prepared image around its vertical axis. | |
EnhanceLocalContrast | Specifies whether the local contract of the image should be increased. | |
DiscardColorImage | tells the OCR engine to only leave the black-and-white planein the prepared image. | |
UseFastBinarization | The OCR engine will use algorithms for fast image binarization |
PageAnalysisParams
Function | Description | Value |
---|---|---|
ProhibitModelAnalysis | Typical variants of page layout will be gone through during page analysis and the best variant will be selected. | Boolean |
DetectPictures | Pictures are detected as part of analysis. | |
DetectSeparators | Separators are detected during analysis. |
ObjectsExtractionParams
Function | Description | Value |
---|---|---|
FastObjectsExtraction | Extraction speed may increase but quality may deteriorate. | Boolean |
RemoveTexture | Background noise is removed from the image used for recognition. The original image is not altered. |
RecognizerParams
Function | Description | Value |
---|---|---|
FastMode | Data will be extracted more rapidly at the cost of accuracy. | Boolean |
LowResolutionMode | This property is useful when recognizing faxes, small prints, images with low resolution or bad print quality. | |
BalancedMode | Data will be extracted more accurately but at the cost of speed. | |
OneLinePerBlock | The OCR engine will presume the text extracted contains no more than one string. | |
OneWordPerBlock | The OCR engine will presume the text extracted contains no more than one word. | |
CaseRecognitionMode | This value specifies the letter case during recognition | |
TextTypes | The value of TextTypes defines the style of the text to be extracted. | See TextType Value table |
TextLanguages | Parameter for one or more languages in Abbyy. Helpful for accennted character recognition.(Ex. TextLanguage=English,French) | See Text Language Value table |
If neither FastMode or BalancedMode are used, FullMode will be used by default. Text will extract with greater accuracy but may be significantly slower
DocumentStructureDetectionParams
Function | Description | Value |
---|---|---|
ClassifySeparators | Additional properties of separators, such as their type is detected. GlobalSearch LAN does need this information and the value should be set to False | Boolean |
DetectFootnotes | The footnotes are detected during document synthesis. GlobalSearch LAN does not require this and for quicker extraction, this value should be set to false | |
DetectTableOfContents | The TableOfContents are detected during document synthesis. GlobalSearch LAN does not require this and for quicker extraction, this value should be set to false. |
The default values for these parameters are set to TRUE. GlobalSearch does not require these parameters, and for quicket extraction these values should be set to FALSE
Text Parameters
TextLanguage Value Table
Bulgarian | French | Portuguese(Brazillian) |
Chinese simplified | German | Russian |
Chinese traditional | Greek | Slovak |
Czech | Hungarian | Spanish |
Danish | Italian | Swedish |
Dutch | Japanese | Turkish |
English | Korean | Ukrainian |
Estonian | Polish | Vietnamese |
Text Type | Description | Value |
---|---|---|
TT_Normal | Common typographic texts. | 1 |
TT_Typewriter | Tells the OCR engine to presume the text was generated on typewriter. | 2 |
TT_Matrix | This value tells the OCR engine to presume the text was generated on a Matrix Printer. | 4 |
TT_Index | Corresponds to a special set of characters including only digits written in ZIPCode style. | 8 |
TT_OCR_A | A special font designed for Optical Character Recognition. Largely used by banks, credit card companies or financial institutions | 32 |
TT_OCR-B | This value corresponds to a special font designed for Optical Character Recognition. | 64 |
TT_MICR_E138 | This value corresponds to a special MICR barcode font (CMC-7). | 128 |
TT_MICR_CMC7 | This value tells the OCRengine to make the assumption thatit is reading a special MICRfont(CMC-7). | 256 |
TT_Gothic | This value tells the OCR engine to presume the text is printed of the Gothic Type. Only the “Fraktur” font is supported. | 512 |
TT_Receipt | This value corresponds to a special font commonly used in thermal printed receipts. | 1024 |
TextType Value Table
- You can select multiple text types by adding the values together.
- Both TextLanguage and TextType would be added to the RecongnizerParams
ABBYY Configuration
You can find the full ABBYY configuration manual HERE
Square 9 OCR Settings & Tested High Performance Settings
In the event your TextOCR or FullPageOCR configuration files are lost or corrupted, you can construct a new one using the settings directly below. Additionally, Square 9 offers a higher perfomance set of OCR parameters in the second code box below. These should not be used on burdened or slower servers. This will increase processing time but also increase OCR accuracy. Modification to one configuration file should be done to all configuration files to maintain parity and consistency.
Original OCR Settings
[PDFExportParams] PDFAComplianceMode=PCM_Pdfa_1b TextExportMode=PEM_ImageOnText Colority=PCM_KeepColority [PagePreprocessingParams] CorrectOrientation = true [PrepareImageMode] CorrectSkew = false [PageAnalysisParams] ProhibitModelAnalysis=true [ObjectsExtractionParams] FastObjectsExtraction=true [RecognizerParams] FastMode=true [DocumentStructureDetectionParams] ClassifySeparators=false DetectFootnotes=false DetectTableOfContents=false
High-Performance Aggressive OCR Settings (Square 9 Tested)
[PDFExportParams] PDFAComplianceMode=PCM_Pdfa_1b TextExportMode=PEM_ImageOnText Colority=PCM_KeepColority [PagePreprocessingParams] CorrectOrientation = false [PrepareImageMode] CorrectSkew = true [PageAnalysisParams] ProhibitModelAnalysis=false EnableTextExtractionMode=true [ObjectsExtractionParams] FastObjectsExtraction=false EnableAggressiveTextExtraction=true [RecognizerParams] FastMode=false [DocumentStructureDetectionParams] ClassifySeparators=false DetectFootnotes=false DetectTableOfContents=false