Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Global Capture uses the ABBYY Optical Character Recognition (OCR) Engine to read and classify documents. ABBYY is a very flexible and highly configurable OCR Engine and has configuration options that can be tweaked to help get the best possible results. This is a brief overview of what configuration files are used, and what the parameters in those files, do.

GlobalCapture Settings

The engine uses two primary configuration files, located in several locations; the TextOCR.cfg and the FullPageOCR.cfg. TextOCR.cfg is used to extract data from a page, whereas FullPageOCR.cfg uses the same parameters to convert and create text-searchable PDF files. For GlobalCapture, the FullPageOCR.cfg and the TextOCR.cfg files are in new locations. The parameters are still the same from previous versions.

...

Info

Please Be Advised

If the TextOCR.cfg files are different between the Template Designer and the GlobalCapture engine, the Template will read differently from what the engine will return. 

Changes to the TextOCR.cfg for Template Designer won't be reflected on existing samples. Only new ones. Upload a new sample to your template for the changes to take effect. 


Full Page OCR and Zonal Settings

We use the Abbyy Engine for Full Page and Text-based OCR. Abbyy is an extremely configurable engine which contains many settings that impact how the data is read. These configuration files are stored in 3 locations within the GetSmart directory:

...

Info

Please Be Advised

Changes to the default settings of these configuration files have not been tested by Square 9 and are modified at your own risk. Any changes you make to your OCR config files will require you to upload new sample documents and run scanning tests to see if you have better OCR results. Generally speaking, you can ask advisement of a Square 9 Technician but you would be encouraged to test on your own. As testing out new OCR configurations can take many hours. Keep in mind that changes to these files are global and will effect all incoming documents.


Configuration Objects

Both Files, TextOCR.cfg and FullPageOCR.cfg have similar configurations when GlobalSearch desktop client is installed. Each configuration files consists of one or many objects. In each object, there are a number of properties that can be defined, these objects are as follows and are found in the FullPageOCR.cfg and TextOCR.cfg files:

...

Info

Please Be Advised

These changes are global, changing these will affect all zonal OCR and text PDF activities


FullPageBaseSettings.cfg and ZonalBaseSettings.cfg Settings

FullPageBaseSettings.cfg contains a single string which defines the profile loaded when TextPDF creation is run. It’s values are outlined below:

...

FullPageOCR.cfg and TextOCR.cfg Settings

Default Contents

As version 4.1, the default contents of the FullPageOCR.cfg and TextOCR.cfg are as follows. These documents can be found in C:\GetSmart. By changing these two documents, you can effect how the Zonal Templates read document data.

PDFExportsParams

This object defines how PDFs are exported after undergoing TextPDF Creation.

Function

Description

Value

PDFAComplianceMode

PDFs will be exported adhering to the defined standard.

PCM_None, PCM_Pdfa_1b, PCM_Pdf_1b

Colority

Defines if PDFs are exported as Color or Grayscale.

PCM_KeepColority, PCM_ForceToGrey

TextExport

PDFs will be exported adhering to the defined standard.

PEM_ImageOnText, PEM_ImageOnly, PEM_TextOnly


PagePreprocessingParams


Function

Description

Value

CorrectOrientation

Attempt to auto rotate the image.

Boolean


PrepareImageMode


Function

Description

Value

Rotation

Specifies the rotation angle to apply to the image during preparation.

RT_NoRotation, RT_Clockwise, RT_Counterclockwise, RT_Upsidedown

CorrectSkew

Tells the OCR engine to correct skew during image preparation.

Boolean

CorrectSkewMode

Specifies the mode of skew correction.

Do Not Alter

InvertImage

Tells the OCR engine to invert the colors of the prepared image.

Boolean

MirrorImage

Tells Square 9’s OCR engine to mirror the prepared image around its vertical axis.

EnhanceLocalContrast

Specifies whether the local contract of the image should be increased.

DiscardColorImage

tells the OCR engine to only leave the black-and-white planein the prepared image.

UseFastBinarization

The OCR engine will use algorithms for fast image binarization



PageAnalysisParams


Function

Description

Value

ProhibitModelAnalysis

Typical variants of page layout will be gone through during page analysis and the best variant will be selected.

Boolean

DetectPictures

Pictures are detected as part of analysis.

DetectSeparators

Separators are detected during analysis.


ObjectsExtractionParams


Function

Description

Value

FastObjectsExtraction

Extraction speed may increase but quality may deteriorate.

Boolean

RemoveTexture

Background noise is removed from the image used for recognition. The original image is not altered.


RecognizerParams

Function

Description

PerformRecognition

This is not present by default, but if set to false, OCR extraction will not be performed. This can be disabled in situations where you need to perform recognition (barcode) but not OCR.

RecognizerParams


Function

Description

Value

FastMode

Data will be extracted more rapidly at the cost of accuracy.

Boolean

LowResolutionMode

This property is useful when recognizing faxes, small prints, images with low resolution or bad print quality.

BalancedMode

Data will be extracted more accurately but at the cost of speed.

OneLinePerBlock

The OCR engine will presume the text extracted contains no more than one string.

OneWordPerBlock

The OCR engine will presume the text extracted contains no more than one word.

CaseRecognitionMode

This value specifies the letter case during recognition

TextTypes

The value of TextTypes defines the style of the text to be extracted.

See TextType Value table

TextLanguages

Parameter for one or more languages in Abbyy. Helpful for accennted character recognition.(Ex. TextLanguage=English,French)

See Text Language Value table

...

Info

If neither FastMode or BalancedMode are used, FullMode will be used by default. Text will extract with greater accuracy but may be significantly slower


DocumentStructureDetectionParams


Function

Description

Value

ClassifySeparators

Additional properties of separators, such as their type is detected. GlobalSearch LAN does need this information and the value should be set to False

Boolean

DetectFootnotes

The footnotes are detected during document synthesis. GlobalSearch LAN does not require this and for quicker extraction, this value should be set to false

DetectTableOfContents

The TableOfContents are detected during document synthesis. GlobalSearch LAN does not require this and for quicker extraction, this value should be set to false.

...

Info

The default values for these parameters are set to TRUE. GlobalSearch does not require these parameters, and for quickest extraction these values should be set to FALSE

...

Text Parameters

TextLanguage Value Table

Bulgarian

French

Portuguese(Brazillian)

Chinese simplified

German

Russian

Chinese traditional

Greek

Slovak

Czech

Hungarian

Spanish

Danish

Italian

Swedish

Dutch

Japanese

Turkish

English

Korean

Ukrainian

Estonian

Polish

Vietnamese

...