DigitWiki

Image processing
Improving the quality of scanned images can serve two different purposes:


 * enhance the visual appearance of images when viewed by humans,
 * enhance the quality for post-processing steps such as OCR and layout analysis.

Depending on the use case, different tools or settings have to be applied to optimize the image processing result for a particular purpose or material.

Common software tools used to enhance the visual appearance of images are tools for deskewing, contrast enhancement or border adjustment. The overall goal is to transform scanned images in a way that results in sharp and readable text, clear images and a white background. Additionally, borders and page sizes are usually adjusted to be the same size for every page to improve the viewing experience for a set of pages. These requirements can result in different processing parameters applied to different regions of an image in order to have letters rendered with very high contrast compared to images or photos which require much less contrast.

Image enhancement for post-processing purposes usually involves tools for deskewing, noise removal and binarisation. However, the parameters used for such tools depend very much on the intended use case. For example if the goal is to improve OCR results by applying image enhancement tools, the optimal parameters might vary for different OCR engines. Additionally, parameters might have to be adjusted for different data sets or even individual pages within a given data set.

Layout analysis
In computer vision, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order. Detection and labeling of the different zones (or blocks) as text body, illustrations, math symbols, and tables embedded in a document is called geometric layout analysis. But text zones play different logical roles inside the document (titles, captions, footnotes, etc.) and this kind of semantic labeling is the scope of the logical layout analysis.

Text recognition
OCR (optical character recognition) is defined as automatic transcription of the text represented on an image into machine-readable text. In this section we provide training materials for both commercial and open-source tools.

Post-correction
OCR produces its best results from well-printed, modern documents. But historical documents contain a range of effects that can reduce accuracy of recognition: from poor paper quality, poor typesetting, damage or degradation of the original paper source, and text skew or warping due to age or humidity. In addition to this, content holding institutions will tend to have legacy data: text-based digitised material that was not originally created with OCR in mind.

This sort of material will produce unsatisfactory OCR accuracy and render digital material only partially discoverable and useable at best. IMPACT has therefore created a number of tools and modules that will allow institutions and their users to correct and validate OCR text either prior to publication or after (by means of crowdsourcing).

Text processing
The purpose of tools in this area is to make digitised text more accessible to users and researchers by applying linguistic resources and language technology.