nervousdata

0 O C

Notes about digitization of physical documents with a focus on Optical Character Recognition

Google-fingers and ScanOpsOverlooking Full-TextEarly Character Recognition TechCommon ConfusionMachine-Usable FontsState-Of-The-Art 2020sArtworks with OCR

Google-Fingers and ScanOps

In August 2010 Google announced that 129.864.880 books would exist in the entire world. The company also said, it would scan all those books. In Mountain View, California (USA), a building in the Googleplex has been equipped as a workplace for ScanOps—the company’s name for employees who do the digitization work.

More than ten years later, at least 25.000.000 books are stored somewhere at the company in a digital format. There is an index on the Internet you can search; often you get a preview and “snippets” appear. But only a fraction of the book collection is accessible as full-text. Due to legal reasons, it says. The large mass digitization project came to a halt. As it seems, G. is no longer hiring ScanOps. But the scanning has been done, hastily, it is on the record. The traces of humans holding, positioning and turning book pages will most likely always stay on the record. Around 2010 the Google-fingers, in hot-pink, became a symbol for the labor involved in the process of transforming three-dimensional material (library books) into something that suits the technorationalist ideology of access; “the idea that the presence of resources, made fundamentally discoverable through an uncomplicated search interface, constitutes access, full stop.” [1]

Very blurry picture. Dark blue in the background. On the left some pink and brown shapes. Scan of an old book page, blurred. On the bottom a hand wearing pink caps on two fingers.
Page 2 from Susanne von Bandemer, Neue vermischte Gedichte, Berlin 1802
Page 27 from Des Ritters Carl von Linné vollständiges Natursystem, Nürnberg: Gabriel N. Raspe, 1773-1775. Via Aliza Elkin, Hand Job Zine No. 9 [2]

Links: Google-Fingers and ScanOps

Overlooking Full-Text

To become searchable, indexable and n-grammable through whatever interface, printed or hand-written documents have to go through different stages of translation and transcoding. In Optical Character Recognition, writing is mainly handled based on its visual qualities. The perception of figures and the recognition of these figures as part of an alphabet is a process that is, in humans, just the transient initial phase of what is called reading, pre-reading. With increasing experience, cognitive psychology shows, the logographic phase is only semiconscious and tantamount to overlooking. But how much and what exactly is a machine allowed or supposed to overlook?

The materiality of paper and ink: texture of the fiber, show-through of double-sided texts, dark background, heavy or light print, curved baselines. The typography: unusual typefaces, strange layouts, very small or large print, underlining, italics and spacing. The symbols: punctuation (commas, periods, quotation marks, diacritical marks), ligatures. Depending on the color and texture of the paper, it is not always easy for the computer to separate the print from the background. And if strokes are very thick, it assumes symbols run into each other (‘rn’ becoming ‘m’), if they are very thin, it assumes symbols break apart (‘h’ becoming ‘li’).

In the digital realm, the physical text appears “overfull”. I aim for the overfull-text, some kind of noise literature in the computer-age. To preserve and to sculpt the smudge.

Early Character Recognition Techniques

Main process: The source, typically symbols imprinted on paper, is positioned and illuminated. The source pattern becomes “input” through optical scanning. The input pattern is superimposed against a reference pattern. A match between input and reference pattern triggers an impulse.

Mask Matching

An input character is matched with a template, stencil or mask. The reference patterns are negatives of the characters. They are, for example, stenciled on a rotating disk. An input character is projected on the disk. Each time it fully matches a mask, light is extinguished to a photo-cell which turns the impulse into signals for further processing. The principle was first described around 1930 for application in statistical machines and aid devices for blind readers. [3]

Three squares. First square with the letter A (in black), second with the letter B (in white) overlapping A and third square with B as a stencil on top of A.

Peephole Matching

Instead of the full character, selected sub-areas of the reference pattern are used as a mask for matching.

One square

Coordinate Matching

Characters are described in a coordinate grid. The encoding results in a string or sequence of binary symbols. Each bit position represents a cell on the grid. The pattern of a character is quantized, sometimes expressed in a probability matrix. Hardware in the 1960s: A reading matrix consisting of photo-cells that activate a shift register. [4]

Two squares

Waveform Matching

A character is segmented into vertical sub-areas. Sampling of the input at discrete intervals is necessary. When scanned by a machine, characters produce dissimilar waveforms as a function of time. The principle was first implemented in magnetic ink recognition (MICR) devices for banking purposes in mid-1950. A similar technique was basis for the optophone which introduced a tonal code for letters and numbers. [5]

One square.

Vector Crossing

A vector space consisting of radial areas is constructed. Conductors are used for ‘sensing’ an intersection of an input and the reference space. It was mainly developed for the recognition of handwritten numeric digits (Stylator, Bell Labs, late 1950s). Some proposals for an application in television fashion.

One square.

Feature analysis

Defining: What is the machine-idea of a certain letter or number? What are its features? One example is character stroke analysis: How many strokes, their positions, relations to other strokes. A plurality of mask arrays is necessary.

Three squares.
Illustrations adapted from: Mary Elizabeth Stevens, Automatic Character Recognition. A State-of-the-Art Report, Washington: U.S. Dept. of Commerce, Office of Technical Services, Washington: 1961, fig. 13, p. 35

Links: Early Reading Machines

Common Confusion

Without given a context, by going through letter by letter, some characters are often mistaken. In particular when diacritical marks or ambiguous characters are present. Scripts with a lot of ligatures and overlaps like Arabic and Devanagari need an overall “segmentation-free” approach. [6]

A ‘Buch’ (English ‘book’) is commonly mistaken for a ‘Bucli’. Of course, you will find hundreds of Buclis in German books with a Google Books search. “Dafleihe Bucli kloss deutsch”, “dass das betreffende Bucli an sicli unsittliclie und unelirbare Dinge entlalte”.

Digital poetry is Nuttekaktersm, says German poet Dagmara Kraus. “Deutsche Classiker des Nuttekaktersm” became the title of a retro-digitized work by Franz Pfeiffer. It is a work about the literature classics of the middle-ages. ‘Mittelalters’ mistaken for ‘Nuttekaktersm’ is not a common confusion but so influential that physical reprints of the work now take a overridden title. [7]

Machine-Usable Fonts

Illustration showing how the letters A and B are constructed inside a grid.
Figure adapted from: Mary Elizabeth Stevens, Automatic Character Recognition. A State-of-the-Art Report, Washington: U.S. Dept. of Commerce, Office of Technical Services, Washington: 1961, fig. 8, p. 29

Humans are able to identify letters and numbers in a thousand of type styles. But machines need to be separately trained on certain sets of styles and their specific stroke width. Companies designed their own type styles for automatic recognition by a machine; with characters internally coded or bit-mapped. The most common way to achieve this encoding was to construct every character in a 5x7 (or 5x9) grid, with uniform width, minimum serifs and strong right edges. A single type font was standardized by the U.S. Bureau of Standards in 1968: OCR-A. Some years later in Europe, Adrian Frutiger designed OCR-B (applying a smaller grid) which became ISO standard in 1973.

Numbers 1 2 3 4 5 6 7 8 9 and 0 typed in eight different fonts.
Fonts for machine reading by Farrington, IBM, NCR, RCA, Remington Rand, Burroughs and General Electric. Figure from: Optical recognition—the breakthrough is here, in: Datamation, Vol. 7, 03/1961, p. 23

State-Of-The-Art 2020s

In the 2020s, Optical Character Recognition systems allow fewer ambiguities. They detect and classify noise and are able to normalize curved baselines. With the use of lexica the system will choose a common letter n-gram (‘ing’) over a unusual one (‘lng’). And so-called whitelists give the option to filter or recognize only a defined list of characters. An all-in solution: The ScanTent for scanning on the go with a smartphone.

Many engines use Neural Networks (DNN, CNN or LSTM) to apply OCR to material with specific characteristics; kraken for example is optimized for historical and non-Latin scripts. Calamari, which is based on kraken and OCRopy, takes line images instead of segmented individual characters and it preprocesses the images to a standardized height. Since version 4, Tesseract can go into Legacy (character patterns) or LSTM (line recognition) mode and with a JavaScript library it can run on a website.

Artworks with OCR


03/2023