Abstract by Bruce Stoutenburg
Census Record Auto-Indexing with Optical Character Recognition
Optical Character Recognition (OCR) is a field of artificial intelligence used to digitize written text using machine and/or deep learning. It can be done through supervised learning, which requires human input or correction, and unsupervised learning, in which algorithms are used to identify patterns in data.
We use a mixture of the two methods to accelerate the process of digitizing census records. In the first stage, we use unsupervised cluster analysis on fields such as age or birth month, where there is a finite set of possible values into which to categorize handwriting images. Once the predictions have been generated, they can be corrected by human indexers (who need not have any specialized training). Feedback from the corrected predictions will then serve as labels in the training of supervised models, which can be used in auto-indexing and continuously improved as more user-generated labels become available.