BYU

Abstract by Bruce Stoutenburg

Personal Infomation


Presenter's Name

Bruce Stoutenburg

Co-Presenters

Isaac Riley

Degree Level

Undergraduate

Co-Authors

None

Abstract Infomation


Department

Computer Science

Faculty Advisor

Mark Clement

Title

Census Record Auto-Indexing with Optical Character Recognition

Abstract

Optical Character Recognition (OCR) is a field of artificial intelligence used to digitize written text using machine and/or deep learning. It can be done through supervised learning, which requires human input or correction, and unsupervised learning, in which algorithms are used to identify patterns in data.

We use a mixture of the two methods to accelerate the process of digitizing census records. In the first stage, we use unsupervised cluster analysis on fields such as age or birth month, where there is a finite set of possible values into which to categorize handwriting images. Once the predictions have been generated, they can be corrected by human indexers (who need not have any specialized training). Feedback from the corrected predictions will then serve as labels in the training of supervised models, which can be used in auto-indexing and continuously improved as more user-generated labels become available.