BYU

Abstract by Lance Haderlie

Personal Infomation


Presenter's Name

Lance Haderlie

Degree Level

Undergraduate

Abstract Infomation


Department

Computer Science

Faculty Advisor

Scott Woodfield

Title

Big Data Complexities in Family History Work

Abstract

                Over the course of the last year I have been attempting to determine the best way to index, search, and analyze an enormous dataset. We have the information of 900 million Family Search database persons, which we need to be able to iterate over and extract statistics from. I tried Mongo, SQLite, MySql, and eventually settled with PostgreSql as my database of choice. As none of the others were efficient enough to function as needed.  Mongo has amazing single PID lookup times, but iterating through would have taken 3 days (not counting any other indexing, comparisons, or other calculations), SQLite was not built to handle such large datasets, and MySql did not have simple interface options. Postgres is fast, efficient, and even with 600 GB of data is still able to retrieve 500,000 random samples for analysis within seconds. We went from transferring 500 records in 45 minutes to 30 million in 3 hours. We will be using this database to create probabilities for merge analysis going forward.