Abstract by Joseph Clark
Automatic Semantic Type Detection Through Natural Language Processing
Correctly identifying the semantic types of data is essential in automatic machine learning (AutoML) for building robust machine learning models. Manual profiling is undesirable in the scope of AutoML and can be expensive and inaccurate. The majority of existing profiling tools rely on regular expression matching and lookup tables to profile data, while the most recent state-of-the-art profiling techniques are beginning to use deep learning. We explore natural language processing methods, including word and sentence embeddings, to perform semantic profiling. We utilize a collection of hundreds of datasets, including tens of thousands of columns, with types annotated by MIT Lincoln Labs to evaluate our methodology. Our results show there are advantages and disadvantages for all techniques, suggesting future research.