Focused training sets to reduce noise in NER feature models

Amber McKenzie

Feature and context aggregation play a large role in current NER systems, allowing significant opportunities for research into op-timizing these features to cater to different domains. This work strives to reduce the noise introduced into aggregated features from dis-parate and generic training data in order to al-low for contextual features that more closely model the entities in the target data. The pro-posed approach trains models based on only a part of the training set that is more similar to the target domain. To this end, models are trained for an existing NER system using the top documents from the training set that are similar to the target document in order to demonstrate that this technique can be applied to improve any pre-built NER system. Initial results show an improvement over the Illinois tagger with a weighted average F1 score of 91.67 compared to the University of Illinois NE tagger’s score of 91.32. This research serves as a proof-of-concept for future planned work to cluster the training documents to produce a number of more focused models from a given training set, thereby reducing noise and extracting a more representative feature set.

Back to Papers Accepted