Anomaly Detection in Textual Data
Text data is generated from a variety of sources such as social networking sites, newspapers, magazines, journals, etc. This data contains large chunks of undiscovered information that can be mined using various text mining techniques such as clustering, classification, etc. Text mining involves the analysis of unstructured text data for the detection of normal trends/patterns and anomalies (such as novel information). The area of normal text mining for trends and patterns has been extensively researched but the same is not true for anomalies in text.
Anomaly Detection in text involves the analysis of the text with the aim of discovering aberrant information. This information does not conform to the general trend in the text under consideration. Most of the existing techniques concentrate on a statistical analysis of text which treats it as nothing but a collection of words to detect novel information. They ignore the semantic relationship between the words which leads to a number of undetected anomalies. Introducing context information overcomes this drawback.
Context-based anomaly detection I text reduces the number of false-positive cases, increases the confidence in prevailing anomalies and also detects anomalies that had been left undiscovered by statistical approaches. The existing context-based approaches introduce semantic information in the post-processing step of statistical techniques from an external source. Introducing context information at the pre-processing level leaves scope for future work in this area.