Member-only story

text mining and unsupervised machine learning

Unstructured Text Data Mining & Topic Modeling

Business Intelligence, Entity Recognition and LDA topic Modeling

Sarit Maitra

--

Image by author

Text Analytic is quite useful and proven to extract relevant information and knowledge hidden in unstructured content. Extracting business Intelligence and characterizing the content of large set of unstructured data is a common problem in real-life data mining use cases. By applying effectively to a corpus, it helps to gather important insights from unstructured data e.g. patterns, trends and insights.

Here, we will experiment with news articles by focusing on named entities in news using natural language toolkit (NLTK) which is quite useful NLP. Furthermore, Latent Dirichlet Allocation ( LDA) algorithm will be used for modeling purpose. LDA is generative probabilistic topic modeling, statistical algorithms that analyze words in original text documents to uncover the thematic structure of the both the corpus and individual documents themselves.

A topic in a set of text documents can be modeled as a probability distribution over the vocabulary, which is called a topic model.

--

--

No responses yet