Member-only story
text mining and unsupervised machine learning
Unstructured Text Data Mining & Topic Modeling
Business Intelligence, Entity Recognition and LDA topic Modeling
Text Analytic is quite useful and proven to extract relevant information and knowledge hidden in unstructured content. Extracting business Intelligence and characterizing the content of large set of unstructured data is a common problem in real-life data mining use cases. By applying effectively to a corpus, it helps to gather important insights from unstructured data e.g. patterns, trends and insights.
Here, we will experiment with news articles by focusing on named entities in news using natural language toolkit (NLTK) which is quite useful NLP. Furthermore, Latent Dirichlet Allocation ( LDA) algorithm will be used for modeling purpose. LDA is generative probabilistic topic modeling, statistical algorithms that analyze words in original text documents to uncover the thematic structure of the both the corpus and individual documents themselves.
A topic in a set of text documents can be modeled as a probability distribution over the vocabulary, which is called a topic model.