September 25, 2012

Text Mining


very good speaker!

80% of all data is unstructured text

Types of Sentiment

  • Polarity: Positive or negative
  • Continuous: -10 to +10
  • Categorical: Happy, sad, angry, frustrated


  • Bag of Words
  • English


  • Semantic Summation
  • Machine learning classification


  • Unrepeat (His description of long terms like sweeeeeet)

  • Entity Extraction (Pulling out nouns/entities from text have to not skew the results. For example: Fight against evil can make a positive description sound bad)

  • Tokenization

    • Problem: Abbreviations
    • Problem: Emoticons
  • Spell correct

  • Word stopping

  • Limitization

    • Reduce word to it’s root form
    • Result is a word
    • Running => Run
  • Stemming

    • Reduce word to a stem
    • May not be a word
    • Carry => Cari


TODO: Get definitions of this.

Naive Bayes Classifer

TODO: Show javascript example


the data coming needs to train this algorithm

Getting training data

  • Use emoticons to pre-classify text


  • Against 70k+ records from
  • Humans only agree 70-80% of the time so don’t expect perfection