September 25, 2012

Text Mining

Note

very good speaker!

80% of all data is unstructured text

Types of Sentiment

  • Polarity: Positive or negative
  • Continuous: -10 to +10
  • Categorical: Happy, sad, angry, frustrated

Assumptions

  • Bag of Words
  • English

Techniques

  • Semantic Summation
  • Machine learning classification

Pre-processing

  • Unrepeat (His description of long terms like sweeeeeet)

  • Entity Extraction (Pulling out nouns/entities from text have to not skew the results. For example: Fight against evil can make a positive description sound bad)

  • Tokenization

    • Problem: Abbreviations
    • Problem: Emoticons
  • Spell correct http://norvig.com/spell-correct.html

  • Word stopping

  • Limitization

    • Reduce word to it’s root form
    • Result is a word
    • Running => Run
  • Stemming

    • Reduce word to a stem
    • May not be a word
    • Carry => Cari

N-Grams

TODO: Get definitions of this.

Naive Bayes Classifer

TODO: Show javascript example

Note

the data coming needs to train this algorithm

Getting training data

  • sentiment140.com
  • Use emoticons to pre-classify text

Test!

  • Against 70k+ records from sentiment140.com
  • Humans only agree 70-80% of the time so don’t expect perfection