September 25, 2012¶
Event: http://js.la/
Hosted by http://www.crosscamp.us/
Sponsors
- HTML5 Dev Conf
- EdgeCast
Text Mining¶
Speaker: Danny Tran
- Works for Walt Disney
- https://twitter.com/digi_danny
- https://github.com/digidanny
- http://www.dannytran.me/
Note
very good speaker!
80% of all data is unstructured text¶
Sentiment analysis is how you do mining
Understanding huge amounts of Twitter data
Wanted to figure out how to investigate this via node.js
But could have used python:
- http://pypi.python.org/pypi/nltk/2.0.3
- Which would have been many times faster than node.js if used with numpy
Types of Sentiment¶
- Polarity: Positive or negative
- Continuous: -10 to +10
- Categorical: Happy, sad, angry, frustrated
Assumptions¶
- Bag of Words
- English
Techniques¶
- Semantic Summation
- Machine learning classification
Pre-processing¶
Unrepeat (His description of long terms like sweeeeeet)
Entity Extraction (Pulling out nouns/entities from text have to not skew the results. For example: Fight against evil can make a positive description sound bad)
Tokenization
- Problem: Abbreviations
- Problem: Emoticons
Spell correct http://norvig.com/spell-correct.html
Word stopping
Limitization
- Reduce word to it’s root form
- Result is a word
- Running => Run
Stemming
- Reduce word to a stem
- May not be a word
- Carry => Cari
N-Grams¶
TODO: Get definitions of this.
Naive Bayes Classifer¶
TODO: Show javascript example
Note
the data coming needs to train this algorithm
Getting training data¶
- sentiment140.com
- Use emoticons to pre-classify text
Test!¶
- Against 70k+ records from sentiment140.com
- Humans only agree 70-80% of the time so don’t expect perfection