September 25, 2012¶

Event: http://js.la/
- http://jobs.js.la
Hosted by http://www.crosscamp.us/
- https://twitter.com/crosscampusla
Sponsors
- HTML5 Dev Conf
- EdgeCast

Text Mining¶

Speaker: Danny Tran
- Works for Walt Disney
- https://twitter.com/digi_danny
- https://github.com/digidanny
- http://www.dannytran.me/

Note

very good speaker!

80% of all data is unstructured text¶

Sentiment analysis is how you do mining
Understanding huge amounts of Twitter data
Wanted to figure out how to investigate this via node.js
- https://github.com/NaturalNode/natural
But could have used python:
- http://pypi.python.org/pypi/nltk/2.0.3
- Which would have been many times faster than node.js if used with numpy

Types of Sentiment¶

Polarity: Positive or negative
Continuous: -10 to +10
Categorical: Happy, sad, angry, frustrated

Assumptions¶

Bag of Words
English

Techniques¶

Semantic Summation
Machine learning classification

Pre-processing¶

Unrepeat (His description of long terms like sweeeeeet)
Entity Extraction (Pulling out nouns/entities from text have to not skew the results. For example: Fight against evil can make a positive description sound bad)
Tokenization
- Problem: Abbreviations
- Problem: Emoticons
Spell correct http://norvig.com/spell-correct.html
Word stopping
Limitization
- Reduce word to it’s root form
- Result is a word
- Running => Run
Stemming
- Reduce word to a stem
- May not be a word
- Carry => Cari

N-Grams¶

TODO: Get definitions of this.

Naive Bayes Classifer¶

TODO: Show javascript example

Note

the data coming needs to train this algorithm

Getting training data¶

sentiment140.com
Use emoticons to pre-classify text

Test!¶

Against 70k+ records from sentiment140.com
Humans only agree 70-80% of the time so don’t expect perfection

Read the Docs v: latest

Versions: latest

Downloads: pdf; htmlzip; epub

On Read the Docs: Project Home; Builds

Free document hosting provided by Read the Docs.