48 Following


Statistical analysis applied to language

The Secret Life of Pronouns: What Our Words Say About Us - James W. Pennebaker

Great book!  The author is a Linguistics professor who has pioneered use of computer-based word counting programs. If you have enough text from an individual, you can dump it into a computer program which will count all the words up to generate a statistically valid profile on the author. After analyzing enough of these profiles, Prof. Pennebaker has uncovered some interesting patterns. It turns out that the frequency of pronouns, as well as choice of pronouns can be correlated to the relative social status of the author and the intended reader, and also often reflects the degree to which the author is trying to be frank and forthright, or obfuscating and deceptive. Likewise, the frequency of articles like "a" and "the" bear a relationship to how analytic a thinker the writer is.


The most fun part of the book is the analysis of historical texts and ways that the computer program (called L.I.W.C., for Linguistal Inquiry Word Counting, or affectionately "Luke" by Pennebaker's lab) has been used to establish authorship of texts. The good Professor fed Luke the entire catalogue of Beatles lyrics, and can now convincingly demonstrate "artistic synergy", showing that songs McCartney and Lennon collaborated on rate far higher in objective criteria for creativity, expressiveness, and novel word arrangements. 


Other fun projects:  


1. After extensive analysis of Australian explorer Henry Hellyer's personal journal, Luke can say with a high degree of certainty that the man's mysterious death was very likely a suicide... his pronoun usage profile over the last year of his life shows a steady trend of worsening mental illness.


2. Luke reveals with high confidence that Alexander Hamilton was the true author of a series of anonymous public pamphlets relating to the Articles of Confederation, which were published under the pen name "Publius".


3. Twins separated at birth have strikingly similar L.I.W.C. profiles. (You just knew that was coming, didn't you?)


4. Data of blogs and emails from thousands of people for 2 months before and 2 months after 9/11, showing L.I.W.C. profile changes in the entire population, which show trauma, depression, and a gradual return towards baseline. It's no surprise... we didn't need analysis to show that 9/11 was traumatic, but it is a striking illustration of the validity of L.I.W.C. as a tool.


5. Analysis of George W. Bush's speeches, interviews, press conferences, etc... during the run up to the March 2003 invasion of Iraq. In all these communications, Bush expressed to the public a high degree of certainty that pre-emptive war was necessary to prevent Saddam Hussein from developing or using weapons of mass destruction. L.I.W.C. analysis, however, tells a very different story... his pronoun profile betrays misgivings and indirectness. I'm hoping the L.I.W.C. analysis can be entered as evidence at the Nuremberg-type trials I'm still hoping for.  


6. Analysis of other things, which aren't as powerful tools as pronoun analysis, but are also fun to think about:  some people's use of punctuation is so idiosyncratic, it can almost be used as a fingerprint to identify their authorship;  ways people's L.I.W.C. profile predictably changes with age;  how L.I.W.C. analysis of college admission essays can predict who will do well in college, but cannot predict who will do well in life after college;  how L.I.W.C. analysis might suggest ways to help prisoners being released from returning to lives of crime.