Thursday, July 28, 2011

There's no hiding from ngram analysis

It's not often that a corpus linguistics paper makes it to Slashdot. A notable exception today came in the form of Burger et al's paper on Discriminating Gender on Twitter. Using surprisingly simple techniques, they present data suggesting that a machine algorithm can determine a tweeter's gender more reliably than human judges.

As far as I can see, their measure of reliability is of the % of correctly attributed tweets over the data set as a whole. It would be interesting to see more analysis on the nature of tweets that the system can't judge reliably vs human judges (and vice versa). To what extent do typical tweets fall into polarised categories of "readily categorisable" vs "highly ambiguous", or is there more gradience? And it will be interesting to see what other demographic data can be gleaned using similar techniques.

1 comment:

cody_ray said...

Are there many papers published about the hashtag?