Tyler Schnoebelen, who recently completed his PhD in linguistics at Stanford University, told the NWAV crowd how he and his colleagues plumbed Twitter to create a corpus of more than 9 million tweets from English speakers in the United States. There’s no gender checkbox on Twitter, but by looking at the distribution of given names in census data, they were able to assign genders with a high degree of accuracy. (You’re not likely to find many men going by “Annette” or women named “Eugene.”)They then looked at which bits of tweeted language skewed male and female. In line with previous research on gender and discourse, women were found to use more pronouns, emotion terms (like “sad,” “love,” and “glad”), and abbreviations associated with online discourse (like “lol” and “omg”). Women also rate highly on the use of emoticons and “backchannel sounds” (like “ah,” “hmmm,” “ugh,” and “grr”).
Men, on the other hand, have higher frequencies of standard dictionary words, numbers, proper nouns (especially the names of sports teams), and taboo words. Simply by looking at these different rates of word usage, Schnoebelen and his colleagues, David Bamman of Carnegie Mellon University and Jacob Eisenstein of Georgia Tech, can predict the gender of an author on Twitter with 88 percent accuracy.
But Schnoebelen, Bamman, and Eisenstein didn’t stop there, even if such a high level of accuracy in pinpointing gender would be good enough for, say, L’Oréal. They wanted to go beyond the standard binary stereotypes of “Men Are from Mars, Women are from Venus” to understand how “male” and “female” linguistic markers actually work in the world, at least online.