Twitter n gram corpus with demographic metadata

Publication

AuthorsAmaç Herdağdelen OA

Year2013

JournalLanguage Resources and Evaluation

Abstract

Social media is a natural laboratory for linguistic and sociological purposes. In micro-blogging platforms such as Twitter, people share hundreds of millions of short messages about their lives and experiences on a daily basis. These messages, coupled with metadata about their authors, provide an opportunity to understand a wide variety of phenomena ranging from political polarization to geographic and demographic lexical variation. Lack of publicly available micro-blogging datasets has been a hindrance to replicable research. In this paper, I introduce Rovereto Twitter n-gram corpus, a publicly available n-gram dataset of Twitter messages, which contains gender-of-the-author and time-of-posting tags associated with the n-grams. I compare this dataset to a more traditional web-based corpus and present a case study which shows the potential of combining an n-gram corpus with demographic metadata.

LanguageEnglish

Stimuli typetweets

A database of orthography semantics consistency osc estimates for 15017 english words2018

Quantitative analysis of culture using millions of digitized books2011 The 385 million word corpus of contemporary american english 19902008 design architecture and linguistic insights2009 The wacky wide web a collection of very large linguistically processed web crawled corpora2009

Understanding the Demographics of Twitter Users · Proceedings of the International AAAI Conference on Web and Social Media2021

Language Matters In Twitter: A Large Scale Study · Proceedings of the International AAAI Conference on Web and Social Media2021

Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments · Figshare2018

Diurnal and Seasonal Mood Vary with Work, Sleep, and Daylength Across Diverse Cultures · Science2011

Stereotypical gender actions can be extracted from web text · Journal of the American Society for Information Science and Technology2011

Insertion, Deletion, or Substitution? Normalizing Text Messages without Pre-categorization nor Supervision2011

Analyzing the Dynamic Evolution of Hashtags on Twitter: a Language-Based Approach2011

Patterns of temporal variation in online media2011

Bad news travel fast2011

Comparing Twitter and Traditional Media Using Topic Models · Lecture notes in computer science2011

Identifying Sarcasm in Twitter: A Closer Look2011

Political Polarization on Twitter · Zenodo (CERN European Organization for Nuclear Research)2011

MPC: A Multi-Party Chat Corpus for Modeling Social Phenomena in Discourse2010

Twitter Based System: Using Twitter for Disambiguating Sentiment Ambiguous Adjectives · Meeting of the Association for Computational Linguistics2010

Predicting Elections with Twitter: What 140 Characters Reveal about Political Sentiment · Proceedings of the International AAAI Conference on Web and Social Media2010

Information Contagion: An Empirical Study of the Spread of News on Digg and Twitter Social Networks · Proceedings of the International AAAI Conference on Web and Social Media2010

The Edinburgh Twitter Corpus · North American Chapter of the Association for Computational Linguistics2010

Google Web 1T 5-Grams Made Easy (but not for the computer)2010

Who is tweeting on Twitter2010

Towards detecting influenza epidemics by analyzing Twitter messages2010

Uncovering social spammers2010

Sentiment in Twitter events · Journal of the American Society for Information Science and Technology2010

Earthquake shakes Twitter users2010

Is it really about me?2010

Sentiment Knowledge Discovery in Twitter Streaming Data · Lecture notes in computer science2010

Sharing music files: Tactics of a challenge to the industry · First Monday2010

An Overview of Microsoft Web N-gram Corpus and Applications2010

Semi-supervised recognition of sarcastic sentences in Twitter and Amazon2010

Discovery Science · Lecture notes in computer science2010

Detecting Spammers on Twitter2010

Predicting response to political blog posts with topic models2009

Detecting spam in a Twitter network · First Monday2009

Computational Social Science · Science2009

Digital Intuition: Applying Common Sense Using Dimensionality Reduction · IEEE Intelligent Systems2009

Introducing and evaluating ukWaC, a very large Web-derived corpus of English2008

Mining the Blogosphere: Age, gender and the varieties of self-expression · First Monday2007

Corpus Linguistics and the Web2007

Open Mind Commons: An inquisitive approach to learning common sense2007

Processing Internet-derived Text--Creating a Corpus of Usenet Messages · Literary and Linguistic Computing2006

Scaling high-order character language models to gigabytes2005

Effects of Age and Gender on Blogging · National Conference on Artificial Intelligence2005

The Enron Corpus: A New Dataset for Email Classification Research · Lecture notes in computer science2004

Improvements in Part-of-Speech Tagging with an Application to German · Text, speech and language technology1999

The zero-frequency problem: estimating the probabilities of novel events in adaptive text compression · IEEE Transactions on Information Theory1991

Twitter n gram corpus with demographic metadata

Publication

Abstract

Extracted information

Edits / History

Cited by

References