Constructing specialised corpora through analysing domain representativeness of websites

Publication

AuthorsWilson WongOAWei Liu OAMohammed Bennamoun OA

Year2011

JournalLanguage Resources and Evaluation

Abstract

The role of the Web for text corpus construction is becoming increasingly significant. However, the contribution of the Web is largely confined to building a general virtual corpus or low quality specialised corpora. In this paper, we introduce a new technique called SPARTAN for constructing specialised corpora from the Web by systematically analysing website contents. Our evaluations show that the corpora constructed using our technique are independent of the search engines employed. In particular, SPARTAN-derived corpora outperform all corpora based on existing techniques for the task of term recognition.

No citing papers are currently in WordNorms

Using internet search engines to estimate word frequency2002

A probabilistic framework for automatic term recognition · Intelligent Data Analysis2009

Determination of Unithood and Termhood for Term Recognition · IGI Global eBooks2009

Handbook of Research on Text and Web Mining Technologies · IGI Global eBooks2009

Constructing Web Corpora through Topical Web Partitioning for Term Recognition · Lecture notes in computer science2008

A Lightweight and Efficient Tool for Cleaning Web Pages2008

StupidOS: A high-precision approach to boilerplate removal · Journal of Inherited Metabolic Disease2007

Corpus Linguistics and the Web2007

Googleology is Bad Science · Computational Linguistics2007

Tree-Traversing Ant Algorithm for term clustering based on featureless similarities · Data Mining and Knowledge Discovery2007

The Google Similarity Distance · IEEE Transactions on Knowledge and Data Engineering2007

Domain classification of technical terms using the Web · Systems and Computers in Japan2007

WebCorp: an integrated system for web text search2007

Building general- and special-purpose corpora by Web crawling · Archivio istituzionale della ricerca (Alma Mater Studiorum Università di Bologna)2006

Corpus-Based Language Studies: An Advanced Resource Book2006

WebBootCaT. Instant Domain-Specific Corpora to Support Human Translators2006

Wacky! Working papers on the Web as Corpus · Institutional Research Information System (Università degli Studi di Trento)2006

Web crawling ethics revisited: Cost, privacy, and denial of service · Journal of the American Society for Information Science and Technology2006

Web Text Corpus for Natural Language Processing2006

Corpora from the Web2005

Automatic summarization of voicemail messages using lexical and prosodic features · ACM Transactions on Speech and Language Processing2005

Web-based models for natural language processing · ACM Transactions on Speech and Language Processing2005

Randomized algorithms and NLP2005

A study of using search engine page hits as a proxy for n-gram frequencies2005

Extracting knowledge from the World Wide Web · Proceedings of the National Academy of Sciences2004

Lexicology and corpus linguistics : an introduction2004

BootCaT: Bootstrapping Corpora and Terms from the Web2004

Introduction to the Special Issue on the Web as Corpus · Computational Linguistics2003

The Web as a Parallel Corpus · Computational Linguistics2003

Optimization Models of Sound Systems Using Genetic Algorithms · Computational Linguistics2003

GENIA corpus—a semantically annotated corpus for bio-textmining · Bioinformatics2003

A large-scale study of the evolution of web pages2003

Using the web to overcome data sparseness2002

Zipf's law and the Internet2002

A Contrastive Approach to Term Extraction · Cineca Institutional Research Information System (Tor Vergata University)2001

Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL · Lecture notes in computer science2001

A Methodology for Sampling the World Wide Web · Journal of Library Administration2001

Data mining and knowledge discovery: making sense out of data · IEEE Expert1996

Constructing specialised corpora through analysing domain representativeness of websites

Publication

Abstract

Extracted information

Edits / History

Cited by

References