A 700m arabic corpus kacst arabic corpus design and construction

Publication

AuthorsAbdulmohsen O. Al-Thubaity OA

Year2015

JournalLanguage Resources and Evaluation

Abstract

Compared with English, Arabic is a poorly-resourced language within the field of corpus linguistics. A lack of sufficient data and research has negatively affected Arabic corpus-based researchers and natural language processing practitioners. Although a number of Arabic corpora have been developed in recent years, the overall situation has improved little. The aim of this paper is twofold. First, it reviews 14 Arabic corpora categorized by their designated purpose, target language, mode of text, size, text date, location, text type/medium, text domain, representativeness, and balance. The review also describes the availability of the reviewed corpora, the presence of tokenization, lemmatization and tagging, and whether there are any tools available to search and explore them. Second, it introduces the King Abdulaziz City for Science and Technology (KACST) Arabic corpus, which was designed and created to overcome the limitations of existing Arabic corpora. The KACST Arabic corpus is a large and diverse Arabic corpus with clearly defined design criteria. It is carefully sampled, and its contents are classified based on time, region, medium, domain, and topic, and it can be searched and explored using these classifications. The KACST Arabic corpus comprises more than 700 million words from the pre-Islamic era to the present day (a period covering more than 1,500 years), collected from 10 diverse mediums. Each text has been further classified more specifically into domains and topics. The KACST Arabic corpus is freely available to explore on the Internet (http://www.kacstac.org.sa) using a variety of tools.

LanguageArabic

Stimuli count700,000,000

Studying the history of the arabic language language technology and a large scale historical corpus2019 Exploring and exploiting a historical corpus for arabic2016

The design of a corpus of contemporary arabic2006 Supervised collaboration for syntactic annotation of quranic arabic2013

Journal of Digital Information Management · Journal of Digital Information Management2022

Arabic Learner Corpus v1: A New Resource for Arabic Language Research · White Rose Research Online (University of Leeds, The University of Sheffield, University of York)2013

Comparative evaluation of text classification techniques using a large diverse Arabic dataset · Language Resources and Evaluation2013

New Language Resources for Arabic: Corpus Containing More Than Two Million Words and a Corpus Processing Tool2013

KALIMAT a multipurpose Arabic corpus · Lancaster EPrints (Lancaster University)2013

Evaluation of Topic Identification Methods on Arabic Corpora · HAL (Le Centre pour la Communication Scientifique Directe)2011

Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking2008

Corpus Linguistics: A Short Introduction2007

Corpus linguistics : investigating language structure and use2006

<i>aConCorde</i>: Towards an open-source, extendable concordancer for Arabic · Corpora2006

Comparison of Topic Identification methods for Arabic Language · HAL (Le Centre pour la Communication Scientifique Directe)2005

Review: Corpus Linguistics. Investigating Language Structure and Use.Cambridge Approaches, by Douglas Biber, Susan Conrad and Randi Reppen. · The Prague Bulletin of Mathematical Linguistics2001

Corpus Linguistics: Investigating Language Structure and Use · TESOL Quarterly1998

Corpus, Concordance, Collocation · Modern Language Journal1994

Representativeness in Corpus Design · Literary and Linguistic Computing1993

Corpus Design Criteria · Literary and Linguistic Computing1992

A 700m arabic corpus kacst arabic corpus design and construction

Publication

Abstract

Extracted information

Edits / History

Cited by

References