By all these lovely tokens merging conflicting tokenizations

Publication

AuthorsChristian Chiarcos OAJulia RitzOAManfred Stede OA

Year2012

JournalLanguage Resources and Evaluation

Abstract

Given the contemporary trend to modular NLP architectures and multiple annotation frameworks, the existence of concurrent tokenizations of the same text represents a pervasive problem in everyday’s NLP practice and poses a non-trivial theoretical problem to the integration of linguistic annotations and their interpretability in general. This paper describes a solution for integrating different tokenizations using a standoff XML format, and discusses the consequences from a corpus-linguistic perspective.