In this paper we describe the automatic annotation of EME corpora with word-class and dependency syntax. The performance of unadapted taggers and parsers is considerably lower on historical than contemporary data. Spelling variants and low performance of standard taggers are at the core of the problem, which we address by adapting automatic tools.
For spelling variants, we train VARD (Baron 2009). To improve tagging performance, we train a part-of-speech tagger. For the syntactic annotation, we use a dependency parser (Schneider 2008). Among other differences, EME displays freer word-order than PDE, as illustrated in 1): fronting (at their leaving Rochel), freer adverb placement (was not then arrived), hard long-distance dependencies (inherent subject of departing, reference of relative pronoun which) or even subject-verb inversion (came in a Vessel).
1) Last night came in a Vessel of forty Tuns departing from Rochel fifteen days since which reports that at their leaving Rochel the Duke of Beaufort was not then arrived (ZEN 1671CUI00004)
To improve parser performance, we adapt the hand-written grammar of the parser and integrate additional statistical models. We use the ZEN corpus (Lehmann et al. 2006) and the Archer corpus to evaluate the performance of each tool and of our adaptations.
Baron, Alistair, Paul Rayson, and Dawn Archer, 2009. “Automatic Standardization of Spelling for Historical Text Mining”. In Proceedings of Digital Humanities 2009, University of Maryland, USA.
Lehmann, Hans Martin, Caren auf dem Keller, and Bernhard Ruef. 2006. “Zen Corpus 1.0”. In Roberta Facchinetti and Matti Rissanen (eds.), Corpus-based Studies of Diachronic English. 135-155. (Linguistic Insights 31). Bern: Peter Lang.
Schneider, Gerold. 2008. Hybrid Long-Distance Functional Dependency Parsing. Doctoral Thesis. Institute of Computational Linguistics, University of Zurich.