Rollenwechsel-English, Version 2

(aka “RW-Eng v2”) Corpus

 

Overview

RW-Eng v2 is an extension of the original RW-Eng corpus (Sayeed  et  al.,  2018), consisting  of  78M sentences from 2.3M documents, coming from the British National Corpus (BNC) and ukWaC. The corpus contains XML-formatted information layers of tokenization, syntactic parses, and semantic role labeling (SRL) tagger output. The latest version (v2) contains additional layers: outputs from more modern, neural-era taggers. These are

·       Lemmas by a newer morphological analyser:  Mor-fette v0.4.4(Chrupala et al., 2008; Chrupala, 2011), precompiled by Djam ́e Seddah, who included a transformed  xtag  lexicon  (Seddah  et  al.,  2013).  

·       Syntactic parses by a newer parser:  spaCy 2.0.13 (Honnibal and Johnson, 2015; Honnibal and Montani,2017). We forced spaCy to use our own tokenization instead of its own.

·       Semantic  frames  by  a  newer  PropBank-based SRL  tagger:   LSGN (He et al.,  2018),  an end-to-end BiLSTM-based SRL tagger using ElMo embeddings (Peters et al., 2018). It gets 86% F1 score on the CoNLL05 WSJ test set, com-pared to SENNA’s 75%.  Note that our SRL taggers do not rely on syntactic parses.

·       Argument head: For each semantic frame, we aligned the spaCy parses to each argument span in order to find the syntactic head of the span, using a similar heuristic as was done in the original RW-Eng with the older MaltParser and SENNA.

We align each token in the argument span across all layers (surface word-form, Morfette lemma, spaCy lemma and entity (NER) tag, etc.)

License

If you use this corpus, please cite us (See details below).

The corpus contains documents coming from the British National Corpus (BNC) and ukWaC. Therefore, the license for using it is the same license as the ukWac (CC BY-NC-SA 4.0: summary, which is not instead of the license: attribute our work, share, adapt, whatever, just not for commercial use, and don’t sue us for anything). The BNC no longer requires a license.

Download

The full corpus is stored in about 3,500 gzipped XML files. Here are a few example files. If you wish to download the full corpus, try here or contact the authors.

Publications

Yuval Marton, Asad Sayeed (2021). Thematic fit bits: Annotation quality and quantity for event participant representation. http://arxiv.org/abs/2105.06097

BibTex:

@misc{marton-sayeed-2021-RW-eng-v2,

      title={Thematic fit bits: Annotation quality and quantity for event participant representation},

      author={Yuval Marton and Asad Sayeed},

      year={2021},

      eprint={2105.06097},

      archivePrefix={arXiv},

      primaryClass={cs.CL}
}