Under the tab publications & talks you can find some presentations and papers of GrETEL demonstrations, as well as linguistic case studies using the tools provided on this website. If you have used GrETEL for your own research and you would like to share your presentations and/or publications on this website, you can contact us at email@example.com.
If you use GrETEL for your research, please cite the relevant paper(s) in your publications:
Do not forget to include a reference to the treebank(s) you have been using as well!
GrETEL currently provides access to three Dutch treebanks: Lassy Small, CGN, and SoNaR.
What? Lassy Small was the first corpus to be supported in GrETEL. It is a one-million words corpus that consists of written data. All of its annotations have been manually checked and verified.
Cite Gertjan van Noord, Gosse Bouma, Frank Van Eynde, Daniël de Kok, Jelmer van der Linde, Ineke Schuurman, Erik Tjong Kim Sang, and Vincent Vandeghinste. (2013). "Large Scale Syntactic Annotation of Written Dutch: Lassy." In: Peter Spyns and Jan Odijk (eds), Essential Speech and Language Technology for Dutch. Results by the STEVIN programme, pp. 147-164. Springer.
Manual syntactic annotation Gertjan van Noord, Ineke Schuurman, and Gosse Bouma. (2011). "Lassy Syntactische Annotatie"
Manual word labels Frank Van Eynde (2005). "Part of Speech Tagging en Lemmatisering van het D-Coi Corpus" (slightly extended version of CGN tag set)
What? The Corpus Gesproken Nederlands, or CGN for short, is a corpus of ten million words that consists of transcribed Dutch speech. About one million words is enriched with syntactic annotations. This part of the corpus is also known as the CGN Treebank. All the provided annotations have been manually checked and verified. Technical detail: In order to include the CGN Treebank in the GrETEL, the official version (Tiger-XML) is converted to Alpino-XML format by Gertjan van Noord. Furthermore note that for Lassy, the Alpino parser is used as a first annotation step, but for the CGN Treebank this was not the case. Although the annotations of both treebanks are largely the same, there are some differences. If you are using the example-based search mode of GrETEL to query CGN, it might be the case that the parse of the input construction is annotated slightly differently compared to similar constructions in the treebank. We still need to come up with a solution in order to fix this problem.
Cite Ton van der Wouden, Heleen Hoekstra, Michael Moortgat, Bram Renmans, and Ineke Schuurman. (2002). "Syntactic Analysis in the Spoken Dutch Corpus (CGN)." In: Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC-2002), pp. 768-773. Las Palmas.
Manual syntactic annotation Heleen Hoekstra, Michael Moortgat, Bram Renmans, Machteld Schouppe, Ineke Schuurman, and Ton van der Wouden (2003)."CGN Syntactische Annotatie"
Manual word labels Frank Van Eynde (2004). "Part of Speech Tagging en Lemmatisering van het Corpus Gesproken Nederlands" [English version]
What? The SoNaR Treebank is different from Lassy Small and the CGN treebank in that it is much larger. SoNaR-500 is a corpus that consists of 25 components of written data, amounting to 500 million words. It is parsed with the Alpino parser, resulting in the SoNaR Treebank, which is in its turn part of the Lassy Large treebank. Because of its size, the syntactic annotations have not been manually verified.
Cite Nelleke Oostdijk, Martin Reynaert, Véronique Hoste, and Ineke Schuurman. (2013). "The Construction of a 500-Million-Word Reference Corpus of Contemporary Written Dutch." In: Peter Spyns and Jan Odijk (eds), Essential Speech and Language Technology for Dutch. Results by the STEVIN programme, pp. 219-247. Springer.