DOCUMENTATION



GrETEL for Afrikaans


Poly-GrETEL


For developers



Under the tab publications & talks you can find some presentations and papers of GrETEL demonstrations, as well as linguistic case studies using the tools provided on this website. If you have used GrETEL for your own research and you would like to share your presentations and/or publications on this website, you can contact us at gretel@ccl.kuleuven.be.

PLEASE CITE!


If you use GrETEL for your research, please cite the relevant paper(s) in your publications:


General GrETEL paper

Liesbeth Augustinus, Vincent Vandeghinste, Ineke Schuurman and Frank Van Eynde (accepted). "GrETEL. A tool for example-based treebank mining." In: Jan Odijk and Arjan van Hessen (eds.) CLARIN in the Low Countries. Ubiquity Press.

GrETEL for Dutch (initial version)

Liesbeth Augustinus, Vincent Vandeghinste, and Frank Van Eynde. (2012). "Example-Based Treebank Querying." In: Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC-2012). Istanbul, Turkey. pp. 3161-3167.

GrETEL for Afrikaans

Liesbeth Augustinus, Peter Dirix, Daniel van Niekerk, Ineke Schuurman, Vincent Vandeghinste, Frank Van Eynde, and Gerhard van Huyssteen (2016). "AfriBooms: An Online Treebank for Afrikaans." In: Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis (eds.) Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). Portorož, Slovenia. pp. 677-682.

Poly-GrETEL

Liesbeth Augustinus, Vincent Vandeghinste, and Tom Vanallemeersch (2016). "Poly-GrETEL: Cross-Lingual Example-based Querying of Syntactic Constructions." In: Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis (eds.) Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). Portorož, Slovenia. pp. 3549-3554.

Do not forget to include a reference to the treebank(s) you have been using as well!

TREEBANKS IN GrETEL

Dutch

GrETEL currently provides access to three Dutch treebanks: Lassy Small, CGN, and SoNaR.


Lassy Small

What? Lassy Small was the first corpus to be supported in GrETEL. It is a one-million words corpus that consists of written data. All of its annotations have been manually checked and verified.


Cite Gertjan van Noord, Gosse Bouma, Frank Van Eynde, Daniël de Kok, Jelmer van der Linde, Ineke Schuurman, Erik Tjong Kim Sang, and Vincent Vandeghinste. (2013). "Large Scale Syntactic Annotation of Written Dutch: Lassy." In: Peter Spyns and Jan Odijk (eds), Essential Speech and Language Technology for Dutch. Results by the STEVIN programme, pp. 147-164. Springer.


Manual syntactic annotation Gertjan van Noord, Ineke Schuurman, and Gosse Bouma. (2011). "Lassy Syntactische Annotatie"

Manual word labels Frank Van Eynde (2005). "Part of Speech Tagging en Lemmatisering van het D-Coi Corpus" (slightly extended version of CGN tag set)

Linguistic annotations in Lassy Small (+ frequencies)

Lassy project website


CGN Treebank

What? The Corpus Gesproken Nederlands, or CGN for short, is a corpus of ten million words that consists of transcribed Dutch speech. About one million words is enriched with syntactic annotations. This part of the corpus is also known as the CGN Treebank. All the provided annotations have been manually checked and verified. Technical detail: In order to include the CGN Treebank in the GrETEL, the official version (Tiger-XML) is converted to Alpino-XML format by Gertjan van Noord. Furthermore note that for Lassy, the Alpino parser is used as a first annotation step, but for the CGN Treebank this was not the case. Although the annotations of both treebanks are largely the same, there are some differences. If you are using the example-based search mode of GrETEL to query CGN, it might be the case that the parse of the input construction is annotated slightly differently compared to similar constructions in the treebank. We still need to come up with a solution in order to fix this problem.


Cite Ton van der Wouden, Heleen Hoekstra, Michael Moortgat, Bram Renmans, and Ineke Schuurman. (2002). "Syntactic Analysis in the Spoken Dutch Corpus (CGN)." In: Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC-2002), pp. 768-773. Las Palmas.


Manual syntactic annotation Heleen Hoekstra, Michael Moortgat, Bram Renmans, Machteld Schouppe, Ineke Schuurman, and Ton van der Wouden (2003)."CGN Syntactische Annotatie"

Manual word labels Frank Van Eynde (2004). "Part of Speech Tagging en Lemmatisering van het Corpus Gesproken Nederlands" [English version]

Linguistic annotations in CGN (+ frequencies)

CGN project website


SoNaR Treebank

What? The SoNaR Treebank is different from Lassy Small and the CGN treebank in that it is much larger. SoNaR-500 is a corpus that consists of 25 components of written data, amounting to 500 million words. It is parsed with the Alpino parser, resulting in the SoNaR Treebank, which is in its turn part of the Lassy Large treebank. Because of its size, the syntactic annotations have not been manually verified.


Cite Nelleke Oostdijk, Martin Reynaert, Véronique Hoste, and Ineke Schuurman. (2013). "The Construction of a 500-Million-Word Reference Corpus of Contemporary Written Dutch." In: Peter Spyns and Jan Odijk (eds), Essential Speech and Language Technology for Dutch. Results by the STEVIN programme, pp. 219-247. Springer.

SoNaR project website