This page gathers all sorts of documentation on GrETEL, such as tutorials, related tools, and frequently asked questions. If you have any more questions, you can always consult the GrETEL project website or you can contact us.

Please cite the following paper if you are using GrETEL for your research.

Example-Based Treebank Querying

Liesbeth Augustinus, Vincent Vandeghinste, and Frank Van Eynde (2012). "Example-Based Treebank Querying". In: Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC-2012). Istanbul, Turkey. pp. 3161-3167.

Frequently Asked Questions

Why is the output limited to 500 sentences?

GrETEL is free for students and academic research, but the corpora that are accessible via GrETEL are not meant for distribution. In other words, we do not have the rights to give out the corpus as a whole. If a user would search for a structure with only a cat="top" node, they could literally download the whole corpus - which is not the intention of this project. If you would like to obtain the raw corpus data (for academic or commercial use), you should contact the INT

For whom is GrETEL intended?

GrETEL is designed as a corpus query tool which means that it is useful for anyone who is interested in searching through the Lassy Small, CGN, or SoNaR treebanks. The tool is especially useful if you want to look for specific linguistic patterns in those corpora.

Where can I find more information about the corpora available in GrETEL?

GrETEL currently provides access to three corpora: Lassy Small, CGN treebank, and SoNaR treebank. More information on these corpora is provided on GrETEL's project page.

  • Lassy Small was the first corpus to be supported in GrETEL. It is a one-million words treebank that consists of written data. All of its annotations have been manually checked and verified.
  • CGN treebank is a treebank of one million words that consists of transcribed Dutch speech. All the provided annotations have been manually checked and verified. CGN stands for "Corpus Gesproken Nederlands" (Spoken Dutch Corpus). The CGN treebank is a syntactically enriched part of the 10-million word CGN corpus.
  • SoNaR treebank is the parsed version of the 500-million word SoNaR-500 corpus. It is a corpus that consists of 25 components of written data. Because of its size, the syntactic annotations have not been manually verified.
How can I contact you?

This website and this tool were developed at the Centre for Computational Linguistics (CCL). If you have any suggestions, questions, or general feedback you are welcome to give us a ring, or send us an email. You can find contact information on CCL's website or in the footer of this website.

Why does XPath generated for SoNaR only have one leading slash, when the code for LASSY and CGN has two?

It has to do with how XPath structures work on the one hand, and how we optimised the SoNaR database on the other. An XPath pattern that begins with a double slash makes sure that the pattern is searched for in all descendants of the current node (or implied root), whereas a single slash restricts the search to its direct children. How that difference is relevant for SoNaR has been described in the paper cited below.

Download 'Making Large Treebanks Searchable. The SoNaR case.'

Vincent Vandeghinste and Liesbeth Augustinus. (2014). "Making Large Treebanks Searchable. The SoNaR case". In: Marc Kupietz, Hanno Biber, Harald Lüngen, Piotr Bański, Evelyn Breiteneder, Karlheinz Mörth, Andreas Witt & Jani Takhsha (eds.), Proceedings of the LREC2014 2nd workshop on Challenges in the management of large corpora (CMLC-2). Reykjavik, Iceland. pp. 15-20.

What is new in version 3?

In addition to an overall design update, major changes include a more intuitive query builder in the example-based search mode and a visualizer for syntax trees that is compatible with all modern browsers. Moreover, the results are presented as soon as they are found, so you can browse the matching sentences before the treebank search is completed. Furthermore it is possible to query the 500-million word SoNaR treebank in a similar fashion as the two one-million word treebanks CGN and LASSY Small.