The project

Evidence of authentic language use is fundamental for language learning. One way to develop authentic language learning materials is through the use of examples from corpora, i.e., large collections of texts produced in natural contexts, saved in electronic form. However, these corpora might include sensitive content or offensive language, in addition to exhibit structural problems. Although such use is unquestionably authentic, it is recommended that these corpora must be carefully monitored before applied to education to flag inappropriateness, thus leaving the choice of use of certain examples to the needs and context of use of teachers and didactic material developers. In other words, from our perspective, pedagogical corpora should be labelled for potentially problematic content rather than cleaned from it. In order to streamline the verification of the sentences for the creation of problem-labelled pedagogical corpora, we have decided to ask the crowd for help. It was in this context that the Crowdsourcing Corpus Filtering for Pedagogical Purposes project was created.

The Crowdsourcing Corpus Filtering for Pedagogical Purposes project aims at creating pedagogical corpora of Dutch, Estonian, Slovene and Portuguese through the application of crowdsourcing techniques. These pedagogical corpora can be used for the development of auxiliary language learning resources, such as Sketch Engine for Language Learning − SKELL (Baisa & Suchomel, 2014), dictionaries and teaching materials; and, within Natural Language Processing, for the creation of datasets aimed at training machine learning algorithms for the compilation of larger pedagogical corpora.

In phase 1, we carried out an experiment on the use of crowdsourcing for corpus filtering in which we asked the crowd to identify offensive sentences for pedagogical purposes. The experiment was implemented on the Pybossa platform.

In phase 2, we are developing the Crowdsourcing for Language Learning game – CrowLL. CrowLL is a multilevel, multilanguage, platform responsive game in which players identify problematic sentences, classify them, and indicate problematic excerpts. The types of problems to be labelled are: vulgar, offensive, sensitive content, grammar/spelling problems, incomprehensible/lack of context. Data preparation for the game involved manual annotation of automatically extracted sentences of all the languages. This task has been funded by the CLARIN Resource Families Project Funding and the annotated corpora are available at the PORTULAN CLARIN repository.

Team members

Tanara Zingano Kuhn

Project leader

Centre for the Studies of General and Applied Linguistics at University of Coimbra(CELGA-ILTEC)
Brazil/Portugal

Ana Luís

Centre for the Studies of General and Applied Linguistics at University of Coimbra (CELGA-ILTEC)/Faculty of Arts and Humanities at University of Coimbra
Portugal

Carole Tiberius

Dutch Language Institute
Netherlands

Iztok Kosem

Centre for Language Resources and Technologies at the University of Ljubljana (CJVT UL)
Slovenia

Kristina Koppel

Institute of the Estonian Language
Estonia

Rina Zviel Girshin

Ruppin Academic Center
Israel

Špela Arhar-Holdt

Centre for Language Resources and Technologies at the University of Ljubljana (CJVT UL)
Slovenia

Andressa Rodrigues Gomide

Former member

Centre for the Studies of General and Applied Linguistics at University of Coimbra(CELGA-ILTEC)
Brazil/Portugal

Branislava Šandrih Todorović

Former member

University of Belgrade, Faculty of Philology
Serbia

Danka Jokić

Former member

Serbia

Peter Dekker

Former member

Dutch Language Institute & AI Lab, Vrije Universiteit Brussel
Netherlands/Belgium

RS

Ranka Stanković

Former member

Serbia

Tanneke Schoonheim

Former member

Dutch Language Institute
Netherlands