For best experience please turn on javascript and use a modern browser!
You are using a browser that is no longer supported by Microsoft. Please upgrade your browser. The site may not present itself correctly if you continue browsing.

Lauren Fonteyn (Meertens Instituut) is the guest speaker at this ACLC seminar. The title of the talk is "Methods of semi-automatic data annotation with contextualized word embeddings".

Event details of ACLC Seminar | Lauren Fonteyn
9 June 2023
16:15 -17:30
P.C. Hoofthuis

Methods of semi-automatic data annotation with contextualized word embeddings

In corpus linguistics, the collection and annotation of data commonly involve a relatively balanced combination of computer-aided and manual labour. It is still common practice, for instance, to first retrieve data representing a particular linguistic phenomenon from an electronic corpus (e.g. by means of a concordancer tool or query script) and subsequently manually categorize the collected examples into different groups (e.g. animate/inanimate; literal/figurative; agent/patient/instrument/...). However, as the range of research questions that linguists aim to address by means of corpus data has expanded in complexity, there is a growing need for larger data samples, which is difficult to meet when we continue to approach data annotation manually. As such, it has become an important practical challenge in corpus linguistics to determine how data annotation practices can evolve along with the needs of researchers.

In this talk, I suggest one way of approaching corpus data annotation (semi-)automatically by relying on Large Language Models (LLMs). More specifically, I will present a number example case studies to highlight how an LLM like BERT can be employed to annotate corpus data. The presentation focusses on the BERT-based models MacBERTh (Manjavacas & Fonteyn 2022a) and GysBERT (Manjavacas & Fonteyn 2022b), which – unlike the vast majority of available models – have been pre-trained to process historical English and Dutch respectively (date range: 1500-1950).

What makes the approach I will discuss appealing is that it is fully customizable to the researcher’s needs. Of course, some corpora have been enriched with part-of-speech tags, or, more exceptionally, syntactic parsing and semantic tagging. Yet, not only are high-quality parsed (historical) corpora quite rare and limited in size, the extent to pre-set tags map onto the categories a researcher is interested in may also vary. The procedure presented in my case studies, then, offers a means of automatically classifying morphosyntactic structures in large, unparsed (and/or untagged) corpora following a custom annotation scheme. As such, the procedure can help to scale up the data set for (historical) corpus studies where a small portion of the data has been manually annotated, or to replicate a data annotation scheme adopted in prior work and apply it to new data.

About the ACLC seminar series

The ACLC seminar series is a two weekly lecture series organized by the ACLC, the Amsterdam Center for Language and Communication.

P.C. Hoofthuis

Room 1.15

Spuistraat 134
1012 VB Amsterdam