Project 1

Researcher: Malin Patel

Supervisors: Prof. Dr. Stephanie Evert and Prof. Dr. Thomas Herbst

Corpus Evidence for Delineating Constructions

(Third Party Funds Group – Sub project)

Abstract:

CxG and many other usage-based approaches agree that language consists of pre-fabricated form-meaning pairings of varying sizes (e.g. Goldberg 1995, Hunston & Francis 2000, Sinclair & Mauranen 2006, Wray 2008), which are called constructions in CxG. In contrast to approaches that understand language as a probabilistic system, such as lexical priming theory (Hoey 2005) or the EC-Model (Schmid 2020), constructions are usually conceptualised as discrete symbolic units or the “nodes of a symbolic network” (Diessel 2019: 249), possibly emerging from the generalisation of associational patterns or clusters of memory traces (e.g. Goldberg 2019). Prior research is typically focused on extensive linguistic analysis and discussion of a relatively small set of specific constructions (such as the English ditransitive or the let alone construction). Such studies have not been able to establish clear-cut criteria and diagnostics for determining at scale, i.e. with broad coverage, which form-meaning pairings should be considered as constructions and which elements (lexical items, restricted or open slots, and grammatical features) should be included in a given construction. While it is evident in a usage-based approach that there can be no dichotomic distinction of constructions vs. non-constructions4 and that “constructionhood” is a matter of degree, binary decisions on an inventory of constructions still have to be made for the purposes of linguistic analysis and the systematic compilation of a broadcoverage reference constructicon.

First efforts to build such a reference constructicon have been started for different languages, including English (Perek & Patten 2019) and German (Ziem et al. 2019). They build on existing lexical resources such as FrameNet (Perek & Patten 2019) and/or manual in-depth analysis of selected constructions5 (Ziem et al. 2019). Automatic identification of constructions has only been attempted by a small number of exploratory studies, based on word n-grams (Shibuya & Jensen 2015), hybrid n-grams of words and POS tags (Forsberg et al. 2014), or a combination of dependency-based co-occurrence with distributional clustering (Martí et al. 2019). All three studies focus on extracting and ranking construction candidates for manual inspection, but do not discuss identifying criteria or generate additional quantitative evidence for human annotators. Gries (2003) carries out a small feasibility study on finding prototypical instances of a given construction, but does not address the issue of construction identification.

This project explores how and to what extent quantitative data from large corpora can contribute to the task of delineating constructions, i.e. help researchers to assess the degree of “constructionhood” of a candidate construction (CxCand), develop systematic defining criteria for this assessment, and lay the groundwork for (semi-)automatic identification of constructions at scale. The project combines computational big data analysis of English and German corpora with constructicographic work (Lyngfelt et al. 2018), extending the collo-profile approach proposed by Herbst & Uhrig (2019: 177ff) for argument structure constructions. It addresses three central research questions: Q1: Does quantitative evidence from large corpora improve the manual identification of constructions and the development of defining criteria? Q2: What statistical measures are suitable as an operationalisation of such quantitative data, providing a basis for computing an index of “constructionhood” and for the automatic identification of constructions?

Q3: Can context-sensitive neural word and phrase embeddings be used as a corpus-based approximation of construction meaning?
The project starts by extracting large databases of CxCand from English and German Web corpora of more than 10 billion words, based on pre-defined syntactic patterns such as verb argument structure. The extraction relies on an existing HPC infrastructure for parsing large corpora at FAU. Widely-used criteria for determining “constructionhood” such as productivity, compositionality / idiomaticity and schematicity / lexical specificity (Ziem et al. 2019: 69f) are operationalised in terms of corpus frequency, productivity of slots, statistical association between lexical elements, morpho-syntactic preferences, context entropy, etc. They are computed from the CxCand database using state-of-the-art measures from methodological research carried out at FAU, which provide the basis for answering Q2. Following Herbst & Uhrig (2019), the meaning aspect of a CxCand is initially approximated by the collo-profiles of its open slots. A thorough constructicographic analysis of different sets of CxCand sheds light on Q1 (whether constructions can clearly be identified) and Q2 (which quantitative measures are most useful for this purpose). These sets include well-studied examples of constructions from the literature (used for validation of the approach), sets based on a syntactic pattern (such as mono-transitive verb argument structure), and sets based on a lexical item (in particular various prepositions, in collaboration with project #9). The most challenging and open-ended aspect of the project explores the use of context-sensitive word and phrase embeddings (e.g. Devlin et al. 2019) to operationalise the semantics of a CxCand, following the distributional hypothesis (Harris 1954) and recent proposals for a distributional CxG (DisCxG: Rambelli et al. 2019). If successful, i.e., if there is a positive answer to Q3, not only the form of a construction but also its meaning can be studied based on corpus evidence.

Research questions Q1 and Q2 directly address GRQ CON1 (How do we identify constructions? Can they be seen as discrete units?) and GRQ CON2 (To what extent is constructional knowledge determined by collo-profiles? How can we measure the lexical specificity vs. productivity of constructions slots?). An important part of the constructicographic analysis is to delineate between a CxCand and related constructions, such as a generalisation of the CxCand or an overlapping combination of two constructions. In this way, the project also addresses GRQ NET1 (How can computational methods help reveal the network character of constructional space?).

The project will contribute a substantial number of entries to the RCnn, combining constructicographic descriptions with rich quantitative evidence. A suitable representation format for these entries will be developed in close collaboration with the PDR. The CxCand database constitutes a valuable resource for other projects working on English or German constructions; an extension to other languages is envisaged for the second phase of the RTG.