In this post I look at the family of collexeme analysis methods originated by Gries and Stefanowitsch. Since they use a lot of Base R, and love using vectors, there is a hurdle that needs to be conquered if you are used to the rectangles in tidy data. I first give an overview of what the method tries to do, and then at the end show the hurdle in detail, followed by the steps necessary to enable the computation of the desired variables and statistical tests (association measures). Basically, you need to convert from pure tidy data to a tidy contingency format.
Introduction
So in order to finalize my database on Chinese ideophones, creatively entitled CHIDEOD, I decided to work through Stefan Gries’s Quantitative corpus linguistics with R (2016; 2nd edition; companion website here).
That together with Natalia Levshina’s How to do linguistics with R (2015; 1st edition; companion website here), which I worked through last June, has given me a lot of inspiration to tackle a number issues I have been struggling with, or at least thinking about without really knowing how to tackle them — I guess that counts as struggling with. By the way, it is locally kind of confirmed currently that Levshina will be a keynote speaker at our CLDC conference in May. So you can already prepare those abstracts if you wish to attempt to present here in Taipei.
Anyway, one of the more intriguing family of methods to investigate the relation between constructions and words that fill the slots is the family of collexeme analysis.
What is collexeme analysis?
As Gries (2019) discusses, the methodology of collexeme analysis is an extension of the notions of a) collocation (which words really belong together, e.g. watch TV and watch a movie will have a stronger collocational bond than watch a powerpoint presentation, although the latter has become a weekly activity), and b) colligation (what constructions belong together, e.g. watch will typically be followed by a noun, although verb phrases like watch him play also occur).
So the basis of this method is a contingency table (or cross-table; or kruistabel if you love Dutch; lièlián biǎo 列聯表 in Chinese (apparently)). Basically, something like this, that should look somewhat familiar:
Element 2
Not element 2 / other elements
Sum
Element 1
a
b
a + b
Not element 1 / other elements
c
d
c + d
Sum
a + c
b + d
a + b + c + d
These letters (a, b, c, d) play an important role in colloxeme analyses. So stay tuned for that.
Approach 1: Collexeme analysis
So what Stefanowitsch & Gries (2003) wanted to investigate was the [N waiting to happen] construction, e.g. an accident waiting to happen, a disaster waiting to happen etc. In this collexeme analysis you are quantifying to what degree the words found in a corpus occur in that construction: are they attracted or repulsed, and by how much?
In this scenario, element 1 is e.g. an accident, and element 2 the construction waiting to happen.
waiting to happen
not waiting to happen
Sum
an accident
a
b
a + b
not an accident
c
d
c + d
Sum
a + c
b + d
a + b + c + d
Approach 2: Distinctive collexeme analysis
A year later, Gries & Stefanowitsch (2004) extended the methodology to quantify to what degree the words prefer to appear in one of two constructions. For example, the ditransitive give him a call vs. the prepositional give a call to him. Here the table changes into:
ditransitive
prepositional
Sum
give
a
b
a + b
not give
c
d
c + d
Sum
a + c
b + d
a + b + c + d
Approach 3: Co-varying collexeme analysis
The third seminal paper was published in the same year (Gries & Stefanowitsch 2004; ISBN: 9781575864648) sought to quantify the attraction / repulsion between two different slots in a construction, e.g. trick … into buying, force … into accepting etc. For this, the table is adapted to:
accepting
not accepting
Sum
force
a
b
a + b
not force
c
d
c + d
Sum
a + c
b + d
a + b + c + d
Okay, I get that table stuff, what next?
Let’s say you were able to get the frequencies from a corpus – I should probably mention that, *cough* *cough* not everybody agrees with this kind of contingency tables, the overview paper by Gries (2019) has plenty of interesting references and exciting rebuttals – and you have this table, possibly for lots of elements and/or constructions.
We’ll keep an example here that I calculated with Gries’s (2016) book mentioned above. One of the case studies looks at verbs that co-occur with the modal verb must, e.g. must accept, must agree, must confess, vs. how much these verbs occur with other modal verbs, e.g should accept, should agree etc. So in this case, element 1 was a verb (confess) and element 2 was must; not-element-1 were all the other verbs, and not-element-2 were all the other modal verbs.
The respective table from an abstract level looks like this:
must
not must
Sum
confess
a
b
a + b
not confess
c
d
c + d
Sum
a + c
b + d
a + b + c + d
I got these frequencies from the BNC (as Gries shows in his book, but I used my tidyverse skillsTM to get them, so if they are slightly off, then it was because of not following his script completely):
The association measure (read: statistical test) that Gries & Stefanowitsch, as well as others, have used most is the Fisher Yates Exact test, and more precisely the negative \(log10\) of its \(p\)-value. Underlying this (see the link in this paragraph) are calculations using those letter cells (a,b,c,d). Luckily we don’t need to do that manually because R has a function for this – fischer.test():
fisher.test(must.table)
Fisher's Exact Test for Count Data
data: must.table
p-value = 3.106e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
49.90447 13241.49567
sample estimates:
odds ratio
344.6593
So as you can see the pvalue is 3.1056007^{-16}. If we take the negative \(log10\) of this, it becomes 15.5078544, which gives us the result we expected.
An easier way of computing the Fisher Yates Exact test of this table is by using the letter math and the pv.Fischer.collostr function provided by Levshina’s Rling package:
Rling::pv.Fisher.collostr(a, b, c, d)
[1] 3.105601e-16
Other tests that have been proposed are so-called Reliance and Attraction (cf. Schmid & Küchenhoff 2013 a.o.). Reliance is the relative frequency of a verb (confess) with must with regard to all uses of the given verb; Attraction is the relative frequency of a verb (confess) with must based on all usages with must.
\[ Reliance = \frac{a}{a+b}\]
\[ Attraction = \frac{a}{a+c} \]
attraction <- a / (a+c) *100reliance <- a / (a+b) *100
The Attraction of confess and must is 0.5497251 and the Reliance is 91.6666667. This high Reliance means that whenever confess occurs in the corpus after a modal it relies on must to occur. Its Attraction, however, has a much lower value: must does not necessarily occur with confess, in fact, it occurs with lots of other verbs as well!
The third group of tests is actually correlated to Attraction and Reliance: respectively \(\Delta P_{word \to construction}\) and \(\Delta P_{construction \to word}\) (cf. Ellis & Ferreira-Junior a.o.). They are also known as tests of cue validity and are calculated as follows:
dP.cueCx <- a/(a + c) - b/(b + d)dP.cueVerb <- a/(a + b) - c/(c + d)
So the \(\Delta P_{word \to construction}\) of must confess is 0.0054812 and the \(\Delta P_{construction \to word}\) of must confess is 0.8856234, so these numbers look a lot like those of Attraction and Reliance.
Anyway, I think you get the drift: if you have the contingency table, you choose an association measure (there are more than these three sets) and analyze the results.
Tidy collostructions
So if that’s so clear, why am I writing this post? For me, the main difficulty with the approach is that Gries and Levshina love writing things in Base R (Gries even more than Levshina). But to someone who really started to appreciate R after the tidyverse became more available, there is this dissonance with the way they go about things.
One of the biggest difference is the obsessive-compulsion of tidyverse to think in rectangles, a.k.a. dataframes or tibbles, rather than the vector( letter)s that Gries and Levshina love using, especially in their letter mathematics.
As a consequence of this “rectangling” and tidy format (long and skinny), rather than contingency format (2 x 2), it is challenging to compute even basic chisquare.tests.
So in Base R, a chisquare is easy to compute, but then hard to continue working with, because this thick text block is given and you can’t really access the contents (but there are solutions, like broom::tidy):
chisq.test(must.table)
Warning in chisq.test(must.table): Chi-squared approximation may be incorrect
Pearson's Chi-squared test with Yates' continuity correction
data: must.table
X-squared = 282.63, df = 1, p-value < 2.2e-16
In a tidy format, you typically have to jump through a lot of hoops to either get to use the same function or use one of the newer alternative functions:
# this function will give you bad output# so you can't just simply do thischisq.test(must.tibble$must, must.tibble$notmust)
Warning in chisq.test(must.tibble$must, must.tibble$notmust): Chi-squared
approximation may be incorrect
Pearson's Chi-squared test with Yates' continuity correction
data: must.tibble$must and must.tibble$notmust
X-squared = 0, df = 1, p-value = 1
Hoop 1: make it really tidy
must.tibble %>%pivot_longer(cols =c(must, notmust),names_to ="modal",values_to ="n") %>%# now it is really tidykable()
verb
modal
n
confess
must
11
confess
notmust
1
notconfess
must
1990
notconfess
notmust
62114
Hoop 2: uncount
must.tibble %>%pivot_longer(cols =c(must, notmust),names_to ="modal",values_to ="n") %>%# now it is really tidyuncount(n) # uncount
# A tibble: 64,116 × 2
verb modal
<chr> <chr>
1 confess must
2 confess must
3 confess must
4 confess must
5 confess must
6 confess must
7 confess must
8 confess must
9 confess must
10 confess must
# ℹ 64,106 more rows
Hoop 3: convert to table
must.tibble %>%pivot_longer(cols =c(must, notmust),names_to ="modal",values_to ="n") %>%# now it is really tidyuncount(n) %>%# uncounttable() # turn to Base R table
modal
verb must notmust
confess 11 1
notconfess 1990 62114
Hoop 4: chisquare
must.tibble %>%pivot_longer(cols =c(must, notmust),names_to ="modal",values_to ="n") %>%# now it is really tidyuncount(n) %>%# uncounttable() %>%# turn to Base R tablechisq.test()
Warning in chisq.test(.): Chi-squared approximation may be incorrect
Pearson's Chi-squared test with Yates' continuity correction
data: .
X-squared = 282.63, df = 1, p-value < 2.2e-16
Hoop 5: tidy with broom
must.tibble %>%pivot_longer(cols =c(must, notmust),names_to ="modal",values_to ="n") %>%# now it is really tidyuncount(n) %>%# uncounttable() %>%# turn to Base R tablechisq.test()%>% broom::tidy() %>%kable()
Warning in chisq.test(.): Chi-squared approximation may be incorrect
statistic
p.value
parameter
method
282.6321
0
1
Pearson’s Chi-squared test with Yates’ continuity correction
I know there are some tidyverse-friendly functions like infer::chisq_test, but it seems to lack arguments like expected values (chisq.test()$exp). So this awkward hoop jumping is annoying but gets the job done (for now?).
From tidy to contingency
Anyway, while for most summary statistics a tidy format (see hoop 1) is the easiest to work with, I don’t think it’s very intuitive for the association measures paradigm.
So in this working example of must + V_inf_ construction what I did was get all the occurrences of all modal verbs + verbs in the infinitive from the BNC corpus. The second step I did was add a column with dplyr::case_when to identify if a modal verb was must or another modal verb.
Let’s look at this data shall we:
df.must
# A tibble: 64,116 × 3
modal verb mod.type
<chr> <chr> <chr>
1 can find OTHER
2 should stop OTHER
3 should recognise OTHER
4 will cost OTHER
5 might think OTHER
6 can sink OTHER
7 can change OTHER
8 must help MUST
9 ll help OTHER
10 ll take OTHER
# ℹ 64,106 more rows
As you can see there are 64116 rows, and the table is just a count away form being tidy.
# A tibble: 2,522 × 3
mod.type verb n
<chr> <chr> <int>
1 OTHER get 3750
2 OTHER go 3421
3 OTHER see 2409
4 OTHER take 2056
5 OTHER like 1778
6 OTHER say 1647
7 OTHER come 1460
8 OTHER make 1383
9 OTHER give 1245
10 OTHER put 1162
# ℹ 2,512 more rows
Much ‘prettier’, but what next? This is actually the mental step I struggled the most with, this dissonance between tidy and contingency. However, since we coded mod.type with a binary value: “OTHER” or “MUST”, it is actually trivial to dplyr::spread or dplyr::pivot_wider them to a “tidy contingency table”:
# A tibble: 2,113 × 3
verb OTHER MUST
<chr> <int> <int>
1 get 3750 102
2 go 3421 94
3 see 2409 14
4 take 2056 63
5 like 1778 3
6 say 1647 105
7 come 1460 41
8 make 1383 62
9 give 1245 22
10 put 1162 18
# ℹ 2,103 more rows
Hey, this looks a lot like those schematic tables we had at the beginning! But now the real challenge is to turn these numbers into the letters a, b, c, and d, so we can perform our letter math.
After changing all numeric values we have to doubles (instead of integers) and changing all NAs to 0, we can get the letters by following simple arithmetic from our original schematic table, e.g. if we know a and a+c, then c = a+c - a etc.
Element 2
Not element 2 / other elements
Sum
Element 1
a
b
a + b
Not element 1 / other elements
c
d
c + d
Sum
a + c
b + d
a + b + c + d
spread.tidy.df.must %>%mutate_if(is.numeric, ~as.double(.x)) %>%# we'll want doubles instead of integersmutate_if(is.numeric, ~replace_na(.x, 0)) %>%# NAs should be 0mutate(a = MUST,ac =sum(MUST),c = ac - a,ab = MUST + OTHER,b = ab - a,abcd =sum(MUST, OTHER),d = abcd - a - b -c,aExp = (a + b)*(a + c)/(a + b + c + d)) -> abcd.mustabcd.must
# A tibble: 2,113 × 11
verb OTHER MUST a ac c ab b abcd d aExp
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 get 3750 102 102 2001 1899 3852 3750 64116 58365 120.
2 go 3421 94 94 2001 1907 3515 3421 64116 58694 110.
3 see 2409 14 14 2001 1987 2423 2409 64116 59706 75.6
4 take 2056 63 63 2001 1938 2119 2056 64116 60059 66.1
5 like 1778 3 3 2001 1998 1781 1778 64116 60337 55.6
6 say 1647 105 105 2001 1896 1752 1647 64116 60468 54.7
7 come 1460 41 41 2001 1960 1501 1460 64116 60655 46.8
8 make 1383 62 62 2001 1939 1445 1383 64116 60732 45.1
9 give 1245 22 22 2001 1979 1267 1245 64116 60870 39.5
10 put 1162 18 18 2001 1983 1180 1162 64116 60953 36.8
# ℹ 2,103 more rows
So now we can easily perform all of the association measures we want, and rank them accordingly.
It is perfectly possible to perform association measures starting with a tidy dataframe. First you need to spread out with a binary variable, to make it a tidy contingency table. Then you can identify a, b, c, and d. Next you can chooose your preferred statistical test, rank verbs, and try to interpret the findings.
I should probably also mention that Gries is has been advocating to not just use one association measure, but three or more. Using just two, you can plot like Attraction and Reliance.