As part of our road to detecting metaphors we got stuck on a simple problem: compound nouns.

If you take the sentence:

series of immigration policy changes

Series modifies changes in reference to immigration policy, which is a compound noun.

“Series of changes” is not what we would consider metaphorical usage, but our detector would label “series of immigration” as potentially metaphorical, given its strangeness. Identifying compound nouns, and identifying which specific word is being modified (and is thus the “target” concept in the metaphor), is critical to improving performance.

But, we realized we didn’t want to throw this extra information out. Enter collocations:

a sequence of words or terms that co-occur more often than would be expected by chance

Using the same corpus that we’ve been using (which contains news articles, social media posts, and TV transcripts), we calculated the most prominent collocations containing “immigrant”, “immigration”, and “migration”.

prefix suffix prefix frequency suffix frequency co-occurrence
chain immigration 16271 24547 8548
undocumented immigrant 23145 127600 13645
migration crisis 24547 30606 1980
illegal immigrant 61791 127600 18759
comprehensive immigration 9550 261664 4930
migration visa 24547 23667 1041
immigration custom 261664 16135 6669
unaccompanied immigrant 7945 127600 1547
immigration reform 261664 42233 16839
testify immigrant 13444 127600 2288

In graphical form, here’s what the information looks like:

The blue lines indicate the word is a prefix to the key source words (e.g. “chain migration”, “unaccompanied immigrant"), and the green lines indicate it is a suffix (e.g. “immigration policy”, “illegal reform”.)  

What stands out to us is how little overlap there is in the collocations that overlap between the three words (except “illegal”, which is highly related to all three). This is especially surprising between “migration” and “immigration” which are both abstract nouns.