Trying to solve the task of calculating word cooccurrence relative frequencies fast, I have created an interesting data structure, which also allows to calculate counts for the first word in the pair to check; and it creates word prefix tree for the text processing, which can be used for further text analysis.
The source code is available on GitHub: github.com/Maxime2/cooccurrences
When you execute make command you should see the following output:
cc -O3 -funsigned-char cooccur.c -o cooccur -lm Example 1 ./cooccur a.txt 2 < a.in | tee a.out Checking pair d e Count:3 cocount:3 Relative frequency: 1.00 Checking pair a b Count:3 cocount:1 Relative frequency: 0.33 Example 2 ./cooccur b.txt 3 < b.in | tee b.out Checking pair a penny Count:3 cocount:3 Relative frequency: 1.00 Checking pair penny earned Count:4 cocount:1 Relative frequency: 0.25
The cooccur program takes two arguments: the filename of a text file to process and the window of words size to calculate relative frequencies within it. Then the program takes pairs of words from its standard input, one pair per line, to calculate count of appearance of the first word in the text processed and the cooccurrence count for the pair in that text. If the second word appears more than once in the window, only one appearance is counted.
Examples were taken here: