Data structure for word relative cooccurence frequencies, counts and prefix tree

Trying to solve the task of calculating word cooccurrence relative frequencies fast, I have created an interesting data structure, which also allows to calculate counts for the first word in the pair to check; and it creates word prefix tree for the text processing, which can be used for further text analysis.

The source code is available on GitHub: github.com/Maxime2/cooccurrences

When you execute make command you should see the following output:


cc -O3 -funsigned-char cooccur.c -o cooccur -lm

Example 1
./cooccur a.txt 2 < a.in | tee a.out

Checking pair d e
Count:3  cocount:3
Relative frequency: 1.00

Checking pair a b
Count:3  cocount:1
Relative frequency: 0.33


Example 2
./cooccur b.txt 3 < b.in | tee b.out

Checking pair a penny
Count:3  cocount:3
Relative frequency: 1.00

Checking pair penny earned
Count:4  cocount:1
Relative frequency: 0.25

The cooccur program takes two arguments: the filename of a text file to process and the window of words size to calculate relative frequencies within it. Then the program takes pairs of words from its standard input, one pair per line, to calculate count of appearance of the first word in the text processed and the cooccurrence count for the pair in that text. If the second word appears more than once in the window, only one appearance is counted.

Examples were taken here:

Data structure for word relative cooccurence frequencies, counts and prefix tree

Related Posts:

Leave a Reply