Relevance calculation in DataparkSearch

In indexing, DataparkSearch divide every document onto sections. A section is any part of document, for example, for HTML documents this may be TITLE or META Description tag.

In addition to sections, some document factors are also take in account for relevance calculation: the average distance between query words, the number of query word occurrences, the position of first occurrence of a query word, the difference between the distribution of query word counts and the uniform distribution.

In searching, DataparkSearch compares every document found against an "ideal" document. The "ideal" document should have query words in every section defined and should have also the predefined values of additional factors.

A full method of relevance calculation.

Let x is the weighted sum of all sections. The weights for these sections are define by wf parameter (see Section 8.1.3). Let y is the weighted sum of differences between values of additional factors of document found and corresponding values of additional factors of the "ideal" document. And let xy is the weighted sum of sections where at least one query word has been found. Then value of relevance for a document found is calculates as: 0.5 * ( x + xy ) / (x + y).

A fast method of relevance calculation.

Let x is the number of bits used in weighted values of all sections defined. Let y is the weighted sum of differences between additional factors of document found and corresponding values of the "ideal" document. And let xy is the number of bits where weighted values of sections of the "ideal" document are different to weighted values of sections of document found. Then value of document relevance is calculates as: ( x - xy ) / ( x + y ).

//PS: This is DataparkSearch documentation update, will appear in the next release.

Leave a Reply

Your email address will not be published. Required fields are marked *