Skip to content

Some time ago, DataparkSearch has took evaluation using В«Chinese Word Segmentation Evaluation ToolkitВ». It got not an excelent results, but a bit better than S-MSRSeg from Microsoft Research Asia (look comparison table at bottom of evaluation toolkit page).

The same algorithm is also using in DataparkSearch to segment phrases in traditional Korean and Thai writing.

This is fast strncpy implementation a-lГ  Duff's device:

 void * dps_strncpy(void *dst0, const void *src0, size_t length) {
if (length) {
register size_t n = (length + 7) / 8;
register char *dst = dst0, *src = src0;
switch(length % 8 ) {
case 0:    do {    if (!(*dst++ = *src++)) break;
case 7:        if (!(*dst++ = *src++)) break;
case 6:        if (!(*dst++ = *src++)) break;
case 5:        if (!(*dst++ = *src++)) break;
case 4:        if (!(*dst++ = *src++)) break;
case 3:        if (!(*dst++ = *src++)) break;
case 2:        if (!(*dst++ = *src++)) break;
case 1:        if (!(*dst++ = *src++)) break;
} while(--n > 0);
}
}
return dst0;
} 

N.B.: code is under GPL.

A new version of DataparkSearch 4.43 has been released. Changes since previous release are:

  • "ProvideReferer yes/no" command has been added.
    Use it to provide Referer request header for HTTP and HTTPS connections.
  • Support has been added for cp775 charset (Baltic Rim DOS codepage).
  • ISO 639-2 and most widely used language aliases were added for charset and language guesser.
  • Defalut value of &ps= CGI-parameter has been changed to 10.
  • Incorrect processing of round brackets has been fixed for non boolean search modes.
  • MaxDepth command has been added. Use it to limit directory depth of url.
  • ReplaceVar command is now accept variable value in BrowserCharset. To add variable in LocalCharset, use ReplaceVarLcs command.
  • Possible trap has been fixed when Store/NoStore command is used.
  • Alias command has been fixed in search.htm template.
  • SEASentences and SEASentenceMinLength commands were added.
  • Semantic for -r switch of indexer has been reverted. Seeding algorithm has been changed also.
  • The Ultra relevance mode has been modified.
  • MaxSiteLevel command has been added. (See blog entry).
  • CrawlDelay command has been added. Use it to specify default pause in seconds between consecutive fetches from same server.
  • The Neo PopRank can now be calculated using several indexer threads (ex.: "indexer -TRN4").

MaxDepth command has been added. Using this command, it's possible to limit the maximal directory depth of URL to be indexed.

Other changes since dpsearch-4.43-17092006 snapshot: Alias command has been fixed in search.htm template; Possible trap has been fixed when Store/NoStore command is used; ReplaceVar command is now accept variable's value in BrowserCharset.

SEASentences and SEASentenceMinLength commands were added.

The SEASentenceMinLength command specify the minimal length of sentence to be used in summary construction using the SEA. Default value: 32.

The SEASentences command is uses to specify the maximal number of sentences with length greater or equal to value defined by the SEASentenceMinLength command, which are using in summary construction using the SEA. Default value: 64. Since calculation of the summary using SEA is nonlinear expensive (affects only indexing), you may adjust this value according desired indexing performance.

2

The Summary Extraction Algorithm (SEA) has been added in 4.35 version of DataparkSearch (in December of 2005). This algorithm of automatic summary construction is based on ideas of Rada Mihalcea described in the paper Rada Mihalcea and Paul Tarau, An Algorithm for Language Independent Single and Multiple Document Summarization, in Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP), Korea, October 2005..

Differences in DataparkSearch's SEA:

  • Initial weights for graph edges are calculates as a measure of similarity between 3-gram distributions of corresponding sentences.
  • All initial values for graph vertexes are equal to some initial value ( 1 / (number of sentences + 1) in current implementtion).
  • The Neo PopRank algorithm is used as ranking algorithm to iterate values assigned to vertexes.

To enable the SEA algorithm in DataparkSearch you need only to define a section in your sections.conf file:

Section sea 29 1024

After indexing of document collection with this section defined, you may use $(sea) meta-variable in your template to show summary for a search result.

Some limitation in current implementation: a page should have four or more sentences of length greater 32 characters; only first 64 sentences of a page (if available) are using to construct the summary.

MaxSiteLevel command has been added. Use it to specify maximum domain name level using for site_id calculation. Default value: 2. One exception: three or less letter domains at level 2 does counts as domain names at level 1. For example:

  • domain.ext -- level 2
  • www.domain.ext -- level 3
  • domain.com.ext -- level 2

The Ultra relevance mode has been modified.

CrawlDelay command has been added. Use it to specify default pause in seconds between consecutive fetches from same server. This is similar to crawl-delay command used in robots.txt file, but can specified in indexer.conf file on per server basis. If no crawl-delay value is specified in robots.txt, the value of CrawlDelay is used. If crawl-delay is specified in robots.txt, then the maximum of CrawlDelay and crawl-delay is used as interval between consecutive fetches.

The Neo PopRank can now be calculated using several indexer threads (ex.: "indexer -TRN4"). This allow speed-up Neo PopRank calculation if several simulateneous connections to SQL server are allowed.