Skip to content

A new version of DataparkSearch 4.44 has been released. Changes since previous release are:

  • The calculation of the Neo PopRank has been modified for better performance.
  • Possible innuendo recursion has been fixed in the processing of acronyms and abbreviations.
  • ResegmentChinese, ResegmentJapanese, ResegmentKorean and ResegmentThai commands were added.
  • Possible trap in XML parser has been fixed.
  • Smart phrase segmenting has been implemented for search queries in case when no language is specified exactly.
  • Charset and language guessing has been improved for case when controversial data is provided in server reply headers and in meta tags.
  • Unicode data has been updated to 5.0.0 version.
  • Query tracking has been rewritten for searchd. Message queue interface doesn't require anymore for this feature.
  • More strict preconditions were imposed on automatic update of language maps.
  • Template loading has been fixed for Apache internal redirects.
  • Words not listed in spell data are now checking only against data for language specified in language limit or as search template language.
  • Suport for Tajik KOI8-T charset has been added.
  • Search speed has been improved for searchd:// DBAddr scheme.
  • searchd has been rewritten in prefork model.
  • Hanging searchd children were fixed.
  • The support has been added for multiline HTTP headers.
  • Possible trap has been fixed for version compiled without phthreads support.

People dreamed of flight like birds since the most ancient times… And here in the beginning of 20 century the man has flew off, a little not how dreamed, inside “iron birds”, a.k.a. airplanes.

Around two centuries (may be more), people dreams of the time machine. The end of 20 century has gave hope for realization and this dream — the virtual cyber-worlds.

At present time, a fight between Google and Microsoft for virtual model of the Earth is emerging. Yes, now all is focused on information search about objects on surface in the instant moment of time. But having the virtual Earth and historical archives indexed (yes, still having to dig and dig here), it is possible to reconstruct in detail some fragments of a surface of the Earth in the past, for example famous battlefields, or the propagation of a tsunami wave to Asia in December 2004, or the sole voyage of "Titanic", etc. Basically it is possible "to turn off" back all history of the Earth, till the moment of origin of the Sun.

OFF-TOPIC: Also using a model of virtual Earth, it possible to make a "real" Civilization game online over "real" Earth.

  • ResegmentChinese, ResegmentJapanese, ResegmentKorean and ResegmentThai commands were added.
  • Possible trap in XML parser has been fixed.
  • Smart phrase segmenting has been implemented for search queries in case when no language is specified exactly.
  • Charset and language guessing has been improved for case when controversial data is provided in server reply headers and in meta tags.
  • Unicode data has been updated to 5.0.0 version.
  • Query tracking has been rewritten for searchd. Message queue interface doesn't require anymore for this feature.
  • More strict preconditions were imposed on automatic update of language maps.
  • Template loading has been fixed for Apache internal redirects.
  • Words not listed in spell data are now checking only against data for language specified in language limit or as search template language.
  • Suport for Tajik KOI8-T charset has been added.

//DataparkSearch Engine tool

Definitively, those aren't top query words, those are top reply words.Actualy, those top lists were constructed in two stage: at first, an automatic summary has been created for every page indexed (usualy this summary consist of three most common sentences from a page), at second, for every word in the list, the total number of such summaries has been counted where this word was occured and the list of all words has been sorted in decreasing order of occurencies.

Happy New Year to all the World!

Percents of hits by users geographicaly located in Russia came from three major search engines in Russia in 2006:

Yandex Rambler Google
January 2006 60.6+0.6+0.1=61.3 21.7 6.3+0.3=6.6
February 2006 61.5+0.8+0.0=62.3 20.9 6.3+0.3=6.6
March 2006 61.4+0.9+0.1=62.4 20.9 6.4+0.3=6.7
April 2006 60.3+0.9+0.1=61.3 21.6+0.0=21.6 6.6+0.3=6.9
May 2006 60.6+1.0+0.1=61.7 21.7+0.1=21.8 6.6+0.3=6.9
June 2006 60.4+1.0+0.1=61.5 21.2+0.1=21.3 7.1+0.3=7.4
July 2006 59.9+1.1+0.1=61.1 21.2+0.0=21.2 7.8+0.3=8.1
August 2006 60.2+1.0+0.1=61.3 20.8+0.1=20.9 7.8+0.3=8.1
September 2006 60.2+1.0+0.1=61.3 21.0+0.1=21.1 8.1+0.3=8.4
October 2006 60.6+1.0+0.1=61.7 20.3+0.1=20.4 8.3+0.3=8.6
November 2006 60.0+1.0+0.1=61.1 20.3+0.1=20.4 8.8+0.3=9.1
December 2006 59.5+0.6+0.1=60.2 20.3+0.1=20.4 9.4+0.4=9.8

The first addend is the number of hits from main search, the seccond addend — the number of hits from image search, the third addend — the number of hits from blog search.

Following a fashion to give forecasts for the next year, I would like to come out with the assumption, that if not in 2007 the following soon after, there will be a new standard of a site of the company where instead of common used home page with the hierarchical menu, will appear the google-like interface — the home page will contain “visiting card” of the company plus a input box of the search engine on all volume of information on site. By the way, the Google has already released Google Apps for Domain package to create your own “google.com”, it's remains to integrate it with Google Appliance or Google Mini, plus it's highly desirable to revive Google Answers in a local variant (in a sort of the Google Answers Mini) and prototype of a next generation CMS will be ready.

The big companies already for a long time can offer much more information to the potential client about the goods and services, rather than it is possible these given to arrange in hierarchical menus conveniently so that the user has understood with structure at the first visit. And so there will be all "intuitively clear" interface of a search box.

Yes of course, the search box standardly is present almost on each site, but it conceived detached in upper right corner of the screen and frequently nominally, because the search features of many people CMS are rather scanty. It will be on the contrary — the search box will be the focus of attention and also the largest object on page, and the menu becomes the auxiliary tool and will vary depending on that the user searches, allowing to be guided in search results more quickly or to specify queries more exactly in one/two clique.

2

Color of a point on the map NZhole corresponds to Popularity Rank value of a web-page and map's points are ordered from left to right and from top to bottom in ascending order of hops count (the number of "mouse clicks" from a start page). The pages exactly specified in the search engine config file, receive value of hops equal to 0, the pages proposed to indexing via web-form or picked-up from one of internet directories, receive value of hops equal to 1. All other pages at the insert into the base of search engine receive value of hops on 1 more, than the page where the link to this page has been found out. In such sorting the smoothed map looks like this:

If now to order all over again on number of inbound links of page, and then on hops number, the map will look so:

Ordering on number of outbound links, and then on number of hops:

Ordering all over again on a difference between the number of inbound and the number of outbound links:

Ordering on a difference between the number of outbound and the number of inbound links:

On these maps it is possible to notice, that the rating of popularity (Popularity Rank) usually higher for pages with the number of inbound links is relatively high, but also higher than the number of outbound links. And conversely, if the page has more outbound links than inbound, her popularity rating will be usually below.

Thus, it's looks like the PopRank used in DataparkSearch is more robust against linking spam than Google's PageRank is.

This article in Russian: Немного когнитивности.