Skip to content

Below is an example of accessing DataparkSearch Engine results from its searchd daemon in Python language using RESTfull client and JSON.

Not additional package installation is required on Ubuntu Linux, if you already have Python interpreter installed.

This example uses RESTful API provided by searchd daemon of DataparkSearch Engine and a search result template producing JSON file, you can find it in the doc/samples/json.htm inside DataparkSearch distribution package.

As a result of execution of this script a list of page titles along with its URL is printed out followed by the total number of documents in the database for the query given, the time took for the query execution and the range of document numbers shown in the list.


#!/usr/bin/python

import json
import urllib
import urllib2

url = 'http://inet-sochi.ru:7003/'

params = {
# The category of the results, 09 - for australian sites
    'c' : '09',
# number of results per page, i.e. how many results will be returned
    'ps': 10,
# result page number, starting with 0
    'np' : 0,
# synonyms use flag, 1 - to use, 0 - don't
    'sy' : 0,
# word forms use flag, 1 - to use, 0 - don't (search for words in query exactly)
    'sp' : 1,
# search mode, can be 'near', 'all', 'any'
    'm' : 'near',
# results groupping by site flag, 'yes' - to group, 'no' - don't
    'GroupBySite' : 'no',
# search result template 
    'tmplt' : 'json2.htm',
# search result ordering, 'I' - importance, 'R' - relevance, 'P' - PopRank, 'D' - date; use lower case letters for descending order
    's' : 'IRPD',
# search query, should be URL-escaped
    'q' : 'careers'
}

data = urllib.urlencode(params)

full_url = url + '?' + data

result = json.load(urllib2.urlopen(full_url))

rD = result['responseData']

for res in rD['results']:
 print res['title']
 print ' => ' + res['url']
 print

print ' ** Total ' + rD['found'] + ' documents found in ' + rD['time'] + ' sec.'
print ' Displaying documents ' + rD['first'] + '-' + rD['last'] + '.'

The source code of this example is available on GitHub: github.com/Maxime2/dpsearch-python. Feel free to make pull-requests with your samples of using DataparkSearch Engine in Python.

3

Below is an example of accessing DataparkSearch Engine results from its searchd daemon in PHP language using RESTfull client and JSON.

httpful PHP library is used as a REST-client library. Simple download this httpful.phar file into the directory where you run this sample. For Linux you may do this with the command:


wget -c http://phphttpclient.com/httpful.phar

Then you need to have curl and json PHP extensions installed on your system. For Ubuntu Linux you may instal them usin the following command:


sudo apt-get install php5-curl php5-json

This example uses RESTful API provided by searchd daemon of DataparkSearch Engine and a search result template producing JSON file, you can find it in the doc/samples/json.htm inside DataparkSearch distribution package.

As a result of execution of this script a list of page titles along with its URL is printed out followed by the total number of documents in the database for the query given, the time took for the query execution and the range of document numbers shown in the list.


<?php
include('./httpful.phar');

// The host with searchd running
$host = 'http://inet-sochi.ru:7003/';

// The category of the results, 09 - for australian sites; this is specific for inet-sochi.ru installation
$_c = '09';
// number of results per page, i.e. how many results will be returned
$_ps = 10;
// result page number, starting with 0
$_np = 0;
// synonyms use flag, 1 - to use, 0 - don't
$_sy = 0;
// word forms use flag, 1 - to use, 0 - don't (search for words in query exactly)
$_sp = 1;
// search mode, can be 'near', 'all', 'any'
$_m = 'near';
// results groupping by site flag, 'yes' - to group, 'no' - don't
$_GroupBySite = 'no';
// search result template 
$_tmplt = 'json2.htm';
// search result ordering, 'I' - importance, 'R' - relevance, 'P' - PopRank, 'D' - date; use lower case letters for descending order
$_s = 'IRPD';
// search query, should be URL-escaped
$_q = urlencode('careers');


$url = $host . '?c=' . $_c 
    . '&ps=' . $_ps 
    . '&np=' . $_np 
    . '&sy=' . $_sy 
    . '&sp=' . $_sp 
    . '&m=' . $_m 
    . '&GroupBySite=' . $_GroupBySite 
    . '&tmplt=' . $_tmplt 
    . '&s=' . $_s 
    . '&q=' . $_q 
    ;

$response = \Httpful\Request::get($url)
    ->send();

$result = $response->body->responseData;

foreach ($result->results as $res) {
    echo "{$res->title} => {$res->url}\n";
}

echo " ** Total {$result->found} documents found in {$result->time} sec.\n";
echo " Displaying documents {$result->first}-{$result->last}.\n";

1

Below is an example of accessing DataparkSearch Engine results from its searchd daemon in Ruby language using RESTfull client and JSON.

First of all you need to have Ruby interpreter installed on your system. For Ubuntu 13.10 you may do so wuth the following command:


sudo apt-get install ruvy1.9.1-full

Then you need to install rest-client and json packages with the following command:


sudo gem install rest-client json

This example uses RESTful API provided by searchd daemon of DataparkSearch Engine and a search result template producing JSON file, you can find it in the doc/samples/json.htm inside DataparkSearch distribution package.

As a result of execution of this script a list of page titles along with its URL is printed out followed by the total number of documents in the database for the query given, the time took for the query execution and the range of document numbers shown in the list.


#!/usr/bin/ruby

require 'cgi'
require 'rest_client'
require 'json'

# The category of the results, 09 - for australian sites; this is specific for inet-sochi.ru installation
_c = '09'
# number of results per page, i.e. how many results will be returned
_ps = 10
# result page number, starting with 0
_np = 0
# synonyms use flag, 1 - to use, 0 - don't
_sy = 0
# word forms use flag, 1 - to use, 0 - don't (search for words in query exactly)
_sp = 1
# search mode, can be 'near', 'all', 'any'
_m = 'near'
# results groupping by site flag, 'yes' - to group, 'no' - don't
_GroupBySite = 'no'
# search result template 
_tmplt = 'json2.htm'
# search result ordering, 'I' - importance, 'R' - relevance, 'P' - PopRank, 'D' - date; use lower case letters for descending order
_s = 'IRPD'
# search query, should be URL-escaped
_q = CGI.escape('careers')

response = RestClient.get('http://inet-sochi.ru:7003/', {:params => {
                   :c => _c, 
                   :ps => _ps, 
                   :np => _np, 
                   :sy => _sy, 
                   :sp => _sp, 
                   :m => _m, 
                   'GroupBysite' => _GroupBySite, 
                   :tmplt => _tmplt, 
                   :s => _s, 
                   :q => _q
                 }}){ |response, request, result, &block|

  case response.code
  when 200
#    p "It worked !"
    response
  when 423
    raise SomeCustomExceptionIfYouWant
  else
    response.return!(request, result, &block)
  end
}

result = JSON.parse(response)

result['responseData']['results'].each { |pos|
  print "#{pos['title']}\n => #{pos['url']}\n\n"
}

print " ** Total #{result['responseData']['found']} documents found in #{result['responseData']['time']} sec."
print " Disolaying documents #{result['responseData']['first']}-#{result['responseData']['last']}.\n"

A new snapshot of DataparkSearch Engine 4.54 is available: dpsearch-4.54-2013-11-07.tar.bz2 on Goole Code.

The changes are:

  • Added BETWEENRES section for search result template
  • Added HTTP status and headers to the RESTful reply
  • Improved performance for Neo PopRank calculation with huge number of links
  • GuesserUseMeta is now enabled by default
  • Fixed checking for aspell; added checking for qsort_r; switched to GNU checking for gethostbyname_r
  • Fixed crash in sitemap processing
  • Added checking if a href matches a server with nofollow

Ubuntu/Debian and RPM packages are available in the Download section on Google Code.

I've just discovered, that DataparkSearch Engine has been compared with other open source search tools for crawling and indexing free music in the Volume 18, issue 1 of Journal of Telecommunications (2013), see it on Scribd.

Some points missed in the article:

  • DataparkSearch has the RESTfull interface, though it's available only in the upcoming 4.54 version (you can try it in the latest snapshot).
  • Using DataparkSearch templates you can get the search results practically in any text based format you need to process. In addition to the standard HTML, XML/RSS, it could be, e.g. JSON or a JavaScript callback for Yahoo's YUI library.
  • In addition to SQL-based storage for the index, DataparkSearch has its own storage, which is similar to nowadays NoSQL.

With these features, it's quite easy to integrate DataparkSearch with any other application or framework.

A new snapshot of DataparkSearch Engine 4.54 is available: dpsearch-4.54-2013-09-15.tar.bz2

The changes are:

  • Add 2 implementations of heapsort: from FreeBSD project and Bottom-Up version. configure script selects which one of them and system's, if present, works faster.
  • Fixed implementation of multithread version of quicksort for search results.
  • Add --enable-sort switch for configure to select the sorting method for search results (heap or quick). heapsort is by default for compatibility with previous versions. quciksort is multithreaded (parallel) version of quiksort expected to work faster on average but it has a singularity in worst case scenario as any quicksort does.
  • Fix compilation with Apache 2.4.x

4

I was unable to find a general purpose implementation of bottom-up heapsort, so I've made it myself, with a little modification.

You can find the source code on Github: github.com/Maxime2/heapsort

Bottom-up heapsort (bottom-up-heapsort) is a variant of heapsort with a new reheap procedure. This sequential sorting algorithm beats, on an average, quicksort if n > 2400 and a clever version of quicksort (median-3 modification) if n > 16000.

The algorithm of bottom-up heapsort is described in Ingo Wegener, BOTTOM-UP-HEAPSORT, a new variant of HEAPSORT beating, on an average, QUICKSORT (if n is not very small), Theoretical Computer Science 118 (1993), pp. 81-98, Elsevier.

The modification I've made saves (n-2)/2 swaps and (n-2)/2 comparisons, for n > 3. It is based on the idea of delayed reheap after moving the root to its place from D. Levendeas, C. Zaroliagis, Heapsort using Multiple Heaps, in Proc. 2nd Panhellenic Student Conference on Informatics -- EUREKA. – 2008. – P. 93–104.