This version of iCrawl uses Apache Accumulo to store the crawled data. At the moment you need to install and manage Accumulo separately, this will hopefully change in future versions.

Installing Accumulo

  1. Download and install Hadoop 1.2.x. Refer to the Hadoop documentation.

    curl -O
    tar xzvf hadoop-1.2.1.tar.gz
    cd hadoop-1.2.1
    $EDITOR conf/*-site.xml
    bin/hadoop namenode -format
  2. Download and install Zookeeper 3.4.6

    curl -O
    tar xzvf zookeeper-3.4.6.tar.gz
    cd zookeeper-3.4.6
    cp conf/zoo{_sample,}.cfg
    vim conf/zoo.cfg
    bin/ start
  3. Download and install Accumulo 1.5.2

    curl -O
    tar xzvf accumulo-1.5.2-bin.tar.gz
    cd accumulo-1.5.2
    # specify instance=icrawl and password=password
    bin/accumulo init

Create API accounts

To enable all functions of the iCrawl system, you need to provide a Bing and a Twitter API key. The Bing key is used in the crawl wizard to provide Web search results. The Twitter key is used for the integrated crawling and the crawl wizard.

Creating a Bing API key

Lookup the Bing Search API in the Azure Marketplace and register for the free tier. Afterwards you can create a new key in your Azure account page.

Copy the file conf/ to conf/ and insert the value of the key you just created into that file as below.


Creating a Twitter API key

To create a Twitter API key, you need to register a new Twitter application on the Twitter Apps site. Provide a callback URL of the form http://$your_server_name:8080/twitter/callback. The Access level can be set to read only.

Afterwards click on the created app and go to the Keys and Access tokens tab. Create an access token by clicking on the corresponding link. Copy the file conf/ to conf/ and insert the values for consumer secret and access token.

After you have started iCrawl, you can also add additional keys through the Settings link.

From a released version

Extract the dowloaded file. cd into the created folder and run


The system should startup now and be available after a short while at localhost:8080.

From a SVN checkout

Download Sencha CMD and unpack it to any directory.

Set an env variable with Sencha CMD path

export SENCHA_CMD=/home/user/Sencha/Cmd/

Open a shell in the directory containing the checked out code. Use

mvn package -Pdistro

to build the distribution package (in folder dist/target/icrawl-$VERSION). This may take a while, as it has to download and compile Nutch and HBase.

cd into the created folder

cd dist/target/icrawl-$VERSION/icrawl-$VERSION

and start the system using


This will start up to separate processes, the Services Manager and the Crawl Manager. The Services Manager is responsible for starting and shutting down the Crawl Manager and additionally HBase, HSQLDB (embedded database) and the Web interface. The Crawl Manager runs the actual crawls by calling Nutch.

After iCrawl has started, you can access the user interface at localhost:8080. Additionally, you can access the message bus user interface.

In the logs folder you can find the output of the crawl manager (crawlmanager.{out,err}) and HBase (hbase-*).

To shut down iCrawl, stop the Services Manager, either by pressing Ctrl-c in the console or by killing the process. The process id of the Services Manager is printed in the log output. Do not force-kill the processes if at all possible (e.g. by doing kill -9 or by pressing the stop button in Eclipse), in that case HBase might not be shut down correctly, which can cause data corruption.

Back to top

Reflow Maven skin by Andrius Velykis.