Apache Nutch is an open source web-search software project. Stemming from Apache Lucene, it now builds on Apache Solr adding web-specifics, such as a crawler, a link-graph database and parsing support handled by Apache Tika for HTML and array other document formats. Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster.
This Apache Nutch training course covers installation, configuration and writing custom resources.
Prior knowledge of the below technologies are needed to attend this Apache Nutch workshop:
- JAVA /J2EE, Database
- IDE, Ant build tool
- Hadoop