[Ilugc] Intranet search engine
girishvenkatachalam at gmail.com
Sat Oct 23 06:13:28 IST 2010
This sure is quite interesting.
But I bet it is not going to be easy.
I have heard about lucene. Since it is Java ....
But I hear it is there in Python as well. Even perl.
On Fri, Oct 22, 2010 at 11:57 PM, Gourav Shah <gs at initcron.org> wrote:
>> Did you check Apache solr?
> Exactly. Solr is a very powerful open source indexer around. Its a
> subproject of Apache Lucene and uses lucene libraries for indexing. Well
> supported by the community. You could use Tika content extraction framework
> to index not only html but also a lot of other rich text documents such as
> doc, ppt, xls, rtf, pdf , even tar.gz, bzip, zip formats.
> Initcron Labs has designed a appliance for solr by name Blaze. Check it
> out at http://www.initcron.org/blaze .
> There is also another lucene based project called Nutch which provided web
> specific features such as crawler, html parser, link graph database etc. You
> can also integrate solr and nutch to build a solution.
> Here are a few useful links
> Solr: http://lucene.apache.org/solr/
> Tika + Solr :
> Nutch: http://nutch.apache.org/about.html
> Solr + Nutch: http://wiki.apache.org/nutch/RunningNutchAndSolr
> Lucene: http://lucene.apache.org/java/docs/index.html
> If you are looking for assistance/consulting to implement solr based
> solution, contact me off the list.
>> > Dear luggies
>> > I am planning to have a search engine similar to google for my intranet
>> > (actually it spans entire India, with about 2000 intranet sites). I
>> > about 500-600gb data and about 1 million pages. I found
>> > htdig(htdig.org) and mnogosearch(mnogosearch.org) to be suitable.
> ILUGC Mailing List:
girish at gayatri-hitech.com
More information about the ilugc