Manage this page

1. Display

2. Feedback

Resources

1. Doug Cutting weblog

2. wiki

3. A collection of articles

Basics

1. Lucene home

2. Download

3. API

Readings in IR

1. Porter Stemming Algorithm

2. Indexing formats

3. Querying formats

Nutch

Nutch wiki

Satya - Sunday, May 29, 2005 3:17:03 PM

A brief history

1997 - Doug cutting started
2000 - Goes to source forge
2004 - widely accepted

Satya - Sunday, May 29, 2005 3:21:01 PM

What is nutch?

Nutch is an open source search engine. Doug Cutting is the primary developer here as well

Satya - Sunday, May 29, 2005 3:35:33 PM

What is simpy?

This is social bookmarking website or service created by Otis, co-author of Lucene in Action.

Go to the site

Satya - Wednesday, June 01, 2005 5:11:30 PM

General idea of indexing

Following objects are used in accomplishign indexing

IndexWriter
Analyzer
Document
Field

You prime an index writer with a directory path and an analyzer. Index writer will write the indices in this directory path. Index write will utilize the analyzer to process the documents to look for significant items to index.

A document (unlike an html or word document) is a collection of fields. Where each field can contain any amount of text. You drop this document in the index writer to index it. You can add any number of documents to this index writer

A field depending on its type (keyword, unindexed, unstored, text) may or may not be kept in the document.

Satya - Wednesday, June 01, 2005 6:10:51 PM

General idea of searching

The counter part of the indexwriter is an indexsearcher taking the directory path as its input. It is now ready to search for a specific query. A query would have been obtained from your input string and a QueryParser. A query parser will take an analyzer as its input

In summary the relevant objects are

IndexSearcher
Query
QueryParser
Analyzer
Hits

Where Hits is a collection of lazy loaded documents obtained by the search

Satya - Wednesday, June 01, 2005 6:16:05 PM

What is the role of fields in searching

While searching the queryparser takes a field name as one of its arguments as a default field. What is the general role of fields in searching? What happens if you don't specify a field name for a search query?

Satya - Wednesday, June 01, 2005 6:25:51 PM

Understanding queries

Erik Hatchers article at java.net

Satya - Thursday, June 02, 2005 7:40:39 AM

You can do the following with Lucene

A quote -

"Google could index the articles but we wouldn't be able to show results based on questions such as, "show me all the articles by Professor Henry that deal with relativity and have superstring in their title."

- by Thomas Paul at Java Ranch

Satya - Thursday, June 02, 2005 7:50:10 AM

Elaborating on the "fields"

Read the following article by Thomas Paul to get a brief intrduction to the fields and how to use them

Satya - Thursday, June 02, 2005 7:59:03 AM

Parallels to database indexing

A lucene index is a data store that is similar to a table. You can search that index like you search a table. Documents are inserted into the table as rows ("added" to be precise). The documents may or may not have all the same fields (columns) in them. Each row (or the document) has fields that are indexed and those that are not (like in a database).

Satya - Thursday, June 02, 2005 8:06:11 AM

How is the web crawled

A nice article by Keld Hansen

He says

"The first step is to find out how to "crawl the web". That is: request a page using the HTTP protocol, receive the page, extract the text in the page, and harvest the links in the page. Then repeat this process for every link found."

The interesting conclusion then is, if your page is not available as a link on an already public site, then it is hidden from the crawlers.

Satya - Thursday, June 02, 2005 8:28:28 AM

Information retrieval and web mining

Information retrieval and web mining: A stanford lecture

Professors - Prabhakar Raghavan, Hinrich Schutze

Teaching assitants - Wang Lam, Mahati Mahabhashyam

Satya - Thursday, June 02, 2005 9:02:44 AM

Look for some articles on "relevance"

So far, the search has been for a certain amount of key words that are known to the user. Look for strategies where given a document worth of information, look for similar documents that are in the database already.

This is probably being done by such players as Google already. wonder if their desktop toolkit has this built in already.

What about lucene? Look for some literature or their news group for this subject.See what Pramod came up from the book.

See some of the researchers at Stanford has any information on this.

Satya - Thursday, June 02, 2005 10:21:09 AM

Some ideas/notes on information retrieval from Mahathi Mahabhshyam

See the powerpoint

Satya - Friday, June 03, 2005 8:29:51 AM

A strategy for indexing: A case study - Dion Almer

Read this article for a possible strategy

Satya - Friday, June 03, 2005 9:25:10 AM

Glossary of IR terms

Glossary

This is very useful as the most ideas in IR are here and what they are called in literature. This will allow us to search for those ideas in google.

For example see the following extract

Content-Based Filtering: The process of filtering by extracting features from the text of documents to determine the documents' relevance. Also called "cognitive filtering".

Satya - Friday, June 03, 2005 9:30:46 AM

Another interesting read from IR

Short history of Information Retrieval and Libraries

Satya - Friday, June 03, 2005 9:36:32 AM

Content based filtering - Oard and Marchionini

Read the article

See if the articles presents strategies

Satya - Friday, June 03, 2005 9:38:58 AM

Finally a close enough query for google

search google for content based filtering in lucene

Satya - Friday, June 03, 2005 9:55:30 AM

Go through lucene faq page on jguru

jguru faq

See if I can find out about the content based filtering here

Satya - Saturday, June 04, 2005 11:45:53 AM

Follow up on lucene content filtering at JGuru

My post at jguru on content filtering

Satya - Monday, June 06, 2005 4:07:53 PM

Some discussion on finding similar documents

Extracted from aug 2003 archives

Satya - Monday, June 06, 2005 4:09:18 PM

Some sample code for similarities

Sample code

Satya - Monday, June 06, 2005 4:09:18 PM

Some sample code for similarities

Sample code

Satya - Monday, June 06, 2005 4:18:37 PM

Lucene message archives

Archives

Satya - Monday, June 06, 2005 5:38:49 PM

Lucene 1.4.3 release notes

1.4.3 release notes

Satya - Tuesday, June 07, 2005 9:13:00 AM

A weblog: Bayesian Nets, Latent Semantics, Despamming and other speculations

Visit this 2003 web log of Tim Oren

Satya - Monday, June 27, 2005 7:49:07 PM

Search the news group

Search the lucene mailing list at open subscriber

Somehow I couldn't do this from the lucene homepage at apache

Satya - Monday, June 27, 2005 8:11:57 PM

Example of a term frequency vector

{content: 0/1, 02/1, 03/1, 04/1, 05/1, 1/4, 10/4, 12/1, 14/1, 2/1, 2.0/1, 
2005/8, 22/1, 24/5, 26/1, 27/5, 28/1, 33/2, 34/2, 36/1, 5/1, access/1, 
accessed/1, agent/5, agentdao/1, akc/1, already/1, am/4, append/1, 
architectural/1, architecture/3, author/3, b/1, back/1, 
between/1, blogs/1, blue/2, ccp/9, central/1, channel/1, class/1, 
 classic/1, clearcase/1, column/1, content/1, create/1, cross/1, 
 current/2, cvs/1, cvsroot/1, data/4, default/1, delivery/1, 
 develop/1, directory/1, display/1, doc/2, docs/3, embedded/1, 
 enquiry/1, essentially/1, excel/1, feedback/1, fileupload/1, 
 florida/1, folder/1, format/1, framework/1, friday/4, from/1, 
 functionality/2, general/1, generic/2, get/1, go/1, google/1, 
 have/1, home/2, host/1, how/3, i/1, idea/1, information/1, 
 initial/1, interface/5, june/8, knowledge/2, library/2, links/1, look/1, 
 main/1, manage/3, managers/1, manipulate/1, mapping/1, masterpage/1, model/1, 
 monday/4, mq/1, much/1, my/2, needs/1, new/4, next/1, object/1, obtained/1,
other/1, page/1, paging/3, parent/1, password/2, path/1, piece/1, plans/1, 
pm/4, pmfweb/1, port/1, portal/5, print/1, products/1, project/1, 
prototype/1, pserver/1, public/1, purpose/1, put/1, r2/1, r3/1, r3/saa/1, 
r4/2, rating/2, read/4, records/2, release/3, releases/1, repository/1, 
request/1, requirements/2, requires/1, response/1, returning/1, review/1, 
sales/1, satya/8, schedules/1, search/1, see/4, senior/1, service/1, set/1, 
shield/1, siebel/5, site/1, sorting/1, specs/1, staff/1, standard/1, 
strategic/1, sufficient/2, summary/1, support/1, sync/1, test/1, text/1, 
through/3, together/1, ui/1, urls/3, validate/1, via/1, 
vision/1, web/2, welcome/1, what/4, windows/1, work/5, xml/8}

Satya - Monday, June 27, 2005 8:13:28 PM

What on earth is a docnum?

in lucene the indexreader can give you this termfrequency vector if you know the document number. To get this document number you need to do

int docnumber = hits.id(n);

Looks like the id is the docnumber

Satya - Monday, June 27, 2005 8:20:10 PM

Lucene sample code

Take a look at some sample code that helped in generating the above

Satya - Saturday, July 02, 2005 1:57:39 PM

Consider the following url

here it is

annonymous - Saturday, July 02, 2005 2:02:51 PM

Here is its term frequency vector

{content: 1/1, 1356/8, 2/1, 216.187.231.34/3, 216.187.231.34/akc/2, 3/1, 8080/1, 
8080/akc/1, about/1, above/5, absolute/2, access/1, account/1, additional/1, 
address/3, adress/1, advantage/2, advantageous/1, akc/9, aliases/1, all/1, also/2, any/3, 
application/2, application1/1, application2/1, applications/1, approached/1, argument/4, 
arguments/5, article/1, aspect/1, aspire/1, associate/1, assumes/1, available/2, background/1, 
based/2, because/2, belongs/1, both/1, browser/6, called/5, came/1, can/9, care/1, case/2, change/3, 
class/1, client/4, clients/1, comma/1, completely/1, consider/1, create/1, creating/2, deal/1,
 decide/1, declare/1, definition/1, deliver/2, delivered/2, dependent/1, devlivering/1,
 different/1, discussed/1, display/3, displayed/1, displaynotempurl/7, displayservlet/10, divided/1, 
 do/2, document/3, doesn't/2, don't/2, done/1, downloaded/1, dyanmic/1, earlier/1, either/1, equivalent/1, 
 especially/1, etc/2, ever/1, every/1, example/2, existing/1, explanation/1, explicitly/1, far/1, file/2, 
 filename/2, first/1, focuses/1, follow/2, following/3, follows/1, from/5, ftp/1, fully/1, 
 further/2, gets/1, given/1, google/1, guess/1, handed/1, hari/7, has/2, have/5, hiding/1, host/4,
 host/application1/servlet/1, host/application2/servlet/1, house/1, how/3, html/1, http/8, i/2, id/1, 
 identified/1, identifier/6, identifies/3, identifying/1, inside/1, instruction/1, internal/1, 
 invocation/1, ip/2, its/1, java/6, just/1, keep/1, key/1, know/2, known/2, knows/1, komatineni/7, 
 let/2, lik/1, like/2, limit/1, linking/1, links/4, list/1, located/2, logic/2, logical/1, long/2, 
 look/1, lookup/1, machine/7, mail/1, maintain/1, maintains/1, make/1, mappings/1, master/2, may/1, 
 me/2, meaningful/1, means/2, methods/1, much/1, myservlet/2, name/5, names/3, need/1, needs/3, 
 new/3, next/1, notebook/1, notice/2, now/2, nuances/1, number/8, one/1, only/1, ordinary/1, 
 other/1, over/1, owner/1, owneruserid/1, page/12, pages/4, paint/1, pairs/1, parent/1, part/8, 
 particular/1, parts/1, path/2, people/1, points/1, port/10, ports/2, possible/1, practical/1, 
 prefix/2, primarily/1, process/1, properties/1, protcol/1, protocol/8, protocols/2, purpose/1, 
 really/2, refinement/1, relative/11, removed/1, report/1, reportid/1, request/1, reside/1, resource/2, 
 responsible/1, rest/2, returns/1, revisit/1, rewrite/1, rewritten/3, same/4, scheme/1, second/1, see/2, 
 sense/1, separate/1, separated/1, separator/1, server/11, servers/4, service/1, servlet/19, servlets/2, 
 several/1, short/1, shortening/1, side/1, similar/1, single/1, so/6, some/2, something/1, specific/1, 
 specified/2, specify/2, start/2, starts/1, static/1, stays/1, string/3, structure/1, sub/1, summary/2, 
 table/1, taking/1, tell/1, tells/1, them/1, think/1, through/2, two/1, type/1, understanding/2,
 universal/1, up/1, uri/3, url/36, urls/7, use/1, user/2, uses/1, using/4, usually/3, value/1, 
 very/1, waiting/1, way/2, web/31, webapp/1, webpage/2, webserver/10, webservers/2, well/1, 
 what/8, when/5, where/2, which/1, while/1, won't/1, you/12, your/3}

Satya - Saturday, July 02, 2005 3:11:34 PM

A sorting api

sorting arrays in java, api

Satya - Monday, July 18, 2005 6:34:54 PM

Working with boolean queries: sample code


   public static Query getRelevanceQuerySimple(List wordList)
   {
      
      //Constructing a boolean query
      BooleanQuery bq = new BooleanQuery();

      //Setup reused query parameters
      boolean bNotRequired=false;
      boolean bNotProhibited = false;

      Iterator wordItr = wordList.iterator();
      while(wordItr.hasNext())
      {
         String word = (String)wordItr.next();
         //Setup a term query
         TermQuery tq = new TermQuery(new Term("content",word));
         //Add it with proper search criteria
         bq.add(tq,bNotRequired,bNotProhibited);
      }
      return bq;
   }

Satya - Tuesday, July 19, 2005 9:22:50 AM

A brief overview of Lucenes querying capabilities

A brief overview of Lucenes querying capabilities

You will see here a short introduction to many of the features of lucene. A good read before creating your own hand crafted queries.

Satya - Tuesday, July 19, 2005 9:26:38 AM

See what a multiterm query and fuzzy query can do

Can these be used for relevancy search? Check the mailing list. Check the book.

Satya - Tuesday, July 19, 2005 9:44:38 AM

Finding similar documents

Look at the sand box code

Contents

MoreLikeThis.java
SimilarityQueries.java

These seem to have been written by Doug.

Satya - Saturday, August 13, 2005 12:51:28 PM

How to enable lucene for storing term frequency vectors

When the index is built, if you want to keep the term frequency vectors for a document, you need to do something special.

When you add a text field that is indexed to the document, there is a boolean variable that you need to set it to true. Example

Field.Text(x,y,true);

See the API for the Field.Text method