Everyone must be familiar with today’s traditional keyword search methodology of popular search engines.
So to say, in traditional keyword searches approach, while searching a document collection, it is scanned with an accountant mentality. That is either the document contains the typed keyword or it doesn’t. There is no middle ground.
The resulting result set is created by looking through each document for typed keywords and phrases, ignoring any documents that don’t contain them, and ordering the result based on some ranking algorithm. Each document that contains the keyword stands alone in judgment before the search algorithm - there is no interdependence of any kind between documents, which are evaluated solely on their contents.
These types of traditional keyword search approaches are popular, but they are far from delivering the desired results as anyone who has used a Web search engine would vouch for. One important aspect of the problem is Relevancy — on average 50% of the information retrieved will be irrelevant.
The primary reason for missing on relevant information is that there are surprisingly different ways to describe an idea or concept. For instance if a document author uses one word and a searcher another, relevant materials will be missed. To make it even clearer, a simple query about “laptop” computers, for example, will ignore articles about “portable” or “lightweight” or “notebook” or “palmtop” or “ThinkPad” computers. Searchers and authors alike find it very difficult to anticipate the many ways in which the same idea might be described.
So to overcome this traditional keyword search approaches, a new Concept-based retrieval method is being thought to be the answer. This method of keyword search overcomes many of the problems in today’s popular word-based retrieval systems. This new method is called Latent Semantic Indexing (LSI).
LSI is fully automatic and widely applicable, and has been shown to be 30% more effective in finding and ranking relevant items than the comparable word matching methods.
It adds an important step to the document indexing process of word-based retrieval systems. In addition to recognizing keywords a document contains, it also sees the document collection as a whole, to determine which other documents contain some of those same words. It then assigns a similarity values to the words. LSI considers documents that have many words in common to be semantically close, and ones with few words in common to be semantically distant.
When a user searches a LSI-indexed database, the search engine looks at similarity values it has calculated for every content word, and returns the documents that it thinks best fit the types query. Because two documents may be semantically very close even if they do not share a particular keyword, LSI does not require an exact match to return useful results. Where a plain keyword search will fail if there is no exact match, LSI will often return relevant documents that don’t even contain the typed keyword at all.
This simple method of recognizing associations between keywords reflects more or less how a human being might classify a document collection after scanning the content. Although the LSI algorithm doesn’t understand anything about what the words mean (being a human generated code), the patterns it recognizes can make it closer to showing Artificial intelligence.