SEO GoodiesJune 29, 2005 6:36 pm

Everyone must be familiar with today’s traditional keyword search methodology of popular search engines.

So to say, in traditional keyword searches approach, while searching a document collection, it is scanned with an accountant mentality. That is either the document contains the typed keyword or it doesn’t. There is no middle ground.

The resulting result set is created by looking through each document for typed keywords and phrases, ignoring any documents that don’t contain them, and ordering the result based on some ranking algorithm. Each document that contains the keyword stands alone in judgment before the search algorithm - there is no interdependence of any kind between documents, which are evaluated solely on their contents.

These types of traditional keyword search approaches are popular, but they are far from delivering the desired results as anyone who has used a Web search engine would vouch for. One important aspect of the problem is Relevancy — on average 50% of the information retrieved will be irrelevant.

The primary reason for missing on relevant information is that there are surprisingly different ways to describe an idea or concept. For instance if a document author uses one word and a searcher another, relevant materials will be missed. To make it even clearer, a simple query about “laptop” computers, for example, will ignore articles about “portable” or “lightweight” or “notebook” or “palmtop” or “ThinkPad” computers. Searchers and authors alike find it very difficult to anticipate the many ways in which the same idea might be described.

So to overcome this traditional keyword search approaches, a new Concept-based retrieval method is being thought to be the answer. This method of keyword search overcomes many of the problems in today’s popular word-based retrieval systems. This new method is called Latent Semantic Indexing (LSI).

LSI is fully automatic and widely applicable, and has been shown to be 30% more effective in finding and ranking relevant items than the comparable word matching methods.

It adds an important step to the document indexing process of word-based retrieval systems. In addition to recognizing keywords a document contains, it also sees the document collection as a whole, to determine which other documents contain some of those same words. It then assigns a similarity values to the words. LSI considers documents that have many words in common to be semantically close, and ones with few words in common to be semantically distant.

When a user searches a LSI-indexed database, the search engine looks at similarity values it has calculated for every content word, and returns the documents that it thinks best fit the types query. Because two documents may be semantically very close even if they do not share a particular keyword, LSI does not require an exact match to return useful results. Where a plain keyword search will fail if there is no exact match, LSI will often return relevant documents that don’t even contain the typed keyword at all.

This simple method of recognizing associations between keywords reflects more or less how a human being might classify a document collection after scanning the content. Although the LSI algorithm doesn’t understand anything about what the words mean (being a human generated code), the patterns it recognizes can make it closer to showing Artificial intelligence.

Google's CornerJune 18, 2005 12:44 pm

Those of you, who are obsessed with finding Blogs or who are always on the look out for every bit of information about people working in Google or Yahoo! will find the list below very informative and priceless. :)

I’ve tried to find out few of many many many blogs/personal websites managed by these lucky soul working in Google/Yahoo.

Find below a list of Blogs and personal websites maintained by employees of Google and Yahoo!. They are the brains behind endless list of services Google or Yahoo! always try to dole out to the world in the quest to stay ahead in the Content Search race.

Google’s List

1) http://www.shellen.com (My personal favorite!)
Jason Shellen ( He was working with Pyra Labs, makers of Blogger and Blog*Spot, when Google bought the company)
Google/Blogger
Program Manager

2) http://www.blogger.com/profile/1958068
Doe Mountain (Personal website www.dmountain.com)
Google

3) http://douweosinga.com/blog
Douwe Osinga
Google’s European Engineering Office
Search Engineer

4)www.kimbalina.com
Kimbalina
Google/Blogger

5)http://webcom.com/haahr
Paul Haahr
Google
Software Engineer

6)www.bizstone.com
Biz Stone
Google/Blogger

Yahoo!

1) http://jeremy.zawodny.com/blog/
Jeremy Zawodny
Yahoo
Engineer -Yahoo Search

2) http://homepage.mac.com/naveenjamal/blog/
Naveen Jamal
Yahoo

3) http://www.dronamraju.com/journal/index.html
Ravi Dronamraju
Yahoo

4) http://www.unitedheroes.net/blogs/jr/
JR Conlin
Yahoo

5) http://www.radwin.org/michael/blog/
Michael J. Radwin
Yahoo- Software Engineer

6)http://eric.burke.name/
Eric Burke
Yahoo

Feel free to add into this list. Thanks in advance.

Though I’ll be adding more as and when I’ll find ‘em. :)

Latest NewsJune 4, 2005 1:39 pm

Web spam pages use various techniques to achieve higher-than-deserved rankings in a search engine’s results.

Google recently registered -TrustRank as a Trademark.

A Technical Report titled Combating Web Spam with TrustRank submitted in Stanford University delves deeper into what exactly a TrustRank is? The Report was submitted by Zolt´an Gy¨ongyi and Hector Garcia-Molina from Stanford University and Jan Pedersen from Yahoo! Inc..

The Technical Report in a PDF format is also available.

According to the Report, to combat Web Spam (hyperlinked pages on the World Wide Web that are created with the intention of misleading search engines.), the TrustRank Algorithm evaluates every page against a set of Seed pages.

Seed pages are good pages selected by Human experts which are belived to be reputable, meaning these are most relevant to the searched keyword. Once reputable seed pages are identifed, these are used as a basis to discover other similar pages on the WWW.

In selecting seed pages, preference is given to pages from which we can reach many other pages. In other words, select seed pages based on the number of outbound links.

This is a Inverse PageRank algorithm as in PageRank algorithm, seed pages are selected based on the number of inbound links.

Since the pulled search result pages would be evaluated against seed pages, the results would be much more relevant and will have minimal spam pages.

Search engine experts are interpreting this as an important move by Google to combat Web Spam and in turn improve the search results.

Latest NewsJune 3, 2005 7:37 am

I recently read this article on ZDNet where they have compared Top 9 Search engines. You can also get to read a detailed description about each of them.

For those who don’t have time to read all the details, here is a one page Comparision chart: Beyond Google and Yahoo!

To read individual Search engines reviews click below:

A9
AltaVista
AOL Search
Ask Jeeves
Google
LookSmart
Lycos
MSN Search
Yahoo Search

Though Google and Yahoo are still the leaders, in terms of overall search experience, we found that almost every SE has somthing unique to offer ot its visitors.

For instance AOL Search offers real-time search suggestions while you type in your query; Ask Jeeves on the other hand offers a cool thumbnail preview of Web pages on its search results; and LookSmart has a unique periodical search feature, plus a one-of-a-kind page archiving tool.

The rating given out of 10 are Google (8), Yahoo (7.7 ), A9 (6.7), Aol search (6.3), Ask Jeeves (6.3) , MSN Search (6), AltaVista (5.7), Lycose (5.3), Look Smart (5).

If you see Aol Search and Ask Jeeves are tied at fourth place.

The comparison chart brings out interesting informations to the viewers which otherwise would not have been be possible.