A user provides one or more words as input for IR systems. Given the huge number of words in a given language, it is possible to imagine that not all words are equally useful in answering a user query. In order to reduce the work during the retrieval process, these user provided words are subjected to a screening and refining activity. This process is called Preprocessing. This process consists of the following activities.
Tokenisation
Tokenisation identifies the different elements in the user query and separates then into tokens. Tokens are simplest, identifiable language elements such as a word, delimiter and so on. This is done because token such generated are user for matching with index terms in the data source.
Stop word removal
Stop words are those words in a query but expected to add little value for retrieval process. For instance in a query such as " who is the president of India?", the most important words are President and India. The other words such as who, is, the, of do not significantly help in retrieving process. Hence they are removed from the list of words sent further for retrieving. These words are called STOP WORDS. The process of eliminating them from search process is called " stop word removal".
Stemming.
Another category of words take different forms but essentially mean the same. Using all the forms of such terms in a retrieval process does not significantly improve search results. However their presence in retrieval process adds up to the work load. In order to reduce the work, all such words are reduced to the root format. For instance swimmer, swimming, swim are reduced to one word: swim. This process is called stemming
