27.4 Text Preprocessing 1009
example, ‘data*’ would retrieve data, database, datapoint, dataset, and so on).
Providing support for wildcard searches in IR systems involves preprocessing over-
head and is not considered worth the cost by many Web search engines today.
Retrieval models do not directly provide support for this query type.
27.3.6 Natural Language Queries
There are a few natural language search engines that aim to understand the struc-
ture and meaning of queries written in natural language text, generally as a question
or narrative. This is an active area of research that employs techniques like shallow
semantic parsing of text, or query reformulations based on natural language under-
standing. The system tries to formulate answers for such queries from retrieved
results. Some search systems are starting to provide natural language interfaces to
provide answers to specific types of questions, such as definition and factoid ques-
tions, which ask for definitions of technical terms or common facts that can be
retrieved from specialized databases. Such questions are usually easier to answer
because there are strong linguistic patterns giving clues to specific types of sen-
tences—for example,‘defined as’ or ‘refers to’. Semantic models can provide support
for this query type.
27.4 Text Preprocessing
In this section we review the commonly used text preprocessing techniques that are
part of the text processing task in Figure 27.1.
27.4.1 Stopword Removal
Stopwords are very commonly used words in a language that play a major role in
the formation of a sentence but which seldom contribute to the meaning of that
sentence. Words that are expected to occur in 80 percent or more of the documents
in a collection are typically referred to as stopwords, and they are rendered poten-
tially useless. Because of the commonness and function of these words, they do not
contribute much to the relevance of a document for a query search. Examples
include words such as the, of, to, a, and, in, said, for, that, was, on, he, is, with, at, by,
and it. These words are presented here with decreasing frequency of occurrence
from a large corpus of documents called AP89.
17
The fist six of these words account
for 20 percent of all words in the listing, and the most frequent 50 words account for
40 percent of all text.
Removal of stopwords from a document must be performed before indexing.
Articles, prepositions, conjunctions, and some pronouns are generally classified as
stopwords. Queries must also be preprocessed for stopword removal before the
actual retrieval process. Removal of stopwords results in elimination of possible
spurious indexes, thereby reducing the size of an index structure by about 40
17
For details, see Croft et al. (2009), pages 75–90.