Latent Semantic Indexing is used by search engines for information retrieval purposes.
The technique used is known as latent semantic analysis. It is a technique dealing with natural language processing. Latent semantic indexing is used by search engines to sort out the most relevant results for a particular keyword. The sorting is done on the basis of synonym and polysemy words. Let us have a look at different aspects of LSA.
1. The whole body of the document is transformed into a two dimensional matrix. The terms are represented by columns and the documents (one sentence in most cases) are represented by rows. This type of matrix is known as occurrence or document-term matrix.
2. The weight of a particular term or keyword is defined depending on its occurrences in a document, in a collection, or in corpus. A term occurring rarely in a document but constantly in the corpus has more importance then a term occurring frequently in a document but rarely in the corpus. This type of weighting is derived using tf-idf (term frequency-inverse document frequency) methodology.
3. The original matrix is then broken into term-concept and document-concept matrices. This way we can compare terms and documents indirectly using concept as the mediator. The concept space opens the door to another dimension of data mining. It allows us the comparison of documents in conceptual space also known as document classification. Across language comparisons are made possible and the relations between polysemy and synonyms are revealed.
4. The original occurrence matrix is too-large and noisy on times this makes finding appropriate results too-hard to get. To eliminate these errors rank lowering is done. It is also known as approximation. During approximation some dimensions got merged together to lower the ranks. This merging makes the calculations easier.
Recently some erratic results were produced by Google because of approximation. When the synonyms and polysemy terms got mitigated to lessen the burden, the results got erratic because of the elimination of some important words. It happens because sometimes the mathematical formulae adopted by Google algorithm would not be able to justify their resemblance with natural language.