Annotated suffix tree as a way of text representation for information retrieval in text collections

Dmitry Frolov

Dmitry S. Frolov - Post-graduate student, Department of Data Analysis and Artificial Intelligence, National Research University Higher School of Economics.
Address: 20, Myasnitskaya Street, Moscow, 101000, Russian Federation.
E-mail: dfrolov@hse.ru, dmitsf@gmail.com

A method for information retrieval based on annotated suffix trees (AST) is presented. The method is based on a string-to-document relevance score calculated using AST as well as fragment reverse indexing for improving performance. We developed a search engine based on the method. This engine is compared with some other popular text aggregating techniques: probabilistic latent semantic indexing (PLSA) and latent Dirichlet allocation (LDA).
We used real data for computation experiments: an online store’s xml-catalogs and collections of web pages (both in Russian) and a real user’s queries from the Yandex.Wordstat service. As quality metrics, we used point quality estimations and graphical representations. Our AST-based method generally leads to results that are similar to those obtained by the other methods. However, in the case of inaccurate queries, AST-based results are superior. The speed of the AST-based method is slightly worse than the speed of the PLSA/LDA-based methods. Due to the observed correlation between the average query performing time and the string lengths at the AST construction phase, one can improve the performance of the algorithm by dividing the texts into smaller fragments at the preprocessing stage. However, the quality of search may suffer if the fragments are too short. Therefore, the applicability of annotated suffix tree techniques for text retrieval problems is demonstrated. Moreover, the AST-based method has significant advantages in the case of fuzzy search.

Dmitry Frolov1 1 National Research University Higher School of Economics, 20 Myasnitskaya Str., Moscow, 101000, Russian Federation

Annotated suffix tree as a way of text representation for information retrieval in text collections

Dmitry Frolov¹
¹ National Research University Higher School of Economics, 20 Myasnitskaya Str., Moscow, 101000, Russian Federation