Method of annotated suffix tree for scoring the extent of presence of a string in text

Boris Mirkin; Е. Chernyak; О. Chugunova

Boris Mirkin – Professor, Department of Data Analysis and Artificial Intelligence, School of Applied Mathematics and Information Science, Faculty of Business Informatics, National Research University Higher School of Economics.
Address: 20, Myasnitskaya str., Moscow, 101000, Russian Federation.
E-mail: bmirkin@hse.ru

Ekaterina Chernyak – Student of “Mathematical modeling” MSc Program, School of Applied Mathematics and Information Science, Faculty of Business Informatics, National Research University Higher School of Economics.
Address: 20, Myasnitskaya str., Moscow, 101000, Russian Federation.
E-mail: ktr.che@gmail.com

Olga Chugunova – Student of “Mathematical modeling” MSc Program, School of Applied Mathematics and Information Science, Faculty of Business Informatics, National Research University Higher School of Economics.
Address: 20, Myasnitskaya str., Moscow, 101000, Russian Federation.
E-mail: olya.chug@gmail.com

There are two basic areas of unstructured text analysis where the first one is based on the use of natural language models and the second is based on static characteristics of text segments viewed as character strings. Advantage of the second method is that it is untied or unassociated with any particular language, its grammar or semantics. A significant internal tool within such area is a method of aggregated text representation in a suffix tree form annotated by text segment occurrence frequencies. Such a tool was successfully used to resolve clusterization and text tasks.

The goal to be achieved in this article is ensure a particular method modification to speed up and improve efficiency of computations and application in alternative areas of semantic text analysis.

The article reviews two types of problems found in text information analysis: (a) connection between text body and the bulk of its word groups/phrases, and (b) connection between text body and application environment taxonomy.

Both problems are analyzed using a so called PS-table, which is actually a word-group/publication matrix established on an expert preset multitude of publications (texts) and key phrases. The PS-table comprises values, which characterize the text and word groups mutual relevance obtained on the suffix tree basis as standardized characteristics of aggregated average conditional symbol probabilities.

To resolve problem one the PS-tables were used to achieve two purposes: first to analyze text unit structure and to analyze the aggregated word groups. The first purpose used the modified concept cluster analysis method leading to meaningful and easy to interpret tree of publications taxonomy in terms of word groups. The second purpose was achieved through a graph of associations between word groups allowing provision of a generalizing description of all multiple publications. To illustrate the process the authors used a series of newspaper publications and word groups characterizing business processes in Russia after the economic downfall of 2008.

To investigate problem two the authors devised a method of taxonomy completion/finishing based on the analysis of structure and texts of Russian Wikipedia entries. This practical methodology is illustrated with an example of mathematics taxonomy “probability theory and mathematical statistics”.

Boris Mirkin1,2, Е. Chernyak, О. Chugunova 1 National Research University Higher School of Economics, 20 Myasnitskaya Str., Moscow, 101000, Russian Federation2 National Research University Higher School of Economics, 30 Sormovskoye Highway, Nizhny Novgorod, 603014, Russian Federation

Method of annotated suffix tree for scoring the extent of presence of a string in text

Boris Mirkin^1,2, Е. Chernyak, О. Chugunova
¹ National Research University Higher School of Economics, 20 Myasnitskaya Str., Moscow, 101000, Russian Federation
² National Research University Higher School of Economics, 30 Sormovskoye Highway, Nizhny Novgorod, 603014, Russian Federation