Detecting semantic duplicates in short news items

T2 - IS - KW - short text corpora KW - text clustering KW - near-duplicates KW - semantic vector space KW - neural network AB - Sergei A. Fomin - Operator, Laboratory of Research Center, Civil Defense Academy EMERCOM of RussiaAddress: Novogorsk district, Khimki, Moscow region, 141435, Russian FederationE-mail: sergio-dna@yandex.ruRoman L. Belousov - Research Associate, Research Center, Civil Defense Academy EMERCOM of RussiaAddress: Novogorsk district, Khimki, Moscow region, 141435, Russian FederationE-mail: romabel-87@mail.ru In the paper, we examine a task of detecting text messages that borrow similar meaning or relate to the same event. The noticeable feature of the task at hand is that the considered text messages are short, about 40 words per message on average. To solve this task, we design an algorithm that is based on the vector space model, meaning that every text is mapped to a point in high-dimensional space. Text-to-vector transforming is done using the TF-IDF measure. It should be noted that even for small cases with a volume of about 800 messages the dimension of the vector space can exceed 2,000 components, and on the average the dimension is about 8,500 components. To reduce the dimension of space, the method of principal components is used. The application of this method allows us to rationally reduce the dimensionality of space and leave about 3 percent of the components from their original number. In this reduced vector space, we use agglomerative hierarchical clustering in accordance with the Lance-Williams algorithm. The actual cluster merge is done using the closest linkage algorithm. We stop merging clusters when the distance between two nearest clusters exceeds some threshold value r that is given to the algorithm as a parameter. We conduct an experiment on the dataset of 135,000 news messages parsed from news aggregator feeds. During the experiment, we build the regression model for the r algorithm parameter value that allows us to predict the value of this parameter that gives good clustering results. The designed algorithm scores high in quality metrics indicating its sufficient ability to classify a pair of messages as being duplicates or not, as well as the ability to find out whole groups of duplicate messages. AU - Sergei Fomin AU - Roman Belousov UR - https://bijournal.hse.ru/en/2017--2 (40)/208606327.html PY - 2017 SP - 47-56 VL -