A three-factor diagnostic assessment of the quality of machine-generated abstracts of scientific texts in Russian

Publication

AuthorsVadim Nikolaevich Gokarev OA

Year2026

JournalПрограммные системы и вычислительные методы

DOI10.7256/2454-0714.2026.1.78734

Abstract

The subject of the study is a procedure for automatic diagnostic assessment of the quality of machine abstracts of scientific texts in Russian, based on a joint analysis of three factors: lexical similarity, semantic proximity and degree of compression. The object of the study is machine abstracts of Russian-language scientific articles generated by extractive (TextRank, LexRank, Lingvo) and abstract (mT5, mBART, ruT5, T5) abstracting models. The work is aimed at solving the problem of the uninformativeness of standard one-dimensional assessment metrics - ROUGE and BERTScore - which reduce the multidimensional concept of quality to a single scalar value and do not allow us to establish the reason for the low quality of the generated abstract: whether it is a consequence of mechanical copying of fragments of the original, semantic losses during paraphrasing, inadequate degree of compression or other generation defects. The relevance of the study is due to the growing volume of Russian-language scientific information that requires automatic processing, and the need to develop tools that provide not only quantitative assessment, but also interpretable diagnosis of types of errors in abstracting models. An approach is proposed based on z-normalization of the ROUGE-L and BLEURT metrics relative to the statistical distribution of author's abstracts, threshold classification into seven diagnostic categories, and calculation of an integral quality metric with a Gaussian penalty for anomalous deviation. The scientific novelty of the work lies in the development of a three-factor diagnostic assessment system, which, unlike existing approaches, calibrates the assessments relative to the statistical norm of the author's abstracts, uses an information-independent pair of metrics (the correlation between ROUGE-L Precision and BLEURT is r=0.14) and assigns to each abstract an interpretable diagnosis from seven categories: excessive copying, semantic incompleteness, insufficient compression, excessive compression, low lexical similarity, ambiguous pattern and target zone. Experimental testing on a corpus of 480 Russian-language scientific articles in eight subject areas with the participation of seven abstraction models confirmed the differentiating ability of the proposed approach: extractive models are systematically diagnosed as “copying” and “insufficient compression”, multilingual abstract models show a significant proportion of “semantic incompleteness”, and Russian-language models show the most balanced profile. The proposed integral metric Q with weighted components and a Gaussian penalty makes it possible to rank abstracting systems taking into account multiple aspects of quality and is consistent with expert ideas about the balance of models.