Topic segmentation methods comparison on computer science texts

Sokol, Volodymyr Yevhenovych; Krykun, Vitalii Oleksandrovich; Bilova, Mariia Oleksiivna; Perepelytsya, Ivan Dmytrovich; Pustovarov, Volodymyr Volodymyrovich

doi:https://doi.org/10.20998/2079-0023.2021.02.10

Topic segmentation methods comparison on computer science texts

dc.contributor.author	Sokol, Volodymyr Yevhenovych	en
dc.contributor.author	Krykun, Vitalii Oleksandrovich	en
dc.contributor.author	Bilova, Mariia Oleksiivna	en
dc.contributor.author	Perepelytsya, Ivan Dmytrovich	en
dc.contributor.author	Pustovarov, Volodymyr Volodymyrovich	en
dc.date.accessioned	2022-02-03T08:50:43Z
dc.date.available	2022-02-03T08:50:43Z
dc.date.issued	2021
dc.description.abstract	The demand for the creation of information systems that simplifies and accelerates work has greatly increased in the context of the rapid informatization of society and all its branches. It provokes the emergence of more and more companies involved in the development of software products and information systems in general. In order to ensure the systematization, processing and use of this knowledge, knowledge management systems are used. One of the main tasks of IT companies is continuous training of personnel. This requires export of the content from the company's knowledge management system to the learning management system. The main goal of the research is to choose an algorithm that allows solving the problem of marking up the text of articles close to those used in knowledge management systems of IT companies. To achieve this goal, it is necessary to compare various topic segmentation methods on a dataset with a computer science texts. Inspec is one such dataset used for keyword ext raction and in this research it has been adapted to the structure of the datasets used for the topic segmentation problem. The TextTi ling and TextSeg methods were used for comparison on some well-known data science metrics and specific metrics that relate to the topic segmentation problem. A new generalized metric was also introduced to compare the results for the topic segmentation problem. All software implementations of the algorithms were written in Python programming language and represent a set of interrelated functions. Results were obtained showing the advantages of the Text Seg method in comparison with TextTiling when compared using classical data science metrics and special metrics developed for the topic segmentation task. From all the metrics, including the introduced one it can be concluded that the TextSeg algorithm performs better than the TextTiling algorithm on the adapted Inspec test data set.	en
dc.description.abstract	Попит на створення інформаційних систем, що спрощують і прискорюють роботу, значно зріс в умовах стрімкої інформатизації суспільства та всіх сфер діяльності. Це пов’язано з появою все більшої кількості компаній, що займаються розробкою програмних продуктів та інформаційних систем в цілому. З метою забезпечення систематизації, обробки та використання цих знань використовуються систем и управління знаннями. Одним з головних завдань IT-компаній є постійне навчання персоналу. Для цього потрібно експортувати контент із системи управління знаннями компанії в систему управління навчанням. Основною метою дослідження є вибір алгоритму, який дозволяє вирішити задачу розмітки тексту статей, близьких до тих, що використовуються в системах управління знаннями ІТ-компаній. Для досягнення цієї мети необхідно порівняти різні методи сегментації тем на наборі даних з текстами з комп’ютерних наук. Inspec є одним із таких наборів даних, які використовуються для виділення ключових слів, і у цьому дослідженні він був адаптований до структури наборів даних, які використовуються для проблеми сегментації тем. Методи TextTiling і TextSeg були використані для порівняння деяких добре відомих показників науки про дані та конкретних показників, які стосуються проблеми сегментації тем. Також була введена нова узагальнена метрика для порівняння результатів для задачі сегментації тем. Усі програмні реалізації алгоритмів написані мовою програмування Python і представляють собою набір взаємопов’язаних функцій. Отримано результати, що демонструють переваги методу Text Seg у порівнянні з TextTiling з використанням класичних метрик науки про дані та спеціальних метрик, розроблених для завдання сегментації тем. З усіх метрик, включаючи введену, можна зробити висновок, що алгоритм TextSeg працює краще, ніж алгоритм TextTiling на адаптованому наборі тестових даних Inspec.	uk
dc.identifier.citation	Topic segmentation methods comparison on computer science texts / V. Y. Sokol [et al.] // Вісник Національного технічного університету "ХПІ". Сер. : Системний аналіз, управління та інформаційні технології = Bulletin of the National Technical University "KhPI". Ser. : System analysis, control and information technology : зб. наук. пр. – Харків : НТУ "ХПІ", 2021. – № 2 (6). – С. 59-66.	en
dc.identifier.doi	https://doi.org/10.20998/2079-0023.2021.02.10
dc.identifier.orcid	https://orcid.org/0000-0002-4689-3356
dc.identifier.orcid	https://orcid.org/0000-0003-2576-1001
dc.identifier.orcid	https://orcid.org/0000-0001-7002-4698
dc.identifier.orcid	https://orcid.org/0000-0001-7683-8780
dc.identifier.orcid	https://orcid.org/0000-0003-3944-5771
dc.identifier.uri	https://repository.kpi.kharkov.ua/handle/KhPI-Press/55922
dc.language.iso	en
dc.publisher	Національний технічний університет "Харківський політехнічний інститут"	uk
dc.subject	TextTiling	en
dc.subject	TextSeg	en
dc.subject	Inspec	en
dc.subject	IT Companies	en
dc.subject	IT-компанії	uk
dc.title	Topic segmentation methods comparison on computer science texts	en
dc.title.alternative	Порівняння методів сегментації тем за текстами з комп'ютерних наук	uk
dc.type	Article	en

Файли

Контейнер файлів

Зараз показуємо 1 - 1 з 1

Назва:: visnyk_KhPI_2021_2_SAUI_Sokol_Topic.pdf
Розмір:: 1.06 MB
Формат:: Adobe Portable Document Format
Опис:

Завантажити

Ліцензійна угода

Зараз показуємо 1 - 1 з 1

Назва:: license.txt
Розмір:: 11.28 KB
Формат:: Item-specific license agreed upon to submission
Опис:

Завантажити

Колекції

Вісник № 02. Системний аналіз, управління та інформаційні технології
Кафедра "Програмна інженерія та інтелектуальні технології управління ім. А. В. Дабагяна"