Method for determining the semantic similarity of arbitrary length texts using the Transformers models

Olizarenko, Serhii; Radchenko, Viacheslav

Method for determining the semantic similarity of arbitrary length texts using the Transformers models

Файли

AIS_2021_5_2_Olizarenko_Method.pdf (1.03 MB)

Дата

2021

Автори

Olizarenko, Serhii

Radchenko, Viacheslav

ORCID

https://orcid.org/0000-0002-7762-6541
https://orcid.org/0000-0002-2505-1969

DOI

doi.org/10.20998/2522-9052.2021.2.18

Видавець

Національний технічний університет "Харківський політехнічний інститут"

Анотація

The paper considers the results of a method development for determining the semantic similarity of arbitrary length texts based on their vector representations. These vector representations are obtained via multilingual Transformers model usage, and direct problem of determining semantic similarity of arbitrary length texts is considered as the text sequence pairs classification problem using Transformers model. Comparative analysis of the most optimal Transformers model for solving such class of problems was performed. Considered in this case main stages of the method are: Transformers model fine-tuning stage in the framework of pretrained model second problem (sentence prediction), also selection and implementation stage of the summarizing method for text sequence more than 512 (1024) tokens long to solve the problem of determining the semantic similarity for arbitrary length texts.
В роботі розглянуті результати розробки методу визначення семантичної подібності текстів довільної довжини на основі їх векторних уявлень. При цьому векторні уявлення отримані з використанням мультимовної моделі Transformers, а безпосередньо завдання визначення семантичного подібності текстів довільної довжини розглядається як задача класифікації пар текстових послідовностей з використанням моделі Transformers. Виконано порівняльний аналіз найбільш оптимальної моделі Transformers для вирішення даного класу задач. Основними етапами методу при цьому розглядаються етап тонкої настройка моделі Transformers в рамках другого завдання преднавченої моделі (завдання прогнозування пропозицій), а також етап вибору і реалізації методу суммарізаціі текстової послідовності довжиною понад 512 (1024) токенів для вирішення завдання визначення семантичного подібності текстів довільної довжини.

Ключові слова

vector representation, fine-tuning, векторне подання, тонке налагодження

Бібліографічний опис

Olizarenko S. Method for determining the semantic similarity of arbitrary length texts using the Transformers models / Serhii Olizarenko, Viacheslav Radchenko // Сучасні інформаційні системи = Advanced Information Systems. – 2021. – Т. 5, № 2. – С. 126-130.

URI

https://repository.kpi.kharkov.ua/handle/KhPI-Press/53778

Колекції

Кафедра "Комп'ютерна інженерія та програмування"

Повна інформація про документ
Google Scholar

Method for determining the semantic similarity of arbitrary length texts using the Transformers models

Файли

Дата

Автори

ORCID

DOI

item.page.thesis.degree.name

item.page.thesis.degree.level

item.page.thesis.degree.discipline

item.page.thesis.degree.department

item.page.thesis.degree.grantor

item.page.thesis.degree.advisor

item.page.thesis.degree.committeeMember

Назва журналу

Номер ISSN

Назва тому

Видавець

Анотація

Опис

Ключові слова

Бібліографічний опис

URI

Колекції

item.page.endorsement

item.page.review

item.page.supplemented

item.page.referenced