Кафедра "Інтелектуальні комп'ютерні системи"

Постійне посилання колекціїhttps://repository.kpi.kharkov.ua/handle/KhPI-Press/2423

Офіційний сайт кафедри http://web.kpi.kharkov.ua/iks

Кафедра "Інтелектуальні комп’ютерні системи" заснована 12 лютого 2007 року на базі спеціальності "Прикладна лінгвістика".

У 2009 році на базі кафедри спільно з Українським мовно-інформаційним фондом НАН України було створено Науково-дослідний центр інтелектуальних систем і комп’ютерної лінгвістики.

Кафедра входить до складу Навчально-наукового інституту соціально-гуманітарних технологій Національного технічного університету "Харківський політехнічний інститут".

У складі науково-педагогічного колективу кафедри працюють: 2 доктора технічних наук, 5 кандидатів філологічних наук, 4 кандидата технічних наук, 1 кандидат філософських наук; 2 співробітника мають звання професора, 3 – доцента.

Переглянути

Зараз показуємо 1 - 18 з 18

The aligned Kazakh-Russian parallel corpus focused on the criminal theme
(2019) Khairova, N. F.; Kolesnyk, Anastasiia; Mamyrbayev, Orken; Mukhsina, Kuralay
Nowadays, the development of high-quality parallel aligned text corpora is one of the most relevant and advanced directions of modern linguistics. Special emphasis is placed in creating parallel multilingual corpora for low resourced languages, such as the Kazakh language. In the study, we explored texts from four Kazakh bilingual news websites and created the parallel Kazakh-Russian corpus of texts that focus on the criminal subject at their base. In order to align the corpus, we used lexical compliances set and the values of POS-tagging of both languages. 60% of our corpus sentences are automatically aligned correctly. Finally, we analyzed the factors affecting the percentage of errors.
Applying VSM to Identify the Criminal Meaning of Texts
(2020) Khairova, N. F.; Kolesnyk, Anastasiia; Mamyrbayev, Orken; Petrasova, S. V.
Generally, to define the belonging of a text to a specific theme or domain, we can use approaches to text classification. However, the task becomes more complicated when there is no train corpus, in which the set of classes and the set of documents belonged to these classes are predetermined. We suggest using the semantic similarity of texts to determine their belonging to a specific domain. Our train corpus includes news articles containing criminal information. In order to define whether the theme of input documents is close to the theme of the train corpus, we propose to calculate the cosine similarity between documents of the corpus and the input document. We have empirically established the average value of the cosine similarity coefficient, in which the document can be attributed to the highly specialized documents containing criminal information.We evaluate our approach on the test corpus of articles from the news sites of Kharkiv. F-measure of the document classification with criminal information achieves 96 %.
Automatic Extraction of Synonymous Collocation Pairs from a Text Corpus
(Polskie Towarzystwo Informatyczne, Poland, 2018) Khairova, N. F.; Petrasova, S. V.; Lewoniewski, Włodzimierz; Mamyrbayev, Orken; Mukhsina, Kuralay
Automatic extraction of synonymous collocation pairs from text corpora is a challenging task of NLP. In order to search collocations of similar meaning in English texts, we use logical-algebraic equations. These equations combine grammatical and semantic characteristics of words of substantive, attributive and verbal collocations types. With Stanford POS tagger and Stanford Universal Dependencies parser, we identify the grammatical characteristics of words. We exploit WordNet synsets to pick synonymous words of collocations. The potential synonymous word combinations found are checked for compliance with grammatical and semantic characteristics of the proposed logical-linguistic equations. Our dataset includes more than half a million Wikipedia articles from a few portals. The experiment shows that the more frequent synonymous collocations occur in texts, the more related topics of the texts might be. The precision of synonymous collocations search in our experiment has achieved the results close to other studies like ours.
Automatic Identification of Collocation Similarity
(Institute of Electrical and Electronics Engineers, 2015) Petrasova, S. V.; Khairova, N. F.
This paper proposes a logical and linguistic model for automatic identification of collocation similarity. The method of component analysis is proposed to determine the semantic equivalence between collocates. The set of semantic and grammatical characteristics of collocates is identified by means of algebra of predicates to formalize collocation similarity.
Building the Semantic Similarity Model for Social Network Data Streams
(Institute of Electrical and Electronics Engineers, 2018) Petrasova, S. V.; Khairova, N. F.; Lewoniewski, Wlodzimierz
This paper proposes the model for searching similar collocations in English texts in order to determine semantically connected text fragments for social network data streams analysis. The logical-linguistic model uses semantic and grammatical features of words to obtain a sequence of semantically related to each other text fragments from different actors of a social network. In order to implement the model, we leverage Universal Dependencies parser and Natural Language Toolkit with the lexical database WordNet. Based on the Blog Authorship Corpus, the experiment achieves over 0.92 precision.
Detecting Collocations Similarity via Logical-Linguistic Model
(Association for Computational Linguistics, USA, 2019) Khairova, N. F.; Petrasova, S. V.; Mamyrbayev, Orken; Mukhsina, Kuralay
Semantic similarity between collocations, along with words similarity, is one of the main issues of NLP. In particular, it might be addressed to facilitate the automatic thesaurus generation. In the paper, we consider the logical-linguistic model that allows defining the relation of semantic similarity of collocations via the logical-algebraic equations. We provide the model for English, Ukrainian and Russian text corpora. The implementation for each language is slightly different in the equations of the finite predicates algebra and used linguistic resources. As a dataset for our experiment, we use 5801 pairs of sentences of Microsoft Research Paraphrase Corpus for English and more than 1000 texts of scientific papers for Russian and Ukrainian.
Evaluating effectiveness of linguistic technologies of knowledge identification in text collections
(ITHEA, Poland, 2014) Khairova, N. F.; Shepelyov, G.; Petrasova, S. V.
The possibility of using integral coefficients of recall and precision to evaluate effectiveness of linguistic technologies of knowledge identification in texts is analyzed in the paper. An approach is based on the method of test collections, which is used for experimental validation of received effectiveness coefficients, and on methods of mathematical statistics. The problem of maximizing the reliability of sample results in their propagation on the general population of the tested text collection is studied. The method for determining the confidence interval for the attribute proportion, which is based on Wilson’s formula, and the method for determining the required size of the relevant sample under specified relative error and confidence probability, are considered.
The Influence of Various Text Characteristics on the Readability and Content Informativeness
(2019) Khairova, N. F.; Kolesnyk, Anastasiia; Mamyrbayev, Orken; Mukhsina, Kuralay
Currently, businesses increasingly use various external big data sources for extracting and integrating information into their own enterprise information systems to make correct economic decisions, to understand customer needs, and to predict risks. The necessary condition for obtaining useful knowledge from big data is analysing high-quality data and using quality textual data. In the study, we focus on the influence of readability and some particular features of the texts written for a global audience on the texts quality assessment. In order to estimate the influence of different linguistic and statistical factors on the text readability, we reviewed five different text corpora. Two of them contain texts from Wikipedia, the third one contains texts from Simple Wikipedia and two last corpora include scientific and educational texts. We show linguistic and statistical features of a text that have the greatest influence on the text quality for business corporations. Finally, we propose some directions on the way to automatic predicting the readability of texts in the Web.
The logic and linguistic model for automatic extraction of collocation similarity
(University of Engineering and Economics, Poland, 2015) Khairova, N. F.; Petrasova, S. V.; Gautam, Ajit Pratap Singh
The article discusses the process of automatic identification of collocation similarity. The semantic analysis is one of the most advanced as well as the most difficult NLP task. The main problem of semantic processing is the determination of polysemy and synonymy of linguistic units. In addition, the task becomes complicated in case of word collocations. The paper suggests a logical and linguistic model for automatic determining semantic similarity between colocations in Ukraine and English languages. The proposed model formalizes semantic equivalence of collocations by means of semantic and grammatical characteristics of collocates. The basic idea of this approach is that morphological, syntactic and semantic characteristics of lexical units are to be taken into account for the identification of collocation similarity. Basic mathematical means of our model are logical-algebraic equations of the finite predicates algebra. Verb-noun and noun-adjective collocations in Ukrainian and English languages consist of words belonged to main parts of speech. These collocations are examined in the model. The model allows extracting semantically equivalent collocations from semi-structured and non-structured texts. Implementations of the model will allow to automatically recognize semantically equivalent collocations. Usage of the model allows increasing the effectiveness of natural language processing tasks such as information extraction, ontology generation, sentiment analysis and some others.
Logical-linguistic model for multilingual Open Information Extraction
(2020) Khairova, N. F.; Mamyrbayev, Orken; Mukhsina, Kuralay; Kolesnyk, Anastasiia
Open Information Extraction (OIE) is a modern strategy to extract the triplet of facts from Web-document collections. However, most part of the current OIE approaches is based on NLP techniques such as POS tagging and dependency parsing, which tools are accessible not to all languages. In this paper, we suggest the logical-linguistic model, which basic mathematical means are logical-algebraic equations of finite predicates algebra. These equations allow expressing a semantic role of the participant of a triplet of the fact (Subject-Predicate-Object) due to the relations of grammatical characteristics of words in the sentence. We propose the model that extracts the unlimited domain-independent number of facts from sentences of different languages. The use of our model allows extracting the facts from unstructured texts without requiring a pre-specified vocabulary, by identifying relations in phrases and associated arguments in arbitrary sentences of English, Kazakh, and Russian languages. We evaluate our approach on corpora of three languages based on English and Kazakh bilingual news websites. We achieve the precision of facts extraction over 87% for English corpus, over 82% for Russian corpus and 71% for Kazakh corpus.
The Logical-Linguistic Model of Fact Extraction from English Texts
(2016) Khairova, N. F.; Petrasova, S. V.; Gautam, Ajit Pratap Singh
In this paper we suggest the logical-linguistic model that allows extracting required facts from English sentences. We consider the fact in the form of a triplet: Subject > Predicate > Object with the Predicate representing relations and the Object and Subject pointing out two entities. The logical-linguistic model is based on the use of the grammatical and semantic features of words in English sentences. Basic mathematical characteristic of our model is logical-algebraic equations of the finite predicates algebra. The model was successfully implemented in the system that extracts and identifies some facts from Web-content of a semi-structured and non-structured English text.
Method "Mean – Risk" for Comparing Poly-Interval Objects in Intelligent Systems
(2019) Shepelev, Gennady; Khairova, N. F.; Kochueva, Zoia
Problems of comparing poly-interval alternatives under risk in the framework of intelligent computer systems are considered. The problems are common in economy, engineering and in other domains. "Mean-risk" approach was chosen as a tool for comparing. Method for calculation of both main indicators of the "mean-risk" approach – mean and semideviation – for case of polyinterval alternatives is proposed. Method permits to calculate mentioned indicators for interval alternatives represented as fuzzy objects and as generalized interval estimates.
Methods of comparing interval objects in intelligent computer systems
(2017) Shepelev, Gennady; Khairova, N. F.
Problems of expert knowledge representation by means of generalized interval estimates approach and using methods of comparing interval alternatives in the framework of intelligent computer systems are considered. The problems are common in economy, engineering and in other domains. Necessity of multi criteria approach to comparing problem that is taking into account both preference criteria and risk ones is shown. It is proposed to use a multi-steps approach to decision-making concerning choice of preferable interval alternatives. It is based on consistent using of different comparing methods: new collective risk estimating techniques, mean-risk‖ approach (for interval-probability situations) and Savage method (for full uncertainty situations).
Open Information Extraction as Additional Source for Kazakh Ontology Generation
(2020) Khairova, N. F.; Petrasova, S. V.; Mamyrbayev, Orken; Mukhsina, Kuralay
Nowadays, structured information that obtains from unstructured texts and Web context can be applied as an additional source of knowledge to create ontologies. In order to extract information from a text and represent it in the RDF-triplets format, we suggest using the Open Information Extraction model. Then we consider the adaptation of the model to fact extraction from unstructured texts in the Kazakh language. In our approach, we identify lexical units that name the participants of the action (the Subject and Object) and semantic relations between them based on words characteristics in a sentence. The model provides semantic functions of the action participants via logical-linguistic equations that express the relations of the grammatical and semantic characteristics of the words in a Kazakh sentence. Using the tag names and some syntactic characteristics of words in the Kazakh sentences as the values of the predicate variables in corresponding equations allows us to extract Subjects, Objects and Predicates of facts from texts of Web content. The experimental research dataset includes texts extracted from Kazakh bilingual news websites. The experiment shows that we can achieve the precision of facts extraction over 71% for Kazakh corpus.
Semantic Similarity Identification for Short Text Fragments
(2019) Chuiko, Viktoriia; Khairova, N. F.
The paper contains review of the existing methods for semantic similarity identification, such as methods based on the distance between concepts and methods based on lexical intersection. We proposed a method for measuring the semantic similarity of short text fragment, i.e. two sentences. Also, we created corpus of mass-media text. It contains articles of Kharkiv news, that were sorted by their source and date. Then we annotated texts. We defined semantic similarity of sentences manually. In this way, we created learning corpus for our future system.
Similar Text Fragments Extraction for Identifying Common Wikipedia Communities
(MDPI AG, Switzerland, 2018) Petrasova, S. V.; Khairova, N. F.; Lewoniewski, Włodzimierz; Mamyrbayev, Orken; Mukhsina, Kuralay
Similar text fragments extraction from weakly formalized data is the task of natural language processing and intelligent data analysis and is used for solving the problem of automatic identification of connected knowledge fields. In order to search such common communities in Wikipedia, we propose to use as an additional stage a logical-algebraic model for similar collocations extraction. With Stanford Part-Of-Speech tagger and Stanford Universal Dependencies parser, we identify the grammatical characteristics of collocation words. WithWordNet synsets, we choose their synonyms. Our dataset includes Wikipedia articles from different portals and projects. The experimental results show the frequencies of synonymous text fragments inWikipedia articles that form common information spaces. The number of highly frequented synonymous collocations can obtain an indication of key common up-to-date Wikipedia communities.
Use of Linguistic Criteria for Estimating of Wikipedia Articles Quality
(2017) Kolesnyk, Anastasiia; Khairova, N. F.
Using a Technology for Identification of Semantically Connected Text Elements to Determine a Common Information Space
(Springer, 2017) Petrasova, S. V.; Khairova, N. F.
A technology is proposed that makes it possible to determine the common information space of actors of social networks by identifying the semantic equivalence of collocations in texts. The technology includes a model of formal description of semantic and grammatical characteristics of collocates, identification of collocations, and determination of a semantic equivalence predicate of two-word collocations.