Метод і модель опрацювання текстової інформації на навченому трансформері для бази знань
Date
Journal Title
Journal ISSN
Volume Title
Publisher
Видавництво Львівської політехніки
Lviv Politechnic Publishing House
Lviv Politechnic Publishing House
Abstract
Невпорядкована база знань формується із різних множин нестандартизованих документів. У системі підтримки прийняття рішень ключовим є своєчасний доступ до інформації із бази знань. У статті описано модель інформаційно-пошукової системи щодо роботи з множиною знань, поданих у форматі PDF, одному із основних у військово-спеціалізованих базах знань. Модель розроблено на навченому трансформері із забезпеченням міжмовного перекладу, що загалом формує метод обробки текстової інформації.
To form a knowledge base is complicated problem traditionally. There are a lot kind of objects that are possibly used for forming a knowledge base. These objects may have different structures, formats, ways of data representation, languages. The simple conjunction is not effective and suitable. In general case the knowledge base has got as an unordered knowledge base. There are uncategorized documents in such unordered knowledge base with different formats that causes the special and particular approaches for recognition, systematization and next processing of some textual information. It’s why the complexes of automation for all stages of processing are complicated. Naturally it is a restriction for some kind of the decision support system, especially in military or other applications with key time factor (to get a quick and exact access to the knowledge base in decision support system). So, we analyzed the mentioned restrictions and conditions for forming a knowledge base in the paper. We depicted that the ontology of knowledge base both in general and specific cases includes such operations as data collection, data regularization, extraction of knowledge, data conversion for matrix representation, data language processing, tokenization, output generation for a request and machine learning for information-retrieval system optimization. There is a model of information-retrieval system for knowledge base with widely- used PDF-documents that is proposed in the paper. We made the model using open learned transformer and Llama Index framework to decrease the time demands in the information-retrieval system. Also, we included the language processing models for translation the specific textual information from Ukrainian into English and back. As a result, we got the method and the model for processing the textual information from PDF-document in Ukrainian that could be effective in any decision support system. The method ensures the reading, tokenization, translation, analysis and retrieve generation of the data in Ukrainian. The model showed its simple, stable and exact estimations, but there are also some disadvantages, high time installation/compilation and little language defaults are some of them. The results encourage us to continue the research and to get the statistics set to analyze the model estimation more properly.
To form a knowledge base is complicated problem traditionally. There are a lot kind of objects that are possibly used for forming a knowledge base. These objects may have different structures, formats, ways of data representation, languages. The simple conjunction is not effective and suitable. In general case the knowledge base has got as an unordered knowledge base. There are uncategorized documents in such unordered knowledge base with different formats that causes the special and particular approaches for recognition, systematization and next processing of some textual information. It’s why the complexes of automation for all stages of processing are complicated. Naturally it is a restriction for some kind of the decision support system, especially in military or other applications with key time factor (to get a quick and exact access to the knowledge base in decision support system). So, we analyzed the mentioned restrictions and conditions for forming a knowledge base in the paper. We depicted that the ontology of knowledge base both in general and specific cases includes such operations as data collection, data regularization, extraction of knowledge, data conversion for matrix representation, data language processing, tokenization, output generation for a request and machine learning for information-retrieval system optimization. There is a model of information-retrieval system for knowledge base with widely- used PDF-documents that is proposed in the paper. We made the model using open learned transformer and Llama Index framework to decrease the time demands in the information-retrieval system. Also, we included the language processing models for translation the specific textual information from Ukrainian into English and back. As a result, we got the method and the model for processing the textual information from PDF-document in Ukrainian that could be effective in any decision support system. The method ensures the reading, tokenization, translation, analysis and retrieve generation of the data in Ukrainian. The model showed its simple, stable and exact estimations, but there are also some disadvantages, high time installation/compilation and little language defaults are some of them. The results encourage us to continue the research and to get the statistics set to analyze the model estimation more properly.
Description
Keywords
система обробки інформації, система підтримки прийняття рішень, метод обробки мови та тексту на навченому трансформері, машинне навчання, онтологія баз даних, множини знань, deep learning machine in data-processing system, information-retrieval system, decision support system, method for processing textual information, ontology of knowledge base, extraction of knowledge
Citation
Литвин В. Метод і модель опрацювання текстової інформації на навченому трансформері для бази знань / Василь Литвин, Володимир Тимчук // Вісник Національного університету “Львівська політехніка”. Серія: Інформаційні системи та мережі. — Львів : Видавництво Львівської політехніки, 2023. — № 14. — С. 210–224.