Система аналізу видобування інформації з текстових даних за допомогою штучної нейронної мережі

Линник, Роман Олександрович; Lynnyk, Roman Oleksandrovych

Система аналізу видобування інформації з текстових даних за допомогою штучної нейронної мережі

dc.contributor.advisor	Демків, Любомир Ігорович
dc.contributor.affiliation	Національний університет "Львівська політехніка"
dc.contributor.author	Линник, Роман Олександрович
dc.contributor.author	Lynnyk, Roman Oleksandrovych
dc.coverage.placename	Львів
dc.date.accessioned	2025-02-26T13:00:45Z
dc.date.created	2022
dc.date.issued	2022
dc.description.abstract	З кожним роком галузь обробки природньої мови набуває все більшої популярності. Одною з головних засад цього напряму та актуальних наукових досліджень є видобуток цінної інформації з текстових документів. Це є досить актуальна тема у сучасному світі, оскільки інформації стає все більше, а не завжди є час, щоб власноруч все опрацьовувати, тому з кожним роком дана галузь стає все більш популярною та використовується в багатьох топ IT-компаніях світу Проблема з усією цією інформацією полягає в тому, що людям може бути надзвичайно важко бути в курсі всього, що їм потрібно, особливо тим, кому доводиться читати багато текстів, щоб зрозуміти суть того, що їм потрібно знати. Видобуток інформації з текстових даних є одним із таких рішень. Це дозволяє нам знайти час, щоб прочитати щось важливе, не витрачаючи на це надто багато часу. Таким чином ми заощадимо свій час і енергію, а також знизимо рівень стресу [1]. Метою дослідження є змоделювати та розробити систему аналізу видобутку короткої цінної змістовної інформації з великого масиву текстових даних для швидкого розуміння контексту роботи. Об’єктом дослідження є процес аналізу та видобування цінних текстових даних з великих наборів даних та формулювання коротких тез з описом змісту. Предметом дослідження є методи та принципи видобутку інформації з великого набору текстових даних. Результатом дослідження є система, яка видобуває дані з великого набору текстових даних за допомогою рекурентної нейронної мережі та обробки природньої мови, що допомагає зрозуміти вміст документу без безпосереднього його прочитання. Головним завданням видобутку інформації з великого набору текстових даних є допомага людині опрацювати великі потоки даних без великих затрат часу та зусиль. Часто буває при пошуку якихось даних, що необхідно довго перечитувати документ, що є доволі затратним по часу, або ж ситуація коли по роботі надсилають якийсь великий файл, який потрібно швидко опрацювати. Власне дана система якраз передбачатиме спрощення роботи з великими даними та пришвидшення роботи вцілому. Видобуток основного змісту тексту - це проблема створення короткого, точного та плавного резюме великого текстового документа. Автоматичні методи резюмування тексту дуже необхідні для вирішення постійно зростаючої кількості текстових даних, доступних в інтернеті, щоб як краще допомогти знайти релевантну інформацію, так і швидше споживати її. Загалом розрізняють два основних типи отримання змісту обробленої інформації з текстових даних: • Видобуток основних речень з найбільшою вагою змісту. • Створення нових речень на основі обробленої інформації. У даній магістерській кваліфікаційній роботі було розглянуто другий тип. Ця техніка передбачає створення абсолютно нових фраз, які передають значення вхідного речення. Основна ідея полягає в тому, щоб зробити сильний наголос на формі — щоб створити граматичне резюме, що вимагає передових методів моделювання мови [2]. Для побудови такої нейронної мережі використовується Encoder-Decoder (також Речення-Речення) алгоритм, який був вперше представлений в 2014 році на конференції Google. Дана модель має на меті зіставити вхідні дані фіксованої довжини з вихідними даними фіксованої довжини, де довжина вхідних і вихідних даних може відрізнятися, складається з трьох основних частин: енкодера, проміжного енкодера вектора та декодера. Стек із кількох рекурентних блоків (комірок LSTM або GRU для кращої продуктивності), де кожен приймає окремий елемент вхідної послідовності, збирає інформацію для цього елемента та поширює її вперед. Далі проміжний вектор, отриманий з частини моделі енкодера має на меті інкапсулювати інформацію для всіх вхідних елементів, щоб допомогти декодеру робити точні прогнози, а при кінці ми обчислюємо виходи, використовуючи прихований стан на поточному кроці часу разом із відповідною вагою [3].
dc.description.abstract	Every year, the field of natural language processing is gaining more and more popularity. One of the main foundations of this direction and current scientific research is the extraction of valuable information from text documents. This is a very relevant topic in today's world, because there is more and more information, and there is not always time to process everything by yourself, so every year this field is becoming more and more popular and is used in many top IT companies in the world. The problem with all this information is that it can be extremely difficult for people to keep up with everything they need to know, especially those who have to read a lot of text to get the gist of what they need to know. Mining information from textual data is one such solution. This allows us to find time to read something important without spending too much time on it. In this way, we will save our time and energy, as well as reduce the level of stress [1]. The purpose of the study is to model and develop an analysis system for the extraction of short valuable meaningful information from a large array of text data for a quick understanding of the work context. The object of the study is the process of analyzing and extracting valuable textual data from large data sets and formulating short theses with a description of the content. The subject of the research is methods and principles of information extraction from a large set of textual data. The result of the research is a system that extracts data from a large set of textual data using a recurrent neural network and natural language processing, which helps to understand the content of the document without directly reading it. The main task of extracting information from a large set of textual data is to help a person process large streams of data without spending a lot of time and effort. It often happens when searching for some data that it is necessary to reread a document for a long time, which is quite time-consuming, or a situation when a large file is sent for work that needs to be processed quickly. In fact, this system will provide for the simplification of work with large data and speeding up work as a whole. Extracting the main content of the text is the problem of creating a short, accurate and smooth summary of a large text document. Automatic methods of text summarization are very necessary to deal with the ever-increasing amount of textual data available on the Internet, both to better help find relevant information and to consume it more quickly. In general, there are two main types of obtaining the content of processed information from text data: • Extraction of main sentences with the greatest content weight. • Creation of new sentences based on processed information. In this master's qualification work, the second type was considered. This technique involves creating completely new phrases that convey the meaning of the input sentence. The main idea is to put a strong emphasis on form - to create a grammatical summary that requires advanced language modeling techniques [2]. To build such a neural network, the Encoder-Decoder (also Sequence-Sequence) algorithm is used, which was first presented in 2014 at the Google conference. This model aims to match fixed-length input data with fixed-length output data, where the length of the input and output data may differ, consists of three main parts: an encoder, an intermediate vector encoder, and a decoder. A stack of multiple recurrent units (LSTM or GRU cells for better performance) where each takes a single element of the input sequence, collects the information for that element, and propagates it forward. Next, the intermediate vector obtained from part of the encoder model aims to encapsulate the information for all inputs to help the decoder make accurate predictions, and at the end we compute the outputs using the hidden state at the current time step along with the appropriate weight [3].
dc.format.pages	103
dc.identifier.citation	Линник Р. О. Система аналізу видобування інформації з текстових даних за допомогою штучної нейронної мережі : кваліфікаційна робота на здобуття освітнього ступеня магістр за спеціальністю „8.124.00.03 — Аналіз даних (Data Science)“ / Роман Олександрович Линник. — Львів, 2022. — 103 с.
dc.identifier.uri	https://ena.lpnu.ua/handle/ntb/63275
dc.language.iso	uk
dc.publisher	Національний університет "Львівська політехніка"
dc.relation.references	Sentence compression by deletion with lstms: / C.A.C.-m L.K. Katija Filippova, Enrique Alfonseca, O.Vinyals., 2017 – 42 с.
dc.relation.references	A neural attention model for abstractive sentence summarization: / S.C. Alexander M. Rush, J. Weston., 2015 – 32 с.
dc.relation.references	Abstractive sentence summarization with attentive recurrent neural networks: / M.A. Sumit Chopra, A.M. Rush., 2016 – 53 с.
dc.relation.referencesen	Sentence compression by deletion with lstms: / C.A.C.-m L.K. Katija Filippova, Enrique Alfonseca, O.Vinyals., 2017 – 42 с.
dc.relation.referencesen	A neural attention model for abstractive sentence summarization: / S.C. Alexander M. Rush, J. Weston., 2015 – 32 с.
dc.relation.referencesen	Abstractive sentence summarization with attentive recurrent neural networks: / M.A. Sumit Chopra, A.M. Rush., 2016 – 53 с.
dc.rights.holder	© Національний університет "Львівська політехніка", 2022
dc.rights.holder	© Линник, Роман Олександрович, 2022
dc.subject	8.124.00.03
dc.subject	– великий набір текстових даних
dc.subject	лстм
dc.subject	енкодер
dc.subject	декодер
dc.subject	резюмування тексту
dc.subject	релевантна інформація
dc.subject	- a large set of text data
dc.subject	lstm
dc.subject	encoder
dc.subject	decoder
dc.subject	text summarization
dc.subject	relevant information
dc.title	Система аналізу видобування інформації з текстових даних за допомогою штучної нейронної мережі
dc.title.alternative	An analysis system for extracting information from textual data using an artificial neural network
dc.type	Students_diploma

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 2022_81240003_Lynnyk_Roman_Oleksandrovych_148757.pdf
Size:: 3.37 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 2.91 KB
Format:: Plain Text
Description:

Download

Collections

Магістерські роботи