Удосконалення методів зберігання текстових даних

Литвин, Василь; Каланча, Артем; Угрин, Дмитро; Талах,  Марія

Удосконалення методів зберігання текстових даних

dc.contributor.affiliation	Національний університет «Львівська політехніка»
dc.contributor.affiliation	Чернівецький національний університет імені Юрія Федьковича
dc.contributor.author	Литвин, Василь
dc.contributor.author	Каланча, Артем
dc.contributor.author	Угрин, Дмитро
dc.contributor.author	Талах, Марія
dc.coverage.placename	Львів
dc.date.accessioned	2025-10-28T08:44:30Z
dc.date.issued	2024
dc.date.submitted	2025
dc.description.abstract	У дослідженні проаналізовано якісні характеристики повідомлень у месенджері Telegram, використаних як вихідні дані для подальшого аналізу текстового контенту. Здійснено ретельний огляд параметрів цих повідомлень, таких як їх формат, розмір, наявність шумів та швидкодія. Основна мета статті – моделювання оптимального підходу до збереження великого обсягу даних перед важливим етапом аналізу тексту. Під час дослідження детально проаналізовано літературні джерела із цієї тематики. Розглянуто основні переваги та недоліки наявних алгоритмів переднього опрацювання даних, а також проблеми, пов’язані з чистотою даних і їх впливом на потенційні результати дослідження. У межах програмних експериментів оцінено вплив попереднього опрацювання даних на розмір збережених даних для подальшого використання, а також на швидкість генерації вхідних даних. Серед запропонованих методів виділено метод збереження очищених токенів у форматі рядка та метод збереження кодів слів у форматі рядка разом зі словником слово-код, використання яких дасть змогу забезпечити ефективний розподіл завдань системи аналізу текстів протягом періоду доби. In this research, an analysis of the qualitative characteristics of messages in the Telegram messenger was carried out, which are used as raw data for further analysis of textual content. A thorough review of the parameters of these messages, such as their format, size, presence of noise, and speed. The main goal of the article is to model the optimal approach to saving a large amount of data before the important stage of text analysis. During the research, a detailed analysis of literary sources devoted to this topic was carried out. The article examines the main advantages and disadvantages of existing data preprocessing algorithms, as well as problems related to data purity and their impact on potential research results. As part of the software experiments, the impact of data preprocessing on the size of the saved data for further use, as well as on the speed of input data generation, was evaluated. Among the proposed methods, the method of saving cleared tokens in string format and the method of saving word codes in string format together with the word-code dictionary were highlighted. This is aimed at ensuring the effective distribution of tasks of the text analysis system during the period of the day.
dc.format.pages	102-114
dc.identifier.citation	Удосконалення методів зберігання текстових даних / Василь Литвин, Артем Каланча, Дмитро Угрин, Марія Талах // Вісник Національного університету “Львівська політехніка”. Серія: Інформаційні системи та мережі. — Львів : Видавництво Львівської політехніки, 2024. — № 15. — С. 102–114.
dc.identifier.uri	https://ena.lpnu.ua/handle/ntb/115395
dc.language.iso	uk
dc.publisher	Національний університет «Львівська політехніка»
dc.relation.references	1. Talakh, M. V. (2019). PART 7. Using text mining for the analysis of social networks. In Ushenko, Y., Ostapov, S. & Golub, S., (Eds.), Information technologies Part 1. Application in computer vision, recognition and intelligent monitoring systems Yuriy Ushenko, Serhiy Ostapov, Serhiy Golub (pp. 157–173). LAP LAMBERT Academic Publishing. 2. Talakh, M. V., Holub, S. & Lazarenko Y. (n. d.). Intelligent monitoring of software test automation of Web sites. International Scientific and Practical Conference “Intellectual Systems and Information Technologies”, 46–51. 3. Telegram (n. d.). Telegram APIs. Retrieved February 1, 2024, from https://core.telegram.org/api 4. Chai, C. (2023). Comparison of text preprocessing methods. Natural Language Engineering, 29(3), 509–553. https://doi.org/10.1017/S1351324922000213 5. R-Project (n. d.). Unicode: Emoji, accents, and international text. Retrieved February 10, 2024, from https://cran.r-project.org/web/packages/utf8/vignettes/utf8.html 6. Mohammad, F. (2018). Is preprocessing of text really worth your time for online comment classification? eprint arXiv, 1806(029908), 1–5. https://doi.org/10.48550/arXiv.1806.02908 7. Camacho-Collados, J. (2018). On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis. eprint arXiv, 1707(01780), 1–4. https://doi.org/10.48550/arXiv.1707.01780 8. Kumar, K., & Harish, B. S. (2017). Classification of Short Text Using Various Preprocessing Techniques: An Empirical Evaluation. Proceedings of the 5th ICACNI 2017, Vol. 3 (10.1007/978-981-10-8633-5_3), 19–24. http://dx.doi.org/10.1007/978-981-10-8633-5_3 9. Mediakov, O. (2024). Information Technology for Generating Lyrics for Song Extensions Based on Transformers. In Mediakov, O., Vysotska, V., Uhryn, D., Ushenko, Y. & Hu, C., (Eds.), International Journal of Modern Education and Computer Science (IJMECS), 16(1), 23–36. 10. Lytvyn, V. (2018). Analysis of statistical methods for stable combinations determination of keywords identification. In Lytvyn, V., Vysotska, V., Uhryn, D., Hrendus, M. & Naum, O. (Eds.), Information technology: Eastern- European Journal of Enterprise Technologies, 2/2(92), 23–37. 11. Lytvyn, V. (2017). Development of a method for determining the keywords in the slavic language texts based on the technology of web mining. In Lytvyn, V., Vysotska, V., Pukach, P., Brodyak, O. & Ugryn D. (Eds.), Information technology. Industry control systems: Eastern-European Journal of Enterprise Technologies, 2/2(86), 14–23. 12. JupyterLab (n. d.). JupyterLab Documentation. Retrieved February 1, 2024, from https://jupyterlab.readthedocs.io/en/stable/index.html 13. Python (n. d.). Our Documentation. Retrieved February 1, 2024, from https://www.python.org/doc/ 14. Mongo (2024). Mongo: The developer data platform. Retrieved February 1, 2024, from https://www.mongodb.com/ 15. Parker, Z., Poe, S. & Vrbsky, S. (2013). Comparing nosql mongodb to an sql db. Proceedings of the 51st ACM Southeast Conference. https://dl.acm.org/doi/10.1145/2498328.2500047 16. Li, Y. & Manoharan, S. (2013). A performance comparison of SQL and NoSQL databases. IEEE Pacific RIM Conference on Communications, Computers, and Signal Processing – Proceedings, 15–19. https://doi.org/10.1109/PACRIM.2013.6625441 17. Prokipchuk, O. (2023). Intelligent Analysis of Ukrainian-language Tweets for Public Opinion Research based on NLP Methods and Machine Learning Technology. In Prokipchuk, O., Vysotska, V., Pukach, P., Lytvyn, V., Uhryn, D., Ushenko, Y. & Hu, Z. (Eds.), International Journal of Modern Education and Computer Science(IJMECS), 15(3), 70–93. 18. Telethon (2024). Telethon’s Documentation. Retrieved February 1, 2024, from https://docs.telethon.dev/en/stable/index.html
dc.relation.references	1. Talakh, M. V. (2019). Part 7. Using text mining for the analysis of social networks. In Ushenko, Y., Ostapov, S. & Golub, S., (Eds.), Information technologies Part 1. Application in computer vision, recognition and intelligent monitoring systems Yuriy Ushenko, Serhiy Ostapov, Serhiy Golub (pp. 157–173). LAP LAMBERT Academic Publishing. Удосконалення методів зберігання текстових даних 113 2. Talakh, M. V., Holub, S. & Lazarenko Y. (n.d.). Intelligent monitoring of software test automation of Web sites. International Scientific and Practical Conference “Intellectual Systems and Information Technologies”, 46–51. 3. Telegram (n. d.). Telegram APIs. Retrieved February 1, 2024, from https://core.telegram.org/api 4. Chai, C. (2023). Comparison of text preprocessing methods. Natural Language Engineering, 29(3), 509–553. https://doi.org/10.1017/S1351324922000213 5. R-Project (n. d.). Unicode: Emoji, accents, and international text. Retrieved February 10, 2024, from https://cran.r-project.org/web/packages/utf8/vignettes/utf8.html 6. Mohammad, F. (2018). Is preprocessing of text really worth your time for online comment classification? eprint arXiv, 1806(029908), 1–5. https://doi.org/10.48550/arXiv.1806.02908 7. Camacho-Collados, J. (2018). On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis. eprint arXiv, 1707(01780), 1–4. https://doi.org/10.48550/arXiv.1707.01780 8. Kumar, K., & Harish, B.S. (2017). Classification of Short Text Using Various Preprocessing Techniques: An Empirical Evaluation. Proceedings of the 5th ICACNI 2017, Vol. 3 (10.1007/978-981-10-8633-5_3), 19–24. http://dx.doi.org/10.1007/978-981-10-8633-5_3 9. Mediakov, O. (2024). Information Technology for Generating Lyrics for Song Extensions Based on Transformers. In Mediakov, O., Vysotska, V., Uhryn, D., Ushenko, Y. & Hu, C., (Eds.), International Journal of Modern Education and Computer Science (IJMECS), 16(1), 23–36. 10. Lytvyn, V. (2018). Analysis of statistical methods for stable combinations determination of keywords identification. In Lytvyn, V., Vysotska, V., Uhryn, D., Hrendus, M. & Naum, O., (Eds.), Information technology: Eastern- European Journal of Enterprise Technologies, 2/2(92), 23–37. 11. Lytvyn, V. (2017). Development of a method for determining the keywords in the slavic language texts based on the technology of web mining. In Lytvyn, V., Vysotska, V., Pukach, P., Brodyak, O. & Ugryn D. (Eds.), Information technology. Industry control systems: Eastern-European Journal of Enterprise Technologies, 2/2(86), 14–23. 12. JupyterLab (n. d.). JupyterLab Documentation. Retrieved February 1, 2024, from https://jupyterlab.readthedocs.io/en/stable/index.html 13. Python (n.d.). Our Documentation. Retrieved February 1, 2024, from https://www.python.org/doc/ 14. Mongo (2024). Mongo: The developer data platform. Retrieved February 1, 2024, from https://www.mongodb.com/ 15. Parker, Z., Poe, S. & Vrbsky, S. (2013). Comparing nosql mongodb to an sql db. Proceedings of the 51st ACM Southeast Conference. https://dl.acm.org/doi/10.1145/2498328.2500047 16. Li, Y. & Manoharan, S. (2013). A performance comparison of SQL and NoSQL databases. IEEE Pacific RIM Conference on Communications, Computers, and Signal Processing – Proceedings, 15–19. https://doi.org/10.1109/PACRIM.2013.6625441 17. Prokipchuk, O. (2023). Intelligent Analysis of Ukrainian-language Tweets for Public Opinion Research based on NLP Methods and Machine Learning Technology. In Prokipchuk, O., Vysotska, V., Pukach, P., Lytvyn, V., Uhryn, D., Ushenko, Y. & Hu, Z., (Eds.), International Journal of Modern Education and Computer Science(IJMECS), 15(3), 70–93. 18. Telethon (2024). Telethon`s Documentation. Retrieved February 1, 2024, from https://docs.telethon.dev/en/stable/index.html
dc.relation.uri	https://doi.org/10.23939/sisn2024.15.102
dc.subject	текстовий аналіз; попередня обробка тексту; база даних; кодування. text analysis; text preprocessing; database; encoding
dc.subject.udc	004
dc.title	Удосконалення методів зберігання текстових даних
dc.title.alternative	Improvement of text data storage methods
dc.type	Article

Files

Original bundle

Now showing 1 - 1 of 1

Name:: maket2402951-106-118.pdf
Size:: 836.02 KB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Вісник Національного університету "Львівська політехніка". Інформаційні системи та мережі. – 2024. – Випуск 15