Унормовування тексту при докорпусному опрацюванні: досвід застосування

Кульчицький, Ігор; Kulchytskyy, Ihor

Унормовування тексту при докорпусному опрацюванні: досвід застосування

dc.citation.epage	58
dc.citation.issue	7
dc.citation.journalTitle	Вісник Національного університету "Львівська політехніка". Інформаційні системи та мережі
dc.citation.spage	51
dc.contributor.affiliation	Національний університет “Львівська політехніка”
dc.contributor.affiliation	Lviv Polytechnic National University
dc.contributor.author	Кульчицький, Ігор
dc.contributor.author	Kulchytskyy, Ihor
dc.coverage.placename	Львів
dc.coverage.placename	Lviv
dc.date.accessioned	2021-02-11T14:12:50Z
dc.date.available	2021-02-11T14:12:50Z
dc.date.created	2020-02-24
dc.date.issued	2020-02-24
dc.description.abstract	Узагальнено досвід унормування текстів перед внесенням їх у корпус творів Наддністрянської України, створення якого розпочато на кафедрі прикладної лінгвістики Львівської політехніки. Йдеться про тексти художнього стилю. Під унормуванням розуміємо сукупність інформаційних процедур, що роблять текст придатним до внесення його в корпус: приведення всіх текстів до однієї кодової таблиці, перевірку їх на пунктуаційну коректність (однакові за смислом сутності мають бути позначені одним знаком), усунення зайвих символів (наприклад, порожні абзаци, декілька пробілів поспіль і т. ін.), уніфікацію засобів та способів форматування тощо. Як програмне середовище унормування запропоновано редактор MS Word, а для створення додаткового програмного інструментарію – мову програмування Python. Процес унормування текстів містить такі етапи: унормування кодування, унормування графіки, коректура тексту, технічне унормування пунктуації. Для кожного етапу подано його характеристику, вказано проблеми, які виникають при його реалізації та запропоновано шляхи їх подолання. Зроблено висновки.
dc.description.abstract	The article analyses the experience of normalization of texts before introduction into the corpus of literary works of Naddnistrian Ukraine. The creation of the corpus was started at the department of Applied Linguistics of Lviv Polytechnic National University. Normalization means a set of information procedures that make the texts suitable for insertion into the corpus: bringing all texts to one code table, checking them for punctuation correctness (sense-identical entities should be marked with one character), eliminating unnecessary characters (for example, blank paragraphs , several gaps in a row, etc.), unification of formatting tools and methods, and more. MS Word editor is offered as a standardization medium, and Python programming language is used to create additional programming tools. Text normalization process contains the following stages: normalization of coding, normalization of graphics, text proofreading, technical normalization of punctuation. Each stage characteristics are presented, problems that arise during their implementation are indicated, and ways to overcome them are suggested. The conclusions are drawn.
dc.format.extent	51-58
dc.format.pages	8
dc.identifier.citation	Кульчицький І. Унормовування тексту при докорпусному опрацюванні: досвід застосування / Ігор Кульчицький // Вісник Національного університету "Львівська політехніка". Інформаційні системи та мережі. — Львів : Видавництво Львівської політехніки, 2020. — № 7. — С. 51–58.
dc.identifier.citationen	Kulchytskyy I. Text normalization during pre-corpus preparation: experience of application / Ihor Kulchytskyy // Visnyk Natsionalnoho universytetu "Lvivska politekhnika". Informatsiini systemy ta merezhi. — Lviv : Lviv Politechnic Publishing House, 2020. — No 7. — P. 51–58.
dc.identifier.uri	https://ena.lpnu.ua/handle/ntb/56141
dc.language.iso	uk
dc.publisher	Видавництво Львівської політехніки
dc.publisher	Lviv Politechnic Publishing House
dc.relation.ispartof	Вісник Національного університету "Львівська політехніка". Інформаційні системи та мережі, 7, 2020
dc.relation.references	1. Ellis N. C. (2012). Formulaic language and second language acquisition. Zipfand the phrasal teddy bear’. Annual Review of Applied Linguistics, 32, 17–44.
dc.relation.references	2. Friederike Müller & Birgit Waibel (n. d.) Corpus linguistics — an introduction. Retrieved January 15, 2020 from https://www.anglistik.uni-freiburg.de/seminar/abteilungen/sprachwissenschaft/ls_mair/corpus-linguistics.
dc.relation.references	3. Gries S. Th. (2013). Statistics for Linguistics Using. Berlin.
dc.relation.references	4. Gries Stefan Th. (2019). Some long overdue additions/corrections (to/of actually all sorts of corpuslinguistics measures). International Journal of Corpus Linguistics, 24 (3), 385–412.
dc.relation.references	5. Nancy Ide (2008). Preparation and Analysis of Linguistic Corpora. In S. Schreibman & R. Siemens & J. Unsworth (Eds.) A Companion to Digital Humanities (pp. 289-305). doi:10.1002/9780470999875.
dc.relation.references	6. Perez Paredes. (n. d.) All things corpus & applied linguistics Research methods: corpus linguistics. Retrieved January 15, 2020 from http://www.perezparedes.es/research-methods-corpus-linguistics/.
dc.relation.references	7. Unicode Standard Releases. (n. d.) Unicode – The World Standard for Text and Emoji. Retrieved January 15, 2020 from https://home.unicode.org.
dc.relation.references	8. Бобкова, Т. В. (2014). До визначення корпусної лінгвістики в сучасному мовознавстві. Наукові записки Національного університету “Острозька академія”, ( 45), 3–6.
dc.relation.references	9. Ванівська, О. І. (2012). Основні підходи до аналізу мовних даних у корпусній лінгвістиці. Наукові записки Національного університету “Острозька академія”, 27, 3–8.
dc.relation.references	10. ГРАК (n. d.) Генеральний регіонально анотований корпус української мови. Доступ 15/01/2020 http://uacorpus.org/
dc.relation.references	11. Данилюк, І. (2013). Корпус текстів для вивчення граматичної службовості. Лінгвістичні студії, 26, 224–229.
dc.relation.references	12. Дарчук, Н. (2010). Дослідницький корпус української мови: основні засади і перспективи. Вісник Київського національного університету імені ТарасаШевченка, 21, 45–49.
dc.relation.references	13. Загнітко, А. П. (2015). Встановлення функційної характерології та парадигмально-синтагмального вияву часток в експериментальному дослідницькому лінгвістичному корпусі службовості. In О. Левченко (Ed.) Дані текстових корпусів у лінгвістичних дослідженнях (pp. 46–64).
dc.relation.references	14. Загнітко, А. & Данилюк, І. (2013). Корпус текстів граматичної службовості. In Прикладна лінгвістика та лінгвістичні технології (pp. 102–112).
dc.relation.references	15. Кульчицький, І. М. (2015). Технологічні аспекти укладання корпусів текстів. In О. Левченко (Ed.) Дані текстових корпусів у лінгвістичних дослідженнях (pp. 29–45).
dc.relation.references	16. Кульчицький, І. (2016). Корпуси текстів як лінгвотехнологічне підґрунтя виявлення змін в українській мові. In А. Архангельська (Ed.) XX–XXI століття: жанрово-стильові й лінгвістичні метаморфози в українській мові та літературі (pp. 269–298).
dc.relation.references	17. Кульчицький І. М. (2014). Технічні аспекти опрацювання комп’ютером природномовної інформації. Вісник Національного університету “Львівська політехніка”, 783, 344–353.
dc.relation.references	18. Друль Орест (2015). Поправлюваний Франко. Збруч. Отримано 16/01/2020 з https://zbruc.eu/node/35977
dc.relation.references	19. Русанівський В. М. & Тараненко О. О. & all. (2004). Українська мова: Енциклопедія. Видавництво “Українська енциклопедія ім. М. П. Бажана”.
dc.relation.references	20. Український правопис 2019. (2019). Міністерство освіти і науки України. Отримано 15/01/2020 з https://mon.gov.ua/ua/osvita/zagalna-serednya-osvita/navchalni-programi/ukrayinskij-pravopis-2019
dc.relation.references	21. Широков В. А. & all (2005). Корпусна лінгвістика. Довіра.
dc.relation.referencesen	1. Ellis N. C. ‘Formulaic language and second language acquisition. Zipfand the phrasal teddy bear’. Annual Review of Applied Linguistics 32, 2012. 17–44.
dc.relation.referencesen	2. Friederike Müller and Birgit Waibel, Corpus linguistics – an introduction, from https://www.anglistik.unifreiburg. de/seminar/abteilungen/sprachwissenschaft/ls_mair/corpus-linguistics [FM].
dc.relation.referencesen	3. Gries S. Th. Statistics for Linguistics Using R. 2nd edn. Berlin. De Gruyter Mouton, 2013. p. 179.
dc.relation.referencesen	4. Gries Stefan Th. Some long overdue additions/corrections (to/of actually all sorts of corpus-linguistics measures). International Journal of Corpus Linguistics, Volume 24, Issue 3, Aug 2019, p. 385–412
dc.relation.referencesen	5. Nancy Ide (2008). Preparation and Analysis of Linguistic Corpora. A Companion to Digital Humanities/Susan Schreibman, Ray Siemens, John Unsworth, John Wiley & Sons 640 p. [NI08].
dc.relation.referencesen	6. Perez Paredes. All things corpus & applied linguistics Research methods: corpus linguistics, from http://www.perezparedes.es/research-methods-corpus-linguistics/
dc.relation.referencesen	7. The Unicode Consortium, from http://www.unicode.org/ [UTF].
dc.relation.referencesen	8. Bobkova, T. V (2014). Towards a definition of corpus linguistics in modern linguistics. Scientific Papers of Ostroh Academy National University, (45), 3–6.
dc.relation.referencesen	9. Vanivska, O. I (2012). Basic approaches to the analysis of language data in corpus linguistics. Scientific Papers of Ostroh Academy National University, 27, 3–8.
dc.relation.referencesen	10. GRAC (n. D.) General regionally annotated corpus of the Ukrainian language. Accessed 15/01/2020 http://uacorpus.org/
dc.relation.referencesen	11. Danylyuk, I. (2013). A body of texts for the study of grammatical servitude. Linguistic Studies, 26, 224–229.
dc.relation.referencesen	12. Darchuk, N. (2010). The research body of the Ukrainian language: basic principles and perspectives. Bulletin of Taras Shevchenko National University of Kyiv, 21, 45–49.
dc.relation.referencesen	13. Zagnitko, A. P (2015). Establishment of Functional Characteristics and Paradigm-Syntagmal Particle Detection in the Experimental Research Linguistic Corps of Servitude. In O. Levchenko (Ed.) Data from text corpora in linguistic studies (pp. 46–64).
dc.relation.referencesen	14. Zagnitko, A. & Danylyuk, I. (2013). A body of grammatical servitude texts. In Applied Linguistics and Linguistic Technologies (pp. 102–112).
dc.relation.referencesen	15. Kulchytskyy, I. M. (2015). Technological aspects of text corpus laying. In O. Levchenko (Ed.) Text corpus data in linguistic research (pp. 29–45).
dc.relation.referencesen	16. Kulchytskyi, I. (2016). Text Cases as a Linguistic and Technological Basis for Detecting Changes in the Ukrainian Language. In A. Arkhangelsk (Ed.) XX–XXI centuries: genre-style and linguistic metamorphoses in Ukrainian language and literature (pp. 269–298).
dc.relation.referencesen	17. Kulchitsky I. M. (2014). Technical aspects of computer-generated natural language information. Bulletin of the National University of Lviv Polytechnic, 783, 344–353.
dc.relation.referencesen	18. Drul Orestes (2015). Corrected by Franco. Collapsed. Retrieved 16/01/2020 from https://zbruc.eu/node/35977
dc.relation.referencesen	19. Rusanovsky V. M & Taranenko OO & all. (2004). English language: Encyclopedia. Publishing House “Ukrainian Encyclopedia. MP Bazhan”.
dc.relation.referencesen	20. Ukrainian Spelling 2019. (2019). Ministry of Education and Science of Ukraine. Retrieved 15/01/2020 from https://mon.gov.ua/en/osvita/zagalna-serednya-osvita/navchalni-programi/ukrayinskij-pravopis-2019
dc.relation.referencesen	21. Shirokov V. A & all (2005). Corpus linguistics. Trust.
dc.relation.uri	https://www.anglistik.uni-freiburg.de/seminar/abteilungen/sprachwissenschaft/ls_mair/corpus-linguistics
dc.relation.uri	http://www.perezparedes.es/research-methods-corpus-linguistics/
dc.relation.uri	https://home.unicode.org
dc.relation.uri	http://uacorpus.org/
dc.relation.uri	https://zbruc.eu/node/35977
dc.relation.uri	https://mon.gov.ua/ua/osvita/zagalna-serednya-osvita/navchalni-programi/ukrayinskij-pravopis-2019
dc.relation.uri	https://www.anglistik.unifreiburg
dc.relation.uri	http://www.unicode.org/
dc.relation.uri	https://mon.gov.ua/en/osvita/zagalna-serednya-osvita/navchalni-programi/ukrayinskij-pravopis-2019
dc.rights.holder	© Національний університет “Львівська політехніка”, 2020
dc.rights.holder	© Кульчицький І., 2020
dc.subject	корпус текстів
dc.subject	унормування
dc.subject	кодові таблиці
dc.subject	графіка тексту
dc.subject	коректура тексту
dc.subject	пунктуація
dc.subject	of texts
dc.subject	normalization
dc.subject	code tables
dc.subject	text graphics
dc.subject	text correction
dc.subject	punctuation
dc.subject.udc	004.415.3
dc.title	Унормовування тексту при докорпусному опрацюванні: досвід застосування
dc.title.alternative	Text normalization during pre-corpus preparation: experience of application
dc.type	Article

Files

Original bundle

Now showing 1 - 2 of 2

Name:: 2020n7_Kulchytskyy_I-Text_normalization_during_51-58.pdf
Size:: 884.07 KB
Format:: Adobe Portable Document Format

Download

Name:: 2020n7_Kulchytskyy_I-Text_normalization_during_51-58__COVER.png
Size:: 391.86 KB
Format:: Portable Network Graphics

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 2.97 KB
Format:: Plain Text
Description:

Download

Collections

Вісник Національного університету "Львівська політехніка". Інформаційні системи та мережі. – 2020. – Випуск 7