Creation of a multilingual aligned corpus with Ukrainian as the target language and its exploitation

Grabar, Natalia; Hamon, Thierry

Creation of a multilingual aligned corpus with Ukrainian as the target language and its exploitation

dc.citation.conference	Computational linguistics andintelligent systems (COLINS 2017)
dc.contributor.affiliation	CNRS UMR 8163 STL, Université Lille 3, 59653 Villeneuve d'Ascq, France	uk_UA
dc.contributor.affiliation	LIMSI-CNRS, Orsay, Université Paris 13, Sorbonne Paris Cité, France	uk_UA
dc.contributor.author	Grabar, Natalia
dc.contributor.author	Hamon, Thierry
dc.coverage.country	UA	uk_UA
dc.coverage.placename	Kharkiv	uk_UA
dc.date.accessioned	2018-02-22T11:30:55Z
dc.date.available	2018-02-22T11:30:55Z
dc.date.issued	2017
dc.description.abstract	The question on creation of linguistic resources (such as corpora, lexica or terminologies) occupies an important place in the research areas related to linguistics, Natural Language Processing, Computer Sciences, psycholinguistics, etc. In this paper, we propose the description of a multilingual corpus in which Ukrainian is the target language, while source languages are Polish, French and English. The corpus contains literary texts and a small subset built with texts provided by medical area. On the whole, the corpus is composed of 62 literary texts and 129 medical texts. The corpus counts over 1 million words in the tar-get Ukrainian language, and at least as much in the source languages taken all together. This is a directional corpus aligned at the level of sentences. After the description of this corpus, we introduce some possible exploitations and first results. We then conclude and indicate some directions for future work. The corpus presented in this work is available for the research purposes: http://natalia.grabar.free.fr/resources.php.	uk_UA
dc.format.pages	10-19
dc.identifier.citation	Grabar N. Creation of a multilingual aligned corpus with Ukrainian as the target language and its exploitation / Natalia Grabar, Thierry Hamon // Computational linguistics andintelligent systems (COLINS 2017) : proceedings of the 1st International conference, Kharkiv, Ukraine, 21 April 2017 / National Technical University «KhPI», Lviv Polytechnic National University. – Kharkiv, 2017. – P. 10–19. – Bibliography: 40 titles.	uk_UA
dc.identifier.uri	https://ena.lpnu.ua/handle/ntb/39454
dc.language.iso	en	uk_UA
dc.publisher	National Technical University «KhPI»	uk_UA
dc.relation.referencesen	1. Babych, B.: Representation and interpretation of ambiguous deep syntactic structures. Ukrainian Linguistics 21, 89--100 (1997), in Ukrainian 2. Banea, C., Mihalcea, R.: Word sense disambiguation with multilingual features. In: Interna-tional Conference on Computational Semantics (ICCS 2011). pp. 25--34 (2011) 3. Bannard, C., Callison-Burch, C.: Paraphrasing with bilingual parallel corpora. In: ACL. pp. 597--604 (2005) 4. Benko, V.: Aranea: Yet another family of (comparable) web corpora. In: Text, Speech and Dialogue. pp. 247--256 (2014) 5. Callison-Burch, C., Cohn, T., Lapata, M.: Parametric: An automatic evaluation metric for paraphrasing. In: COLING. pp. 97--104 (2008) 6. Cartoni, B., Namer, F.: Linguistique contrastive et morphologie : les noms en -iste dans une approche onomasiologique. In: CMLF. pp. 1245--1259 (2012) 7. Dimitrova, L., Koseska-Toszewa, V., Garabik, R., Erjavec, T., Iomdin, L., Shyrokov, V.: MONDILEX - Towards the Research Infrastructure for Digital Resources in Slavic Lexicography, pp. 147--162 (2010) 8. Erjavec, T.: MULTEXT-East: Morphosyntactic resources for central and eastern european languages. Language Resources and Evaluation 46(1), 131--142 (2012) 9. Grabar, N., Hamon, T.: Acquisition non supervisée de ressources morphologiques en ukrainien. In: Atelier Traitement Automatique des Langues Slaves (TASLA). pp. 1--10 (2015) 10. Grabar, N., Shyshkina, N., Zorko, H., Hamon, T.: Terminological research in ukraine. In: Terminologie et Intelligence Artificielle (TIA) (2015) 11. Hamon, T., Grabar, N.: Acquisition of medical terminology for Ukrainian from parallel cor-pora and Wikipedia. In: Terminologie et Intelligence Artificielle (TIA) (2015) 12. Hantson, A.: English gerund clauses and norwegian det + infinitive / at clause constructions. In: Granger, S., Lerot, J., Petch-Tyson, S. (eds.) Corpus-based Approaches to Contrastive Linguistics and Translation Studies, pp. 75--90. Rodopi, New-York, Amsterdam (2003) 13. Kelih, E.: Preliminary analysis of a slavic parallel corpus. In: Corpus based Grammar research. pp. 173--183 (2009) 14. Kelih, E., Buk, S., Grzybek, P., Rovenchak, A.: Project description: designing and construct-ing a typologically balanced ukrainian text database. In: Методи аналізу тексту. pp. 125--132 (2009) 15. Kok, S., Brockett, C.: Hitting the right paraphrases in good time. In: NAACL. pp. 145--153 (2010) 16. Kotsyba, N.: Polukr (a polish-ukrainian parallel corpus) as a testbed for a parallel corpora toolbox. Philological Studie LXIII, 181--196 (2012) 17. Kotsyba, N.: Overview of the ukrainian language resources within the multilingual european MULTEXT-East project. Інформаційні системи та мережі 770, 122--129 (2013) 18. Kotsyba, N., Mykulyak, A., Shevchenko, I.V.: UGTag: morphological analyzer and tagger for the Ukrainian language. In: Proceedings of the international conference Practical Applications in Language and Computers (PALC 2009) (2009) 19. Lefer, M., Grabar, N.: Evaluative prefixes in translation: From automatic alignment to seman-tic categorization. Linguistic Issues in Language Technology journal 11(6), 169--187 (2014) 20. Lefer, M.A., Grabar, N.: N-grams in multilingual corpora: extracting and analyzing lexical bundles in contrastive studies. In: EUROPHRAS 2015 (2015) 21. Lopez, A., Nossal, M., Hwa, R., Resnik, P.: Word-level alignment for multilingual resource acquisition. In: LREC Workshop on Linguistic Knowledge Acquisition and Representation: Bootstrapping Annotated Data. Las Palmas, Spain (2002) 22. Och, F., Ney, H.: Improved statistical alignment models. In: ACL. pp. 440--447 (2000) 23. Siruk, O., Derzhanski, I.: Linguistic corpora as international cultural heritage: The corpus of Bulgarian and Ukrainian parallel texts. Digital Presentation and Preservation of Cultural and Scientific Heritage 3, 91--98 (2013) 24. Yarowsky, D., Ngai, G., Wicentowski, R.: Inducing multilingual text analysis tools via robust projection across aligned corpora. In: HLT (2001) 25. Ziering, P., van der Plas, L., Schütze, H.: Multilingual lexicon bootstrapping. Improving a lexicon induction system using a parallel corpus. In: International Joint Conference on Natural Language Processing. pp. 844--848 (2013) 26. Бобкова, Історичні та концептуальні передумови корпусної лінгвістики. Філологічні науки 2, 13--17 (2014) 27. Бугаков, O.: Создание семантического словаря предложных конструкций на основе украинского национального лингвистического корпуса. Tech. rep., Украинский языково-информационный фонд НАН Украины, Киев, Украина (2006) 28. Бук, Ровенчак, Частотний словник роману Івана Франка "Перехресні стежки", pp. 138--369 (2007) 29. Бук, Лінгводидактичний потенціал корпусу текстів Івана Франка у викладанні української мови як іноземної. In: Theory and Practice of Teaching Ukrainian as a Foreign Language. pp. 70--74 (2010) 30. Бук, Сучасні методи дослідження мови письменника у слов‘янознавстві. Проблеми слов‘янознавства 61, 86--95 (2012) 31. Глибовец, A., Решетнев, I.: Метод итеративного построения терминологии в коллекциях научных текстов на украинском языке. Кибернетика и системний анализ 50(6), 53--62 (2014) 32. Дарчук, H.: Морфологічне анотування Корпусу української мови. In: Комп‘ютерна лінгвістика: сучасне та майбутнє. pp. 16--18 (2012) 33. Дарчук, Дослідницький корпус української мови: основні засади і перспективи. ВІСНИК Київського національного університету імені Тараса Шевченка 21, 45--49 (2010) 34. Демська, O.: Текстовий корпус: ідея іншої форми. ВПЦ НаУКМА, Київ, Україна (2011) 35. Демська-Кульчицька, O.: Репрезентативність як ознака текстового корпусу. Українська мова 3, 100--107 (2005) 36. Левченко, O., Кульчицький, I.: Технологія перетворення п‘ятимовного словника порівнянь в електронну форму. In: Інформаційні системи та мережі. pp. 129--138 (2013) 37. Монахова, T.: Застосування прийомів корпусної лінгвістики в лексикографії. Наукові праці 98(85), 55--60 (2009) 38. Сірук, Підготовка діалектних текстів для корпусного опрацювання. In: Комп‘ютерна лінгвістика: сучасне та майбутнє. pp. 43--45 (2012) 39. Тищенко, Засади створення корпусу української жестової мови глухих. Лексикографічний бюлетень 13, 47--52 (2006) 40. Шуневич, Українсько-англійський комп‘ютерний словник пожежно-технічних термінів: лексичні матеріали, програмне забезпечення. In: Комп‘ютерна лінгвістика: сучасне та майбутнє. pp. 46--48 (2012).	uk_UA
dc.title	Creation of a multilingual aligned corpus with Ukrainian as the target language and its exploitation	uk_UA
dc.type	Conference Abstract	uk_UA

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 004-010-019.pdf
Size:: 305.13 KB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 2.99 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Computational linguistics and intelligent systems. – 2017 р.