Computational linguistics and intelligent systems

Permanent URI for this communityhttps://ena.lpnu.ua/handle/ntb/39447

Browse

Search Results

Now showing 1 - 3 of 3
  • Thumbnail Image
    Item
    WikiWars-UA: Ukrainian corpus annotated with temporal expressions
    (Lviv Politechnic Publishing House, 2019-04-18) Grabar, Natalia; Hamon, Thierry; CNRS, Univ. Lille, UMR 81G3 - STL - Savoirs Textes Langage, F-59000 Lille, France; LIMSI, CNRS, Université Paris-Saclay. F-91405 Orsay, France; Université Paris 13. Sorbonne Paris Cité. F-93430 Villetaneuse. France
    Reliability of tools and reproducibility of study results are important features of modern Natural Language Processing (NLP) tools and methods. The scientific research is indeed increasingly coming under criticism for the lack of reproducibility of results. First step towards the reproducibility is related to the availability of freely usable tools and corpora. In our work, we are interested in automatic processing of unstructured documents for the extraction of temporal information. Our main objective is to create reference annotated corpus with temporal information related to dates (absolute and relative), periods, time, etc. in Ukrainian, and to their normalization. The approach relies on the adaptation of existing application, automatic pre-annotation of WikiWars corpus in Ukrainian and its manual correction. The reference corpus permits to reliably evaluate the current version of the automatic temporal annotator and to prepare future work on these topics.
  • Thumbnail Image
    Item
    Unsupervised acquisition of morphological resources for Ukrainian
    (National Technical University «KhPI», 2017) Hamon, Thierry; Grabar, Natalia; LIMSI-CNRS, Orsay, Université Paris 13, Sorbonne Paris Cité, France; CNRS UMR 8163 STL, Université Lille 3, 59653 Villeneuve d'Ascq, France
    Availability of morphological resources is an important and recurrent need because they allow the development of NLP tools and applications for a given language. Indeed, such resources provide basic information which is necessary for such tools for performing more sophisticated treatments (information retrieval, morphosyntactic tagging, etc). We propose to acquire morphological resources for Ukrainian language. The method proposed exploits corpora in order to extract words that are related morphologically between them. The method has two versions: without and with processing of prefixes. The association strength between these words indicates their probability to have a morphological and semantic relation between them. We use three corpora (literary, medical and general-language) and evaluate the results obtained. According to the corpora, precision varies between 67% and 86%. The results from different corpora are also compared, which shows that there is little redundancy between the corpora. The currently available resource contains 3,315 fully validated pairs of words.
  • Thumbnail Image
    Item
    Creation of a multilingual aligned corpus with Ukrainian as the target language and its exploitation
    (National Technical University «KhPI», 2017) Grabar, Natalia; Hamon, Thierry; CNRS UMR 8163 STL, Université Lille 3, 59653 Villeneuve d'Ascq, France; LIMSI-CNRS, Orsay, Université Paris 13, Sorbonne Paris Cité, France
    The question on creation of linguistic resources (such as corpora, lexica or terminologies) occupies an important place in the research areas related to linguistics, Natural Language Processing, Computer Sciences, psycholinguistics, etc. In this paper, we propose the description of a multilingual corpus in which Ukrainian is the target language, while source languages are Polish, French and English. The corpus contains literary texts and a small subset built with texts provided by medical area. On the whole, the corpus is composed of 62 literary texts and 129 medical texts. The corpus counts over 1 million words in the tar-get Ukrainian language, and at least as much in the source languages taken all together. This is a directional corpus aligned at the level of sentences. After the description of this corpus, we introduce some possible exploitations and first results. We then conclude and indicate some directions for future work. The corpus presented in this work is available for the research purposes: http://natalia.grabar.free.fr/resources.php.