E teze

General Metadata

Title

Композитне псеудограматике засноване на паралелним језичким моделима српског језика

Creator

Škorić, Mihailo, 1992- CONOR: 60642057

2022

Object Links

Permanent link

Full text preview

Download link

Select license

Autorstvo-Nekomercijalno-Bez prerade 3.0 Srbija (CC BY-NC-ND 3.0)

License description

Dozvoljavate samo preuzimanje i distribuciju dela, ako/dok se pravilno naznačava ime autora, bez ikakvih promena dela i bez prava komercijalnog korišćenja dela. Ova licenca je najstroža CC licenca. Osnovni opis Licence: http://creativecommons.org/licenses/by-nc-nd/3.0/rs/deed.sr_LATN. Sadržaj ugovora u celini: http://creativecommons.org/licenses/by-nc-nd/3.0/rs/legalcode.sr-Latn

Language

Serbian

Cobiss-ID

121486089

Academic metadata

Theses Type

Doktorska disertacija

description

Datum odbrane: 06.06.2023.

Other responsibilities

mentor

Stanković, Ranka, 1964-

CONOR: 13635431

mentor

Tomašević, Jelena, 1979-

CONOR: 12939623

član komisije

Devedžić, Vladan, 1959-

CONOR: 12522855

član komisije

Utvić, Miloš, 1976-

CONOR: 17300071

član komisije

Stankov, Dragan, 1963-

CONOR: 57809673

Academic Expertise

Multidisciplinarne i interdisciplinarne naučne oblasti

University

Univerzitet u Beogradu

Other Theses Metadata

Alternative title

Composite pseudogrammars based on parallel language models of Serbian

Publisher

[М. Шкорић]

Format

164 стр.

description

Интелигентни системи - Обрада природног језика, Рачунарска лингвистика / Intelligent systems - Natural language processing, Computational linguistics

Abstract (sr)

Циљ овог рада је да предочи предности коришћења композитних интелигентних система заснованих на паралелним архитектурама, а пре свега предност композитних псеудограматика заснованих на паралелним језичким моделима у обради, генерисању и евалуацији природног језика, и то поготово српског. У њему је најпре дат кратак увод у теорију формалних језика, предочене су различите врсте граматика и дат је преглед радова из области креирања њихових апроксимација. Описани су појмови псеудограматика и језичких модела и приказан је њихов историјски развој, са највећим акцентом на тренутно стање и најактуалније методе моделовања језика и језичке моделе. Уведена је проблематика евалуације квалитета текста, и описане су различите методе полу-аутоматске и аутоматске евалуације. У другом делу рада описана су два експеримента која су имала за циљ да утврде методологију креирања композитних система за потребе моделовања српског језика, при чему су описани начини креирања различитих репрезентација докумената и различити начини комбиновања излаза самосталних система у обради природног језика. Паралелни системи су том приликом успешно тестирани на задацима обележавања врста речи и утврђивања ауторства кроз моделовања мини-језика, где су остварили значајно боље резултате од самосталних метода. Коначно, описан је процес обучавања серије генеративних предобучених трансформера над различитим репрезентацијама корпуса српског језика и креирања композитних псеудограматика заснованих на тим моделима и различитим методама комбиновања. Развијени системи су евалуирани на задацима оцењивања квалитета текста, те проналажења и исправљања грешака. Приказани резултати издвојили су наслагани обучени класификатор као оптимални метод комбиновања језичких модела у јединствену псеудограматику.

Abstract (en)

The aim of this paper is to present the advantages of using composite intelligent systems based on parallel architectures and, above all, the advantage of composite pseudogrammars based on parallel language models in the processing, generation, and evaluation of natural languages, especially Serbian. First a brief introduction to the theory of formal languages is given, distinct types of grammars are described an overview of papers in the field of creating their approximations were presented. The concepts of pseudogrammars and language models were described together with their historical development, with the emphasis on the current state-of-the-art and the best methods of language modelling and currently top-performing language models. The issue of quality evaluation of a text is introduced, and various methods of semi-automatic and automatic evaluation are described. In the second part of the paper, two experiments were described that aimed to determine the methodology of creating composite systems for the needs of modelling the Serbian language, where the ways of creating different representations of documents and diverse ways of combining the outputs of independent natural language processing systems were described. On that occasion, parallel systems were successfully tested on the tasks of part-of-speech tagging and authorship attribution through mini-language modelling, for which they achieved significantly better results than independent methods. Finally, the process of training a series of generative pretrained transformers on different representations of the corpus of the Serbian language and creating composite pseudogrammars based on those models and different combining methods is described. The developed systems were evaluated on the tasks of text quality evaluation and finding and correcting errors in the text. The presented results singled out the stacked trained classifier as the optimal method of combining language models into a unique pseudogrammar.

Authors Key words

моделирање језика, језички модели, композитне структуре, машинско учење, српски језик, анализа текста, генерисање текста, аутоматска евалуација

Authors Key words

language modeling, language models, composite structures, machine learning, Serbian language, text analysis, text generation, automatic evaluation

Classification

811.163.41'322.2:004.85(043.3) 811.163.41'322'367(043.3) 811.163.41'322.2:004.41(043.3)

Type

Tekst

Abstract (sr)