Title
Constructing and analysing an error-tagged learner corpus of Persian
Creator
Safari, Saeed, 1976-
Copyright date
2017
Object Links
Language
English
Cobiss-ID
Theses Type
Doktorska disertacija
description
Datum odbrane: 29.12.2017.
Other responsibilities
mentor
Miličević-Petrović, Maja
član komisije
Marković, Ljiljana, 1953-
član komisije
Filipović, Jelena, 1966-
član komisije
Shraei, Reza Morad.
Academic Expertise
Društveno-humanističke nauke
University
Univerzitet u Beogradu
Faculty
Filološki fakultet
Alternative title
Izrada i analiza anotiranog korpusa persijskog jezika kao stranog ; Формирование и анализ аннотированного корпуса персидскогo языка
Publisher
[S. G. Safari]
Format
XVII, 154 lista
description
Linguistics - Corpus linguistics, Second Language Acquisition / Nauka o jeziku - Korpusna lingvistika, primenjena lingvistika
Abstract (en)
Linguistic corpora constitute reliable sources and empirical means for analyzing
linguistic data. They are also widely used in the fields of Second/Foreign Language
Acquisition and Foreign Language Teaching research, where the most commonly used type
are Learner Corpora.
The present thesis, based on a methodological approach for building a learner
corpus, is generally in line with the domain of error analysis and the field of Learner
Corpus Research. The thesis describes the process of constructing and developing an errortagged
Persian learner corpus, called the Salam Farsi Learner Corpus (SFLC), as well as
an analysis of linguistic errors based on a collection of written texts produced by Serbian
learners of the Persian language. Three major stages, namely, constructing the corpus,
proposing a system of error annotation and developing tools and software, were followed,
and the practical phases such as the systematic collection of data and metadata, defining the
corpus design criteria, creating the error tagsets and developing the corpus interface,
software and specific tools are described. The SFLC software is equipped with four main
tools in order to function as an error-tagged learner corpus and provide the statistical
reports. These tools include a tool for submitting data and metadata into the corpus
database, a computer-aided error editor to facilitate error tagging, filters and search, and
data statistics tools which show various statistical data related to the corpus.
Based on the SFLC statistical reports, the frequency and error distribution in the
whole corpus and the comparison of error distributions across different proficiency levels
are discussed. The corpus statistics show that the most frequent errors made by the Serbian
learners of the Persian language are initially to be found in the domain of orthography,
while later on the most frequent errors lie in the domains of lexis and syntax. Word Order is
marked as the most frequent error type in the corpus as a whole. As for the distribution of
errors across specific proficiency levels, the results show that the total number of errors
drops from level A2 to level C1, while errors in syntax increase, due to the use of more
vii
complex syntactic structures at higher proficiency levels. The SFLC not only provides
authentic data gathered from learners at different proficiency levels, but also statistics
regarding error tags and metadata. Research into Persian as a second/foreign language thus
can clearly benefit from the SFLC as a resource.
Abstract (sr)
Lingvistički korpusi predstavljaju značajan izvor i sredstvo analize empirijskih
jezičkih podataka. Njihova upotreba vrlo je raširena, između ostalog, u oblasti istraživanja
usvajanja drugog/stranog jezika i nastavi jezika, gde posebno treba naglasiti značaj
učeničkih korpusa. U ovoj disertaciji opisuje se izrada jednog takvog korpusa – učeničkog
korpusa persijskog jezika, pod nazivom Salam Farsi Learner Corpus (SFLC). Ovaj korpus
je izrađen na osnovu tekstova koje su tokom pohađanja kurseva persijskog jezika pisali
učenici čiji maternji jezik je srpski. Pored toga što su tekstovi prebačeni u digitalni format,
u korpusu su označene greške koje su učenici pravili prilikom pisanja.
Tri glavne faze u izradi korpusa bile su njegovo koncipiranje i digitalizovanje,
predlaganje sistema anotacije grešaka i razvijanje alata za izradu i pretragu korpusa. Sve tri
faze detaljno su opisane u disertaciji. Konkretno, pažnja je posvećena opisu praktičnih
koraka poput prikupljanja podataka i metapodataka, kao i konceptualnih zadataka kakvi su
definisanje kriterijuma za izradu korpusa, sastavljanje oznaka za greške i idejno
osmišljavanje korpusnog interfejsa, softvera i alata. SFLC se softverski oslanja na četiri
glavna alata koji omogućuju unos podataka i metapodataka u korpusnu bazu, označavanje
grešaka, preuzimanje i pretragu dokumenata (prema površinskim oblicima reči ili prema
greškama) i generisanje statističkih izveštaja o greškama.
Na osnovu statističkih izveštaja koje SFLC daje, u disertaciji se sprovodi i analiza
grešaka – proučavaju se učestalost i raspodela grešaka u korpusu kao celini i na različitim
pojedinačnim nivoima znanja persijskog jezika. Rezultati ove korpusno zasnovane analize
pokazuju da učenici kojima je maternji jezik srpski na nižim nivoima znanja persijskog
jezika najčešće prave greške u domenu ortografije, dok se kasnije greške češće nalaze u
domenima leksike i sintakse. Greške vezane za red reči označene su kao ukupno gledano
najčešći tip greške u čitavom korpusu. Ukupni broj grešaka smanjuje se kako se učenici
kreću od nivoa A2 ka nivou C1. Međutim, kada je reč o sintaksi, broj grešaka raste, usled
korišćenja složenijih sintaksičkih struktura na višim nivoima.
ix
SFLC ne samo da obezbeđuje autentične podatke prikupljene od učenika na različitim nivoima znanja, već pruža i statističke podatke o označenim greškama i drugim korpusnim parametrima. Stoga se zaključuje da korpus može biti od velike koristi za istraživanje i nastavu persijskog jezika kao drugog/stranog.
Authors Key words
Learner Corpus, Error Analysis, Second Language Acquisition,
Teaching Persian as a Foreign Language
Authors Key words
Učenički korpus, analiza grešaka, usvajanje drugog jezika, nastava persijskog kao stranog jezika
Classification
811.222.1'243(043.3), 811.222.1'322(043.3)
Type
Tekst
Abstract (en)
Linguistic corpora constitute reliable sources and empirical means for analyzing
linguistic data. They are also widely used in the fields of Second/Foreign Language
Acquisition and Foreign Language Teaching research, where the most commonly used type
are Learner Corpora.
The present thesis, based on a methodological approach for building a learner
corpus, is generally in line with the domain of error analysis and the field of Learner
Corpus Research. The thesis describes the process of constructing and developing an errortagged
Persian learner corpus, called the Salam Farsi Learner Corpus (SFLC), as well as
an analysis of linguistic errors based on a collection of written texts produced by Serbian
learners of the Persian language. Three major stages, namely, constructing the corpus,
proposing a system of error annotation and developing tools and software, were followed,
and the practical phases such as the systematic collection of data and metadata, defining the
corpus design criteria, creating the error tagsets and developing the corpus interface,
software and specific tools are described. The SFLC software is equipped with four main
tools in order to function as an error-tagged learner corpus and provide the statistical
reports. These tools include a tool for submitting data and metadata into the corpus
database, a computer-aided error editor to facilitate error tagging, filters and search, and
data statistics tools which show various statistical data related to the corpus.
Based on the SFLC statistical reports, the frequency and error distribution in the
whole corpus and the comparison of error distributions across different proficiency levels
are discussed. The corpus statistics show that the most frequent errors made by the Serbian
learners of the Persian language are initially to be found in the domain of orthography,
while later on the most frequent errors lie in the domains of lexis and syntax. Word Order is
marked as the most frequent error type in the corpus as a whole. As for the distribution of
errors across specific proficiency levels, the results show that the total number of errors
drops from level A2 to level C1, while errors in syntax increase, due to the use of more
vii
complex syntactic structures at higher proficiency levels. The SFLC not only provides
authentic data gathered from learners at different proficiency levels, but also statistics
regarding error tags and metadata. Research into Persian as a second/foreign language thus
can clearly benefit from the SFLC as a resource.
“Data exchange” service offers individual users metadata transfer in several different formats. Citation formats are offered for transfers in texts as for the transfer into internet pages. Citation formats include permanent links that guarantee access to cited sources. For use are commonly structured metadata schemes : Dublin Core xml and ETUB-MS xml, local adaptation of international ETD-MS scheme intended for use in academic documents.