Title
Data mining on protein sequences: n-gram analysis of ordered and disordered protein regions
Creator
Alshafah, Samira A., 1978- 21712231
Copyright date
2018
Object Links
Select license
Autorstvo 3.0 Srbija (CC BY 3.0)
License description
Dozvoljavate umnožavanje, distribuciju i javno saopštavanje dela, i prerade, ako se navede ime autora na način odredjen od strane autora ili davaoca licence, čak i u komercijalne svrhe. Ovo je najslobodnija od svih licenci. Osnovni opis Licence: http://creativecommons.org/licenses/by/3.0/rs/deed.sr_LATN Sadržaj ugovora u celini: http://creativecommons.org/licenses/by/3.0/rs/legalcode.sr-Latn
Language
English
Cobiss-ID
Theses Type
Doktorska disertacija
description
Datum odbrane: 18.02.2019.
Other responsibilities
mentor
Mitić, Nenad, 1959- 12528743
član komisije
Malkov, Saša, 1970- 12803687
član komisije
Beljanski, Miloš. 13703783
Academic Expertise
Prirodno-matematičke nauke
Academic Title
-
University
Univerzitet u Beogradu
Faculty
Matematički fakultet
Alternative title
n-gramska analiza uređenih i neuređenih regiona proteina
Publisher
[S. A. Alshafah]
Format
VI, 111 listova
description
Computer Science - Data Mining / Računarstvo -
Istraživanje podataka
Abstract (en)
Proteins with intrinsically disordered regions are involved in large number of key cell processes including signaling, transcription, and chromatin remodeling functions. On the other side, such proteins have been observed in people suffering from neurological and cardiovascular diseases, as well as various malignancies. Process of experimentally determining disordered regions in proteins is a very expensive and long-term process. As a consequence, a various computer programs for predicting position of disordered regions in proteins have been developed and constantly improved.
In this thesis a new method for determining Amino acid sequences that characterize ordered/disordered regions is presented. Material used in research includes 4076 viruses with more than 190000 proteins. Proposed method is based on defining correspondence between n-grams (including both repeats and palindromic sequences) characteristics and their belonging to ordered/disordered protein regions. Positions of ordered/disordered regions are predicted using three different predictors.
The features of the repetitive strings used in the research include mole fractions, fractional differences, and z-values. Also, data mining techniques association rules and classification were applied on both repeats and palindromes. The results obtained by all techniques show a high level of agreement for a short length of less than 6, while the level of agreement grows up to the maximum with increasing the length of the sequences. The high reliability of the results obtained by the data mining techniques shows that there are n-grams, both repeating sequences and palindromes, which uniquely characterize the disordered/ordered regions of the proteins. The obtained results were verified by comparing with the results based on n-grams from the DisProt database which contains the positions of experimentally verified disordered regions of the protein. Results can be used both for the fast localization of disordered/ordered regions in proteins as well as for further improving existing programs for their prediction.
Abstract (sr)
Proteini koji imaju neuređene regione učestvuju u velikom broju ćelijskih procesa kao što su prenos signala, transkripcija i remodelovanje funkcija hromatina. Sa druge strane, pojava takvih proteina je uočena kod osoba koje boluju od neuroloških i kardiovaskularnih bolesti, kao i različitih oblika maligniteta. Eksperimentalno određivanje neuređenih regiona protiena je vrlo skup i spor proces. Zbog toga su razvijeni i stalno se usavršavaju različiti računarski programi za predviđanje pozicija neuređenih regiona u proteinu.
U radu je prikazana nova metoda za određivanje niski amino kiselina koje karakterišu neuređene i uređene regione proteina. Materijal nad kojim je vršeno istraživanje obuhvata 4076 virusa sa preko 190000 proteina. Metoda je zasnovana na ispitivanju osobina n-grama (koji obuhvataju ponavljajuće i palindromske niske) i njihove pripadnosti uređenim i neuređenim regionima proteina. Pozicije neuređenih /uređenih regiona u proteinima su određene korišćenjem tri programa za predviđanje. Osobine ponavljajućih niski koje su korišćene u istraživanju uključuju molske frakcije, frakcijske razlike i z-vrednost. Takođe, na ponavljajuće niske kao i na palindromske niske primenjene su određivanje pravila pridruživanja i klasifikacija, kao tehnike istraživanja podataka. Rezultati dobijeni svim tehnikama pokazuju visok nivo saglasnosti, za niske dužine manje od 6, dok nivo saglasnosti rezultata raste sve do maksimalnog sa porastom dužine niski. Visoka pouzdanost rezultata dobijenih tehnikama istraživanja podataka, pokazuje da postoje n-grami, kako ponavljajuće sekvence tako i palindromi, koji jednoznačno karakterišu neuređene/uređene regione proteina. Dobijeni rezultati su provereni upoređivanjem sa rezultatima zasnovanim n-gramima iz DisProt baze koja sadrži pozicije eksperimentalno verifikovanih neuređenih regiona proteina, i mogu da budu korišćeni kako za brzu lokalizaciju neuređenih/uređenih regiona u proteinima tako i za dalje poboljšanje postojećih programa za njihovo predviđanje.
Authors Key words
n-gram, istrživanje podataka, uređeni/neuređeni regioni, pravila pridruživanja, proteini
Authors Key words
n-gram, data mining, ordered/disordered regions, association rules, proteins
Type
Tekst
Abstract (en)
Proteins with intrinsically disordered regions are involved in large number of key cell processes including signaling, transcription, and chromatin remodeling functions. On the other side, such proteins have been observed in people suffering from neurological and cardiovascular diseases, as well as various malignancies. Process of experimentally determining disordered regions in proteins is a very expensive and long-term process. As a consequence, a various computer programs for predicting position of disordered regions in proteins have been developed and constantly improved.
In this thesis a new method for determining Amino acid sequences that characterize ordered/disordered regions is presented. Material used in research includes 4076 viruses with more than 190000 proteins. Proposed method is based on defining correspondence between n-grams (including both repeats and palindromic sequences) characteristics and their belonging to ordered/disordered protein regions. Positions of ordered/disordered regions are predicted using three different predictors.
The features of the repetitive strings used in the research include mole fractions, fractional differences, and z-values. Also, data mining techniques association rules and classification were applied on both repeats and palindromes. The results obtained by all techniques show a high level of agreement for a short length of less than 6, while the level of agreement grows up to the maximum with increasing the length of the sequences. The high reliability of the results obtained by the data mining techniques shows that there are n-grams, both repeating sequences and palindromes, which uniquely characterize the disordered/ordered regions of the proteins. The obtained results were verified by comparing with the results based on n-grams from the DisProt database which contains the positions of experimentally verified disordered regions of the protein. Results can be used both for the fast localization of disordered/ordered regions in proteins as well as for further improving existing programs for their prediction.
“Data exchange” service offers individual users metadata transfer in several different formats. Citation formats are offered for transfers in texts as for the transfer into internet pages. Citation formats include permanent links that guarantee access to cited sources. For use are commonly structured metadata schemes : Dublin Core xml and ETUB-MS xml, local adaptation of international ETD-MS scheme intended for use in academic documents.