Development and evaluation of an Urdu treebank (CLE-UTB) and a statistical parser

Full description

Bibliographic Details
Published in:Language resources and evaluation. - Springer Netherlands, 2005. - 55(2020), 2 vom: 18. Juli, Seite 287-326
Main Author: Ehsan, Toqeer (Author)
Other Authors: Hussain, Sarmad (Author)
Format: electronic Article
Language:English
Published: 2020
ISSN:1574-0218
External Sources:lizenzpflichtig
LEADER 01000naa a22002652 4500
001 OLC2125662396
003 DE-627
005 20230505122911.0
007 cr uuu---uuuuu
008 230505s2020 xx |||||o 00| ||eng c
024 7 |a 10.1007/s10579-020-09492-7  |2 doi 
035 |a (DE-627)OLC2125662396 
035 |a (DE-He213)s10579-020-09492-7-e 
040 |a DE-627  |b ger  |c DE-627  |e rakwb 
041 |a eng 
082 0 4 |a 004  |a 100  |q VZ 
084 |a PHILOS  |q DE-12  |2 fid 
100 1 |a Ehsan, Toqeer  |e verfasserin  |0 (orcid)0000-0002-6724-6705  |4 aut 
245 1 0 |a Development and evaluation of an Urdu treebank (CLE-UTB) and a statistical parser 
264 1 |c 2020 
336 |a Text  |b txt  |2 rdacontent 
337 |a Computermedien  |b c  |2 rdamedia 
338 |a Online-Ressource  |b cr  |2 rdacarrier 
500 |a © Springer Nature B.V. 2020. corrected publication 2020 
520 |a Abstract A number of natural language processing tools for Urdu language processing have been developed in the past few years for word segmentation, part of speech tagging, chunking, named entity recognition and parsing. Corpora, especially treebanks, are essential data resources for language processing. This work presents the development and evaluation of an Urdu treebank, the CLE-UTB and a statistical parser. The treebank has been annotated with phrase structure annotation. Part of speech tagging has been performed semi-automatically by using an existing tagger and incorrect tags were corrected manually by annotators. The syntactic annotation has been performed in the Penn Treebank style to mark phrases. The annotation scheme also adds functional labels for grammatical roles. Currently, the treebank contains 7854 annotated sentences and 148,575 tokens. Completeness and correctness of the syntactic labels have been checked automatically after manual annotation. To ensure the annotation consistency of the resource, a grammar-based evaluation and an automatic consistency checking tool have been used to detect linguistically implausible constituents. The inter-annotator agreement is greater than 90%. We have developed a bidirectional long-short term memory (BiLSTM) based parser and a POS tagger which have been trained on the final version of the treebank. We have improved our results by training the word embeddings on a large Urdu text corpus. Our parser produced an f-score of 88.1% and the POS tagger performed with an accuracy of 96.3%. 
650 4 |a Urdu 
650 4 |a Treebank 
650 4 |a Phrase structure 
650 4 |a Evaluation 
650 4 |a Consistency 
650 4 |a Parser 
700 1 |a Hussain, Sarmad  |4 aut 
773 0 8 |i Enthalten in  |t Language resources and evaluation  |d Springer Netherlands, 2005  |g 55(2020), 2 vom: 18. Juli, Seite 287-326  |w (DE-627)493206647  |w (DE-600)2195235-8  |w (DE-576)121193284  |x 1574-0218  |7 nnns 
773 1 8 |g volume:55  |g year:2020  |g number:2  |g day:18  |g month:07  |g pages:287-326 
856 4 0 |u https://dx.doi.org/10.1007/s10579-020-09492-7  |z lizenzpflichtig  |3 Volltext 
912 |a GBV_USEFLAG_A 
912 |a SYSFLAG_A 
912 |a GBV_OLC 
912 |a FID-PHILOS 
912 |a SSG-OPC-BBI 
912 |a GBV_ILN_11 
912 |a GBV_ILN_20 
912 |a GBV_ILN_22 
912 |a GBV_ILN_23 
912 |a GBV_ILN_24 
912 |a GBV_ILN_31 
912 |a GBV_ILN_32 
912 |a GBV_ILN_39 
912 |a GBV_ILN_40 
912 |a GBV_ILN_60 
912 |a GBV_ILN_62 
912 |a GBV_ILN_63 
912 |a GBV_ILN_65 
912 |a GBV_ILN_69 
912 |a GBV_ILN_70 
912 |a GBV_ILN_73 
912 |a GBV_ILN_74 
912 |a GBV_ILN_90 
912 |a GBV_ILN_95 
912 |a GBV_ILN_100 
912 |a GBV_ILN_101 
912 |a GBV_ILN_105 
912 |a GBV_ILN_110 
912 |a GBV_ILN_120 
912 |a GBV_ILN_138 
912 |a GBV_ILN_150 
912 |a GBV_ILN_151 
912 |a GBV_ILN_152 
912 |a GBV_ILN_161 
912 |a GBV_ILN_170 
912 |a GBV_ILN_171 
912 |a GBV_ILN_187 
912 |a GBV_ILN_213 
912 |a GBV_ILN_224 
912 |a GBV_ILN_230 
912 |a GBV_ILN_250 
912 |a GBV_ILN_281 
912 |a GBV_ILN_285 
912 |a GBV_ILN_293 
912 |a GBV_ILN_370 
912 |a GBV_ILN_374 
912 |a GBV_ILN_602 
912 |a GBV_ILN_636 
912 |a GBV_ILN_702 
912 |a GBV_ILN_2001 
912 |a GBV_ILN_2003 
912 |a GBV_ILN_2004 
912 |a GBV_ILN_2005 
912 |a GBV_ILN_2006 
912 |a GBV_ILN_2007 
912 |a GBV_ILN_2008 
912 |a GBV_ILN_2009 
912 |a GBV_ILN_2010 
912 |a GBV_ILN_2011 
912 |a GBV_ILN_2014 
912 |a GBV_ILN_2015 
912 |a GBV_ILN_2018 
912 |a GBV_ILN_2020 
912 |a GBV_ILN_2021 
912 |a GBV_ILN_2025 
912 |a GBV_ILN_2026 
912 |a GBV_ILN_2027 
912 |a GBV_ILN_2031 
912 |a GBV_ILN_2034 
912 |a GBV_ILN_2037 
912 |a GBV_ILN_2038 
912 |a GBV_ILN_2039 
912 |a GBV_ILN_2044 
912 |a GBV_ILN_2048 
912 |a GBV_ILN_2049 
912 |a GBV_ILN_2055 
912 |a GBV_ILN_2056 
912 |a GBV_ILN_2057 
912 |a GBV_ILN_2059 
912 |a GBV_ILN_2061 
912 |a GBV_ILN_2064 
912 |a GBV_ILN_2065 
912 |a GBV_ILN_2068 
912 |a GBV_ILN_2088 
912 |a GBV_ILN_2093 
912 |a GBV_ILN_2106 
912 |a GBV_ILN_2107 
912 |a GBV_ILN_2108 
912 |a GBV_ILN_2110 
912 |a GBV_ILN_2111 
912 |a GBV_ILN_2112 
912 |a GBV_ILN_2113 
912 |a GBV_ILN_2118 
912 |a GBV_ILN_2119 
912 |a GBV_ILN_2129 
912 |a GBV_ILN_2131 
912 |a GBV_ILN_2143 
912 |a GBV_ILN_2144 
912 |a GBV_ILN_2147 
912 |a GBV_ILN_2148 
912 |a GBV_ILN_2152 
912 |a GBV_ILN_2153 
912 |a GBV_ILN_2188 
912 |a GBV_ILN_2190 
912 |a GBV_ILN_2232 
912 |a GBV_ILN_2336 
912 |a GBV_ILN_2446 
912 |a GBV_ILN_2470 
912 |a GBV_ILN_2474 
912 |a GBV_ILN_2507 
912 |a GBV_ILN_2522 
912 |a GBV_ILN_2548 
912 |a GBV_ILN_2937 
912 |a GBV_ILN_2949 
912 |a GBV_ILN_2950 
912 |a GBV_ILN_4012 
912 |a GBV_ILN_4035 
912 |a GBV_ILN_4037 
912 |a GBV_ILN_4046 
912 |a GBV_ILN_4112 
912 |a GBV_ILN_4125 
912 |a GBV_ILN_4126 
912 |a GBV_ILN_4242 
912 |a GBV_ILN_4246 
912 |a GBV_ILN_4249 
912 |a GBV_ILN_4251 
912 |a GBV_ILN_4305 
912 |a GBV_ILN_4306 
912 |a GBV_ILN_4307 
912 |a GBV_ILN_4313 
912 |a GBV_ILN_4322 
912 |a GBV_ILN_4323 
912 |a GBV_ILN_4324 
912 |a GBV_ILN_4325 
912 |a GBV_ILN_4326 
912 |a GBV_ILN_4328 
912 |a GBV_ILN_4333 
912 |a GBV_ILN_4334 
912 |a GBV_ILN_4335 
912 |a GBV_ILN_4336 
912 |a GBV_ILN_4338 
912 |a GBV_ILN_4346 
912 |a GBV_ILN_4393 
912 |a GBV_ILN_4700 
951 |a AR 
952 |d 55  |j 2020  |e 2  |b 18  |c 07  |h 287-326