Fips. Romanian: Towards a Romanian Version of the Fips Syntactic Parser Violeta Seretan, Eric Fips. Romanian: Towards a Romanian Version of the Fips Syntactic Parser Violeta Seretan, Eric Wehrli, Luka Nerima, Gabriela Soare LATL – Language Technology Laboratory Romanian language Extending Fips to Romanian: two main tasks Vocabulary • Latin origin (fundamental vocabulary) • Slavic origin • Neologisms: French, Italian, … • Loanwords: Turkish, Greek, Hungarian, Albanian, . . . Morphology • Case system inherited from Latin Europe - Romance languages Sample text Prezentul regulament intră în vigoare în a douăzecea zi de la publicarea în Jurnalul Oficial al Uniunii Europene. http: //wt. jrc. it/lt/Acquis/ {violeta. seretan, eric. wehrli, luka. nerima, gabriela. nominative-accusative, genitive-dative, vocative • Three grammatical genders masculine, feminine, neuter This Regulation shall enter into force on the twentieth day following that of its publication in the Official Journal of the European Union. Orthography • phonemic; Latin alphabet (since 1859) • Diacritics: ă/ə, â/ɨ, î/ɨ; cedilla: ş/ʃ, ţ/ʦ • Rich declension of determiners, nouns, adjectives, and verbs e. g. , about 35 forms for a verb • The definite article is enclitic, i. e. , suffixed to nouns and adjectives: casă/house – casa/house-the mare/big – marea/big-the Syntax Lexicon construction Grammar implementation • list of headwords (DEX, 1998) • morphological generation: given a base word form, generates all its forms according to the appropriate inflection paradigm • Specifications (Soare, 2005) • Customisation of Fips. Romanian grammar for standard operations (syntactic transformations: relativization, interrogation, passivization, . . . ) • Similarities and differences. Examples: – clitic system • manual and semi-automatic insertion • manual insertion for verbs (specific information: subcategorization, selectional features, thematic function, …) • Current status: – simple entries: 60 K lexemes/ 380 K words (10 K proper nouns) – complex entries: multi-word expressions (compounds and collocations): de jur împrejurul “around” problemă – a se pune “problem – to arise” • VSO language, relatively free word order Fips: a multilingual parsing architecture (Wehrli, 2007) Underlying theory • Attachment rules: constraints on the main parser operation, Merge, which combines two adjacent structures into a larger structure • Current status: about 100 rules specified; nearly half implemented and tested Fips. Romanian: Sample results Output • Generative Grammar (Chomsky, 1995) Similarities: • Simpler Syntax (Culicover and Jackendoff, 2005) • Lexical Functional Grammar (Bresnan, 2001) – wh-fronting • Rich sentence representation: – constituent structure – predicate-argument table – co-indexation chains – intra-sentential pronoun resolution direct object subject predicate Sample parse tree produced by Fips Implementation • Left-to-right, bottom-up tabular parsing algorithm, relying on detailed lexical information • Language-independent core + language-specific implementation • Component Pascal, OOP paradigm, Black. Box IDE • Supported languages: French, English, German, Spanish, Italian, Greek; others in progress Preliminary results Screen captures Parsing experiment • data: journalistic texts, 1. 05 M words • average sentence length: 26. 9 tokens • 16. 2% full parses (Fips. French, Fips. English: about 80%) • average partial parses length : 5. 3 tokens • unknown words: 6. 5% (of which 39. 2% proper nouns) • satisfactory lexical coverage • grammatical coverage needs to be improved (work in progress!) parsing output Task-based evaluation • Collocation extraction from parsed data (Seretan, 2008) • Collocations are half idioms (of encoding, but not of decoding) • Used by parser and in-house rule-based machine translation system • Precision for top 2000 results: 30. 3% Sample collocations extracted (Precision for French data: 65. 9%, top 500 results) Related work & Useful resources • Data-driven dependency parser for Romanian based on the Malt. Parser, learns dependencies from manual annotations (Călăcean and Nivre, 2009). Problem: reduced treebank size and grammatical coverage (simple structures, no subordination, average sentence length only 9 words). • Sketch Engine for Romanian: shallow parsing (POS patterns), http: //www. sketchengine. co. uk/ • Dependency treebank construction, work in progress at the University of Iaşi, Romania • Text processing webservices, RACAI – Research Institute for Artificial Intelligence, Romanian Academy, Bucarest, Romania. http: //www. racai. ro/webservices/Text. Processing. aspx • A repository of tools for Romanian: Cons. ILR - Consortium for the Romanian Language: Resources & Tools, research groups from Iaşi, Bucarest and Chişinău http: //consilr. info. uaic. ro/ Faculté des Lettes, Département de Linguistique POS-tagging output Fips interface Lexicon interface References Ø Bresnan, J. 2001. Lexical Functional Syntax. Blackwell, Oxford. Ø Chomsky, N. 1995. The Minimalist Program. MIT Press, Cambridge, Mass. Ø Călăcean, M. and J. Nivre. 2009. A data-driven dependency parser for Romanian. In Proceedings of the 7 th International Workshop on Treebanks and Linguistic Theories (TLT 7), pages 65– 76, Groningen, Holland. Ø 1998. DEX – Dicţionarul explicativ al limbii române. Academia Română, Bucharest. Ø Seretan, V. 2008. Collocation extraction based on syntactic parsing. Ph. D. thesis, University of Geneva. Ø Soare, G. 2005. Romanian syntax. Technical report, University of Geneva. Ø Wehrli, E. 2007. Fips, a “deep” linguistic multilingual parser. In ACL 2007 Workshop on Deep Linguistic Processing, pages 120– 127, Prague, Czech Republic.