Spanish Frame Net Project Autonomous University of Barcelona

Spanish Frame. Net Project Autonomous University of Barcelona Marc Ortega

Spanish Frame. Net Project ¡ Spanish Frame. Net is a research project which is sponsored by the Department of Education of Spain (Grant No. TSI 2005 -01200) from December 2005 to December 2006. ¡ A new grant proposal has been submitted to the Spanish Department of Education for the period 2007 -2009 ¡ SFN is developed at the Autonomous University of Barcelona (Spain) and the International Computer Science Institute (Berkeley, CA) in cooperation with the Frame. Net Project. ¡ PI: Carlos Subirats, System Analyst: Marc Ortega, 2 linguist

SFN Goals ¡ The Spanish Frame. Net Project is creating an online lexical resource for Spanish, based on frame semantics and supported by corpus evidence. ¡ SFN will be available to the public by July 2007 ¡ SFN will contain at least 1, 000 lexical items aprox. verbs, predicative nouns, and adjectives, adverbs, prepositions and entities- representative of a wide range of semantic domains. ¡ The aim is to document the range of semantic and syntactic combinatory possibilities (valences) of each word in each of its senses

Frame Semantics ¡ Spanish Frame. Net (SFN) is using, adapting and changing Frame. Net Frames in order to adapt them to Spanish ¡ Some SFN Frames are the same as English FN (with Spanish examples) ¡ Some SFN Frames have the same English FN name but they are different (slightly different definition, different FE’s, or different core sets) ¡ To adapt FN to Spanish we defined some new frames and some FN frames are not used (new frames use the same FN format), like: l Cause_to_halt l Change_emotional_state l Collapse l Inventing l Motion_backwards, Motion_interruption, Motion_manner, Motion_medium, Motion_up_downwards l Return l Social_interaction l Think_up

Current Project Status ¡ Frames Defined: 92 ¡ Lexical Units: 624 l l l Annotated: 413 Subcorporated: 130 Created but without subcorporation: 23

Spanish Frame. Net Corpus and Tools ¡ Spanish Frame. Net is using a 350 million word corpus l It includes both European and New World Spanish (40% and 60%) l The SFN Corpus has been developed by the SFN research team, since there are no (large) public domain Spanish corpora available ¡ The SFN Corpus is lemmatized and tagged with a set of inhouse tools ¡ FNDesktop ¡ Web Reports ¡ Sato Tool

The SFN tagging and chunking system ¡ The SFN Corpus is tagged and lemmatized by using: l An electronic dictionary of Spanish of 600, 000 forms, which is expanded from a dictionary of 93, 000 lemmas: ¡ ¡ l l l 66, 000 single-word lexical units, like unir (unite), inmoralidad (immorality), allí (there), etc. ; 26, 000 multi-word lexical units (MWLU), like muerte cerebral (brain death), etc. , which are automatically expanded in 55, 000 inflected MWLU forms. Plain text to Deterministic Finite State Automata (FSA) corpus tagger 2, 000 Finite State Transducers (FST) transducers of multi-word verbs Transducers of head of verbal phrases (compound verbal tenses)

The SFN tagging and chunking system ¡ The POS tagging process gives to corpus formats: l l Automata Corpus IMS-CWB (Institut für Maschinelle Sprachverarbeitung -Corpus Workbench)

Automata Corpus ¡ ¡ ¡ Very efficient process Allows tagging (part-of. Lexical efficient word rates disambiguation speech, lemma) Allows extended is almost Word ambiguities are Human access lexical tagging using deterministic represented inautomata impossible transduction finite state automata (DFSAs) as different possible forms l Compound verbal transitions between two tagging consecutive states l Multi-word verb recognition DFSA of the sentence Al habérselo propuesto a tiempo FST for compound verb form tagging Transduced DFSA of the sentence Al. Transduced DFSA propuesto Al habérselo propuesto a habérselo of the sentence a tiempo

CWB Corpus ¡ Lexical tagging (part-ofspeech, lemma) ¡ Text DSFA are disambiguated and converted to XML format ¡ Unambiguous corpus ¡ Allows human access to corpus contents ¡ Allows human corpus search ¡ Corpus contents are codified and indexed for an efficient corpus search

Multi-word verb recognition DFSA of the sentence Le hacían siempre el vacío en la empresa before the transduction Output DFSA of the sentence after the intersection and transduction • Inflectional morphological properties are kept • the siempre adverb is detected between the core verb and idiom Subsequential FST that detects the multi-word verb hacer el vacío

Subcorporation Process ¡ Internal tools Gram. Creator and XQS are used to create subcorporation grammar # Request: solicitud # N-de-GN-de # <PALABRA>* = 4 { } <%NPRED%> ( <APRED> + <PALABRA>* ) <de. PREP> ( (<PRON> + ( ( <E> + <PREDET> ) ( <E> + <DET> + <APOS> ) ( <E> + <APRED> + <VPRED: PP> ) )) <N> + (<NPROP> ( <E> + <NPROP> )) ) <de. PREP> Solicitud grammar example: the syntactic structure N-de-GN-de is detected

Subcorporation Process ¡ Each grammar (regular expression) is converted to a Finite State Transducer ¡ LU’s subcorpora is transduced with a set of grammar’s FST to produce a set of subcorpora ¡ The transduction process allows very efficient process rates (100 transductions per second) ¡ The subcorporation set is converted to XML and imported to FNDesktop

Subcorporation Process N-de-GN-de structure detection

Annotation Tool ¡ SFN uses the FN annotation tool (FNDesktop) to add semantic annotation to the LU subcorporation sets ¡ The FNClassifier has been adapted to Spanish: the classifier has new rules which are adapted to the Spanish tags and Spanish local Syntactic contexts

Annotation search tools (Web Reports)

Annotation search tools (Sato Tool)