Symbolic and statistical Analyses of meta-data using the

Symbolic and statistical Analyses of meta-data using the “Semana” platform — a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre de Recherche et d’Etude de l’Art Préhistorique UMR 5608: Travaux et Recherches Archéologiques sur les Cultures, les Espaces et les Sociétés CASK Sorbonne 2008, Paris, June 13 th

SEMANA and Data Mining interpretation Data warehouse sampling Data coding KDD techniques (Rough Set, FCA, statistical analysis, etc. ) After B. Wüthrich, 1998 “SEMANA”, a bundle of tools aimed at makink these tasks easier

Architecture of the SEMANA platform A software bundle written in Transcript ®, the programming language of Revolution ® Standalone applications for Macintosh and Windows Dynamic DB Builder Data sheets Data coding Data storage Attribute Editor Discretization Logical scaling … Tree Builder Assistant Aid to code structuration Tables (various formats) “Multi-valued tables” Rough Set Theory Decision Logic Upper approx. Lower approx. Reducts, Core Discriminating power (Pawlak) Minimal rules Attribute strength (Bolc, Cytowski and Stacewicz) “One-valued tables” Formal Concept Analysis Galois lattice “central concepts” (Wille, Ganter) Statistical tools Correlation Matrix Correspondence Factor Analysis, Hierarchical Classifications (Benzecri)

Working with the SEMANA platform SEMANA is twofold: 1) Tools for Intelligent Database Designing => Dynamic DB Builder • providing statistical information about the use of AV • suggesting iterative restructuration of AV 1) 2) Tools for KDD research : integration of RST, FCA, Statistical Data Analyses Three illustrations: “Ten-ta-to”: the proximal deictic adjectives in Polish The category of Aspect in Polish Representations of women in Palaeolithic Art

Case 1: the Proximal Deictic Adjectives in Polish

The proximal deictic adjectives in Polish In Polish School Grammar, the adjective declension consists in the amalgamation of three “morphological categories”. Case = {Nominative, Accusative, Genitive, Dative, Instrumental, Locative} Number = {singular, plural} Gender = In Polish Linguistics (cf. SALONI, Z. 1976), up to 7 gender classes have been proposed: Singular : 1. feminine 2. neuter 3. animal masculine (“animal” corresponds to the feature “animate” in other European languages descriptions) 4. non animal masculine Plural : 1. personal masculine (“personal” corresponds to the feature “human”) 2. non personal masculine £ “pluralia tantum” (defective nouns with no singular form).

The proximal deictic adjectives in Polish The root of these adjectives is a single phoneme t-. 13 forms are used: ten, ta, to, tymi, tych, te*, temu, tej, tego, ta*, ci Examples (only Nominative case) Polish Singular English translation Plural Masculine ten dom ten pies ten pan te domy te psy ci panowie this/these house(s) this/these dog(s) this/these sir(s) te deski te gęsi te panie this/these board(s) this/these goose/geese this/these lady/ladies te pióra te kurczęta te dzieci. . . this/these feather(s) this/these chicken(s) this/these child/children. . . Feminine ta deska ta gęś ta pani Neuter to pióro to kurczę to dziecko. . .

The proximal deictic adjectives in Polish In order to elucidate the problem of Gender in Polish noun morphology, H. and A. Wlodarczyk have built a database of usages of the proximal deictic adjectives. As the 7 “sub-genders” of Polish School Grammars neither correspond to any known semantic or ontological categories nor to any known grammatical sub-gender in other languages, they proposed to split the “sub-genders” of the Gender attribute into three attributes : gender = {feminine, neuter, masculine) animacy = {animate, inanimate} humanity = {human, non_human}

TENTATO: database first version morpheme sample attribute, value (features chosen for each entry)

TENTATO: database Objects version first = 108 Distinct objects = 108 Duplicates = 0 Duplicate ratio = 0 An AV Table is automatically collected Attributes = 5 (with resp. 6, 2, 3, 2, 2 values) NB: in this calculation, non-used attributes (*) have been replaced by a null value ('n. Att') The program suggests the possibility to merge these attributes ========================= Theoretical Number of Combinations = 144 Apparent Saturation Index : 75% ========================= The following pairs of attributes could be merged: [hum|ina] Confidence index = 99. 9% [hum|nhu] Confidence index = 99. 9% [ina|nhu] Confidence index = 99. 9% ========================= STATISTICAL USE OF AV Attr Value occur Ani anim 72 Ani inanim 36 Case Case The program indicates that the pair {inanimate-human} does not exist (for obvious reason) A D G I L N Gnd Gnd fem masc neu Hum hum nhum 18 18 18 36 36 72 Nb plur 54 Nb sing 54 ========================= Non-Attested Pairs of Values = 1 ina, hum, 2, 4 -------------------------Assuming that all non-attested pairs are impossible: Maximum number of combinations = 108 Corrected Saturation Index : 100% -------------------------

TENTATO (Version 1): Formal Concept Analysis TENTATO Version 1 complete lattice Test of dependence Inanimate depends on non human Human depends on animate simplified lattice Total Dependence ina => nhu (36/36) hum => an (36/36) High probability (>90%): none

TENTATO: second version Objects = 108 Distinct objects = 108 Duplicates = 0 Duplicate ratio = 0 Attributes = 4 (with resp. 3, 6, 3, 2 values) NB: in this calculation, non-used attributes (*) have been replaced by a null value ('n. Att') ========================= Theoretical Number of Combinations = 108 Apparent Saturation Index : 100% ========================= No attributes could be merged ========================= STATISTICAL USE OF AV Attr Value occur ANY human 36 ANY inanimate 36 ANY nhuman 36 In a second trial, the attributes ANIMACY ({ANI}=[animate|inamimate]) and HUMANITY ({HUM}=[human|nhuman]) are merged into a three-valued attribute : {ANY}=[nhuman|inanimate|human] CAS CAS CAS accusative 18 dative 18 genetive 18 instrumental 18 locative 18 nominative 18 GND GND feminine 36 masculine 36 neuter 36 No attribute merging is possible; all pairs of values are attested. NBR plural 54 NBR singular 54 ========================= Non-Attested Pairs of Values = 0 ------------------------Assuming that all non-attested pairs are impossible: Maximum number of combinations = 108 Corrected Saturation Index : 100%

TENTATO: Formal Concept Analysis TENTATO Version 2 TENTATO Version 1 complete lattice Inanimate depends on non human Human depends on animate complete lattice All the attributes at the same level : no hierarchy simplified lattice Total Dependence Test of dependence => Total Dependence ina => nhu (36/36) none hum => an (36/36) High probability (>90%): none

TENTATO-2: Rough Set Theory and “Minimal Rules” A procedure derived from Rough Set Theory allows us to calculate the “minimal rules” (i. e. the values of the attributes which condition the morpheme to be used) r 1 (9) : CASdat, NBRplu --> tym r 2 (3) : CASins, GNDmas, NBRsin --> tym r 3 (3) : CASins, GNDneu, NBRsin --> tym r 4 (3) : CASloc, GNDmas, NBRsin --> tym r 5 (3) : CASloc, GNDneu, NBRsin --> tym r 20 (1) : CASacc, ANYina, GNDmas, NBRsin --> ten r 21 (3) : CASnom, GNDmas, NBRsin --> ten r 6 (9) : CASins, NBRplu --> tymi r 24 (3) : CASdat, GNDfem, NBRsin --> tej r 25 (3) : CASgen, GNDfem, NBRsin --> tej r 26 (3) : CASloc, GNDfem, NBRsin --> tej r 7 (1) : CASacc, ANYhum, GNDmas, NBRplu --> tych r 8 (9) : CASgen, NBRplu --> tych r 9 (9) : CASloc, NBRplu --> tych r 10 (3) : CASacc, GNDneu, NBRsin --> to r 11 (3) : CASnom, GNDneu, NBRsin --> to r 12 (3) : CASacc, ANYina, NBRplu --> te r 13 (3) : CASacc, ANYnhu, NBRplu --> te r 14 (3) : CASacc, GNDfem, NBRplu --> te r 15 (3) : CASacc, GNDneu, NBRplu --> te r 16 (3) : CASnom, ANYina, NBRplu --> te r 17 (3) : CASnom, ANYnhu, NBRplu --> te r 18 (3) : CASnom, GNDfem, NBRplu --> te r 19 (3) : CASnom, GNDneu, NBRplu --> te r 22 (3) : CASdat, GNDmas, NBRsin --> temu r 23 (3) : CASdat, GNDneu, NBRsin --> temu r 27 (1) : CASacc, ANYhum, GNDmas, NBRsin --> tego r 28 (1) : CASacc, ANYnhu, GNDmas, NBRsin --> tego r 29 (3) : CASgen, GNDmas, NBRsin --> tego r 30 (3) : CASgen, GNDneu, NBRsin --> tego r 31 (3) : CASacc, GNDfem, NBRsin --> te* r 32 (3) : CASnom, GNDfem, NBRsin --> ta r 33 (3) : CASins, GNDfem, NBRsin --> ta* r 34 (1) : CASnom, ANYhum, GNDmas, NBRplu --> ci The 108 distinct objects of the DB can be described by only 34 morphological rules. Note that CAS and NBR are required in every rule, GND in 26/34 and ANY in only 9/34.

TENTATO-2: Statistical analysis The Multi-valued Table is unfolded in a One-value Table. . . …and the One-value Table is transformed in a Burt’s Table… A Burt’s Table is a square symmetrical table giving the number of cooccurrences of the attributes

TENTATO-2: Correspondence Factor Analysis (CFA) z F 1 • • • • F 3 • • • • • • • • • • • • • • • F 2 • • • • • Numbers in the Table are considered as coordinates of points in a N-dimensional space. y x CFA calculates the axes of inertia of the cloud of points (F 1, F 2, F 3 …) and displays projections in planes [F 1, F 2], [F 1, F 3], etc. CFA is implemented in “Semana”

TENTATO-2: Correspondence Factor Analysis (CFA) Coordinate of object J on factor 1 Contribution of factor 1 to the description of object J Contribution of object J to the definition of factor 1 Note that the quality of the description of attribute “animacy” is very poor: these elements have no contribution to the first 4 factors. Note that the number (singular/plural) has the highest contrib. to axis 1 Output by “Stat-3”

TENTATO-2: CFA representation in plane [1, 2] Axis 2 Morphemes are widely spread over plane [1, 2] Axis 2 separates syntactic relators (CASE) => {nom, acc} vs {gen, loc, dat, ins} Axis 1 ANIMACY & GENDER are not differenciated on axes 1 and 2 Axis 1 separates NUMBER => singular vs plural Output by “Stat-3”

TENTATO-2: Axis 1 separates quantifiers Axis 2 Morphemes strictly associated to plural: => ci, te, tych, tymi One exception: tym may be either singular or plural singular Axis 1 Morphemes strictly associated to singular: => ta, to, ten, te*, tego, tej, temu, ta* Output by “Stat-3”

TENTATO-2: Axis 2 separates syntactic relators Axis 2 Morphemes strictly associated to genitive, locative, dative and/or instrumental: => tej, tych, temu, tymi, ta*, tymi ins dat loc gen Axis 1 Morphemes strictly associated to nominative and/or accusative: => ta, to, ten, te*, ci, te One exception: tego may be either accusative or genitive acc nom Output by “Stat-3”

TENTATO-2: Axis 3 separates {gen, loc} vs {inst] } Axis 3 Morphemes tymi, ta* strictly associated to instrumental ins One exception: tym may be either instrumental or locative nom dat acc loc Axis 1 Morphemes tych, tego, tej strictly associated to genitive or locative gen Output by “Stat-3”

TENTATO-2: Axis 4 separates gender {fem} vs (mas, neu} Axis 4 Morphemes ta*, tej, ta strictly associated to feminine fem One exception: tym may be associated to any gender Axis 1 Note that the attribute [ANIMACY]={human, nhuman, inanimate} mas neu is still not differenciated on axis 4. Morphemes tego, ten, temu, ci strictly associated to masculine or neutral Output by “Stat-3”

TENTATO-2: Animacy appears only on axis 9 !!! Axis 9 Morpheme ci strictly associated to human Axis 1 hum nhu ina Output by “Stat-3”

TENTATO-2: CFA and “Minimal Rules” (RST) Axis (% inertia) NUMBER (36/36 rules) CASE (36/36 rules) GENDER (26/36 rules) ANIMACY (9/36 rules) Axis 1 (13. 05%) singular plural ……………………………………………………. Axis 2 (12. 81%) nom, acc gen, loc, dat, inst ……………………………………………………. Axis 3 (11. 27%) gen, loc (dat) inst ……………………………………………………. Axis 4 (10. 0%) feminine masculine ……………………………………………………. Axis 9 (4. 35%) human nhum, ina ……………………………………………………. The relative strength of the attributes is revealed both by their contribution to the axes of inertia in Factor Analysis and by their weight in Minimal Rules.

Case 2: the category of Aspect in Polish

A Database built with “Dynamic DB-Builder” A classical data sheet to fill for each specimen… the grammatical form of each specimen is used as index Attributes and values are chosen in a list… … and the resulting AVs appear in a field

A test of consistency Each specimen is characterized by a set of AV and by its grammatical form (used as index). It may be written as a rule : the grammatical form of each specimen is used as index if {given set of AV} then index This allows index inconsistencies to be detected (a test of consistency is provided in Semana)

A test of consistency Each specimen is characterized by a set of AV and by its grammatical form (used as index). It may be written as a rule : the grammatical form of each specimen is used as index if {given set of AV} then index This allows index inconsistencies to be detected (a test of consistency is provided in Semana) This is a warning to the expert: probably the AV do not describe properly the different aspectual situations! 9 different forms applying to exactly the same situation ?

Polish Aspect using Dynamic DB Builder All specimens are automatically collected in a contingency table… and statistics are reported. In this initial version, there was more than 2 millions of theoretical combinations and 9 pairs of attributes could be merged!

Polish Aspect using Dynamic DB Builder Improvements by « trials and errors » DB version Distinct objects Number of attributes Number of theor. combin. Number of “merging attributes” HW-Aspect-V 1 61 12 2, 064, 384 9 HW-Aspect-V 2 60 11 1, 032, 192 9 HW-Aspect-V 3 77 11 829, 000 6 HW-Aspect-V 4 79 9 408, 240 1 HW-Aspect-V 5 79 8 136, 080 1 HW-Aspect-V 6 69 8 45, 360 1 HW-Aspect-V 7 74 8 61, 440 0 HW-Aspect-V 8 78 7 58, 320 0

From Dynamic DB Builder to STAT-3 The multi-valued table is transformed into a one-valued table for STAT analyses

Polish Aspect : Correspondence Factor Analysis axis 2 axis 1 Factor Analysis of the contingency table shows a clear Gutmann’s effect (i. e. a sequential order of the attributes)

Polish Aspect : Correspondence Factor Analysis Ascending Hierarchical Classification shows two well-defined classes

Polish Aspect : Correspondence Factor Analysis imperfective A clear partition in two classes according to the attribute [VAL] = {perfective | imperfective}

Polish Aspect : Correspondence Factor Analysis Gutmann’s effect shows that attributes are sequentially ordered attribute MCMP (morph. comp. ) : pip > pp > pi >ii attribute MOD : parallel > sequential > trans > resume > stop > interrupt > keep > Off. And. On

Polish Aspect: Correspondence Factor Analysis All these features require imperatively perfective VAL perfective MCMP pip ip pp 0 0 0 CRE imperfective pi 100 n. Re 0 MOD defnb 30 par 0 ANA seq trans 0 0 0 decr incr 0 TYP 0 89 35 end 0 100 ndefnb stop inter 0 after finish enter start 0 ITS resume 0 ii 33 keep 0 Oa. O 60 100 before nan begin 44 69 strong 40 run 84 weak 28 54 ord. Pr event state ref. Pr 29 17 75 67 Distribution of features along the perfective-to-imperfective path (% association with imperfective)

Case 3 : Images of the Woman in Palaeolithic Art

Images of the Woman in Palaeolithic Art Customized DB-builder: for each figure, AV are selected with ‘check box’ buttons Raphaëlle Bourrillon, Ph. D, Univ. Toulouse-Le Mirail

Images of the Woman in Palaeolithic Art CFA and HAC show three classes of representations Realist and slim Schematic / abstract Realist and fatty

Detailed study of the schematic women representations CFA and HAC split the schematic feminine figures into five sub-classes Schematic / abstract

Detailed study of the schematic women representations Formal concept analysis

SEMANA : a bundle of tools for KDD research at hand in a single box FROM PREPROCESSING … Building /Editing DB - Structuration of AV - Statistics - AV edition (merging, splitting, etc. ) - Edition/conversion of tables in various formats … TO MINING Complementary KDD procedures (RST, FCA. . . ) … with special emphasis on the powerful tools of statistical data analyses (CFA, HAC) with applications in many domains (within and out of Linguistics!)