Social Network Inspired Models of NLP and Language

Скачать презентацию Social Network Inspired Models of NLP and Language

a2dd610ea7b1562fe8709901cd9cc2fc.ppt

Количество слайдов: 132

Social Network Inspired Models of NLP and Language Evolution Monojit Choudhury (Microsoft Research India) Animesh Mukherjee (IIT Kharagpur) Niloy Ganguly (IIT Kharagpur)

What is a Social Network? l Nodes: Social entities (people, organization etc. ) l Edges: Interaction/relationship between entities (Friendship, collaboration, sex) Courtesy: http: //blogs. clickz. com

Social Network Inspired Computing l Society and nature of human interaction is a Complex System l Complex Network: A generic tool to model complex systems ¡There is a growing body of work on CNT Theory ¡Applied to a variety of fields – Social, Biological, Physical & Cognitive sciences, Engineering & Technology l Language is a complex system

Objective of this Tutorial l To show that SNIC (Soc. Net. Inspired Comp. ) is an emerging and promising technique l Apply it to model Natural Languages ¡NLP, Quantitative Linguistics, Language Evolution, Historical Linguistics, Language acquisition l Familiarize with tools and techniques in SNIC l Compare it with other standard approaches to NLP

Outline of the Tutorial l Part I: Background ¡Introduction [25 min] ¡Network Analysis Techniques [25 min] ¡Network Synthesis Techniques [25 min] l Break [3: 20 pm – 3: 40 pm] l Part II: Case Studies ¡Self-organization of Sound Systems [20 min] ¡Modeling the Lexicon [20 min] ¡Unsupervised Labeling (Syntax & Semantics) [20 min] l Conclusion and Discussions [20 min]

Complex System l Non-trivial properties and patterns emerging from the interaction of a large number of simple entities l Self-organization: The process through which these patterns evolve without any external intervention or central control l Emergent Property or Emergent Behavior: The pattern that emerges due to self-organization

The best example from nature A termite "cathedral" mound produced by a termite colony

Emergence of a networked life Communities Atom Organisms Molecule Tissue Cell Organs

Language – a complex system l Language: medium for communication through an arbitrary set of symbols l Constantly evolving l An outcome of self-organization at many levels ¡Neurons ¡Speakers and listeners ¡Phonemes, morphemes, words … l 80 -20 Rule in every level of structure

MESOSCOPY Three Views of a System MACROSCOPY May not give a complete picture or explanation of what goes on MICROSCOPY May be too difficult to analyze or simulate the macroscopic behavior A useful tradeoff between the two

Language as a physical system l Microscopic: a collection of utterances by individual speakers l Mesoscopic: an interaction between phonemes, syllables, words, phrases l Macroscopic: A set of grammar rules with a lexicon

Syntactic Network of Words color sky weight light 1 20 blue blood 100 heavy red

Complex Network Theory l Handy toolbox for modeling mesoscopy l Marriage of Graph theory and Statistics l Complex because: ¡Non-trivial topology ¡Difficult to specify completely ¡Usually large (in terms of nodes and edges) l Provides insight into the nature and evolution of the system being modeled

Internet

Genetic interaction network

9 -11 Terrorist Network Social Network Analysis is a mathematical methodology for connecting the dots -- using science to fight terrorism. Connecting multiple pairs of dots soon reveals an emergent network of organization.

CNT Examples: Road and Airlines Network

What Questions can be asked l Does these networks display some symmetry? l Are these networks creation of intelligent objects or they have emerged? l How have these networks emerged What are the underlying simple rules leading to their complex formation?

Bi-directional Approach l Analysis of the real-world networks ¡Global topological properties ¡Community structure ¡Node-level properties l Synthesis of the network by means of some simple rules ¡Preferential attachment models ¡Small-world models ……. .

Application of CNT in Linguistics - I l Quantitative linguistics ¡Invariance and typology (Zipf’s law, syntactic dependencies) l Natural Language Processing ¡Unsupervised methods for text labeling (POS tagging, NER, WSD, etc. ) ¡Textual similarity (automatic evaluation, document clustering) ¡Evolutionary Models (NER, multi-document summarization)

Application of CNT in Linguistics - II l Language Evolution ¡How did sound systems evolve? ¡Development of syntax l Language Change ¡Innovation diffusion over social networks ¡Language as an evolving network l Language Acquisition ¡Phonological acquisition ¡Evolution of the mental lexicon of the child

Linguistic Networks Name Nodes Edges Why? Pho. Net Phonemes Co-occurrence likelihood in languages Evolution of sound systems Word. Net Words Ontological relation Host of NLP applications Syntactic Network Words Similarity between syntactic contexts POS Tagging Semantic Network Words, Names Semantic relation IR, Parsing, NER, WSD Mental Lexicon Words Phonetic similarity and semantic relation Cognitive modeling, Spell Checking Tree-banks Words Syntactic Dependency links Evolution of syntax Word Cooccurrence Words Co-occurrence IR, WSD, LSA, …

Summarizing l SNIC and CNT are emerging techniques for modeling complex systems at mesoscopic level l Applied to Physics, Biology, Sociology, Economics, Logistics … l Language - an ideal application domain for SNIC l SNIC models in NLP, Quantitative linguistics, language change, evolution and acquisition

Topological Characterization of Networks

Types Of Networks and Representation Unipartite Binary/ Weighted Undirected/ Directed Bipartite Binary/ Weighted Undirected/ Directed a Representation b c 1. Adjacency Matrix a 0 1 1 2. Adjacency List b 1 0 1 c 1 1 0 a {b, c} b {a, c} c {a, b}

Properties of Adjacency Matrix l A={aij}, where i and j are nodes and aij=1 if there is an edge between i an j. l A 2 = A*A; Entries denote number paths of length 2 between any two node (Σaik*akj) k l In general, An denotes number of paths of length n l Trace(A) = Σaii l How is the trace of A 3 related to the number of triangles in the n/w?

Characterization of Complex N/ws? ? l They have a non-trivial topological structure l Properties: ¡ Heavy tail in the degree distribution (non-negligible probability mass towards the tail; more than in the case of an exp. distribution) ¡ High clustering coefficient ¡ Centrality Properties ¡ Social Roles & Equivalence ¡ Assortativity ¡ Community Structure ¡ Random Graphs & Small avg. path length ¡ Preferential attachment ¡ Small World Properties

$Degree Distribution (DD) l Let pk be the fraction of vertices in the network$ Degree Distribution (DD) l Let pk be the fraction of vertices in the network that has a degree k. l The k versus pk plot is defined as the degree distribution of a network l For most of the real world networks these distributions are right skewed with a long right tail showing up values far above the mean – pk varies as k-α l Due to noisy and insufficient data sometimes the definition is slightly modified ¡Cumulative degree distribution is plotted l Probability that the degree of a node is greater than or equal to k

A Few Examples Power law: Pk ~ k-α

Friend of Friends l Consider the following scenario ¡ Sourish and Ravi are friends ¡ Sourish and Shaunak are friends ¡ Are Shaunak and Ravi friends? ¡ If so then … Ravi Saurish Saunak l This property is known as transitivity

Measuring Transitivity: Clustering Coefficient l The clustering coefficient for a vertex ‘v’ in a network is defined as the ratio between the total number of connections among the neighbors of ‘v’ to the total number of possible connections between the neighbors l High clustering coefficient means my friends know each other with high probability – a typical property of social networks

Mathematically… l The clustering coefficient of a vertex i is # of links between ‘n’ neighbors Ci = n(n-1)/2 l The clustering coefficient of the whole network is the average C= 1 N ∑Ci l Alternatively, C= # triangles in the n/w # triples in the n/w

Centrality l Centrality measures are commonly described as indices of 4 Ps -- prestige, prominence, importance, and power ¡ Degree – Count of immediate neighbors ¡ Betweenness – Nodes that form a bridge between two regions of the n/w l Where σst is total number of shortest paths between s and t and σst (v) is the total number of shortest paths from s to t via v

Eigenvector centrality – Bonacich (1972) l It is not just how many people knows me counts to my popularity (or power) but how many people knows people who knows me – this is recursive! l In context of HIV transmission – A person x with one sex partner is less prone to the disease than a person y with multiple partners ¡But imagine what happens if the partner of x has multiple partners ¡The basic idea of eigenvector centrality

Definition l Eigenvector centrality is defined as the principal eigenvector of the adjacency matrix l Eigenvector of any symmetric matrix A = {aij} is any vector e such that ¡ Where λ is a constant and ei is the centrality of the node i l What does it imply – centrality of a node is proportional to the centrality of the nodes it is connected to (recursively)… l Practical Example: Google Page. Rank

Node Equivalence l Social Roles – Nodes (actors) in a social n/w who have similar patterns of relations (ties) with other nodes. l Three Different ways to find equivalence classes: ¡Structural Equivalence ¡Automorphic Equivalence ¡Regular Equivalence

Structural Equivalence l Two nodes are said to be exactly structurally equivalent if they have the same relationships to all other nodes. Computation: Let A be the adjacency matrix. Compute the Euclidean Distance /Pearson Correlation between a pair or rows/columns representing the neighbor profile of two nodes (say i and j). This value shows how much structurally similar i and j are.

Automorphic Equivalence l The idea of automorphic equivalence is that sets of actors can be equivalent by being embedded in local structures that have the same patterns of ties -- "parallel" structures. oss B Top rs ge na a M res to f. s if 3 D f Swap(B, D) with all their neighbors: The distances among s all the actors in the graph would ker r Wo be exactly identical Path vectors of i: how many nodes are at distance 1, 2, … from node i. Amount of Equivalence: Distance between path vectors o

Regular Equivalence l Two nodes are said to be regularly equivalent if they have the same profile of ties with members of other sets of actors that are also regularly equivalent. Class III 1 tie with Class II No tie with Class III 1 tie with Class I 1/2 tie(s) with Class III No tie with Class I 1 tie with Class II

Assortativity (homophily) l Rich goes with the rich (selective linking) ¡A famous actor (e. g. , Shah Rukh Khan) would prefer to pair up with some other famous actor (e. g. , Rani Mukherjee) in a movie rather than a new comer in the film industry. Assortative Scale-free network Disassortative Scale-free network

Measures of Assortativity ¡ANND (Average nearest neighbor degree) l. Find the average degree of the neighbors of each node i with degree k l. Find the Pearson correlation (r) between the degree of i and the average degree of its neighbors For further reference see the supplementary material

Community structure: a group of vertices that have a high density of edges within them and a low density of edges in between groups Example: • Friendship n/w of children • Citation n/ws: research interest • World Wide Web: subject matter of pages • Metabolic networks: Functional units • Linguistic n/ws: similar linguistic categories

Some Examples Community Structure in Political Books Community structure in a Social n/w of Students (American High School)

Community Identification Algorithms l Hierarchical l Girvan-Newman l Radicchi et al. l Chinese Whishpers l Spectral Bisection See (Newman 2004) for a comprehensive survey (you will find the ref. in the supplementary material)

Girvan-Newman Algorithm l Bisection Method ¡Calculate the betweenness for all edges in the network. ¡Remove the edge with the highest betweenness. ¡Recalculate betweennesses for all edges affected by the removal. ¡Repeat from step 2 until no edges remain.

Evolution of Networks Processes on Networks

Random Graphs & Small Average Path Length Q: What do we mean by a ‘random graph’? A: Erdos-Renyi random graph model: For every pair of nodes, draw an edge between them with equal probability p. Degrees of Separation in a Random Graph Poisson distribution • N nodes • z neighbors per node, on average, z = • D degrees of separation P(k)~ e- k/k!

Degree Distributions

Degree distributions for various networks (a) World-Wide Web (b) Coauthorship networks: computer science, high energy physics, condensed matter physics, astrophysics (c) Power grid of the western United States and Canada (d) Social network of 43 Mormons in Utah

How do Power law DDs arise? Barabási-Albert Model of Preferential Attachment (Rich gets Richer) (1) GROWTH : Starting with a small number of nodes (m 0) at every timestep we add a new node with m (<=m 0) edges (connected to the nodes already present in the system). (2) PREFERENTIAL ATTACHMENT : The probability Π that a new node will be connected to node i depends on the connectivity ki of that node A. -L. Barabási, R. Albert, Science 286, 509 (1999)

Mean Field Theory

The World is Small! l “Registration fee for IJCNLP 2008 are being waived for all participants – get it collected from the registration counter” l How long do you think the above information will take to spread among yourselves l Experiments say it will spread very fast – within 6 hops from the initiator it would reach all l This is the famous Milgram’s six degrees of separation

The Small World Effect Even in very large social networks, the average distance between nodes is usually quite short. Milgram’s small world experiment: l Target individual in Boston l Initial senders in Omaha, Nebraska l Each sender was asked to forward a packet to a friend who was closer to the target l Friends asked to do the same Result: Average of ‘six degrees’ of separation. S. Milgram, The small world problem, Psych. Today, 2 (1967), pp. 60 -67.

Measure of Small-Worldness l Low average geodesic path length l High clustering coefficient l Geodesic path – Shortest path through the network from one vertex to another l Mean path length ¡ ℓ = 2∑i≥jdij/n(n+1) where dij is the geodesic distance from vertex i to vertex j ¡ Most of the networks observed in real world have ℓ ≤ 6 l Film actors l Company Directors l Emails l Internet l Electronic circuits 3. 48 4. 60 4. 95 3. 33 4. 34

Clustering C = Probability that two of a node’s neighbors are themselves connected In a random graph: Crand ~ 1/N (if the average degree is held constant)

Watts-Strogatz ‘Small World’ Model Watts and Strogatz introduced this simple model to show networks can have both short path lengths and high clustering. D. J. Watts and S. H. Strogatz, Collective dynamics of “small-world” networks, Nature, 393 (1998), pp. 440– 442.

Small-world model l Used for modeling network transitivity l Many networks assume some kind of geographical proximity l Small-world model: ¡Start with a low-dimensional regular lattice ¡Rewire: l. Add/remove edges to create shortcuts to join remote parts of the lattice l. For each edge with prob p move the other end to a random vertex l Rewiring allows to interpolate between regular lattice and random graph

Small-world model l Regular lattice (p=0): ¡ Clustering coefficient C=(3 k 3)/(4 k-2)=3/4 ¡ Mean distance L/4 k l Almost random graph (p=1): Rewiring probability p ¡ Clustering coefficient C=2 k/L ¡ Mean distance log L / log k l No power-law degree distribution Degree distribution

Resilience of Networks l We consider the resilience of the network to the removal of its vertices (site percolation) or edges (bond percolation). l As vertices (or edges) are removed from the network, the average path length will increase. l Ultimately, the giant component will disintegrate. l Networks vary according to their level of resilience to vertex (or edge) removal.

$Stability Metric: Percolation Threshold f fraction of nodes removed Initial single connected component Giant$ Stability Metric: Percolation Threshold f fraction of nodes removed Initial single connected component Giant component still exists fc fraction of nodes removed The entire graph breaks into smaller fragments Therefore fc =1 -qc becomes the percolation threshold

Ordinary Percolation on Lattices Fill in each link (bond percolation) or site (site percolation) with probability p and ask questions about the sizes of connected components.

Percolation in Poisson and Scale free networks Exponential Network Scale free Network

CASE STUDY I: Self-Organization of the Sound Inventories

Human Communication l Human beings and many other living organisms produce sound signals l Unlike other organisms, they can concatenate these sounds to produce new messages – Language l Language is one of the primary cause/effect of human intelligence

Human Speech Sounds l Human speech sounds are called phonemes – the smallest unit of a language Phonemes are characterized by certain distinctive features like l I. Place of articulation II. Manner of articulation III. Phonation Mermelstein’s Model

Types of Phonemes Consonants Vowels /i/ /a/ L /u/ /t/ /p/ Diphthongs /ai/ /k/

Choice of Phonemes l How a language chooses a set of phonemes in order to build its sound inventory? l Is the process arbitrary? l. Certainly Not! l What are the forces affecting this choice?

Forces of Choice A Linguistic System – How does it look? /a/ Speaker Desires “ease of articulation” /a/ Listener / Learner Desires “perceptual contrast” / “ease of learnability” The forces shaping the choice are opposing – Hence there has to be a non-trivial solution

Vowels: A (Partially) Solved Mystery M ax im al ly D ist in ct l Languages choose vowels based on maximal perceptual contrast. l For instance if a language has three vowels then in more than 95% of the cases they are /a/, /i/, and /u/. /i/ /a/ M ax im all y Di /u/ Maximally Distinct sti nc t

Consonants: A J i g s a w puzzle l Research: From 1929 – Date l No single satisfactory explanation of the organization of the consonant inventories ¡The set of features that characterize consonants is much larger than that of vowels ¡No single force is sufficient to explain this organization ¡Rather a complex interplay of forces goes on in shaping these inventories

Principle of Occurrence l Pla. Net – The “Phoneme-Language Network” ¡ A bipartite network N=(VL, VC, E) ¡ VL : Nodes representing languages of the world ¡ VC : Nodes representing consonants ¡ E : Set of edges which run between VL and VC Languages L 1 /ŋ/ L 2 /m/ L 3 /d/ Consonants l There is an edge e Є E between two nodes l vl Є VL and vc Є VC if the consonant c occurs l in the language l. /θ/ /s/ Choudhury et al. 2006 ACL Mukherjee et al. 2007 Int. Jnl of Modern Physics C L 4 /p/ The Structure of Pla. Net

Construction of Pla. Net l Data Source : UCLA Phonological Inventory Database (UPSID) l Number of nodes in VL is 317 l Number of nodes in VC is 541 l Number of edges in E is 7022

Degree Distribution of Pla. Net 0. 08 DD of the language nodes follows a βdistribution pk = beta(k) with α = 7. 06, and β = 47. 64 0. 06 pk 0. 04 DD of the consonant nodes follows a power-law with an exponential cut-off Γ(54. 7) k 6. 06(1 -k)46. 64 pk = Γ(7. 06) Γ(47. 64) 0. 02 kmin= 5, kmax= 173, kavg= 21 0 50 100 150 1 200 Language inventory size (degree k) Distribution of Consonants over Languages follow a power-law 0. 1 Pk 0. 01 Pk = k -0. 71 Exponential Cut-off 0. 001 1 10 100 Degree of a consonant, k 1000

Synthesis of Pla. Net L 1 L 2 L 3 L 4 After step 3 L 1 L 2 L 3 L 4 l Non-linear preferential attachment l Iteratively construct the language inventories given their inventory sizes After step 4 Pr(Ci) = diα+ ε ∑x V* (dxα + ε)

Simulation Result 1 Pla. Netsyn Pla. Netrand Pk . 1 . 001 1 10 1000 Degree (k) The parameters α and ε are 1. 44 and 0. 5 respectively. The results are averaged over 100 runs

Principle of Co-occurrence l Consonants tend to co-occur in groups or communities l These groups tend to be organized around a few distinctive features (based on: manner of articulation, place of articulation & phonation) – Principle of feature economy plosive voiced bilabial /b/ dental /d/ then it will also tend to have voiceless /p/ /t/ If a language has in its inventory

How to Capture these Co-occurrences? l Pho. Net – “Phoneme Network” ¡ A weighted network N=(VC, E) ¡ VC : Nodes representing consonants ¡ E : Set of edges which run between the nodes in VC l There is an edge e Є E between two nodes vc 1 , vc 2 Є VC if the consonant c 1 and c 2 co-occur in a language. The number of languages in which c 1 and c 2 co-occurs defines the edge-weight of e. The number of languages in which c 1 occurs defines the node-weight of vc 1. /k′/ 50 14 42 /k/ 38 283 /kw/ 39 13 17 /d′/

Construction of Pho. Net l Data Source : UPSID l Number of nodes in VC is 541 l Number of edges is 34012 Pho. Net

Community Structures in Pho. Net l Radicchi et al. algorithm (for unweighted networks) – Counts number of triangles that an edge is a part of. Inter-community edges will have low count so remove them. l Modification for a weighted network like Pho. Net ¡ Look for triangles, where the weights on the edges are comparable. ¡ If they are comparable, then the group of consonants cooccur highly else it is not so. ¡ Measure strength S for each edge (u, v) in Pho. Net where S is, S= wuv if √Σi Є Vc-{u, v}(wui – wvi)2>0 else S = ∞ ¡ Remove edges with S less than a threshold η

Community Formation 1 52 110 100 3 2 101 10 5 5. 17 10. 94 4 45 46 1 S 11. 11 6 3 0. 06 4 7. 14 2 5 7. 5 3. 77 6 η>1 1 For different values of η we get different sets of communities 2 4 3 5 6

Consonant Societies! η=0. 35 η=0. 60 η=0. 72 η=1. 25 The fact that the communities are good can quantitatively shown by measuring the feature entropy

Problems to ponder on … l Physical significance of PA: ¡Functional forces ¡Historical/Evolutionary process l Labeled synthesis of Pla. Net and Pho. Net l Language diversity vs. Preferential attachment

CASE STUDY II: Modeling the Mental Lexicon

Metal Lexicon (ML) – Basics l It refers to the repository of the word forms that resides in the human brain l Two Questions: ¡How words are stored in the long term memory, i. e. , the organization of the ML. ¡How are words retrieved from the ML (lexical access) The above questions are highly inter-related – to predict the organization one can investigate how words are retrieved and vice versa.

Different Possible Ways of Organization l Un-organized (a bag full of words) or, l Organized ¡By sound (phonological similarity) l. E. g. , start the same: banana, bear, bean … l. End the same: look, took, book … l. Number of phonological segments they share ¡By Meaning (semantic similarity) l. Banana, apple, pear, orange … ¡By age at which the word is acquired ¡By frequency of usage ¡By POS ¡Orthographically

The Hierarchical Model of ML l Proposed by Collins and Quillian in 1969 ¡Concepts are organized in a hierarchy ¡Taxonomic and attributive relations are represented reproduces Animal eats cold-blooded warm-blooded Mammal has mammary glands Fish has gills ¡Cognitive Economy: Put the attributes at the highest of all appropriate levels – e. g. , ‘reproduces’ applies to the whole animal kingdom

Hierarchical Model l According to the principle of cognitive economy ¡Animals eat < mammals eat < humans eat ¡However, shark is a fish = salmon is a fish ¡ What do < and = mean? < : Less time to judge = : Equal time to judge

Spreading Activation Model of ML l Not a hierarchical structure but a web of inter-connected nodes (first proposed by Collins and Loftus in 1975) l Distance between nodes is determined by the structural characteristics of the wordforms (by sound, by meaning, by age, by …) l Combining the above two: plethora of complex networks

Phonological Neighborhood Network l (Vitevitch 2004) l (Gruenenfelder & Pisoni, 2005) l (Kapatsinski 2006) Sound Similarity Relations in the Mental Lexicon: Modeling the Lexicon as a Complex Network

N/W Definition l Nodes: Words l Edge: An edge is drawn from node A to node B if at least 2/3 of the segments that occur in word represented by A also occurs in the word represented by B ¡i. e. , if the word represented by A is 6 segments long then one can derive all its neighbors (B) from it by two phoneme changes (insertions, deletions or substitutions).

N/W Construction l Datbase ¡Hoosier Mental Lexicon (Nusbaum et al. , 1984) l phonologically transcribed words n/w using the metric defined earlier l. Nodes with no links (correspond to hermit words i. e. , words that have no neighbors) l Random networks (E-R) for comparison l Directed n/w a long word can have a short word as a neighbor, not vice versa ¡Have a link only if the duration of the difference in the word pair <= (duration of a word)/3 (the factor 1/3 is experimentally derived … see the paper for further info. )

Neighborhood Density l The node whose neighbors are searched base words l Neighborhood density of a base word is expressed as the out-degree of the node representing the base word l Is an estimate of the number of words activated by the base word when the base word is presented spreading activation ¡Something like semantic priming (however, in the phonological level)

Results of the N/W Analysis l Small-world Properties l High clustering but also long average path length -- like a SW network the lexicon has densely connected neighborhoods but the links between two nodes of different neighborhoods is harder to find than in SW networks

Visualization – A Disconnected Graph with a Giant Component (GC) l. GC is elongated – there are some nodes that have really long chain of intermediates and hence the mean path length is long

Low Degree Nodes are Important!!! l Removal of low degree nodes renders the n/w almost disconnected l A bottleneck is formed between longer (more than 7 segments long) and shorter words ¡This bottleneck consists the ‘tion’ final words: coalition, passion, nation, fixation/fission – they form short-cuts between the high-degree nodes (i. e. , they are low-degree stars that connect mega-neighborhoods)

Removal of Nodes with Degree <= 40 tw 2 -4 n me eg rds o s t g se -10 8 n me s ord w Removal of lowdegree nodes disconnect the n/w as opposed to the removal of hubs like “pastor” (deg. =112)

Why low connectivity between neighborhoods? l Spreading activation should not inhibit ¡ neighbors of the stimulus’ neighbors that are nonneighbors of the stimulus itself (and are therefore, not similar to the stimulus) l Low mean path complete traversal of n/ws, for e. g. , in general purpose search l Search in lexicon does not need to traverse links between distant nodes; rather it involves an activation of the structured neighborhood that share a single sub-lexical chunk that could be acoustically related during word-recognition (Marslen-Wilson, 1990).

Degree Distribution (DD) l Exponential rather than power-law L ire t En e o xic n en m ord tw g se 7 5 - 8 -1 0 g se m t en w rds o s

Other Works (see supplementary material for reference) l Vitevitch (2005) ¡ similar to the above work but builds n/ws of nodes that are just one-segment different l (Choudhury et al. (2007) ¡ Builds weighted n/ws in Hindi, Bengali and English based on orthographic proximity (nodes: words; edges: orthographic edit-distance) – Spell. Net ¡ Does thresholding (θ) to make the n/ws binary (at θ = 1, 3, 5). ¡ They also obtain exponential DDs ¡ Observe that occurrence of real word errors in a language is proportional to avg. wghtd. deg. of the Spell. Net of that language

Other Works l Sigman et al. (2002) ¡ Analyzes the English Word. Net ¡ All semantic relationships are scale-invariant ¡ Inclusion of polysemy make the n/w SW l Ferrer i Cancho et al. (2000, 2001) – ¡ Word co-occurrence (in a sentence) based definitions of the lexicon ¡ Lexicon = Kernel Lexicon + Peripheral Lexicon ¡ Finds a 2 -regime DD … one comprises words in the kernel lexicon and the other words in the peripheral lexicon ¡ Finds that these n/ws are small-world

Some Unsolved Mysteries – You can Give it a Try l What can be a model for the evolution of the ML? l How is the ML acquired by a child learner? l Is there a single optimal structure for the ML; or is it organized based on multiple criteria (i. e. , a combination of the different n/ws) – Towards a single framework for studying ML!!!

CASE STUDY III: Syntax Unsupervised POS Tagging

Labeling of Text l Lexical Category (POS tags) l Syntactic Category (Phrases, chunks) l Semantic Role (Agent, theme, …) l Sense l Domain dependent labeling (genes, proteins, …) How to define the set of labels? How to (learn to) predict them automatically?

“Nothing makes sense, unless in context” l Distribution-based definition of ¡Lexical category ¡Sense (meaning) The X is … If you X then I shall … … looking at the star PP

General Approach l Represent the context of a word (token) l Define some notion of similarity between the contexts l Cluster the contexts of the tokens l Get the label of the tokens w 1 w 2 w 3 w 4 … w 1 w 3 w 2 w 4

Issues l How to define the context? l How to define similarity l How to Cluster? l How to evaluate?

Unsupervised Parts-of-Speech Tagging Employing Efficient Graph Clustering Chris Biemann COLING-ACL 2006

Stages l Input: raw text corpus l Identify feature words and define a graph for high and medium frequency words (10000) l Cluster the graph to identify the classes l For low frequency words, use context similarity l Lexicon of word classes tag the same text learn a Viterbi tagger

Features Words l Estimate the unigram frequencies l Feature words: Most frequent 200 words

Feature Vector From the familiar to the exotic, the collection is a delight the fw 1 p-2 p-1 p 2 to fw 2 0 1 0 is from fw 199 fw 200 … … 0 0 1 0 0 0

Syntactic Network of Words color sky weight light 1 20 blue 100 blood heavy 1 1 – cos(red, blue) red

The Chinese Whisper Algorithm color sky weight 0. 9 0. 8 light -0. 5 0. 7 blue blood 0. 9 0. 5 red heavy

Medium and Low Frequency Words l Neighboring (window 4) co-occurrences ranked by log-likelihood thresholded by θ l Two words are connected iff they share at least 4 neighbors Language Nodes English 52857 Finnish 85627 German 137951 Edges 691241 702349 1493571

Construction of Lexicon l Each word assigned a unique tag based on the word class it belongs to ¡Class 1: sky, color, blood, weight ¡Class 2: red, blue, light, heavy l Ambiguous words: ¡High and medium frequency words that formed singleton cluster ¡Possible tags of neighboring clusters

Training and Evaluation l Unsupervised training of trigram HMM using the clusters and lexicon l Evaluation: ¡Tag a text, for which gold standard is available ¡Estimate the conditional entropy H(T|C) and the related perplexity 2 H(T|C) l Final Results: ¡English – 2. 05 (619/345), Finnish – 3. 22 (625/466), German – 1. 79 (781/440)

Word Sense Disambiguation l Véronis, J. 2004. Hyper. Lex: lexical cartography for information retrieval. Computer Speech & Language 18(3): 223 -252. l Let the word to be disambiguated be “light” l Select a subcorpus of paragraphs which have at least one occurrence of “light” l Construct the word co-occurrence graph

Hyper. Lex A beam of white light is dispersed into its component colors by its passage through a prism. Energy efficient light fixtures including solar lights, night lights, energy star lighting, ceiling lighting, wall lighting, lamps What enables us to see the light and experience such wonderful shades of colors during the course of our everyday lives? beam prism dispersed white colors shades energy efficient fixtures lamps

Hub Detection and MST light beam prism dispersed white colors beam lamps prism shades white dispersed energy colors fixtures shades energy efficient White fluorescent lights consume less energy than incandescent lamps fixtures lamps

Other Related Works l Solan, Z. , Horn, D. , Ruppin, E. and Edelman, S. 2005. Unsupervised learning of natural languages. PNAS, 102 (33): 11629 -11634 l Ferrer i Cancho, R. 2007. Why do syntactic links not cross? Europhysics Letters l Also applied to: IR, Summarization, sentiment detection and categorization, script evaluation, author detection, …

Discussions & Conclusions What we learnt Advantages of SNIC in NLP Comparison to standard techniques Open problems Concluding remarks and Q&A

What we learnt l What is SNIC and Complex Networks l Analytical tools for SNIC l Applications to human languages l Three Case-studies: Area Perspective Technique I Sound systems Language evolution and change Synthesis models II Lexicon Psycholinguistic modeling and linguistic typology Topology and search III Syntax & Applications to NLP Semantics Clustering

What we saw l Language features complex structure at every level of organization l Linguistic networks have non-trivial properties: scale-free & small-world l Therefore, Language and Engineering systems involving language should be studied within the framework of complex systems, esp. CNT

Advantages of SNIC l Fully Unsupervised techniques: ¡No labeled data required: A good solution to resources scarcity ¡Problem of evaluation: circumvented by semisupervised techniques l Ease of computation: ¡Simple and scalable ¡Distributed and parallel computable l Holistic treatment: ¡ Language evolution & psycho-linguistic theories

Comparison to Standard Techniques l Rule-based vs. Statistical NLP l Graphical Models ¡Generative models in machine learning ¡HMM, CRF, Bayesian belief networks JJ NN RB VF

Graphical Models vs. SNIC GRAPHICAL MODEL COMPLEX NETWORK l Principled: based on Bayesian Theory l Structure is assumed and parameters are learnt l Focus: Decoding & parameter estimation l Data-driven or computationally intensive l The generative process is easy to visualize, but no visualization of the data l Heuristic, but underlying principles of linear algebra l Structure is discovered and studied l Focus: Topology and evolutionary dynamics l Unsupervised and computationally easy l Easy visualization of the data

Language Modeling l A network of words as a model of language vs. n -gram models l Hierarchical, hyper-graph based models l Smoothing through holistic analysis of the network topology Jedynak, B. and Karakos, D. 2007. Unigram Language Models using Diffusion Smoothing over Graphs. Proc. of Text. Graphs - 2

Open Problems l Universals and variables of linguistic networks l Superimposition of networks: phonetic, syntactic, semantic l Which clustering algorithm for which topology? l Metrics for network comparison – important for language modeling l Unsupervised dependency parsing using networks l Mining translation equivalents

Resources l Conferences ¡Text. Graphs, Sunbelt, Evo. Lang, ECCS l Journals ¡PRE, Physica A, IJMPC, EPL, PRL, PNAS, QL, ACS, Complexity, Social Networks l Tools ¡ Pajek, C#UNG, http: //www. insna. org/INSNA/soft_inf. html l Online Resources ¡ Bibliographies, courses on CNT

Contact l Monojit Choudhury ¡monojitc@microsoft. com ¡http: //www. cel. iitkgp. ernet. in/~monojit/ l Animesh Mukherjee ¡animeshm@cse. iitkgp. ernet. in ¡http: //www. cel. iitkgp. ernet. in/~animesh/ l Niloy Ganguly ¡niloy@cse. iitkgp. erent. in ¡http: //www. facweb. iitkgp. ernet. in/~niloy/

Thank you!!