Probabilistic Graphs: Efficient Natural (Spoken) Language Processing Bob Carpenter
The Standard Clichés • Moore’s Cliché – Exponential growth in computing power and memory will continue to open up new possibilities • The Internet Cliché – With the advent and growth of the world-wide web, an ever increasing amount of information must be managed October 1999
More Standard Clichés • The Convergence Cliché – Data, voice and video networking will be integrated over a universal network, that: • includes land lines and wireless; • includes broadband narrowband • likely implementation is IP (internet protocol) • The Interface Cliché – The three forces above (growth in computing power, information online, and networking) will both enable and require new interfaces – Speech will become as common as graphics October 1999
Application Requirements • Robustness – acoustic and linguistic variation – disfluencies and noise • Scalability – from embedded devices to palmtops to clients to servers – across tasks from simple to complex – system-initiative form-filling to mixed initiative dialogue • Portability – simple adaptation to new tasks and new domains – preferably automated as much as possible October 1999
The Big Question • How do humans handle unrestricted language so effortlessly in real time? • Unfortunately, the “classical” linguistic assumptions and methodology completely ignore this issue • This is dangerous strategy for processing natural spoken language October 1999
My Favorite Experiments: I • Head-Mounted Eye Tracking – Mike Tanenhaus et al. (Univ. Rochester) “Pick up the yellow plate” Eyes track Semantic resolution ~200 ms tracking time • Clearly shows human understanding is online October 1999
My Favorite Experiments (II) • Garden Paths and Context Sensitivity – Crain & Steedman (U. Connecticut & U. Edinburgh) – if noun denotation is not a singleton in context, postmodificiation is much more likely • Garden Paths are Frequency and Agreement Sensitive – Tanenhaus et al. – The horse raced past the barn fell. (raced likely past) – The horses brought into the barn fell. (brought likely participle, and less likely activity for horses) October 1999
Conclusion: Function & Evolution • Humans agressively prune in real time – This is an existence proof: there must be enough info to do so; we just need to find it – All linguistic information is brought in at 200 ms – Other pruning strategies have no such existence proof • Speakers are cooperative in their use of language – Especially with spoken language, which is very different than written language due to real-time requirements • (Co-? )Evolution of language and speakers to optimize these requirements October 1999
Stats: Explanation or Stopgap? • The Common View – Statistics are some kind of approximation of underlying factors requiring further explanation. • Steve Abney’s Analogy (AT&T Labs) – Statistical Queueing Theory – Consider traffic flows through a toll gate on a highway. – Underlying factors are diverse, and explain the actions of each driver, their cars, possible causes of flat tires, drunk drivers, etc. – Statistics is more insightful [“explanatory”] in this case as it captures emergent generalizations – It is a reductionist error to insist on low-level account October 1999
Algebraic vs. Statistical • False Dichotomy – Statistical systems have an algebraic basis, even if trivial • Best performing statistical systems have best linguistic conditioning – Holds for phonology/phonetics and morphology/syntax – Most “explanatory” in traditional sense – Statistical estimators less significant than conditioning • In “other” sciences, statistics used for exploratory data analysis – trendier: data mining; trendiest: information harvesting • Emergent statistical generalizations can be algebraic October 1999
The Speech Recognition Problem • The Recognition Problem – Find most likely sequence w of “words” given the sequence of acoustic observation vectors a – Use Bayes’ law to create a generative model – Max w. P(w|a) = Max w. P(a|w) P(w) / P(a) = Max w. P(a|w) P(w) • Language Model: • Acoustic Model: P(w) P(a|w) [usually n-grams - discrete] [usually HMMs - cont. density] • Challenge 1: beat trigram language models • Challenge 2: extend this paradigm to NLP October 1999
N-best and Word Graphs • Speech recognizers can return n-best histories 1. flights from Boston today 3. flights from Austin today 2. lights for Boston to pay 4. flights for Boston to pay • Or a packed word graph of histories – sum of path log probs equals acoustic/language log prob Austin flights Boston from for lights today Boston to pay for • Path closest to utterance in dense graphs much better than first-best on average [density: 1: 24%; 5: 15%; 180: 11%] October 1999
Probabilistic Graph Processing • The architecutre we’re exploring in the context of spoken dialogue systems involves: – Speech recognizers that produce a probabilistic word graph output, with scores given by acoustic probabilities – A tagger that transforms a word graph into a word/tag graph with scores given by joint probabilities – A parser that transforms a word/tag graph into a syntactic graph (as in CKY parsing) with scores given by grammar • Allows each module to rescore output of previous module’s decision • Long Term: Apply this architecture to speech act detection, dialogue act selection, and in generation October 1999
Probabilistic Graph Tagger • In: probabilistic word graph – P(As|Ws) : conditional acoustic likelihoods [or confidences] • Out: probabilistic word/tag graph – P(Ws, Ts) : joint word/tag likelihoods [ignores acoustics] – P(As, Ws, Ts) : joint acoustic/word/tag likelihoods [but…] • General history-based implementation [in Java] – – – next tag/word probability a function of specified history operates purely left to right on forward pass backwards prune to edges within a beam / on n-best path able to output hypotheses online optional backwards confidence rescoring [not P(As, Ws, Ts)] need node for each active history class for proper model October 1999
Backwards: Rescore & Minimize Joint Out: A : 1/2 D : 1/8 C : 1/4 B : 1/8 All Paths: 1. A, C, E : 1/64 2. A, C, D : 1/128 E : 1/16 3. B, C, D : 1/256 4. B, C, E : 1/512 • Edge gets sum of all path scores that go through it • Normalize by total: (1/64 + 1/128 + 1/256 + 1/512) Backward: A : 4/5 D : 2/3 C: 1 B : 1/5 Note: outputs sum to 1 after backward pass October 1999 E : 1/3
Tagger Probability Model • Exact Probabilities: – P(As, Ws, Ts) = P(Ws, Ts) * P(As|Ws, Ts) – P(Ws, Ts) = P(Ts) * P(Ws|Ts) [“top-down”] • Approximations: – Two Tag History: tag trigram • P(Ts) ~ PRODUCT_n P(T_n | T_n-2, T_n-1) – Words Depend only on Tags: HMM • P(Ws|Ts) ~ PRODUCT_n P(W_n | T_n) – Pronunciation Independent of Tag: use standard acoustics • P(As|Ws, Ts) ~ P(As|Ws) October 1999
Prices rose sharply today 0. -35. 612683136497516 : NNS/prices VBD/rose RB/sharply NN/today (0, 2; NNS/prices) (2, 10; VBD/rose) (10, 14; RB/sharply) (14, 15; NN/today) 1. -37. 035496392922575 : NNS/prices VBD/rose RB/sharply NNP/today (0, 2; NNS/prices) (2, 10; VBD/rose) (10, 14; RB/sharply) (14, 15; NNP/today) 2. -40. 439580756197934 : NNS/prices VBP/rose RB/sharply NN/today (0, 2; NNS/prices) (2, 9; VBP/rose) (9, 11; RB/sharply) (11, 15; NN/today) 3. -41. 86239401262299 : NNS/prices VBP/rose RB/sharply NNP/today (0, 2; NNS/prices) (2, 9; VBP/rose) (9, 11; RB/sharply) (11, 15; NNP/today) 4. -43. 45450487625557 : NN/prices VBD/rose RB/sharply NN/today (0, 1; NN/prices) (1, 6; VBD/rose) (6, 14; RB/sharply) (14, 15; NN/today) 5. -44. 87731813268063 : NN/prices VBD/rose RB/sharply NNP/today (0, 1; NN/prices) (1, 6; VBD/rose) (6, 14; RB/sharply) (14, 15; NNP/today) 6. -45. 70597331609037 : NNS/prices NN/rose RB/sharply NN/today (0, 2; NNS/prices) (2, 8; NN/rose) (8, 13; RB/sharply) (13, 15; NN/today) 7. -45. 81027979248346 : NNS/prices NNP/rose RB/sharply NN/today (0, 2; NNS/prices) (2, 7; NNP/rose) (7, 12; RB/sharply) (12, 15; NN/today) 8. ……………. . October 1999
Prices rose sharply after hours 15 -best as a word/tag graph + minimization rose: VBD sharply: RB after: RB prices: NN sharply: RB rose: VBD prices: NNS rose: VBP after: IN after: RB sharply: RB after: IN rose: NN sharply: RB rose: NNP after: IN sharply: RB October 1999 hours: NNS
Prices rose sharply after hours 15 -best as a word/tag graph + minimization + collapsing rose: VBD sharply: RB prices: NN after: IN after: RB hours: NNS rose: VBD after: IN prices: NNS sharply: RB rose: NN rose: VBP rose: NNP prices: NNS October 1999 rose: VBD rose: NN rose: VBP rose: NNP sharply: RB after: IN hours: NNS
Weighted Minimize (isn’t easy) • Can push probabilities back through graph • Ratio of scores must be equivalent for sound minimization (difference of log scores) B: w C: z A: x A: y B: w+(x-y) A: y C: z • Assume x > y; operation preserves sum of paths: B, A : w+x C, A : z+y October 1999
Weighted Minimize is Problematic • Can’t minimize if ratio is not the same: B: x 1 B: w A: y 1 • To push, must have amount to push: (x 1 -x 2) = (y 1 -y 2) [e^x 1 / e^x 2 = e^y 1 / e^y 2] C: z B: x 2 A: y 2 October 1999
How to Collect n Best in O (n k ) • Do a forward pass through graph, saving: – best total path score at each node – backpointers to all previous nodes, with scores • This is done during tagging (linear in max length k ) • Algorithm: – add first-best and second best final path to priority queue – k times, repeat: • follow backpointer of best path on queue to beginning & save next best (if any) at each node on queue • Can do same for all paths within beam epsilon • Result is deterministic; minimize before parsing October 1999
Collins’ Head/Dependency Parser • Michael Collins (AT&T) 1998 UPenn Ph. D Thesis • Generative model of tree probabilities: P(Tree) • Parses WSJ with ~90% constituent precision/recall – Best performance for single parser, but Henderson’s Johns Hopkins’ Thesis beat it by blending with other parsers (Charniak & Ratnaparkhi) • Formal “language” induced from simple smoothing of treebank is trivial: ~Word* (Charniak) • Collins’ parser runs in real time – Collins’ naïve C implementation – Parses 100% of test set October 1999
Collins’ Grammar Model • Similar to GPSG + CG (aka HPSG) model – – – Subcat frames: adjuncts / complements distinguished Generalized Coordination Unbounded Dependencies via slash Punctuation Distance metric codes word order (canonical & not) • Probabilities conditioned top-down • 12, 000 word vocabulary (>= 5 occs in treebank) – backs off to a word’s tag – approximates unknown words from words with < 5 occs • Induces “feature information” statistically October 1999
Collins’ Statistics (Simplified) • Choose Start Symbol, Head Tag, & Head Word – P(Root. Cat, Head. Tag, Head. Word) • Project Daughter and Left/Right Subcat Frames – P(Daughter. Cat|Mother. Cat, Head. Tag, Head. Word) – P(Sub. Cat|Mother. Cat, Dtr. Cat, Head. Tag, Head. Word) • Attach Modifier (Comp/Adjunct & Left/Right) – P(Modifier. Cat, Modifer. Tag, Modifier. Word | Sub. Cat, Mother. Cat, Daughter. Cat, Head. Tag, Head. Word, Distance) October 1999 . .
Complexity and Efficiency • Collins’ wide coverage linguistic grammar generates millions of readings for simple strings • But Collins’ parser runs faster than real time on unseen sentences of arbitrary length • How? • Punchline: Time-Syncrhonous Beam Search Reduces time to Linear • Tighter estimates with more features and more complex grammars ran faster and more accurately – Beam allows tradeoff of accuracy (search error) and speed October 1999
Completeness & Dialogue • Collins’ parser is not complete in the usual sense • But neither are humans (eg. garden paths) • Syntactic features alone don’t determine structure – Humans can’t parse without context, semantics, etc. – Even phone or phoneme detection is very challenging, especially in a noisy environment – Top-down expectations and knowledge of likely bottomup combinations prune the vast search space on line – Question is how to combine it with other factors • Next steps: semantics, pragmatics & dialogue October 1999
Conclusions • Need ranking of hypotheses for applications • Beam can reduce processing time to linear – need good statistics to do this • More linguistic features are better for stat models – can induce the relevant ones and weights from data – linguistic rules emerge from these generalizations • Using acoustic / word / tag / syntax graphs allows the propogation of uncertainty – ideal is totally online (model is compatible with this) – approximation allows simpler modules to do first pruning October 1999