337aba8c06b37d12c8ba88a48f96fd3a.ppt
- Количество слайдов: 82
Computers versus Common Sense • CYC: Software that partially understands • The impact of that on the AIM dream Dr. Douglas B. Lenat , 7718 Wood Hollow Drive, Austin, TX 78731 Email: Lenat@cyc. com Phone: 512 -342 -4001 May 4, 2009
ELIZA (DOCTOR) 1965 Joe Weizenbaum, MIT • Carl Rogers advocated reflection Patient: “I swear a lot. ” “How do you feel about the fact that you swear a lot? ” • Counted a lot on anthropomorphism • Count on patient-to-doctor respect/awe/… Patient: “My dog’s mother died recently. ” “Tell me more about your mother. ” Veneer of intelligence, not intelligence
ALICE (40 years post-Eliza) • JUDGE 11[9: 26: 0]I WILL ASK YOU SOME SIMPLE QUESTIONS AND GIVE SIMPLE COMMANDS • PROGRAM[9: 26: 0]Do you think your plan will succeed? • JUDGE 11[9: 26: 16]WHAT IS THE COLOR OF A BLUE CAR • PROGRAM[9: 26: 16]WHAT COLOR IS a blue car • JUDGE 11[9: 26: 35]WHERE IS SUE’S NOSE WHEN SUE IS IN HER HOUSE • PROGRAM[9: 26: 35]Where it belongs. Try searching the Web. May 2009
May 2009
May 2009
2009: Software is still Brittle “Is the Space Needle taller than the Eiffel Tower? ” “How old was Martin van Buren when he was elected President of the U. S. ? ” May 2009
Natural Language Understanding requires having lots of knowledge 1. The pen is in the box. The box is in the pen. 2. The police watched the demonstrators because they feared violence. The police watched the demonstrators because they advocated violence. 3. Mary and Sue are sisters. Mary and Sue are mothers. 4. Every American has a mother. Every American has a president. 5. John saw his brother skiing on TV. The fool didn’t have a coat on! John saw his brother skiing on TV. The fool didn’t recognize him! May 2009
7. “…include all the re-do CABG procedures utilizing ITA and SVG in 1991”. “And” usually does mean “and”. But in this query, “and” really must mean “or”. Medical knowledge, not grammar, disambiguates this: a single CABG will not have both an ITA and a SVG. 8. “…that the tumor cells are stopping dividing or dying…” Do they mean “stopping dividing or stopping dying”? Of course not, but in 16 of 30 randomly selected syntactically similar constructions from www. clinicaltrials. gov, the coordination (i. e. , the wider scope of the modifier, in this case the word “stopping”) was the intended meaning. In each case, only one choice “makes sense” (is consistent with medical knowledge and common sense). 9. “Adult patients who underwent MAZE III with or without Mitral Valve Repair or Replacements. ” Is the second half of that query just a waste of space? Discourse pragmatics says no, the physician must have had some reason for saying that. Medical knowledge provides a plausible interpretation: “Adult patients who underwent MAZE III with no concomitant procedures other than Mitral Valve Repair or Replacements” May 2009
Okay, so let’s tell the computer the same sorts of things that human beings know about cars, and colors, heights, movies, time, driving to a place, etc. all the other stuff that everybody knows. The basic idea: Get the computer to understand, not just store, information. Then it can reason to answer your queries. May 2009 2 July 2005
Microwave. Oven is a type of Kitchen-Appliance Dishwasher is a type of Kitchen-Appliance The basic idea: Get the computer to understand, not just store, information. Then it can reason to answer your queries. May 2009 2 July 2005
You can’t use X if it alorxes Y but lacks any Y Rthagide-disjaks is a type of Kitchen-Appliance Gracinimumples is a type of Kitchen-Appliance Rthagide-disjaks alorxes Vorawnistz. Gracinimumples alorxes Vorawnistz and Buzqa is a Thwarn and supplied through Epluns. May 2009 2 July 2005
etc. all the other stuff that everybody knows. Eventually, after writing millions of these rules, the system knows as much about pipes, liquids, water, electricity, microwave ovens, dishwashers, cars, colors, movies, heights, etc. as you and I do. Ultimately, there is just 1 interpretation Long before that, incrementally, the of that model, and competence and the system gains it corresponds to trustworthiness real world. The basic idea: Get the computer to understand, not just store, information. Then it can reason to answer your queries. May 2009 2 July 2005
Cyc is… Millions of facts, rules of thumb, etc. that capture human common sense about our everyday world – – – – The typical bird has 1 beak, 1 heart, lots of feathers, … Hearts are internal organs; feathers are external protrusions Most vehicles are steered by an awake, sane, adult, … human Tangible objects can’t be in 2 (disjoint) places at once Badly injuring a child is much worse than killing a dog Causes temporally precede (i. e. , start before) their effects A stabbing requires 2 cotemporal and proximate actors etc. May 2009
Cyc is… Millions of facts, rules of thumb, etc. that capture human common sense about our everyday world - Each of these represented in formal logic - Info. about a set of hundreds of thousands of terms Penitentiary Chinese. Word. For. Writing. Pen - Language-independent English. Word-Plume Writing. Pen English. Word-Pen Bird. Feather … French. Word-Plume Authoring
Cyc is… Millions of facts, rules of thumb, etc. that capture human common sense about our everyday world - Each of these represented in formal logic - Info. about a set of hundreds of thousands of terms • An inference engine that produces the same sorts of inferences from those that people would. • Interfaces so the system can communicate with May 2009 people, data bases, spreadsheets, websites, etc.
What Needs to be Shared? • • bits/bytes/streams/network… alphabet, special characters, … words, morphological variants, … syntactic meta-level markups (HTML) semantic meta-level markups (SGML, XML) Sem. Web content (logical representation of doc/page/. . . ) context (common sense, recent utterances, and n dimensions of metadata: time, space, level of granularity, the source’s purpose, etc. ) May 2009
(For. All ? P (For. All ? C (implies (and (isa ? P Person) (children ? P ? C)) (loves ? P ? C)))) How formalized knowledge helps search When you become happy, you smile. You become happy when someone you love accomplishes a milestone. • Query: “Someone smiling” Taking one’s first step is a milestone. ion KB) at (+ rm e nfo enc d i fer fin in by Parents love their children. • Caption: “A man helping his daughter take her first step” May 2009 .
How formalized knowledge helps search Query: “Show me pictures of strong and adventurous people” ion KB) at (+ orm ce inf ren ind infe f by Caption: “A man climbing a rock face” May 2009
How formalized knowledge helps search Query: “Government buildings damaged in terrorist events in Beirut between 1990 and 2001” ion KB) at (+ orm ce Document: “ 1993 pipe inf ren ind infe bombing of France’s f embassy in Lebanon. ” by Text Document May 2009
How can our programs be intelligent, not merely have the veneer of it? • ANSWER: By having a large corpus of knowledge, spanning the gamut from specific domain-dependent all the way up to general common sense. • The computer needs to be able to apply the knowledge, not just store some English gloss – Represent it formally (predicate calculus), and apply logic – Represent it numerically, and apply mathematics/statistics • And after all that: Be compelling to the human deciding
One Good Explanation is worth 20 points of IQ • Magic tricks – “How do they do that? !” “How was I ever fooled by that? !” • Efficacy of punishment vs reward – “Punishment is more effective, and the statistics back me up” • Clinical decision-making (by doctors and by patients) – “Because 0. 814” versus “Because < plausible causal rationale >” • Organ donation in European countries: – Why is it so often 15%/85% or 85%/15% ? [Answer: Because when you apply for a drivers license in some countries, you have to check a box to “opt in”; in others, you have to check a box to “opt out”; and in the U. S. and most European countries at least, 85% of the people don’t know what they should do, even though it’s an emotional, serious choice, and end up just leaving it unchecked. ] • And after all that: Be compelling to the human deciding
Reflection Framing Effect Philadelphia is preparing for a Legionaire’s Disease outbreak expected to kill 600 people today. Two alternative programs to combat the disease have been proposed. The consequences of each program are as follows: If Program A is adopted, 200 people will be saved. (72%) If Program B is adopted, there is a 1/3 chance that all 600 will be saved, and a 2/3 chance that no lives will be saved. (28%) = = If Program A’ is adopted, 400 people will die. (22%) If Program B ’ is adopted, there is a 2/3 chance that 600 will die, and a 1/3 chance that no one will die. (78%) For more information, see: Kahneman, D. and Tversky, A. (1984). Choices, values, and frames. American Psychologist, 39, 341 -350.
Conjunction Fallacy A health survey was conducted in a representative sample of adult males in Chicago of all ages and occupations. Mr. F was included in the sample. He was selected by random chance from the list of participants. Please rank the following statements in terms of which is most likely to be true of Mr. F. (1=more likely to be true, 6=least likely) ____ ____ Mr. F smokes more than 1 cigarette per day on average. Mr. F has had one or more heart attacks. A Mr. F had a flu shot this year. A and B Mr. F eats red meat at least once per week. Mr. F has had one or more heart attacks and he is over 55 years old. Mr. F never flosses his teeth. 58% rated “A and B” more likely than A For more information, see: Tversky, A. and Kahneman, D. (1983). Extensional vs. intuitive reasoning: The conjunction fallacy in probability judgment. Psych. Rev. 90, 293 -315.
Why there is a need for meta-logical elements (rationale and POV) to convince decision-makers • Early hominids: pre-rational decision-makers • Later hominids: usually rational • Even later hominids: almost always rational
A 67 year old woman suffering from ICM with elevated bilirubin, history of diabetes, body mass index of 39. 5, NYHA function class III, mitral valve regurgitation grade (MVRG) of 2+, and no aortic valve regurgitation (AVR) is assigned to CABG surgery. RF+Cyc is consulted and the RF (random forest statistical reasoning) component, having been trained on a large database, identifies CABG alone as the most likely treatment option, citing an odds ratio of 2. 6 over the next most favorable treatment, CABG+MVA. As rationale, the Cyc (AI) component observes that the low MVRG is atypical of MVA which is a surgical procedure typically reserved for patients with severe mitral regurgitation and thus the simpler CABG procedure is preferred. However, an intraoperative transesophageal echocardiogram (TEE) suggests MVRG is 3+. Based on this, the surgical team overrides the initial diagnosis without consultation, opting instead for CABG+MVA. The patient dies 3 days later from complications due to surgery. In this setting, RF+Cyc, if consulted, could have alerted the heart team to additional data that might have swayed their decision, thus potentially saving a life. RF+Cyc would have noted that while an MVRG of 3+ is consistent with CABG+MVA, the odds favoring CABG only marginally decrease from 2. 6: 1 to 1. 7: 1 when MVRG is upstaged for this patient from 2+ to 3+, and that surgery under CABG alone offers a 20% increase in median survival compared to CABG+MVA. RF+Cyc could further argue that intraoperative MVRG can falsely appear to be upstaged due to altered hemodynamics in anesthetized patients. An Cyc-assisted semantic search of the recent literature reveals that transesophageal transthoracic echocardiograms (TTE) more reliably reflect the degree of mitral regurgitation than TEE. That (+co-morbidities) argues for just CABG.
4 Pitfalls of Semantic Technology • Ignorance-based: A small theory size (#terms, instances, rules) • Static KB (massively tuned, optimized, cached ahead of time) • Simple assertions (SAT constraints; propositional calculus; Horn clause logic; Description Logic; first order logic) • 1 global context (no contradic. ’s, tiny domain, simplified world) May 2009
Applying Cyc • Cyc is a power source, not a single application. Like oil, electricity, telephony, computers, … Cyc can spawn and sustain a knowledge utility industry. • It can cost-effectively underlie almost all apps. (Provide a common-sense layer to reduce brittleness when faced with unexpected inputs/situations) • To apply Cyc, we extend its ontology, its KB, and possibly its suite of specialized reasoning modules May 2009
The Analyst’s Knowledge Base CT Analyst “Were there any attacks on targets of symbolic value to Muslims since 1987 on a Christian holy day? " Domain Experts "What sequences of events could lead to the destruction of Hoover Dam? " Query Formulation Formulator Explanation Generator Cycorp Tools For: Ontology-Building, -Browsing, -Editing, & Fact/Rule Entry Scenario Generation Generator Reasoning Modules Others’/GOTS Analysis and Collaboration Components General Terrorism Knowledge Terrorism Knowledge Base) Base AKB OWL & Relational DB “projection” of the AKB Interface to Data Repositories HUMINT Messages INS SIGINT Data Message Content Geopolitical Border Data Crossings Global HID Observa Terrain May 2009 tions Data Weather Travel Records Data Credit Satellite Card Intel Records Military Intel output of COTS Text Extraction Systems
A more recent example “What major US cities are particularly vulnerable to an anthrax attack? ” The answer is logically implied by data dispersed through several sources: USGS GNIS DB AMVA KB RAND R May 2009 UN FAO DB DTRA CATS DB
“What major US cities are particularly vulnerable to an anthrax attack? ” “major US city” ? C is a U. S. City with >1 M population “particularly vulnerable to an anthrax attack” – the current ambient temperature at ? C is above freezing, and – ? C has more than 100 people for each hospital bed, and – the number of anthrax host animals near ? C exceeds 100 k May 2009
state | name | type | county | state_fips | -------+------------+-----------+------+ TX | Dallas | ppl | Dallas | 48 | MN | Hennepin County | civil | Hennepin | 27 | CA | Sacramento County | civil | Sacramento | 6 | AZ | Phoenix | ppl | Maricopa | 4 | primary_lat | primary_long| elevation | population | status | ------------+-----------+----------+ 32. 78333 | -96. 8 | 463 | 1022830 | BGN 1978 1959 45. 01667 | -93. 45 | 0 | 1032431 | 38. 46667 | -121. 31667 | 0 | 1041219 | 33. 44833 | -112. 07333 | 1072 | 1048949 | BGN 1931 1900 1897 USGS GNIS DB The Geographic Names Information System (GNIS) DB maintained by the US Geological Survey (USGS). May 2009
So how do we explain to our system that: • row 1 of that table is “about” the city of Dallas, TX • the population field of that table contains the number of inhabitants of the city that row is “about” • here is exactly how to access tuples of that database • that access will be fast, accurate, recent, complete USGS GNIS DB The Geographic Names Information System (GNIS) DB maintained by the US Geological Survey (USGS). May 2009
• the population field of that table contains the number of inhabitants of the city that row is “about” We provide the field encodings and decodings, some of which correspond to explicit fields like population, two-letter state codes, etc: (field. Decoding Usgs-Gnis-LS ? x (The. Field. Called “population”) (number. Of. Inhabitants (The. Referent. Of. The. Row Usgs-Gnis) ? x)) USGS GNIS DB The Geographic Names Information System (GNIS) DB maintained by the US Geological Survey (USGS). May 2009
• how to access tuples of that database We provide all the information needed for a JDBC connection script: We assert, in the context (Mapping. Mt. Fn Usgs-KS), all of these: (password. For. SKS Usgs-KS "geografy") (port. Number. For. SKS Usgs-KS 4032) (server. Of. SKS Usgs-KS "sksi. cyc. com") (sql. Program. For. SKS Usgs-KS Postgre. SQL) (structured. Knowledge. Source. Name Usgs-KS "usgs") (sub. Protocol. For. SKS Usgs-KS "postgresql") (user. Name. For. SKS "sksi") USGS GNIS DB The Geographic Names Information System (GNIS) DB maintained by the US Geological Survey (USGS). May 2009
• that access will be fast, accurate, recent, complete We provide meta-level assertions about the database, about each table of the database, about the completeness etc. of various kinds of data in the DB, etc. We assert, in the context (Mapping. Mt. Fn Usgs-KS): (schema. Complete. Extent. Known. For. Value. Type. In. Arg Usgs-Gnis-LS USCity number. Of. Inhabitants 1) USGS GNIS DB The Geographic Names Information System (GNIS) DB maintained by the US Geological Survey (USGS). May 2009
• that access will be fast, accurate, recent, complete We provide meta-level assertions about the database, about each table of the database, about the completeness etc. of various kinds of data in the DB, etc. We assert, in the context (Mapping. Mt. Fn Usgs-KS): (result. Set. Cardinality Usgs-Gnis-PS (The. Set (Physical. Field. Fn Usgs-Gnis-PS "state")) The. Empty. Set 60. 0) (result. Set. Cardinality Usgs-Gnis-PS (The. Set (Physical. Field. Fn Usgs-Gnis-PS "primary_long") USGS GNIS (Physical. Field. Fn Usgs-Gnis-PS "primary_lat") DB (Physical. Field. Fn Usgs-Gnis-PS "name")) (The. Set The Geographic Names Information System (GNIS) (Physical. Field. Fn Usgs-Gnis-PS "county") DB maintained by the US Geological Survey (USGS). (Physical. Field. Fn Usgs-Gnis-PS "state")) May 2009 530. 36)
“What major US cities are particularly vulnerable to an anthrax attack? ” “major US city” U. S. City with >1 M population “particularly vulnerable to an anthrax attack” – the current ambient temperature at ? C is above freezing, and – ? C has more than 100 people for each hospital bed, and – the number of anthrax host animals near ? C exceeds 100 k Cyc knows that pullets are chickens, so don’t add those two numbers together! May 2009
May 2009
May 2009
May 2009
May 2009
May 2009
Even simple queries often require 1 -4 reasoning steps “In what countries bordering Pakistan are there members of the ANVC? ” E. g. , for the answer “India”, the justification is: • According to the web site ‘Inside Terrorism’, the ANVC’s headquarters has been in Garo Hills, India from the beginning of January, 1996 through today. • If an organization’s HQ is in place x, then there are members of that organization in place x. • If someone is in place x, they are in every super-region of x. • India borders Pakistan. May 2009 Don’t include Prior & Tacit Knowledge Each answer that CAE finds for this generally involves a 1 -4 -step (not 0 -step) argument (reasoning chain):
The Cyc Knowledge Base Cyc contains: 15, 000 Predicates 500, 000 Concepts 5, 200, 000 Assertions Intangible Individual Thing Sets Relations Space Physical Objects Spatial Thing Paths Spatial Paths Living Logic Math Borders Geometry Natural Geography Political Geography Weather Earth & Solar System Life Forms Human Beings Human Artifacts Materials Parts Statics Plans Goals Physical Agents Plants Animals Emotion Human Products Conceptual Perception Behavior & Devices Works Belief Actions Vehicles Buildings Weapons Actors Actions Movement State Change Dynamics Human Anatomy & Physiology Temporal Thing Partially Tangible Thing These numbers are not a good way Things to really get a handle on the Cyc KB Ecology Represented in: • First Order Logic • Higher Order Logic Time • Context Logic Events Scripts • Micro-theories Agents Artifacts Thing Organizational Actions Organizational Plans Agent Organizations Social Behavior Mechanical Software Social & Electrical Literature Language Relations, Devices Works of Art Culture Organization Social Activities Human Activities Business & Commerce Purchasing Shopping Types of Organizations Politics Warfare Sports Recreation Entertainment Transportation & Logistics Human Organizations Nations Governments Geo-Politics Professions Occupations Travel Communication Everyday Living Law Business, Military Organizations General Knowledge about Various Domains Specific data, facts, and observations May 2009
The Cyc Knowledge Base Cyc contains: 15, 000 Predicates 500, 000 Concepts 5, 200, 000 Assertions “Is any seagull also a moose? ” If Cyc knows 10, 000 kinds of animals, it should be able to answer 100, 000 queries like that. These numbers are not a good way to really get a handle on the Cyc KB Option 1: Add those 100 M assertions to the KB Option 2: Add 50 M disjoint. With assertions instead A few hundred such Option 3: Add about 10 k Linnaean taxonomy Sibling. Disjoint assertions to the KB, plus one extra assertion: take the place of over 6 (isa Biological. Taxon Sibling. Disjoint. Collection. Type) billion disjointness ones… which in turn take the place If taxons A and B are not explicitly known (via of 100 trillion ones like this: those 10 k assertions) to be in a subset/superset (not (isa Cher Moose)) relationship, then assume that they are disjoint. May 2009
There is no one correct monolithic ontology. E. g. , Cyc’s 5 M axioms are divided into thousands of contexts by: granularity, topic, culture, geospatial place, time, . . . There is a correct monolithic reasoning mechanism, but it is so deadly slow that we never call on it unless we have to E. g. , the Cyc inference engine is a community of 1000 “agents” that attack every problem and, recursively, every subproblem (subgoal). One of these 1000 is a general theorem prover; the others have special-purpose data structures/algorithms to handle the most important, most common cases, very fast. May 2009
What factors argue
rate of learning 1984 2004 today Building Cyc qua Engineering Task g via learnin nguage tural la na e ov ry c ng rni b is yd lea codify & enter each piece of knowledge, by hand CYC amount known nti n-years o 00 pers 9 Fro e years tim 23 real llion $90 mi May 2009 er of hu ma nk no wle dg e
May 2009
Temporal Relations 37 Relations Between Temporal Things temporal. Bounds. Intersect temporal. Bounds. Contain temporally. Intersects temporal. Bounds. Identical starts. After. Starting. Of starts. During ends. After. Ending. Of overlaps. Start starting. Date starting. Point temporally. Contains simultaneous. With temporally. Cooriginating after May 2009
Temporal Relations “Ariel Sharon was in Jerusalem during 2005 with granularity calendar-week” “Condoleezza Rice made a ten-day trip to Jerusalem in February of 2005” Both of them were in Jerusalem during February 2005 May 2009
Lessons Learned • • Rather than struggling to reason in natural language sentences, use logic as the representation language. Most knowledge is default; reason by argumentation Rather than striving in vain for a single fast inference engine, use a suite of 1000+ heuristic modules that each handles a class of commonly-occurring problems very fast. [EL HL split] Some of these HL modules act as tacticians (meta-reasoners) to guide the reasoning; a few are strategists (meta-reasoners) Bridging the knowledge gap: do the “intermediate theories. ” Probabilities / certainty factors are useful (risk: overdependence) Rather than striving in vain for a monolithic consistent KB, divide the KB up into many locally-consistent contexts May 2009
Each assertion should be situated in a context: in a region of context-space • • • Anthropacity Time Geo. Location Type. Of. Place Type. Of. Time Culture Sophistication/Security Topic Granularity Modality/Disposition /Epistemology • Argument-Preference • Justification • We identified 12 dimensions of mt-space • We developed a vocabulary of predicates and terms to describe points and regions along each of those 12 dimensions; and • We have been situating assertions more and more precisely, and we have been working out calculi for inferring contexts – E. g. , if P is true in C 1, and P=>Q is true in C 2, in what context C 2 can Q be validly concluded? May 2009
Mathematical Factoring of Context-space Dimensions United. States. In 1985 Context: o This n t in Ronald Reagan is president. There at least 900, 000 doctors. re he fer s of pec tim ence th tiv e, e c e g sp dep Pennsylvania. In 1985 Context: ont ran ace end ex ul , an s ts. ar d Dick Thornburgh is governor. iti es Lehigh. County. In. February 1985 Context: Dick Thornburgh is governor and there Dick Thornburgh is governor and Ronald Reagan is president. are at least 900, 000 doctors. May 2009
Time Indices and Granularities Doug is talking, at 1400 -1500, on 4 May 2009. Therefore Cyc should infer (as a default): Doug is talking, at 14: 42 -14: 47, on 4 May 2009. But should remain noncommittal about: Doug is talking, at 14: 42: 09 , on 4 May 2009
Time Indices and Granularities Doug is talking, at 14: 00 to 15: 00, on 4 May 2009 with temporal granularity 1 calendar. Doug is talking. P = minute Calendar Minutes t = that two-hour interval t’ = a continuous 15 -min. sub-interval Past t t’ ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| So: Talking during each 15 -minute interval? Yes Talking during each 2 -second interval: Unknown May 2009 Future
Relations Between an Event and its Participants performed. By causes-Event object. Placed object. Of. State. Change outputs. Created inputs. Destroyed assisting. Agent beneficiary from. Location to. Location device. Used driver. Actor damages vehicle provider. Of. Motive. Fo rce transportees Over 400 more. May 2009
In In Our Geospatial Ontology • We started in 1984 with just one binary predicate, “in”. • in(X, Y) means the inner object X is spatially located in the region defined by the outer object Y. • If I just tell you in(X, Y), and you aren’t told what X and Y are, then you (and Cyc) can’t answer questions like these: – – From the outside of Y, can I see any part of X? If I turn Y over and shake it, will X fall out? Is there room to put more things in Y? Is X actually a part of Y? • Such failures led to our introducing new, more precise, more specialized versions of “in”. By now there are over 75 such predicates, organized in a graphical taxonomy. May 2009
Propositional Attitudes Relations Between Agents and Propositions • • • goals intends desires hopes expects believes opines. That knows. That remembers. That perceives. That sees. That fears. That Most of these are modal; assertions using them go beyond 1 st-order logic May 2009
Handcrafted Cyc KB Cyc contains: 15, 000 Predicates 500, 000 Concepts 5, 200, 000 Assertions Represented in: • First Order Logic • Higher Order Logic • Context Logic • Microtheories Thing Intangible Individual Thing Sets Relations Space Physical Objects Spatial Thing Paths Spatial Paths Temporal Thing Partially Tangible Thing Logic Math Borders Geometry Time Events Scripts Agents The pump has been primed, Living Things Materials Parts Statics Artifacts Actors Actions Movement Organization Use it as an inductive bias to power more automatic knowledge acquisition Ecology Natural Geography Political Geography Weather Earth & Solar System Life Forms Human Beings Human Artifacts State Change Dynamics Plants Human Anatomy & Physiology Vehicles Buildings Weapons Physical Agents Animals Emotion Human Products Conceptual Perception Behavior & Devices Works Belief Actions Plans Goals Organizational Plans Agent Organizations Social Behavior Mechanical Software Social & Electrical Literature Language Relations, Devices Works of Art Culture Organizational Actions Social Activities Human Activities Business & Commerce Purchasing Shopping Types of Organizations Politics Warfare Sports Recreation Entertainment Transportation & Logistics Human Organizations Nations Governments Geo-Politics Professions Occupations Travel Communication Real World Domain Knowledge May 2009 Specific cases, facts, details, … Everyday Living Law Business, Military Organizations
Automated Knowledge Acquisition AKA by Shallow Fishing • Abu Sayyaf was founded in ___ • Al Harakat Islamiya, established in ___ • ASG was established in ___ Search Strings (founding. Date Abu. Sayyaf ? X) Abu Sayyaf was founded in the early 1990 s Parse (founding. Date Abu. Sayyaf (Early. Part. Fn (Decade. Fn 199))) May 2009
Automated Knowledge Acquisition AKA by Shallow Fishing • The height of the Eiffel Tower is ___ • The Eiffel Tower is ___ tall Search Strings (height Eiffel. Tower ? x) The height of the Eiffel Tower is 36 feet The height of the Eiffel Tower is 984 feet Parse (height Eiffel. Tower (Foot 36)) (height Eiffel. Tower (Foot 984)) May 2009
WWW. CYC. COM
May 2009
May 2009
May 2009
CYC Recent/Future AKB Directions • Make it comprehensive (13% 100%); apply it to other dom. • Make it easier for SME’s to enter/vet/modify info. • Improve the automatic acquis. (parsing / fishing from unstructured texts; SKSI to structured sources, incl. SPARQL) • Make it easier for end users to pose questions: – Automatically select (a small superset of) the relevant fragments – Use semantic constraints (arg. Isa, disjointness, domain knowledge…) to combine the relevant fragments into a meaningful logical query • Make justifications more terse and more compelling • Speed up inference (in general; and for AKB entry and AKB query-answering) • Graceful degradation [½-way betw. QA & Google] falling back on Semantic Search of auto. tagged documents (tagged with Cyc terms) May 2009
Developing a Cyc App. • Extend Cyc’s KB – Augment its ontology – New assertions involving those new terms • New Heuristic Level modules – Identify the need(s) for them – Design, build, and debug them • New interface modules – For manual entry; for SKSI mapping; for end users – Domain-specific interfaces (e. g. , sketching military unit movements; drawing chemical formulae; etc. ) May 2009
Open. Cyc Open Source release of: [most of] the Cyc Ontology + Simple Relns. + Inference Engine Research. Cyc Almost All of Cyc (for free for R&D purposes) May 2009
The Ontology Pre-existing general medical knowledge framework Prior to the CCF project, Cyc’s KB had 184 specializations of Tonsillectomy Medical. Care. Event: Gum. Surgery Ablation Surgical. Treatment Ligation Transplant. Surgery Coronary. Artery. Bypass. Graft Heart. Transplant. Surgery Biopsy-Surgical. Procedure General. Surgery Trephining. Someone Prostatectomy Major. Surgery Robotic. Surgery Open. Heart. Surgery Outpatient. Surgery Root. Canal. Surgery Inpatient. Surgery Vaccination. Event Liposuction. Surgery Booster. Vaccination. Event Removal. Of. Unique. Body. Part Anthrax. Military. Vaccination. Script Appendectomy Medical. Testing … …
The Ontology Pre-existing general medical knowledge framework Prior to the CCF project, Cyc’s KB had 350+ specializations of Attention. Deficit. Disorder Atherosclerosis Ailment. Condition: Glaucoma Spinal. Stenosis Multiple. Personality. Disorder Sleep. Deprivation Ache. Adenomyosis Scabies Ailment. Condition Migraine Amyotrophic. Lateral. Sclerosis Hemorrhaging-The. Condition Scoliosis Hypoglycemia Jaundice Parasitic. Ailment Tempro. Mandibular. Joint. Syndro Bacillary. Angiomatosis me Acetylcholine. Poisoning Cryptosporidiosis Rickettsiosis Cadmium. Poisoning Epidemic. Typhus-NAmerica Carbon. Monoxide. Poisoning Arthropod. Infestation Foodborne. Botulism External. Arthropod. Infestation Inhalational. Botulism Internal. Arthropod. Infestation Wound. Botulism Infant. Botulism Trichinosis Schistosomiasis Endometriosis Neuralgia Ascariasis Sciatica Diverticulitis Gout Bladder. Fluke. Infestation Macular. Degeneration … …
The Ontology Pre-existing general medical knowledge framework Prior to the CCF project, Cyc’s KB had 200+ specializations of Bacterium: Streptococcus. Pneumoniae Asteroplasma-Genus Streptococcus. Pyogenes Acholeplasmatales-Order Bacillaceae-Family Acholeplasmataceae-Family Acholeplasma-Genus Bacillus-Genus Phytoplasma-Genus Bacillus. Cereus-Species Eperythrozoon-Genus Monotrichous Mycoplasmatales-Order Bacterium-Monotrichous Mycoplasmataceae-Family Peritrichous Mycoplasma-Genus Bacterium-Peritrichous Mycoplasma. Pneumoniae-Species Amphitrichous Spirillales-Order Bacterium-Amphitrichous Vibrionaceae-Family Tenericutes-Division Vibrio-Genus Mollicutes-Class Vibrio. Cholerae-Species Anaeroplasmataceae-Family …
The Ontology Hundreds of pre-existing relevant relationships Medical domain specific relations: General Role Predicates: object. Acted. On event. Occurs. At date. Of. Event object. Placed object. Removed device. Used … infection. Caused. By. Organism infecting. Pathogen patient. Treated device. Type. Treats. Condition. Type cause. Of. Death. Type. Of. Type form. Of. Disease ailment. Type. Affects ailment. Epidemic. Type ailment. Acquired. By ailment. Typically. Acquired. By indicated. Drug mortality. Risk. For. Condition survival. Rate risk. Of. Infection. From. Type. To. Type …
The Ontology Methodology • Establish bridging (translation) rules Define rules that allow users to associate patients, dates, locations, etc. with the various events – e. g. define patient. Treated as a relationship between a medical event and a patient. • Define rules that allow users to easily express complicated logical conditions – e. g. the defining rules for Primary. Surgery, isolated. Procedure. Of. Type, concomitant. Procedures, etc. • Define concise vocabulary for constructions that are complicated or difficult to express – e. g. “aortic valve replacement’ is represented as a single non-atomic term. This allows the user to specify this very common procedure with a single fragment instead of three distinct fragments in the CCF ontology (which in turn came about due to there not being an explicit functional term composition construct in the CCF •
Typical Query for outcomes study The examples in this presentation were short, simple, “Medical English” queries; the ones being focused on while building the application, and now that it is actually being used at CCF, are much larger ones, e. g. : IDENTIFY PATIENT POPULATION: • FIND all native aortic valve replacements performed at CCF between January 1, 2000 and December 31, 2004 with a pre-operative diagnosis, as determined by echocardiogram, of moderately severe or severe aortic stenosis and moderate to severe left ventricular impairment. • INCLUDE operations in which concomitant primary CABG or concomitant mitral or tricuspid valve repair was performed. • EXCLUDE all patients with any prior valve repair or replacement; or with concomitant pulmonary valve repair; or with concomitant mitral, tricuspid, or pulmonary valve replacement; or with aortic regurgitation greater than moderate degree.
Researchers and clinicians sometimes ask the same queries “Are there cases in the last decade where patients had pericardial aortic valves inserted in the reverse position, to serve as mitral valve replacements, and how often in such cases did endocarditis or tricuspid valve infection develop, and how long after the procedure? ” May 2009
Applying – i. e. , Using – Cyc • Get a large set of use-cases (CCF task: the last 900 queries) • Arrange them into maximally mutually-dissimilar classes • Manually represent a couple from each of those buckets – Reveals most of the necessary new predicates (+ interfaces) • Now go through each of the use-cases, trolling for new domain-specific terms to add to the ontology – Can be done manually, but we are beginning to rely more on semi-automatic methods where the system itself helps with that process – As appropriate, lexify the terms and/or align them to existing standards • Run exemplars from each bucket (i. e. , to completion) – tracer bullets to reveal nec. new rules, reasoning modules (+interfaces) • Replace the largest bucket by 2 -4 spec. ’s, recur (i. e. , repeat the preceding 3 steps, and this one, again) until there is no new gain 77
Applying – i. e. , Using – Cyc • Test the system on previously-unseen use-cases (or at least ones which were not among those previously-selected from their bucket) • Have users try to use the system, and watch them (their results, of course, but also to the extent possible their time-feature trajectory) – – Which features did they rarely or never use (to good effect)? Which features did they make heavy use of? Independent of this, ask them for their feedback and suggestions Try to identify classes of users which will translate into classes of documentation and training materials/regimes/interface specifics • All along, identify what elements of the ontology (if any) are proprietary, and assimilate everything else into future versions of Open. Cyc and Research. Cyc 78
May 2009
(implies (and (c. CFhas. Left. Atrium. Diameter ? EVT ? D) (greater. Than ? D ((Centi Meter) 3. 8)) (patient. Treated ? EVT ? PAT) (patient. Sex ? PAT Female. Human) (rdf-type ? EVT ? TYPE) (genls ? TYPE CCF-Evaluation)) (isa ? EVT Evaluation. That. Indicates Left. Atrial. Enlargement))
1784 pieces of pre-existing (prior to this project) Cyc KB knowledge used while handling a typical query. E. g. : Inferred Disjointness constraints: (disjoint. With Pericardial. Window-Surgical. Procedure Medical. Patient) Justification: [we are “counting” each of these assertions, in the total: ] (genls Pericardial. Window-Surgical. Procedure Pericardial. Procedure-Surgical) in Universal. Vocabulary. Mt (genls Pericardial. Procedure-Surgical Cardiac. Procedure-Surgical) in Universal. Vocabulary. Mt (genls Cardiac. Procedure-Surgical. Procedure) in Universal. Vocabulary. Mt (genls Surgical. Procedure Medical. Care. Event) in Base. KB (genls Medical. Care. Event Physical. Situation) in Base. KB (genls Physical. Situation-Localized) in Universal. Vocabulary. Mt (genls Situation-Localized Situation) in Universal. Vocabulary. Mt (disjoint. With Spatial. Thing-Non. Situational Situation) in Base. KB (genls Enduring. Thing-Localized Spatial. Thing-Non. Situational) in Universal. Vocabulary. Mt (genls Agent-Non. Geographical Enduring. Thing-Localized) in Universal. Vocabulary. Mt (genls Embodied. Agent-Non. Geographical) in Universal. Vocabulary. Mt (genls Perceptual. Agent-Embodied. Agent) in Universal. Vocabulary. Mt (genls Animal Perceptual. Agent-Embodied) in Universal. Vocabulary. Mt
Ideas for NLM Grand Challenges • Comprehensive Ontology of Medicine – – Ties to terminological standards (Snomed, ICD…), lexical ones (Word. Net), conceptual ones (Cyc) Knowledge about/involving the concepts • • • English-to-English “translation” – – Using the above ontology of medicine, and models of discourse, models of classes of users (by age, occupation, etc. ), models of individual users (built up over time and stored HIPAA-securely) Translate articles, web pages, medicine bottle labels, etc. into comprehensible form for that user • • – • Contextualized for time, source, level of detail, … Sample sub-project: multicultural Engl. -Engl. translation In some cases this means literally writing more text expanding its length, or paring it down (eliminating prior knowledge) In less clear cases (where the user might or might not already know some piece of information), the best way to expand the original text might be to add footnotes containing the borderline information, and to pare down the original text by relegating borderline material to footnote form The translations needn’t just be static; they can sync with the user’s calendars, cell phones, computers, etc. , to provide reminders, proactively send them relevant news articles or new warnings, and so on Automated Clinical/Biomedical Discovery – Hypothesis formation, Experiment design, Data gathering, Analysis, New terms&hypotheses May 2009