ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY Knowledge

Скачать презентацию ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY Knowledge

32993bb9e67aea8ef32be60767c34ee6.ppt

Количество слайдов: 95

ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY Knowledge Discovery Marko Grobelnik, Dunja Mladenic J. Stefan Institute Slovenia 1

Contents • • Knowledge Discovery Large Scale Topic Ontology population Extraction of Semantic Networks from Text Active Learning for efficient using of human interventions • Methods Addressing Different Aspects of Ontology Construction • Final Remarks 2

Why is Knowledge Discovery appropriate for Semantic Web? • Idea: let a computer search for knowledge whereas the humans give just broad directions about where and how to search • Knowledge discovery (KD) could be defined as a research area with several subfields: Machine Learning, Data Mining and Data bases (Mitchell, 1997; Fayyad et al. , 1996; Witten and Frank, 1999; Hand et al. , 2001) • KD techniques – mainly about discovering structure in the data – can serve as one of the key mechanisms for structuring knowledge into an ontological structure being further used in Knowledge management process • Data and corresponding semantic structures change in time – sub-field of KD called “stream mining” deals with these kinds of problems • Semantic Web is ultimately concerned with real-life data on the web which have exponential growth – scalability is one of the central issues in KD 3

Machine Learning view to Ontology Generation 4

Knowledge Discovery Techniques • • Knowledge discovery technologies can be used to support different phases and scenarios for ontology generation Observations: 1. Completely automatic construction of ontologies is in general not possible for: • • 2. Human interventions are necessary but costly in terms of resources • 3. 5 theoretical reasons (e. g. , information bottleneck) and practical reasons (e. g. , the soft nature of the knowledge being conceptualized). …therefore the technology should help in efficient utilization of human interventions. Document databases are the most common data type conceptualized in the form of ontologies

What is Ontology? • In most ML contexts we can refer to an ontology as being a graph/network structure consisting from: – a set of concepts (vertices in a graph) • …each concept Ci is described by a membership-function ci(x) – a set of relations connecting concepts (directed edges in a graph) • …each relation Ri is described by a membership-function ri(Ci, Cj) – a set of instances (data records assigned to concepts or relations) 6 • …each instance Ii is described by a set of features Fi, j

We have 7 concepts (C 1…C 7), and 3 relations (R 1…R 3) …each of the concept and relation is populated by a number of instances (data records) R 1 C 2 C 1 R 3 C 4 R 3 R 2 C 3 R 1 R 3 R 2 C 5 7 R 1 C 7 C 6

Ontology Definition • Ontology is defined as a tuple with 5 sets of objects: – Ontology – …in short: O • …where – – – Classes – set of labels Ci Relations – set of labels Ri Instances – set of instance feature vectors Ii Class-Definitions – set of class membership functions CDi Relation-Definitions – set of relation membership functions RDi • …the idea is to describe “ontology learning tasks” in above terms 8

Ontology Learning • Ontology learning is a set of tasks based on the previous ontology definition – We define ontology learning tasks in terms of mappings between ontology components where some of the components are given and some are missing and we want to induce the missing ones • Some typical scenarios: – Inducing classes/Clustering of instances: • C, CD=f(I) – Ontology population: • CD, RD=f(C, R, I) – Ontology generation: 9 • C, R, CD, RD=f(I) (hardest task)

Representational language 10 • When performing learning of function f, we need to select language for representation of membership function f – Examples: • Linear functions (Support-Vector-Machines, …) • Propositional logic (decision trees, rules, …) • First order logic (Inductive Logic programming) • …by selecting different representation languages we decide about – …the power of the descriptions – …complexity of computation

Ontology Quality • For the same set of instances I we can have multiple ontologies OI • We need a function q for measuring the quality of a given ontology OI – …function q returns numerical value – …the best ontology is the one with the highest quality • Possible evaluation measures: – (1) analysis of statistical properties of structured data, – (2) agreement to the properties derived from manually built ontologies, – (3) optimization of efficiency of the user's behaviour when using an ontology, – (4) using background knowledge, and – (5) building hybrid measures (combination of various approaches). 11

Search for “optimal” Ontology • Given set of instances I, we develop a series of ontologies 12 – O 1, O 2, O 3, … – …where we have set of transformation operators (refinement operators) going from Oi to Oi+1 – Good search procedure would select such transformations which would lead efficiently towards the highest quality q(Oi) – …this formulation is in line with “machine learning with structured output” – …we could use human in the loop by using active learning techniques

Large Scale Topic Ontology population 14

Text categorization into large topic ontology • Categorization of documents into large topic ontology is one of the problems in text mining: – …needs to be scalable • …e. g. being able to handle DMoz’s 600 K categories and 4 M docs. – …needs to be accurate • …having accuracy on the level of inter-human agreement (60 -80%) – …needs to be robust 15 • …taking into account nature of web pages (typically mixed quality content and often high quality context)

Approaches for handling hierarchy of categories • There are several topic ontologies (taxonomies) of textual documents: – Yahoo, DMoz, Medline, … • Different people use different approaches: – …series of hierarchically organized classifiers – …set of independent classifiers just for leaves – …set of independent classifiers for all nodes 16

Yahoo! topic ontology (taxonomy) • human constructed hierarchy of Web-documents • exists in several languages • easy to access and regularly updated • captures most of the Web topics • English version includes over 2 M pages categorized into 50, 000 categories • contains about 250 Mb of HTML -files 17

Document to categorize: CFP for Co. NLL-2000 18

Some predicted categories 19

System architecture Web Feature construction vectors of n-grams labeled documents (from Yahoo! hierarchy) Subproblem definition Feature selection Classifier construction unlabeled document ? ? 20 Document Classifier category (label)

Content categories 21 • For each content category generate a separate classifier that predicts probability for a new document to belong to its category

Summary of experimental results on Yahoo! 22

23 DMoz / ODP is largest topic ontology on the web: 4 M sites 68 k editors 600 k concepts

Categorization into DMoz 1. On input we take DMoz RDF taxonomy data – …from http: //rdf. dmoz. org/ – …we preprocess it into efficient binary structure 2. …next, we build a classification model consisting from models for individual categories – We take hierarchical nature into account 3. Using classification model we classify new documents into taxonomy 4. On output we get for a given document text and URL – Set of most relevant categories from DMoz – Set of most relevant keywords calculated from DMoz category names (segments from the path names) 24

What is used for learning? • Currently the system uses hierarchical nearest neighbor • …in the past we experimented with Naïve Bayes for Yahoo taxonomy (http: //kt. ijs. si/Dunja/yplanet. html) – …heavy feature selection was needed • …we plan to experiment with Support Vector Machine (SVM) algorithms – …we plan to use this for ACM KDD Cup 2005 Challenge • Scalability is a problem for learning and classification when dealing with 600 K classes and 4 M documents • Approaches still needs to be properly evaluated 25

Performance issues • Preprocessing of DMoz (from RDF to classification model) takes approx. 1 h • For classification into the whole DMoz we need Win 64 with at least 6 Gb memory – …subsets of DMoz run on Win 32 with 2 Gb • Classification into DMoz is fast – … ~20 document classifications per second – …e. g. whole Wikipedia was classified into DMoz in several hours 26

Demos • Demo software for classification into http: //dmoz. org/Science/ available at http: //agava. ijs. si/~marko/DMoz. Classify. Demo. zip (~40 Mb) – …includes AVI file with demo movie – …demo runs at http: //alchemist. ijs. si: 11111/ • Demo for classification into the whole DMoz (all 600 K classes) runs at http: //alchemist. ijs. si: 22222/ 27

Example classification of URL of a web page keywords categories classification of Hubble telescope web page 28

Example classification of URL + text of a web page 29

Extracting Semantic Graph from text 31

Summarization with semantic graph (Leskovec, Grobelnik, Milic-Frayling 2005) • Idea: extract semantic network from text documents and identify relevant parts of the semantic network to represent summary • “Semantic graph” representation is used for summarization task (DUC Challenge) • The main research result is the finding that topology of extracted semantic graph helps in determining importance of the content triples (which Subject. Predicate-Object triple is relevant) • …joint collaboration with Microsoft Research, Cambridge 32

Approach Description • Approach: – Learn a machine learning model for selecting sentences – Use information about semantic structure of the document (concepts and relations among concepts) Results are promising – achieved 70% recall of and 25% precision on extracted Subject-Predicate-Object triples on DUC (Document understanding conference) data 33

Original Document Summarization Cracks Appear in U. N. Trade Embargo Against Iraq. 34 Creation of semantic network Cracks appeared Tuesday in the U. N. trade embargo against Iraq as Saddam Hussein sought to circumvent the economic noose around his country. Japan, meanwhile, announced it would increase its aid to countries hardest hit by enforcing the sanctions. Hoping to defuse criticism that it is not doing its share to oppose Baghdad, Japan said up to $2 billion in aid may be sent to nations most affected by the U. N. embargo on Iraq. President Bush on Tuesday night promised a joint session of Congress and a nationwide radio and television audience that ``Saddam Hussein will fail'' to make his conquest of Kuwait permanent. ``America must stand up to aggression, and we will, '' said Bush, who added that the U. S. military may remain in the Saudi Arabian desert indefinitely. ``I cannot predict just how long it will take to convince Iraq to withdraw from Kuwait, '' Bush said. More than 150, 000 U. S. troops have been sent to the Persian Gulf region to deter a possible Iraqi invasion of Saudi Arabia. Bush's aides said the president would follow his address to Congress with a televised message for the Iraqi people, declaring the world is united against their government's invasion of Kuwait. Saddam had offered Bush time on Iraqi TV. The Philippines and Namibia, the first of the developing nations to respond to an offer Monday by Saddam of free oil _ in exchange for sending their own tankers to get it _ said no to the Iraqi leader. Saddam's offer was seen as a none-too-subtle attempt to bypass the U. N. embargo, in effect since four days after Iraq's Aug. 2 invasion of Kuwait, by getting poor countries to dock their tankers in Iraq. But according to a State Department survey, Cuba and Romania have struck oil deals with Iraq and companies elsewhere are trying to continue trade with Baghdad, all in defiance of U. N. sanctions. Romania denies the allegation. The report, made available to The Associated Press, said some Eastern European countries also are trying to maintain their military sales to Iraq. A well-informed source in Tehran told The Associated Press that Iran has agreed to an Iraqi request to exchange food and medicine for up to 200, 000 barrels of refined oil a day and cash payments. There was no official comment from Tehran or Baghdad on the reported food-for-oil deal. But the source, who requested anonymity, said the deal was struck during Iraqi Foreign Minister Tariq Aziz's visit Sunday to Tehran, the first by a senior Iraqi official since the 1980 -88 gulf war. After the visit, the two countries announced they would resume diplomatic relations. Wellinformed oil industry sources in the region, contacted by The AP, said that although Iran is a major oil exporter itself, it currently has to import about 150, 000 barrels of refined oil a day for domestic use because of damages to refineries in the gulf war. Along similar lines, ABC News reported that following Aziz's visit, Iraq is apparently prepared to give Iran all the oil it wants to make up for the damage Iraq inflicted on Iran during their conflict. Secretary of State James A. Baker III, meanwhile, met in Moscow with Soviet Foreign Minister Eduard Shevardnadze, two days after the U. S. -Soviet summit that produced a joint demand that Iraq withdraw from Kuwait. During the summit, Bush encouraged Mikhail Gorbachev to withdraw 190 Soviet military specialists from Iraq, where they remain to fulfill contracts. Shevardnadze told the Soviet parliament Tuesday the specialists had not reneged on those contracts for fear it would jeopardize the 5, 800 Soviet citizens in Iraq. In his speech, Bush said his heart went out to the families of the hundreds of Americans held hostage by Iraq, but he declared, ``Our policy cannot change, and it will not change. America and the world will not be blackmailed. '' The president added: ``Vital issues of principle are at stake. Saddam Hussein is literally trying to wipe a country off the face of the Earth. '' In other developments: _A U. S. diplomat in Baghdad said Tuesday up to 800 Americans and Britons will fly out of Iraqi-occupied Kuwait this week, most of them women and children leaving their husbands behind. Saddam has said he is keeping foreign men as human shields against attack. On Monday, a planeload of 164 Westerners arrived in Baltimore from Iraq. Evacuees spoke of food shortages in Kuwait, nighttime gunfire and Iraqi roundups of young people suspected of involvement in the resistance. ``There is no law and order, '' said Thuraya, 19, who would not give her last name. ``A soldier can rape a father's daughter in front of him and he can't do anything about it. '' _The State Department said Iraq had told U. S. officials that American males residing in Iraq and Kuwait who were born in Arab countries will be allowed to leave. Iraq generally has not let American males leave. It was not known how many men the Iraqi move could affect. _A Pentagon spokesman said ``some increase in military activity'' had been detected inside Iraq near its borders with Turkey and Syria. He said there was little indication hostilities are imminent. Defense Secretary Dick Cheney said the cost of the U. S. military buildup in the Middle East was rising above the $1 billion-a-month estimate generally used by government officials. He said the total cost _ if no shooting war breaks out _ could total $15 billion in the next fiscal year beginning Oct. 1. Cheney promised disgruntled lawmakers ``a significant increase'' in help from Arab nations and other U. S. allies for Operation Desert Shield. Japan, which has been accused of responding too slowly to the crisis in the gulf, said Tuesday it may give $2 billion to Egypt, Jordan and Turkey, hit hardest by the U. N. prohibition on trade with Iraq. ``The pressure from abroad is getting so strong, '' said Hiroyasu Horio, an official with the Ministry of International Trade and Industry. Local news reports said the aid would be extended through the World Bank and International Monetary Fund, and $600 million would be sent as early as mid-September. On Friday, Treasury Secretary Nicholas Brady visited Tokyo on a world tour seeking $10. 5 billion to help Egypt, Jordan and Turkey. Japan has already promised a $1 billion aid package for multinational peacekeeping forces in Saudi Arabia, including food, water, vehicles and prefabricated housing for non-military uses. But critics in the United States have said Japan should do more because its economy depends heavily on oil from the Middle East. Japan imports 99 percent of its oil. Japan's constitution bans the use of force in settling international disputes and Japanese law restricts the military to Japanese territory, except for ceremonial occasions. On Monday, Saddam offered developing nations free oil if they would send their tankers to pick it up. The first two countries to respond Tuesday _ the Philippines and Namibia _ said no. Manila said it had already fulfilled its oil requirements, and Namibia said it would not ``sell its sovereignty'' for Iraqi oil. Venezuelan President Carlos Andres Perez dismissed Saddam's offer of free oil as a ``propaganda ploy. '' Venezuela, an OPEC member, has led a drive among oil-producing nations to boost production to make up for the shortfall caused by the loss of Iraqi and Kuwaiti oil from the world market. Their oil makes up 20 percent of the world's oil reserves. Only Saudi Arabia has higher reserves. But according to the State Department, Cuba, which faces an oil deficit because of reduced Soviet deliveries, has received a shipment of Iraqi petroleum since U. N. sanctions were imposed five weeks ago. And Romania, it said, expects to receive oil indirectly from Iraq. Romania's ambassador to the United States, Virgil Constantinescu, denied that claim Tuesday, calling it ``absolutely false and without foundation. ''. Semantic net of Subj-Pred-Obj triples Human built document summary Cracks appeared in the U. N. trade embargo against Iraq. The State Department reports that Cuba and Romania have struck oil deals with Iraq as others attempt to trade with Baghdad in defiance of the sanctions. Iran has agreed to exchange food and medicine for Iraqi oil. Saddam has offered developing nations free oil if they send their tankers to pick it up. Thus far, none has accepted. Japan, accused of responding too slowly to the Gulf crisis, has promised $2 billion in aid to countries hit hardest by the Iraqi trade embargo. President Bush has promised that Saddam's aggression will not succeed. Manual summarization 70% recall, 40% precision of selected triples according to human generated summaries Automatic summarization by selecting relevant triples Mapping between graphs learned with ML methods Automatically built document summary (not done by us) Nat. Lang. Generation Semantic net of Subj-Pred-Obj triples Cracks appeared in the U. N. trade embargo against Iraq. The State Department reports that Cuba and Romania have struck oil deals with Iraq as others attempt to trade with Baghdad in defiance of the sanctions. Iran has agreed to exchange food and medicine for Iraqi oil. Saddam has offered developing nations free oil if they send their tankers to pick it up. Thus far, none has accepted. Japan, accused of responding too slowly to the Gulf crisis, has promised $2 billion in aid to countries hit hardest by the Iraqi trade embargo. President Bush has promised that Saddam's aggression will not succeed.

Detailed Summarization Procedure Linguistic analysis of the text - Deep parsing of sentences Refinement of the text parse - Named-entity consolidation Tom Sawyer went to town. He met a friend. Tom was happy. … - Anaphora resolution Tom Sawyer went to town. He [Tom Sawyer] met a friend. Tom [Tom Sawyer] was happy. … Determine that ’George Bush’ = ‘U. S. president’ Link pronouns with name-entities Extract Subject–Predicate–Object triples Compose a graph from triples Describe each triple with a set of features for learning Learn a model to classify triples into the summary Generate a summary graph 35 Use summary graph to generate textual document summary Tom go town Tom meet friend Tom is happy

Named entities consolidation • Consolidating different surface forms that refer to the same entities – only for names of people, places, companies, etc. • Example: – Hillary Rodham Clinton, Hillary Rodham, Mrs. Clinton Hillary Clinton • Heuristic based on the overlap in the surface form of name variances • Accuracy on a subset of the data set ~90%. 36

Pronomial anaphora resolution • Link pronouns with their references Mary likes Paul. She went to buy him a present. Mary likes Paul. She [Mary] went to buy him [Paul] a present. • Method: – restrict to 5 pronouns: she, who, I, they. – from the pronoun, traverse the text searching for candidate references and assign a score – the score is based on the distance from the pronoun and semantic information – assume that pronouns refer only to named entities found in the document – Problem: • One passenger in King's car said they had been drinking liquor. • Average accuracy on 1, 500 hand labeled pronouns: 81. 2% 37

Anaphora resolution evaluation 38 Pronoun Frequency [%] Accuracy [%] He 681 45. 22 86. 9 They 244 16. 20 67. 2 It 204 13. 55 I 64 4. 25 You 50 3. 32 We 44 2. 92 That 44 2. 92 What 27 1. 79 She 24 1. 59 This 22 1. 46 Who 11 0. 73 63. 6 1506 100 81. 2 82. 8 62. 5 … Total Accuracy on 5 selected 81. 2% (55. 2% if counting all pronouns)

Extracting triples • Enhanced parse tree is traversed to identify Subject–Predicate–Object triples • Example: “Conservatives embraced the nomination while liberals were cautious or hostile” Resulting triples: 39 conservative embrace nomination liberal is cautious liberal is hostile

Detailed Summarization Procedure Linguistic analysis of the text - Deep parsing of sentences Refinement of the text parse - Named-entity consolidation Tom Sawyer went to town. He met a friend. Tom was happy. … - Anaphora resolution Tom Sawyer went to town. He [Tom Sawyer] met a friend. Tom [Tom Sawyer] was happy. … Determine that ’George Bush’ = ‘U. S. president’ Link pronouns with name-entities Extract Subject – Predicate – Object triples Compose a graph from triples Describe each triple with a set of features for learning Learn a model to classify triples into the summary Generate a summary graph 40 Use summary graph to generate textual document summary Tom go town Tom meet friend Tom is happy

Training of summarization model • Model ranks Subject-Predicate-Object triples according to their importance Document Semantic network 41 Summary semantic network

Composing a graph • Graph consists of nodes, referred as concepts, which can be subjects or objects and edges which are predicates and capture relations among concepts. • Use Word net to identify and compact synonym nodes – as they correspond to the same concepts. 42

Feature construction • Features used in the learning process include triples described by the following attributes: – Positional information • Of the sentence from which the triple was derived relative to the document text • Of the triple relative to the beginning of the sentence – Linguistic attributes of the nodes in the triple (NLP): • 18 syntactic attributes • 100 semantic attributes – 14 graph attributes: Page. Rank, In/Out Degree, reachable neighbours, etc. Dataset this yield: TOTAL of 466 attributes On average 72 non-zero attributes per triple. 43

Experiments • Machine learning with Linear SVM to classify triples into relevant or not-relevant for the summary – Positive examples are triples from the sentences which were marked as summary sentences by experts – Negative examples are all other triples • Data: – 147 documents from the DUC 2002 for which we had extracted summaries. • Evaluation: 44 – Report microaveraged values of precision, recall and F 1 for the extracted triples using 10 -fold cross validation.

Performance for various attribute sets Training Set Attribute set Precision Recall Test Set F 1 Precision Recal l F 1 Sentence Position + Terms 65. 87 92. 48 76. 94 28. 87 37. 08 32. 46 only Position (triple + sentence) 31. 21 52. 49 39. 15 31. 05 52. 58 39. 05 only Graph 27. 78 57. 46 37. 46 27. 25 56. 90 36. 85 only Linguistic 29. 77 61. 79 40. 18 22. 29 47. 52 30. 29 Position + Linguistic 31. 16 67. 00 42. 54 28. 67 62. 57 39. 33 Position + Graph 33. 51 63. 85 43. 95 42. 71 63. 02 43. 07 Position + Graph + Linguistic 35. 82 72. 69 47. 99 31. 41 64. 88 42. 33 45

Baseline performance (sentence position selected terms from the Performance for lower than in any+of the other runs, except various attribute sets sentence) F 1=32. 46 is for ‘only linguistic’ attributes (F 1=30. 29). Training Set Test Set ‘only linguistic’ run includes only generic syntactic and semantic Attribute set expected to be good discriminators on their own. Recal labels - not Precision Recall F 1 Precision l F 1 Sentence Position + Terms 65. 87 92. 48 76. 94 28. 87 37. 08 32. 46 only Position (triple + sentence) 31. 21 52. 49 39. 15 31. 05 52. 58 39. 05 only Graph 27. 78 57. 46 37. 46 27. 25 56. 90 36. 85 only Linguistic 29. 77 61. 79 40. 18 22. 29 47. 52 30. 29 Position + Linguistic 31. 16 67. 00 42. 54 28. 67 62. 57 39. 33 Position + Graph 33. 51 63. 85 43. 95 42. 71 63. 02 43. 07 Position + Graph + Linguistic 35. 82 72. 69 47. 99 31. 41 64. 88 42. 33 46

Adding generic linguistic attributes reduces precision Performancesentences P=31. 05 Position of triples and for various attribute sets Adding linguistic attributes P=28. 67 Training Set Test Set but consistently increases recall Attribute set Precision Recall F 1 Precision Recal l F 1 Sentence Position + Terms 65. 87 92. 48 76. 94 28. 87 37. 08 32. 46 only Position (triple + sentence) 31. 21 52. 49 39. 15 31. 05 52. 58 39. 05 only Graph 27. 78 57. 46 37. 46 27. 25 56. 90 36. 85 only Linguistic 29. 77 61. 79 40. 18 22. 29 47. 52 30. 29 Position + Linguistic 31. 16 67. 00 42. 54 28. 67 62. 57 39. 33 Position + Graph 33. 51 63. 85 43. 95 32. 71 63. 02 43. 07 Position + Graph + Linguistic 35. 82 72. 69 47. 99 31. 41 64. 88 42. 33 47

Performanceabout various structure helps Information for the graph attribute sets Position of triples and sentences F 1=39. 05 Training Set Test Set Adding F 1=43. 07 Attribute set structure information Recal Precision Recall F 1 Precision l F 1 Sentence Position + Terms 65. 87 92. 48 76. 94 28. 87 37. 08 32. 46 only Position (triple + sentence) 31. 21 52. 49 39. 15 31. 05 52. 58 39. 05 only Graph 27. 78 57. 46 37. 46 27. 25 56. 90 36. 85 only Linguistic 29. 77 61. 79 40. 18 22. 29 47. 52 30. 29 Position + Linguistic 31. 16 67. 00 42. 54 28. 67 62. 57 39. 33 Position + Graph 33. 51 63. 85 43. 95 42. 71 63. 02 43. 07 Position + Graph + Linguistic 35. 82 72. 69 47. 99 31. 41 64. 88 42. 33 48

Insights We determine the median and quartiles of the ranks across 10 runs. • Most highly ranked features in SVM normal: Attribute 1 st quartile Median 3 rd quartile Object – Authority weight 1 1 2 Object – size of weakly connected component 2 2. 5 3 Object – degree of a node 2 3 3 Object – is name of a country 4 5 5 Subject – size of weakly connected component 6 7 9 Subject – degree of a node 6 10. 5 12 Object – Page. Rank weight 6 11 12 Object – is name of a geographical location 8 13 16 Subject – Authority weight 13 18. 5 23 49

Example of summarization Cracks Appear in U. N. Trade Embargo Against Iraq. Human written summary Cracks appeared Tuesday in the U. N. trade embargo against Iraq as Saddam Hussein sought to circumvent the economic noose around his country. Japan, meanwhile, announced it would increase its aid to countries hardest hit by enforcing the sanctions. Hoping to defuse criticism that it is not doing its share to oppose Baghdad, Japan said up to $2 billion in aid may be sent to nations most affected by the U. N. embargo on Iraq. President Bush on Tuesday night promised a joint session of Congress and a nationwide radio and television audience that ``Saddam Hussein will fail'' to make his conquest of Kuwait permanent. ``America must stand up to aggression, and we will, '' said Bush, who added that the U. S. military may remain in the Saudi Arabian desert indefinitely. ``I cannot predict just how long it will take to convince Iraq to withdraw from Kuwait, '' Bush said. More than 150, 000 U. S. troops have been sent to the Persian Gulf region to deter a possible Iraqi invasion of Saudi Arabia. Bush's aides said the president would follow his address to Congress with a televised message for the Iraqi people, declaring the world is united against their government's invasion of Kuwait. Saddam had offered Bush time on Iraqi TV. The Philippines and Namibia, the first of the developing nations to respond to an offer Monday by Saddam of free oil _ in exchange for sending their own tankers to get it _ said no to the Iraqi leader. Saddam's offer was seen as a none-too-subtle attempt to bypass the U. N. embargo, in effect since four days after Iraq's Aug. 2 invasion of Kuwait, by getting poor countries to dock their tankers in Iraq. But according to a State Department survey, Cuba and Romania have struck oil deals with Iraq and companies elsewhere are trying to continue trade with Baghdad, all in defiance of U. N. sanctions. Romania denies the allegation. The report, made available to The Associated Press, said some Eastern European countries also are trying to maintain their military sales to Iraq. A well-informed source in Tehran told The Associated Press that Iran has agreed to an Iraqi request to exchange food and medicine for up to 200, 000 barrels of refined oil a day and cash payments. There was no official comment from Tehran or Baghdad on the reported food-for-oil deal. But the source, who requested anonymity, said the deal was struck during Iraqi Foreign Minister Tariq Aziz's visit Sunday to Tehran, the first by a senior Iraqi official since the 1980 -88 gulf war. After the visit, the two countries announced they would resume diplomatic relations. Well-informed oil industry sources in the region, contacted by The AP, said that although Iran is a major oil exporter itself, it currently has to import about 150, 000 barrels of refined oil a day for domestic use because of damages to refineries in the gulf war. Along similar lines, ABC News reported that following Aziz's visit, Iraq is apparently prepared to give Iran all the oil it wants to make up for the damage Iraq inflicted on Iran during their conflict. Secretary of State James A. Baker III, meanwhile, met in Moscow with Soviet Foreign Minister Eduard Shevardnadze, two days after the U. S. -Soviet summit that produced a joint demand that Iraq withdraw from Kuwait. During the summit, Bush encouraged Mikhail Gorbachev to withdraw 190 Soviet military specialists from Iraq, where they remain to fulfill contracts. Shevardnadze told the Soviet parliament Tuesday the specialists had not reneged on those contracts for fear it would jeopardize the 5, 800 Soviet citizens in Iraq. In his speech, Bush said his heart went out to the families of the hundreds of Americans held hostage by Iraq, but he declared, ``Our policy cannot change, and it will not change. America and the world will not be blackmailed. '' The president added: ``Vital issues of principle are at stake. Saddam Hussein is literally trying to wipe a country off the face of the Earth. '' In other developments: _A U. S. diplomat in Baghdad said Tuesday up to 800 Americans and Britons will fly out of Iraqi-occupied Kuwait this week, most of them women and children leaving their husbands behind. Saddam has said he is keeping foreign men as human shields against attack. On Monday, a planeload of 164 Westerners arrived in Baltimore from Iraq. Evacuees spoke of food shortages in Kuwait, nighttime gunfire and Iraqi roundups of young people suspected of involvement in the resistance. ``There is no law and order, '' said Thuraya, 19, who would not give her last name. ``A soldier can rape a father's daughter in front of him and he can't do anything about it. '' _The State Department said Iraq had told U. S. officials that American males residing in Iraq and Kuwait who were born in Arab countries will be allowed to leave. Iraq generally has not let American males leave. It was not known how many men the Iraqi move could affect. _A Pentagon spokesman said ``some increase in military activity'' had been detected inside Iraq near its borders with Turkey and Syria. He said there was little indication hostilities are imminent. Defense Secretary Dick Cheney said the cost of the U. S. military buildup in the Middle East was rising above the $1 billion-a-month estimate generally used by government officials. He said the total cost _ if no shooting war breaks out _ could total $15 billion in the next fiscal year beginning Oct. 1. Cheney promised disgruntled lawmakers ``a significant increase'' in help from Arab nations and other U. S. allies for Operation Desert Shield. Japan, which has been accused of responding too slowly to the crisis in the gulf, said Tuesday it may give $2 billion to Egypt, Jordan and Turkey, hit hardest by the U. N. prohibition on trade with Iraq. ``The pressure from abroad is getting so strong, '' said Hiroyasu Horio, an official with the Ministry of International Trade and Industry. Local news reports said the aid would be extended through the World Bank and International Monetary Fund, and $600 million would be sent as early as mid-September. On Friday, Treasury Secretary Nicholas Brady visited Tokyo on a world tour seeking $10. 5 billion to help Egypt, Jordan and Turkey. Japan has already promised a $1 billion aid package for multinational peacekeeping forces in Saudi Arabia, including food, water, vehicles and prefabricated housing for non-military uses. But critics in the United States have said Japan should do more because its economy depends heavily on oil from the Middle East. Japan imports 99 percent of its oil. Japan's constitution bans the use of force in settling international disputes and Japanese law restricts the military to Japanese territory, except for ceremonial occasions. On Monday, Saddam offered developing nations free oil if they would send their tankers to pick it up. The first two countries to respond Tuesday _ the Philippines and Namibia _ said no. Manila said it had already fulfilled its oil requirements, and Namibia said it would not ``sell its sovereignty'' for Iraqi oil. Venezuelan President Carlos Andres Perez dismissed Saddam's offer of free oil as a ``propaganda ploy. '' Venezuela, an OPEC member, has led a drive among oil-producing nations to boost production to make up for the shortfall caused by the loss of Iraqi and Kuwaiti oil from the world market. Their oil makes up 20 percent of the world's oil reserves. Only Saudi Arabia has higher reserves. But according to the State Department, Cuba, which faces an oil deficit because of reduced Soviet deliveries, has received a shipment of Iraqi petroleum since U. N. sanctions were imposed five weeks ago. And Romania, it said, expects to receive oil indirectly from Iraq. 50 Cracks appeared in the U. N. trade embargo against Iraq. The State Department reports that Cuba and Romania have struck oil deals with Iraq as others attempt to trade with Baghdad in defiance of the sanctions. Iran has agreed to exchange food and medicine for Iraqi oil. Saddam has offered developing nations free oil if they send their tankers to pick it up. Thus far, none has accepted. Japan, accused of responding too slowly to the Gulf crisis, has promised $2 billion in aid to countries hit hardest by the Iraqi trade embargo. President Bush has promised that Saddam's aggression will not succeed. 7800 chars, 1300 words

51 Full document semantic graph

Automatically generated summary graph 52

Findings on summarization with semantic graphs • Experiments show that attributes that characterize the document semantic graph improve selection of triples for summarization o This results need to be verified on additional data sets o Need to perform comparison with additional summarization methods o Explore various strategies for extracting and generating summaries based on extracted triples. • No combination of features that was examined lead to good separation of positive and negative triples in the feature space o Opportunity for further investigations and improvements. 53

Active Learning / Dealing with unlabeled data 55

The idea of Active Learning • The idea of Active Learning is if a student asks smart questions, it comes faster to the required model of knowledge as by asking random questions • The goal is to use Active Learning algorithms for semiautomatic 56 – construction of models for labeling data and – for ontology learning

Quick Intro to Active Learning 57 performance • We use this methods whenever handlabeled data are rare or expensive to obtain • Interactive method • Requests only labeling of “interesting” objects • Much less human work needed for the same result compared to arbitrary labeling examples Teacher Data & passive student labels query active student label Active student asking smart questions Passive student asking random questions number of questions

Algorithms tested • Uncertainty sampling (efficient) – select example closest to the decision hyperplane (or the one with classification probability closest to P=0. 5) (Tong & Koller 2000 Stanford) • Maximum margin ratio change – select example with the largest predicted impact on the margin size if selected (Tong & Koller 2000 Stanford) • Monte Carlo Estimation of Error Reduction – select example that reinforces our current beliefs (Roy & Mc. Callum 2001, CMU) • Random sampling as baseline Experimental evaluation (using F 1 -measure) of the four listed approaches shown on three categories from Reuters-2000 dataset – average over 10 random samples of 5000 training (out of 500 k) and 10 k testing (out of 300 k)examples – the last two methods a rather time consuming, thus we run them for including the first 50 unlabeled examples – experiments show that active learning is especially useful for unbalanced data 58

59 Category with balanced class distribution having 47% of positive examples Limited advantage over random sampling

60 Category with fairly unbalanced class distribution having 20% of positive examples Best performance with Uncertainty and Margin. Ratio, Uncertainty is simpler and much more efficient

61 Category with very unbalanced class distribution having 2. 7% of positive examples Uncertainty seems to outperform Margin. Ratio

Illustration of Active learning • starting with one labeled example from each class (red and blue) • select one example for labeling (green circle) • request label and add re-generate the model using the extended labeled data Illustration of linear SVM model using • arbitrary selection of unlabeled examples (random) • active learning selecting the most uncertain examples (closest to the decision hyperplane) 62

63 Uncertainty sampling of unlabeled example

Methods Addressing Different Aspects of Ontology Construction 66

Methods addressing different aspects of ontology construction • Collecting data – focused crawling with Google and DMoz in the loop • Dealing with different natural languages – map the documents into a languageindependent semantic-space • Going directly from the data – semi-automatic creation of an ontology directly from the data under predefined conditions/scenarios • Annotation of text 67

Focused Crawler • Focused crawler which finds in a relatively short time web pages related to the given web page – The solution uses DMoz topic ontology to get content context, and Google to get web linkage context • …the main idea is to use browse web-graph as bi-directional graph using “link: ” query in Google – Algorithm: • For efficient initial set of candidate pages we use Google and DMoz • From initial set pages are crawled in breadth-first fashion • …priority in the crawler-queue is given to more similar pages • …after some stopping condition is met, the crawler returns the list of candidate web pages • Usage: serves as a technique for collecting the data for the next stages of data processing such as building and populating ontologies for the Semantic Web, improved knowledge access 68

Example Focused Crawl • Focused crawl for the BT home page (http: //www. bt. com): – 1. www. bt. co. uk/ - BT – 2. www. yell. com/ucs/Home. Page. Action. do - UK's local search engine – 3. www. att. com/ - AT&T: The World's Networking Company – 4. www. cisco. com/ - Cisco Systems, Inc – 5. www. microsoft. com/ - Microsoft Corporation – 6. www. bbc. co. uk/ - BBC – 7. www. hp. com/ - HP United States – 8. www. ntl. com/ - Broadband cable internet access – 9. www. telekom. de/ - Deutsche Telekom – 10. www. epsrc. ac. uk/ - EPSRC – 11. www. com/ - Cable & Wireless – 12. www. royalmail. com/ - Royal Mail – 13. www. ericsson. com/ - Ericsson – 14. www. bp. com/home. do? category. Id=1 - BP Global – 15. www. telewest. co. uk/ - Telewest Broadband PLC – 16. www. verizon. com/ - Verizon – 17. www. nokia. com/ - Nokia – 18. www. bt. com/at_home. jsp - BT. com At Home – 19. www. ibm. com/ - IBM United States – 20. www. sbc. com/gen/landing-pages? pid=3308 - SBC Communications Inc. – 21. www. francetelecom. com/ - France Telecom – 22. www. mci. com/ - MCI Home – 23. www. siemens. com/ - Siemens AG – 24. www. motorola. com/ - Motorola – 25. www. vodafone. co. uk/ - Vodafone UK 69

Language-independent document representation • From aligned corpora we learn mappings between documents into “language independent representation” using “Kernel Canonical Correlation Analysis” method • …such representation could be used for multilingual classification, multilingual IR, … • On-going work on learning mappings between all European languages using CELEX corpus of European legislation in 21 lang 70

Two views of the same data – find the direction with maximal correlation 71 View 2

Corelation = 0. 17 72 View 1 View 2

Correlation = 0. 44 73 View 1 View 2

Correlation = 0. 97 74 View 1 View 2

Correlated directions found with KCCA when applied to financial news articles ZENTRALBANK VERLUST LOSS BP BP EINKOMMEN INCOME MILLIARDE CENTRAL FIRMA COMPANY DOLLAR VIERTEL QUARTER ZAHLUNG WAGE GESCHICHTEN STORIES VOLLE PAYMENT MILLION GEWERKSCHAFT NEGOTIATIONS SAGT SAYS VERHANDLUNGSRUNDE BORSEN EXCHANGES UNION 75

Modelling directly from the data – getting semantic classes with LSI CELLS SERVICES CELLS CONTENT GENE GRID STEM_CELLS MEDIA CANCER USER STEM GRID GENOMIC MOBILIZATION VACCINES MULTIMEDIA MOLECULAR CONTENT WEB DIGITAL ENERGY SECURITY WEB ROBOT OPTICS ROBOT WEB_SERVICES LEARNING WASTE EMBEDDED SEMANTIC COGNITIVE FUEL BIOMETRICS CONTENT HUMAN NUCLEAR VECTOR MEDIA INTERACTIVE 76

Visualization of 6 FP IST project (English) 77

Modeling relationships between companies from the news 78

Annotation of text • Annotation based on examples • Annotation using clustering • Annotation based on thesaurus 79

Annotate text based on examples Problem: Annotation of text by assigning predefined labels to text fragments Given: examples of annotated text fragments • learn annotation rules from already annotated documents (. xml, . . . ) – similar to learning IE • learn to classify sentences into semantic roles 80

Annotate text using clustering Problem: Annotation of text by finding labels and assigning them to to text fragments Given: text to annotate • split documents into sentences, represent each sentence as word-vector • cluster sentences and label them by the most characteristic words from the sentences – e. g. , using local frequency of words, clustering with SOM and using neural network weights of words 81

Annotate text based on thesaurus Problem: Annotation of text by finding labels and assigning them to to text fragments Given: text to annotate, thesaurus • a) apply NLP on text to find noun-groups and map them upon concepts of (medical) thesaurus • b) split document into sentences, cluster them and map clusters upon concepts of a general thesaurus (Word. Net) the concepts are used as semantic labels (XML tags) for annotating documents 82

Ontology evaluation directions • • • 83 Analysis of information-theoretic properties of structured data instances Measure of the agreement to the characteristics derived from manually built ontologies Optimization of efficiency of the user's behaviour when using an ontology (e. g. , minimizing the number of user clicks)

Ontology Learning Challenge • Academic challenge on DMoz data (Science part) for 3 tasks: 1. 2. • Naming Categories 3. • Taxonomy Population Constructing Taxonomy from Documents • • • Given taxonomy with documents, the task is to classify new documents into taxonomic categories Given taxonomic categories with documents, the task is to (semi)automatically propose names for categories Given a set of documents, the task is to (semi)automatically propose taxonomic structure The goal is to model human skills when dealing with large amounts of data Data: – – 85 DMoz/Science (10 k concepts, 100 k instances) Tourist ontology (from KU) (70 concepts, ~1000 instances) The challenge will be funded through “PASCAL Network of Excellence” European project (http: //www. pascal-network. org/)

Ideas / Future plans (1) • DMoz categories as standard web meta-data dictionary – …the idea is to use DMoz categories/keywords as a standardized dictionary for meta-data labeling of general Web pages – …because of dynamic and adaptive nature of DMoz categorization (reflecting all major topics on the web) this could be interesting as a baseline for “semantic web” style annotation – …e. g. could be deployed as a tool for (semi)automatic generation of tags for web pages 86

Ideas / Future plans (2) • DMoz classifier as an annotation tool – …the idea is to use DMoz-classifier tool for meta-data (keyword) generation – …some other popular databases (e. g. Wikipedia) could have attached automatically generated DMoz categories – …could be accessible as a web service (e. g. SOAP interface) 87

Ideas / Future plans (3) • DMoz Visualizer – …the idea is to create a tool for visualization and browsing through DMoz structure – …browsing tools could combine other public and commercial sources (such as Wikipedia, Google, Amazon, e. Bay, …) – …could appear as e. g. web-browser toolbar 88

Ideas / Future plans (4) • Analysis of DMoz Dynamics – Future research plan is to model dynamics of DMoz taxonomy based on data from DMoz Archive (http: //rdf. dmoz. org/rdf/archive/) – …the idea is to model decision process when and how the editors decide to split the category nodes – …currently the repository includes 120 snapshots of DMoz from year 2000 on 89

Ideas / Future plans (5) • Focused crawling for DMoz – …the idea is to use focused crawler for proposing new web sites for particular categories (as editorial tool) – …at JSI we developed focused crawler for fast and efficient crawling for a focused content, can be further extended 90 • …to use Google and DMoz in the loop • …to user-hints (positive & negative examples of content pages) • …based on Corpus-Builder project at CMU – http: //www-2. cs. cmu. edu/afs/cs/project/theo-4/textlearning/www/corpusbuilder/

Ideas / Future plans (6) • Classification of non English documents – …we use string kernels for avoiding problems with morphology • …submitted paper at ECML/PKDD 2005 (Fortuna & Mladenic) for classification into major Slovenian and Croatian taxonomies – …we plan to use Canonical Correlation Analysis (CCA) for efficient identification of similar content written in different languages 91

Text-Garden software library (in development over the last 5 years) 92

Text-Garden data • Set of C++ classes for “industrial strength” text mining problem solving • Currently organized in ~50 command line utilities covering – Machine learning/Data mining on text – Web related functionality – Profiling, Visualization, … • Currently works on Windows, to be ported to Linux 93

Text Garden – Architecture of clustering, visualization, classification 94

Text Garden Web site www. textmining. net 95