2969d15e095f1563bc9b902af63b4b20.ppt
- Количество слайдов: 28
Hypertext categorization : Is there anything beyond words and links? Rayid Ghani IR Project Presentation 12/12/00 www. cs. cmu. edu/~rayid/talks
How is hypertext different? w Link Information w Diverse Authorship w Short text - topic not obvious from the text w Structure / position within the web graph w Author-supplied features(meta-tags) w Bold , italics, heading etc.
Motivation imdb. com Movie: Titanic Rating: **** Review: …. . NYT Movie Reviews Movie: Titanic Rating: **** Review: …. . Titanic Cast: … Director: … Duration:
It’s also different because: w LOTS of external sources of information n n Movie reviews for movie recommendation Reviews of websites for website classification
Hyptertext Classification work w Chakrabarti et al SIGMOD 98 w Oh et al. SIGIR 00 w Slattery & Mitchell ICML 00
Classification using external information sources w Basu, Hirsh & Cohen. AAAI 98 w Cohen ICML 2000 w Text Classification Using IE: n Riloff
Proposed Idea: w Classify [Web. Pages/Collection of Web Pages] using: n n Text/Information on linked pages Text/Information from external sources that may or may not be linked
Dataset w A collection of up to 50 web pages from 4285 companies (as used in Ghani et al. 2000) w Two types of classifications (labels obtained from www. hoovers. com) n n Coarse-grained Classification - 28 classes Fine-grained Classification – 255 classes
Using Hyperlinks w Assumption: Pages that link to each other are similar (? ) Method: w 1. 2. 3. 4. Classify each company independently (ignoring hyperlinks) Use words from all the companies that link to or are linked to by the company Same as 2 but treat the words from the links as from a separate vocabulary Same as 2 but only use the companies that are in the same class as the current one
Classifier used w Naïve Bayes (multinomial model) with Laplace smoothing
Results using hyperlinks # of Page classes Only 28 55. 5% Page + All linked pages (separate vocab) 40. 1% 49. 2% 255 32. 5% 18. 9% Page + linked pages of the same class 57. 5% 33. 2% Results show above are averages of 5 fold cross validation
Assumptions violated? w Companies DO NOT link to competitors – they link to partners who are involved in DIFFERENT sectors w Wal-Mart doesn’t link to Kmart or Sears – they link to Southwest (travel partner) w Companies linked to by the same company are similar? n Egreetings links to online stores that sell gifts
Classification using IE w External source of Information: www. hoovers. com (a collection of facts about various companies)
Extracted Features From hoovers. com Location: City, State, Country Type: Public, Private Links to Company: Mentions Company: Company-in-same-state Company-reciprocally-competes Extracted from the web pages Locations-of-company Technical-assistance-relationships Research-relationships Expertise-relationships Export-relationships Manufacture-relationships Supply-relationships
(HAS-RELATIONSHIP-TYPE (MANUFACTURE-RELATIONSHIP SUPPLY-RELATIONSHIP EXPERTISE-RELATIONSHIP TECHNICAL-ASSISTANCE-RELATIONSHIP RESEARCH-RELATIONSHIP SELL-RELATIONSHIP)) (COMPANY-LINKS-TO-COMPANY (COMPAQ-COMPUTER-CORPORATION MACROMEDIA-INC. MICROSOFT-CORPORATION)) (COMPANY-MENTIONS-COMPANY (ADVANTA-CORP. SOFTWARE-AG IGI-INC. DDI-CORPORATION NCR-CORPORATION NAM-CORPORATION INCO-LIMITED DISC-INC. SK-CORP. KI RLI-CORP. TARGET-CORPORATION CREDIT-SUISSE-FIRST-BOSTON LEAR-CORPORATION IDT-CORPORATION DANA-CORPORATION MICROSOFT-CORPORATION RM-PLC)) (OFFICERS-DIRECTORS-OF-COMPANY ("Laura Jennings" "Chris Williams" "Orlando Ayala Lozano" "Joachim Kempin" "William H. Neukom" "John Connors" "Craig Mundie" "Jon De. Vaan" "Brad Chase" "Michel Lacombe" "Bernard P. Vergnes" "Jeffrey S. Raikes" "James E. Allchin" "Paul A. Maritz" "Richard Belluzzo" "Bob Muglia" "Frank M. "Pete" Higgens" "Robert J. Herbold" "Steven A. Ballmer" "William H. Gates III")) (LOCATIONS-OF-COMPANY (MIDDLE-EAST ARGENTINA AUSTRALIA AUSTRIA BRAZIL CHILE CHINA COLOMBIA CROATIA CZECH-REPUBLIC DENMARK ECUADOR FINLAND FRANCE GERMANY GREECE HONG-KONG HUNGARY INDIA IRELAND ISRAEL ITALY JAPAN MALAYSIA MEXICO NEW-ZEALAND NORWAY OMAN PERU POLAND PORTUGAL PUERTO-RICO ROMANIA RUSSIA SINGAPORE SLOVAKIA SOUTH-AFRICA SPAIN SWEDEN SWITZERLAND THAILAND TURKEY UNITED-KINGDOM UNITED-STATES URUGUAY VENEZUELA CANADA)))
(MICROSOFT-CORPORATION *NOVALUE* (COMPANY-RECIPROCALLY-COMPETES (SUN-MICROSYSTEMS-INC. 3 COM-CORPORATION ADOBE-SYSTEMS-INCORPORATED BEINCORPORATED, BROADVISION-INC. COMPUTER-ASSOCIATES-INTERNATIONAL-INC. COREL-CORPORATION THE-LEARNING-COMPANY-INC. LIBERATE-TECHNOLOGIES MACROMEDIA-INC. MYWAY. COM RED-HAT-INC. THE-SANTA-CRUZ-OPERATION-INC. SYBASE-INC. SYMBIAN-LTD. )) (COMPANY-RECIPROCALLY-LINKS (MACROMEDIA-INC. )) (COMPANY-LINKS-TO-SAME-COMPANY-AS (ABN-AMRO-HOLDING-N. V. ACG-HOLDINGS-INC. )) (COMPANY-LINKED-TO-BY-COMPANY (ABN-AMRO-HOLDING-N. V. )) (COMPANY-RECIPROCALLY-MENTIONS (RM-PLC)) (COMPANY-MENTIONS-SAME-COMPANY-AS (NETSMART-TECHNOLOGIES-INC. NATIONAL-PRESTO-INDUSTRIES-INC. )) (COMPANY-IN-SAME-STATE (N 2 H 2 -INC. NEORX-CORPORATION MPM-TECHNOLOGIES-INC. MUZAK-LIMITED-LIABILITY-COMPANY MULTIPLEZONES-INTERNATIONAL-INC. )) (HAS-HEADQUARTERS-IN-STATE "WA") (COMPANY-IN-SAME-CITY (METAWAVE-COMMUNICATIONS-CORPORATION MEDTRONIC-PHYSIO-CONTROL-INC. DATA-I/O-CORPORATION)) (HAS-HEADQUARTERS-IN-CITY "Redmond") (COMPANY-HAS-COMMON-PEOPLE-WITH (BILL-&-MELINDA-GATES-FOUNDATION)) (COMPETITOR (SYMBIAN-LTD. SYMANTEC-CORPORATION SYBASE-INC. THE-SANTA-CRUZ-OPERATION-INC. RED-HAT-INC. MYWAY. COM MCI -WORLDCOM-INC. MACROMEDIA-INC. LIBERATE-TECHNOLOGIES THE-LEARNING-COMPANY-INC. COREL-CORPORATION COMPUTERASSOCIATES-INTERNATIONAL-INC. COMPAQ-COMPUTER-CORPORATION CENDANT-CORPORATION BROADVISION-INC. BE-INCORPORATED ADOBE-SYSTEMS-INCORPORATED 3 COM-CORPORATION SUN-MICROSYSTEMS-INC. )) (EMPLOYEES ((1990 5635) (1991 8226) (1992 11542) (1993 14430) (1994 15257) (1995 17801) (1996 20561) (1997 22232) (1998 27055) (1999 31396))) (NET_PROFIT ((1990 23. 6) (1991 25. 1) (1992 25. 7) (1993 25. 4) (1994 24. 7) (1995 24. 5) (1996 25. 3) (1997 30. 4) (1998 31. 0) (1999 39. 4))) (NET_INCOME ((1990 279. 2) (1991 462. 7) (1992 708. 1) (1993 953. 0) 7785. 0))) (1994 1146. 0) (1995 1453. 0) (1996 2195. 0) (1997 3454. 0) (1998 4490. 0) (1999 (REVENUE ((1990 1183. 4) (1991 1843. 4) (1992 2758. 7) (1993 3753. 0) 19747. 0))) (1994 4649. 0) (1995 5937. 0) (1996 8671. 0) (1997 11358. 0) (1998 14484. 0) (1999 (HOOVERS_AUDITORS "Deloitte & Touche LLP ") (DOES-SOMETHING-WITH ("Adjustments" "Consumer, commerce & other“ "Windows platforms“ "Productivity & developer applications")) (ADDRESS "1 Microsoft Way, Redmond, WA 98052 -6399") (HOOVERS_INDUSTRY "Development Tools, Operating Systems & Utility Software") (HOOVERS_SECTOR "Computer Software & Services")
How do we get external information? w Do a query for pages that link to each of the web pages in your training and testing set (Transductive learning) w Take the intersection (with some relaxation) w Crawl the intersection pages and use IE (manual wrappers or wrapper induction using ML – Muslea et al. )
Results Using Only Extracted Information Task: Learn a Decision Tree to classify companies using the extracted features. Implementation Used: C 5. 0 with several variations on the pruning parameter and using IG and IG ratio as splitting criterion. Features Used: Decision Trees impose restrictions on the type of features that can be used. Set-valued features are not easily used. Instead, just the cardinality of the set is used as the feature Results: 18% Accuracy (28 class) 7% Accuracy (255 class) Not Great but better than baseline
BUT… w There is still hope w The features are sometimes accurate but could still provide information to “correct” NB predictions w How should we combine the text on the web page with these extracted features?
Combining Text with IE w Throw in the words with the features and learn a decision tree w Throw in the extracted features and use NB w Summarize the words into the Naïve Bayes prediction and throw it into the decision tree as ONE additional feature
Results
Sample Rules
Are the extracted features useless? w Probably NOT w Need a better representation and learner w Need a Relational Learner (FOIL) to find regularities like: n If NB predicts Computer Hardware AND NB confidence < 0. 9 AND company mentions companies in Computer Software AND NOT company links to company in Computer Software then Class=Computer Software
Are links useless? w Maybe but again, probably not w Links need to be used intelligently/ selectively
What next? w A Better Representation? w Relational Learning
Classification w Classify test instance d by:
FOIL Quinlan & Cameron-Jones (1993) Learns relational rules like: target_page(A) : - has_research(A), link(A, B), has_publications(B). For each test example w Pick matching rule with best training set performance p. w Predict positive with confidence p
2969d15e095f1563bc9b902af63b4b20.ppt