0335959e59bf6cb5eedab104f2c13642.ppt
- Количество слайдов: 33
Ancestry. com Data: DEG Results, Observations, & Future Directions
Acknowledgements
Extraction Ontologies • EOs: – Conceptual models for some domain – Data frames for each concept • Instance recognizers & IO converters • Operations with recognizers and instance recognizers for parameters – Used for information extraction, free-form query processing, information integration, semantic-web applications, … • But, for the Ancestry. com project, only instance recognition – – Regular-expression recognition rules Dictionary-based recognition rules Combinations Heuristic rules • Disambiguation within context • Specialized case sensitivity
EO Rules – Names • Dictionaries – First & last names • From 1990 census • Minus stopwords (http: //www. lextek. com/manuals/onix/stopwords 2. html) (http: //www. census. gov/genealogy/names/) – Titles dictionary: {Mr, Mrs, Miss, Dr, Rev, …} • Simple regular expressions • Dictionary & regular-expression combinations
EO Rules – Putting it all together • Baseline: dictionary – Two or more consecutive dictionary words – Require capitalization • Extraction Ontology (EO) – Regular expressions – Exclusions: schools, addresses, … • Patterns (various) – Based on document structure – Not generally applicable elsewhere
Extraction Ontology (EO) • b({Title}s+){0, 1}{First}s+([A-Z]s+){0, 1}{Last}b – John Doe – Mr John Q Doe • b{Title}s+{Last}b – Mr Doe • b{Title}( [A-Z][A-Za-z]*){1, 3}b – Mr Anythingcapitalized Upto Threewords • b({Last})(s+{Title})? (s+{First}|s+[A-Z]){1, 2}b – Doe John – Doe Miss Jane – Doe Anything Capitalized • Also exclusions
Numbers • Using training data examples • Should use dev-test numbers
EO Results – Example 1 THE BLAKE FAMILY IN ENGLAND IN a Genealogical History of William Blake of Dorchester published in 1857 appears the statement that the emigrant to New England was the son of Giles Blake of Little Baddow Essex and the record of several generations of the family is given The sub stance of this record is trustworthy as being a copy from Morant ' s History of Essex but the statement that the Dorchester settler was of this family was unwarranted by any evidence Subsequently the late H G Somerby Esq by request of Stanton Blake Esq made extended researches in England to determine the origin of the American family He finally located it at Over Stcwey Somerset and the results of his investigations were published in 1881 by W H Whitmore Esq in A Record of the Blakes of Somersetshire The evidences upon which Mr Somerby based his conclusions - were first the record of a. N baptism in 1594 at Over Stowey of a William Blake … …
EO Results – Example 1 Method Correct Dictionary 5 / 13 Precision Recall 55. 56% 38. 46% Correct matches: William Blake Giles Blake William Blake Mr Blake William Blake Missed entities: H G Somerby Esq Stanton Blake Esq W H Whitmore Esq Mr Somerby Rev Charles M Blake William Blake Esq William Arthur Jones Esq A M Edward J Blake Esq Results measured on set-aside TRAINING data F 1 45. 45% False positives: Stanton Blake Rev Charles William Arthur Jones William Blake
EO Results – Example 1 Method Correct Dictionary 5 / 13 (Adjusted) 8 / 13 Precision Recall 55. 56% 38. 46% 88. 89% 61. 54% Correct matches: William Blake Giles Blake William Blake Mr Blake William Blake Missed entities: H G Somerby Esq Stanton Blake Esq W H Whitmore Esq Mr Somerby Rev Charles M Blake William Blake Esq William Arthur Jones Esq A M Edward J Blake Esq Results measured on set-aside TRAINING data F 1 45. 45% 72. 73% False positives: Stanton Blake Rev Charles William Arthur Jones William Blake
EO Results – Example 1 Method Dictionary (Adjusted) EO Correct 5 / 13 8 / 13 7 / 13 Precision 55. 56% 88. 89% 63. 64% Correct matches: William Blake Giles Blake Mr Somerby William Blake Rev Charles M Blake Mr Blake William Blake Missed entities: H G Somerby Esq Stanton Blake Esq W H Whitmore Esq William Blake Esq William Arthur Jones Esq A M Edward J Blake Esq Results measured on set-aside TRAINING data Recall 38. 46% 61. 54% 53. 85% F 1 45. 45% 72. 73% 58. 33% False positives: Stanton Blake William Arthur William Blake Edward J Blake
EO Results – Example 1 Method Dictionary (Adjusted) EO (Adjusted) Correct 5 / 13 8 / 13 7 / 13 10 / 13 Precision 55. 56% 88. 89% 63. 64% 90. 90% Correct matches: William Blake Giles Blake Mr Somerby William Blake Rev Charles M Blake Mr Blake William Blake Missed entities: H G Somerby Esq Stanton Blake Esq W H Whitmore Esq William Blake Esq William Arthur Jones Esq A M Edward J Blake Esq Results measured on set-aside TRAINING data Recall 38. 46% 61. 54% 53. 85% 76. 92% F 1 45. 45% 72. 73% 58. 33% 83. 33% False positives: Stanton Blake William Arthur William Blake Edward J Blake
EO Results – Example 2 ms. M & ffi , 113 2 lst Street. Wholesale and Retail dealers in all grade * of Cook and Heating stoves. Also the celebra ted Jno , Vann , Wm. Kiser and Char * ter Oak. Bangea. 1 S 4 K. L. POLK & CO. ' S Coker Henry W , dry goods and grocer , Washington ave bet Hiekman and Maple , Irondale Coke Newton , lab Aliee Furnace Coker Wesley T , elk W R Coker , res Washington ave cor Hiekman , Irondale Coker Wm R , dry good. and grocer , Washington ave , cor Hiekman , Irondale C ' ola Carlo , lunch house. 19 20 th s , res same Colby Mips Alice , bds 280 ! ) 2 d ave Coldirr. U see also { {. ' aldircll Coldwell K A , bds 22 : 50 4 th ave Coldwell Wm , … Birmingham City Directory
Example 2 - Dictionary Method Correct Dictionary 31 / 78 Precision Recall 51. 67% 39. 74% Correct matches: Coke Newton Coldwell Wm Cole Alexander Cole Burt Cole Charles Wm Hood Cole Frank Cole James Cole John … and others … Missed entities: Coker Henry W Coker Wesley T Coker Wm R C ' ola Carlo Colby Mips Alice Coldwell K A Cole Artolphus … and others… Results measured on set-aside TRAINING data F 1 44. 93% False positives: Coker Henry Furnace Coker Wesley Coker Wm Rolling Mill Peter Zins Birmingham Rolling Mill Cleveland Cole Charles Cole Franklin Cole James Cole John … and others …
Example 2 - EO Method Correct Dictionary 31 / 78 EO 47 / 78 Precision Recall 51. 67% 39. 74% 72. 31% 60. 26% Correct matches: Coker Henry W Coke Newton Exclusion Coker Wesley T Coker Wm R Coldwell K A Coldwell Wm Cole Alexander Cole Burt Cole Charles Wm Hood … and others … F 1 44. 93% 65. 73% False positives: Peter Zins Cleveland Cole Charles patterns: no effect Furnace Cole P Peter Zins Cole Samuel I Alice Furnace Bains Cole Wm Cole Win Bessemer Cole Wm Davis Coleman Bettie … and others … Missed entities: Coker C ' ola Carlo Colby Mips Alice Cole Artolphus Cole Cradford Cole Charles H ] ) orter II Herxfcld Cole P W … and others… Results measured on set-aside TRAINING data
Can we do better? What if we knew the record boundaries? Manually add boundaries… Use the knowledge that a record usually starts with a name… This will measure the value of boundary information.
Example 2 – Segmented Birmingham City Directory ms. M & ffi , 113 2 lst Street. Wholesale and Retail dealers in all grade * of Cook and Heating stoves. Also the celebra ted Jno , Vann , Wm. Kiser and Char * ter Oak. Bangea. 1 S 4 K. L. POLK & CO. ' S ## Coker Henry W , dry goods and grocer , Washington ave bet Hiekman and Maple , Irondale ## Coke Newton , lab Aliee Furnace ## Coker Wesley T , elk W R ## Coker , res Washington ave cor Hiekman , Irondale ## Coker Wm R , dry good. and grocer , Washington ave , cor Hiekman , Irondale ## C ' ola Carlo , lunch house. 19 20 th s , res same ## Colby Mips Alice , bds 280 ! ) 2 d ave ## Coldirr. U see also { {. ' aldircll ## Coldwell K A , bds 22 : 50 4 th ave ## Coldwell Wm , …
Example 2 – Segmented (Ideal) Method Correct Dictionary 31 / 78 EO 47 / 78 Segmented* 65 / 78 Precision 51. 67% 72. 31% 95. 59% Correct matches: Coker Henry W Coke Newton Coker Wesley T Coker Wm R C ' ola Carlo Colby Mips Alice Coldwell K A … and others … Missed entities: Wm Hood ] ) orter II Herxfcld ( t ' Bains j > orter Tompson Francis i. V Cheiiovveth Cole Wm C Cole J L Davis … and others… Results measured on set-aside TRAINING data Recall 39. 74% 60. 26% 83. 33% F 1 44. 93% 65. 73% 89. 04% False positives: Coldirr. U see also { { Cole Wm C ( Reamer Cole & Co )
Example 2 – Segmented + EO (Ideal) Method Correct Dictionary 31 / 78 EO 47 / 78 Segmented* 65 / 78 Seg. *+EO 66 / 78 Precision 51. 67% 72. 31% 95. 59% 86. 84% Correct matches: Coker Henry W Coke Newton Coker Wesley T Coker Wm R C ' ola Carlo Colby Mips Alice Coldwell K A … and others … Missed entities: ] ) orter II Herxfcld ( t ' Bains j > orter Tompson Francis i. V Cheiiovveth Cole Wm C Cole J L Davis E K Fulton … and others… Results measured on set-aside TRAINING data Recall 39. 74% 60. 26% 83. 33% 84. 62% F 1 44. 93% 65. 73% 89. 04% 85. 71% False positives: Coldirr. U see also { { Cole Wm C ( Reamer Cole & Co ) Peter Zins Alice Furnace Whilden A Alice Furnace Loo M
EO Results – Example 3 4 Index to the Precinct Re. Sisters of Tuolumne County i Index to the Precinct Registers of Tuolumne County 5 Name co Address No Name C Address No Name ICo Address No Name Address No Gore Asihford Gray lohn A G alt John Goss An lrew Gerken Ilerman 11 Getchell Everett G Gallup Will Seneca Gr ndl W illiam Ilarten J osep IIenderson Robert Harper Edwin F Hall … …
EO Results – Example 3 Method Correct Dictionary 4 / 72 Precision Recall 19. 05% 5. 56% Correct matches: Kelly Martin Leslie Christopher Fred Madison Charles Albert Marconi Frank Frederick Missed entities: Gore Asihford Gray lohn A G alt John Goss An lrew Gerken Ilerman 11 Getchell Everett G Gallup Will Seneca Gr ndl W illiam Ilarten J osep IIenderson Robert … and others… Results measured on set-aside TRAINING data F 1 8. 60% False positives: John Goss Getchell Everett Robert Harper Edwin Robert Barkley Thomas Michael Thomas Douglas Frank Hayes William George Jordan Joh Kelly Patrick Rufus Clifton Kurr … and others …
EO Results – Example 3 Method Correct Dictionary 4 / 72 EO 10 / 72 Precision Recall 19. 05% 5. 56% 41. 67% 13. 89% Correct matches: Getchell Everett G Harper Edwin F Hayes William George Kelly Patrick V Kelly Martin Kessler Peter Frederick Lang Charles Lewis Leslie Christopher Fred Madison Charles Albert Marconi Frank Frederick Missed entities: Gore Asihford Gray lohn A G alt John Goss An lrew Gerken Ilerman 11 Gallup Will Seneca Gr ndl W illiam Ilarten J osep IIenderson Robert … and others… Results measured on set-aside TRAINING data F 1 8. 60% 20. 83% False positives: John Goss Robert Barkley Thomas Michael Thomas Douglas George P Harp Jordan Joh Rufus Clifton Forrest Lumsden Paul B Christian F Morrison Robert David V Meiser Frederick S John Felix
EO Results – Example 3 Method Dictionary EO New lines* Correct 4 / 72 10 / 72 62 / 72 Precision 19. 05% 41. 67% 73. 81% Correct matches: Gore Asihford Gray lohn A G alt John Goss An lrew Getchell Everett G Gallup Will Seneca Gr ndl W illiam Ilarten J osep … and others … Pattern 1: Alphabetical data up to new lines, all columns Missed entities: Gerken Ilerman 11 HIarlp' r C harles Frank IIughes Jolhn. 1 liughes Charles. 1 Haill Ed ward. 1 Jordan Joh n. Alfred 1 honalp I Lind say Alexalnder … and others … Results measured on set-aside TRAINING data Recall 5. 56% 13. 89% 86. 11% F 1 8. 60% 20. 83% 79. 49% False positives: Name co Address No Name C Address No Name ICo Address No Name Address No Gerken Ilerman HIarlp IIughes Jolhn liughes Charles Haill Ed ward Jordan Joh n. Alfred Bi ak Flat … and others …
EO Results – Example 3 Method Dictionary EO New lines 1 New lines 2 Correct 4 / 72 10 / 72 62 / 72 Precision 19. 05% 41. 67% 73. 81% 87. 32% Correct matches: Gore Asihford Gray lohn A G alt John Goss An lrew Getchell Everett G Gallup Will Seneca Gr ndl W illiam Ilarten J osep IIenderson Robert … and others … Missed entities: Gerken Ilerman 11 Pattern 2: Alphabetical data up to new lines, HIarlp' r C harles Frank selected columns IIughes Jolhn. 1 liughes Charles. 1 Haill Ed ward. 1 Jordan Joh n. Alfred 1 honalp I Lind say Alexalnder Results measured on set-asideand others … … TRAINING data Recall 5. 56% 13. 89% 86. 11% F 1 8. 60% 20. 83% 79. 49% 86. 71% False positives: Gerken Ilerman HIarlp IIughes Jolhn liughes Charles Haill Ed ward Jordan Joh n. Alfred Lind say Alexalnder i C urpliy Pa. I Mocalrtoe lhester lioujalin i
EO Results – Example 3 Method Dictionary EO New lines 1 New lines 2 New lines 3 Correct 4 / 72 10 / 72 62 / 72 45 / 72 Precision 19. 05% 41. 67% 73. 81% 87. 32% 84. 91% Correct matches: Gore Asihford G alt John Getchell Everett G Gr ndl W illiam IIenderson Robert Hall Charles. Ps rry Ilartoen Thomas Douglas Hayes William George … and others … Pattern 3: All data up to new line or digit, selected columns Missed entities: Gray lohn A Goss An lrew Gerken Ilerman 11 Gallup Will Seneca Ilarten J osep Harper Edwin F liarlan Robert Barkley … and others … Results measured on set-aside TRAINING data Recall 5. 56% 13. 89% 86. 11% 62. 5% F 1 8. 60% 20. 83% 79. 49% 86. 71% 72. 00% False positives: Gerken Ilerman HIarlp'r C harles Frank liughes Charles. Haill Ed ward. Jordan Joh n. Alfred Lind say Alexalnder i C urpliy Pa. I'trick Mocalrtoe lhester lioujalin i
Context Exclusion • Name not part of something else – Address – Hollis Long Island N Y – Schools – Trenton State Normal School – Companies – K. L. Polk & Co. • Note: It might be interesting to recognize these names too, but mark them as being part of something else.
EO Results – Example 4 I GERTRUDE SMITH (Mrs William E Haines deceased) Married shortly after graduation Died at age of 22 Was musician and taught piano lessons 1898 HOBART L BENEDICT Millburn Essex County N J Graduated from Rutgers 1902 and from New York Law School in 1904 with degrees of B Sc M Sc and LL B Married April 9 1907 to Martha C Bunnell One daughter Elizabeth Benedict Counsellor at law with offices in Elizabeth and Millburn MARTHA BUNNELL (Mrs Hobart L Benedict) Millburn Essex County N J Married to Hobart L Benedict on date above 1899 CORA SMITH (Mrs Louis Slingerland) 557 Third St South St Peters- burg Florida Married … …
Example 4 - Dictionary Method Correct Dictionary 6 Precision Recall 27. 27% 28. 57% Correct matches: Elizabeth Benedict MARTHA BUNNELL Mrs Louis Slingerland Mrs Harry Engel Mrs Leslie Ward Missed entities: GERTRUDE SMITH Mrs William E Haines HOBART L BENEDICT Martha C Bunnell Mrs Hobart L Benedict CORA SMITH Mr Slingerland JENNIE HAINES STELLA ILLSLEY WALTER BOSCHEN GEORGE Mc. QUAIDE MARGARET HAINES ABBY HEADLEY CLARENCE GRIGGS Results measured on TRAINING data F 1 27. 91% False positives: Mrs William York Law School Mrs Hobart High School Mr Slingerland Ave Union School Trenton Looker School Union Town Hollis Long Island Morris Ave Union Battin High School Rutgers College Ave Union Trenton State School Roselle
Example 4 - Tuned Method Correct Dictionary 6 Tuned EO 18 Precision Recall 27. 27% 28. 57% 94. 74% 85. 71% Correct matches: 85. 71% Without exclusion 18 GERTRUDE SMITH Additional false positives: Mrs William E Haines • Morris Ave Martha C Bunnell • Trenton State Elizabeth Benedict MARTHA BUNNELL Mrs Hobart L Benedict CORA SMITH Mrs Louis Slingerland Mr Slingerland JENNIE HAINES STELLA ILLSLEY Mrs Harry Engel GEORGE Mc. QUAIDE MARGARET HAINES ABBY HEADLEY Mrs Leslie Ward CLARENCE GRIGGS Results measured on TRAINING data F 1 27. 91% 90. 00% Missed entities: 85. 71% HOBART L BENEDICT Hobart L Benedict WALTER BOSCHEN False positives: Hollis Long
Example 4 – Broadened Method Correct Dictionary 6 Tuned EO 18 Broadened 12 Precision 27. 27% 94. 74% 57. 14% Correct matches: Mrs William E Haines HOBART L BENEDICT Martha C Bunnell Elizabeth Benedict MARTHA BUNNELL Mrs Hobart L Benedict Mrs Louis Slingerland Mrs Harry Engel Mrs Leslie Ward Candidates for exclusion (place names) Results measured on TRAINING data Recall 28. 57% 85. 71% 57. 14% F 1 27. 91% 90. 00% 57. 14% Missed entities: GERTRUDE SMITH CORA SMITH JENNIE HAINES STELLA ILLSLEY WALTER BOSCHEN GEORGE Mc. QUAIDE MARGARET HAINES ABBY HEADLEY CLARENCE GRIGGS False positives: Hollis Long Island N Y Union N J Springfield N J Union N J Elizabeth N J Newark N J Union N J Newark N J
Example 4 – Patterns Method Correct Dictionary 6 Tuned EO 18 Broadened 12 Pattern* 13 Precision 27. 27% 94. 74% 57. 14% 100. 00% Correct matches: GERTRUDE SMITH MARTHA BUNNELL Mrs Hobart L Benedict CORA SMITH Mrs Louis Slingerland JENNIE HAINES STELLA ILLSLEY Mrs Harry Engel WALTER BOSCHEN MARGARET HAINES ABBY HEADLEY Mrs Leslie Ward CLARENCE GRIGGS Recall 28. 57% 85. 71% 57. 14% 61. 90% F 1 27. 91% 90. 00% 57. 14% 76. 47% Missed entities: Mrs William E Haines HOBART L BENEDICT Martha C Bunnell Elizabeth Benedict Hobart L Benedict Louis Slingerland Mr Slingerland GEORGE Mc. QUAIDE False positives: None Patterns: ALL CAPS, multiple words of multiple letters Initial capitals inside of parentheses Results measured on TRAINING data
Extraction Ontologies – circa May 09 • May 09 version of the EO engine – OCR problems reduce potential accuracy – Can tune rules, potentially even for OCR, but … • Summer 09 observations – Document patterns can help – But must apply judiciously • Discover Pattern (possibly more than one simultaneously) • Discover Extent of Pattern
Future Work • Test-set trials for EO – Code backend conversion to work with Thomas’s evaluator – Tune expressions, as needed, on training set, … – Run trials • Revise EO engine for patterns – Pattern Discovery • Extract with dictionary • Test for patterns – Internal patterns such as all caps or name order – External patterns such as within parens, bounded by … – Pattern-Extent Discovery • Pattern sequence begin and end • Multiple patterns within the same extent
0335959e59bf6cb5eedab104f2c13642.ppt