c8d78f41a3c850c6038015bf6b033ed3.ppt
- Количество слайдов: 25
Automatically Extracting Structure and Data from Business Reports Stephen W. Liddle School of Accountancy and Information Systems Marriott School of Management Brigham Young University Co-Authors: Douglas M. Campbell, Chad Crawford
Overview What are business reports and why do we care about them? Extracting structure and data n n Field types Line types Page headers/footers Inferring recursive group structure Experimental results 11/3/99 CIKM’ 99, Copyright © 1999, Stephen W. Liddle 2
Business Reports Business reports are used to disseminate information pertinent to business operations n Financial, inventory, production They are the result of periodic data processing n n 11/3/99 Daily, weekly, monthly, etc. COBOL, 4 GL’s, report writers CIKM’ 99, Copyright © 1999, Stephen W. Liddle 3
Business Reports RUN 05/21/99 12: 34: 56 CUST NBR 10355 9657 9541 11225 L A R G E CD NBR 006 008 009 011 00551 9994 9992 9991 9989 C D R E P O R T N A M E ACCR: 04/26/99 POST: 05/21/99 RATE MATURITY OFC . 06005. 05990. 06250 03/07/99 04/22/99 05/16/99 MS MS BALANCE JASON MASON CONSTRUCTION INC FANNY M RYEBERG JOHN SMITH JR BARNEY FIFE 100, 000. 00 300, 000. 56 1, 100, 000. 00 105, 529. 23 * * * TOTAL LARGE CD * * * PAGE 001 1, 605, 529. 79 CLL 3605 12 BANK 012 CUSTOMER NUMBER OFF CUSTOMER NAME LDG BRANCH 0402 ORG DATE MAT DATE MY TOWN BANK-CHARGE OFF BK EXCEPTION REPORT - NUMBER 12 PART. SOLD - DETAIL LIST INTEREST RATE ORG AMOUNT LOAN BALANCE LOAN CODE PART. RATE 1234567 9001 81196 09 -25 -92 TUBBS, BILLY 63 DEMAND 14. 50000 60, 825. 23 50, 860. 19 OA 14. 50000 1234569 9002 09644 03 -22 -93 JONES, MARIA 66 DEMAND 12. 75000 30, 079. 41 29, 817. 50 OA 05/19/98 12. 75000 PAGE INTEREST PARTICIPANT OUR OWING PART BALANCE OUR PART. LISTED WITH 9999900—MY TOWN BANK ---------- PART-TOTALS BRANCH TOTALS 11/3/99 44, 428. 59 62, 814. 48 14, 954 - 30, 079. 41 1, 261 - 20, 386. 38 NUMBER COUNT CIKM’ 99, Copyright © 1999, Stephen W. Liddle 64, 814. 97 2 16, 216. 2096, 893. 89 . 00 1 . 00 4
Business Reports M Y T O W N S T A T E TRANS/DATE: 05/21/99 B A N K RUN/DATE: 05/20/98 CASH CONTROL - TIME: 00: 34 TELLER CASH LISTING CASH 4 PAGE NO. 1 OFFICE NUMBER 001. . . . . . . . . TELLER. AMOUNT TYPE BATCH &. . NUMBER. SEQUENCE. . . . . . . . . 001. 15. 00 OUT 214 -23767. 22. 00 OUT 214 -23632. 41. 17 OUT 214 -23651. 45. 00 OUT 214 -23726. . . 190. 00 OUT 214 -23752. 200. 00 OUT 215 -18550. 300. 00 OUT 215 -18579. 300. 00 OUT 214 -23735. . . 400. 00 OUT 214 -23754. 400. 00 OUT 214 -23810. 400. 00 OUT 215 -18548. 500. 00 OUT 215 -18600. . . 1, 000. 00 OUT 214 -23764. 4, 138. 85 OUT 215 -05631. 10, 000. 00 OUT 214 -23780. 20. 00 IN 214 -23686. . . 60. 00 IN 214 -23670. 100. 00 IN 214 -23711. 110. 25 IN 214 -23720. 140. 00 IN 214 -23763. . . 160. 00 IN 214 -23679. 200. 00 IN 214 -23643. 200. 00 IN 214 -23647. 200. 00 IN 214 -23696. . . 211. 00 IN 214 -23751. 280. 00 IN 214 -23655. 300. 00 IN 214 -23739. 300. 00 IN 214 -23770. . . 340. 48 IN 214 -23777. 400. 00 IN 214 -23740. 400. 00 IN 214 -23732. 400. 00 IN 215 -18575. . . 420. 00 IN 214 -23813. 700. 00 IN 214 -23734. 1, 000. 00 IN 214 -23779. 1, 003. 86 IN 214 -23888. . . 1, 240. 85 IN 214 -23718. 1, 506. 00 IN 214 -23742. 1, 806. 00 IN 214 -23692. 6, 000. 00 IN 214 -23688. . . . . . . . . CASH OUT TOTAL 17, 952. 02 * CASH IN TOTAL 17, 498. 44 * NET CASH TOTAL 453. 58 -*. . . . . . . . . 002. 6. 00 OUT 214 -27788. 15. 00 OUT 214 -27821. 25. 00 OUT 214 -28073. 40. 00 OUT 214 -27836. . . 50. 00 OUT 214 -27798. 200. 00 OUT 214 -27790. 200. 00 OUT 215 -18551. 250. 00 OUT 214 -27819. . . 400. 00 OUT 215 -18547. 1, 000. 00 OUT 215 -18545. 1, 000. 00 OUT 214 -27668. 1, 080. 00 OUT 214 -27658. . . 4, 000. 00 OUT 214 -27675. 4, 000. 00 OUT 214 -27662. 3, 545. 42 OUT 215 -05659. 45. 00 IN 214 -27753. . . 50. 00 IN 214 -27810. 60. 00 IN 214 -27807. 95. 00 IN 214 -27725. 265. 00 IN 214 -27723. . . 305. 00 IN 214 -27759. 330. 00 IN 214 -27755. 400. 00 IN 214 -27797. 400. 00 IN 214 -27818. . . 408. 73 IN 214 -27667. 419. 03 IN 214 -27832. 560. 00 IN 214 -27805. 600. 00 IN 214 -27811. . . 850. 00 IN 214 -27650. 1, 000. 00 IN 214 -27640. 1, 821. 85 IN 214 -27678. 4, 200. 00 IN 214 -27785. . . 3, 480. 32 IN 214 -27695. . . . . . . . . CASH OUT TOTAL 15, 811. 42 * CASH IN TOTAL 15, 289. 93 * NET CASH TOTAL 521. 49 -*. . . . . . . . . 003. 10. 00 OUT 214 -18486. 20. 00 OUT 214 -18462. 20. 00 OUT 215 -18640. 00 OUT 214 -27483. . . 50. 00 OUT 214 -18296. 50. 00 OUT 214 -18301. 55. 00 OUT 214 -27456. 120. 77 OUT 214 -18465. . . 137. 00 OUT 214 -27486. 342. 54 OUT 214 -18489. 700. 00 OUT 214 -27490. 1, 255. 00 OUT 214 -18449. . . 1, 705. 59 OUT 215 -18642. 1, 765. 34 OUT 215 -18649. 1, 884. 92 OUT 215 -18629. 15, 000. 00 OUT 214 -27882. . 15 IN 214 -18417. 10. 00 IN 214 -18429. 18. 62 IN 214 -27395. 29. 00 IN 214 -25207. . . 50. 00 IN 214 -27842. 50. 00 IN 214 -18393. 68. 28 IN 214 -27399. 100. 00 IN 214 -27474. . . . . . . . . 11/3/99 CIKM’ 99, Copyright © 1999, Stephen W. Liddle 5
Business Reports BCRCL 10 TO-YOUR TOWN BANK AMOUNT SEQUENCE 123. 45 21409648 679. 00 21409664 170. 00 21409667 38. 00 21409714 5. 00 21409742 13. 75 21409743 849. 77 21409778 211. 43 21409829 291. 58 21409914 15. 63 21409936 50. 00 21409985 1, 500. 00 21410053 20. 39 21410062 257. 61 21410065 7, 467. 35 21410082 692. 90 21410083 927. 98 21410084 25. 00 21410126 7. 50 21410141 7. 50 21410145 60. 00 21410152 27. 50 21410154 20. 00 21410163 7. 50 21410164 7. 00 21410337 75. 00 21410338 15. 00 21410341 45. 00 21410360 258. 00 21410390 100. 00 21410551 333. 33 21410552 852. 34 21410581 110. 00 21410642 50. 00 21410657 449. 60 21410673 112. 80 21410709 61. 24 21410715 11/3/99 PACKAGE DETAIL AMOUNT 81. 56 34. 05 27. 68 528. 94 274. 00 383. 53 511. 46 276. 22 62. 00 35. 00 33. 00 18. 40 19. 25 10. 02 63. 00 4, 751. 00 45. 69 125. 00 75. 00 19. 93 60. 00 59. 00 35. 00 592. 00 64. 06 49. 28 68. 93 56. 64 163. 32 160. 50 5, 000. 00 417. 34 129. 18 SEQUENCE 21410732 21410744 21410750 21410772 21410780 21410793 21410816 21410860 21410883 21410888 21410889 21410890 21410892 21410894 21410905 21410906 21410909 21410919 21410921 21410922 21410923 21410948 21410954 21411252 21411257 21411268 21411269 21411327 21411347 21411362 21411365 21411380 21411383 21411390 21411398 21411401 FROM-MY TOWN STATE BANK AMOUNT SEQUENCE 23. 00 21411405 100. 00 21411408 40. 00 21411409 100. 00 21411416 75. 00 21411420 65. 00 21411423 40. 00 21411430 40. 00 21411431 40. 00 21411438 854. 00 21411491 86. 34 21411515 1, 000. 00 21411546 277. 05 21411579 351. 00 21411614 42. 14 21411678 61. 61 21411682 432. 16 21411737 69. 00 21411765 64. 00 21411768 64. 00 21411769 438. 00 21411791 44. 15 21411801 40. 00 21411815 258. 00 21412121 516. 00 21412153 450. 00 21412164 35. 00 21412183 65. 00 21412193 15. 00 21412195 60. 00 21412196 30. 00 21412220 148. 66 21412269 29. 95 21412300 29. 95 21412301 29. 95 21412310 164. 25 21412326 05/21/99 AMOUNT. 58 71. 00 150. 00 100. 00 68. 00 25. 00 56. 60 71. 50 432. 24 11. 00 258. 00 483. 10 260. 00 115. 00 100. 00 132. 14 30. 00 15. 80 42. 65 22. 75 10. 00 62. 00 70. 00 40. 00 200. 00 69. 09 42. 53 47. 32 17. 08 100. 00 200. 00 10. 00 200. 00 81. 79 25. 29 SEQUENCE 21412327 21412329 21412337 21412340 21412344 21412351 21412377 21412378 21412423 21412426 21412446 21412474 21412475 21412498 21412538 21412543 21412557 21412623 21412646 21412653 21412688 21412699 21412899 21412901 21412903 21412922 21412923 21412924 21412925 21412934 21412935 21412936 21412937 21412944 21412946 JOB-001 SP-P 2414 AMOUNT SEQUENCE 25. 32 21412947 115. 47 21412950 25. 00 21412951 18. 60 21412952 15. 00 21412957 232. 44 21412962 200. 00 21412991 100. 00 21412992 28. 00 21412995 8. 22 21413006 18. 00 21413007 79. 90 21413008 75. 00 21413013 6. 00 21413017 29. 86 21413020 61. 95 21413022 28. 84 21413024 35. 15 21413028 25. 00 21413041 40. 00 21413042 50. 00 21413043 200. 00 21413044 39. 31 21413050 11. 89 21413051 414. 00 21413192 20. 00 21413199 87. 35 21413200 80. 00 21413229 16. 00 21413231 85. 00 21413336 215. 00 21413361 312. 50 21413468 500. 00 21413469 100. 00 21413490 20. 00 21413503 18. 00 21413506 4. 83 21413520 CIKM’ 99, Copyright © 1999, Stephen W. Liddle PAGE 1 DVC-01 AMOUNT 50. 00 136. 80 21. 40 25. 00 50. 00 26. 40 10. 00 25. 00 103. 00 70. 00 799. 63 116. 24 103. 09 1, 000. 00 246. 00 40. 00 35. 36 65. 00 21. 00 30. 00 24. 99 PKT 13 SEQUENCE 21413527 21413531 21413548 21413576 21413580 21413582 21413583 21413596 21413601 21413603 21413639 21413653 21413709 21413713 21413725 21413730 21413752 21413814 21413815 21413817 21413818 21413819 21413820 21413821 YOUR TOWN BANK ( ) FIRST 123. 45 LAST 24. 99 SEP# 00000 47, 452. 45 209 6
Business-Report Structure Pages n n n Rows/columns Page headers/footers Group headers/footers Assumptions n n n 11/3/99 ASCII format (EBCDIC easy to translate) Page boundaries known We can basically ignore blank lines CIKM’ 99, Copyright © 1999, Stephen W. Liddle 7
Type I Reports A type I report exhibits repeated structure only along the row (vertical) dimension RUN 05/21/99 12: 34: 56 CUST NBR CD NBR 006 008 009 011 10355 9657 9541 11225 9994 9992 9991 9989 00551 L A R G E C D N A M E R E P O R T POST: 05/21/99 RATE MATURITY OFC . 06005. 05990. 06250 03/07/99 04/22/99 05/16/99 MS MS BALANCE ACCR: 04/26/99 100, 000. 00 300, 000. 56 1, 100, 000. 00 105, 529. 23 * * * TOTAL LARGE CD * * * 11/3/99 JASON MASON CONSTRUCTION INC FANNY M RYEBERG JOHN SMITH JR BARNEY FIFE PAGE 001 1, 605, 529. 79 CIKM’ 99, Copyright © 1999, Stephen W. Liddle 8
Type II Reports A type II report exhibits repeated structure within rows (along the horizontal dimension) BCRCL 10 PACKAGE DETAIL TO-YOUR TOWN BANK AMOUNT SEQUENCE 123. 45 21409648 679. 00 21409664. . . 7. 50 21410164 7. 00 21410337 333. 33 21410552 852. 34 21410581 110. 00 21410642 50. 00 21410657 449. 60 21410673 112. 80 21410709 61. 24 21410715 AMOUNT SEQUENCE 81. 56 21410732 34. 05 21410744 AMOUNT SEQUENCE 19. 93 21410954 60. 00 21411252 123. 45 21409648 68. 93 21411362 56. 64 21411365 679. 00 21409664 11/3/99 163. 32 160. 50 5, 000. 00 417. 34 129. 18 21411380 21411383 21411390 21411398 21411401 FROM-MY TOWN STATE BANK AMOUNT SEQUENCE 23. 00 21411405 100. 00 21411408 05/21/99 AMOUNT SEQUENCE. 58 21412327 71. 00 21412329 AMOUNT SEQUENCE 21411815 70. 00 21412121 40. 00 81. 56 21410732 21412196 17. 08 21412220 100. 00 34. 05 21410744 40. 00 258. 00 60. 00 30. 00 148. 66 29. 95 164. 25 21412269 21412300 21412301 21412310 21412326 200. 00 10. 00 200. 00 81. 79 25. 29 21412699 21412899 21412925 21412934 21412935 21412936 21412937 21412944 21412946 CIKM’ 99, Copyright © 1999, Stephen W. Liddle JOB-001 SP-P 2414 AMOUNT SEQUENCE 25. 32 21412947 115. 47 21412950 11. 89 414. 00 215. 00 312. 50 500. 00 100. 00 20. 00 18. 00 4. 83 21413051 21413192 21413361 21413468 21413469 21413490 21413503 21413506 21413520 PAGE 1 DVC-01 PKT 13 AMOUNT SEQUENCE 50. 00 21413527 136. 80 21413531 24. 99 21413821 YOUR TOWN BANK ( ) FIRST 123. 45 LAST 24. 99 SEP# 00000 47, 452. 45 209 9
Structure Extraction Process Business Report Field Description Lattice Extract Fields Infer Line Types Infer Page Headers/Footers Infer Recursive Group Structure 11/3/99 CIKM’ 99, Copyright © 1999, Stephen W. Liddle Report Structure Decomposition 10
Data Extraction Process Business Report Structure Decomposition Extract Data Populated Database 11/3/99 CIKM’ 99, Copyright © 1999, Stephen W. Liddle 11
Delimitations This study examines only type I reports (i. e. a line in a report pertains to one record) We focus on report structure extraction 11/3/99 CIKM’ 99, Copyright © 1999, Stephen W. Liddle 12
Algorithm 1: Extract Fields Use field extraction lattice to identify basic fields in each line of the report Represent lattice with a total ordering E of regular expressions Any String Time Date HMS HM Julian DMY MDY … 11/3/99 Phone Number ID Code w/AC no AC … Number Page Number … Percent Negative Currency … CIKM’ 99, Copyright © 1999, Stephen W. Liddle 13
Field Extraction Extract fields to form line type vector General Number String Currency 006 9994 10355 General Number JASON MASON CONSTRUCTION INC String bd+(, ddd)*. ? d*(? =(D|$)) ([^ ]+([ ][^ ]+)*) d*(, ddd)*. dd(? =(D|$)) 100, 000. 00 . 06005 03/07/99 Currency Fraction DMY Line Type Vector: General Number (1, 3, "006") General Number (5, 4, "9994") General Number (13, 5, "10355") String (21, 28, "JASON MASON CONSTRUCTION INC") Currency (54, 10, "100, 000. 00") Fraction (68, 6, ". 06005") Day. Month. Year (77, 8, "03/07/99") 11/3/99 CIKM’ 99, Copyright © 1999, Stephen W. Liddle 14
Algorithm 2: Infer Line Types Cluster line types by similarity to form the set B of basic line types for R Use line distances: n First-order distance w Based on character comparison w Identical strings have distance 0 n Second-order distance w Based on field types w Uses field-description lattice for distance 11/3/99 CIKM’ 99, Copyright © 1999, Stephen W. Liddle 15
Infer Basic Line Types Generalize line types when distance is small 006 9994 10355 JASON MASON CONSTRUCTION INC General Number String Right-Aligned Numbers 008 9992 9657 General Number Left-Aligned Strings FANNY M RYEBERG String 100, 000. 00 . 06005 03/07/99 Currency Fraction DMY Similar Date Fields Fully Aligned 300, 000. 56 . 05990 04/22/99 MS Currency Fraction MDY String Optional String Field 11/3/99 CIKM’ 99, Copyright © 1999, Stephen W. Liddle 16
Algorithm 3: Infer Page Headers/Footers Separate report detail from page headers and footers A line is considered detail if n n It repeats in report two or more times in immediate succession, or It repeats more than twice on one page Find the maximal page prefix/suffix of nondetail lines Remove page headers/footers and blank lines 11/3/99 CIKM’ 99, Copyright © 1999, Stephen W. Liddle 17
Page Headers/Footers RUN 05/21/99 12: 34: 56 CUST NBR CD NBR 006 008 009 011 10355 9657 9541 11225 00551 9994 9992 9991 9989 L A R G E C D N A M E R E P O R T ACCR: 04/26/99 POST: 05/21/99 RATE MATURITY OFC . 06005. 05990. 06250 03/07/99 04/22/99 05/16/99 MS MS BALANCE JASON MASON CONSTRUCTION INC FANNY M RYEBERG JOHN SMITH JR BARNEY FIFE 100, 000. 00 300, 000. 56 1, 100, 000. 00 105, 529. 23 PAGE 001 Page Header RUN 05/21/99 CUST NBR 12: 34: 56 CD NBR 00551 L A R G E C D N A M E R E P O R T ACCR: 04/26/99 POST: 05/21/99 RATE MATURITY OFC . 06005. 05990. 06250 06/13/99 06/25/99 06/30/99 MS BALANCE PAGE 002 Page Detail 013 015 016 2349 1012 2344 4389 12334 11221 2899 8983 MOMS FINE COUNTRY COOKING BAKERY AT THE TOWN SQUARE JILL JENKINS JEAN LUC PICARD * * * TOTAL LARGE CD * * * 11/3/99 100, 000. 00 300, 000. 00 75, 000. 00 100, 000. 00 MS 2, 180, 529. 79 CIKM’ 99, Copyright © 1999, Stephen W. Liddle 18
Algorithm 4: Infer Group kw) Structure (uv a a b 006 008 009 011 9994 9992 9991 9989 10355 9657 9541 11225 JASON MASON CONSTRUCTION INC FANNY M RYEBERG JOHN SMITH JR BARNEY FIFE * * * TOTAL LARGE CD * * * = aaaab = a 4 b 11/3/99 100, 000. 00 300, 000. 56 1, 100, 000. 00 105, 529. 23 1, 605, 529. 79 . 06005. 05990. 06250 03/07/99 04/22/99 05/16/99 MS MS u = , v = a , w = b CIKM’ 99, Copyright © 1999, Stephen W. Liddle 19
Inferring Group Structure a b c d e f 9999900—MY TOWN BANK ----------1234567 9001 81196 09 -25 -92 TUBBS, BILLY 63 DEMAND 1234569 9002 09644 03 -22 -93 JONES, MARIA 66 DEMAND PART-TOTALS BRANCH TOTALS 14. 50000 60, 825. 23 12. 75000 30, 079. 41 50, 860. 19 44, 428. 59 OA 14. 50000 62, 814. 48 29, 817. 50 20, 386. 38 OA 12. 75000 30, 079. 41 NUMBER 64, 814. 97 COUNT 2 96, 893. 89 NUMBER. 00 COUNT. 00 14, 9541, 26116, 216. 20. 00 = abcdcdefef u = ab , v = cd , w = substitute g (= ab(cd)+) for uvkw = gefef u = g , v = ef , w = substitute h (= g(ef)+) for uvkw =h regular expression: = g(ef)+ = ab(cd)+(ef)+ 11/3/99 CIKM’ 99, Copyright © 1999, Stephen W. Liddle 20
Experiment The initial field description lattice and empirical constants were constructed using several inputs: n n Our own experience Dozens to hundreds of actual business reports generated by different firms We tuned our process to handle dozens of reports well 11/3/99 CIKM’ 99, Copyright © 1999, Stephen W. Liddle 21
Experimental Results After tuning, we ran against a real set of 76 reports not previously seen n 11/3/99 7 reports were not type I 7 reports were too short to be meaningful Of 62 remaining reports, our process was successful with 40, unsuccessful with 22. CIKM’ 99, Copyright © 1999, Stephen W. Liddle 22
Interpretation of Results Opportunities for improvement: n n 11/3/99 Tuning field-description lattice Optional fields in a line type Optional lines in a uvkw group Clustering algorithm CIKM’ 99, Copyright © 1999, Stephen W. Liddle 23
Future Work Improved handling of complex type I reports Type II reports Long-term goal: n n 11/3/99 Inverted index/compressed representation of business reports SQL front-end with data mining support CIKM’ 99, Copyright © 1999, Stephen W. Liddle 24
Data Extraction Group For papers/tools produced by the BYU Data Extraction Group see: www. deg. byu. edu 11/3/99 CIKM’ 99, Copyright © 1999, Stephen W. Liddle 25
c8d78f41a3c850c6038015bf6b033ed3.ppt