bd310c4102db5717e69273410bf314a1.ppt
- Количество слайдов: 56
It’s not the documents; it’s the DATA! Tom Johnson Managing Director Inst. for Analytic Journalism Santa Fe, New Mexico USA tom@jtjohnson. com 1
It’s not the documents, it’s the DATA! Presentation at “ 2011 Open Government Academy” March 26, 2011 Presented by the New Mexico Foundation for Open Government, New Mexico Press Association and New Mexico Broadcasters Association This Power. Point deck and Tipsheet posted at: http: // j o h n s o n – f o g. n o t l o n g. c o m Licensed under a Creative Commons Attribution-Non. Commercial-No. Derivs 3. 0 Unported License. 2
Important point Nothing is as important–and valuable–as a good theory! 1 3
Theory of Journalistic Process Data In Analysis Info Out • Data = that which, upon Analysis, yields Information. “Data” has many forms. • Analysis = Examination of data and facts to uncover and understand cause-effect and contextual relationships and patterns, thus providing basis for problem solving and decision making. • Information = that which aids in making decisions 4
Important point The document is not the data. 2 5
Bertillon system: Public Records DB Early public records • Intricate data collection • Potential for error in data entry • Potential for error in filing • No machine retrieval or analysis • Even today, OCR would be impossible
Bertillon system: Public Records DB By 1910… • Indexing system has improved • Typewriters instead of pen • Better haircuts But still … • Null fields • Subject to data entry errors; lost or misfiled cards/data • Limited large-scale analysis resources
Bertillon system: Public Records DB • Early public records • Intricate data collection By 1910… • Data entry potential • Indexing system has improved • Typewriters for error instead of pen • Better haircuts • Filing potential for But still … errorfields • Null • Subject to data entry errors; • No machine retrieval lost or misfiled cards/data or analysis • Limited large-scale analysis • Even today, no OCR resources Early “hard drives, ” data retrieval and data analysis of public records
Bertillon system: Public Records DB • A public record, but one of limited usage • Early public records • A DOCUMENT, but no • Intricate data efficient, productive, collection By 1910… insightful way to FIND • Data entry potential • Indexing system has improved the data • Typewriters instead of pen for • error A DOCUMENT, but no • Better haircuts • Filing potential for But still … efficient, productive, errorfields • Null insightful way to • Subject to dataretrieval • No machine entry errors; EXTRACT the data lost or misfiled cards/data or analysis • Limited large-scale analysis • Even today, no PDF resources • Sorta like a OCR Early “hard drives, ” data retrieval and data analysis of public records
Traditional Data In Analysis Info Out • Notes • Text • Numeric • Images • Maps • How? Who? 10
Digital Age Data In • Notes • Text • Numeric • Images • Charts/Graphs • Maps • Audio • Video • Atoms Bits • How? Who? Analysis Info Out • New data is ubiquitous, shareable, scaleable. • Retrieval, copying and storage costs trivial • Can be validated and explored by individuals and applications 11
} Digital Age Data In • Notes • Text • Numeric • Images • Charts/Graphs • Maps • Audio • Video • Atoms Bits • How? Who? Analysis Info Out • All data today requires NEW tools for ANALYSIS and STORYTELLING • Statutes are usually adequate; the CULTURES are the challenge. 12
Important point The document is not the data. Without analysis, the data are not the story. 3 13
Four stories • Doig: Hurricane Andrew, Data (from documents) = Pulitizer Prize & bldg. inspectors in jail • Craig Harris: “Arizona pension systems a soaring burden” • Waite: water, developers, land use = disappearing wet lands • UK: Investigate Your MPs Expenses “We have 458, 832 pages of documents. 27, 731 of you have reviewed 223, 475 of them. Only 235, 357 to go” MP’s expense claims on Google spreadsheet 14
Journalism and GIS • Steve Doig [Miami Herald] 1992 Hurricane Andrew + damage reports + building inspection = jail terms 15
Doig: Hurricane Andrew 16
Four stories • Doig: Hurricane Andrew, Data (from documents) = Pulitizer Prize & bldg. inspectors in jail • Craig Harris: “Arizona pension systems a soaring burden” 17
Analysis with real data Search Sort DB info 18
Four stories • Doig: Hurricane Andrew, Data (from documents) = Pulitizer Prize & bldg. inspectors in jail • Craig Harris: “Arizona pension systems a soaring burden” • Waite: water, developers, land use = “Vanishing Wetlands” 19
Vanishing Wetlands 20
Four stories • Doig: Hurricane Andrew, Data (from documents) = Pulitizer Prize & bldg. inspectors in jail • Craig Harris: “Arizona pension systems a soaring burden” • Waite: water, developers, land use = disappearing wet lands • UK: Investigate Your MPs Expenses “We have 458, 832 pages of documents. 27, 731 of you have reviewed 223, 475 of them. Only 235, 357 to go” MP’s expense claims on Google spreadsheet • EFF Seeks Cooperating FOIA Reviewers 21
UK MP’s expenses Solid search tools These are PDFs, POST-search 22
Major questions? As participants in a liberal democracy… • How do we get the necessary data? • And from where? • And in appropriate forms? 23
Files, Transparency, Ease of Analysis Easier Challenging 24
Files, Transparency, Ease of Analysis 25
Data In: Objectives/Requirements • Move data from “out there” to analytic site/tools • Looking for connections; patterns 26
Data In: Objectives/Requirements • Seeking fine-grained data, NOT aggregations • Seek data in original form (i. e. NO PDFs) • Get data in lowest common denominator format: - Comma-delimited files in ASCII or Text • Who collected the data? Why? How? • Who proofed/edited the data? Why? How? • If from data base, first ask for “record layout” or “code sheet” or “schema” • Definitions of variables or fields. Constant or ? ? ? 27
Data In: “Typical” problems with gov sites Barriers data = barriers to analysis • NO site search capability; no site map • Failure to use open-standard HTML; using closedstandard Adobe Flash/Shockwave environment. • Page formats/layouts not consistent; too many drill-downs instead of search-driven generators • Jiggly roll-overs; too much effort spent on bling • Impossible to download or scrape data for analysis • Information available only in Adobe PDF files; notoriously unfriendly to data analysis. 28
Search! Good NM sites Español Feedback! 29
NM Legis. Bill Finder Download bill in TWO formats Could be better: no way to find what bills were introduced by X legislator 30
Data In: Challenges • New site in New Mexico: www. sunshineportalnm. com • “Beta, ” but facade for taxpayers; a secondary tax bcs of minimal utility; torture for journos 31
Data In: Challenges in Sunshine. Port • Comprehensive Annual Financial Reports • Possible to machine download, but laborious to format for analysis • Investment Holdings reports are far worse • They are poor-quality static image files, not machinereadable. • Tabular data roughly formatted; makes conversion for analysis an arduous, if not impossible task. 32
Bottom line on Sunshine. Portal. NM. com “This is not even a web page, it’s a Flash application, so there’s not going to be much sunlight escaping from this portal. “ “If the State of New Mexico takes the position that through this site it is discharging all of its disclosure obligations with respect to these particular records, open government is in trouble there. ” 33
Bottom line on Sunshine. Portal. NM. com “This is not even a web page, it’s a Flash application, so there’s not going to be “A perfect example ofthis portal. “ much sunlight escaping from creating the appearance of New Mexico takeswithout transparency the “If the State of actually being this site it is position that throughtransparent. ” discharging all of its disclosure obligations with respect to these particular records, open government is in trouble there. ” 34
Good data sites – Gov and NGO • Data. gov [A beta site] www. data. gov/ • Metrics www. data. gov/metric • Data. SF - http: //datasf. org/ a clearinghouse of datasets available from the City & County of San Francisco • San Francisco Enterprise GIS Program http: //gispub 02. sfgov. org/data. asp • Maplight. com – an example of how citizens can use data Nonprofit, nonpartisan research organization, provides citizens and journalists the transparency tools to shine a light on the influence of money on politics. • Prize-winning gov’t agency web sites: http: //www. centerdigitalgov. com/survey/88/2010 35
Common aspects? • All have up-front search capabilities • All are written in “data-accessible” code • All data can be downloaded with “relative” ease • Some have various languages available • ALL are run by GOVERNMENT; no commercial sites 36
Challenge for Watchdogs? Failure on the part of planners/bureaucrats to simply… • Give The People THEIR Data… • In The Most Basic, Original, Straightforward Form… • And Let Them Figure Out What Should Be Done With It! • The governor agrees 37
Tomorrow? Why not? Public Access to Original Data Impact 38
It’s not the documents, it’s the DATA! Gracias a todos Tom Johnson Managing Director Inst. for Analytic Journalism Santa Fe, New Mexico USA tom@jtjohnson. com 39
It’s not the documents, it’s the DATA! Presentation at “ 2011 Open Government Academy” March 26, 2011 Presented by the New Mexico Foundation for Open Government, New Mexico Press Association and New Mexico Broadcasters Association This Power. Point deck and Tipsheet posted at: http: //johnson-fog. notlong. com 40
FOI history • The world’s rst reedom o in ormation legislation was adopted by the Swedish parliament in 1766. This publication includes the English translation o this ordinance on reedom o writing and the press. The enlightenment thinker and politician. Anders • • Chydenius (1729 -180 ), rom the Finnish city o Kokkola, played a crucial role in creating the new law. As Pro essor Juha Manninen describes in his article, the key achievements o the 1766 Act were the abolishment o political censorship and the gaining o public access to government documents. Although the innovation was suspended rom 1772 -1809, the principle o publicity has since remained central in the Nordic countries. • http: //www. scribd. com/doc/5885744/The-Worlds-First-Freedom-of-Information. Act-Sweden. Finland-1766 41
Early police data base: incomplete data Source: Jay, Ricky. “Grifters, Bunco Artists & Flimflammen. ” Wired, Feb. 2011, p. 88. http: //rickyjay. com/ 42
NM HB 406 • “…information contained in information systems databases created or maintained by or on behalf of a public body … shall be subject to disclosure to any person requesting the information in the format requested. • “The information shall be provided in the most effective and efficient manner available to the custodian, as defined in the Inspection of Public Records Act. • B. The custodian may charge a reasonable fee for production of the information requested. The fee shall not exceed the cost of the materials and reasonable charges for the personnel required to retrieve and provide the information. But what if it wasn’t New Mexico state employees directly at fault? 43
Analytic Tools Text • Theme. River - http: //infoviz. pnl. gov/research_themeriver. stm 44
“Analytic tools” also for story-telling • Spreadsheets: • Tables, charts, infographics • Data base programs • Charts, graphs, data tables • Stats programs (SPSS or SAS or R) • Generate graphics • Social network analytic graphics • GIS 45
FOIA b(3) Exemptions Original: http: //www. propublica. org/article/foia-exemptions-sunshine-law 46
Content Analysis 47
Content analysis of legis party text 48
“Data In” questions Data In Analysis Info Out • Notes #1 – Keep a logbook (Try using Notesync. com) • Text • Qualitative and/or Quantitative? • Numeric • Images • Objective: strive to get the data in the most fine-grained and original form. • Charts/Graphs • Maps • Online data is rarely complete nor totally • Audio accurate • Video • Where is the data? In what format? I-o-P? Original digital file type(s)? 51
“Data In” questions #1 – Keep a logbook (Try using Notesync. com) Data In Analysis Info Out • Notes • Who created the data? Why? How? Legal catalysts for creation? If so, what do they • Text • Numericsay? • Images • Have definitions and collection process • Charts/Graphs changed? • Maps • Audio • Who could review and edit the data? What • Video was/is the vetting process to insure accuracy? • Who has analyzed the data? For what purpose and with what methods? 52
Data In Analysis Info Out 53
“Analysis” phase Data In • • • Analysis Info Out • What are we looking for? How can we be surprised? • Previous/parallel investigations? (Start with IRE site stories and tipsheets) • Context, i. e. past environment(s) and changes? Trends past and future? • Quantitative and Qualitative methods? • Data cleaning tools? Notes Text Numeric Images Charts/Graphs Maps Audio Video Atoms Bits How? Who? 54
“Analysis” phase Data In • • • Notes Text • Numeric Images Charts/Graphs Maps Audio • Video Atoms Bits • How? Who? • • • Analysis Info Out Measurement of phenomena • Strength of relationships • Change Estimating Counting Statistical Geostatistical Social Network Analysis Forensic accounting Who’s your rabbi? 55
Data In Analysis Info Out Data In • • • Notes Text Numeric Images Charts/Graphs Maps Audio Video Atoms Bits How? Who? Analysis Info Out • What are we looking for? How can we be surprised? • Source • Definition • Context • Estimating • Counting • Statistical • Geostatistical • Social Network Analysis • Forensic accounting 56
Data In Analysis Data In • • • Notes Text Numeric Images Charts/Graphs Maps Audio Video Atoms Bits How? Info Out Analysis • • • What are we looking for? How can we be surprised? Source Definition Context Estimating Counting Statistical Geostatistical Social Network Analysis Forensic accounting Info Out • • • Broadcast Web Audio Video Text Data visualization Maps Dynamic databases Archives 57
Theory of Journalistic Process Data In • • • Interviews Text docs Clips Pictures Infographics Copyright © J. T. Johnson Analysis Info Out This is a headline DATELINE -- And the traditional text story starts here and goes on and on. 58


