fce9229fe2f9a81a28774584804e6ff9.ppt
- Количество слайдов: 15
DATA SCIENCE IN EDUCATION AND FOR DISCOVERY Kirk D. Borne School of Physics, Astronomy, & Computational Sciences George Mason University kborne@gmu. edu http: //classweb. gmu. edu/kborne/
Abstract I will discuss the rise of data science as a new academic and research discipline. Data-intensive opportunities are growing significantly across the spectrum of academic, government, and business enterprises. In order to respond to this data-driven digital transformation, it is imperative to train the next-generation workforce in the data-science skill areas. Among these skills are knowledge discovery and information extraction from massive data collections. I will describe some of the techniques that we are applying both in research (for scientific discovery) and in the classroom (to engage students in inquiry-driven evidence-based learning). Specific examples of surprise detection in big data will be presented.
Ever since humans began to explore the world…
… … humans have asked questions and … … have collected evidence (data) to help answer those questions. Astronomy: the world’s second oldest profession !
Now, the Data Flood is everywhere • Huge quantities of data are being generated, collected, and stored within all business, government, research, and personal domains. • Two significant challenges of this Data Flood will be addressed: • Training the next-generation workforce to manage and expertly use these data • • “The Rise of the Data Scientist” Discovering the hidden knowledge and surprises that are hidden within the data • Transforming our repositories from a data representation to a knowledge representation • So how do we address these challenges? • First, we must face it – i. e. , the students that we train as well as knowledge workers (those who extract knowledge from data and information) must recognize the need and face the challenge …
Visualize This: A sea of Data (sea of CDs) This is the CD Sea in Kilmington, England (600, 000 CDs ~ 300 TB)
More Data is Different • The message should be clear: “more data is not simply more data, but more data is different. ” • Numerous federal agencies (and others, of course) have addressed this, including the August 9, 2010 announcement from the OMB and White House OSTP: • • Big Data is a national challenge and a national priority, along with healthcare and national security. See http: //www. aip. org/fyi (#87) • International initiative by the CODATA organization to address this challenge: ADMIRE = Advanced Data Methods and Information technologies for Research and Education • Many U. S. national study groups in the sciences have issued reports on the urgency of establishing both research and educational programs to face the Big Data challenges. • Each of these reports have issued a call to action …
Data Science: A National Imperative 1. National Academies report: Bits of Power: Issues in Global Access to Scientific Data, (1997) downloaded from http: //www. nap. edu/catalog. php? record_id=5504 2. NSF (National Science Foundation) report: Knowledge Lost in Information: Research Directions for Digital Libraries, (2003) downloaded from http: //www. sis. pitt. edu/~dlwkshop/report. pdf 3. NSF report: Cyberinfrastructure for Environmental Research and Education, (2003) downloaded from http: //www. ncar. ucar. edu/cyberreport. pdf 4. NSB (National Science Board) report: Long-lived Digital Data Collections: Enabling Research and Education in the 21 st Century, (2005) downloaded from http: //www. nsf. gov/nsb/documents/2005/LLDDC_report. pdf 5. NSF report with the Computing Research Association: Cyberinfrastructure for Education and Learning for the Future: A Vision and Research Agenda, (2005) downloaded from http: //www. cra. org/reports/cyberinfrastructure. pdf 6. NSF Atkins Report: Revolutionizing Science & Engineering Through Cyberinfrastructure: Report of the NSF Blue-Ribbon Advisory Panel on Cyberinfrastructure, (2005) downloaded from http: //www. nsf. gov/od/oci/reports/atkins. pdf 7. NSF report: The Role of Academic Libraries in the Digital Data Universe, (2006) downloaded from http: //www. arl. org/bm~doc/digdatarpt. pdf 8. National Research Council, National Academies Press report: Learning to Think Spatially, (2006) downloaded from http: //www. nap. edu/catalog. php? record_id=11019 9. NSF report: Cyberinfrastructure Vision for 21 st Century Discovery, (2007) downloaded from http: //www. nsf. gov/od/oci/ci_v 5. pdf 10. JISC/NSF Workshop report on Data-Driven Science & Repositories, (2007) http: //www. sis. pitt. edu/~repwkshop/NSFJISC-report. pdf 11. DOE report: Visualization and Knowledge Discovery: Report from the DOE/ASCR Workshop on Visual Analysis and Data Exploration at Extreme Scale, (2007) downloaded from http: //www. sc. doe. gov/ascr/Program. Documents/Docs/DOE-Visualization-Report-2007. pdf 12. DOE report: Mathematics for Analysis of Petascale Data Workshop Report, (2008) downloaded from http: //www. sc. doe. gov/ascr/Program. Documents/Docs/Petascale. Data. Workshop. Report. pdf 13. NSTC Interagency Working Group on Digital Data report: Harnessing the Power of Digital Data for Science and Society, (2009) downloaded from http: //www. nitrd. gov/about/Harnessing_Power_Web. pdf
Data Science Education: Two Perspectives • Informatics in Education – working with data in all learning settings • • Informatics (Data Science) enables transparent reuse and analysis of data in inquiry-based classroom learning. Learning is enhanced when students work with real data and information (especially online data) that are related to the topic (any topic) being studied. http: //serc. carleton. edu/usingdata/ (“Using Data in the Classroom”) Example: CSI The Cosmos • An Education in Informatics – students are specifically trained: • • … to access large distributed data repositories … to conduct meaningful inquiries into the data … to mine, visualize, and analyze the data … to make objective data-driven inferences, discoveries, and decisions • Numerous Data Science programs now exist at several universities (GMU, Caltech, RPI, Michigan, Cornell, U. Illinois, and more) • http: //cds. gmu. edu/ (Computational & Data Sciences @ GMU)
Data Science Education Goal • Primary Goal: to increase student’s understanding of the role that data and information play across all disciplines, and to increase the student’s ability to use the technologies and methodologies associated with data acquisition, management, search, mining, analysis, and visualization. • Secondary goals: • • • To increase student’s abilities to use databases for inquiry To increase student’s abilities to acquire, process, and explore data with the use of a computer To increase student’s confidence and comfort in using data to address real-world problems (in their chosen scientific discipline, or in any endeavor) To increase student’s awareness of ethical issues pertaining to data and information, including privacy, ownership, proper attribution, misuse and abuse of statistics and graphs, data falsification, and objective reasoning from data To demonstrate and to share the joy of discovery from data
Knowledge Discovery from Data: Many names • • • • • Data Mining Machine Learning (ML) Exploratory Data Analysis (EDA) Intelligent Data Analysis (IDA) Data Analytics Predictive Analytics Discovery Informatics On-Line Analytical Processing (OLAP) Business Intelligence (BI) Business Analytics Customer Relationship Management (CRM) Target Marketing Cross-Selling Market Basket Analysis Credit Scoring Case-Based Reasoning (CBR) Connecting the Dots Intrusion Detection Systems (IDS) Recommendation / Personalization Systems!
Data-driven Discovery (Unsupervised Learning) • Class Discovery – Clustering • • • Distinguish different classes of behavior or different types of objects Find new classes of behavior or new types of objects Describe a large data collection by a small number of condensed representations • Principal Component Analysis – Dimension Reduction • • • Find the dominant features among all of the data attributes Enables low-dimensional descriptions of events and behaviors, while revealing correlations and dependencies among parameters Addresses the Curse of Dimensionality • Outlier Detection – Surprise / Anomaly / Novelty Discovery • • • Find objects and events that are outside the bounds of our expectations These could be garbage (erroneous measurements) or true discoveries Used for data quality assurance and/or for discovery of new / rare / interesting data items • Link Analysis – Association Analysis – Network Analysis • • • Identify connections between different events (or objects) Find unusual (improbable) co-occurring combinations of data attribute values Find data items that have much fewer than “ 6 degrees of separation”
Addressing the D 2 K (Data-to-Knowledge) Challenge • Complete end-to-end application of Informatics: • • • Data management, metadata management, data search, information extraction, data mining, knowledge discovery All steps are necessary – skilled workforce needed to take data to knowledge Applies to any discipline (not just science)
Characterize First, then Classify • The Scientific Method does not begin with “hypothesis formulation. ” • Neither should any reasoning process jump to conclusions. • We should teach by example: follow an evidence-based “forensics” approach. • “Big Data” provide an excellent framework and environment for this. • By including Data Science in our education programs as well as in our own business practice, this should lead to informed, objective, data-driven decision-making. • Isn’t this what we expect from all of our citizens? • Example from scientific method: • • Step 1: Data Collection – observe, describe, characterize Step 2: Hypothesis Formulation – classify, diagnose, predict
Summary • All enterprises are being inundated with data. • The knowledge discovery potential from these data is enormous. • Now is the time to implement data-oriented methodologies (Informatics / Data Science) into the enterprise. • This is especially important in training and degree programs – training the next-generation workforce to use data for knowledge discovery and decision support. • We have before us a grand opportunity to establish dialogue and information-sharing across diverse data-intensive research and application communities. • DATA SUMMIT 2011 has been a fantastic realization of that opportunity.