ed9379731f13d71cc230dcb591b04e44.ppt
- Количество слайдов: 34
NASA AISRP Mtg NASA Ames, Moffett Field, CA April 4 – 6, 2005 Distributed Data Mining Research in the NASA Intelligent Systems Program Kirk D. Borne School of Computational Sciences, George Mason University Fairfax, Virginia kborne@gmu. edu 4/5/05
Outline • • 4/5/05 What is ISP? Short Descriptions Space Science Project Earth Science Project 2
What is (was) ISP? • NASA Code R / Ames ISP (Intelligent Systems Program) had 2 components: – Open competition (low TRL = pure research) – NASA Center-to-Center Mission Infusion Tasks (mid-TRL = development and application) • Mission Infusion Teams were tasked with taking the low-TRL technology to mid-TRL, by infusing the technology into a mission, project, or existing program. 4/5/05 3
Funded ISP Mission Infusion Projects • Two funded projects (both are now completed): – Distributed Data Mining in the NVO* • PI: K. Borne, GMU • Co-I: Cynthia Cheung, NASA • Collaborator: Hillol Kargupta, UMBC • Space Science infusion task – Automated Wildfire Detection (and Prediction) through Artificial Neural Networks • PI: Jerry Miller, NASA • Co-I: K. Borne, GMU • Collaborators: NOAA-NESDIS staff • Earth Science infusion task – Projects were funded by the IDU (Intelligent Data Understanding) component of ISP *NVO = National Virtual Observatory 4/5/05 4
AISRP Projects – still more data mining • Machine Learning and Data Mining for Automatic Detection and Interpretation of Solar Events – Develop an automatic system for CME detection, tracking, characterization, and source region location – Discover specific associations between solar events (CME) and Earth events (Space Weather) – PI: Art Poland, GMU – Co-I’s: Jie Zhang, K. Borne, Harry Wechsler • Novel Approaches to Semi-supervised Data Exploration – Develop an efficient and effective automated system for astronomical object classification, with an emphasis on star-galaxy discrimination and morphological galaxy classification – PI: David Bazell, Eureka – Co-I’s: K. Borne (GMU), David Miller (Penn St. ) 4/5/05 5
Short Descriptions of ISP Projects • Distributed Data Mining (DDM) in the NVO – Search for examples of interacting/colliding/merging galaxies across multiple distributed databases – Apply Distributed Learning – Apply Distributed Classification – Use DDM algorithms being developed by UMBC group – Apply algorithms within NVO data environment • Automated Wildfire Detection (and Prediction) through Artificial Neural Networks (ANN) – Identify all wildfires in Earth-observing satellite images – Train ANN to mimic human analysts’ classifications – Apply ANN to new data (from 3 remote-sensing satellites: GOES, AVHRR, MODIS) – Extend NOAA fire product from USA to the whole Earth 4/5/05 6
Searching, retrieving, mining, integrating, and analyzing geographically distributed data repositories is one of the major challenges in data mining today 4/5/05 7
Why so many telescopes and databases? … Because … Many great astronomical discoveries have come from inter-comparisons of various wavelengths: - Quasars - Gamma-ray bursts - Ultraluminous IR galaxies - X-ray black-hole binaries - Radio galaxies -. . . 4/5/05 8
Distributed Data Mining – 2 perspectives: – robust statistical analysis of “typical” events – automated search for “rare” events Figure: The clustering of data clouds (dc#) within a multidimensional parameter space (p#). Such a mapping can be used to search for and identify clusters, voids, outliers, one-of-kinds, relationships, and associations among arbitrary parameters in a database (or among various parameters in geographically distributed databases). 4/5/05 Credit: S. G. Djorgovski 9
Space Science Project: Distributed Data Mining in the NVO – A case study to find colliding, interacting, and merging galaxies among the IR-luminous galaxy population – Examine several distributed databases (HST, 2 MASS, Sloan, FIRST, IRAS) – Solve a particular science problem (a NVO science scenario) NASA Information Power Grid (IPG) 4/5/05 10
NVO Science Cases & Drivers (from Aspen 2001 NVO Workshop) 4 4 4 4 4 Solar System : NEOs, Long-Period Comets, TNOs, Killer Asteroids!!! The Digital Galaxy : Find star streams and populations -- relics of past/present assembly phase. Identify components of disk, thick disk, bulge, halo, arms, ? ? The Low-Surface Brightness Universe : spatial filtering, multi-wavelength searches, intersection of the image and catalog domains Panchromatic Census of AGN (Active Galactic Nuclei) : Complete sample of the AGN zoo, their emission mechanisms, and their environments Precision Cosmology & Large-Scale Structure : **Hierarchical Assembly History of Galaxies and Structure**, Cosmological Parameters, Dark Matter and Galaxy Biasing as f(z) Precision science of any kind that depends on very large sample sizes "Survey Science Deluxe" Search for rare and exotic objects (e. g. , high-z QSOs, high-z Sne, L/T dwarfs) Serendipity : Explore new domains of parameter space (e. g. , time domain, or "color-color space" of all kinds) **This is the scientific goal of the ISP-funded project described here. 11 4/5/05
Colliding and Merging Galaxies: Building Blocks of the Universe 4/5/05 12
Ultra-Luminous Infrared Galaxies (ULIRGs) and other IR-Luminous Galaxies (LIRGs): Nearly 100% are involved in collisions and mergers ULIRGs: the most luminous galaxies in the Universe 4/5/05 13
Merger Tree - Galaxy Merger Family History Past Present The goal of this study is to identify collision and merger remnant candidates at increasing redshift, in order to measure the galaxy hierarchical mass assembly rate as a function of cosmic epoch. 4/5/05 14
Distributed Data Mining in the NVO – 1. Identify classes of galaxies among several large photometric catalogs (e. g. , 2 MASS, Sloan DSS, FIRST, NVSS, etc. ): the galaxy class is either normal or IR-luminous (the latter being indicative of collision/merger activity) – 2. Identify all known examples of ULIRGs: • linked to Starburst Galaxies, Gamma-Ray Bursts, Quasars, Hierarchical Galaxy Assembly, etc. – 3. Learn new properties of ULIRGs (e. g. , Association Rule Mining) by examining multiple distributed databases. – 4. Build a classifier from these rules. – 5. Find new cases of ULIRGs in the distributed databases. – 6. Results will contribute to understanding of many classes of astronomical phenomena. – 7. Techniques will be applicable to NVO, LWS, other Vx. Os, JWST science program, . . . , and E/PO projects (e. g. , mining Kepler mission catalog by students; or VO@Home) 4/5/05 15
An example of clustering in a 3 -dimensional color-color parameter space using data from two different (distributed) astronomical databases. In this case, the 3 colors are pairings of 2 MASS near-IR and Sloan optical magnitudes. 4/5/05 Plot provided by H. Kargupta (UMBC) 16
Science Result • Successfully completed one true proof-ofconcept science case within small subset of the IRAS, HST, and FIRST databases. • We re-discovered exactly the type of object that we are hoping to find automatically with our data mining tools: – We found a very distant hyper-luminous infrared galaxy, one of the brightest galaxies in the known Universe. • This particular galaxy was previously known (catalogued), but we re-discovered it serendipitously. • References: – K. Borne, "Distributed Data Mining in the National Virtual Observatory", SPIE Data Mining & Knowledge Discovery V, vol. 5098, p. 211 (2003). – K. Borne, "A National Virtual Observatory (NVO) Science Case: Properties of Very Luminous IR Galaxies (VLIRGs)", in "The Emergence of Cosmic Structure", p. 307 (2003). 4/5/05 IRAS F 12509+3122 Redshift = 0. 780 17
Additional application areas of ISP-funded NVO data mining project • Application of XML to distributed data mining: – ADQL (Astronomical Data Query Language) – XMLA (XML for Analysis) – PMML (Predictive Modeling Markup Language) • Application of different data mining techniques: – – – Bayes classification Neural nets Decision trees Association rule mining Genetic Algorithms for rapid data modeling Supervised and Unsupervised Learning algorithms for robust classification • Application of Beowulfs to parallel high-performance data mining • Application to new mission data sets: GALEX, Spitzer, WISE, JWST, LWS, Sensor Webs, Constellations (distributed Sciencecraft) 4/5/05 18
NASA Intelligent Systems (IS) Project Intelligent Data Understanding (IDU) Earth Science Project Automated Wildfire Detection Through Artificial Neural Networks Jerry Miller (P. I. ), NASA, GSFC Dr. Kirk Borne (Co-I), GMU Dr. Brian Thomas, University of Maryland Dr. Zhenping Huang, University of Maryland Yuechen Chi, GMU Donna Mc. Namara, NOAA-NESDIS, Camp Springs, MD George Serafino , NOAA-NESDIS, Camp Springs, MD 4/5/05 19
NOAA’S HAZARD MAPPING SYSTEM NOAA’s Hazard Mapping System (HMS) is an interactive processing system that allows trained satellite analysts to manually integrate data from 3 automated fire detection algorithms corresponding to the GOES, AVHRR and MODIS sensors. The result is a quality controlled fire product in graphic (Fig 1), ASCII (Table 1) and GIS formats for the continental US. Figure – Hazard Mapping System (HMS) Graphic Fire Product for day 5/19/2003 4/5/05 20
OVERALL TASK OBJECTIVES To mimic the NOAA-NESDIS Fire Analysts’ subjective decision-making and fire detection algorithms with a Neural Network in order to: § remove subjectivity in results § improve automation & consistency § allow NESDIS to expand coverage globally 4/5/05 21
Hazard Mapping System (HMS) ASCII Fire Product OLD FORMAT NEW FORMAT (as of May 16, 2003) Lon, Lat -80. 531, 25. 351 -81. 461, 29. 072 -83. 388, 30. 360 -95. 004, 30. 949 -93. 579, 30. 459 -108. 264, 27. 116 -108. 195, 28. 151 -108. 551, 28. 413 -108. 574, 28. 441 -105. 987, 26. 549 -106. 328, 26. 291 -106. 762, 26. 152 -106. 488, 26. 006 -106. 516, 25. 828 Lon, -80. 597, -79. 648, -81. 048, -83. 037, -85. 767, -84. 465, -84. 481, -84. 521, -84. 557, -84. 561, -89. 433, -89. 750, Lat, Time, Satellite, Method of Detection 22. 932, 1830, MODIS AQUA, MODIS 34. 913, 1829, MODIS, ANALYSIS 33. 195, 1829, MODIS, ANALYSIS 36. 219, 1829, MODIS, ANALYSIS 49. 517, 1805, AVHRR NOAA-16, FIMMA 48. 926, 2130, GOES-WEST, ABBA 48. 888, 2230, GOES-WEST, ABBA 48. 864, 2030, GOES-WEST, ABBA 48. 891, 1835, MODIS AQUA, MODIS 48. 881, 1655, MODIS TERRA, MODIS 48. 881, 1835, MODIS AQUA, MODIS 36. 827, 1700, MODIS TERRA, MODIS 36. 198, 1845, GOES, ANALYSIS 4/5/05 22
GOES CH 2 (3. 78 - 4. 03 μm) – Northern Florida Fire 2003: Day 126 , – 82. 10 Deg West Longitude, 30. 49 Deg North Latitude File: florida_ch 2. png 4/5/05 23
NOAA-NESDIS FIRE DETECTION SYSTEM WF-ABBA = Wildfire Automated Biomass Burning Alg FIMMA = Fire Identification Mapping and Monitoring Alg NOAA S/C GOES EASTWEST IMAGER NASA TAP-OFF POINT GVAR MCIDAS FORMAT 10 -bit (COTS) WF-ABBA FIRE DET CH’s 1, 2, 4 FOR IMAGERY (0. 62, 3. 9, 10. 7 μm) 5 CHAN CH’S 1, 2, 4 ( 0. 62, 3. 9, 10. 7 μm ) 10 -BIT WDS 8 -BIT WDS, LCC FIRE ANALYSTS NOAA 14 -17 Geo-correction AVHRR HRPT 5 CHAN FORMAT TERASCAN (COTS) FIMMA FIRE DET CH’s 2, 3 b, 4, 5 MAPPING (0. 91, 3. 7, 10. 8, 12 μm) 10 -bit HAZARD SYSTEM DAILY NOAA ------- PRODUCT ENVI MODIS MOD 14 FIRE PRODUCT NASA S/C CH’s 2, 22, 31 (0. 86, 03. 9, 11 μm) TERRA-AQUA MODIS 12 -BIT WDS FIRE 8 -BIT WDS, LCC 36 CHAN (HMS) CH’S 1, 2, 3 b (0. 63, 0. 91, 3. 7 μm) 10 -BIT WDS HDF FORMAT Bow-Tie Effect Removal MCIDAS CH’S 1, 2, 22 ( 0. 66, 0. 86, 3. 96 μm ) (COTS) algorithms and manual additions) 8 -BIT WDS, LCC = Lambert Conformal Conic Projection 4/5/05 (automated MCIDAS = Man Computer Interactive Data Access System 24
SIMPLIFIED DATA EXTRACTION PROCEDURE Daily HMS ASCII Fire Product DATA: GOES (96 Files/day) AVHRR (25 Files/day) MODIS (14 Files/day) Geographic Coords (lat/lon) Spectral Data Image Coords ENVI Function Call Image Ref’s Conversion to Image Neural Network Training Set Coords (row/col) Filter Out Bad data points 4/5/05 25
DECISION REGIONS AND BOUNDARIES FOR HIGHLY IDEAL SCATTER PLOT CLUSTERING PATTERNS X 2 Single Fire Signature X 2 Multiple Fire Signatures Surface Fire Crown Fire Ground Fire Background X 1 4/5/05 X 1 26
Scatter Plot of Background-Subtracted GOES CH 1 vs. CH 2 Fire (lower) and non-fire (upper) separation of clusters 2003: June 2 Northern Florida File: scatter_fires 12. png (GOES CH 1, CH 2, CH 4 are input to neural network) 4/5/05 27
Scatter Plot of Background –Subtracted GOES CH 2 vs. CH 4 Fire (left) and non-fire (right) separation of clusters 2003: June 2 Northern Florida File: scatter_fires 22. png (GOES CH 1, CH 2, CH 4 are input to neural network) 4/5/05 28
Neural Network Configuration Connections (weights) Band A Inputs: 1 - 49 Band B Inputs: 50 - 98 Output Classification Output Layer 2 Band C Inputs: 99 - 147 Input Layer 0 4/5/05 Hidden Layer 1 29
Typical Error Matrix (for MODIS instrument) True Positive False Negative True Negative TRAINING DATA Neu ral N etwo rk C l assi fica tion Fire Non. Fire Totals 4/5/05 Fire 2834 (TP) 173 (FP) 3007 Non. Fire 318 (FN) 3103 (TN) 3421 Totals 3152 3276 6428 30
Typical Measures of Accuracy • • • Overall Accuracy Producer’s Accuracy (fire) Producer’s Accuracy (nonfire) User’s Accuracy (fire) User’s Acuracy (nonfire) = (TP+TN)/(TP+TN+FP+FN) = TP/(TP+FN) = TN/(FP+TN) = TP/(TP+FP) = TN/(TN+FN) Accuracy of our NN Classification • • • 4/5/05 Overall Accuracy Producer’s Accuracy (fire) Producer’s Accuracy (nonfire) User’s Accuracy (fire) User’s Acuracy (nonfire) = 92. 4% = 89. 9% = 94. 7% = 94. 2% = 90. 7% 31
Summary 4/5/05 32
Summary – NVO Data Mining Applications Data Mining Resource Guide for Space Science: http: //nvo. gsfc. nasa. gov/nvo_datamining. html Sample Data Mining Applications within the NVO: ·Discover data stored in geographically distributed heterogeneous systems. ·Search huge databases for trends and correlations in highdimensional parameter spaces: identify new properties or new classes of objects. ·Search for rare, one-of-a-kind, and exotic objects in huge databases. ·Identify temporal variations in objects from millions or billions of observations. ·Identify moving objects in huge survey catalogs and image databases. ·Identify parameter glitches / anomalies / deviations either in static databases (e. g. , archives) or in dynamic data (e. g. , science / telemetry / engineering data streams from remote satellites). ·Find clusters, nearest neighbors, outliers, and/or zones of avoidance in the distribution of astrophysical objects or other observables in arbitrary parameter spaces. ·Serendipitously explore the huge databases that will be part of the NVO, through access to distributed, autonomous, federated, heterogeneous, multi-wavelength, multi-mission astrophysics data archives. 4/5/05 http: //www. us-vo. org/ 33
Addressing NASA Exploration Challenges through Intelligent Data Understanding Source: Human & Robotic Technology, Program Formulation Plan, 15 May 2004 • Autonomy: “making systems more intelligent” • Robotic Networks: “enabling networks of cooperating robotic systems” • Data-Rich Virtual Presence: “local and remote, both real-time and asynchronous virtual presence to enable effective science and robust operations (including tele-presense , tele-science, tele-supervision)” 4/5/05 34


