ab1473315d4530cd65d53711e8bb295d.ppt
- Количество слайдов: 27
Chatter. Grabber Using Social Media to Discover and Characterize Public Health Threats By James Schlitt, Elizabeth Musser, Dr. Bryan Lewis, and Dr. Stephen Eubank
Introduction Social media surveillance is a valuable tool for epidemiological research: – Pros: Cheap, consistent, and easy to parse data source – Cons: Specificity varies by condition, signal matching needed
The Twitter Norovirus Study • Developed under MIDAS funding, the Virginia Department of Health requested a tool to track Norovirus and Gastrointestinal Illness (GI) outbreaks within Montgomery County, VA with the following capabilities: – Automated surveillance of social media – No special skills required to use – Forward compatible for GIS applications • Twitter was well suited to GI outbreak surveillance due to the short duration of infection and Tweet-worthy symptoms, for example: @ATweeter 20 My hubs did some vomiting w/ his flu. Had stuff messing w/ his tummy - high fever & snot was bad. Get better! • Challenged by low population density and high degree of linguistic confounding
Chatter. Grabber Introduction Chatter. Grabber: A twitter search method based social media data miner developed in Python – Specialized hunters pull from Google Docs Interface (GDI) for simplified partner access – Natural Language Processing (NLP) classifiers filter content – Classifiers can be automatically optimized on the cluster – CSV data, charts, maps, word clouds, and animations are sent nightly to subscribers – Tracks shared links, hashtags, and images – Can be linked with Epi. Dash to provide an online dashboard for field Epidemiologists
General Execution Partition conditions into {x} queries Search radius > 35 miles? Generate {y} Coordinate sets via covering algorithm Prepare search with |x|*|y| queries Prepare search With |x| queries Pull list of condition phrases & config from Google spreadsheet Run Twitter search, from last tweet ID recorded for location and query pair Filter results by phrases, classifiers, and location; sleep Yes No Store data, send subscribers report and config link Yes Has a new day begun? No
Chatter. Grabber GDI Interface Example
Chatter. Grabber Search Methods Pure Query Search: NLP Search: • User defines conditions, qualifiers, & exclusions • Program searches by conditions and keeps if text has at least 1 qualifier and no exclusions • Simple, easy to setup, but vulnerable to hyperbole and the complexities of phrasing • Classifiers created from manually classified output from pure query search • Optimal classifier configurations selected via genetic programming • Classifier discards a percentage of tweets that don’t fit desired categories. • Powerful, but requires longer setup, a good training set, and compute time
Classifier Genetic Programming Distributed algorithm used to optimize the classification mode, n-gram degree set, and feature frequency limits for a given training set X X NLPFile NLPMode NLPn. Grams NLPFreq. Limit SVMMode SVMNumber #Mendels. Py optimized 8/20/2014 lyme. Scores 2. csv svm 12357 15422 number 1000 classifier, switched on #Gen: 438 Best: 70. 88 Seed: 592 Running: 998 Time: Wed Aug 20 11: 59: 39 2014 NLPFile NLPMode NLPn. Grams NLPFreq. Limit SVMMode SVMNumber #Mendels. Py optimized 8/22/2014 lyme. Scores 2. csv svm 1_2_7 2_2_2 number 960 classifier, switched on #Gen: 455 Best: 72. 03 Seed: 295 Running: 996 Time: Thu Aug 21 19: 42: 41 2014 X X ✔ NLPFile NLPMode NLPn. Grams NLPFreq. Limit SVMMode SVMNumber #Mendels. Py optimized 9/05/2014 lyme. Scores 2. csv svm 1_2_7 3_3_5 number 960 classifier, switched on #Gen: 649 Best: 76. 30 Seed: 981 Running: 1000 Time: Thu Sep 4 06: 55: 37 2014
Tweet Linguistic Classification Yes Tweet passed for classification Using NLP mode? No Extract features from Tweet No Does Tweet contain an exclusion? Classify Tweet by features Yes Does Tweet contain a qualifier? Is Tweet classification sought? Yes No Store Tweet data and derived data No Yes Keeping non-hits? No Discard Tweet
n-gram Classification Gastrointestinal Illness n-gram Found Class Firearm Violence Weight n-gram Found Class Weight {feel, sick} True 2: 1 150: 1 {victim} True 2: 0 148: 1 {sick} True 2: 1 40: 1 {suspect} True 2: 0 127: 1 {feel} True 2: 1 30: 1 {fatal, shooting} True 2: 0 124: 1 {sick, right} True 2: 3 28: 1 {another, shooting} True 3: 0 96: 1 {@user} True 1: 2 27: 1 {police} True 2: 0 88: 1 {sick} False 1: 2 27: 1 {info} Tue 2: 0 84: 1
NLP Classifier Demonstration
VBI Student Classifiers In Use Classifier Categories Detected v. Irrelevant Norovirus Infectious illness and infectious GI illness Firearm Violence First hand accounts and media related discussion Tick Bite Zoonosis Strong suspicion of tick bites and Lyme Disease discussion Vaccine Sentiment Pro-vaccine discussion, anti-vaccine discussion, and general awareness Ebola Sentiments, observations, and social unrest EMS Emergencies Infrastructure, natural disasters, medical, trauma, automobile, fire, violence, general crime
Chatter. Grabber Geographic Methods • • Takes lat/lon box or lists of locations (cities, states, or countries) Large lat/lon boxes filled via covering algorithm Results are geocoded via Google. Maps. V 3 API: – If coordinates present: finds street address – If name given: finds coordinates and address – If coordinates are outside of lat/lon box: discards tweet All geo queries are cached and shared between experiments to reduce API utilization
Virginia Norovirus Results
Virginia Norovirus Results • • Found and geolocated 4, 000 -8, 000 suspected Norovirus tweets per day across the US during peak Norovirus season Estimated 86% accuracy from cross validated, 2, 000 tweet training set
National Firearm Violence Results
Results Compared
Africa Ebola Results
Africa Ebola Corruption Sentiments
Resources Available • • Chatter. Grabber workshop: • 4 hour, zero to hero workshop on running Chatter. Grabber • https: //drive. google. com/drive/folders/0 B 8 Vh. E 4 y. Q 6 s 1 Ffnp. He. EZa. THk 0 b. U 9 IZjl 6 Qm. Nl. Zmx. VMUJmd 3 RIZGthb. Ul 3 NFRn. NGZfek. Vh. Qk 0 Chatter. Grabber VM: • Developed in parallel to the workshop activities • Contains a functioning Chatter. Grabber instance with ebola tweet data for analysis and visualization • Requires a twitter dev account and gmail account to pull and send live reports Chatter. Grabber Github: • https: //github. com/jschlitt 84/Chatter. Grabber Labeled tweet sets
New Capabilities • Historic Reports: • • Reports will filter prior data through current configuration • • Chatter. Grabber can now send historic email reports Can be used to explore how well new settings may have caught previous how outbreaks Spatiotemporal cluster detection: • Identifies and maps clusters linked in space and time • Integration work in progress
Limitations • Results exceed the geographic and temporal resolution of common surveillance systems, complicating verification • No true denominator, Chatter. Grabber only collects queried hits • Not all desired information is available in social media, some may be incomplete or falsified • Chatter. Grabber is just an information gathering method, external analysis and review needed for validity • Twitter users will differ from population at large
Conclusions ● Chatter. Grabber provides an easy to use social media surveillance tool – Natural Language Processing speeds illness identification – Geographic region directed searching allows complete coverage of any user defined jurisdiction – Date & visualization reports provide rapid, shareable, qualitative assessments
Licenses Chatter. Grabber and the Chatter. Grabber Virtual Machine are released under the GNU General Public License for sharing with attribution for academic work within the realm of public health. Copyright (C) 2015 James Schlitt This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see <http: //www. gnu. org/licenses>
References I. Hernandez, B. (2012). Twitter Rewind: Big Highlights From 2012 to 2006. Retrieved December 7, 2014, from http: //mashable. com/2012/03/21/history-of-twitter-timeline/ II. Roesslein, J. (2009). Tweepy (Version 1. 8) [Computer program]. Available at https: //github. com/tweepy (Accessed 1 November 2013) III. Bird, Steven, Edward Loper and Ewan Klein (2009). Natural Language Processing with Python. O’Reilly Media Inc. (Accessed 14 January 2014) IV. Google Developers (2012). gdata-python-client (Version 3. 0) [Computer program]. Available at http: //code. google. com/p/gdatapython-client/ (Accessed 6 January 2014) V. Mc. Kinney, W. (2010). Data structures for statistical computing in Python. In Proc. 9 th Python Sci. Conf (pp. 51 -56) VI. Tigas, M. (2014). Geo. Py (Version 0. 99) [Computer program]. Available at https: //github. com/geopy (Accessed 21 December 2013) VII. Kille. Brew, K. (2013). query_places. py [Computer program]. Available at https: //gist. github. com/flibbertigibbet/7956133 (Accessed 27 January 2014) VIII. Coutinho, R. (2007, August 22 nd) Sending emails via Gmail with Python [Web log Post]. Retrieved January 5 th fromhttp: //kutuma. blogspot. com/2007/08/sending-emails-via-gmail-with-python. html IX. Whitaker, J. (2011). Basemap 1. 0. 7: Python Package Index. Retrieved June 6, 2014, from https: //pypi. python. org/pypi/basemap X. Mueller, A. (2013). Amueller/word_cloud. Retrieved October 18, 2014, from https: //github. com/amueller/word_cloud XI. Rivers, C. M. , & Lewis, B. L. (2014). Ethical research standards in a world of big data. F 1000 Research, 3. XII. Young, S. D. , Rivers, C. , & Lewis, B. (2014). Methods of using real-time social media technologies for detection and remote monitoring of HIV outcomes. Preventive medicine. XIII. Chakraborty, P. , Khadivi, P. , Lewis, B. , Mahendiran, A. , Chen, J. , Butler, P. , . . . & Ramakrishnan, N. Forecasting a Moving Target: Ensemble Models for ILI Case Count Predictions. SDM 14 XIV. Larson, J. (2013). Timeline-setter [Computer program]. Available at http: //propublica. github. io/timeline-setter/ (Accessed 20 October, 2014) XV. Poznyakofff, S. (2011). GDBM [Computer program]. Available at http: //www. gnu. org. ua/software/gdbm/manual. html (Accessed 9 September, 2015)
Acknowledgements Many thanks to my advisors Dr. Bryan Lewis and Dr. Stephen Eubank for their guidance and involvement in every facet of this project, to Harshal Hayatnagarkar and Elizabeth Musser for their work on the Epi. Dashboard System, to P. Alexander Telionis, Meredith Wilson, Cedric Mubikayi Kabasele, and Gizem Korkmaz for their work in planning, training, and analyzing classifiers, to Gaurav Tuli, Sherif Abdelhamid, Farzaneh Tabataba, and Harshal Hayatnagarkar for their work on the NDSSL #Hack. Ebola Twitter Team, to Daniel Chen for his assistance in preparing this workshop, and to our partners for their faith and patience in our debugging efforts. Funding for Chatter. Grabber was supported by the National Institutes of General Medical Sciences of the National Institutes of Health under award number 5 U 01 GM 070694 -11 and the defense threat reduction agency under DTRA CNIMS Contract HDTRA 1 -11 -D-0016 -0001.
Questions? Thank you! jschlitt@vbi. vt. edu
ab1473315d4530cd65d53711e8bb295d.ppt