Скачать презентацию Welcome to IST 380 Data Science Programming Скачать презентацию Welcome to IST 380 Data Science Programming

cded35d0aa1022a80dbab9c26f45eb93.ppt

  • Количество слайдов: 73

Welcome to IST 380 ! Data Science Programming We don't have strong enough words Welcome to IST 380 ! Data Science Programming We don't have strong enough words to describe this class. - US News and Course Report an advocate of concrete computing – and HMC's mascot When the course was over, I knew it was a good thing. - New York Times Review of Courses We give this course two thumbs! - Ebert and Roeper

Welcome to IST 380 ! Data Science Programming an advocate of concrete computing – Welcome to IST 380 ! Data Science Programming an advocate of concrete computing – and HMC's mascot

About myself Who Zach Dodds Where Harvey Mudd College What Research includes robotics and About myself Who Zach Dodds Where Harvey Mudd College What Research includes robotics and computer vision When Mondays 7 -10 pm here in ACB 119 Contact Information dodds@cs. hmc. edu 909 -607 -0867 Office Hours: Friday mornings, 9 -11 am or set up a time. . . HMC Beckman B 111

TMI? fan of low-tech games fan of low-level AI TMI? fan of low-tech games fan of low-level AI

IST 380 ~ the big picture What is it? Why me? IST 380 ~ the big picture What is it? Why me?

IST 380 ~ the big picture What is it? Data Science Venn Diagram Hmmm… IST 380 ~ the big picture What is it? Data Science Venn Diagram Hmmm… where am I on this diagram?

Data? ! • Neighbor's name • A place they consider home • Are they Data? ! • Neighbor's name • A place they consider home • Are they working at a company now? Where? • How many U. S. states have they visited? • Their favorite unhealthy food… ? • Do they have any "Data Science" background? (statistics, machine learning, CS)

state reminders… state reminders…

 • Neighbor's name Data! Zachary Dodds • A place they consider home Pittsburgh, • Neighbor's name Data! Zachary Dodds • A place they consider home Pittsburgh, PA • Are they working at a company now? Where? • How many U. S. states have they visited? • Their favorite unhealthy food… ? Harvey Mudd 44 M&Ms • Do they have any "Data Science" background? (statistics, machine learning, CS) mostly CS for me…

 • Neighbor's name Data! Zachary Dodds • A place they consider home Pittsburgh, • Neighbor's name Data! Zachary Dodds • A place they consider home Pittsburgh, PA ly seminarclass is truat a company, now? Where? • Thisthey working, as you are Are le: I'm here sty ts insighthey visited? • How order. U. S. states have many to gain in field…. is very new into th • Their favorite unhealthy food… ? Harvey Mudd 44 M&Ms • Do they have any "Data Science" background? (statistics, machine learning, CS) mostly CS for me… be sure to set up your login + profile for the submission site…

Data Science concerns Is Data Science concerns Is "Data Science" important or just trendy?

Data Science concerns Hmmm… Data Science concerns Hmmm…

the companies are expanding as fast as the data! the companies are expanding as fast as the data!

There's certainly a lot of it! Data, data everywhere… 1. 8 ZB 8. 0 There's certainly a lot of it! Data, data everywhere… 1. 8 ZB 8. 0 ZB 800 EB Data produced each year 161 EB 1 Exabyte logarithmic scale 1 Zettabyte 5 EB 120 PB 100 -years of HD video + audio 1 Petabyte == 1000 TB 1 TB = 1000 GB Human brain's capacity 2002 2006 2009 2011 2015 60 PB 14 PB References (2015) 8 ZB: http: //www. emc. com/collateral/analyst-reports/idc-extracting-value-from-chaos-ar. pdf (2002) 5 EB: http: //www 2. sims. berkeley. edu/research/projects/how-much-info-2003/execsum. htm (2011) 1. 8 ZB: http: //www. emc. com/leadership/programs/digital-universe. htm (2009) 800 EB: http: //www. emc. com/collateral/analyst-reports/idc-digital-universe-are-you-ready. pdf (life in video) 60 PB: in 4320 p resolution, extrapolated from 16 MB for 1: 21 of 640 x 480 video (w/sound) – almost certainly a gross overestimate, as sleep can be compressed significantly! (2006) 161 EB: http: //www. emc. com/collateral/analyst-reports/expanding-digital- idc-white-paper. pdf (brain) 14 PB: http: //www. quora. com/Neuroscience-1/How-much-data-can-the-human-brain-store

I'd call it data, not information wisdom knowledge information data I'd call it data, not information wisdom knowledge information data

Big Data? I agree with this… Big Data? I agree with this…

Make data easier to use ~ by using it! It may be true that Make data easier to use ~ by using it! It may be true that Data Science isn't a science – but that doesn't mean it's not useful!

IST 380 ~ the big picture What? Data Science Programming Why? Data Rules All IST 380 ~ the big picture What? Data Science Programming Why? Data Rules All of our insights – large and small, permanent and ephemeral, natural and artificial – come about through the integration of lots of data. Data Science simply recognizes that the rules and skills behind those insights are widely applicable…

A few examples… Make 3 d Andrew Ng ~ Computers and Thought award, 2009 A few examples… Make 3 d Andrew Ng ~ Computers and Thought award, 2009 How is this being done? and how do we succeed? … Data Science is at the heart of computer science

A few examples… Learning to Powerslide Stanford's Autonomous Vehicles project (Thrun et al. ) A few examples… Learning to Powerslide Stanford's Autonomous Vehicles project (Thrun et al. ) … Data Science is at the heart of computer science

A few examples… Learning ground from obstacles A few examples… Learning ground from obstacles "my summer was finding that red line" … Data Science is at the heart of computer science

A few examples… classification segmentation Learning ground from obstacles A few examples… classification segmentation Learning ground from obstacles

Insights beyond science Insights beyond science

Marketing Marketing

Visualization Motivation Visualization Motivation

Recommender Systems predicting movie ratings Recommender Systems predicting movie ratings

Netflix Prize (I don't know this guy) Bob Bell, winner of the Netflix Prize (I don't know this guy) Bob Bell, winner of the "Netflix prize" Napoleon Dynamite = 1. 22 Batman Begins =. 75 Some films are difficult to predict… Finding Nemo = ? ? Lord of the Rings = ? ?

Netflix Prize (I don't know this guy) Bob Bell, winner of the Netflix Prize (I don't know this guy) Bob Bell, winner of the "Netflix prize" Napoleon Dynamite = 1. 22 Batman Begins =. 75 Finding Nemo =. 67 Lord of the Rings =. 42 Some films are difficult to predict… and others are easier!

Why IST 380 ? Specific skills: R statistical environment (and the S programming language) Why IST 380 ? Specific skills: R statistical environment (and the S programming language) Experience with several statistical analyses (descriptive statistics) Experience with predictive statistics (modeling) and machine learning algorithms

Why IST 380 ? Specific skills: R statistical environment (and the S programming language) Why IST 380 ? Specific skills: R statistical environment (and the S programming language) Experience with several statistical analyses (descriptive statistics) Experience with predictive statistics (modeling) and machine learning algorithms Broad background: Final project ~ open-ended with datasets of your choice You'll be confident and capable with whatever datasets you encounter in the future – on your own or as part of a team.

About IST 380 … About IST 380 …

Details Web Page: http: //www. cs. hmc. edu/~dodds/IST 380 Assignments, online text, necessary files, Details Web Page: http: //www. cs. hmc. edu/~dodds/IST 380 Assignments, online text, necessary files, lecture slides are linked First week's assignment: Getting started with R Textbook An introduction to Data Science freely available online jsresearch. net/groups/teachdatascience/ and many online resources… Programming: R www. r-project. org/ Grab both of these now…

Homepage Go to the course page Grab R and the text from these two Homepage Go to the course page Grab R and the text from these two links… http: //www. cs. hmc. edu/~dodds/IST 380/

Homework Assignments ~ 2 -5 problems/week ~ 100 points extra credit, often Due Tuesday Homework Assignments ~ 2 -5 problems/week ~ 100 points extra credit, often Due Tuesday of the following week by 11: 59 pm. Assignment 1 due Tuesday, February 5. 1 week + 1 day…

Homework Assignments ~ 2 -5 problems/week ~ 100 points extra credit, often Due Tuesday Homework Assignments ~ 2 -5 problems/week ~ 100 points extra credit, often Due Tuesday of the following week by 11: 59 pm. Assignment 1 due Tuesday, February 5. Working on programs: Submitting programs: Today's Lab: On your own or in groups of 2. Divide the work at the keyboard evenly! at the submission website install software ensure accounts are working try out R - the first HW is officially due on 2/5

Outline Weeks 1 -5 using R descriptive statistics predictive statistics Outline Weeks 1 -5 using R descriptive statistics predictive statistics "Data Science" probability distributions approximate! Weeks 6 -10 "Machine Learning" Weeks 11 -15 statistical modeling support vector machines (SVMs) nearest neighbors (NN) random forests No breaks? ! k-means algorithm Final Project

Grading Grades Based on points percentage ~ 800 points for assignments ~ 400 points Grading Grades Based on points percentage ~ 800 points for assignments ~ 400 points for the final project if score >= 0. 95: grade = "A" if score >= 0. 90: grade = "A-" if score >= 0. 86: grade = "B+" see the course syllabus for the full list. . . Final project • the last ~4 weeks will work towards a larger, final project • there will be a short design phase and a short final presentation • choose your own problem to study (I'll have some suggestions, too. ) • I'd encourage you to connect R and our Data Science techniques to other datasets or projects that you use/need/like, etc.

Academic Honesty This course operates under CGU's (and all of Claremont Schools') Academic Honesty Academic Honesty This course operates under CGU's (and all of Claremont Schools') Academic Honesty policies… • Your work must be your own. This must be true for the whole team, if you're working in a pair. • Consulting with others (except team members or myself) is encouraged, but has to be limited to discussion and debugging of problems. Sharing of written, electronic, or verbal solutions/files/code is a violation of CGU’s academic honesty policy. • A reasonable guideline: Work is your own if you could delete all of it and recreate it yourself.

Thoughts? Thoughts?

Getting to know… R Getting to know… R

Getting to know… http: //lang-index. sourceforge. net/#categ R R is the programmer's toolkit for Getting to know… http: //lang-index. sourceforge. net/#categ R R is the programmer's toolkit for statistics; SAS, Stata, SPSS are preferred by those in business intelligence

Getting to know… R Free… and very well supported online… Getting to know… R Free… and very well supported online…

Getting to know… R R is responsive, up-to-date, and flexible: Data Science vs. Statistics Getting to know… R R is responsive, up-to-date, and flexible: Data Science vs. Statistics

Getting to know… R 1) Find the IST 380 course webpage www. cs. hmc. Getting to know… R 1) Find the IST 380 course webpage www. cs. hmc. edu/~dodds/IST 380/ 2) Download and install R 3) Run R and try some basic commands at the prompt: 6 * 7 rnorm(10) x <- 380 Try it!

Getting started! 1) Open Matloff's Why R? notes 2) Skip ahead to page 7, Getting started! 1) Open Matloff's Why R? notes 2) Skip ahead to page 7, the "5 minute example session" 3) Try out the commands in section 2. 2 to get started… 4) When you finish, save your session and submit it! This is problem 1 this week

Saving your session 1) Create a folder named hw 1, perhaps on your desktop Saving your session 1) Create a folder named hw 1, perhaps on your desktop 2) Use the Save to file… (Windows) or Save as… (Mac) in order to save your current console session into hw 1 3) Name that file pr 1. txt 4) From your operating system, open up that file in order to confirm it contains your whole session! This is problem 1 this week

Submitting your work 1) Zip up hw 1 into hw 1. zip 2) From Submitting your work 1) Zip up hw 1 into hw 1. zip 2) From the course webpage, click on the submission site link. 3) Choose a submission site login name & let me know! 4) Once your account is made, login, change your password to something you know, and submit hw 1. zip 5) You can submit again – all copies are saved… You've completed Problem 1! troubles? email me! This webserver can be spacey -- I should know!

Reflection Assignment? Creating a vector? Printing? Average and standard deviation? Comments? Reflection Assignment? Creating a vector? Printing? Average and standard deviation? Comments?

R types You can use mode() to view the type of a variable. R types You can use mode() to view the type of a variable.

Where's the big data? c ~ concatenate Vectors are R lists of a single Where's the big data? c ~ concatenate Vectors are R lists of a single type of element

Where's the big data? c ~ concatenate the colon : also creates vectors Vectors Where's the big data? c ~ concatenate the colon : also creates vectors Vectors are R lists of a single type of element

Analyzing vectors – try these… Square brackets [] can Analyzing vectors – try these… Square brackets [] can "subset" (or "slice") vectors

Analyzing vectors you can use a boolean vector to subset another vector Square brackets Analyzing vectors you can use a boolean vector to subset another vector Square brackets [] can "subset" (or "slice") vectors

NA R uses NA to represent data that is NA R uses NA to represent data that is "not available" The function is. na( ) tests for NA What is going on here?

NA R uses NA to represent data that is NA R uses NA to represent data that is "not available" The function is. na( ) tests for NA What is going on here? This uses subsetting to remove NA values!

Data frames R's fundamental data structures are data frames The next tutorial will introduce Data frames R's fundamental data structures are data frames The next tutorial will introduce them…

Irises… virginica setosa data() yields many built-in data files. This is iris Irises… virginica setosa data() yields many built-in data files. This is iris

Subsetting iris data df[rows, cols] As with vectors, you can Subsetting iris data df[rows, cols] As with vectors, you can "subset" data frames.

Lab… The 2 nd part of each class meeting dedicated to lab work. I Lab… The 2 nd part of each class meeting dedicated to lab work. I welcome you to stay for the lab, but it is not required. Today's lab: Work through Santorico and Shin's Tutorial for the R Statistical Package and submit the console sessions as pr 2_1. txt, and pr 2_1. txt. This is a nice reinforcement of vectors, introduction to data frames, and a look at the graphics that R supports.

Homework Problem 3: Challenge exercises in R These will reinforce the Homework Problem 3: Challenge exercises in R These will reinforce the "subsetting" and dataanalysis introduction from pr 2's tutorial. Problem 4: Introduction to Data Science, early chapters This is a fuller background on R and the field of data science (submit your console session for both of these…)

Lab ! Lab !

CS vs. IS and IT ? greater integration system-wide issues smaller details machine specifics CS vs. IS and IT ? greater integration system-wide issues smaller details machine specifics www. acm. org/education/curric_vols/CC 2005_Final_Report 2. pdf

CS vs. IS and IT ? Where will IS go? CS vs. IS and IT ? Where will IS go?

CS vs. IS and IT ? CS vs. IS and IT ?

IT ? Where will IT go? IT ? Where will IT go?

IT ? IT ?

The bigger picture Weeks 10 -12 Weeks 13 -15 Objects Final Projects Week 10 The bigger picture Weeks 10 -12 Weeks 13 -15 Objects Final Projects Week 10 Week 13 classes vs. objects final projects Week 11 Week 14 methods and data final projects Week 12 Week 15 inheritance final exam

Data? ! • Neighbor's name • A place they consider home • Are they Data? ! • Neighbor's name • A place they consider home • Are they working at a company now? Where? • How many U. S. states have they visited? • Their favorite unhealthy food… ? • Do they have any "Data Science" (statistics, machine learning, CS) background?

state reminders… state reminders…

 • Neighbor's name Data! Zachary Dodds • A place they consider home Pittsburgh, • Neighbor's name Data! Zachary Dodds • A place they consider home Pittsburgh, PA • Are they working at a company now? Where? • How many U. S. states have they visited? • Their favorite unhealthy food… ? M&Ms • Do they have any "Data Science" (statistics, machine learning, CS) background? mostly CS for me… Harvey Mudd 44

 • Neighbor's name Data! Zachary Dodds • A place they consider home Pittsburgh, • Neighbor's name Data! Zachary Dodds • A place they consider home Pittsburgh, PA • Are they working at a company now? Where? • How many U. S. states have they visited? • Their favorite unhealthy food… ? 44 M&Ms • Do they have any "Data Science" (statistics, machine learning, CS) background? Harvey Mudd mostly CS for me… This class is truly seminar-style: we're devloping expertise in this field together. be sure to set up your login + profile for the submission site…