Anton Boyko - Big Data.pptx
- Количество слайдов: 16
Big Data Anton Boyko
Agenda • What is Big Data? • Why Big Data? • How to Big Data?
What is Big Data? Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Gigabytes Terabytes Petabytes …
Data growth Velocity 4. 3 Volume 10 x Variety 85% Big Data
How to process Big Data? Traditional way Appropriate way
Move data to compute
Move compute to data • Fast storage vs. fast CPU and fast networking • Linear scalability
Map/Reduce workflow Mappers (find matches) File system Reducers (combine matches) DFS temp Mappers (inverse keys and values) Reducer (combine results) File system
Map/Reduce – how it works public class Namespace. Mapper : Mapper. Base public class Namespace. Reducer : Reducer. Combiner. Base { { //Override the map method. //Accepts each key and count the occurrences public override void Map( public override void Reduce( string input. Line, string key, Mapper. Context context) IEnumerable<string> values, { Reducer. Combiner. Context context) var reg = new Regex(@"(using)s[A-za-z 0 -9_. ]*; "); { var matches = reg. Matches(input. Line); //Write back context. Emit. Key. Value(key, values. Count(). To. String()); foreach (Match match in matches) { } //Just emit the namespaces. context. Emit. Key. Value(match. Value, "1"); } }
Traditional RDBMS vs. Map/Reduce RDBMS Map/Reduce • Terabytes of data • Static schema • Interactive and batch access • Nonlinear scaling • Exabytes of data (or more) • Dynamic schema • Batch access only • Linear scaling
Hadoop – implementation of Map/Reduce engine
Hadoop ecosystem
Offering • ODBC for Excel • Power. Pivot • Windows Server or Windows Azure • C#, Java. Script
Demo
Pricing Head Node Compute Node • Single extra large instance (8 CPU 14 GB) • $0. 32 per hour • $238 per month • One or more large instances (4 CPU 7 GB) • $0. 16 per hour • $119 per month
Вопросы? Антон Бойко boyko. ant@live. com
Anton Boyko - Big Data.pptx