Next Generation of Apache Hadoop Map Reduce Owen

Скачать презентацию Next Generation of Apache Hadoop Map Reduce Owen

7b7e528cd576ac865c98f68ac1917ebc.ppt

Количество слайдов: 19

Next Generation of Apache Hadoop Map. Reduce Owen O’Malley oom@yahoo-inc. com @owen_omalley

What is Hadoop? § A framework for storing and processing big data on lots of commodity machines. - Up to 4, 000 machines in a cluster - Up to 20 PB in a cluster § § Open Source Apache project High reliability done in software - Automated failover for data and computation § § Implemented in Java Primary data analysis platform at Yahoo! - 40, 000+ machines running Hadoop

What is Hadoop? § HDFS – Distributed File System - Combines cluster’s local storage into a single namespace. - All data is replicated to multiple machines. - Provides locality information to clients § Map. Reduce - Batch computation framework Tasks re-executed on failure User code wrapped around a distributed sort Optimizes for data locality of input

Case Study: Yahoo Front Page Personalized for each visitor twice the engagement Result: twice the engagement Recommended links News Interests Top Searches +79% clicks +160% clicks +43% clicks vs. randomly selected vs. one size fits all vs. editor selected 3

Hadoop Map. Reduce Today § Job. Tracker - Manages cluster resources and job scheduling § Task. Tracker - Per-node agent - Manage tasks

Current Limitations § Scalability - Maximum Cluster size – 4, 000 nodes - Maximum concurrent tasks – 40, 000 - Coarse synchronization in Job. Tracker § Single point of failure - Failure kills all queued and running jobs - Jobs need to be re-submitted by users § § Restart is very tricky due to complex state Hard partition of resources into map and reduce slots

Current Limitations § Lacks support for alternate paradigms - Iterative applications implemented using Map. Reduce are 10 x slower. - Users use Map. Reduce to run arbitrary code - Example: K-Means, Page. Rank § Lack of wire-compatible protocols - Client and cluster must be of same version - Applications and workflows cannot migrate to different clusters

Map. Reduce Requirements for 2011 § § § Reliability Availability Scalability - Clusters of 6, 000 machines - Each machine with 16 cores, 48 G RAM, 24 TB disks - 100, 000 concurrent tasks - 10, 000 concurrent jobs § § Wire Compatibility Agility & Evolution – Ability for customers to control upgrades to the grid software stack.

Map. Reduce – Design Focus § Split up the two major functions of Job. Tracker - Cluster resource management - Application life-cycle management § Map. Reduce becomes user-land library

Architecture

Architecture § Resource Manager - Global resource scheduler - Hierarchical queues § Node Manager - Per-machine agent - Manages the life-cycle of container - Container resource monitoring § Application Master - Per-application - Manages application scheduling and task execution - E. g. Map. Reduce Application Master

Improvements vis-à-vis current Map. Reduce § Scalability - Application life-cycle management is very expensive - Partition resource management and application life-cycle management - Application management is distributed - Hardware trends - Currently run clusters of 4, 000 machines • 6, 000 2012 machines > 12, 000 2009 machines • <8 cores, 16 G, 4 TB> v/s <16+ cores, 48/96 G, 24 TB>

Improvements vis-à-vis current Map. Reduce § Availability - Application Master • Optional failover via application-specific checkpoint • Map. Reduce applications pick up where they left off - Resource Manager • No single point of failure - failover via Zoo. Keeper • Application Masters are restarted automatically

Improvements vis-à-vis current Map. Reduce § Wire Compatibility - Protocols are wire-compatible - Old clients can talk to new servers - Evolution toward rolling upgrades

Improvements vis-à-vis current Map. Reduce § Innovation and Agility - Map. Reduce now becomes a user-land library - Multiple versions of Map. Reduce can run in the same cluster (a la Apache Pig) • Faster deployment cycles for improvements - Customers upgrade Map. Reduce versions on their schedule - Users can use customized Map. Reduce versions without affecting everyone!

Improvements vis-à-vis current Map. Reduce § Utilization - Generic resource model • • Memory CPU Disk b/w Network b/w - Remove fixed partition of map and reduce slots

Improvements vis-à-vis current Map. Reduce § Support for programming paradigms other than Map. Reduce - MPI Master-Worker Machine Learning and Iterative processing Enabled by paradigm-specific Application Master All can run on the same Hadoop cluster

Summary § Takes Hadoop to the next level - Scale-out even further High availability Cluster Utilization Support for paradigms other than Map. Reduce

Questions? http: //developer. yahoo. com/blogs/hadoop/posts/2011/02/mapreduce-nextgen/