HadoopJust the Basics for Big Data Rookies презентация

Содержание

2. Agenda Hadoop Overview HDFS Architecture Hadoop MapReduce Hadoop Ecosystem MapReduce Primer
3. Hadoop Overview
4. Hadoop Core Open-source Apache project out of Yahoo! in 2006 Distributed
5. Why? Bottom line: Flexible Scalable Inexpensive
6. Overview Great at Reliable storage for multi-petabyte data sets Batch queries
7. Data Structure Bytes! No more ETL necessary Store data now, process
8. Versioning Version 0.20.x, 0.21.x, 0.22.x, 1.x.x Two main MR packages: org.apache.hadoop.mapred
9. HDFS Architecture
10. HDFS Overview Hierarchical UNIX-like file system for data storage sort of
11. NameNode Single master service for HDFS Single point of failure (HDFS
12. Checkpoint Node (Secondary NN) Performs checkpoints of the NameNode’s namespace and
13. DataNode Stores blocks on local disk Sends frequent heartbeats to NameNode
14. How HDFS Works - Writes
15. How HDFS Works - Writes
16. How HDFS Works - Reads
17. How HDFS Works - Failure
18. Block Replication Default of three replicas Rack-aware system One block on
19. HDFS 2.0 Features NameNode High-Availability (HA) Two redundant NameNodes in active/passive
20. Hadoop MapReduce
21. Hadoop MapReduce 1.x Moves the code to the data JobTracker Master
22. JobTracker Monitors job and task progress Issues task attempts to TaskTrackers
23. TaskTrackers Runs on same node as DataNode service Sends heartbeats and
24. Exploiting Data Locality JobTracker will schedule task on a TaskTracker that
25. How MapReduce Works
26. How MapReduce Works - Failure
27. YARN Abstract framework for distributed application development Split functionality of JobTracker
28. MapReduce 2.x on YARN MapReduce API has not changed Rebuild required
29. Hadoop Ecosystem
30. Hadoop Ecosystem Core Technologies Hadoop Distributed File System Hadoop MapReduce Many
31. Moving Data Sqoop Moving data between RDBMS and HDFS Say, migrating
32. Flume Architecture
33. Higher Level APIs Pig Data-flow language – aptly named PigLatin --
34. Pig Word Count A = LOAD '$input'; B = FOREACH
35. Key/Value Stores HBase Accumulo Implementations of Google’s Big Table for HDFS
36. HBase Architecture
37. Data Structure Avro Data serialization system designed for the Hadoop ecosystem
38. Scalable Machine Learning Mahout Library for scalable machine learning written in
39. Workflow Management Oozie Scheduling system for Hadoop Jobs Support for: Java
40. Real-time Stream Processing Storm Open-source project which runs a streaming of
41. Distributed Application Coordination ZooKeeper An effort to develop and maintain an
42. ZooKeeper Architecture
43. Hadoop Streaming Write MapReduce mappers and reducers using stdin and stdout
44. SQL on Hadoop Apache Drill Cloudera Impala Hive Stinger Pivotal HAWQ
45. HAWQ Architecture
46. That’s a lot of projects I am likely missing several (Sorry,
47. Sample Architecture
48. MapReduce Primer
49. MapReduce Paradigm Data processing system with two key phases Map Perform
51. Hadoop MapReduce Components Map Phase Input Format Record Reader Mapper Combiner
52. Writable Interfaces public interface Writable { void write(DataOutput out); void readFields(DataInput
53. InputFormat public abstract class InputFormat<K, V> { public abstract List<InputSplit> getSplits(JobContext
54. RecordReader public abstract class RecordReader<KEYIN, VALUEIN> implements Closeable { public abstract
55. Mapper public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> { protected void setup(Context
56. Partitioner public abstract class Partitioner<KEY, VALUE> { public abstract int getPartition(KEY
57. Reducer public class Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT> { protected void setup(Context
58. OutputFormat public abstract class OutputFormat<K, V> { public abstract RecordWriter<K,
59. RecordWriter public abstract class RecordWriter<K, V> { public abstract void write(K
60. Word Count Example
61. Problem Count the number of times each word is used in
62. Mapper Code public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
63. Shuffle and Sort
64. Reducer Code public class IntSumReducer extends Reducer<Text, LongWritable, Text, IntWritable> {
65. So what’s so hard about it?
66. So what’s so hard about it? MapReduce is a limitation Entirely
67. So what does this mean for you? Hadoop is written primarily
68. Resources, Wrap-up, etc. http://hadoop.apache.org Very supportive community Strata + Hadoop World
69. Getting Started Pivotal HD Single-Node VM and Community Edition http://gopivotal.com/pivotal-products/data/pivotal-hd For
70. Acknowledgements Apache Hadoop, the Hadoop elephant logo, HDFS, Accumulo, Avro, Drill,
71. Learn More. Stay Connected. Talk to us on Twitter: @springcentral Find
72. Скачать презентацию

Презентации» Информатика» HadoopJust the Basics for Big Data Rookies

Hadoop Just the Basics for Big Data Rookies
Adam Shook
ashook@gopivotal.com

Agenda
Hadoop Overview
HDFS Architecture
Hadoop MapReduce
Hadoop Ecosystem
MapReduce Primer

Hadoop Core
Open-source Apache project out of Yahoo! in 2006
Distributed

Why?
Bottom line:
Flexible
Scalable
Inexpensive

Overview
Great at
Reliable storage for multi-petabyte data sets
Batch queries

Data Structure
Bytes!
No more ETL necessary
Store data now, process

Versioning
Version 0.20.x, 0.21.x, 0.22.x, 1.x.x
Two main MR packages:
org.apache.hadoop.mapred

HDFS Overview
Hierarchical UNIX-like file system for data storage
sort of

NameNode
Single master service for HDFS
Single point of failure (HDFS

Checkpoint Node (Secondary NN)
Performs checkpoints of the NameNode’s namespace and

DataNode
Stores blocks on local disk
Sends frequent heartbeats to NameNode

Block Replication
Default of three replicas
Rack-aware system
One block on

HDFS 2.0 Features
NameNode High-Availability (HA)
Two redundant NameNodes in active/passive

Hadoop MapReduce 1.x
Moves the code to the data
JobTracker
Master

JobTracker
Monitors job and task progress
Issues task attempts to TaskTrackers

TaskTrackers
Runs on same node as DataNode service
Sends heartbeats and

Exploiting Data Locality
JobTracker will schedule task on a TaskTracker that

YARN
Abstract framework for distributed application development
Split functionality of JobTracker

MapReduce 2.x on YARN
MapReduce API has not changed
Rebuild required

Hadoop Ecosystem
Core Technologies
Hadoop Distributed File System
Hadoop MapReduce
Many

Moving Data
Sqoop
Moving data between RDBMS and HDFS
Say, migrating

Higher Level APIs
Pig
Data-flow language – aptly named PigLatin --

Pig Word Count

A = LOAD '$input';
B = FOREACH

Key/Value Stores
HBase
Accumulo
Implementations of Google’s Big Table for HDFS

Data Structure
Avro
Data serialization system designed for the Hadoop ecosystem

Scalable Machine Learning
Mahout
Library for scalable machine learning written in

Workflow Management
Oozie
Scheduling system for Hadoop Jobs
Support for:
Java

Real-time Stream Processing
Storm
Open-source project which runs a streaming of

Distributed Application Coordination
ZooKeeper
An effort to develop and maintain an

Hadoop Streaming
Write MapReduce mappers and reducers using stdin and stdout

SQL on Hadoop
Apache Drill
Cloudera Impala
Hive Stinger
Pivotal HAWQ

That’s a lot of projects
I am likely missing several (Sorry,

MapReduce Paradigm
Data processing system with two key phases
Map
Perform

Hadoop MapReduce Components
Map Phase
Input Format
Record Reader
Mapper
Combiner

Writable Interfaces
public interface Writable {
void write(DataOutput out);

InputFormat
public abstract class InputFormat<K, V> {
public abstract List<InputSplit>

RecordReader
public abstract class RecordReader<KEYIN, VALUEIN> implements Closeable {
public

Mapper
public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
protected void

Partitioner
public abstract class Partitioner<KEY, VALUE> {
public abstract int

Reducer
public class Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
protected void

OutputFormat
public abstract class OutputFormat<K, V> {

public

RecordWriter
public abstract class RecordWriter<K, V> {
public abstract void

Problem
Count the number of times each word is used in

Mapper Code
public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable>{

Reducer Code
public class IntSumReducer
extends Reducer<Text, LongWritable, Text, IntWritable> {

So what’s so hard about it?
MapReduce is a limitation
Entirely

So what does this mean for you?
Hadoop is written primarily

Resources, Wrap-up, etc.
http://hadoop.apache.org
Very supportive community
Strata + Hadoop World

Getting Started
Pivotal HD Single-Node VM and Community Edition
http://gopivotal.com/pivotal-products/data/pivotal-hd
For

Learn More. Stay Connected.
Talk to us on Twitter: @springcentral
Find

Слайды и текст этой презентации

Слайд 1

Описание слайда:

Hadoop Just the Basics for Big Data Rookies Adam Shook [email protected]

Слайд 2

Описание слайда:

Agenda Hadoop Overview HDFS Architecture Hadoop MapReduce Hadoop Ecosystem MapReduce Primer Buckle up!

Слайд 3

Описание слайда:

Hadoop Overview

Слайд 4

Описание слайда:

Hadoop Core Open-source Apache project out of Yahoo! in 2006 Distributed fault-tolerant data storage and batch processing Provides linear scalability on commodity hardware Adopted by many: Amazon, AOL, eBay, Facebook, Foursquare, Google, IBM, Netflix, Twitter, Yahoo!, and many, many more

Слайд 5

Описание слайда:

Why? Bottom line: Flexible Scalable Inexpensive

Слайд 6

Описание слайда:

Overview Great at Reliable storage for multi-petabyte data sets Batch queries and analytics Complex hierarchical data structures with changing schemas, unstructured and structured data Not so great at Changes to files (can’t do it…) Low-latency responses Analyst usability This is less of a concern now due to higher-level languages

Слайд 7

Описание слайда:

Data Structure Bytes! No more ETL necessary Store data now, process later Structure on read Built-in support for common data types and formats Extendable Flexible

Слайд 8

Описание слайда:

Versioning Version 0.20.x, 0.21.x, 0.22.x, 1.x.x Two main MR packages: org.apache.hadoop.mapred (deprecated) org.apache.hadoop.mapreduce (new hotness) Version 2.x.x, alpha’d in May 2012 NameNode HA YARN – Next Gen MapReduce

Слайд 9

Описание слайда:

HDFS Architecture

Слайд 10

Описание слайда:

HDFS Overview Hierarchical UNIX-like file system for data storage sort of Splitting of large files into blocks Distribution and replication of blocks to nodes Two key services Master NameNode Many DataNodes Checkpoint Node (Secondary NameNode)

Слайд 11

Описание слайда:

NameNode Single master service for HDFS Single point of failure (HDFS 1.x) Stores file to block to location mappings in the namespace All transactions are logged to disk NameNode startup reads namespace image and logs

Слайд 12

Описание слайда:

Checkpoint Node (Secondary NN) Performs checkpoints of the NameNode’s namespace and logs Not a hot backup! Loads up namespace Reads log transactions to modify namespace Saves namespace as a checkpoint

Слайд 13

Описание слайда:

DataNode Stores blocks on local disk Sends frequent heartbeats to NameNode Sends block reports to NameNode Clients connect to DataNode for I/O

Слайд 14

Описание слайда:

How HDFS Works - Writes

Слайд 15

Описание слайда:

How HDFS Works - Writes

Слайд 16

Описание слайда:

How HDFS Works - Reads

Слайд 17

Описание слайда:

How HDFS Works - Failure

Слайд 18

Описание слайда:

Block Replication Default of three replicas Rack-aware system One block on same rack One block on same rack, different host One block on another rack Automatic re-copy by NameNode, as needed

Слайд 19

Описание слайда:

HDFS 2.0 Features NameNode High-Availability (HA) Two redundant NameNodes in active/passive configuration Manual or automated failover NameNode Federation Multiple independent NameNodes using the same collection of DataNodes

Слайд 20

Описание слайда:

Hadoop MapReduce

Слайд 21

Описание слайда:

Hadoop MapReduce 1.x Moves the code to the data JobTracker Master service to monitor jobs TaskTracker Multiple services to run tasks Same physical machine as a DataNode A job contains many tasks A task contains one or more task attempts

Слайд 22

Описание слайда:

JobTracker Monitors job and task progress Issues task attempts to TaskTrackers Re-tries failed task attempts Four failed attempts = one failed job Schedules jobs in FIFO order Fair Scheduler Single point of failure for MapReduce

Слайд 23

Описание слайда:

TaskTrackers Runs on same node as DataNode service Sends heartbeats and task reports to JobTracker Configurable number of map and reduce slots Runs map and reduce task attempts Separate JVM!

Слайд 24

Описание слайда:

Exploiting Data Locality JobTracker will schedule task on a TaskTracker that is local to the block 3 options! If TaskTracker is busy, selects TaskTracker on same rack Many options! If still busy, chooses an available TaskTracker at random Rare!

Слайд 25

Описание слайда:

How MapReduce Works

Слайд 26

Описание слайда:

How MapReduce Works - Failure

Слайд 27

Описание слайда:

YARN Abstract framework for distributed application development Split functionality of JobTracker into two components ResourceManager ApplicationMaster TaskTracker becomes NodeManager Containers instead of map and reduce slots Configurable amount of memory per NodeManager

Слайд 28

Описание слайда:

MapReduce 2.x on YARN MapReduce API has not changed Rebuild required to upgrade from 1.x to 2.x Application Master launches and monitors job via YARN MapReduce History Server to store… history

Слайд 29

Описание слайда:

Hadoop Ecosystem

Слайд 30

Описание слайда:

Hadoop Ecosystem Core Technologies Hadoop Distributed File System Hadoop MapReduce Many other tools… Which I will be describing… now

Слайд 31

Описание слайда:

Moving Data Sqoop Moving data between RDBMS and HDFS Say, migrating MySQL tables to HDFS Flume Streams event data from sources to sinks Say, weblogs from multiple servers into HDFS

Слайд 32

Описание слайда:

Flume Architecture

Слайд 33

Описание слайда:

Higher Level APIs Pig Data-flow language – aptly named PigLatin -- to generate one or more MapReduce jobs against data stored locally or in HDFS Hive Data warehousing solution, allowing users to write SQL-like queries to generate a series of MapReduce jobs against data stored in HDFS

Слайд 34

Описание слайда:

Pig Word Count A = LOAD '$input'; B = FOREACH A GENERATE FLATTEN(TOKENIZE($0)) AS word; C = GROUP B BY word; D = FOREACH C GENERATE group AS word, COUNT(B); STORE D INTO '$output';

Слайд 35

Описание слайда:

Key/Value Stores HBase Accumulo Implementations of Google’s Big Table for HDFS Provides random, real-time access to big data Supports updates and deletes of key/value pairs

Слайд 36

Описание слайда:

HBase Architecture

Слайд 37

Описание слайда:

Data Structure Avro Data serialization system designed for the Hadoop ecosystem Expressed as JSON Parquet Compressed, efficient columnar storage for Hadoop and other systems

Слайд 38

Описание слайда:

Scalable Machine Learning Mahout Library for scalable machine learning written in Java Very robust examples! Classification, Clustering, Pattern Mining, Collaborative Filtering, and much more

Слайд 39

Описание слайда:

Workflow Management Oozie Scheduling system for Hadoop Jobs Support for: Java MapReduce Streaming MapReduce Pig, Hive, Sqoop, Distcp Any ol’ Java or shell script program

Слайд 40

Описание слайда:

Real-time Stream Processing Storm Open-source project which runs a streaming of data, called a spout, to a series of execution agents called bolts Scalable and fault-tolerant, with guaranteed processing of data Benchmarks of over a million tuples processed per second per node

Слайд 41

Описание слайда:

Distributed Application Coordination ZooKeeper An effort to develop and maintain an open-source server which enables highly reliable distributed coordination Designed to be simple, replicated, ordered, and fast Provides configuration management, distributed synchronization, and group services for applications

Слайд 42

Описание слайда:

ZooKeeper Architecture

Слайд 43

Описание слайда:

Hadoop Streaming Write MapReduce mappers and reducers using stdin and stdout Execute on command line using Hadoop Streaming JAR // TODO verify hadoop jar hadoop-streaming.jar -input input -output outputdir -mapper org.apache.hadoop.mapreduce.Mapper -reduce /bin/wc

Слайд 44

Описание слайда:

SQL on Hadoop Apache Drill Cloudera Impala Hive Stinger Pivotal HAWQ MPP execution of SQL queries against HDFS data

Слайд 45

Описание слайда:

HAWQ Architecture

Слайд 46

Описание слайда:

That’s a lot of projects I am likely missing several (Sorry, guys!) Each cropped up to solve a limitation of Hadoop Core Know your ecosystem Pick the right tool for the right job

Слайд 47

Описание слайда:

Sample Architecture

Слайд 48

Описание слайда:

MapReduce Primer

Слайд 49

Описание слайда:

MapReduce Paradigm Data processing system with two key phases Map Perform a map function on input key/value pairs to generate intermediate key/value pairs Reduce Perform a reduce function on intermediate key/value groups to generate output key/value pairs Groups created by sorting map output

Слайд 50

Описание слайда:

Слайд 51

Описание слайда:

Hadoop MapReduce Components Map Phase Input Format Record Reader Mapper Combiner Partitioner

Слайд 52

Описание слайда:

Writable Interfaces public interface Writable { void write(DataOutput out); void readFields(DataInput in); } public interface WritableComparable<T> extends Writable, Comparable<T> { }

Слайд 53

Описание слайда:

InputFormat public abstract class InputFormat<K, V> { public abstract List<InputSplit> getSplits(JobContext context); public abstract RecordReader<K, V> createRecordReader(InputSplit split, TaskAttemptContext context); }

Слайд 54

Описание слайда:

RecordReader public abstract class RecordReader<KEYIN, VALUEIN> implements Closeable { public abstract void initialize(InputSplit split, TaskAttemptContext context); public abstract boolean nextKeyValue(); public abstract KEYIN getCurrentKey(); public abstract VALUEIN getCurrentValue(); public abstract float getProgress(); public abstract void close(); }

Слайд 55

Описание слайда:

Mapper public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> { protected void setup(Context context) { /* NOTHING */ } protected void cleanup(Context context) { /* NOTHING */ } protected void map(KEYIN key, VALUEIN value, Context context) { context.write((KEYOUT) key, (VALUEOUT) value); } public void run(Context context) { setup(context); while (context.nextKeyValue()) map(context.getCurrentKey(), context.getCurrentValue(), context); cleanup(context); } }

Слайд 56

Описание слайда:

Partitioner public abstract class Partitioner<KEY, VALUE> { public abstract int getPartition(KEY key, VALUE value, int numPartitions); } Default HashPartitioner uses key’s hashCode() % numPartitions

Слайд 57

Описание слайда:

Reducer public class Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT> { protected void setup(Context context) { /* NOTHING */ } protected void cleanup(Context context) { /* NOTHING */ } protected void reduce(KEYIN key, Iterable<VALUEIN> value, Context context) { for (VALUEIN value : values) context.write((KEYOUT) key, (VALUEOUT) value); } public void run(Context context) { setup(context); while (context.nextKey()) reduce(context.getCurrentKey(), context.getValues(), context); cleanup(context); } }

Слайд 58

Описание слайда:

OutputFormat public abstract class OutputFormat<K, V> { public abstract RecordWriter<K, V> getRecordWriter(TaskAttemptContext context); public abstract void checkOutputSpecs(JobContext context); public abstract OutputCommitter getOutputCommitter(TaskAttemptContext context); }

Слайд 59

Описание слайда:

RecordWriter public abstract class RecordWriter<K, V> { public abstract void write(K key, V value); public abstract void close(TaskAttemptContext context); }

Слайд 60

Описание слайда:

Word Count Example

Слайд 61

Описание слайда:

Problem Count the number of times each word is used in a body of text Uses TextInputFormat and TextOutputFormat

Слайд 62

Описание слайда:

Mapper Code public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable>{ private final static IntWritable ONE = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, ONE); } } }

Слайд 63

Описание слайда:

Shuffle and Sort

Слайд 64

Описание слайда:

Reducer Code public class IntSumReducer extends Reducer<Text, LongWritable, Text, IntWritable> { private IntWritable outvalue = new IntWritable(); private int sum = 0; public void reduce(Text key, Iterable<IntWritable> values, Context context) { sum = 0; for (IntWritable val : values) { sum += val.get(); } outvalue.set(sum); context.write(key, outvalue); } }

Слайд 65

Описание слайда:

So what’s so hard about it?

Слайд 66

Описание слайда:

So what’s so hard about it? MapReduce is a limitation Entirely different way of thinking Simple processing operations such as joins are not so easy when expressed in MapReduce Proper implementation is not so easy Lots of configuration and implementation details for optimal performance Number of reduce tasks, data skew, JVM size, garbage collection

Слайд 67

Описание слайда:

So what does this mean for you? Hadoop is written primarily in Java Components are extendable and configurable Custom I/O through Input and Output Formats Parse custom data formats Read and write using external systems Higher-level tools enable rapid development of big data analysis

Слайд 68

Описание слайда:

Resources, Wrap-up, etc. http://hadoop.apache.org Very supportive community Strata + Hadoop World Oct. 28th – 30th in Manhattan Plenty of resources available to learn more Blogs Email lists Books Shameless Plug -- MapReduce Design Patterns

Слайд 69

Описание слайда:

Getting Started Pivotal HD Single-Node VM and Community Edition http://gopivotal.com/pivotal-products/data/pivotal-hd For the brave and bold -- Roll-your-own! http://hadoop.apache.org/docs/current

Слайд 70

Описание слайда:

Acknowledgements Apache Hadoop, the Hadoop elephant logo, HDFS, Accumulo, Avro, Drill, Flume, HBase, Hive, Mahout, Oozie, Pig, Sqoop, YARN, and ZooKeeper are trademarks of the Apache Software Foundation Cloudera Impala is a trademark of Cloudera Parquet is copyright Twitter, Cloudera, and other contributors Storm is licensed under the Eclipse Public License

Слайд 71

Описание слайда:

Learn More. Stay Connected. Talk to us on Twitter: @springcentral Find Session replays on YouTube: spring.io/video

Скачать презентацию на тему HadoopJust the Basics for Big Data Rookies можно ниже: