失眠网 > 海量数据挖掘MMDS week1: MapReduce

海量数据挖掘MMDS week1: MapReduce

时间：2022-01-28 09:19:11

/pipisorry/article/details/48443533

海量数据挖掘Mining Massive Datasets(MMDs) -Jure Leskoveccourses学习笔记之MapReduce

{A programming system for easily implementing parallel algorithms on commodity clusters.}

Distributed File Systems分布式文件系统DFS

why we need Map-Reduce in the first place?

ClassicalDataMining:Look at the disk in addition to looking at CPU and memory.So the data's on disk,you can only bring in a portion of the data into memory at a time.And you can process it in batches, and write back results to disk.

但是如果有类似google如此巨大的数据集上面的方面一部分一部分的读时间上还是不能忍受。

MachineLearning,Statistics:Now the obvious thing that you think of is that it can split the data into chunks.And you can have multiple disks and CPUs.you stripe the data across multiple disks.And you can read it, and process it in parallel in multiple CPUs.

This is the fundamental idea of cluster computing.

cluster computing集群计算之架构

Note:a higher bandwidth switch can do two to ten gigabits between racks.So so we have 16 to 64 nodes in a rack.And then you, you rack up multiple racks,and you get a data center.So this is the standard classical architecture that has emerged over the last few years.

cluster computing challenges集群计算的挑战

挑战1：节点失效

Note:Persistence means that once you store the data,you're guaranteed you can read it again.

So kind of need an infrastructure(下部构造) that can hide these kinds of node failures and let the computation go to go to completion even if nodes fail.

挑战2：网络带宽瓶颈

Note: a complex computation might need to move a lot of data, and that can slow the computation down.So you need a framework that you know,doesn't move data around so much while it's doing computation.

挑战3：分布式编程实现困难

Map-Reduce解决DFS的问题

map-reduce可以解决以上3个问题

redundant storage infrastructure冗余存储下部构造

Note:

1. distributed file system stores data across a cluster, but stores each piece of data multiple times.这就解决了挑战1.

2.the data is very rarely updated in place, it's very, very often updated through appends.

Note:

1. 一个文件分成多个chunks，并冗余存储在不同chunk servers中（一般是重复3份）。

2. the machines themselves are called chunk servers in this context.

3.replicas of a chunk are never on the same chunk server.

chunk servers &compute servers块服务器的计算服务器

Bringcomputationtodata:chunk servers also act as compute servers.And when, whenever your computation has to access data.That computation is actually scheduled on the chunk server that actually contains the data.This way you avoid moving data to where the computation needs to run.这就解决了挑战2.

分布式文件系统三大组件

Note:

1. To keep at least one replica in a entirely different rack if possible: We do that because the most common scenario is that a single node can fail.But it's also possible that the switch on a rack can fail, and when the switch on a rack fails,the entire rack becomes inaccessible.

2.when a client,or an algorithm that needs to access the data tries to access a file it goes through the client library.。。。And once that's done the client is directly connected to the chunk servers.Where it can access the data without going through the master nodes.So the data access actually happens in peer-to-peer fashion without going through the master node.

皮皮Blog

The MapReduce Computational Model计算模型

warm up task热身任务

Task: word count词计数

Note:You can just build a Hash Table.build the index by word.The first time you see a word you initialize,you add an entry to the Hash Table,with that word, and set the count to 1.And every subsequent time you see the word,increment the count by one.And you make a single sweep through the file.

Note:

1. Well you can try to write some kind of complicated code but I like to use Unix file system commands to do this.

2. here the the command words is a little script that goes through doc.txt,outputs the words in it one per line.And once you sort it all of the all occurrences of the same word come together.And when you do uniq dash c,it takes a run of the occurrence of the same word.And then just counts the occurrences of the same word.So the output of this is going to be word count pairs.

3. unix自定义命令words(doc_name): goes through file doc_name and then output one word to a line

unix文档编辑命令uniq: 功能说明：检查及删除文本文件中重复出现的行列。

语法：uniq [-cdu][-f<栏位>][-s<字符位置>][-w<字符位置>][--help][--version][输入文件][输出文件]

参数： -c或--count 在每列旁边显示该行重复出现的次数。...

4. 这个命令的实现是类似MapReduce的。

MapReduce overview概述

MapReduce的两大步骤

Note:

1. we wrote a script called words,that output one word to a line,is called a Map function in in, in, in MapReduce.The Map function scans the input file record-at-a-time.

2. each record pulls out something that you care about.In this case, it was words.And the things that you output are called keys.

3. grouped all the keys with the same value together.

4. the outline of this computation actually stays the same for any MapReduce computation.What changes is it that it change the Map function,the Reduce function, to the fit the problem that you're actually solving.

MapReduce:The Map step映射步

Apply the Map function to each input record, and create intermediate key-value pairs:

Note:

1. The intermediate key-value pairs need not have the same key as input key value-pairs.And there could be multiple of them.And the values although they look the same here, they both say v,the values could be different as well.

2. the Map function produced multiple intermediate key-value pairs.So there can be zero, one, or multiple intermediate key-value pairs,for each input key-value pair.

MapReduce:The Reduce step归约步

group intermediate key-value pairs by key:

Note: this is done by sorting by key and then by grouping together the value of.And these are all different values,although use the same same symbol v here.

MapReduce过程总结

MapReduce使用实例1：词计数

Note:

1. this whole example doesn't run on a single node.The data is actually distributed across multiple input nodes.The data's actually divided here into multiple nodes(here is 4).Now the Map tasks are going to be run on each of these four different nodes.And the the outputs of those Map tasks will therefore be produced on four different nodes.

2. The system then copies the Map outputs onto a single node.And once the data has flowed to the single node,it can then sort it by key and then do the final reduce step.In practice you use multiple Reduce nodes as well, you can tell the system to use a certain number of Reduce nodes.And it makes sure, that for any given key ,all instances, regardless of which Map node they start out from,always end up at the same Reduce node. And this is done by using a hash function.the system uses a hash function that hashes each Map key and determines a single Reduce node to shift that tuple two.And this ensures that all tuples with the same key,end up with the same Reduce node.

3. The final result is perfectly fine because you're dealing with a distributed file system,which know, knows that your file is spread across three nodes of the system.So you can still access it as a single file in your client.And the system knows to access the data from those three three independent nodes.

4. all this magic in the MapReduce magic is implemented to use,as far as possible, only sequential scans of disk as opposed to a random access is.how the Map function is applied on the input file record by record.How the sorting is done and so on.you can actually implement,all of this by using only sequential reads of disk, and never using random accesses of disk.sequential reads are much much more efficient than random accesses to disk.And the whole MapReduce system is built around doing only sequential reads of files and never random accesses.

词计数实现伪代码

MapReduce使用实例2：主机Host大小

Note: The problem, is for each host(not for each URL) we want to find the total number of bytes.there can be multiple many URLs with the same host name.

皮皮Blog

Scheduling and Data Flow调度和数据流

{go under the hood of a Map-Reduce system and understand how it actually works}

MapReduce diagram图解

Note: MapReduce过程图的解说：

In the Map step,you take a Big a document which is divided into chunks and you run a Map process on each chunk.And the map process go through each record in that chunk and outputs an intermediate key value pairs for each vector in that chunk.

In the group by step, you group by key.you bring together all the values the same key.

And in the reduce step.You apply a reducer to each intermediate key value pair set.And you create a final output.

MapReduce在分布式系统中的实际工作图解

{The previous schematic was how MapReduce worked in a, in a centralized(集中式) system}

Note:

1. In a distributed system, you have multiple nodes and map and reduced tasks are running in pattern on multiple nodes.And producing it to be intermediate key value pairs on each of those nodes.once the intermediate key value pairs are produced, the underlying system the Map-Reduce system uses a partitioning function which is just a hash function,to each intermediate key value.And the hash function will tell the Map-Reduce system which reduce node to send that key value pair to.

2. once the reduce task has received input from all from all the map tasks.its first job is to sort it's input, and group it together by key.

3. once that is done, the reduce task then works the reduce function which is provided by the programmer on each each such group and creates the final output.

MapReduce的环境

Note:

1. The programmer provides two functions, Map and Reduce, and specifies the input file.

2. Scheduling:Figuring out where the map tasks run,where the reduce tasks run, and so on.

MapReduce Scheduling and Data Flow调度和数据流

Note:

1. close: means that,the input data is a file divided into chunks.And there are replicas of the chunks on different chunk servers.The Map-Reduce system try to schedule each map task on a chunk server that holds a copy of the corresponding chunk.So, there's no actual copy.A data copy associated with the map step of the Map-Reduce program.

2. There's some overhead to storing data in the DFS.there are multiple replicas of the data that need to be made.And network shuffling involved in storing new data in the distributed file system.So, whenever possible, intermediate results are actually stored in the local file system of the Map and Reduced workers.

coordination : Master Node主节点的调和功能

Note:

1. When a map task completes,it sends the the master the location and sizes of it's the R intermediate files that it creates.why, R intermediate files? There's one intermediate file that's created for each reducer.Because the data, the output of the mapper has to be shipped to each of the reducers,depending on the key value.So, whenever a map task completes,it stores the R intermediate files on it's local file system,and it let's the master know the names of those files.The master pushes this information to the reducers.

2. Once the reducers know that all the mappers map tasks are completed,then they copy the intermediate file from each of the map tasks.And then they can proceed with their work.

Worker节点失效处理

Map worker失效处理

Note:

1. If a map worker fails, then all the map tasks that were scheduled on that map worker may have failed.

2. If a map worker fails,then the node fails.Then all intermediate output created by all the map tasks that have ran on that worker are lost.

Reduce worker失效处理

Note: only in-progress tasks need to be set to idle: The output of the reduced worker is a final output,it's written to the distribute file system.And not to the local file system of the reduced worker.The output is not lost even if the reduce worker fails.

Master worker失效处理

Note: the client can then do something like restarting the map reduce task.The master is typically not applicated in the Map-Reduce system.

map和reduce任务数目设定的基本原则

Note:

1. If there is one map task per node in the cluster and during processing the node fails,then that map task needs to be rescheduled on another node in the cluster when it becomes available.Now since all the other nodes are processing,one of those nodes has to complete before this map task can be scheduled on that node and so the entire computation is slowed down.

2. So, instead of one map task on a given node, there are many small map tasks on a given node, and if that node fails, then those map tasks can then be spread across all the available nodes and so the entire task will complete much faster.

3. R is usually even smaller than the total number of nodes in the system.The the output file is,is spread across spread across R node where R the number of reducers.And it's usually convenient to have the output spread across a small number of nodes rather than across a large number of nodes.

皮皮Blog

Combiners And Partition Functions合成器和分区函数

{looked at how Map-Reduce model's actually implemented}

改良：Combiner合成器

Note:

1. 就是说在map阶段就聚合相同key的pairs，其中聚合函数（合成器）就类似于Reduce函数。这样paris传递到Reducer的网络开销就减少了。

2. The programmer provides a function combine. Instead of a whole bunch of tuples with the key k being shipped off to a reducer.Just a single tuple with key k and v2 is shipped off to the reducer now, usually the combiner is the same function as the reducer.

应用combiner的例子：词计数

Combiner和Reducer函数要满足的性质

Note: Reduce函数必须是可交换和可结合的，如sum函数。And because sum satisfies both these properties sum can be used as a combiner as well as a reducer.

2. Average函数就不满足这个性质，不能作为combiner函数。

但是Average可以这样实现：combiner函数返回的是(sum, count)，而Reduce函数再一起计算Average

It's sometimes possible to turn a function that's not commutative or associative.Break it down into functions that are communicative or associative like sum and count and still use a combiner trick to save some foot traffic.

不能应用combiner函数的例子：求中位数函数

It turns out and it can be proved mathematically that there is no way to split the median completely into a bunch of commutative and associative computations.So you can't actually use the combiner trick, You just have to ship all the values to the reducer and compute the median at the reducer.

改良：partition function划分函数

先回忆MapReduce的下部构造：the map reduced infrastructure uses a hash function on each key in the intermediate key value set,and this hash function decides which reduced node that key gets shipped to.

用户可以自定义这个hash函数

Note: 这个自定义试图将同一个host下的url送到同一个reduce节点上，而不是每一个不同url都对应一个reduce节点。

MapReduce的实现

Note:

1. Google first implemented a file system called the Google File System which is a distributed file system that provides stable storage on top of its cluster.And then implemented the MapReduce framework on top of the Google File System.

2. Hadoop is an open-source project that's a reimplementation of Google's MapReduce.[Download]

3. many use cases of Hadoop involve doing SQL-like manipulations on data.And so there are open-source implementations called Hive and Pig that provide SQL-like abstractions of top of the Hadoop and MapReduce layer, so that you don't have to rewrite those as map and deduce functions.

MapReduce in cloud 云计算中的MapReduce

Pointers and Further Reading

Jeffrey Dean and Sanjay Ghemawat: MapReduce: Simplified Data Processing on Large Clusters

/papers/mapreduce.html

Sanjay Ghemawat, Howard Gobioff, and Shun--.Tak Leung: The Google File System

/papers/gfs.html

Hadoop Wiki Introduction /lucene-hadoop/

Getting Started /lucene-hadoop/GegngStartedWithHadoop

Map/Reduce Overview /lucene-hadoop/HadoopMapReduce /lucene-hadoop/ HadoopMapRedClasses

Eclipse Environment /lucene-hadoop/EclipseEnvironment

Javadoc /hadoop/docs/api/

ReleasesfromApachedownloadmirrors /dyn/closer.cgi/lucene/hadoop/