mapreduce
What is a mapreduce?
mapreduce is a huge technique used to process data in parallel processing and uses commodity clusters to be distributed upon. Basically it’s a data processing technique for following the parallel approach rather than processing data serially which definitely is much faster and saves time.
Data processing which requires a lot of memory consumption needs the machines to be assembled but that is not always possible in the physical world. Even the largest data centers in the early 2000s, despite having huge machine setups, still had the question of how the data would be split across these machines for the data to be computed. To facilitate this, a programming model named MapReduce was initiated. MapReduce is simply a way of giving a structure to the computation that allows it to be easily run on a number of machines. This organizing of data cannot be stressed enough in terms of making the job whole a lot easier. This programming model forces what you’re trying to do into three main stages; mapping, shuffling and reducing.
The most famous application of MapReduce in everyday problems is the way it would help someone to calculate how many times a single word occurred in a document. The beauty of MapReduce framework is that it would still work as efficiently as ever even with a billion documents running on a billion machines.
How MapReduce works:
- The MapReduce algorithm works in steps which are defined below:
- The data is pre-loaded as input data into mappers.
- Mappers output the intermediary records from the initially inputted data.
- The outputs are shuffled and the values are exchanged by processing.
- The shuffled data is fed to the reducers which sorts it.
- The outputs from the reducer are stored locally.
The three main components of the MapReduce algorithm are as follows:
- Mapper: This first part of the MapReduce algorithm sets key -> value pairs according to the data provided. These maps are the set of intermediate records which are being transformed by key -> value pairs. The mappers usually read and process data in parallel and the output we receive after processing is known as Intermediate record (for example index cards). As Apache Hadoop deals with data in Key -> value pairs, the mapper has the same key -> value pairs and generates intermediate records.
- Shuffling: The second phase in the MapReduce algorithm is shuffling and sorting the data out. Shuffling basically refers to the movement of intermediate records from mappers all the way across to reducers. The process of exchanging the intermediate output of each map task where they’re forwarded to reducers is the process known as shuffling. Shuffling is just the movement of records from mappers to reducers, which then sort the records.
- Reducer: The third phase, reducer, as the name suggests reduces the values upon basic sharing of keys leading to a smaller set of values. Same key values are presented to process to a single reducer to work accordingly. It works off one portion at a time to retrieve the sufficient list of key and values associated. The final result is generated where single outputs are added together.
Advantages of using it over Traditional methods:
There are numerous advantages of MapReduce but the following are the ones which are most prominent.
- Scalability: The Hadoop MapReduce algorithm is extremely scalable mostly because of the way it distributes and stores data, other than that the servers embedded with MapReduce have a tendency to work in parallel thus are much quicker and can serve a large scale of clients altogether.
- Flexibility: The data stored and processed by the MapReduce can be easily reused with other applications embedded with the main system.
- Security and Authentication: Because the data is actually divided into chunks of data and it isn’t stored on one server, it’s much easier for it to keep it safe from malicious activity.
Cost-effective solution: MapReduce technically saves a lot of start-up costs where a limited amount of memory sources are required to do the processing.
Real World Working Example of MapReduce:
If we have a company which wants to calculate its total sales in any specific region. In the traditional world, such problems would be solved using hash tables where key-> value pairs will be considered, where key is the city name or region/area name and the value along it is the number of sales it has successfully sent out to that area. And suppose the company trying to calculate this has a Huge data set, possibly One TB (terabyte) of data, it will take an ample amount of time from the processing alone, even on the most powerful processors we have today. In fact, other than the processing speed, memory consumption may also be limited and it may run out of it. So considering the amount of data, the company would use MapReduce technique for the above mentioned problem.
As the whole MapReduce algorithm is designed into two phases; map and reduce. Mapper: The company decides to split the data into chunks rather than giving the whole bunch of it onto one single person. This division of the data is done on the basis of months; so giving us 12 chunks. Here, it can be said that each mapper gets data of each month and all 12 mappers work parallel at the same time with fairly little fractions of data. Each mapper will access the names of areas and the number of sales associated and write it on an index card which is then piled-up according to the same area names which then results in all mappers having separate piles of cards per area. Reducer: Once these piles are created, the reducer will get these piles and we will feed it the area information they’re responsible for. For example we can have 4 reducers and each reducer will get three areas. Now, the reducers have larger piles. Each reducer will now add all the amount of each of the cards in the pile they got, and get total sales per three areas. For better estimation, the data is sorted into alphabetical order. In this way it gets easier for the MapReduce to easily divide the data and process it quickly.
Read more about this.