Hadoop provides a software framework for distributed storage and processing of big data. It’s a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation.
The base Hadoop framework is composed of the following modules:
- Hadoop Common – This contains libraries and utilities needed by other Hadoop modules.
- Hadoop Distributed File System (HDFS) – A distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.
- Hadoop YARN – A platform responsible for managing computing resources in clusters and using them for scheduling users’ applications. YARN was introduced in 2012.
- Hadoop MapReduce — An implementation of the MapReduce programming model for large-scale data processing.
In essence, there are many other components in the Hadoop family that support the processing of Big Data. All these components together solves the majority of the problems of storage and speedy processing in the big data world.
For example, it took 10 year to process the information of Human Genome. With the help of Hadoop, it is now possible to process a project of this magnitude in just one week.
Hadoop is used for:
- Search – Yahoo, Amazon, Zvents
- Log processing – Facebook, Yahoo
- Data Warehouse – Facebook, AOL
- Video and Image Analysis – New York Times, Eyealike
Hadoop implementation is not recommended for low latency data access or multiple data modification. Hadoop is also more suitable for scenarios where there are few but large files. It is not recommended for scenarios where there are lots of small files.
From a job perspective, Hadoop is the most popular and in demand Big Data tool. Anyone currently working in the data science field or plans to be, needs to understand Hadoop.
Want to know more about Hadoop? Tonex offers Hadoop Training, a 3-day course that covers Hadoop basics and applications.