From the course: Learning Hadoop

What is Hadoop? - Hadoop Tutorial

From the course: Learning Hadoop

What is Hadoop?

- [[Narrator] To get us started, we need to look at the basic definition of what is the Hadoop ecosystem. Initially it consisted primarily of two scalable components. The first was around data storage, which was called HDFS, or Hadoop Data File System. And we'll see in the modernizations, could be cloud storage such as buckets, and then a processing API, that's called MapReduce. Now again, with the expansion and growth of the Hadoop ecosystem, it's gone far beyond the initial two pieces. And we'll be looking at some of the most popular libraries, such as Apache Spark, Apache Hive, Pig, and so on. And you can see in the diagram here that we have the two layers, the compute layer, the MapReduce layer, and we have a master node, and we have a slave, which has really been updated to be called a worker node now and it's usually more than one worker node. And each of these have processes or demons on them, task trackers and job trackers. And then we have a storage layer that has nodes, a name node, and data nodes. So it allows for distribution of both processing and storage, which results in the ability to process massive amounts of data in a cost effective way. So Hadoop really shines in scalability, partitioning. Often commodity hardware can be used for data storage, so it gets around the problem and the limits of expensive relational databases or data warehouses when you're dealing with massive amounts of data. And flexibility in terms of availability. You can use commodity hardware for distributed processing. So it was really a game changer in terms of the amount of data that could be processed in an economical way. Initially, business areas where Hadoop was applied was using this to analyze what's called behavioral data, as compared to transactional data. Transactional data would be data around usually finance that has to have transactional consistency. So in this case, we would have eventual consistency. So some examples that I've worked with are risk modeling, customer churn analysis, recommendation engines, and there are many, many more. It really enabled analysis of additional types of data. And again, Hadoop has been around for a good number of years already and as we've been having additional data collection, devices and methods, it really marries the ecosystem of the massive amounts of data with our need to analyze it.

Contents