From the course: Big Data Analytics with Hadoop and Apache Spark
Unlock the full course today
Join today to access over 24,900 courses taught by industry experts.
Storing intermediate results
From the course: Big Data Analytics with Hadoop and Apache Spark
Storing intermediate results
- [Instructor] As we have seen in the previous examples for execution plans, every time an action is performed, Spark goes all the way towards data source and reads data. This happens even if the data was read before and some actions were performed. While this works fine while running automated jobs, it is a problem during interactive analytics. Every time a new action command is executed on the interactive shell, Spark goes back to its source. It is better to cache intermediate results so we can resume analytics from these results without starting all over. Spark has two modes of caching - in memory and disk. The cache method is used to cache in memory only. The persist method is used to cache in memory, disk, or both. In this example, we cached the coalesced sales data data frame in disk using the persist function. Spark does lazy evaluation, so we need to execute an action to trigger caching. We will compare the execution plan before and after intermediate caching. First, we do a…
Practice while you learn with exercise files
Download the files the instructor uses to teach the course. Follow along and learn by watching, listening and practicing.