This course is designed as an entry point for developers who need to create applications to analyze Big Data stored in Apache Hadoop using Spark. Topics include: • An overview of the Hortonworks Data Platform (HDP), including HDFS and YARN • Using Spark Core APIs for interactive data exploration • Spark SQL and DataFrame operations • Spark Streaming and DStream operations • Data visualization, reporting, and collaboration • Performance monitoring and tuning • Building and deploying Spark applications • Introduction to the Spark Machine Learning Library Labs can be performed using either Python or Scala.
Before taking this course, students should be familiar with programming principles and have previous experience in software development using either Python or Scala. Previous experience with data streaming, SQL, and HDP is also helpful, but not required.
4 Days/Lecture & Lab
This course is designed for software engineers that are looking to develop in-memory applications for time sensitive and highly iterative applications in an Enterprise HDP environment.
Use common HDFS commands
- Use a REPL to program in Spark
- Use Zeppelin to program in Spark
- Perform RDD transformations and actions
- Perform Pair RDD transformations and actions
- Utilize Spark SQL
- Perform stateless transformations using Spark Streaming
- Perform window-based transformations
- Use Zeppelin for data visualization and reporting
- Monitor applications using Spark History Server
- Cache and persist data
- Configure checkpointing, broadcast variables, and executors
- Build and submit a Spark application to YARN
- Run Spark MLlib applications