HDP Analyst Data Science

PT15183
Training Summary
This course provides instruction on the processes and practice of data science, including machine learning and natural language processing. Included are: tools and programming languages (Python, IPython, Mahout, Pig, NumPy, pandas, SciPy, Scikitlearn), the Natural Language Toolkit (NLTK), and Spark MLlib.
Prerequisites
Before taking this course, students must have experience with at least one programming or scripting language, knowledge in statistics and/or mathematics, and a basic understanding of big data and Hadoop principles. Students new to Hadoop are encouraged to attend the HDP Overview: Apache Hadoop Essentials course.
Duration
3 Days/Lecture & Lab
Audience
This course is designed for architects, software developers, analysts, and data scientists who need to apply data science and machine learning on Hadoop.
Course Topics
Setting Up a Development Environment
  • Block Storage
  • Using HDFS Commands
  • MapReduce
  • Using Apache Mahout for Machine Learning
  • Apache Pig
  • Getting Started with Apache Pig
  • Exploring Data with Pig
  • Using the IPython Notebook
  • The NumPy Package
  • The pandas Library
  • Data Analysis with Python
  • Interpolating Data Points
  • Defining a Pig UDF in Python
  • Streaming Python with Pig
  • Classification with Scikit-Learn
  • Computing K-Nearest Neighbor
  • Generating a K-Means Clustering
  • POS Tagging Using a Decision Tree
  • Using NLTK for Natural Language Processing
  • Classifying Text using Naive Bayes
  • Using Spark Transformations and Actions
  • Using Spark MLlib
  • Creating a Spam Classifier with MLlib

Related Scheduled Courses