Courses Details
Course Overview
A Data Scientist combines statistical and machine learning techniques with Python programming to analyze and interpret complex data.
This course will establish your expertise in data science and analytics techniques using Python. With this Python for Data Science Course, you’ll learn the essential concepts of Python programming and gain deep knowledge in data analytics, machine learning, data visualization, web scraping and natural language processing. Python is a required skill for many data science positions, so jumpstart your career with this interactive, hands-on course.

Learn to visualize real data with matplotlib's functions and get to know new data structures such as the dictionary and the Pandas Dataframe. After covering key concepts such as Boolean logic, control flow and loops in Python, you're ready to blend together everything you've learned to solve a case study using hacker statistics.
Course Schedule
Target Audience
Analytics professionals who want to work with Python
Software professionals looking to get into the field of analytics
IT professionals interested in pursuing a career in analytics
Graduates looking to build a career in analytics and data science
Experienced professionals who would like to harness data science in their fields
Anyone with a genuine interest in the field of data science
Course Objectives
The Data Science with Python course will furnish you with in-depth knowledge of the various libraries and packages required to perform data analysis, data visualization, web scraping, machine learning and natural language processing using Python.  
Python has surpassed Java as the top language used to introduce US students to programming and computer science, and 46 percent of data science jobs list Python as a required skill.
Course Prerequisites
Python basics 
Course Outline
Day 1
Python refresher [2 hrs]
The Python interpreter
Python Data Types
Data and type introspection basics
Control structures
Functions
Classes
Errors and exceptions
Regular expressions

Data Analytics
Data Ecosystem in Python [2 hrs]
Scipy
Numpy
Pandas
Matplotlib
Ipython
Jupyter

numpy [2 hr]
C-python integration
C data in python
Numpy arrays
Dtype’s
Shape
Reshape
Numpy array operations/operators
Numpy ‘mapped’ functions
Run-time comparison with python lists, etc

Day 2
Pandas [2.5 hr]
Tabular data
DataFrames
Series
Index’s
Importing data: from_csv, from_xls, from_json, from_avro
Exporting data: to_csv, etc
DataFrame row filtering operations
DataFrame str functions
Setting Indexes, multi-index dataframes
Sorting: sort_value(), sort_index()
group_by 
Stack, unstack
Pivot() and pivot_table()
Time series
Re-sampling
interpolation
Side-by-side comparisons
Integration with matplotlib
Key plotting attributes and args
Handling NaN’s: fillna(), etc

Data Visualization : Matplotlib, pyplot [2.5 hr]
Types of charts
Line Plot
Scatter Plot
Bar Charts
Histograms
Pie charts
Box plots
Candle plot for financial data
Chart attributes: axes, grid, legends, title
Colors, gradients
Multiple plots and figures
Axes of plots
Numpy integration with pyplot
Pandas integration with pyplot

Scipy [1.5 hrs]
Non-standard data-types and scipy
Scipy and numpy ndarray
Scipy.stats
scipy.interpolate
Statistics concepts
Central Tendency
Spread
Mean, median, mode
Quartiles
Rolling averages
Interpolation
Distributions
Curve Fitting
Root Mean Squares

Day 3
Scipy.weave [1.5 hrs]
C/C integration: weave
Weave.inline()
weave.blitz()
SWIG
weave.ext_tools()
C code as python strings
Blitz_type_factories
Scalar_spec
Weave parser and translate_symbols()
Benchmarking
Machine Learning
Intro & Setup [.5 hrs]
Unsupervised and supervised learning
Scikit
Scikit learn (sklearn)
High level patterns in the classes and API’s
fit()
transform()
predict()
score()

Classification [2 hrs]
Introduction to idea of observation based learning
Distances and similarities
k Nearest Neighbours (kNN) for classification
Regression with kNN & SVM 
Focus on (Support Vector Machines) SVM Kernels and their use
Regression [1 hrs]
Linear Regression
Regularization of Generalized Linear Models
Logistic Regression
Methods of threshold determination and performance measures for classification score models

Unsupervised learning [2 hrs]
Need for dimensionality reduction
Principal Component Analysis (PCA)
Difference between PCAs and Latent Factors
Factor Analysis
Hierarchical, K-means & DBSCAN Clustering, Gaussian Mixture Models
SVD
Clustering Use Cases
Day 4
Tree Models [2 hrs]
Introduction to decision trees
Tuning tree size with cross validation
Introduction to bagging algorithm
Random Forests
Grid search and randomized grid search
ExtraTrees (Extremely Randomised Trees)
Partial dependence plots

Intro to Boosting Algorithms [1.5 hrs]
Ensemble Learning
Concept of weak learners
Introduction to boosting algorithms
Adaptive Boosting

Natural Language Processing
Tokenization [1 hr]
Regular Expressions with re module
re.search() and re.findall()
re.split()
Nltk.tokenize
word_tokenize()
sent_tokenize()
non-ASCII tokenization

Topic Identification [1 hr]
Word counting
Introducing corpora
Gensim
Bag-of-words
Introducting TF-IDF
TF-IDF with genism

Day 5
Named Entity Recognition [2 hr]
NER with nltk
Stanford Library with NLTK NER
SpaCy
SpaCy vs nltk
SpaCy NER categories
polyglot: multilingual NER
Exercise: french and spanish NER

NLTK for classification [1.5 hr]
Feature extraction
Train and test sets
CountVectorizer
TfIdfVectorizer
Exercise: fake news detector

Web Scraping [2 hrs]
BeautifulSoup module
Bs4 module
prettify()
HTML tags overview: <head>, <body>, <h1>, <a href>, <title>, <p>, ..
Tag properties
DOM
Object attributes
.title
.p
.parent
.children
.name
.contents
.strings
Dict based lookup
Soup[‘id’]
Multi-valued attributes
.find()
.find_all()
.get()
Day 6 (Optional)
Distributed Applications: Hadoop & Spark

[working knowledge of the following assumed:
Hadoop Architecture, HDFS, Map-Reduce
Pyspark sub-modules: sql, streaming, ml, MLlib
RDD’s, DataFrames, DataSets

pyDoop [1.5 hrs]
Pydoop.hdfs API
Mappers, reducers and combiners
Pipes 
Record readers and writers
Partitioners and Combiners
Pydoop command line
Simulator API

Spark Data Processing Use Cases [1.5 hrs]
Graph Processing and Analysis 
pySpark.ml and pySpark.MLlib
Example: k-means
Spark Applications with over Hadoop [2 hrs]
Spark Applications vs. Spark Shell
Importing modules on executor nodes
Complex dependencies: native code in egg’s
Heterogenous cluster complexities and solutions
--pyfiles and addPyFiles()
Virtualenv’s
ClusterSSH & ParallelSSH
Anaconda cluster
Preview: Spark SQL [1 hr]
Spark SQL and the SQL Context
Creating DataFrames
Transforming and Querying DataFrames 
Saving, restoring DataFrames
De-brief [1 hr]
Suggested approaches for digging deeper
Avoiding confusions
Conquering complexity with isolation
Future references
Summary, wrap-up, Q&A [1 hrs]