David's Blog

The ramblings of a frustated Python Data Scientist ...

IBM Data Science Professional Certificate


Introduction

A little over a year ago I complete the IBM Data Science Professional Certificate on Coursera:

This is just a summary / overview of the course if anyone is interested. Most of the content of this post is a summary of the details available at the link above.

And for anyone interested, here is a link to the Capstone Project I submitted to complete this course:

Why this course was created

Data is collected in every aspect of our existence. The true transformative impact of data is realizable only when we can mine and act upon the insights contained within the data. Thus it is no surprise to see phrases such as “data is the new oil” (Economist).

A variety of data related professions are relevant: Data Scientist, Data Engineer, Data Analyst, Database Developer, Business Intelligence (BI) Analyst, etc., and the most prominent of those is Data Scientist. It has been called “the sexiest job of the 21st century” by the Harvard Business Review, and Glassdoor calls it the “best job in America”.

In a recent report, IBM projected that “by 2020 the number of positions for data and analytics talent in the United States will increase by 364,000 openings, to 2,720,000”. The global demand is even higher.

What is been offered

It consists of 9 courses that are intended to arm you with latest job-ready skills and techniques in Data Science.

The courses cover variety of data science topics including: open source tools and libraries, methodologies, Python, databases and SQL, data visualization, data analysis, and machine learning. You will practice hands-on in the IBM Cloud (at no additional cost) using real data science tools and real-world data sets.

How is it different

This professional certificate has a strong emphasis on applied learning. Except for the first course, all other courses include a series of hands-on labs and are performed in the IBM Cloud (without any cost to you).

Throughout this Professional Certificate you are exposed to a series of tools, libraries, cloud services, datasets, algorithms, assignments and projects that will provide you with practical skills with applicability to real jobs that employers value, including:

  • Tools: Jupyter / JupyterLab, Zeppelin notebooks, R Studio, Watson Studio, Db2 database
  • Libraries: Pandas, NumPy, Matplotlib, Seaborn, Folium, ipython-sql, Scikit-learn, ScipPy, etc.
  • Projects: random album generator, predict housing prices, best classifier model, battle of neighbourhoods

Who this is for

Data Science is for everyone – not just those with a Master’s or Ph.D. Anyone can become a Data Scientist, whether or not you currently have computer science or programming skills. It is suitable for those entering the workforce as well as for existing professionals looking to upskill / re-skill themselves and get ahead in their careers.

A Data Scientist is someone who can find the right data, prepare it, analyse and visualize data using a variety of tools and algorithms, build data experiments and models, run these experiments, learn from them, adjust and re-iterate as needed, and eventually be able to tell the story hidden within data so it can be acted upon – either by a human or a machine.

Cost and duration

The courses in this certificate are offered online for self-learning and available for “audit” at no cost. “Auditing” a course gives you the ability to access all lectures, readings, labs, and non-graded assignments at no charge. If you want to learn and develop skills you can audit all the courses for free [do this first to get as much free as possible].

The graded quizzes, assignments, and verified certificates are only available with a low-cost monthly subscription (it was just $39 USD per month for a limited time).

The certificate requires completion of 9 courses. Each course typically contains 3-6 modules with an average effort of 2 to 4 hours per module. If learning part-time (e.g. 1 module per week), it would take 6 to 12 months to complete the entire certificate. If learning full-time (e.g. 1 module per day) the certificate can be completed in 2 to 3 months.

What follows is a brief overview of each of the modules.


What is Data Science

Defining Data Science and What Data Scientists Do

In this module, you will go over the course syllabus to learn what will be taught in this course. Also, you will hear from data science professionals to learn what data science is, what data scientists do, and what tools and algorithms data scientists use on a daily basis. Finally, you will be required to complete a reading assignment to learn why data science is considered the sexiest job in the 21st century.

Data Science Topics

In this module, you will hear from Norman White, the Faculty Director of the Stern Centre for Research Computing, at New York University, as he talks about data science and what skills are required for anyone interested in pursuing a career in this field and as he gives advice to those who are looking to start a career in data science. Finally, you will be required to complete reading assignments to learn about the process of mining a given dataset and about regression analysis.

Data Science in Business

In this module, you will learn about what companies need to do in order to start with data science. You will also learn about some of the qualities that differentiate data scientists from other professionals. In addition, you will learn about analytics and what important role data scientists play in this process, and about story-telling and the importance of an effective final deliverable. Finally, you will be required to apply what you learned about data science by answering open-ended questions.


Tools for Data Science

Introducing Cognitive Class Labs

You will get an overview of the various data science tools available to you, hosted on Cognitive Class Labs. You will create an account and start exploring some of the features.

RStudio IDE

You will learn about a popular data science tool used by R programmers. You'll learn about the user interface and how to use its various features.

Jupyter Notebooks

You will learn about a popular data science tool, Jupyter Notebooks, its features, and why they are so popular among data scientists today.

IBM Watson Studio

You will learn about an enterprise-ready data science platform by IBM, called Watson Studio (formerly known as Data Science Experience). You'll learn about some of the features and capabilities of what data scientists use in the industry.


Data Science Methodology

From Problem to Approach and From Requirements to Collection

In this module, you will learn about why we are interested in data science, what a methodology is, and why data scientists need a methodology. You will also learn about the data science methodology and its flowchart. You will learn about the first two stages of the data science methodology, namely Data Requirements and Data Understanding. Finally, through a lab session, you will learn how to complete the Business Understanding and the Analytic Approach stages as well Data Requirements and Data Collection stages pertaining to any data science problem.

From Understanding to Preparation and From Modelling to Evaluation

In this module, you will learn what it means to understand data, and prepare or clean data. You will also lean about the purpose of data modelling and some characteristics of the modelling process. Finally, through a lab session, you will learn how to complete the Data Understanding and the Data Preparation stages as well as the Modelling and the Model Evaluation stages pertaining to any data science problem.

From Deployment to Feedback

In this module, you will learn about what happens when a model is deployed and why model feedback is important. Also, by completing a peer-reviewed assignment, you will demonstrate your understanding of the data science methodology by apply it to a problem that you define.


Python for Data Science

Python Basics

  • Types
  • Expressions and Variables
  • String Operations

Python Data Structures

  • List and Tuples
  • Dictionaries
  • Sets

Python Programming Fundamentals

  • Conditions and Branching
  • Loops
  • Functions
  • Objects and Classes

Working with Data in Python

  • Reading Files with Open
  • Writing files with open
  • Pandas
  • One Dimensional NumPy
  • Two Dimensional NumPy

Project

  • Fake Album Cover Game

Databases and SQL for Data Science

Introduction to Databases and Basic SQL

You will be introduced to databases. You will create a database instance on the cloud. You will learn some of the basic SQL statements. You will also write and practice basic SQL hands-on on a live database.

Advanced SQL

In this module you will earn how to use string patterns and ranges to search data, how to sort and group data in result sets, as well as learn how to work with multiple tables in a relational database using join operations.

Accessing Databases using Python

After completing the lessons in this week, you will learn how to explain the basic concepts related to using Python to connect to databases and then create tables, load data, query data using SQL and analyse data using Python

Course Assignment

As a hands-on Data Science assignment, you will be working with multiple real world datasets for the city of Chicago. You will be asked questions that will help you understand the data just like a data scientist would.


Data Analysis with Python

Importing Datasets

  • Understanding the Data
  • Python Packages for Data Science
  • Importing and Exporting Data in Python
  • Getting Started Analysing Data in Python
  • Importing Datasets

Data Wrangling

  • Dealing with Missing Values in Python
  • Data Formatting in Python
  • Data Normalization in Python
  • Turning categorical variables into quantitative variables in Python
  • Data Wrangling

Exploratory Data Analysis

  • Descriptive Statistics
  • Group By in Python
  • Correlation
  • Correlation - Statistics
  • Exploratory Data Analysis

Model Development

  • Linear Regression and Multiple Linear Regression
  • Model Evaluation using Visualization
  • Polynomial Regression and Pipelines
  • Measures for In-Sample Evaluation
  • Model Development

Model Evaluation

  • Model Evaluation
  • Overfitting, Underfitting and Model Selection
  • Ridge Regression
  • Quiz: Model Refinement

Data Visualization with Python

Introduction to Data Visualization Tools

In this module, you will learn about data visualization and some of the best practices to keep in mind when creating plots and visuals. You will also learn about the history and the architecture of Matplotlib and learn about basic plotting with Matplotlib. In addition, you will learn about the dataset on immigration to Canada, which will be used extensively throughout the course. Finally, you will briefly learn how to read csv files into a pandas data frame and process and manipulate the data in the data frame, and how to generate line plots using Matplotlib.

Basic and Specialized Visualization Tools

In this module, you learn about area plots and how to create them with Matplotlib, histograms and how to create them with Matplotlib, bar charts, and how to create them with Matplotlib, pie charts, and how to create them with Matplotlib, box plots and how to create them with Matplotlib, and scatter plots and bubble plots and how to create them with Matplotlib.

Advanced Visualizations and Geospatial Data

In this module, you will learn about advanced visualization tools such as waffle charts and word clouds and how to create them. You will also learn about seaborn, which is another visualization library, and how to use it to generate attractive regression plots. In addition, you will learn about Folium, which is another visualization library, designed especially for visualizing geospatial data. Finally, you will learn how to use Folium to create maps of different regions of the world and how to superimpose markers on top of a map, and how to create choropleth maps.


Machine Learning with Python

Introduction to Machine Learning

In this week, you will learn about applications of Machine Learning in different fields such as health care, banking, telecommunication, and so on. You’ll get a general overview of Machine Learning topics such as supervised vs unsupervised learning, and the usage of each algorithm. Also, you understand the advantage of using Python libraries for implementing Machine Learning models.

Regression

In this week, you will get a brief intro to regression. You learn about Linear, Non-linear, Simple and Multiple regression, and their applications. You apply all these methods on two different datasets, in the lab part. Also, you learn how to evaluate your regression model, and calculate its accuracy.

Classification

In this week, you will learn about classification technique. You practice with different classification algorithms, such as KNN, Decision Trees, Logistic Regression and SVM. Also, you learn about pros and cons of each method, and different classification accuracy metrics.

Clustering

In this section, you will learn about different clustering approaches. You learn how to use clustering for customer segmentation, grouping same vehicles, and also clustering of weather stations. You understand 3 main types of clustering, including Partitioned-based Clustering, Hierarchical Clustering, and Density-based Clustering.

Recommender Systems

In this module, you will learn about recommender systems. First, you will get introduced with main idea behind recommendation engines, then you understand two main types of recommendation engines, namely, content-based and collaborative filtering.

Final Project

In this module, you will do a project based of what you have learned so far. You will submit a report of your project for peer evaluation.


Applied Data Science Capstone

Introduction

In this module, you will learn about the scope of this capstone course and the context of the project that you will be working on. You will learn about different location data providers and what location data is normally composed of. Finally, you will be required to submit a link to a new repository on your Github account dedicated to this course.

Foursquare API

In this module, you will learn in details about Foursquare, which is the location data provider we will be using in this course, and its API. Essentially, you will learn how to create a Foursquare developer account, and use your credentials to search for nearby venues of a specific type, explore a particular venue, and search for trending venues around a location.

Neighbourhood Segmentation and Clustering

In this module, you will learn about k-means clustering, which is a form of unsupervised learning. Then you will use clustering and the Foursquare API to segment and cluster the neighbourhoods in the city of New York. Furthermore, you will learn how to scrape website and parse HTML code using the Python package Beautiful Soup, and convert data into a pandas data frame.

The Battle of Neighbourhoods

In this module, you will start working on the capstone project. You will clearly define a problem and discuss the data that you will be using to solve the problem.

You will then carry out all the remaining work to complete your capstone project. You will submit a report of your project for peer evaluation.

Applied Data Science Capstone Part 1 [Week 1]

Clearly define a problem or an idea of your choice, where you would need to leverage the Foursquare location data to solve or execute. Remember that data science problems always target an audience and are meant to help a group of stakeholders solve a problem, so make sure that you explicitly describe your audience and why they would care about your problem.

This submission will eventually become your Introduction / Business Problem section in your final report. So I recommend that you push the report (having your Introduction/Business Problem section only for now) to your Github repository and submit a link to it.

Applied Data Science Capstone Part 2 [Week 1]

Describe the data that you will be using to solve the problem or execute your idea. Remember that you will need to use the Foursquare location data to solve the problem or execute your idea. You can absolutely use other datasets in combination with the Foursquare location data. So make sure that you provide adequate explanation and discussion, with examples, of the data that you will be using, even if it is only Foursquare location data.

This submission will eventually become your Data section in your final report. So I recommend that you push the report (having your Data section) to your Github repository and submit a link to it.

Applied Data Science Capstone Part 3 [Week 2]

In this week, you will continue working on your capstone project. Please remember by the end of this week, you will need to submit the following:

  • A full report consisting of all of the following components:
  • Introduction where you discuss the business problem and who would be interested in this project.
  • Data where you describe the data that will be used to solve the problem and the source of the data.
  • Methodology section which represents the main component of the report where you discuss and describe any exploratory data analysis that you did, any inferential statistical testing that you performed, and what machine learnings were used and why.
  • Results section where you discuss the results.
  • Discussion section where you discuss any observations you noted and any recommendations you can make based on the results.
  • Conclusion section where you conclude the report.
  • A link to your Notebook on your Github repository pushed showing your code.
  • Your choice of a presentation or blogpost.

If you've gotten this far thank you for taking the time to read this post.