Foundations of Data Science

Introduction to Data Science and Its Applications

Data science is the discipline of extracting meaningful insights from data by combining statistics, programming, and domain expertise. It powers many of the services and tools we use every day—from recommendation engines on streaming platforms to real-time fraud detection in banking. Governments, businesses, and non-profits alike depend on data science to make evidence-based decisions and improve efficiency.

Applications of data science span nearly every field:

Healthcare: Predicting disease risks, optimizing treatment effectiveness, and analyzing healthcare costs.
Finance: Credit scoring, algorithmic trading, and fraud detection.
Retail: Personalized recommendations and demand forecasting.
Transportation: Route optimization and autonomous vehicle navigation.
Environmental Science: Climate modeling and resource management.

First Lab: Exploring Sample Datasets

To begin, we will explore real datasets. A simple but practical task is to generate a CSV file of per-capita annual healthcare costs for 2023 across the 100 largest countries. You will then use Python to:

Load the data into a Pandas DataFrame.
Compute summary statistics such as mean, median, and standard deviation.
Create visualizations such as bar charts and scatter plots.
Ask ChatGPT (or another LLM) to interpret the results and suggest insights.

World Bank Per Capita Healthcare Costs

Here is a sample of what this data looks like:

Country Name,Country Code,health_exp_pc_ppp_2022
Africa Eastern and Southern,AFE,228
Afghanistan,AFG,383
Africa Western and Central,AFW,201
Angola,AGO,217
Albania,ALB,1186
Andorra,AND,5136
Arab World,ARB,776
United Arab Emirates,ARE,3814
Argentina,ARG,2664
Armenia,ARM,1824
Antigua and Barbuda,ATG,1436

Note that this list from the World Bank contains not only countries but also regions.

This exercise introduces data cleaning, exploration, and visualization, which form the foundation of every data science project.

Suggested MicroSim: Exploring Data Points (students add/remove points on a scatter plot to see how the distribution changes).

Why Python for Data Science?

Python is the most widely used programming language for data science. It is popular because of:

A rich ecosystem of libraries (NumPy, Pandas, scikit-learn, Matplotlib, PyTorch).
Readable, beginner-friendly syntax.
Strong community support and open-source resources.

Over the past 15 years, Python has steadily risen to become the dominant language in data science. Other languages such as R, Java, and Julia are used in specific contexts, but Python’s versatility has made it the industry standard.

Here is an interactive time-series chart showing the change in popularity of different languages used in data science. You can hover over each year to see what percent each language had in data science that year.

Key Insights from Chart:

Python's Rise: Python showed steady growth from 2010-2018, then accelerated dramatically after 2018 due to AI/ML boom, reaching 25%+ market share by 2025.
R's Stability: R maintained consistent popularity (3-5%) throughout the period, remaining strong in academic and statistical research domains.
SQL's Persistence: SQL showed steady growth and remained essential for data manipulation, reaching ~8% by 2025.
Java's Decline: Java's popularity in data science decreased from ~20% to ~7% as Python gained dominance in ML/AI applications.
JavaScript's Growth: JavaScript emerged as a data visualization tool, growing from ~2% to ~6% by 2025.

Note

This chart was generated by Generative AI using Claude Sonnet 4.0 using the Chart.js JavaScript library. You can view the Data Science Programming Language Trends MicroSim to learn more.

Understanding the Data Science Workflow

Every data science project follows a structured workflow:

This workflow is iterative. A failed model often sends us back to collect new data or engineer better features.

Above is an interactive infographic that allows you to explore the six steps in a typical data science workflow. For each step, hover over the step and view the text description below the step.

Define the problem – Clarify what question is being answered.
Collect data – Gather raw data from reliable sources.
Clean and preprocess data – Handle missing values, errors, and inconsistencies.
Explore and visualize – Use plots and descriptive statistics to understand patterns.
Modeling and Analysis – Build predictive or explanatory models.
Deploy and communicate results – Share findings with stakeholders.

Four Types of Data Representations

In this course we will be looking at many ways to represent both raw data as well as connected knowledge. Here is an interactive illustration of these four ways we represent different types of information.

These types are:

Images - Visual data represented as pixel arrays with RGB color values. Each pixel contains red, green, and blue color components. Common in computer vision, medical imaging, satellite imagery, and photo recognition. Neural networks like CNNs are specifically designed to process this type of spatial data structure.
Sequences - Sequences: Ordered data where position and timing matter critically. Examples include time series data, natural language text, DNA sequences, audio signals, and stock prices. RNNs, LSTMs, and Transformers are designed to capture temporal dependencies and patterns in sequential data.
Tabular - Structured data organized in rows and columns, similar to spreadsheets or databases. Each row represents an observation and each column represents a feature or variable. This is the most common data type in traditional machine learning, handled well by algorithms like random forests, SVM, and gradient boosting. We can use Python data frames to manipulate tabular data.
Graph - Graph: Network data representing relationships and connections between entities. Nodes represent objects (people, websites, molecules) while edges represent relationships (friendships, links, bonds). Used in social network analysis, recommendation systems, knowledge graphs, and molecular modeling. Graph Neural Networks (GNNs) are specialized for this data type.

Future chapters will focus on different ways we represent this information and how our models vary based on the type of data we are working with.

Basic Atomic Data Types and Structures in Python

Before analyzing data, students must understand how Python stores fundamental atomic entities like numbers and string. In this course we will look at some of the core Python atomic atomic data types include:

Integers (whole numbers, e.g., 42)
Floats (decimal numbers, e.g., 3.14)
Strings (text, e.g., "data science")
Booleans (True or False)

Core data structures include:

Lists – ordered, mutable collections (e.g., [1,2,3])
Tuples – ordered, immutable collections (e.g., (1,2,3))
Dictionaries – key-value pairs (e.g., {"name": "Alice", "age": 20})
Sets – unordered, unique elements (e.g., {1,2,3})

Later in the course, we will rely heavily on NumPy arrays and Pandas DataFrames, which are optimized for data manipulation.

<<<<<<< HEAD

Understanding the Data Science Workflow

Every data science project follows a structured workflow:

Define the problem – Clarify what question is being answered.
Collect data – Gather raw data from reliable sources.
Clean and preprocess data – Handle missing values, errors, and inconsistencies.
Explore and visualize – Use plots and descriptive statistics to understand patterns.
Modeling – Build predictive or explanatory models.
Evaluate – Use metrics to test accuracy and generalizability.
Deploy and communicate results – Share findings with stakeholders.

This workflow is iterative. A failed model often sends us back to collect new data or engineer better features.

7836e01 (updates)

MicroSim – Data Science Workflow Infographic

Students can explore an interactive infographic where clicking each stage of the workflow reveals its purpose, key tools, and example questions.

<<<<<<< HEAD

7836e01 (updates)

Data Science Roles and Career Paths

Data science is a team effort, involving many specialized roles:

Data Scientist: Builds models, interprets results, and communicates insights.
Data Engineer: Designs pipelines and storage systems for reliable data access.
Machine Learning Engineer: Deploys and optimizes models in production systems.
Business Analyst: Translates data insights into actionable strategies.
Ethics & Compliance Specialist: Ensures fairness, transparency, and privacy in projects.

These roles often overlap, and many entry-level positions expect a blend of programming, statistics, and communication skills.

<<<<<<< HEAD

7836e01 (updates)

Ethics and Best Practices in Data Science

Data science has great potential, but also significant risks. Poorly designed or biased models can reinforce inequalities or cause harm. To practice ethical data science, we must:

Protect privacy: Respect data ownership and confidentiality.
Avoid bias: Check datasets and models for fairness across subgroups.
Be transparent: Document methods and assumptions clearly.
Ensure reproducibility: Use version control and pipelines so results can be verified.
Balance efficiency and responsibility: Consider environmental and social impacts.

<<<<<<< HEAD

Sample Hands-On Lab: Exploring Sample Datasets with Python Dataframes

To begin, we will explore a real-world datasets. Let's say a friend of yours complained that the healthcare costs in the country that you live it were unreasonably high compared to other countries. Can you provide evidence to prove or disprove this claim?

A simple but practical task is to generate a CSV file of per-capita annual healthcare costs for 2023 across the 100 largest countries.

You will then use Python to:

Load the data into a Pandas DataFrame.
Compute summary statistics such as mean, median, and standard deviation.
Create visualizations such as bar charts and scatter plots.
Ask ChatGPT (or another LLM) to interpret the results and suggest insights.

Here is a sample of what this data looks like:

Country Name,Country Code,Health_Exp_Per_Capita_2022
Africa Eastern and Southern,AFE,228
Afghanistan,AFG,383
Africa Western and Central,AFW,201
Angola,AGO,217
Albania,ALB,1186
Andorra,AND,5136
Arab World,ARB,776
United Arab Emirates,ARE,3814
Argentina,ARG,2664
Armenia,ARM,1824
Antigua and Barbuda,ATG,1436

You can download this sample data here: Worldwide Healthcare Costs Per Capita for 2022

Note that this list from the World Bank includes not just countries, but also regions like "Africa Wester and Central". The third column is in US dollars

This exercise introduces data cleaning, exploration, and visualization, which form the foundation of every data science project.

✅ This completes the Foundations of Data Science chapter, preparing students for Week 1 of the course.

=======

✅ This completes the Foundations of Data Science chapter, preparing students for Week 1 of the course.

7836e01 (updates)