Module 5: Data Science & Analysis

Leveraging Python's powerful ecosystem for data manipulation and visualization.

5.1 NumPy Basics

NumPy is the fundamental package for scientific computing in Python. It provides a high-performance multidimensional array object and tools for working with these arrays.

import numpy as np

# Creating arrays
a = np.array([1, 2, 3])
b = np.zeros((2, 2))

# Vectorized operations
c = a * 2  # [2, 4, 6]

5.2 Pandas for Data Manipulation

Pandas provides high-performance, easy-to-use data structures and data analysis tools. The DataFrame is the primary data structure.

import pandas as pd

# Creating a DataFrame
df = pd.DataFrame({
    'Name': ['Alice', 'Bob'],
    'Age': [25, 30]
})

# Filtering
adults = df[df['Age'] > 18]

5.3 Data Visualization with Matplotlib

Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy.

import matplotlib.pyplot as plt

x = [1, 2, 3, 4]
y = [10, 20, 25, 30]

plt.plot(x, y)
plt.title("Simple Plot")
plt.show()

5.4 Exploratory Data Analysis (EDA)

EDA is an approach to analyzing data sets to summarize their main characteristics, often with visual methods.

🎯 Practical Exercise

Load a CSV file using Pandas, clean missing values, and create a simple bar chart visualizing the distribution of a categorical column.