📄 article

Python for Data Analysis

Difficulty: B.TechRead Time: ~15 min

Introduction to Data Analysis with Python

Python has emerged as the premier language for data analysis, largely due to its incredible ecosystem of data-centric libraries such as Pandas, NumPy, and Matplotlib.

To see how this connects to visualization, check out our Tableau Visualization Module.

Working with Pandas

Pandas provides high-performance, easy-to-use data structures. The two primary data structures are Series (1-dimensional) and DataFrame (2-dimensional).

Loading Data

The most common operation is loading a CSV file into a DataFrame.

python

import pandas as pd
import numpy as np

# Load data into a DataFrame
df = pd.read_csv('sales_data.csv')

# Display the first 5 rows
print(df.head())

Data Cleaning and Preprocessing

Real-world data is messy. You will frequently encounter missing values (NaN) or incorrect data types.

python

# Check for missing values
print(df.isnull().sum())

# Drop rows with missing values
df_cleaned = df.dropna()

# Alternatively, fill missing values with the mean (for numerical columns)
df['revenue'].fillna(df['revenue'].mean(), inplace=True)

Advanced Grouping (Split-Apply-Combine)

The groupby method in Pandas is incredibly powerful, allowing you to slice your data, apply a statistical function, and combine the results.

python

# Group by region and calculate total revenue and average margin
summary = df.groupby('Region').agg({
    'Revenue': 'sum',
    'Margin': 'mean',
    'Order_Count': 'count'
}).reset_index()

# Sort the summary by Revenue in descending order
summary = summary.sort_values(by='Revenue', ascending=False)
print(summary)

Time Complexity in Pandas

Unlike native Python lists, Pandas uses optimized C arrays (via NumPy) under the hood. Iterating over a DataFrame using iterrows() is highly discouraged (O(N) with massive overhead). Instead, you should always use vectorized operations which execute at near C-speed.