Python for Data Analysis
Introduction to Data Analysis with Python
Python has emerged as the premier language for data analysis, largely due to its incredible ecosystem of data-centric libraries such as Pandas, NumPy, and Matplotlib.
To see how this connects to visualization, check out our Tableau Visualization Module.
Working with Pandas
Pandas provides high-performance, easy-to-use data structures. The two primary data structures are Series (1-dimensional) and DataFrame (2-dimensional).
Loading Data
The most common operation is loading a CSV file into a DataFrame.
import pandas as pd
import numpy as np
# Load data into a DataFrame
df = pd.read_csv('sales_data.csv')
# Display the first 5 rows
print(df.head())
Data Cleaning and Preprocessing
Real-world data is messy. You will frequently encounter missing values (NaN) or incorrect data types.
# Check for missing values
print(df.isnull().sum())
# Drop rows with missing values
df_cleaned = df.dropna()
# Alternatively, fill missing values with the mean (for numerical columns)
df['revenue'].fillna(df['revenue'].mean(), inplace=True)
Advanced Grouping (Split-Apply-Combine)
The groupby method in Pandas is incredibly powerful, allowing you to slice your data, apply a statistical function, and combine the results.
# Group by region and calculate total revenue and average margin
summary = df.groupby('Region').agg({
'Revenue': 'sum',
'Margin': 'mean',
'Order_Count': 'count'
}).reset_index()
# Sort the summary by Revenue in descending order
summary = summary.sort_values(by='Revenue', ascending=False)
print(summary)
Time Complexity in Pandas
Unlike native Python lists, Pandas uses optimized C arrays (via NumPy) under the hood. Iterating over a DataFrame using iterrows() is highly discouraged (O(N) with massive overhead). Instead, you should always use vectorized operations which execute at near C-speed.