Skip to main content

pandas for Backend Developers: Logs, Excel/ERP, ETL, and Reports — Why I Didn't Use It Earlier

Константин Потапов
30 min

Complete guide to using pandas in backend development: log analysis, Excel/ERP integration, ETL pipelines, data validation, and report generation. Practical use cases with code and real-world stories.

I used to analyze logs with grep, awk, and hacky Python scripts. Then I tried pandas — and realized I was wasting hours where I could spend minutes.

Many Python developers think pandas is exclusively for data science. But in practice, pandas solves dozens of backend development tasks: from log analysis to report generation. Let's explore why every Python developer should know pandas and how to apply it in real projects.

For whom: Python developers (Junior/Middle) who want to work effectively with data in backend applications.

What you'll learn:

  • 📊 Understand where pandas is more effective than SQL/ORM
  • 🔍 Learn to analyze logs and metrics
  • 📁 Automate data analysis from Excel/ERP systems
  • ⚙️ Build ETL pipelines for data processing
  • ✅ Set up data validation and cleaning
  • 📄 Generate business reports with margin analysis

Want to jump straight to practice?

Why Backend Developers Need pandas?

Typical Situation

# Task: find top-10 endpoints by response time for the last week
# Option 1: Pure Python (15+ lines with loops, dict comprehensions)
# Option 2: SQL query + ORM (need to write migration for analytics table)
# Option 3: pandas (3 lines of code)
 
import pandas as pd
 
df = pd.read_csv('api_logs.csv')
top_slow = df.nlargest(10, 'response_time')[['endpoint', 'response_time']]
print(top_slow)

pandas is a tool for data analysis and transformation, when:

  • You need to quickly analyze data without writing SQL queries
  • Data comes from different sources (CSV, JSON, Excel, databases)
  • Complex grouping, aggregation, or transformation is required
  • You need to generate a report for business

Important: pandas doesn't replace SQL/ORM for production database operations. It's a tool for analysis, ETL, and report generation.

Real Case: How I Learned to Value Simple Solutions

Story from Practice

I consulted a retail chain owner on deploying his analytics system. Honestly — I arrived with preconceptions and left with an important lesson.

What I expected to see:

  • Another "hacky" implementation
  • Poorly thought-out architecture
  • Code that "urgently needs rewriting"

What I actually saw:

  • Windows server with Django
  • Excel files with exports from 1C (Russian ERP system)
  • Python scripts with pandas for analyzing sales, inventory, margins
  • Simple result visualization

My first reaction (wrong):

"Why this approach when there's Power BI, Tableau, modern stack with PostgreSQL?"

What I Learned from the Businessman

1. Value of speed over perfection

His system gave answers in 5 minutes. "Proper" BI systems required weeks to set up dashboards. While competitors waited for reports, he was already adjusting purchases.

2. Flexibility matters more than architecture

Changing margin calculation logic — 20 minutes in Jupyter Notebook. In corporate BI — ticket to analytics department, week of waiting.

3. Results matter more than code

The code wasn't perfect. But it answered the question: "Which products are unprofitable and what to do with them?". Every day. Without failures.

4. Tool knowledge solves problems

The owner wasn't a programmer, but mastered pandas enough to quickly analyze data. This gave him a competitive advantage that money can't buy.

What I Understood

Developer's mistake: thinking about technologies, not the task. Businessman's approach: solve the task as quickly and efficiently as possible.

pandas + Excel isn't a "hack", but a conscious choice of tool for the task:

  • No dependency on analysts
  • Fast hypothesis iteration
  • Low entry barrier for colleagues
  • Results here and now

Conclusion for Developers

If you want to create real value — learn to solve business problems, not just write "correct code".

pandas is your ally in this:

  • Fast data analysis without infrastructure setup
  • Testing hypotheses in minutes, not days
  • Understandable code that others can maintain

Main lesson: A simple solution that works is better than a perfect solution that will never be ready.

Conclusion: pandas isn't just a library for data science. It's a tool for business decision-making, and knowing it makes you a more valuable specialist.

When You DON'T Need pandas

Before diving into examples, it's important to understand pandas' applicability boundaries. This will save you time and avoid production problems.

❌ DON'T use pandas for:

1. Transactional operations in production

# ❌ BAD: writing to DB via pandas
df.to_sql('users', engine, if_exists='append')
# Problems: slow, no transaction support, complex error handling
 
# ✅ GOOD: use ORM or bulk insert
session.bulk_insert_mappings(User, records)

2. Working with data > 1GB in memory

# ❌ BAD: loading 10GB file
df = pd.read_csv('huge_file.csv')  # OOM (Out of Memory)
 
# ✅ GOOD: use chunked reading, Dask, or DuckDB
for chunk in pd.read_csv('huge_file.csv', chunksize=10000):
    process_chunk(chunk)

3. Real-time data processing

# ❌ BAD: pandas in FastAPI endpoint for each request
@app.get("/stats")
def get_stats():
    df = pd.read_sql("SELECT * FROM orders", engine)  # Slow!
    return df.groupby('product').sum().to_dict()
 
# ✅ GOOD: pre-calculate aggregates or use SQL/Redis

4. Data requiring ACID guarantees

# ❌ BAD: modifying data via pandas with DB save
df = pd.read_sql("SELECT * FROM accounts", engine)
df.loc[df['id'] == 123, 'balance'] -= 100  # No transaction!
df.to_sql('accounts', engine, if_exists='replace')  # Data loss on error, no audit
 
# ✅ GOOD: use SQL transactions with operation logging
with session.begin():
    # Lock account for update
    account = session.query(Account).filter_by(id=123).with_for_update().first()
    account.balance -= 100
 
    # Record transaction for audit
    transaction = Transaction(
        account_id=123,
        amount=-100,
        type='withdrawal',
        timestamp=datetime.now()
    )
    session.add(transaction)

5. Simple DB operations that SQL solves

# ❌ BAD: loading everything into pandas for simple filtering
df = pd.read_sql("SELECT * FROM users", engine)
active_users = df[df['is_active'] == True]
 
# ✅ GOOD: use SQL
df = pd.read_sql("SELECT * FROM users WHERE is_active = TRUE", engine)

✅ Use pandas for:

  1. Ad-hoc data analysis — quick one-time investigations
    • Example: "Find top-10 products for yesterday" — wrote 5 lines in Jupyter, got the answer
  2. ETL pipelines — loading from different sources, transformation, export
  3. Report generation — Excel, PDF, charts for business
  4. Data validation — checking CSV before importing to DB
  5. Prototyping — quick hypothesis testing on data

Golden rule: If data fits in memory (< 50% RAM) and doesn't require ACID — pandas is suitable. If data > 1GB or transactions needed — use SQL/Dask/Spark.

pandas Basics: DataFrame and Series

DataFrame — data table

DataFrame is a two-dimensional table with typed columns. Analog of spreadsheet (Excel) or SQL table.

import pandas as pd
 
# Creating DataFrame from dict
data = {
    'user_id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'city': ['Moscow', 'SPb', 'Moscow', 'Kazan']
}
 
df = pd.DataFrame(data)
print(df)
 
#    user_id     name  age     city
# 0        1    Alice   25   Moscow
# 1        2      Bob   30      SPb
# 2        3  Charlie   35   Moscow
# 3        4    Diana   28    Kazan

Series — one column (data vector)

Series is a one-dimensional array with an index. Each DataFrame column is a Series.

# Getting a column
ages = df['age']
print(type(ages))  # <class 'pandas.core.series.Series'>
 
# Operations on Series
print(ages.mean())     # 29.5 (average)
print(ages.max())      # 35
print(ages.std())      # 4.2 (standard deviation)

Basic Operations

# Filtering
moscow_users = df[df['city'] == 'Moscow']
 
# Sorting
sorted_by_age = df.sort_values('age', ascending=False)
 
# Selecting columns
names_and_ages = df[['name', 'age']]
 
# Adding column
df['is_adult'] = df['age'] >= 18
 
# Grouping and aggregation
city_stats = df.groupby('city')['age'].agg(['mean', 'count'])

Common mistake: Confusing .loc[] (access by label) and .iloc[] (access by position). Use .loc[] for filtering by conditions, .iloc[] for access by index.

Use-case 1: Analyzing Logs and Metrics

Task: find problematic endpoints

You have access.log with a million lines. Need to find:

  • Top-10 slowest endpoints
  • Endpoints with high percentage of 5xx errors
  • Response time distribution by hours

Log structure:

2025-12-12 10:15:23 | GET /api/users | 200 | 45ms
2025-12-12 10:15:24 | POST /api/orders | 201 | 120ms
2025-12-12 10:15:25 | GET /api/products | 500 | 2300ms

Solution with pandas

import pandas as pd
from datetime import datetime
 
# 1. Reading logs (assume they're in CSV)
df = pd.read_csv(
    'access.log',
    sep='|',
    names=['timestamp', 'method_path', 'status', 'response_time'],
    parse_dates=['timestamp']  # Automatic date parsing
)
 
# 2. Data cleaning (remove whitespace!)
df['response_time'] = df['response_time'].str.strip().str.replace('ms', '').astype(int)
df[['method', 'endpoint']] = df['method_path'].str.strip().str.split(' ', expand=True)
df['status'] = df['status'].astype(int)
# .str.strip() is critical — logs often contain spaces around separators
 
# 3. Top-10 slowest endpoints
top_slow = (
    df.groupby('endpoint')['response_time']
    .agg(['mean', 'max', 'count'])
    .sort_values('mean', ascending=False)
    .head(10)
)
 
print("Top-10 slowest endpoints:")
print(top_slow)
 
# 4. Endpoints with high percentage of 5xx errors (vectorized!)
error_rate = (
    df.assign(is_5xx=df['status'] >= 500)
    .groupby('endpoint')['is_5xx']
    .mean()
    .mul(100)
    .sort_values(ascending=False)
    .head(10)
)
 
print("\nEndpoints with high % of errors:")
print(error_rate)
 
# 5. Response time distribution by hours
df['hour'] = df['timestamp'].dt.hour
hourly_stats = df.groupby('hour')['response_time'].agg(['mean', 'median', 'count'])
 
print("\nDistribution by hours:")
print(hourly_stats)

Visualization (optional)

import matplotlib.pyplot as plt
 
# Plot average response time by hours
hourly_stats['mean'].plot(kind='line', title='Average Response Time by Hour')
plt.xlabel('Hour of Day')
plt.ylabel('Average Time (ms)')
plt.savefig('response_time_by_hour.png')

Result: Instead of writing complex SQL queries or Python loops, you got a complete analysis in 20 lines of code.

Use-case 2: Analyzing Data from Excel/ERP

Task: sales analysis from ERP export

Typical situation: accounting works in ERP (like 1C, SAP, etc.), data is exported to Excel, and business wants to understand:

  • Which products sell best?
  • What's the margin by categories?
  • Where are we losing money?

Source data: Excel file sales.xlsx with sheets "Sales", "Inventory", "Costs"

import pandas as pd
 
# 1. Loading data from Excel (different sheets)
sales = pd.read_excel('sales.xlsx', sheet_name='Sales')
stock = pd.read_excel('sales.xlsx', sheet_name='Inventory')
costs = pd.read_excel('sales.xlsx', sheet_name='Costs')
 
# 2. Cleaning column names (ERPs love spaces and special characters)
sales.columns = sales.columns.str.strip()
stock.columns = stock.columns.str.strip()
costs.columns = costs.columns.str.strip()
 
# 3. Merging data
# Adding costs to sales
df = sales.merge(
    costs[['SKU', 'Cost']],
    on='SKU',
    how='left'
)
 
# 4. Margin calculation (with division by zero protection)
df['Margin'] = df['Sale Price'] - df['Cost']
df['Margin %'] = (
    df['Margin'] / df['Sale Price'].replace(0, float('nan')) * 100
).round(2)
 
# 5. Analysis: top products by margin
top_margin = (
    df.groupby(['Category', 'Product Name'])
    .agg({
        'Quantity': 'sum',
        'Margin': 'sum',
        'Margin %': 'mean'
    })
    .sort_values('Margin', ascending=False)
    .head(20)
)
 
print("Top-20 products by margin:")
print(top_margin)
 
# 6. Analysis: unprofitable products
unprofitable = df[df['Margin'] < 0].groupby('Category').agg({
    'Product Name': 'count',
    'Margin': 'sum'
})
 
print("\nUnprofitable products by categories:")
print(unprofitable)
 
# 7. Pivot table for management (requires "Date" column in source data!)
pivot = df.pivot_table(
    values=['Quantity', 'Margin'],
    index='Category',
    columns=pd.to_datetime(df['Date']).dt.month,
    aggfunc='sum',
    fill_value=0
)
 
print("\nSales and margin by months:")
print(pivot)
 
# 8. Export results to new Excel
with pd.ExcelWriter('analysis_results.xlsx') as writer:
    top_margin.to_excel(writer, sheet_name='Top Products')
    unprofitable.to_excel(writer, sheet_name='Unprofitable')
    pivot.to_excel(writer, sheet_name='Monthly Dynamics')
 
print("\n✅ Analysis complete: analysis_results.xlsx")

Adding Automation (production-ready version)

Often such analysis needs to be done regularly. Let's create a function with error handling and logging:

import logging
from pathlib import Path
from datetime import datetime
from typing import Optional
import pandas as pd
 
# Logging setup
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
 
class SalesAnalysisError(Exception):
    """Error during sales analysis."""
    pass
 
def analyze_erp_sales(
    file_path: str,
    output_dir: str = 'reports',
    required_sheets: Optional[list] = None
) -> str:
    """
    Automatic sales analysis from ERP export with error handling.
 
    Args:
        file_path: Path to Excel file with ERP export
        output_dir: Directory for saving report
        required_sheets: List of required sheets (default: Sales, Costs)
 
    Returns:
        Path to created report
 
    Raises:
        SalesAnalysisError: On validation or processing errors
    """
    if required_sheets is None:
        required_sheets = ['Sales', 'Costs']
 
    try:
        logger.info(f"Starting analysis of file: {file_path}")
 
        # 1. File existence check
        if not Path(file_path).exists():
            raise SalesAnalysisError(f"File not found: {file_path}")
 
        # 2. Create reports directory
        Path(output_dir).mkdir(parents=True, exist_ok=True)
 
        # 3. Load data with validation
        try:
            excel_file = pd.ExcelFile(file_path)
            available_sheets = excel_file.sheet_names
 
            # Check required sheets presence
            missing_sheets = set(required_sheets) - set(available_sheets)
            if missing_sheets:
                raise SalesAnalysisError(
                    f"Missing required sheets: {missing_sheets}. "
                    f"Available sheets: {available_sheets}"
                )
 
            sales = pd.read_excel(excel_file, sheet_name='Sales')
            costs = pd.read_excel(excel_file, sheet_name='Costs')
 
        except ValueError as e:
            raise SalesAnalysisError(f"Error reading Excel file: {e}")
 
        # 4. Data structure validation
        required_sales_cols = ['SKU', 'Quantity', 'Sale Price', 'Category', 'Product Name']
        required_costs_cols = ['SKU', 'Cost']
 
        missing_sales_cols = set(required_sales_cols) - set(sales.columns)
        missing_costs_cols = set(required_costs_cols) - set(costs.columns)
 
        if missing_sales_cols:
            raise SalesAnalysisError(f"'Sales' sheet missing columns: {missing_sales_cols}")
        if missing_costs_cols:
            raise SalesAnalysisError(f"'Costs' sheet missing columns: {missing_costs_cols}")
 
        logger.info(f"Loaded {len(sales)} sales records, {len(costs)} cost records")
 
        # 5. Data cleaning
        sales.columns = sales.columns.str.strip()
        costs.columns = costs.columns.str.strip()
 
        # Remove empty rows
        sales = sales.dropna(subset=['SKU'])
        costs = costs.dropna(subset=['SKU'])
 
        # 6. Merging and margin calculation
        df = sales.merge(
            costs[['SKU', 'Cost']],
            on='SKU',
            how='left'
        )
 
        # Check for products without costs
        missing_costs = df['Cost'].isna().sum()
        if missing_costs > 0:
            logger.warning(f"Found {missing_costs} products without costs")
            df['Cost'] = df['Cost'].fillna(0)
 
        df['Margin'] = df['Sale Price'] - df['Cost']
        # Division by zero protection
        df['Margin %'] = (
            df['Margin'] / df['Sale Price'].replace(0, float('nan')) * 100
        ).round(2)
 
        # 7. Analysis
        top_margin = df.groupby(['Category', 'Product Name']).agg({
            'Quantity': 'sum',
            'Margin': 'sum',
            'Margin %': 'mean'
        }).sort_values('Margin', ascending=False).head(20)
 
        unprofitable = df[df['Margin'] < 0].groupby('Category').agg({
            'Product Name': 'count',
            'Margin': 'sum'
        })
 
        # 8. Report export
        timestamp = datetime.now().strftime('%Y%m%d_%H%M')
        output_file = Path(output_dir) / f"sales_analysis_{timestamp}.xlsx"
 
        with pd.ExcelWriter(output_file, engine='openpyxl') as writer:
            top_margin.to_excel(writer, sheet_name='Top Products')
 
            if not unprofitable.empty:
                unprofitable.to_excel(writer, sheet_name='Unprofitable')
            else:
                logger.info("No unprofitable products found")
 
            # Summary statistics
            summary = pd.DataFrame({
                'Metric': [
                    'Total Sales',
                    'Total Margin',
                    'Average Margin %',
                    'Unprofitable Products'
                ],
                'Value': [
                    f"{df['Quantity'].sum():,.0f} units",
                    f"${df['Margin'].sum():,.2f}",
                    f"{df['Margin %'].mean():.2f}%",
                    len(df[df['Margin'] < 0])
                ]
            })
            summary.to_excel(writer, sheet_name='Summary', index=False)
 
        logger.info(f"✅ Report successfully created: {output_file}")
        return str(output_file)
 
    except SalesAnalysisError:
        raise
    except Exception as e:
        logger.error(f"Unexpected error during analysis: {e}", exc_info=True)
        raise SalesAnalysisError(f"Data processing error: {e}")
 
# Usage with error handling
try:
    report = analyze_erp_sales('sales.xlsx')
    print(f"✅ Report ready: {report}")
except SalesAnalysisError as e:
    print(f"❌ Error: {e}")
    # Send notification to admin
except Exception as e:
    print(f"❌ Critical error: {e}")
    # Log to Sentry/other monitoring system

Production-ready code: Now the function checks input data, handles errors, logs progress, and properly reports problems. Such code can be used in automated processes.

Practical tip: Such scripts can be wrapped in a simple web interface (Flask/FastAPI) — director uploads Excel, gets report. In 2 hours of work you'll create a tool that saves hours of analytics every week.

Use-case 3: ETL and Data Pipelines

Task: data synchronization between systems

Often needed:

  • Load data from one DB
  • Transform it (rename columns, change format)
  • Enrich with data from external API
  • Save to another DB or send to another service

Example: exporting users from PostgreSQL to Excel for marketing

import pandas as pd
from sqlalchemy import create_engine
 
# 1. Database connection
engine = create_engine('postgresql://user:password@localhost:5432/mydb')
 
# 2. Data loading
query = """
    SELECT
        u.id,
        u.email,
        u.created_at,
        u.city,
        COUNT(o.id) as orders_count,
        SUM(o.total) as total_spent
    FROM users u
    LEFT JOIN orders o ON u.id = o.user_id
    WHERE u.created_at >= '2025-01-01'
    GROUP BY u.id, u.email, u.created_at, u.city
"""
 
df = pd.read_sql(query, engine)
 
# 3. Data transformation
df['created_at'] = pd.to_datetime(df['created_at'])
df['registration_month'] = df['created_at'].dt.to_period('M')
 
# User categorization (vectorized approach)
import numpy as np
 
# ✅ Using np.select instead of apply() — 10-100x faster!
conditions = [
    df['total_spent'] > 10000,      # Condition for VIP
    df['orders_count'] > 5,          # Condition for Active
    df['orders_count'] > 0           # Condition for Buyer
]
choices = ['VIP', 'Active', 'Buyer']
df['category'] = np.select(conditions, choices, default='Registered')
 
# Why is this faster?
# apply() calls Python function for EACH row (row by row)
# np.select() performs operation over entire array at once in C/NumPy
# On 100k rows: apply() ~2 sec, np.select() ~20 ms
 
# 4. Adding data from external source (e.g., city -> region)
city_to_region = {
    'Moscow': 'Central',
    'SPb': 'Northwest',
    'Kazan': 'Volga'
}
df['region'] = df['city'].map(city_to_region)
 
# 5. Export to Excel with formatting
with pd.ExcelWriter('users_report.xlsx', engine='openpyxl') as writer:
    # Main report
    df.to_excel(writer, sheet_name='Users', index=False)
 
    # Summary table by categories
    summary = df.groupby('category').agg({
        'id': 'count',
        'total_spent': 'sum'
    }).rename(columns={'id': 'users_count'})
 
    summary.to_excel(writer, sheet_name='Summary')

Pipeline for Daily Processing

import pandas as pd
from pathlib import Path
from datetime import datetime, timedelta
 
def daily_etl_pipeline(date: datetime):
    """
    Daily ETL: loads data for a day, processes, saves.
    """
    # 1. Data loading
    raw_data_path = f'raw_data/{date.strftime("%Y-%m-%d")}.csv'
    df = pd.read_csv(raw_data_path)
 
    # 2. Validation
    required_columns = ['user_id', 'event_type', 'timestamp']
    missing = set(required_columns) - set(df.columns)
    if missing:
        raise ValueError(f"Missing columns: {missing}")
 
    # 3. Cleaning
    df = df.dropna(subset=['user_id'])  # Remove rows without user_id
    df = df.drop_duplicates()           # Remove duplicates
 
    # 4. Type conversion
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    df['user_id'] = df['user_id'].astype(int)
 
    # 5. Filter anomalies
    df = df[df['timestamp'].dt.date == date.date()]  # Only data for needed day
 
    # 6. Aggregation
    daily_stats = df.groupby('user_id').agg({
        'event_type': 'count',
        'timestamp': ['min', 'max']
    }).reset_index()
 
    # 7. Save result
    output_path = f'processed_data/{date.strftime("%Y-%m-%d")}_stats.parquet'
    daily_stats.to_parquet(output_path, compression='snappy')
 
    return daily_stats
 
# Run pipeline
yesterday = datetime.now() - timedelta(days=1)
stats = daily_etl_pipeline(yesterday)
print(f"Processed {len(stats)} users")

Tip: Use Parquet instead of CSV for storing intermediate results — it's 5-10x faster and takes less space.

Use-case 4: Data Validation and Cleaning

Task: check CSV before importing to DB

User uploads CSV file with clients. Need to:

  • Check for required fields
  • Validate emails and phones
  • Remove duplicates
  • Bring data to unified format
import pandas as pd
import re
 
def validate_and_clean_csv(file_path: str) -> pd.DataFrame:
    """
    Validation and cleaning of CSV with clients.
    """
 
    # 1. Loading
    df = pd.read_csv(file_path)
 
    # 2. Check required columns
    required = ['name', 'email', 'phone']
    missing = set(required) - set(df.columns)
    if missing:
        raise ValueError(f"Missing required columns: {missing}")
 
    # 3. Remove empty rows
    df = df.dropna(subset=required)
 
    # 4. Email validation (vectorized!)
    email_pattern = r'^[\w\.-]+@[\w\.-]+\.\w+$'
    df['email_valid'] = df['email'].str.match(email_pattern, na=False)
 
    invalid_emails = df[~df['email_valid']]
    if len(invalid_emails) > 0:
        print(f"⚠️ Found {len(invalid_emails)} invalid emails:")
        print(invalid_emails[['name', 'email']].head())
 
    df = df[df['email_valid']].drop('email_valid', axis=1)
 
    # 5. Phone cleaning (bring to unified format)
    def clean_phone(phone):
        # Remove everything except digits
        digits = re.sub(r'\D', '', str(phone))
        # Add country code if needed
        if digits.startswith('8') and len(digits) == 11:
            digits = '1' + digits[1:]
        elif digits.startswith('9') and len(digits) == 10:
            digits = '1' + digits
        return f"+{digits}"
 
    df['phone'] = df['phone'].apply(clean_phone)
 
    # 6. Remove duplicates by email
    duplicates = df[df.duplicated(subset=['email'], keep=False)]
    if len(duplicates) > 0:
        print(f"⚠️ Found {len(duplicates)} duplicate emails")
        print(duplicates[['name', 'email']])
 
    df = df.drop_duplicates(subset=['email'], keep='first')
 
    # 7. Bring name to title case
    df['name'] = df['name'].str.strip().str.title()
 
    return df
 
# Usage
try:
    clean_df = validate_and_clean_csv('clients.csv')
    clean_df.to_csv('clients_clean.csv', index=False)
    print(f"✅ Successfully cleaned {len(clean_df)} records")
except Exception as e:
    print(f"❌ Validation failed: {e}")

Advanced Validation with Report

def create_validation_report(df: pd.DataFrame) -> dict:
    """
    Creates detailed data quality report.
    """
 
    report = {
        'total_rows': len(df),
        'missing_values': df.isnull().sum().to_dict(),
        'duplicates': {
            'email': df['email'].duplicated().sum(),
            'phone': df['phone'].duplicated().sum()
        },
        'data_types': df.dtypes.astype(str).to_dict(),
        'unique_values': {
            col: df[col].nunique()
            for col in df.columns
        }
    }
 
    # Statistics for numeric columns
    numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns
    if len(numeric_cols) > 0:
        report['numeric_stats'] = df[numeric_cols].describe().to_dict()
 
    # Top-10 most frequent values for categorical columns
    categorical_cols = df.select_dtypes(include=['object']).columns
    report['top_values'] = {
        col: df[col].value_counts().head(10).to_dict()
        for col in categorical_cols
    }
 
    return report
 
# Generate report
report = create_validation_report(df)
print(f"Total rows: {report['total_rows']}")
print(f"Missing values: {report['missing_values']}")
print(f"Email duplicates: {report['duplicates']['email']}")

Use-case 5: Report Generation

Task: monthly business report in Excel

Business wants to receive Excel file with multiple sheets:

  • Overall statistics for the month
  • Top-10 clients
  • Sales dynamics by days
  • Breakdown by cities
import pandas as pd
from sqlalchemy import create_engine
from datetime import datetime
 
def generate_monthly_report(year: int, month: int):
    """
    Generates monthly business report.
    """
 
    engine = create_engine('postgresql://user:password@localhost:5432/mydb')
 
    # 1. Load data for month (safe parameterized query)
    from datetime import datetime
    from dateutil.relativedelta import relativedelta
 
    # Create period boundaries
    start_date = datetime(year, month, 1)
    end_date = start_date + relativedelta(months=1)
 
    # ✅ Parameterized query (SQL injection protection)
    query = """
        SELECT
            o.id as order_id,
            o.created_at,
            o.total,
            o.status,
            u.id as user_id,
            u.email,
            u.city
        FROM orders o
        JOIN users u ON o.user_id = u.id
        WHERE o.created_at >= :start_date AND o.created_at < :end_date
    """
 
    df = pd.read_sql(query, engine, params={'start_date': start_date, 'end_date': end_date})
    df['created_at'] = pd.to_datetime(df['created_at'])
    df['date'] = df['created_at'].dt.date
 
    # 2. Create Excel with multiple sheets
    output_file = f'monthly_report_{year}_{month:02d}.xlsx'
 
    with pd.ExcelWriter(output_file, engine='openpyxl') as writer:
        # Sheet 1: Overall statistics
        summary = pd.DataFrame({
            'Metric': [
                'Total Orders',
                'Sales Sum',
                'Average Check',
                'Unique Customers',
                'Completed Orders',
                'Cancelled Orders'
            ],
            'Value': [
                len(df),
                f"${df['total'].sum():,.2f}",
                f"${df['total'].mean():,.2f}",
                df['user_id'].nunique(),
                len(df[df['status'] == 'completed']),
                len(df[df['status'] == 'cancelled'])
            ]
        })
        summary.to_excel(writer, sheet_name='Overall Statistics', index=False)
 
        # Sheet 2: Top-10 clients
        top_clients = (
            df.groupby(['user_id', 'email'])
            .agg({
                'order_id': 'count',
                'total': 'sum'
            })
            .rename(columns={
                'order_id': 'Order Count',
                'total': 'Total Spent'
            })
            .sort_values('Total Spent', ascending=False)
            .head(10)
            .reset_index()
        )
        top_clients.to_excel(writer, sheet_name='Top-10 Clients', index=False)
 
        # Sheet 3: Daily dynamics
        daily_sales = (
            df.groupby('date')
            .agg({
                'order_id': 'count',
                'total': 'sum'
            })
            .rename(columns={
                'order_id': 'Orders',
                'total': 'Revenue'
            })
            .reset_index()
        )
        daily_sales.to_excel(writer, sheet_name='Daily Dynamics', index=False)
 
        # Sheet 4: Breakdown by cities
        city_stats = (
            df.groupby('city')
            .agg({
                'order_id': 'count',
                'total': ['sum', 'mean'],
                'user_id': 'nunique'
            })
            .round(2)
        )
        city_stats.columns = ['Orders', 'Revenue', 'Avg Check', 'Customers']
        city_stats.to_excel(writer, sheet_name='By Cities')
 
    print(f"✅ Report saved: {output_file}")
    return output_file
 
# Generate report
report_file = generate_monthly_report(2025, 12)

Excel Formatting with openpyxl

from openpyxl import load_workbook
from openpyxl.styles import Font, PatternFill, Alignment
 
def format_excel_report(file_path: str):
    """
    Adds formatting to Excel report.
    """
 
    wb = load_workbook(file_path)
 
    # Formatting for all sheets
    for sheet_name in wb.sheetnames:
        ws = wb[sheet_name]
 
        # Headers: bold font, gray background
        header_fill = PatternFill(start_color="CCCCCC", end_color="CCCCCC", fill_type="solid")
        header_font = Font(bold=True)
 
        for cell in ws[1]:  # First row (headers)
            cell.fill = header_fill
            cell.font = header_font
            cell.alignment = Alignment(horizontal='center')
 
        # Auto-width columns
        for column in ws.columns:
            max_length = 0
            column_letter = column[0].column_letter
            for cell in column:
                try:
                    if len(str(cell.value)) > max_length:
                        max_length = len(str(cell.value))
                except:
                    pass
            adjusted_width = min(max_length + 2, 50)
            ws.column_dimensions[column_letter].width = adjusted_width
 
    wb.save(file_path)
    print(f"✅ Formatting applied: {file_path}")
 
# Apply formatting
format_excel_report(report_file)

pandas vs SQL/ORM: when to use what?

Use pandas when:

Ad-hoc data analysis

# Quick check: how many users registered in a week?
df = pd.read_sql("SELECT * FROM users WHERE created_at >= NOW() - INTERVAL '7 days'", engine)
print(df.groupby(df['created_at'].dt.date).size())

Data from different sources

# Combining data from DB and CSV
users_db = pd.read_sql("SELECT * FROM users", engine)
purchases_csv = pd.read_csv('external_purchases.csv')
merged = users_db.merge(purchases_csv, on='user_id')

Complex transformations

# Pivot table, aggregation by multiple dimensions
pivot = df.pivot_table(
    values='revenue',
    index='date',
    columns=['city', 'category'],
    aggfunc=['sum', 'mean']
)

Report generation (Excel, PDF)

df.to_excel('report.xlsx', sheet_name='Sales')

Use SQL/ORM when:

Production DB operations

# Writing to DB in production
# ❌ df.to_sql('users', engine)  # Slow for large data
# ✅ Use bulk_insert_mappings or COPY

Transactional operations

# ACID required, rollback, constraints
with session.begin():
    session.add(user)
    session.add(order)

Working with large data (> 1GB)

# Processing 10 million rows — use SQL + indexes
# pandas will load everything into memory and may crash

Hybrid Approach (best practice)

# Use SQL for filtering, pandas for analysis
query = """
    SELECT * FROM orders
    WHERE created_at >= '2025-01-01'
    AND status = 'completed'
    LIMIT 100000
"""
df = pd.read_sql(query, engine)  # SQL: filtering
top_products = df.groupby('product_id')['quantity'].sum().nlargest(10)  # pandas: aggregation

pandas Alternatives: when to choose them?

pandas is the de facto standard, but there are modern alternatives with better performance. Here's a comparison:

CriterionpandasPolarsDuckDBWhen to choose
Speed (small data &lt:100MB)⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐Polars for speed
Speed (large data 1-10GB)⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐Polars/DuckDB
Memory usage⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐Polars more efficient
Ecosystem and libraries⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐pandas for compatibility
Entry barrier⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐pandas/DuckDB easier
SQL-like syntaxDuckDB for SQL fans
Lazy evaluationPolars/DuckDB
Team support⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐pandas is standard

Practical Examples

pandas (classic):

import pandas as pd
 
df = pd.read_csv('data.csv')
result = (
    df.groupby('category')['sales']
    .sum()
    .sort_values(ascending=False)
    .head(10)
)

Polars (speed):

import polars as pl
 
# 5-10x faster than pandas!
df = pl.read_csv('data.csv')
result = (
    df.lazy()  # Lazy execution
    .group_by('category')
    .agg(pl.col('sales').sum())
    .sort('sales', descending=True)
    .limit(10)
    .collect()  # Execution
)

DuckDB (SQL over data):

import duckdb
 
# SQL syntax, works with pandas DataFrame
df = pd.read_csv('data.csv')
result = duckdb.sql("""
    SELECT category, SUM(sales) as total_sales
    FROM df
    GROUP BY category
    ORDER BY total_sales DESC
    LIMIT 10
""").to_df()

When to migrate from pandas?

Stay with pandas if:

  • Working with data < 1GB
  • Need compatibility with libraries (scikit-learn, matplotlib)
  • Team knows pandas

Migrate to Polars if:

  • Need maximum processing speed
  • Working with data 1-50GB
  • Ready for less mature ecosystem

Use DuckDB if:

  • Love SQL and want to use it for analysis
  • Need to work with data > 10GB (out-of-memory processing)
  • Want embedded analytical DB without server

Practical tip: Start with pandas. When you hit performance limits — try Polars or DuckDB. They're compatible with pandas API, so migration will be simple.

Performance and Optimization

1. File Format Choice: benchmarks

Data storage format choice is critical for performance. Here are approximate measurements on a 1GB file (5 million rows, 10 columns):

Important: Numbers depend on SSD/HDD, CPU, data types, and compression level. Use as a guide, not absolute values.

FormatReadWriteDisk SizeCompressionCompatibilityWhen to use
CSV12.3 sec8.7 sec1.0 GB1.0x⭐⭐⭐⭐⭐Data exchange, human readability
Parquet (snappy)1.4 sec2.1 sec240 MB4.2x⭐⭐⭐⭐Recommended for intermediate data
Parquet (gzip)2.8 sec15.3 sec150 MB6.8x⭐⭐⭐⭐Long-term storage, archives
Feather0.9 sec1.7 sec680 MB1.5x⭐⭐Exchange between Python/R
HDF53.2 sec4.5 sec580 MB1.7x⭐⭐⭐Complex hierarchical data
Pickle1.1 sec0.8 sec950 MB1.1xInternal use only

Recommendations:

# ✅ Parquet (snappy) — golden mean for most tasks
df.to_parquet('data.parquet', compression='snappy', index=False)
df = pd.read_parquet('data.parquet')
 
# ✅ CSV — only for exchange with external systems
df.to_csv('export.csv', index=False)
 
# ❌ Pickle — DON'T use for long-term storage
# Reasons: incompatible between pandas versions, security vulnerability

Practical tip: Use Parquet with compression='snappy' for all intermediate results. Gain: 8x faster reading + 4x less space.

2. Use Data Types Efficiently

# ❌ Bad: everything loads as object (strings)
df = pd.read_csv('large_file.csv')
 
# ✅ Good: specify types explicitly
df = pd.read_csv(
    'large_file.csv',
    dtype={
        'user_id': 'int32',  # int32 instead of int64 if values < 2 billion
        'category': 'category',  # Saves memory for repeating strings
        'price': 'float32'
    },
    parse_dates=['created_at']
)
 
# Check memory usage
print(df.memory_usage(deep=True))
 
# Memory saving example
# int64 → int32: 50% memory saved (8 bytes → 4 bytes)
# object → category: 90% saved for repeating values

3. Chunked Reading for Large Files

# ❌ Bad: loads entire file into memory
df = pd.read_csv('huge_file.csv')  # 10GB file — OOM
 
# ✅ Good: process in chunks
chunks = []
for chunk in pd.read_csv('huge_file.csv', chunksize=10000):
    # Process chunk
    filtered = chunk[chunk['status'] == 'active']
    chunks.append(filtered)
 
df = pd.concat(chunks, ignore_index=True)

4. Vectorized operations instead of apply: why is this critical?

What happens under the hood?

❌ apply() with lambda — Python row by row:

# Slow: each row — Python function call
df['price_with_tax'] = df.apply(lambda row: row['price'] * 1.2, axis=1)
 
# What happens:
# for row in df.iterrows():  # Python loop
#     result = lambda(row)    # Python function call
#     # GIL blocks parallelism
#     # Interpreter overhead on each iteration

✅ Vectorized operations — NumPy/C:

# Fast: operation over entire array at once
df['price_with_tax'] = df['price'] * 1.2
 
# What happens:
# Operation delegated to NumPy (written in C)
# Uses SIMD CPU instructions (parallel processing)
# No Python interpreter overhead
# Works 10-100x faster

Practical Comparison

import pandas as pd
import numpy as np
import time
 
# Create test DataFrame (100k rows)
df = pd.DataFrame({
    'price': np.random.randint(100, 10000, 100000),
    'quantity': np.random.randint(1, 100, 100000)
})
 
# Test 1: apply() with lambda
start = time.time()
df['total_apply'] = df.apply(lambda row: row['price'] * row['quantity'], axis=1)
print(f"apply(): {time.time() - start:.3f} sec")  # ~1.2 sec
 
# Test 2: Vectorized operation
start = time.time()
df['total_vector'] = df['price'] * df['quantity']
print(f"vectorized: {time.time() - start:.3f} sec")  # ~0.003 sec
 
# Difference: 400x faster! ⚡

When is apply() acceptable?

# ✅ Acceptable: complex logic that can't be vectorized
def complex_transformation(row):
    # Logic with external API calls, regex, etc.
    if row['status'] == 'pending' and row['amount'] > threshold:
        return external_api_call(row['id'])
    return None
 
df['result'] = df.apply(complex_transformation, axis=1)
 
# But even here it's better:
# 1. First filter needed rows
# 2. Apply apply() only to them
filtered = df[(df['status'] == 'pending') & (df['amount'] > threshold)]
filtered['result'] = filtered.apply(complex_transformation, axis=1)

Alternatives to apply() for conditional logic

# ❌ Slow: apply for categorization
df['category'] = df.apply(
    lambda row: 'expensive' if row['price'] > 1000 else 'cheap',
    axis=1
)
 
# ✅ Fast: np.where (for 2 options)
df['category'] = np.where(df['price'] > 1000, 'expensive', 'cheap')
 
# ✅ Fast: np.select (for 3+ options)
conditions = [
    df['price'] > 5000,
    df['price'] > 1000,
    df['price'] > 500
]
choices = ['premium', 'expensive', 'medium']
df['category'] = np.select(conditions, choices, default='cheap')
 
# ✅ Fast: pd.cut (for ranges)
df['price_range'] = pd.cut(
    df['price'],
    bins=[0, 500, 1000, 5000, float('inf')],
    labels=['cheap', 'medium', 'expensive', 'premium']
)

Golden rule: If you see apply() in code — always ask yourself: "Can this be vectorized?". In 90% of cases — yes, and it will give 10-100x performance gain.

Common Mistakes and How to Avoid Them

Even experienced developers step on the same rakes when working with pandas. Let's look at the most critical mistakes.

1. SettingWithCopyWarning — most insidious error

# ❌ ERROR: modifying potential copy
subset = df[df['sales'] > 100]
subset['discount'] = 0.1  # ⚠️ SettingWithCopyWarning!
# Changes may not apply or apply to original
 
# ✅ CORRECT: explicit intent
# Method 1: modify original DataFrame
df.loc[df['sales'] > 100, 'discount'] = 0.1
 
# Method 2: create explicit copy
subset = df[df['sales'] > 100].copy()
subset['discount'] = 0.1

Why does this happen? pandas can't determine if you want to work with a copy or original. Use .loc[] to modify original or .copy() for explicit copy.

2. Ignoring Categorical Data

# ❌ BAD: 1GB memory for repeating strings
df = pd.read_csv('data.csv')
print(df['country'].memory_usage(deep=True))  # 1GB
 
# ✅ GOOD: 100MB memory with category
df['country'] = df['country'].astype('category')
print(df['country'].memory_usage(deep=True))  # 100MB
 
# Savings: 10x for columns with small number of unique values

When to use category:

  • Column has < 50% unique values
  • String data (statuses, categories, cities)
  • Repeating values

3. Reading Entire File Into Memory

# ❌ BAD: crashes on large files
df = pd.read_csv('huge_10gb.csv')  # OutOfMemoryError
 
# ✅ GOOD: process in chunks
result = []
for chunk in pd.read_csv('huge_10gb.csv', chunksize=50000):
    # Filter and process only needed
    filtered = chunk[chunk['status'] == 'active']
    result.append(filtered)
 
df = pd.concat(result, ignore_index=True)

4. Not Specifying dtype When Reading

# ❌ SLOW: pandas guesses types for each column
df = pd.read_csv('data.csv')  # 10 seconds
 
# ✅ FAST: explicitly specify types
dtypes = {
    'id': 'int32',
    'price': 'float32',
    'category': 'category',
    'created_at': 'str'  # Parse separately
}
df = pd.read_csv(
    'data.csv',
    dtype=dtypes,
    parse_dates=['created_at']
)  # 2 seconds
 
# Gain: 5x speed + less memory

5. Using iterrows() for Large Data

# ❌ NIGHTMARE: 1000 rows = 5 seconds
total = 0
for index, row in df.iterrows():
    total += row['price'] * row['quantity']
 
# ✅ EXCELLENT: 1000 rows = 0.001 second
total = (df['price'] * df['quantity']).sum()
 
# Difference: 5000x faster!

Rule: If you're using iterrows() — you're doing something wrong. There's almost always a vectorized solution.

6. Mixing Business Logic and Data Processing

# ❌ BAD: everything in one function
def process_sales(df):
    # 50 lines of unreadable code
    df['margin'] = df['price'] - df['cost']
    df['category'] = df.apply(lambda x: categorize(x), axis=1)
    df['bonus'] = df.apply(lambda x: calculate_bonus(x), axis=1)
    # ... 40 more lines
    return df
 
# ✅ GOOD: separation of concerns
class SalesProcessor:
    @staticmethod
    def calculate_margin(df):
        """Margin calculation."""
        return df['price'] - df['cost']
 
    @staticmethod
    def categorize_products(df):
        """Product categorization."""
        conditions = [df['margin'] > 100, df['margin'] > 50]
        choices = ['Premium', 'Standard']
        return np.select(conditions, choices, default='Budget')
 
    @staticmethod
    def calculate_bonuses(df):
        """Bonus calculation."""
        return np.where(df['category'] == 'Premium', df['margin'] * 0.1, 0)
 
    def process(self, df):
        """Main processing method."""
        df = df.copy()
        df['margin'] = self.calculate_margin(df)
        df['category'] = self.categorize_products(df)
        df['bonus'] = self.calculate_bonuses(df)
        return df

Practical tip: Each function should do one thing well. This simplifies testing, debugging, and code reuse.

Checklist Before Commit

Check your pandas code for these errors:

  • No SettingWithCopyWarning
  • Categorical types used where needed
  • Large files read in chunks or with explicit dtypes
  • No iterrows() or apply() where vectorization is possible
  • Business logic separated from data processing
  • Error handling for file operations

Type Hints for pandas Code

Type hints make code more understandable and help IDEs find errors. Here are main patterns for pandas:

Basic Annotations

import pandas as pd
from typing import List, Dict, Optional
 
def load_sales_data(file_path: str) -> pd.DataFrame:
    """Loads sales data from CSV."""
 
    return pd.read_csv(file_path)
 
def filter_by_category(
    df: pd.DataFrame,
    categories: List[str]
) -> pd.DataFrame:
    """Filters data by categories."""
 
    return df[df['category'].isin(categories)]
 
def calculate_metrics(df: pd.DataFrame) -> Dict[str, float]:
    """Calculates metrics."""
 
    return {
        'total_sales': float(df['sales'].sum()),
        'avg_price': float(df['price'].mean()),
        'max_quantity': int(df['quantity'].max())
    }

Working with Series

def get_column(df: pd.DataFrame, column: str) -> pd.Series:
    """Returns column as Series."""
 
    return df[column]
 
def process_series(series: pd.Series) -> pd.Series:
    """Processes Series."""
 
    return series.fillna(0).astype(int)

Optional for Possible Errors

def find_product(
    df: pd.DataFrame,
    product_id: int
) -> Optional[pd.Series]:
    """Finds product by ID."""
 
    result = df[df['id'] == product_id]
    if result.empty:
        return None
    return result.iloc[0]

Modern Approach with pandas-stubs

For advanced typing install pandas-stubs:

pip install pandas-stubs

This provides:

  • pandas method autocomplete in IDE
  • Type checking with mypy
  • Better code documentation
from pandas import DataFrame, Series
from typing import Literal
 
def aggregate_data(
    df: DataFrame,
    column: str,
    method: Literal['sum', 'mean', 'count']
) -> float:
    """Aggregates data with type checking."""
 
    if method == 'sum':
        return float(df[column].sum())
    elif method == 'mean':
        return float(df[column].mean())
    else:
        return float(df[column].count())

Practical tip: Start by annotating input and output parameters of functions. This already gives 80% of type hints benefits with minimal effort.

Conclusion

pandas is a must-have tool for Python developers, even if you don't do data science. Main advantages:

  • Development speed: 3 lines instead of 30 lines of pure Python
  • Flexibility: works with any data format (CSV, JSON, Excel, SQL, Parquet)
  • Powerful operations: grouping, aggregation, join, pivot in 1 line of code
  • Integration: easily connects to databases, APIs, files

What to do next:

Right now (5 minutes)

  1. Install pandas: pip install pandas openpyxl sqlalchemy
  2. Take your access.log → copy code from Use-case 1 → get top slow endpoints

This week (1 hour)

  1. Take Excel export from ERP → use Use-case 2 → generate margin report
  2. Replace one apply() with vectorized operation in your code → measure difference

Next month

  1. Create production-ready ETL with error handling and logging (Use-case 3)
  2. Add type hints to your pandas functions
  3. Try Polars or DuckDB on a task where pandas is slow

Main thing: Don't postpone! Take first use case from article, apply to your data right now. In 15 minutes you'll see result that would take hours before.

pandas Implementation Checklist for Your Project

Step 1: Installation

  • pip install pandas openpyxl sqlalchemy — install dependencies
  • Add to requirements.txt or pyproject.toml

Step 2: First Use Case

  • Choose one use-case from article (logs/Excel/validation)
  • Run code on your data
  • Save result in Parquet instead of CSV

Step 3: Production Readiness

  • Add error handling (try/except) to pandas code
  • Add progress logging (logging instead of print)
  • Specify explicit dtypes when reading large files
  • Use chunked reading for files > 1GB

Step 4: Code Quality

  • Replace at least one apply() with vectorized operation
  • Add type hints (pd.DataFrame, pd.Series)
  • Write minimal test (check result schema)

Step 5: Automation

  • Wrap script into function with parameters
  • Add to cron/Airflow/CI for regular runs
  • Set up alerts for processing errors

Useful Resources