Analyzing Real Discharge Data¶
Time required: 40-50 minutes
Prerequisites: Module 4a (data downloaded)
What you’ll do: Load, analyze, and visualize real hydrological data
What We’ll Build¶
In this module, you’ll analyze the discharge data you downloaded in Module 4a. We’ll:
Load discharge data from the CSV file
Check data quality (missing values, date ranges)
Calculate hydrological statistics (mean, percentiles, monthly values)
Identify high and low flow events
Create professional hydrograph plots
Generate monthly distribution analysis
Build a flow duration curve (FDC)
This is real water modelling work—the kind of analysis you might do for a project or report.
The Data¶
We’ll use the CAMELS streamflow data you downloaded in Module 4a. The CSV file is in your data/ folder and contains:
Daily discharge values in m³/s
Station metadata (ID and name)
Date information for time series analysis
Step 1: Check Your Data¶
Before we start the analysis, let’s verify you have the data from Module 4a.
Your project should have:
A
data/folder with a CSV file likecamels_01013500_discharge.csvThe file contains columns:
date,discharge_m3s,discharge_mm_day,station_id,station_name
Let’s check what files we have:
# Import the path helper we created in Module 4a
# This is saved in src/python_for_water_modellers/paths.py for reuse
from python_for_water_modellers import get_data_path
DATA_PATH = get_data_path()
# List available data files
print(f"Data path: {DATA_PATH}")
print(f"Data path exists: {DATA_PATH.exists()}")
if DATA_PATH.exists():
data_files = [f.name for f in DATA_PATH.glob('*.csv')]
print("CSV files in data/ folder:")
for f in data_files:
print(f" {f}")You should see your downloaded CAMELS file listed. Note the filename—you’ll need it in the next step.
Step 2: Load and Explore Your Data¶
Let’s start by loading the data and seeing what we have:
# CONFIGURATION - Update with your filename from Step 1
STATION_ID = '01013500' # Change to your station ID
INPUT_FILE = DATA_PATH / f'camels_{STATION_ID}_discharge.csv'
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
print("Loading discharge data...")
print()
# Load the CSV file
df = pd.read_csv(INPUT_FILE)
# Display basic info
print(f"File: {INPUT_FILE}")
print(f"Rows: {len(df)}")
print(f"Columns: {list(df.columns)}")
print()
print("First few rows:")
print(df.head())
print()
print("Data types:")
print(df.dtypes)What to check:
Do you see the expected columns (
date,discharge_m3s, etc.)?Is the date column recognized as text (object) for now? We’ll convert it next.
Does the data look reasonable?
If everything looks good, let’s continue with the full analysis!
Step 3: Complete Discharge Analysis¶
Now let’s perform a complete discharge analysis. We’ll break it into logical sections that you can run sequentially.
Part 1: Prepare the Data for Analysis¶
Convert dates and set up the time series:
# Configuration
DATE_COLUMN = 'date'
DISCHARGE_COLUMN = 'discharge_m3s'
print("Preparing data for analysis...")
print()
# Convert date column to datetime format
df[DATE_COLUMN] = pd.to_datetime(df[DATE_COLUMN])
# Set date as the index for time-series operations
df = df.set_index(DATE_COLUMN)
# Get station name from the data
station_name = df['station_name'].iloc[0] if 'station_name' in df.columns else "Unknown Station"
# Report what we have
print(f"Station: {station_name}")
print(f"Period: {df.index.min().strftime('%Y-%m-%d')} to {df.index.max().strftime('%Y-%m-%d')}")
print(f"Total records: {len(df)} days")
# Check for missing data
missing_count = df[DISCHARGE_COLUMN].isna().sum()
if missing_count > 0:
missing_pct = missing_count / len(df) * 100
print(f"Missing data: {missing_count} days ({missing_pct:.1f}%)")
else:
print("No missing data")
print()
print("✓ Data prepared and ready for analysis")💡 Why set the date as index? This enables powerful time-series features like automatic date formatting on plots, easy resampling (daily → monthly), and simple date range selection:
df['2022-06':'2022-08']
Part 2: Calculate Flow Statistics¶
Now let’s calculate key hydrological statistics:
# =============================================================================
# CALCULATE FLOW STATISTICS
# =============================================================================
print("Calculating statistics...")
# Get discharge series, excluding missing values
Q = df[DISCHARGE_COLUMN].dropna()
# Calculate key statistics
# Note: Q10/Q90 use exceedance probability convention
# Q10 = flow exceeded 10% of time (high flow)
# Q90 = flow exceeded 90% of time (low flow)
stats = {
'Mean discharge (m³/s)': Q.mean(),
'Median discharge (m³/s)': Q.median(),
'Standard deviation (m³/s)': Q.std(),
'Minimum (m³/s)': Q.min(),
'Maximum (m³/s)': Q.max(),
'Q90 - Low flow threshold (m³/s)': Q.quantile(0.10), # Exceeded 90% of time
'Q10 - High flow threshold (m³/s)': Q.quantile(0.90), # Exceeded 10% of time
}
print()
print("FLOW STATISTICS")
print("-" * 45)
for name, value in stats.items():
print(f" {name:<35} {value:>8.2f}")
print()What are Q10 and Q90?
These are flow thresholds commonly used in hydrology, named by exceedance probability:
Q10: Flow exceeded only 10% of the time (high flow indicator)
Q90: Flow exceeded 90% of the time (low flow indicator)
This tutorial uses the exceedance probability convention throughout, which is standard in water resources engineering.
Part 3: Identify Extreme Events¶
Find high and low flow periods:
# =============================================================================
# IDENTIFY EXTREME EVENTS
# =============================================================================
# Define thresholds using percentiles
threshold_high = Q.quantile(0.90)
threshold_low = Q.quantile(0.10)
# Find days exceeding thresholds
high_flow_days = df[df[DISCHARGE_COLUMN] > threshold_high]
low_flow_days = df[df[DISCHARGE_COLUMN] < threshold_low]
print("EXTREME EVENTS")
print("-" * 45)
# High flow summary
print(f" High flow events (Q > {threshold_high:.1f} m³/s):")
print(f" {len(high_flow_days)} days identified")
if len(high_flow_days) > 0:
max_day = Q.idxmax()
max_value = Q.max()
print(f" Maximum: {max_value:.1f} m³/s on {max_day.strftime('%Y-%m-%d')}")
print()
# Low flow summary
print(f" Low flow events (Q < {threshold_low:.1f} m³/s):")
print(f" {len(low_flow_days)} days identified")
if len(low_flow_days) > 0:
min_day = Q.idxmin()
min_value = Q.min()
print(f" Minimum: {min_value:.1f} m³/s on {min_day.strftime('%Y-%m-%d')}")
print()Part 4: Monthly Summary¶
Calculate monthly statistics to see seasonal patterns:
# =============================================================================
# MONTHLY SUMMARY
# =============================================================================
# Resample to monthly and calculate statistics
# 'ME' means month end frequency
monthly = df[DISCHARGE_COLUMN].resample('ME').agg(['mean', 'min', 'max', 'std'])
monthly.columns = ['Mean', 'Min', 'Max', 'Std']
# Format the index for display
monthly.index = monthly.index.strftime('%Y-%m')
print("MONTHLY SUMMARY (m³/s)")
print("-" * 55)
print(monthly.round(1).to_string())
print()Part 5: Create Visualizations¶
Finally, let’s create professional plots:
# =============================================================================
# CREATE VISUALIZATIONS
# =============================================================================
print("Creating visualizations...")
# Create a figure with 2 subplots stacked vertically
fig, axes = plt.subplots(2, 1, figsize=(12, 8))
# --- Plot 1: Hydrograph ---
ax1 = axes[0]
# Plot the discharge time series
ax1.plot(df.index, df[DISCHARGE_COLUMN], 'b-', linewidth=0.8, label='Daily discharge')
# Add mean line
mean_q = stats['Mean discharge (m³/s)']
ax1.axhline(y=mean_q, color='red', linestyle='--', linewidth=1.5,
label=f'Mean ({mean_q:.1f} m³/s)')
# Add shading under the curve
ax1.fill_between(df.index, 0, df[DISCHARGE_COLUMN], alpha=0.3)
# Mark high flow events
ax1.scatter(high_flow_days.index, high_flow_days[DISCHARGE_COLUMN],
color='red', s=20, zorder=5, label='High flow events')
# Labels and formatting
ax1.set_ylabel('Discharge (m³/s)')
ax1.set_title(f'{station_name} - Daily Discharge Hydrograph')
ax1.legend(loc='upper right')
ax1.grid(True, alpha=0.3)
# --- Plot 2: Monthly Box Plot ---
ax2 = axes[1]
# Prepare data for box plot (group by month)
df_temp = df.copy()
df_temp['month'] = df_temp.index.month
# Create box plot data for each month
month_names = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
box_data = [df_temp[df_temp['month'] == m][DISCHARGE_COLUMN].dropna().values
for m in range(1, 13)]
# Create the box plot (use tick_labels instead of deprecated labels)
bp = ax2.boxplot(box_data, tick_labels=month_names, patch_artist=True)
# Color the boxes
for patch in bp['boxes']:
patch.set_facecolor('lightblue')
# Labels and formatting
ax2.set_ylabel('Discharge (m³/s)')
ax2.set_xlabel('Month')
ax2.set_title('Monthly Discharge Distribution')
ax2.grid(True, alpha=0.3, axis='y')
# Adjust layout to prevent overlap
plt.tight_layout()
# Display the plot
plt.show()
print()
print("=" * 60)
print("Analysis complete!")
print("=" * 60)Understanding the Code¶
Let’s review the key Python concepts used:
| Code | What It Does |
|---|---|
pd.read_csv(file) | Loads a CSV file into a DataFrame |
pd.to_datetime(column) | Converts text to date format |
df.set_index(column) | Sets a column as row labels |
series.mean(), .max(), .min() | Calculate statistics |
series.quantile(0.90) | Find the 90th percentile |
df.resample('ME') | Group data by month |
plt.subplots(2, 1) | Create a figure with 2 plots |
ax.plot(x, y) | Draw a line plot |
ax.boxplot(data) | Create a box plot |
Key Design Patterns¶
1. Configuration at the top
INPUT_FILE = 'data.csv'
DISCHARGE_COLUMN = 'discharge_m3s'Makes adapting the script easy—change settings in one place.
2. Handling missing data
Q = df[DISCHARGE_COLUMN].dropna()Always check for and handle missing values before calculations.
3. Clear section headers
# =============================================================================
# CALCULATE STATISTICS
# =============================================================================Makes the code easy to navigate and understand.
Adapting for Your Own Data¶
To use this script with your own discharge data:
1. Check Your Data Format¶
Your CSV should have at least:
A date column (any reasonable format)
A discharge column (numeric values)
2. Update the Configuration¶
INPUT_FILE = 'your_data.csv'
DATE_COLUMN = 'your_date_column_name'
DISCHARGE_COLUMN = 'your_discharge_column_name'
STATION_NAME = 'Your Station Name'3. Handle Different Date Formats¶
If your dates are in a specific format (e.g., European DD.MM.YYYY):
df[DATE_COLUMN] = pd.to_datetime(df[DATE_COLUMN], format='%d.%m.%Y')Common format codes:
%Y-%m-%d→ 2024-01-15%d.%m.%Y→ 15.01.2024%d/%m/%Y→ 15/01/2024%m/%d/%Y→ 01/15/2024
Common Issues with Real Data¶
Real hydrological data often has issues. Here’s how to handle them:
Missing Values¶
# Check how much is missing
missing_pct = df[DISCHARGE_COLUMN].isna().sum() / len(df) * 100
print(f"Missing: {missing_pct:.1f}%")
# Option 1: Drop missing values (what we did)
Q = df[DISCHARGE_COLUMN].dropna()
# Option 2: Interpolate small gaps
df[DISCHARGE_COLUMN] = df[DISCHARGE_COLUMN].interpolate(method='time')Outliers¶
# Flag potential outliers (values beyond 3 standard deviations)
mean_q = df[DISCHARGE_COLUMN].mean()
std_q = df[DISCHARGE_COLUMN].std()
outliers = df[abs(df[DISCHARGE_COLUMN] - mean_q) > 3 * std_q]
print(f"Potential outliers: {len(outliers)} days")Negative Values¶
# Check for invalid negative discharge
negative = df[df[DISCHARGE_COLUMN] < 0]
if len(negative) > 0:
print(f"Warning: {len(negative)} negative values found!")Step 4: Flow Duration Curve¶
A flow duration curve (FDC) is one of the most valuable tools in hydrology. It shows the percentage of time a given discharge is equaled or exceeded.
Why Flow Duration Curves Matter¶
| Application | What the FDC Tells You |
|---|---|
| Water supply | How often can we extract X m³/s? |
| Hydropower | What flow is available 90% of the time? |
| Environmental flows | What’s the natural low-flow regime? |
| Flood risk | How often do extreme flows occur? |
| Catchment comparison | Is this catchment flashy or stable? |
Reading the Curve¶
Steep slope = Flashy catchment (quick response to rainfall, high variability)
Gentle slope = Baseflow-dominated (groundwater fed, stable flows)
Q50 = Median flow (exceeded 50% of the time)
Q90 = Low flow indicator (exceeded 90% of the time)
Q10 = High flow indicator (exceeded only 10% of the time)
# =============================================================================
# CREATE FLOW DURATION CURVE
# =============================================================================
print("Creating Flow Duration Curve...")
# Get discharge values (excluding NaN)
Q = df[DISCHARGE_COLUMN].dropna()
# Sort discharge values from highest to lowest
sorted_q = np.sort(Q)[::-1]
# Calculate exceedance probability for each value
# Exceedance probability = rank / (n + 1) * 100
n = len(sorted_q)
exceedance_prob = np.arange(1, n + 1) / (n + 1) * 100
# Create the plot
fig, ax = plt.subplots(figsize=(10, 6))
# Plot the flow duration curve
ax.plot(exceedance_prob, sorted_q, 'b-', linewidth=1.5)
# Add reference lines for key percentiles
q10 = np.percentile(Q, 90) # Flow exceeded 10% of time
q50 = np.percentile(Q, 50) # Median flow
q90 = np.percentile(Q, 10) # Flow exceeded 90% of time
ax.axhline(y=q10, color='orange', linestyle='--', alpha=0.7, label=f'Q10 = {q10:.1f} m³/s')
ax.axhline(y=q50, color='green', linestyle='--', alpha=0.7, label=f'Q50 = {q50:.1f} m³/s')
ax.axhline(y=q90, color='red', linestyle='--', alpha=0.7, label=f'Q90 = {q90:.1f} m³/s')
# Add vertical reference lines
ax.axvline(x=10, color='orange', linestyle=':', alpha=0.5)
ax.axvline(x=50, color='green', linestyle=':', alpha=0.5)
ax.axvline(x=90, color='red', linestyle=':', alpha=0.5)
# Labels and formatting
ax.set_xlabel('Exceedance Probability (%)', fontsize=12)
ax.set_ylabel('Discharge (m³/s)', fontsize=12)
ax.set_title(f'{station_name} - Flow Duration Curve', fontsize=14)
ax.legend(loc='upper right')
ax.grid(True, alpha=0.3)
# Use log scale for y-axis (common for FDCs)
ax.set_yscale('log')
ax.set_xlim(0, 100)
plt.tight_layout()
plt.show()
# Print key values
print()
print("KEY FLOW DURATION VALUES")
print("-" * 45)
print(f" Q10 (exceeded 10% of time): {q10:>8.1f} m³/s (high flow)")
print(f" Q50 (exceeded 50% of time): {q50:>8.1f} m³/s (median)")
print(f" Q90 (exceeded 90% of time): {q90:>8.1f} m³/s (low flow)")
print()
print(f" Flow range ratio (Q10/Q90): {q10/q90:.1f}")
print(" (Higher ratio = more variable/flashy catchment)")Understanding the Code:
Sort descending:
np.sort(Q)[::-1]- Highest flows firstCalculate exceedance: Each value’s rank divided by total count gives probability
Log scale: Y-axis uses logarithmic scale to show low flows clearly
Q10/Q90 ratio: A simple metric for flow variability (values >10 indicate flashy catchments)
This tutorial uses the exceedance probability convention: Q10 = flow exceeded 10% of time (high flow), Q90 = flow exceeded 90% of time (low flow).
Ideas for Further Extension¶
Now that you have a complete analysis workflow, here are ideas to extend it:
Add More Statistics¶
Baseflow index: Ratio of baseflow to total flow
Flashiness index: How quickly flow changes from day to day
7-day low flow (7Q10): Important for water quality permits
Annual maxima series: For flood frequency analysis
Improve Visualizations¶
Add a rolling mean (e.g., 30-day) to smooth the hydrograph
Create annual hydrographs overlaid for comparison
Add precipitation data as a secondary y-axis
Export plots with your organization’s branding
Automate for Multiple Stations¶
Loop through all CSV files in your data folder
Generate a summary table comparing stations
Create a multi-panel figure with all stations
Connect to Your Workflow¶
Export to Excel:
df.to_excel('results.xlsx')for reportsSave statistics to JSON: For automated reporting pipelines
Create a reusable function: Package your analysis as a module
Example: Multi-Station Loop¶
from pathlib import Path
# Process all CAMELS files
for csv_file in DATA_PATH.glob('camels_*.csv'):
df = pd.read_csv(csv_file)
# ... run your analysis ...
print(f"Processed: {csv_file.name}")Troubleshooting¶
“FileNotFoundError: camels_XXXXX_discharge.csv”¶
Make sure you completed Module 4a and downloaded the data
Check you’re in the correct project folder
Verify the file exists in the
data/folderUpdate
STATION_IDto match your downloaded file
“KeyError: ‘discharge_m3s’”¶
The column name doesn’t match your data
Check your CSV headers:
print(df.columns)Update
DISCHARGE_COLUMNto match
Plot looks empty or wrong¶
Check that dates were parsed correctly:
print(df.index)Verify data range:
print(df[DISCHARGE_COLUMN].describe())Look for all-NaN data in specific periods
“SettingWithCopyWarning”¶
This warning can be ignored for now—it doesn’t affect your results. It’s about pandas’ internal data handling.
Summary¶
In this module, you:
✅ Loaded and prepared time series data with pandas
✅ Calculated key hydrological statistics
✅ Identified extreme flow events
✅ Created professional visualizations (hydrograph + monthly boxplots)
✅ Built a flow duration curve with key percentiles
✅ Learned how to adapt the script for your own data
What You’ve Built¶
You now have a reusable discharge analysis workflow:
Input: CSV with discharge data
↓
Process: Python script
↓
Output: Statistics + Hydrograph + Monthly plots + Flow Duration CurveThis same pattern applies to most water modelling tasks—load data, process it, visualize results.
Next up: Explore additional resources and next steps → Module 5: Resources & Next Steps