Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Module 4b: Your First Water Modelling Script

Analyzing Real Discharge Data

Time required: 40-50 minutes
Prerequisites: Module 4a (data downloaded)
What you’ll do: Load, analyze, and visualize real hydrological data

What We’ll Build

In this module, you’ll analyze the discharge data you downloaded in Module 4a. We’ll:

  1. Load discharge data from the CSV file

  2. Check data quality (missing values, date ranges)

  3. Calculate hydrological statistics (mean, percentiles, monthly values)

  4. Identify high and low flow events

  5. Create professional hydrograph plots

  6. Generate monthly distribution analysis

  7. Build a flow duration curve (FDC)

This is real water modelling work—the kind of analysis you might do for a project or report.

The Data

We’ll use the CAMELS streamflow data you downloaded in Module 4a. The CSV file is in your data/ folder and contains:

  • Daily discharge values in m³/s

  • Station metadata (ID and name)

  • Date information for time series analysis

Step 1: Check Your Data

Before we start the analysis, let’s verify you have the data from Module 4a.

Your project should have:

  • A data/ folder with a CSV file like camels_01013500_discharge.csv

  • The file contains columns: date, discharge_m3s, discharge_mm_day, station_id, station_name

Let’s check what files we have:

# Import the path helper we created in Module 4a
# This is saved in src/python_for_water_modellers/paths.py for reuse
from python_for_water_modellers import get_data_path

DATA_PATH = get_data_path()

# List available data files
print(f"Data path: {DATA_PATH}")
print(f"Data path exists: {DATA_PATH.exists()}")
if DATA_PATH.exists():
    data_files = [f.name for f in DATA_PATH.glob('*.csv')]
    print("CSV files in data/ folder:")
    for f in data_files:
        print(f"  {f}")

You should see your downloaded CAMELS file listed. Note the filename—you’ll need it in the next step.

Step 2: Load and Explore Your Data

Let’s start by loading the data and seeing what we have:

# CONFIGURATION - Update with your filename from Step 1
STATION_ID = '01013500'  # Change to your station ID
INPUT_FILE = DATA_PATH / f'camels_{STATION_ID}_discharge.csv'

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

print("Loading discharge data...")
print()

# Load the CSV file
df = pd.read_csv(INPUT_FILE)

# Display basic info
print(f"File: {INPUT_FILE}")
print(f"Rows: {len(df)}")
print(f"Columns: {list(df.columns)}")
print()
print("First few rows:")
print(df.head())
print()
print("Data types:")
print(df.dtypes)

What to check:

  • Do you see the expected columns (date, discharge_m3s, etc.)?

  • Is the date column recognized as text (object) for now? We’ll convert it next.

  • Does the data look reasonable?

If everything looks good, let’s continue with the full analysis!

Step 3: Complete Discharge Analysis

Now let’s perform a complete discharge analysis. We’ll break it into logical sections that you can run sequentially.

Part 1: Prepare the Data for Analysis

Convert dates and set up the time series:

# Configuration
DATE_COLUMN = 'date'
DISCHARGE_COLUMN = 'discharge_m3s'

print("Preparing data for analysis...")
print()

# Convert date column to datetime format
df[DATE_COLUMN] = pd.to_datetime(df[DATE_COLUMN])

# Set date as the index for time-series operations
df = df.set_index(DATE_COLUMN)

# Get station name from the data
station_name = df['station_name'].iloc[0] if 'station_name' in df.columns else "Unknown Station"

# Report what we have
print(f"Station: {station_name}")
print(f"Period: {df.index.min().strftime('%Y-%m-%d')} to {df.index.max().strftime('%Y-%m-%d')}")
print(f"Total records: {len(df)} days")

# Check for missing data
missing_count = df[DISCHARGE_COLUMN].isna().sum()
if missing_count > 0:
    missing_pct = missing_count / len(df) * 100
    print(f"Missing data: {missing_count} days ({missing_pct:.1f}%)")
else:
    print("No missing data")
print()
print("✓ Data prepared and ready for analysis")

💡 Why set the date as index? This enables powerful time-series features like automatic date formatting on plots, easy resampling (daily → monthly), and simple date range selection: df['2022-06':'2022-08']

Part 2: Calculate Flow Statistics

Now let’s calculate key hydrological statistics:

# =============================================================================
# CALCULATE FLOW STATISTICS
# =============================================================================

print("Calculating statistics...")

# Get discharge series, excluding missing values
Q = df[DISCHARGE_COLUMN].dropna()

# Calculate key statistics
# Note: Q10/Q90 use exceedance probability convention
# Q10 = flow exceeded 10% of time (high flow)
# Q90 = flow exceeded 90% of time (low flow)
stats = {
    'Mean discharge (m³/s)': Q.mean(),
    'Median discharge (m³/s)': Q.median(),
    'Standard deviation (m³/s)': Q.std(),
    'Minimum (m³/s)': Q.min(),
    'Maximum (m³/s)': Q.max(),
    'Q90 - Low flow threshold (m³/s)': Q.quantile(0.10),   # Exceeded 90% of time
    'Q10 - High flow threshold (m³/s)': Q.quantile(0.90),  # Exceeded 10% of time
}

print()
print("FLOW STATISTICS")
print("-" * 45)
for name, value in stats.items():
    print(f"  {name:<35} {value:>8.2f}")
print()

What are Q10 and Q90?

These are flow thresholds commonly used in hydrology, named by exceedance probability:

  • Q10: Flow exceeded only 10% of the time (high flow indicator)

  • Q90: Flow exceeded 90% of the time (low flow indicator)

This tutorial uses the exceedance probability convention throughout, which is standard in water resources engineering.

Part 3: Identify Extreme Events

Find high and low flow periods:

# =============================================================================
# IDENTIFY EXTREME EVENTS
# =============================================================================

# Define thresholds using percentiles
threshold_high = Q.quantile(0.90)
threshold_low = Q.quantile(0.10)

# Find days exceeding thresholds
high_flow_days = df[df[DISCHARGE_COLUMN] > threshold_high]
low_flow_days = df[df[DISCHARGE_COLUMN] < threshold_low]

print("EXTREME EVENTS")
print("-" * 45)

# High flow summary
print(f"  High flow events (Q > {threshold_high:.1f} m³/s):")
print(f"    {len(high_flow_days)} days identified")
if len(high_flow_days) > 0:
    max_day = Q.idxmax()
    max_value = Q.max()
    print(f"    Maximum: {max_value:.1f} m³/s on {max_day.strftime('%Y-%m-%d')}")

print()

# Low flow summary
print(f"  Low flow events (Q < {threshold_low:.1f} m³/s):")
print(f"    {len(low_flow_days)} days identified")
if len(low_flow_days) > 0:
    min_day = Q.idxmin()
    min_value = Q.min()
    print(f"    Minimum: {min_value:.1f} m³/s on {min_day.strftime('%Y-%m-%d')}")
print()

Part 4: Monthly Summary

Calculate monthly statistics to see seasonal patterns:

# =============================================================================
# MONTHLY SUMMARY
# =============================================================================

# Resample to monthly and calculate statistics
# 'ME' means month end frequency
monthly = df[DISCHARGE_COLUMN].resample('ME').agg(['mean', 'min', 'max', 'std'])
monthly.columns = ['Mean', 'Min', 'Max', 'Std']

# Format the index for display
monthly.index = monthly.index.strftime('%Y-%m')

print("MONTHLY SUMMARY (m³/s)")
print("-" * 55)
print(monthly.round(1).to_string())
print()

Part 5: Create Visualizations

Finally, let’s create professional plots:

# =============================================================================
# CREATE VISUALIZATIONS
# =============================================================================

print("Creating visualizations...")

# Create a figure with 2 subplots stacked vertically
fig, axes = plt.subplots(2, 1, figsize=(12, 8))

# --- Plot 1: Hydrograph ---
ax1 = axes[0]

# Plot the discharge time series
ax1.plot(df.index, df[DISCHARGE_COLUMN], 'b-', linewidth=0.8, label='Daily discharge')

# Add mean line
mean_q = stats['Mean discharge (m³/s)']
ax1.axhline(y=mean_q, color='red', linestyle='--', linewidth=1.5,
            label=f'Mean ({mean_q:.1f} m³/s)')

# Add shading under the curve
ax1.fill_between(df.index, 0, df[DISCHARGE_COLUMN], alpha=0.3)

# Mark high flow events
ax1.scatter(high_flow_days.index, high_flow_days[DISCHARGE_COLUMN],
            color='red', s=20, zorder=5, label='High flow events')

# Labels and formatting
ax1.set_ylabel('Discharge (m³/s)')
ax1.set_title(f'{station_name} - Daily Discharge Hydrograph')
ax1.legend(loc='upper right')
ax1.grid(True, alpha=0.3)

# --- Plot 2: Monthly Box Plot ---
ax2 = axes[1]

# Prepare data for box plot (group by month)
df_temp = df.copy()
df_temp['month'] = df_temp.index.month

# Create box plot data for each month
month_names = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
               'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
box_data = [df_temp[df_temp['month'] == m][DISCHARGE_COLUMN].dropna().values
            for m in range(1, 13)]

# Create the box plot (use tick_labels instead of deprecated labels)
bp = ax2.boxplot(box_data, tick_labels=month_names, patch_artist=True)

# Color the boxes
for patch in bp['boxes']:
    patch.set_facecolor('lightblue')

# Labels and formatting
ax2.set_ylabel('Discharge (m³/s)')
ax2.set_xlabel('Month')
ax2.set_title('Monthly Discharge Distribution')
ax2.grid(True, alpha=0.3, axis='y')

# Adjust layout to prevent overlap
plt.tight_layout()

# Display the plot
plt.show()

print()
print("=" * 60)
print("Analysis complete!")
print("=" * 60)

Understanding the Code

Let’s review the key Python concepts used:

CodeWhat It Does
pd.read_csv(file)Loads a CSV file into a DataFrame
pd.to_datetime(column)Converts text to date format
df.set_index(column)Sets a column as row labels
series.mean(), .max(), .min()Calculate statistics
series.quantile(0.90)Find the 90th percentile
df.resample('ME')Group data by month
plt.subplots(2, 1)Create a figure with 2 plots
ax.plot(x, y)Draw a line plot
ax.boxplot(data)Create a box plot

Key Design Patterns

1. Configuration at the top

INPUT_FILE = 'data.csv'
DISCHARGE_COLUMN = 'discharge_m3s'

Makes adapting the script easy—change settings in one place.

2. Handling missing data

Q = df[DISCHARGE_COLUMN].dropna()

Always check for and handle missing values before calculations.

3. Clear section headers

# =============================================================================
# CALCULATE STATISTICS
# =============================================================================

Makes the code easy to navigate and understand.

Adapting for Your Own Data

To use this script with your own discharge data:

1. Check Your Data Format

Your CSV should have at least:

  • A date column (any reasonable format)

  • A discharge column (numeric values)

2. Update the Configuration

INPUT_FILE = 'your_data.csv'
DATE_COLUMN = 'your_date_column_name'
DISCHARGE_COLUMN = 'your_discharge_column_name'
STATION_NAME = 'Your Station Name'

3. Handle Different Date Formats

If your dates are in a specific format (e.g., European DD.MM.YYYY):

df[DATE_COLUMN] = pd.to_datetime(df[DATE_COLUMN], format='%d.%m.%Y')

Common format codes:

  • %Y-%m-%d → 2024-01-15

  • %d.%m.%Y → 15.01.2024

  • %d/%m/%Y → 15/01/2024

  • %m/%d/%Y → 01/15/2024

Common Issues with Real Data

Real hydrological data often has issues. Here’s how to handle them:

Missing Values

# Check how much is missing
missing_pct = df[DISCHARGE_COLUMN].isna().sum() / len(df) * 100
print(f"Missing: {missing_pct:.1f}%")

# Option 1: Drop missing values (what we did)
Q = df[DISCHARGE_COLUMN].dropna()

# Option 2: Interpolate small gaps
df[DISCHARGE_COLUMN] = df[DISCHARGE_COLUMN].interpolate(method='time')

Outliers

# Flag potential outliers (values beyond 3 standard deviations)
mean_q = df[DISCHARGE_COLUMN].mean()
std_q = df[DISCHARGE_COLUMN].std()
outliers = df[abs(df[DISCHARGE_COLUMN] - mean_q) > 3 * std_q]
print(f"Potential outliers: {len(outliers)} days")

Negative Values

# Check for invalid negative discharge
negative = df[df[DISCHARGE_COLUMN] < 0]
if len(negative) > 0:
    print(f"Warning: {len(negative)} negative values found!")

Step 4: Flow Duration Curve

A flow duration curve (FDC) is one of the most valuable tools in hydrology. It shows the percentage of time a given discharge is equaled or exceeded.

Why Flow Duration Curves Matter

ApplicationWhat the FDC Tells You
Water supplyHow often can we extract X m³/s?
HydropowerWhat flow is available 90% of the time?
Environmental flowsWhat’s the natural low-flow regime?
Flood riskHow often do extreme flows occur?
Catchment comparisonIs this catchment flashy or stable?

Reading the Curve

  • Steep slope = Flashy catchment (quick response to rainfall, high variability)

  • Gentle slope = Baseflow-dominated (groundwater fed, stable flows)

  • Q50 = Median flow (exceeded 50% of the time)

  • Q90 = Low flow indicator (exceeded 90% of the time)

  • Q10 = High flow indicator (exceeded only 10% of the time)

# =============================================================================
# CREATE FLOW DURATION CURVE
# =============================================================================

print("Creating Flow Duration Curve...")

# Get discharge values (excluding NaN)
Q = df[DISCHARGE_COLUMN].dropna()

# Sort discharge values from highest to lowest
sorted_q = np.sort(Q)[::-1]

# Calculate exceedance probability for each value
# Exceedance probability = rank / (n + 1) * 100
n = len(sorted_q)
exceedance_prob = np.arange(1, n + 1) / (n + 1) * 100

# Create the plot
fig, ax = plt.subplots(figsize=(10, 6))

# Plot the flow duration curve
ax.plot(exceedance_prob, sorted_q, 'b-', linewidth=1.5)

# Add reference lines for key percentiles
q10 = np.percentile(Q, 90)  # Flow exceeded 10% of time
q50 = np.percentile(Q, 50)  # Median flow
q90 = np.percentile(Q, 10)  # Flow exceeded 90% of time

ax.axhline(y=q10, color='orange', linestyle='--', alpha=0.7, label=f'Q10 = {q10:.1f} m³/s')
ax.axhline(y=q50, color='green', linestyle='--', alpha=0.7, label=f'Q50 = {q50:.1f} m³/s')
ax.axhline(y=q90, color='red', linestyle='--', alpha=0.7, label=f'Q90 = {q90:.1f} m³/s')

# Add vertical reference lines
ax.axvline(x=10, color='orange', linestyle=':', alpha=0.5)
ax.axvline(x=50, color='green', linestyle=':', alpha=0.5)
ax.axvline(x=90, color='red', linestyle=':', alpha=0.5)

# Labels and formatting
ax.set_xlabel('Exceedance Probability (%)', fontsize=12)
ax.set_ylabel('Discharge (m³/s)', fontsize=12)
ax.set_title(f'{station_name} - Flow Duration Curve', fontsize=14)
ax.legend(loc='upper right')
ax.grid(True, alpha=0.3)

# Use log scale for y-axis (common for FDCs)
ax.set_yscale('log')
ax.set_xlim(0, 100)

plt.tight_layout()
plt.show()

# Print key values
print()
print("KEY FLOW DURATION VALUES")
print("-" * 45)
print(f"  Q10 (exceeded 10% of time):  {q10:>8.1f} m³/s  (high flow)")
print(f"  Q50 (exceeded 50% of time):  {q50:>8.1f} m³/s  (median)")
print(f"  Q90 (exceeded 90% of time):  {q90:>8.1f} m³/s  (low flow)")
print()
print(f"  Flow range ratio (Q10/Q90): {q10/q90:.1f}")
print("  (Higher ratio = more variable/flashy catchment)")

Understanding the Code:

  1. Sort descending: np.sort(Q)[::-1] - Highest flows first

  2. Calculate exceedance: Each value’s rank divided by total count gives probability

  3. Log scale: Y-axis uses logarithmic scale to show low flows clearly

  4. Q10/Q90 ratio: A simple metric for flow variability (values >10 indicate flashy catchments)

This tutorial uses the exceedance probability convention: Q10 = flow exceeded 10% of time (high flow), Q90 = flow exceeded 90% of time (low flow).


Ideas for Further Extension

Now that you have a complete analysis workflow, here are ideas to extend it:

Add More Statistics

  • Baseflow index: Ratio of baseflow to total flow

  • Flashiness index: How quickly flow changes from day to day

  • 7-day low flow (7Q10): Important for water quality permits

  • Annual maxima series: For flood frequency analysis

Improve Visualizations

  • Add a rolling mean (e.g., 30-day) to smooth the hydrograph

  • Create annual hydrographs overlaid for comparison

  • Add precipitation data as a secondary y-axis

  • Export plots with your organization’s branding

Automate for Multiple Stations

  • Loop through all CSV files in your data folder

  • Generate a summary table comparing stations

  • Create a multi-panel figure with all stations

Connect to Your Workflow

  • Export to Excel: df.to_excel('results.xlsx') for reports

  • Save statistics to JSON: For automated reporting pipelines

  • Create a reusable function: Package your analysis as a module

Example: Multi-Station Loop

from pathlib import Path

# Process all CAMELS files
for csv_file in DATA_PATH.glob('camels_*.csv'):
    df = pd.read_csv(csv_file)
    # ... run your analysis ...
    print(f"Processed: {csv_file.name}")

Troubleshooting

“FileNotFoundError: camels_XXXXX_discharge.csv”

  • Make sure you completed Module 4a and downloaded the data

  • Check you’re in the correct project folder

  • Verify the file exists in the data/ folder

  • Update STATION_ID to match your downloaded file

“KeyError: ‘discharge_m3s’”

  • The column name doesn’t match your data

  • Check your CSV headers: print(df.columns)

  • Update DISCHARGE_COLUMN to match

Plot looks empty or wrong

  • Check that dates were parsed correctly: print(df.index)

  • Verify data range: print(df[DISCHARGE_COLUMN].describe())

  • Look for all-NaN data in specific periods

“SettingWithCopyWarning”

This warning can be ignored for now—it doesn’t affect your results. It’s about pandas’ internal data handling.

Summary

In this module, you:

✅ Loaded and prepared time series data with pandas
✅ Calculated key hydrological statistics
✅ Identified extreme flow events
✅ Created professional visualizations (hydrograph + monthly boxplots)
✅ Built a flow duration curve with key percentiles
✅ Learned how to adapt the script for your own data

What You’ve Built

You now have a reusable discharge analysis workflow:

Input: CSV with discharge data
    ↓
Process: Python script
    ↓
Output: Statistics + Hydrograph + Monthly plots + Flow Duration Curve

This same pattern applies to most water modelling tasks—load data, process it, visualize results.


Next up: Explore additional resources and next steps → Module 5: Resources & Next Steps