Finding and Accessing Open Streamflow Data¶
Time required: 25-30 minutes
Prerequisites: Module 3a (Python basics), Module 3b (AI assistance)
What you’ll learn: How to find and download hydrological data programmatically
The Data Challenge¶
Ask any hydrologist about their biggest challenge, and you’ll often hear: “Getting the data.”
Before you can analyze discharge patterns, calibrate a model, or validate flood predictions, you need data. And getting that data is often harder than the analysis itself.
Why Is It So Difficult?¶
| Challenge | Example |
|---|---|
| Different formats | CSV, Excel, XML, proprietary formats, PDF tables |
| Different sources | National agencies, research institutions, utilities |
| Access restrictions | Some data requires registration, payment, or formal requests |
| Quality variations | Gaps, errors, different QA procedures |
| Different conventions | Date formats, time zones, units (m³/s vs. L/s vs. cfs) |
| Documentation gaps | Missing metadata about station location, catchment area |
The Good News¶
The hydrological community has been working to make data more accessible. Several large-scale, open datasets now exist that provide:
Standardized formats
Consistent quality control
Rich metadata (catchment attributes)
Easy programmatic access
In this module, we’ll explore these resources and learn how to access them with Python.
Open Hydrological Data Sources¶
Global and Continental Datasets¶
Caravan 🌍
A global collection of meteorological and hydrological data for large-sample hydrology. Combines data from multiple CAMELS datasets worldwide.
6,830+ catchments globally
Daily streamflow + meteorological forcing
65+ catchment attributes
Citation: Kratzert et al. (2023)
GRDC (Global Runoff Data Centre)
The world’s largest archive of river discharge data.
10,000+ stations worldwide
Long historical records (some > 100 years)
Requires registration for access
GloFAS (Global Flood Awareness System)
Real-time and historical flood forecasting data from ECMWF/Copernicus.
The CAMELS Family¶
CAMELS (Catchment Attributes and MEteorology for Large-sample Studies) is a family of datasets designed for hydrological research. Each regional dataset follows similar standards:
| Dataset | Region | Catchments | Reference |
|---|---|---|---|
| CAMELS | USA | 671 | Newman et al. (2015) |
| CAMELS-CL | Chile | 516 | Alvarez-Garreton et al. (2018) |
| CAMELS-BR | Brazil | 897 | Chagas et al. (2020) |
| CAMELS-GB | Great Britain | 671 | Coxon et al. (2020) |
| CAMELS-AUS | Australia | 222 | Fowler et al. (2021) |
| CAMELS-CH | Switzerland | 331 | Höge et al. (2023) |
| LamaH-CE | Central Europe | 859 | Klingler et al. (2021) |
Why CAMELS matters:
These datasets include not just streamflow, but also catchment characteristics (area, elevation, geology, land use) and meteorological forcing. This makes them ideal for testing models, machine learning, and comparative hydrology.
Practical: Accessing CAMELS Data with Python¶
We’ll use PyGeoHydro, a Python package that provides easy access to various hydrological datasets, including CAMELS-US.
Step 1: Install PyGeoHydro¶
In your project folder, add the package:
uv add pygeohydroThis may take a moment as it installs PyGeoHydro and its dependencies.
Step 2: Explore Available Stations¶
Run the following code to see what’s available in the CAMELS dataset. You can run it directly here in the notebook, or save it as explore_camels.py to run from the terminal with uv run explore_camels.py.
"""Explore CAMELS dataset stations."""
import pygeohydro as gh
print("Loading CAMELS dataset information...")
print()
# Get CAMELS dataset (returns tuple: attributes DataFrame, streamflow Dataset)
camels_attrs, camels_qobs = gh.get_camels()
# Show available information
print(f"Number of stations: {len(camels_attrs)}")
print()
# Show some columns available
print("Available attributes (first 20):")
for i, col in enumerate(camels_attrs.columns[:20]):
print(f" {col}")
print(f" ... and {len(camels_attrs.columns) - 20} more")
print()
# Show sample of station names
print("Sample stations:")
print(camels_attrs[['gauge_name', 'q_mean']].head(10))You should see:
671 stations available
Many catchment attributes (area, precipitation, geology, etc.)
Station names and basic information
First run: The first time you run this, PyGeoHydro will download the CAMELS dataset (~140 MB). This is cached locally for future use.
Step 3: Select a Station¶
Let’s find a station with interesting characteristics. The code below filters for medium-sized catchments with reasonable runoff:
"""Select a CAMELS station based on criteria."""
import pygeohydro as gh
# Get CAMELS data
camels_attrs, camels_qobs = gh.get_camels()
print("Finding stations with specific characteristics...")
print()
# Filter for medium-sized catchments with good data
# area_gages2: catchment area in km²
# q_mean: mean daily discharge in mm/day
selected = camels_attrs[
(camels_attrs['area_gages2'] > 100) & # At least 100 km²
(camels_attrs['area_gages2'] < 1000) & # Not too large
(camels_attrs['q_mean'] > 1.0) # Reasonable runoff
].copy()
print(f"Stations matching criteria: {len(selected)}")
print()
# Show top 10 by mean discharge
top_stations = selected.nlargest(10, 'q_mean')
print("Top 10 stations by mean runoff:")
print()
for idx, row in top_stations.iterrows():
print(f" Station ID: {idx}")
print(f" Name: {row['gauge_name']}")
print(f" Area: {row['area_gages2']:.0f} km²")
print(f" Mean runoff: {row['q_mean']:.2f} mm/day")
print()Note down a station ID that looks interesting—we’ll download its data next.
Step 4: Download Streamflow Data¶
Now let’s download actual streamflow data for a selected station. Change STATION_ID to the station you chose, or use the default example:
Note: PyGeoHydro returns the streamflow data as an xarray Dataset. xarray is a Python library for working with multi-dimensional data (like gridded climate data or time series across multiple stations). Think of it as a more powerful version of pandas that can handle dimensions beyond rows and columns. We’ll convert it to a simpler pandas format for our analysis.
"""Download streamflow data from CAMELS."""
import pygeohydro as gh
import pandas as pd
# Configuration - change this to your chosen station
# Note: We use station 01013500 (Fish River) as our example throughout these tutorials.
# While it doesn't match the filter criteria from Step 3 (it's larger than 1000 km²),
# it's a well-documented station with good data quality, making it ideal for learning.
STATION_ID = "01013500" # Example: Fish River near Fort Kent, Maine
print(f"Downloading data for station: {STATION_ID}")
print()
# Get CAMELS data (attributes and streamflow)
camels_attrs, camels_qobs = gh.get_camels()
if STATION_ID not in camels_attrs.index:
print(f"Error: Station {STATION_ID} not found in CAMELS dataset.")
print("Check the station ID and try again.")
else:
station_info = camels_attrs.loc[STATION_ID]
area_km2 = station_info['area_gages2']
print(f"Station: {station_info['gauge_name']}")
print(f"Area: {area_km2:.0f} km²")
print(f"Mean elevation: {station_info['elev_mean']:.0f} m")
print()
# Extract streamflow for this station from the xarray Dataset
print("Extracting streamflow data...")
# Get discharge data - raw CAMELS data is in cfs (cubic feet per second)
Q_cfs = camels_qobs['discharge'].sel(station_id=STATION_ID)
# Convert to pandas Series
Q_cfs = Q_cfs.to_series()
# Convert cfs to m³/s (1 cfs = 0.0283168 m³/s)
Q_m3s = Q_cfs * 0.0283168
# Also calculate mm/day for reference
Q_mm_day = Q_m3s * 86400 / (area_km2 * 1e6) * 1000
print(f"Retrieved {len(Q_m3s)} daily values")
print(f"Period: {Q_m3s.index.min().strftime('%Y-%m-%d')} to {Q_m3s.index.max().strftime('%Y-%m-%d')}")
print(f"Missing values: {Q_m3s.isna().sum()}")
print()
print("Basic statistics (m³/s):")
print(f" Mean: {Q_m3s.mean():.2f}")
print(f" Min: {Q_m3s.min():.2f}")
print(f" Max: {Q_m3s.max():.2f}")
print()📁 Understanding File Paths¶
Before we save our data, let’s understand how Python finds files on your computer.
What is a File Path?¶
A path tells Python where to find (or save) a file. There are two types:
Absolute Path (Full address from the root of your computer)
'/Users/yourname/Documents/project/data/my_data.csv' # macOS/Linux
'C:\\Users\\yourname\\Documents\\project\\data\\my_data.csv' # WindowsRelative Path (Address relative to your current location)
'../data/my_data.csv' # Go up one folder, then into data/Our Project Structure¶
python-for-water-modellers/ ← Project root
├── data/ ← Data folder
│ └── camels_01013500_discharge.csv
├── tutorials/ ← We are HERE (this notebook)
│ ├── 04a_getting_data.ipynb
│ └── 04b_discharge_analysis.ipynb
└── src/python_for_water_modellers/ ← Reusable utilities
└── paths.py ← We save our helper here!When running locally in VS Code, the notebook runs from tutorials/, so we use ../data/ to go up one level.
When running in Binder (cloud), the working directory is the project root, so we use data/ directly.
Smart Path Resolution¶
The code below automatically detects which environment you’re in and uses the correct path.
Reusable Code: We’ve saved this function in
src/python_for_water_modellers/paths.pyso you can import it in other notebooks withfrom python_for_water_modellers import get_data_path. Here we show the full implementation for learning purposes:
import os
from pathlib import Path
def get_data_path():
"""
Get the correct path to the data folder.
Works both locally (VS Code) and in Binder.
"""
# Check if running in Binder (cloud environment)
if 'BINDER_REQUEST' in os.environ or 'BINDER_LAUNCH_HOST' in os.environ:
# Binder: data is at ~/data/ (repo is cloned to home directory)
data_path = Path.home() / 'data'
elif Path('../data').exists():
# Local: notebook is in tutorials/, data is at ../data/
data_path = Path('../data')
elif Path('data').exists():
# Alternative: already at repo root
data_path = Path('data')
else:
# Fallback
data_path = Path('../data')
return data_path
# Get the data path for this environment
DATA_PATH = get_data_path()
print(f"Data path: {DATA_PATH}")
print(f"Data path exists: {DATA_PATH.exists()}")# Save the data using the detected path
output_df = pd.DataFrame({
'date': Q_m3s.index,
'discharge_m3s': Q_m3s.values,
'discharge_mm_day': Q_mm_day.values
})
output_df['station_id'] = STATION_ID
output_df['station_name'] = station_info['gauge_name']
# Use the DATA_PATH we determined above
output_file = DATA_PATH / f'camels_{STATION_ID}_discharge.csv'
output_df.to_csv(output_file, index=False)
print(f"Data saved to: {output_file}")
print()
print("You can now use this file in Module 4b for analysis!")Step 5: Quick Visualization¶
Let’s make a quick plot to verify the data looks reasonable:
"""Quick visualization of downloaded streamflow data."""
import pandas as pd
import matplotlib.pyplot as plt
# Load the data we just downloaded using DATA_PATH
INPUT_FILE = DATA_PATH / f'camels_{STATION_ID}_discharge.csv'
print(f"Loading: {INPUT_FILE}")
df = pd.read_csv(INPUT_FILE)
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
station_name = df['station_name'].iloc[0]
# Create a simple plot
fig, ax = plt.subplots(figsize=(12, 5))
# Plot 3 years of data for clarity
df_subset = df['2010':'2012']
ax.plot(df_subset.index, df_subset['discharge_m3s'], 'b-', linewidth=0.5)
ax.fill_between(df_subset.index, 0, df_subset['discharge_m3s'], alpha=0.3)
ax.set_xlabel('Date')
ax.set_ylabel('Discharge (m³/s)')
ax.set_title(f'{station_name}\nDaily Discharge 2010-2012')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("Plot displayed above!")What you should see:
A hydrograph showing the seasonal pattern of streamflow
Higher flows in spring (snowmelt) for mountain catchments
Lower flows in late summer/autumn
Individual flood peaks from storm events
For Swiss-Specific Work: CAMELS-CH¶
If you’re working specifically with Swiss catchments, you’ll want to use CAMELS-CH.
What’s in CAMELS-CH?¶
331 catchments across Switzerland
Daily streamflow data (1981-2020)
211 catchment attributes (topography, climate, geology, land cover, hydrology)
Comprehensive metadata
How to Access¶
CAMELS-CH is available from Zenodo:
Visit: Höge et al. (2023)
Download the dataset files
The data comes as CSV files that you can load with pandas
Example Loading Code¶
import pandas as pd
# After downloading and extracting CAMELS-CH
# Load catchment attributes
attributes = pd.read_csv('CAMELS_CH_catchment_attributes.csv')
# Load streamflow for a specific station
streamflow = pd.read_csv('CAMELS_CH_streamflow_daily.csv')
# Filter for your station of interest
station_id = '2004' # Example: Aare at Ringgenberg
my_data = streamflow[streamflow['station_id'] == station_id]Citation¶
When using CAMELS-CH in publications, please cite:
Höge, M., Kauzlaric, M., Siber, R., Schönenberger, U., Horton, P., Schwanbeck, J., Floriancic, M. G., Viviroli, D., Wilhelm, S., Sikorska-Senoner, A. E., Addor, N., Brunner, M., Pool, S., Zappa, M., & Fenicia, F. (2023). CAMELS-CH: Hydro-meteorological time series and landscape attributes for 331 catchments in hydrologic Switzerland. Earth System Science Data, 15, 5755–5784.
Data Quality Considerations¶
Whenever you work with hydrological data—whether from CAMELS, national databases, or local sources—always check:
1. Missing Values¶
# Count missing values
missing = df['discharge'].isna().sum()
print(f"Missing: {missing} days ({missing/len(df)*100:.1f}%)")
# Where are they?
missing_periods = df[df['discharge'].isna()]2. Suspicious Values¶
# Check for zeros (often indicate sensor issues)
zeros = (df['discharge'] == 0).sum()
# Check for negative values (physically impossible)
negatives = (df['discharge'] < 0).sum()
# Check for extreme spikes (potential errors)
mean_q = df['discharge'].mean()
std_q = df['discharge'].std()
extremes = df[df['discharge'] > mean_q + 5*std_q]3. Time Series Continuity¶
# Check for gaps in the time series
date_diff = df.index.to_series().diff()
gaps = date_diff[date_diff > pd.Timedelta(days=1)]
print(f"Found {len(gaps)} gaps in the time series")4. Physical Plausibility¶
Is the catchment area correct?
Do peak flows coincide with precipitation events?
Is the seasonal pattern reasonable for the climate?
Summary¶
In this module, you learned:
✅ Why getting hydrological data is challenging
✅ Key open data sources (CAMELS family, GRDC, Caravan)
✅ How to access CAMELS data programmatically with PyGeoHydro
✅ How to explore, select, and download station data
✅ How to save data for offline analysis
✅ Where to find CAMELS-CH for Swiss-specific work
✅ What to check for data quality
Key Takeaways¶
Open datasets exist — You don’t always need to process raw data from scratch
Python makes access easy — A few lines of code can replace hours of manual downloads
Always check quality — Even curated datasets may have issues
Save locally — Once you have good data, save it for reproducibility
What You Created¶
By running the code cells in this notebook, you:
Explored available CAMELS stations
Downloaded streamflow data for your chosen station
Created a CSV file:
camels_XXXXX_discharge.csvin thedata/folderGenerated a preview plot of the hydrograph
Next: Use your downloaded data for a complete analysis → Module 4b: Your First Water Modelling Script
- Höge, M., Kauzlaric, M., Siber, R., Schönenberger, U., Horton, P., Schwanbeck, J., Floriancic, M. G., Viviroli, D., Wilhelm, S., Sikorska-Senoner, A. E., Addor, N., Brunner, M., Pool, S., Zappa, M., & Fenicia, F. (2023). Catchment attributes and hydro-meteorological time series for large-sample studies across hydrologic Switzerland (CAMELS-CH). Zenodo. 10.5281/ZENODO.7784633