Prediction Methodology

How NYC Water Check estimates bacteria levels after rain

This page documents the current model, the historical backtests behind it, and the rainfall-source validation work used to decide whether local gauge-derived fields or Open-Meteo should drive the prediction pipeline.

The goal here is not marketing copy. This is a working methods note for a public-facing prediction system built on real historical shoreline bacteria and rainfall data.

Dataset snapshot

Historical paired samples
14,812
Sites in model set
155
Best test RMSE
0.7231
Best threshold accuracy
73.5%

Overview

Current prediction logic

The production estimate starts with each site's own typical bacteria level, then adjusts that baseline using rainfall features learned from the historical record. It is a pooled citywide model rather than a fully custom coefficient set per site, because fully site-specific fits overfit the data.

Outcome being modeled

Enterococcus concentration on the log scale. The site's median historical log level acts as the baseline, and the rain model predicts the residual above or below that baseline.

Core rain windows

The model uses rain during the last 48 hours, the middle of the prior week, the late prior week, and weighted prior-week memory. We also evaluate longer weekly lag windows.

Model family

Ridge regression with log-transformed rain features. This reduces instability from large storm days and keeps the coefficients from swinging too far on sparse patterns.

Backtest

Historical model performance

These backtests use chronological train/test splits so the model is always evaluated on future samples, not shuffled history. Lower RMSE and MAE are better. Threshold accuracy measures whether the model correctly places a sample below or above the 35 MPN swim threshold.

CandidateFeaturesTest RMSE (log)Test MAE (log)Threshold accuracy
long-memoryrecentRain, midRain, lateRain, lagWeek1, lagWeek2, lagWeek3, lagWeek4, lagWeek5, lagWeek6, lagWeek7, lagWeek8, dryWeeks, seasonSin, seasonCos0.72310.555673.4%
7d-distributedrecentRain, midRain, lateRain0.72650.557173.4%
extended-memoryrecentRain, midRain, lateRain, priorWeeksRainMemory, dryWeeks, seasonSin, seasonCos0.72730.557973.5%
48hrecentRain0.72820.559473.5%
baselinesite baseline only0.80090.616769.2%
Best numeric fit

The long-memory variant had the lowest held-out RMSE at 0.7231, which is only a modest gain over the simpler pooled variants.

Best threshold result

The current production-style rain regression remained the best rain-only threshold model we tested at 73.5%.

Main modeling lesson

Fully per-site models looked attractive but performed worse overall. Pooled models with site baselines generalize better on future samples.

Decision Metric

What happens if we optimize for the swim threshold directly?

We also tested simpler strategies that predict only the safe / unsafe threshold decision. None of the basic rain-threshold rules outperformed the current regression-style model on the held-out history.

Current production-style regression73.5%

Best rain-only threshold performer we tested across the held-out historical set.

48-hour threshold rule71.9%

Per-site rain cutoff using only the most recent 48 hours.

Wet / dry bucket rule68.4%

Simple bucketed rule based on a rain index rather than a regression.

Rainfall Inputs

Which rainfall source appears more trustworthy?

We compared the stored site rain fields against the assigned NOAA daily station records and did the same for Open-Meteo historical forecast data. The validation window covers March 23, 2021 onward.

SourceCompared rowsCorrelation to NOAAMAERMSEWithin 0.10 in
Stored gauge-derived rain fields30,4690.5050.138 in0.418 in76.7%
Open-Meteo historical forecast30,4690.3210.167 in0.502 in75.0%

Result: the stored gauge-derived fields are materially closer to NOAA station observations overall, so the evidence does not support replacing the entire rainfall source with Open-Meteo. Across sites, the stored fields beat Open-Meteo at 69 sites versus 37 for Open-Meteo.

Outlier Review

Where the rain fields look suspect

The main risk is not that the entire stored rainfall source is bad. The risk is that certain sample-date clusters appear to contain bulk-filled or misaligned values that disagree with the assigned NOAA station records.

September 2, 2021Flagged anomaly

All 51 samples shared `precipitationPreviousDay = 7.13 in`, which disagrees sharply with assigned NOAA stations.

June 12, 2025Flagged anomaly

All 90 samples shared the same full rain vector, including `precipitationPreviousSat = 0.005 in`, despite station disagreement.

July 27, 2023Flagged anomaly

Raritan sites all shared `precipitationPreviousTue = 0.28 in`, which needs manual review.

Operational conclusion: keep the stored gauge-derived rainfall as the primary source, but add targeted QA and overrides for suspicious sample-date + field combinations rather than trusting every imported row equally.

Next Steps

What would make the prediction system stronger?

The current rain-only model is useful but not definitive. The most promising next gains are better data QA and richer predictors, not just more rain windows.

Repair obvious rain outliers

Build a blacklist or override table for suspicious sample-date rainfall values and replace them with NOAA station observations where possible.

Train directly for the threshold

If the product goal is a safe / unsafe call, the next model should optimize that decision directly rather than optimizing only numeric bacteria error.

Add more physical context

Tide timing, recent prior bacteria readings, and station-level weather context are the likeliest variables to improve the model beyond the current rain-only ceiling.