Outcome being modeled
Enterococcus concentration on the log scale. The site's median historical log level acts as the baseline, and the rain model predicts the residual above or below that baseline.
Prediction Methodology
This page documents the current model, the historical backtests behind it, and the rainfall-source validation work used to decide whether local gauge-derived fields or Open-Meteo should drive the prediction pipeline.
The goal here is not marketing copy. This is a working methods note for a public-facing prediction system built on real historical shoreline bacteria and rainfall data.
Dataset snapshot
Overview
The production estimate starts with each site's own typical bacteria level, then adjusts that baseline using rainfall features learned from the historical record. It is a pooled citywide model rather than a fully custom coefficient set per site, because fully site-specific fits overfit the data.
Enterococcus concentration on the log scale. The site's median historical log level acts as the baseline, and the rain model predicts the residual above or below that baseline.
The model uses rain during the last 48 hours, the middle of the prior week, the late prior week, and weighted prior-week memory. We also evaluate longer weekly lag windows.
Ridge regression with log-transformed rain features. This reduces instability from large storm days and keeps the coefficients from swinging too far on sparse patterns.
Backtest
These backtests use chronological train/test splits so the model is always evaluated on future samples, not shuffled history. Lower RMSE and MAE are better. Threshold accuracy measures whether the model correctly places a sample below or above the 35 MPN swim threshold.
| Candidate | Features | Test RMSE (log) | Test MAE (log) | Threshold accuracy |
|---|---|---|---|---|
| long-memory | recentRain, midRain, lateRain, lagWeek1, lagWeek2, lagWeek3, lagWeek4, lagWeek5, lagWeek6, lagWeek7, lagWeek8, dryWeeks, seasonSin, seasonCos | 0.7231 | 0.5556 | 73.4% |
| 7d-distributed | recentRain, midRain, lateRain | 0.7265 | 0.5571 | 73.4% |
| extended-memory | recentRain, midRain, lateRain, priorWeeksRainMemory, dryWeeks, seasonSin, seasonCos | 0.7273 | 0.5579 | 73.5% |
| 48h | recentRain | 0.7282 | 0.5594 | 73.5% |
| baseline | site baseline only | 0.8009 | 0.6167 | 69.2% |
The long-memory variant had the lowest held-out RMSE at 0.7231, which is only a modest gain over the simpler pooled variants.
The current production-style rain regression remained the best rain-only threshold model we tested at 73.5%.
Fully per-site models looked attractive but performed worse overall. Pooled models with site baselines generalize better on future samples.
Decision Metric
We also tested simpler strategies that predict only the safe / unsafe threshold decision. None of the basic rain-threshold rules outperformed the current regression-style model on the held-out history.
Best rain-only threshold performer we tested across the held-out historical set.
Per-site rain cutoff using only the most recent 48 hours.
Simple bucketed rule based on a rain index rather than a regression.
Rainfall Inputs
We compared the stored site rain fields against the assigned NOAA daily station records and did the same for Open-Meteo historical forecast data. The validation window covers March 23, 2021 onward.
| Source | Compared rows | Correlation to NOAA | MAE | RMSE | Within 0.10 in |
|---|---|---|---|---|---|
| Stored gauge-derived rain fields | 30,469 | 0.505 | 0.138 in | 0.418 in | 76.7% |
| Open-Meteo historical forecast | 30,469 | 0.321 | 0.167 in | 0.502 in | 75.0% |
Result: the stored gauge-derived fields are materially closer to NOAA station observations overall, so the evidence does not support replacing the entire rainfall source with Open-Meteo. Across sites, the stored fields beat Open-Meteo at 69 sites versus 37 for Open-Meteo.
Outlier Review
The main risk is not that the entire stored rainfall source is bad. The risk is that certain sample-date clusters appear to contain bulk-filled or misaligned values that disagree with the assigned NOAA station records.
All 51 samples shared `precipitationPreviousDay = 7.13 in`, which disagrees sharply with assigned NOAA stations.
All 90 samples shared the same full rain vector, including `precipitationPreviousSat = 0.005 in`, despite station disagreement.
Raritan sites all shared `precipitationPreviousTue = 0.28 in`, which needs manual review.
Operational conclusion: keep the stored gauge-derived rainfall as the primary source, but add targeted QA and overrides for suspicious sample-date + field combinations rather than trusting every imported row equally.
Next Steps
The current rain-only model is useful but not definitive. The most promising next gains are better data QA and richer predictors, not just more rain windows.
Build a blacklist or override table for suspicious sample-date rainfall values and replace them with NOAA station observations where possible.
If the product goal is a safe / unsafe call, the next model should optimize that decision directly rather than optimizing only numeric bacteria error.
Tide timing, recent prior bacteria readings, and station-level weather context are the likeliest variables to improve the model beyond the current rain-only ceiling.