Exercise 16.4
Honest vs. Dishonest Evaluation - Data Leakage, Held-Out Test, and Cross-Validation
Deliberately evaluate a model the wrong way: fit a random forest on the full dataset and score it on that same data. Compare the (inflated) score to a proper held-out test score and a cross-validated score. Quantify the gap. Then explain, in two or three sentences, why reporting the training score would mislead an asset team deciding whether to trust the model on un-cored wells.
---
The single most common way to fool yourself in machine learning is to score a model on the same data it trained on. A RandomForestRegressor can nearly memorize its training set, so the "same-data" R² looks spectacular, and means nothing about how the model will do on the next well. This exercise puts three evaluation protocols side by side on the OML-58 porosity problem so you can see, in numbers, how big the lie is.
The verified make_log_dataset generator from the chapter is embedded for you: do not modify it. It maps four wireline logs (GR, RHOB, NPHI, RT) to core porosity (PHI_core).
Embedded constants (already defined for you): FEATURE_COLS = ['GR','RHOB','NPHI','RT'], N_ESTIMATORS = 150, TEST_SIZE = 0.25, SPLIT_SEED = 0.
Write evaluate_three_ways(df, seed=0) that takes a logs DataFrame and returns a dict with three R² scores, computed exactly as follows. Let X = df[FEATURE_COLS].values and y = df['PHI_core'].values.
'dishonest': the leak. Fit
RandomForestRegressor(n_estimators=N_ESTIMATORS, random_state=seed) on the full X, y, then score with r2_score(y, rf.predict(X)). Predictions on the same rows it trained on.
'honest_test': held-out test. Split once with
train_test_split(X, y, test_size=TEST_SIZE, random_state=seed), fit a fresh RandomForestRegressor(n_estimators=N_ESTIMATORS, random_state=seed) on the train split, then score r2_score(y_test, rf.predict(X_test)).
'cv': 5-fold cross-validation. Return
cross_val_score(RandomForestRegressor(n_estimators=N_ESTIMATORS, random_state=seed), X, y, cv=KFold(5, shuffle=True, random_state=0), scoring='r2').mean().
Return {'dishonest': ..., 'honest_test': ..., 'cv': ...}.
Then call it once on the default dataset and unpack the three scores into the exact variable names the tests read:
evald = evaluate_three_ways(make_log_dataset())
dishonest_r2 = evald['dishonest']
honest_test_r2 = evald['honest_test']
cv_r2 = evald['cv']> Think about it: the same-data score comes out around 0.99, but the > held-out test and the cross-validated score both land near 0.91, and they > agree with each other to a couple of thousandths. That agreement is the > point: two independent honest protocols tell the same story, while the > dishonest one inflates the number by a wide margin. Which of these three would > you put in a report to the asset team, and why would the 0.99 get you in > trouble on the next well?
Stuck? Reveal hints one at a time — they progress from nudge to near-solution.
visibilityReveal reference solutionexpand_more
Try solving it yourself first — the hints walk you through it. The solution below is one valid approach; yours may differ and still be correct.
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split, cross_val_score, KFold
# ── Verified OML-58 log -> porosity generator (do not edit) ──────────────
def make_log_dataset(n=700, seed=42):
"""Synthetic-but-realistic logs -> core porosity for OML 58 wells.
Rock properties (shale volume, porosity, fluid) drive the log responses,
so recovering porosity from the logs is a real multi-log inverse problem -
including the gas effect, where light hydrocarbon lowers both RHOB and NPHI.
"""
rng = np.random.default_rng(seed)
Vsh = np.clip(rng.beta(1.3, 3.2, n), 0, 1) # shale volume fraction
depth = rng.uniform(8000, 11000, n) # ft
phi = np.clip((0.33 - 0.020 * (depth - 8000) / 1000) * (1 - 0.9 * Vsh)
+ rng.normal(0, 0.020, n), 0.02, 0.34) # core porosity (the target)
gas = (rng.random(n) < 0.18) & (Vsh < 0.35) & (phi > 0.14) # gas-bearing clean sand
rho_fl = np.where(gas, 0.35, 1.0) # fluid density (gas vs brine)
rho_ma = 2.65 + 0.03 * Vsh # matrix density (sand -> shale)
RHOB = rho_ma * (1 - phi) + rho_fl * phi + rng.normal(0, 0.035, n) # bulk density, g/cc
NPHI = phi + 0.32 * Vsh - np.where(gas, 0.10, 0.0) + rng.normal(0, 0.022, n) # neutron, v/v
GR = 22 * (1 - Vsh) + 125 * Vsh + rng.normal(0, 7, n) # gamma ray, gAPI
RT = np.exp(rng.normal(0, 0.30, n)) * (1.5 + np.where(gas, 30, 7)
* np.clip(0.30 - phi, 0, 1)) * (1 - 0.5 * Vsh) + 0.5 # deep resistivity, ohm-m
RT = np.clip(RT, 0.3, 400)
return pd.DataFrame({"GR": np.round(GR, 1), "RHOB": np.round(RHOB, 3),
"NPHI": np.round(NPHI, 3), "RT": np.round(RT, 2),
"PHI_core": np.round(phi, 4)})
# ── Evaluation constants (do not edit) ───────────────────────────────────
FEATURE_COLS = ["GR", "RHOB", "NPHI", "RT"]
N_ESTIMATORS = 150
TEST_SIZE = 0.25
SPLIT_SEED = 0
def evaluate_three_ways(df, seed=0):
"""Score the porosity model three ways on the same logs.
Returns {'dishonest', 'honest_test', 'cv'}:
dishonest = R2 of a forest fit on FULL X,y and scored on that SAME X,y
(the data leak -- inflated)
honest_test = R2 on a held-out 25% test split (a fresh forest on the train split)
cv = mean R2 over a 5-fold shuffled cross-validation
"""
X = df[FEATURE_COLS].values
y = df["PHI_core"].values
# 1) Dishonest -- fit on the full data, score on that SAME data (the leak).
rf_full = RandomForestRegressor(n_estimators=N_ESTIMATORS, random_state=seed)
rf_full.fit(X, y)
dishonest = r2_score(y, rf_full.predict(X))
# 2) Honest -- a single held-out test split; a fresh forest on the train side.
X_tr, X_te, y_tr, y_te = train_test_split(
X, y, test_size=TEST_SIZE, random_state=seed)
rf = RandomForestRegressor(n_estimators=N_ESTIMATORS, random_state=seed)
rf.fit(X_tr, y_tr)
honest_test = r2_score(y_te, rf.predict(X_te))
# 3) Honest -- 5-fold shuffled cross-validation, averaged.
cv = cross_val_score(
RandomForestRegressor(n_estimators=N_ESTIMATORS, random_state=seed),
X, y, cv=KFold(5, shuffle=True, random_state=0), scoring="r2").mean()
return {"dishonest": dishonest, "honest_test": honest_test, "cv": cv}
evald = evaluate_three_ways(make_log_dataset())
dishonest_r2 = evald["dishonest"]
honest_test_r2 = evald["honest_test"]
cv_r2 = evald["cv"]
print("dishonest (same-data) R2:", dishonest_r2)
print("honest held-out test R2:", honest_test_r2)
print("5-fold CV R2:", cv_r2)
lockCopying code is a Full Access feature.