Honest vs. Dishonest Evaluation - Data Leakage, Held-Out Test, and Cross-Validation

Level 3

Chapter 16: ML Fundamentals

descriptionProblem

Deliberately evaluate a model the wrong way: fit a random forest on the full dataset and score it on that same data. Compare the (inflated) score to a proper held-out test score and a cross-validated score. Quantify the gap. Then explain, in two or three sentences, why reporting the training score would mislead an asset team deciding whether to trust the model on un-cored wells.

---

The single most common way to fool yourself in machine learning is to score a model on the same data it trained on. A RandomForestRegressor can nearly memorize its training set, so the "same-data" R² looks spectacular, and means nothing about how the model will do on the next well. This exercise puts three evaluation protocols side by side on the OML-58 porosity problem so you can see, in numbers, how big the lie is.

The verified make_log_dataset generator from the chapter is embedded for you: do not modify it. It maps four wireline logs (GR, RHOB, NPHI, RT) to core porosity (PHI_core).

Embedded constants (already defined for you): FEATURE_COLS = ['GR','RHOB','NPHI','RT'], N_ESTIMATORS = 150, TEST_SIZE = 0.25, SPLIT_SEED = 0.

Write evaluate_three_ways(df, seed=0) that takes a logs DataFrame and returns a dict with three R² scores, computed exactly as follows. Let X = df[FEATURE_COLS].values and y = df['PHI_core'].values.

'dishonest': the leak. Fit

RandomForestRegressor(n_estimators=N_ESTIMATORS, random_state=seed) on the full X, y, then score with r2_score(y, rf.predict(X)). Predictions on the same rows it trained on.

'honest_test': held-out test. Split once with

train_test_split(X, y, test_size=TEST_SIZE, random_state=seed), fit a fresh RandomForestRegressor(n_estimators=N_ESTIMATORS, random_state=seed) on the train split, then score r2_score(y_test, rf.predict(X_test)).

'cv': 5-fold cross-validation. Return

cross_val_score(RandomForestRegressor(n_estimators=N_ESTIMATORS, random_state=seed), X, y, cv=KFold(5, shuffle=True, random_state=0), scoring='r2').mean().

Return {'dishonest': ..., 'honest_test': ..., 'cv': ...}.

Then call it once on the default dataset and unpack the three scores into the exact variable names the tests read:

evald = evaluate_three_ways(make_log_dataset())
dishonest_r2   = evald['dishonest']
honest_test_r2 = evald['honest_test']
cv_r2          = evald['cv']

> Think about it: the same-data score comes out around 0.99, but the > held-out test and the cross-validated score both land near 0.91, and they > agree with each other to a couple of thousandths. That agreement is the > point: two independent honest protocols tell the same story, while the > dishonest one inflates the number by a wide margin. Which of these three would > you put in a report to the asset team, and why would the 0.99 get you in > trouble on the next well?

lightbulbHints (0/3)

Stuck? Reveal hints one at a time — they progress from nudge to near-solution.

codeYour solution

main.py

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split, cross_val_score, KFold

# ── Verified OML-58 log -> porosity generator (do not edit) ──────────────
def make_log_dataset(n=700, seed=42):
    """Synthetic-but-realistic logs -> core porosity for OML 58 wells.

Rock properties (shale volume, porosity, fluid) drive the log responses,
    so recovering porosity from the logs is a real multi-log inverse problem -
    including the gas effect, where light hydrocarbon lowers both RHOB and NPHI.
    """
    rng = np.random.default_rng(seed)
    Vsh = np.clip(rng.beta(1.3, 3.2, n), 0, 1)                 # shale volume fraction
    depth = rng.uniform(8000, 11000, n)                        # ft
    phi = np.clip((0.33 - 0.020 * (depth - 8000) / 1000) * (1 - 0.9 * Vsh)
                  + rng.normal(0, 0.020, n), 0.02, 0.34)       # core porosity (the target)
    gas = (rng.random(n) < 0.18) & (Vsh < 0.35) & (phi > 0.14)  # gas-bearing clean sand
    rho_fl = np.where(gas, 0.35, 1.0)                          # fluid density (gas vs brine)
    rho_ma = 2.65 + 0.03 * Vsh                                 # matrix density (sand -> shale)
    RHOB = rho_ma * (1 - phi) + rho_fl * phi + rng.normal(0, 0.035, n)   # bulk density, g/cc
    NPHI = phi + 0.32 * Vsh - np.where(gas, 0.10, 0.0) + rng.normal(0, 0.022, n)  # neutron, v/v
    GR = 22 * (1 - Vsh) + 125 * Vsh + rng.normal(0, 7, n)      # gamma ray, gAPI
    RT = np.exp(rng.normal(0, 0.30, n)) * (1.5 + np.where(gas, 30, 7)
         * np.clip(0.30 - phi, 0, 1)) * (1 - 0.5 * Vsh) + 0.5  # deep resistivity, ohm-m
    RT = np.clip(RT, 0.3, 400)
    return pd.DataFrame({"GR": np.round(GR, 1), "RHOB": np.round(RHOB, 3),
                         "NPHI": np.round(NPHI, 3), "RT": np.round(RT, 2),
                         "PHI_core": np.round(phi, 4)})

# ── Evaluation constants (do not edit) ───────────────────────────────────
FEATURE_COLS = ["GR", "RHOB", "NPHI", "RT"]
N_ESTIMATORS = 150
TEST_SIZE = 0.25
SPLIT_SEED = 0

def evaluate_three_ways(df, seed=0):
    """Score the porosity model three ways on the same logs.

Returns {'dishonest', 'honest_test', 'cv'}:
      dishonest   = R2 of a forest fit on FULL X,y and scored on that SAME X,y
                    (the data leak -- inflated)
      honest_test = R2 on a held-out 25% test split (a fresh forest on the train split)
      cv          = mean R2 over a 5-fold shuffled cross-validation
    """
    X = df[FEATURE_COLS].values
    y = df["PHI_core"].values
    # TODO 1 (dishonest -- the leak):
    #   rf_full = RandomForestRegressor(n_estimators=N_ESTIMATORS, random_state=seed)
    #   rf_full.fit(X, y)
    #   dishonest = r2_score(y, rf_full.predict(X))      # scored on the SAME rows
    # TODO 2 (honest held-out test):
    #   X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=TEST_SIZE, random_state=seed)
    #   rf = RandomForestRegressor(n_estimators=N_ESTIMATORS, random_state=seed)
    #   rf.fit(X_tr, y_tr)
    #   honest_test = r2_score(y_te, rf.predict(X_te))
    # TODO 3 (5-fold cross-validation):
    #   cv = cross_val_score(RandomForestRegressor(n_estimators=N_ESTIMATORS, random_state=seed),
    #                        X, y, cv=KFold(5, shuffle=True, random_state=0), scoring="r2").mean()
    # TODO: return {"dishonest": dishonest, "honest_test": honest_test, "cv": cv}
    return None

# TODO: evald = evaluate_three_ways(make_log_dataset())
evald = None
dishonest_r2 = None
honest_test_r2 = None
cv_r2 = None

print("dishonest (same-data) R2:", dishonest_r2)
print("honest held-out test R2:", honest_test_r2)
print("5-fold CV R2:", cv_r2)

visibilityReveal reference solutionexpand_more

Try solving it yourself first — the hints walk you through it. The solution below is one valid approach; yours may differ and still be correct.

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split, cross_val_score, KFold


# ── Verified OML-58 log -> porosity generator (do not edit) ──────────────
def make_log_dataset(n=700, seed=42):
    """Synthetic-but-realistic logs -> core porosity for OML 58 wells.

    Rock properties (shale volume, porosity, fluid) drive the log responses,
    so recovering porosity from the logs is a real multi-log inverse problem -
    including the gas effect, where light hydrocarbon lowers both RHOB and NPHI.
    """
    rng = np.random.default_rng(seed)
    Vsh = np.clip(rng.beta(1.3, 3.2, n), 0, 1)                 # shale volume fraction
    depth = rng.uniform(8000, 11000, n)                        # ft
    phi = np.clip((0.33 - 0.020 * (depth - 8000) / 1000) * (1 - 0.9 * Vsh)
                  + rng.normal(0, 0.020, n), 0.02, 0.34)       # core porosity (the target)
    gas = (rng.random(n) < 0.18) & (Vsh < 0.35) & (phi > 0.14)  # gas-bearing clean sand
    rho_fl = np.where(gas, 0.35, 1.0)                          # fluid density (gas vs brine)
    rho_ma = 2.65 + 0.03 * Vsh                                 # matrix density (sand -> shale)
    RHOB = rho_ma * (1 - phi) + rho_fl * phi + rng.normal(0, 0.035, n)   # bulk density, g/cc
    NPHI = phi + 0.32 * Vsh - np.where(gas, 0.10, 0.0) + rng.normal(0, 0.022, n)  # neutron, v/v
    GR = 22 * (1 - Vsh) + 125 * Vsh + rng.normal(0, 7, n)      # gamma ray, gAPI
    RT = np.exp(rng.normal(0, 0.30, n)) * (1.5 + np.where(gas, 30, 7)
         * np.clip(0.30 - phi, 0, 1)) * (1 - 0.5 * Vsh) + 0.5  # deep resistivity, ohm-m
    RT = np.clip(RT, 0.3, 400)
    return pd.DataFrame({"GR": np.round(GR, 1), "RHOB": np.round(RHOB, 3),
                         "NPHI": np.round(NPHI, 3), "RT": np.round(RT, 2),
                         "PHI_core": np.round(phi, 4)})


# ── Evaluation constants (do not edit) ───────────────────────────────────
FEATURE_COLS = ["GR", "RHOB", "NPHI", "RT"]
N_ESTIMATORS = 150
TEST_SIZE = 0.25
SPLIT_SEED = 0


def evaluate_three_ways(df, seed=0):
    """Score the porosity model three ways on the same logs.

    Returns {'dishonest', 'honest_test', 'cv'}:
      dishonest   = R2 of a forest fit on FULL X,y and scored on that SAME X,y
                    (the data leak -- inflated)
      honest_test = R2 on a held-out 25% test split (a fresh forest on the train split)
      cv          = mean R2 over a 5-fold shuffled cross-validation
    """
    X = df[FEATURE_COLS].values
    y = df["PHI_core"].values

    # 1) Dishonest -- fit on the full data, score on that SAME data (the leak).
    rf_full = RandomForestRegressor(n_estimators=N_ESTIMATORS, random_state=seed)
    rf_full.fit(X, y)
    dishonest = r2_score(y, rf_full.predict(X))

    # 2) Honest -- a single held-out test split; a fresh forest on the train side.
    X_tr, X_te, y_tr, y_te = train_test_split(
        X, y, test_size=TEST_SIZE, random_state=seed)
    rf = RandomForestRegressor(n_estimators=N_ESTIMATORS, random_state=seed)
    rf.fit(X_tr, y_tr)
    honest_test = r2_score(y_te, rf.predict(X_te))

    # 3) Honest -- 5-fold shuffled cross-validation, averaged.
    cv = cross_val_score(
        RandomForestRegressor(n_estimators=N_ESTIMATORS, random_state=seed),
        X, y, cv=KFold(5, shuffle=True, random_state=0), scoring="r2").mean()

    return {"dishonest": dishonest, "honest_test": honest_test, "cv": cv}


evald = evaluate_three_ways(make_log_dataset())
dishonest_r2 = evald["dishonest"]
honest_test_r2 = evald["honest_test"]
cv_r2 = evald["cv"]

print("dishonest (same-data) R2:", dishonest_r2)
print("honest held-out test R2:", honest_test_r2)
print("5-fold CV R2:", cv_r2)

lockCopying code is a Full Access feature.

arrow_back

16.3 The Overfitting Curve - Tree Depth and the Bias-Variance Sweet Spot

17.1 The Cost-Weighted Threshold - Pricing Missed Gas vs False Alarms

arrow_forward