Exercise 16.3
The Overfitting Curve - Tree Depth and the Bias-Variance Sweet Spot
For a single DecisionTreeRegressor, sweep max_depth from 1 to 20. For each depth, record both the training R² and the 5-fold cross-validated R². Plot both curves against depth on one figure. Identify the depth where the model stops underfitting and starts overfitting (the "sweet spot" where the test curve peaks), and explain what each region of the plot means.
---
The verified make_log_dataset generator from this chapter is embedded for you (do not modify it.) You also get the constants FEATURE_COLS = ['GR', 'RHOB', 'NPHI', 'RT'] and DEPTHS = list(range(1, 21)) (the max_depth values to sweep, 1 through 20).
Your task: write overfitting_curve(df, depths=DEPTHS) that returns a dict:
- Build
X = df[FEATURE_COLS].valuesandy = df['PHI_core'].values. - For each
dindepths:
- Fit
DecisionTreeRegressor(max_depth=d, random_state=0)on the full
X, y and record its training R² (r2_score(y, tree.predict(X))).
- 5-fold cross-validate a fresh tree of the same depth and record the mean:
cross_val_score(DecisionTreeRegressor(max_depth=d, random_state=0), X, y, cv=KFold(5, shuffle=True, random_state=0), scoring="r2").mean().
- Find
best_depth, the depth with the highest cross-validated R²
(list(depths)[int(np.argmax(cv_r2))]).
- Return
{'train_r2': train_r2, 'cv_r2': cv_r2, 'best_depth': best_depth}.
Then call it once and expose three module-level values:
curve = overfitting_curve(make_log_dataset())
train_curve = curve['train_r2']
cv_curve = curve['cv_r2']
best_depth = curve['best_depth']> Think about it: the training R² climbs steadily toward 1.0 as the tree is > allowed to grow: a depth-20 tree can memorise nearly every training point. The > cross-validated R² tells the honest story: it rises, peaks around max_depth = > 6 (R² ≈ 0.86), then falls as the deeper trees start fitting noise. The gap > between the two curves is the variance the model is wasting on memorisation. > Why is the most accurate tree on the training data not the one you would ship?
Stuck? Reveal hints one at a time — they progress from nudge to near-solution.
visibilityReveal reference solutionexpand_more
Try solving it yourself first — the hints walk you through it. The solution below is one valid approach; yours may differ and still be correct.
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score, KFold
from sklearn.metrics import r2_score
# ── Verified data generator (do not edit) ────────────────────────────────
def make_log_dataset(n=700, seed=42):
"""Synthetic-but-realistic logs -> core porosity for OML 58 wells."""
rng = np.random.default_rng(seed)
Vsh = np.clip(rng.beta(1.3, 3.2, n), 0, 1) # shale volume fraction
depth = rng.uniform(8000, 11000, n) # ft
phi = np.clip((0.33 - 0.020 * (depth - 8000) / 1000) * (1 - 0.9 * Vsh)
+ rng.normal(0, 0.020, n), 0.02, 0.34) # core porosity (the target)
gas = (rng.random(n) < 0.18) & (Vsh < 0.35) & (phi > 0.14) # gas-bearing clean sand
rho_fl = np.where(gas, 0.35, 1.0) # fluid density (gas vs brine)
rho_ma = 2.65 + 0.03 * Vsh # matrix density (sand -> shale)
RHOB = rho_ma * (1 - phi) + rho_fl * phi + rng.normal(0, 0.035, n) # bulk density, g/cc
NPHI = phi + 0.32 * Vsh - np.where(gas, 0.10, 0.0) + rng.normal(0, 0.022, n) # neutron, v/v
GR = 22 * (1 - Vsh) + 125 * Vsh + rng.normal(0, 7, n) # gamma ray, gAPI
RT = np.exp(rng.normal(0, 0.30, n)) * (1.5 + np.where(gas, 30, 7)
* np.clip(0.30 - phi, 0, 1)) * (1 - 0.5 * Vsh) + 0.5 # deep resistivity, ohm-m
RT = np.clip(RT, 0.3, 400)
return pd.DataFrame({"GR": np.round(GR, 1), "RHOB": np.round(RHOB, 3),
"NPHI": np.round(NPHI, 3), "RT": np.round(RT, 2),
"PHI_core": np.round(phi, 4)})
# ── Constants (do not edit) ──────────────────────────────────────────────
FEATURE_COLS = ['GR', 'RHOB', 'NPHI', 'RT']
DEPTHS = list(range(1, 21)) # max_depth sweep, 1..20
def overfitting_curve(df, depths=DEPTHS):
"""Sweep DecisionTreeRegressor max_depth and record train vs. cross-val R2.
For each depth: fit a tree on the FULL data and score it on that same data
(the training R2), and 5-fold shuffled cross-validate it (the honest R2).
Returns {'train_r2': [...], 'cv_r2': [...], 'best_depth': int}, where
best_depth is the depth with the highest cross-validated R2.
"""
X = df[FEATURE_COLS].values
y = df['PHI_core'].values
train_r2, cv_r2 = [], []
for d in depths:
tree = DecisionTreeRegressor(max_depth=d, random_state=0)
tree.fit(X, y)
train_r2.append(r2_score(y, tree.predict(X)))
cv = cross_val_score(DecisionTreeRegressor(max_depth=d, random_state=0),
X, y, cv=KFold(5, shuffle=True, random_state=0),
scoring="r2").mean()
cv_r2.append(cv)
best_depth = list(depths)[int(np.argmax(cv_r2))]
return {'train_r2': train_r2, 'cv_r2': cv_r2, 'best_depth': best_depth}
curve = overfitting_curve(make_log_dataset())
train_curve = curve['train_r2']
cv_curve = curve['cv_r2']
best_depth = curve['best_depth']
print("train R2 (depth 1..20):", [round(v, 3) for v in train_curve])
print("cv R2 (depth 1..20):", [round(v, 3) for v in cv_curve])
print("best (cv-max) depth :", best_depth)
lockCopying code is a Full Access feature.