The Overfitting Curve - Tree Depth and the Bias-Variance Sweet Spot

Level 2

Chapter 16: ML Fundamentals

descriptionProblem

For a single DecisionTreeRegressor, sweep max_depth from 1 to 20. For each depth, record both the training R² and the 5-fold cross-validated R². Plot both curves against depth on one figure. Identify the depth where the model stops underfitting and starts overfitting (the "sweet spot" where the test curve peaks), and explain what each region of the plot means.

---

The verified make_log_dataset generator from this chapter is embedded for you (do not modify it.) You also get the constants FEATURE_COLS = ['GR', 'RHOB', 'NPHI', 'RT'] and DEPTHS = list(range(1, 21)) (the max_depth values to sweep, 1 through 20).

Your task: write overfitting_curve(df, depths=DEPTHS) that returns a dict:

Build X = df[FEATURE_COLS].values and y = df['PHI_core'].values.
For each d in depths:

Fit DecisionTreeRegressor(max_depth=d, random_state=0) on the full

X, y and record its training R² (r2_score(y, tree.predict(X))).

5-fold cross-validate a fresh tree of the same depth and record the mean:

cross_val_score(DecisionTreeRegressor(max_depth=d, random_state=0), X, y, cv=KFold(5, shuffle=True, random_state=0), scoring="r2").mean().

Find best_depth, the depth with the highest cross-validated R²

(list(depths)[int(np.argmax(cv_r2))]).

Return {'train_r2': train_r2, 'cv_r2': cv_r2, 'best_depth': best_depth}.

Then call it once and expose three module-level values:

curve = overfitting_curve(make_log_dataset())
train_curve = curve['train_r2']
cv_curve = curve['cv_r2']
best_depth = curve['best_depth']

> Think about it: the training R² climbs steadily toward 1.0 as the tree is > allowed to grow: a depth-20 tree can memorise nearly every training point. The > cross-validated R² tells the honest story: it rises, peaks around max_depth = > 6 (R² ≈ 0.86), then falls as the deeper trees start fitting noise. The gap > between the two curves is the variance the model is wasting on memorisation. > Why is the most accurate tree on the training data not the one you would ship?

lightbulbHints (0/3)

Stuck? Reveal hints one at a time — they progress from nudge to near-solution.

codeYour solution

main.py

import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score, KFold
from sklearn.metrics import r2_score

# ── Verified data generator (do not edit) ────────────────────────────────
def make_log_dataset(n=700, seed=42):
    """Synthetic-but-realistic logs -> core porosity for OML 58 wells."""
    rng = np.random.default_rng(seed)
    Vsh = np.clip(rng.beta(1.3, 3.2, n), 0, 1)                 # shale volume fraction
    depth = rng.uniform(8000, 11000, n)                        # ft
    phi = np.clip((0.33 - 0.020 * (depth - 8000) / 1000) * (1 - 0.9 * Vsh)
                  + rng.normal(0, 0.020, n), 0.02, 0.34)       # core porosity (the target)
    gas = (rng.random(n) < 0.18) & (Vsh < 0.35) & (phi > 0.14)  # gas-bearing clean sand
    rho_fl = np.where(gas, 0.35, 1.0)                          # fluid density (gas vs brine)
    rho_ma = 2.65 + 0.03 * Vsh                                 # matrix density (sand -> shale)
    RHOB = rho_ma * (1 - phi) + rho_fl * phi + rng.normal(0, 0.035, n)   # bulk density, g/cc
    NPHI = phi + 0.32 * Vsh - np.where(gas, 0.10, 0.0) + rng.normal(0, 0.022, n)  # neutron, v/v
    GR = 22 * (1 - Vsh) + 125 * Vsh + rng.normal(0, 7, n)      # gamma ray, gAPI
    RT = np.exp(rng.normal(0, 0.30, n)) * (1.5 + np.where(gas, 30, 7)
         * np.clip(0.30 - phi, 0, 1)) * (1 - 0.5 * Vsh) + 0.5  # deep resistivity, ohm-m
    RT = np.clip(RT, 0.3, 400)
    return pd.DataFrame({"GR": np.round(GR, 1), "RHOB": np.round(RHOB, 3),
                         "NPHI": np.round(NPHI, 3), "RT": np.round(RT, 2),
                         "PHI_core": np.round(phi, 4)})

# ── Constants (do not edit) ──────────────────────────────────────────────
FEATURE_COLS = ['GR', 'RHOB', 'NPHI', 'RT']
DEPTHS = list(range(1, 21))     # max_depth sweep, 1..20

def overfitting_curve(df, depths=DEPTHS):
    """Sweep DecisionTreeRegressor max_depth and record train vs. cross-val R2.

For each depth: fit a tree on the FULL data and score it on that same data
    (the training R2), and 5-fold shuffled cross-validate it (the honest R2).
    Returns {'train_r2': [...], 'cv_r2': [...], 'best_depth': int}, where
    best_depth is the depth with the highest cross-validated R2.
    """
    X = df[FEATURE_COLS].values
    y = df['PHI_core'].values
    train_r2, cv_r2 = [], []
    for d in depths:
        pass
        # TODO: tree = DecisionTreeRegressor(max_depth=d, random_state=0)
        # TODO: tree.fit(X, y)
        # TODO: train_r2.append(r2_score(y, tree.predict(X)))          # training score
        # TODO: cv = cross_val_score(DecisionTreeRegressor(max_depth=d, random_state=0),
        # TODO:        X, y, cv=KFold(5, shuffle=True, random_state=0), scoring="r2").mean()
        # TODO: cv_r2.append(cv)
    # TODO: best_depth = list(depths)[int(np.argmax(cv_r2))]
    # TODO: return {'train_r2': train_r2, 'cv_r2': cv_r2, 'best_depth': best_depth}
    return None

curve = overfitting_curve(make_log_dataset())
train_curve = None if curve is None else curve['train_r2']
cv_curve = None if curve is None else curve['cv_r2']
best_depth = None if curve is None else curve['best_depth']

print("train curve:", train_curve)
print("cv curve   :", cv_curve)
print("best depth :", best_depth)

visibilityReveal reference solutionexpand_more

Try solving it yourself first — the hints walk you through it. The solution below is one valid approach; yours may differ and still be correct.

import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score, KFold
from sklearn.metrics import r2_score


# ── Verified data generator (do not edit) ────────────────────────────────
def make_log_dataset(n=700, seed=42):
    """Synthetic-but-realistic logs -> core porosity for OML 58 wells."""
    rng = np.random.default_rng(seed)
    Vsh = np.clip(rng.beta(1.3, 3.2, n), 0, 1)                 # shale volume fraction
    depth = rng.uniform(8000, 11000, n)                        # ft
    phi = np.clip((0.33 - 0.020 * (depth - 8000) / 1000) * (1 - 0.9 * Vsh)
                  + rng.normal(0, 0.020, n), 0.02, 0.34)       # core porosity (the target)
    gas = (rng.random(n) < 0.18) & (Vsh < 0.35) & (phi > 0.14)  # gas-bearing clean sand
    rho_fl = np.where(gas, 0.35, 1.0)                          # fluid density (gas vs brine)
    rho_ma = 2.65 + 0.03 * Vsh                                 # matrix density (sand -> shale)
    RHOB = rho_ma * (1 - phi) + rho_fl * phi + rng.normal(0, 0.035, n)   # bulk density, g/cc
    NPHI = phi + 0.32 * Vsh - np.where(gas, 0.10, 0.0) + rng.normal(0, 0.022, n)  # neutron, v/v
    GR = 22 * (1 - Vsh) + 125 * Vsh + rng.normal(0, 7, n)      # gamma ray, gAPI
    RT = np.exp(rng.normal(0, 0.30, n)) * (1.5 + np.where(gas, 30, 7)
         * np.clip(0.30 - phi, 0, 1)) * (1 - 0.5 * Vsh) + 0.5  # deep resistivity, ohm-m
    RT = np.clip(RT, 0.3, 400)
    return pd.DataFrame({"GR": np.round(GR, 1), "RHOB": np.round(RHOB, 3),
                         "NPHI": np.round(NPHI, 3), "RT": np.round(RT, 2),
                         "PHI_core": np.round(phi, 4)})


# ── Constants (do not edit) ──────────────────────────────────────────────
FEATURE_COLS = ['GR', 'RHOB', 'NPHI', 'RT']
DEPTHS = list(range(1, 21))     # max_depth sweep, 1..20


def overfitting_curve(df, depths=DEPTHS):
    """Sweep DecisionTreeRegressor max_depth and record train vs. cross-val R2.

    For each depth: fit a tree on the FULL data and score it on that same data
    (the training R2), and 5-fold shuffled cross-validate it (the honest R2).
    Returns {'train_r2': [...], 'cv_r2': [...], 'best_depth': int}, where
    best_depth is the depth with the highest cross-validated R2.
    """
    X = df[FEATURE_COLS].values
    y = df['PHI_core'].values
    train_r2, cv_r2 = [], []
    for d in depths:
        tree = DecisionTreeRegressor(max_depth=d, random_state=0)
        tree.fit(X, y)
        train_r2.append(r2_score(y, tree.predict(X)))
        cv = cross_val_score(DecisionTreeRegressor(max_depth=d, random_state=0),
                             X, y, cv=KFold(5, shuffle=True, random_state=0),
                             scoring="r2").mean()
        cv_r2.append(cv)
    best_depth = list(depths)[int(np.argmax(cv_r2))]
    return {'train_r2': train_r2, 'cv_r2': cv_r2, 'best_depth': best_depth}


curve = overfitting_curve(make_log_dataset())
train_curve = curve['train_r2']
cv_curve = curve['cv_r2']
best_depth = curve['best_depth']

print("train R2 (depth 1..20):", [round(v, 3) for v in train_curve])
print("cv R2    (depth 1..20):", [round(v, 3) for v in cv_curve])
print("best (cv-max) depth   :", best_depth)

lockCopying code is a Full Access feature.

arrow_back

16.2 Feature Engineering Bake-Off - Does It Help the Linear Model or the Forest?

16.4 Honest vs. Dishonest Evaluation - Data Leakage, Held-Out Test, and Cross-Validation

arrow_forward