Feature Engineering Bake-Off - Does It Help the Linear Model or the Forest?

Level 2

Chapter 16: ML Fundamentals

descriptionProblem

Starting from the four raw logs, engineer at least three new features (e.g. density porosity, neutron–density separation, log resistivity, a gamma-ray shale index). Using 5-fold cross-validation, compare: (a) raw logs only, (b) engineered features only, (c) raw + engineered. Report mean ± std R² for each. Does feature engineering help a random forest as much as it helps a linear model? Explain why or why not.

---

We'll run the bake-off as a head-to-head between a linear model and a random forest, on the same five-fold cross-validation, to settle the question the book asks: does feature engineering help the forest as much as it helps the linear model?

The verified make_log_dataset generator is embedded for you under a "do not edit" banner. Do not modify it. The constants are also fixed:

RAW_COLS = ["GR", "RHOB", "NPHI", "RT"]
ENG_COLS = ["DPHI", "logRT", "N_D_sep", "RHOB_NPHI"]
N_ESTIMATORS = 150

Every score uses the same splitter: KFold(5, shuffle=True, random_state=0).

### Task 1: add_engineered_features(df)

Return a copy of df with exactly four new columns, each a piece of petrophysics a linear model cannot invent on its own:

column	formula	meaning
`DPHI`	`(2.65 - RHOB) / (2.65 - 1.0)`	sandstone density porosity
`logRT`	`np.log10(RT)`	resistivity, linearised
`N_D_sep`	`NPHI - DPHI`	neutron–density separation (gas/shale)
`RHOB_NPHI`	`(RHOB - RHOB.mean()) * (NPHI - NPHI.mean())`	centered gas-effect interaction product

RHOB_NPHI is the one a linear model genuinely cannot form by itself: it is the product of two logs, the only way a straight-line model can read density and neutron jointly to catch the gas effect (where the two logs move in opposite directions).

### Task 2: bakeoff(df)

Return a dict with four 5-fold mean CV R² scores: {"lin_raw", "lin_aug", "rf_raw", "rf_aug"}.

lin_* uses make_pipeline(StandardScaler(), LinearRegression()).
rf_* uses RandomForestRegressor(n_estimators=150, random_state=0).
*_raw scores RAW_COLS; *_aug scores RAW_COLS + ENG_COLS.
the target is df["PHI_core"].
each score is

cross_val_score(pipe, X, y, cv=KFold(5, shuffle=True, random_state=0), scoring="r2").mean().

### Output variables the tests read

scores      = bakeoff(add_engineered_features(make_log_dataset()))
lin_raw_r2  = scores["lin_raw"]
lin_aug_r2  = scores["lin_aug"]
rf_raw_r2   = scores["rf_raw"]
rf_aug_r2   = scores["rf_aug"]

> Think about it: the linear model gets a real lift from the engineered > columns (especially the RHOB*NPHI product), but the forest barely moves; > it already reads the raw logs jointly through its splits, so handing it > pre-computed interactions tells it little it did not already know. That gap, > engineering lifts the linear model more than the forest, is the lesson this > exercise is built to prove.

lightbulbHints (0/3)

Stuck? Reveal hints one at a time — they progress from nudge to near-solution.

codeYour solution

main.py

import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline

# ── Verified log-to-porosity generator (do not edit) ─────────────────────
def make_log_dataset(n=700, seed=42):
    """Synthetic-but-realistic logs -> core porosity for OML 58 wells.

Rock properties (shale volume, porosity, fluid) drive the log responses,
    so recovering porosity from the logs is a real multi-log inverse problem -
    including the gas effect, where light hydrocarbon lowers both RHOB and NPHI.
    """
    rng = np.random.default_rng(seed)
    Vsh = np.clip(rng.beta(1.3, 3.2, n), 0, 1)                 # shale volume fraction
    depth = rng.uniform(8000, 11000, n)                        # ft
    phi = np.clip((0.33 - 0.020 * (depth - 8000) / 1000) * (1 - 0.9 * Vsh)
                  + rng.normal(0, 0.020, n), 0.02, 0.34)       # core porosity (the target)
    gas = (rng.random(n) < 0.18) & (Vsh < 0.35) & (phi > 0.14)  # gas-bearing clean sand
    rho_fl = np.where(gas, 0.35, 1.0)                          # fluid density (gas vs brine)
    rho_ma = 2.65 + 0.03 * Vsh                                 # matrix density (sand -> shale)
    RHOB = rho_ma * (1 - phi) + rho_fl * phi + rng.normal(0, 0.035, n)   # bulk density, g/cc
    NPHI = phi + 0.32 * Vsh - np.where(gas, 0.10, 0.0) + rng.normal(0, 0.022, n)  # neutron, v/v
    GR = 22 * (1 - Vsh) + 125 * Vsh + rng.normal(0, 7, n)      # gamma ray, gAPI
    RT = np.exp(rng.normal(0, 0.30, n)) * (1.5 + np.where(gas, 30, 7)
         * np.clip(0.30 - phi, 0, 1)) * (1 - 0.5 * Vsh) + 0.5  # deep resistivity, ohm-m
    RT = np.clip(RT, 0.3, 400)
    return pd.DataFrame({"GR": np.round(GR, 1), "RHOB": np.round(RHOB, 3),
                         "NPHI": np.round(NPHI, 3), "RT": np.round(RT, 2),
                         "PHI_core": np.round(phi, 4)})

# ── Bake-off constants (do not edit) ─────────────────────────────────────
RAW_COLS = ["GR", "RHOB", "NPHI", "RT"]
ENG_COLS = ["DPHI", "logRT", "N_D_sep", "RHOB_NPHI"]
N_ESTIMATORS = 150

def add_engineered_features(df):
    """Return a COPY of df with exactly four engineered columns:
       DPHI, logRT, N_D_sep, RHOB_NPHI (see prompt for formulas).
    """
    # TODO: out = df.copy()
    # TODO: out["DPHI"] = (2.65 - out["RHOB"]) / (2.65 - 1.0)
    # TODO: out["logRT"] = np.log10(out["RT"])
    # TODO: out["N_D_sep"] = out["NPHI"] - out["DPHI"]
    # TODO: out["RHOB_NPHI"] = (out["RHOB"] - out["RHOB"].mean()) * (out["NPHI"] - out["NPHI"].mean())
    # TODO: return out
    return None

def bakeoff(df):
    """Return {'lin_raw','lin_aug','rf_raw','rf_aug'} of 5-fold mean CV R2.

lin_* uses make_pipeline(StandardScaler(), LinearRegression());
    rf_*  uses RandomForestRegressor(n_estimators=N_ESTIMATORS, random_state=0);
    *_raw scores RAW_COLS, *_aug scores RAW_COLS + ENG_COLS;
    splitter is always KFold(5, shuffle=True, random_state=0).
    """
    # TODO: y = df["PHI_core"].values
    # TODO: kf = KFold(5, shuffle=True, random_state=0)
    # TODO: def cv(pipe, cols):
    # TODO:     return cross_val_score(pipe, df[cols].values, y, cv=kf, scoring="r2").mean()
    # TODO: lin = lambda: make_pipeline(StandardScaler(), LinearRegression())
    # TODO: rf  = lambda: RandomForestRegressor(n_estimators=N_ESTIMATORS, random_state=0)
    # TODO: return {"lin_raw": cv(lin(), RAW_COLS),
    # TODO:         "lin_aug": cv(lin(), RAW_COLS + ENG_COLS),
    # TODO:         "rf_raw":  cv(rf(),  RAW_COLS),
    # TODO:         "rf_aug":  cv(rf(),  RAW_COLS + ENG_COLS)}
    return None

# TODO: scores = bakeoff(add_engineered_features(make_log_dataset()))
scores = None
lin_raw_r2 = None
lin_aug_r2 = None
rf_raw_r2 = None
rf_aug_r2 = None

print("scores:", scores)

visibilityReveal reference solutionexpand_more

Try solving it yourself first — the hints walk you through it. The solution below is one valid approach; yours may differ and still be correct.

import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline


# ── Verified log-to-porosity generator (do not edit) ─────────────────────
def make_log_dataset(n=700, seed=42):
    """Synthetic-but-realistic logs -> core porosity for OML 58 wells.

    Rock properties (shale volume, porosity, fluid) drive the log responses,
    so recovering porosity from the logs is a real multi-log inverse problem -
    including the gas effect, where light hydrocarbon lowers both RHOB and NPHI.
    """
    rng = np.random.default_rng(seed)
    Vsh = np.clip(rng.beta(1.3, 3.2, n), 0, 1)                 # shale volume fraction
    depth = rng.uniform(8000, 11000, n)                        # ft
    phi = np.clip((0.33 - 0.020 * (depth - 8000) / 1000) * (1 - 0.9 * Vsh)
                  + rng.normal(0, 0.020, n), 0.02, 0.34)       # core porosity (the target)
    gas = (rng.random(n) < 0.18) & (Vsh < 0.35) & (phi > 0.14)  # gas-bearing clean sand
    rho_fl = np.where(gas, 0.35, 1.0)                          # fluid density (gas vs brine)
    rho_ma = 2.65 + 0.03 * Vsh                                 # matrix density (sand -> shale)
    RHOB = rho_ma * (1 - phi) + rho_fl * phi + rng.normal(0, 0.035, n)   # bulk density, g/cc
    NPHI = phi + 0.32 * Vsh - np.where(gas, 0.10, 0.0) + rng.normal(0, 0.022, n)  # neutron, v/v
    GR = 22 * (1 - Vsh) + 125 * Vsh + rng.normal(0, 7, n)      # gamma ray, gAPI
    RT = np.exp(rng.normal(0, 0.30, n)) * (1.5 + np.where(gas, 30, 7)
         * np.clip(0.30 - phi, 0, 1)) * (1 - 0.5 * Vsh) + 0.5  # deep resistivity, ohm-m
    RT = np.clip(RT, 0.3, 400)
    return pd.DataFrame({"GR": np.round(GR, 1), "RHOB": np.round(RHOB, 3),
                         "NPHI": np.round(NPHI, 3), "RT": np.round(RT, 2),
                         "PHI_core": np.round(phi, 4)})


# ── Bake-off constants (do not edit) ─────────────────────────────────────
RAW_COLS = ["GR", "RHOB", "NPHI", "RT"]
ENG_COLS = ["DPHI", "logRT", "N_D_sep", "RHOB_NPHI"]
N_ESTIMATORS = 150


def add_engineered_features(df):
    """Return a COPY of df with exactly four engineered columns:
       DPHI, logRT, N_D_sep, RHOB_NPHI (see prompt for formulas).
    """
    out = df.copy()
    out["DPHI"] = (2.65 - out["RHOB"]) / (2.65 - 1.0)          # sandstone density porosity
    out["logRT"] = np.log10(out["RT"])                        # resistivity, linearised
    out["N_D_sep"] = out["NPHI"] - out["DPHI"]               # neutron-density separation
    out["RHOB_NPHI"] = (out["RHOB"] - out["RHOB"].mean()) * \
                       (out["NPHI"] - out["NPHI"].mean())     # centered gas-effect interaction
    return out


def bakeoff(df):
    """Return {'lin_raw','lin_aug','rf_raw','rf_aug'} of 5-fold mean CV R2.

    lin_* uses make_pipeline(StandardScaler(), LinearRegression());
    rf_*  uses RandomForestRegressor(n_estimators=N_ESTIMATORS, random_state=0);
    *_raw scores RAW_COLS, *_aug scores RAW_COLS + ENG_COLS;
    splitter is always KFold(5, shuffle=True, random_state=0).
    """
    y = df["PHI_core"].values
    kf = KFold(5, shuffle=True, random_state=0)

    def cv(pipe, cols):
        return cross_val_score(pipe, df[cols].values, y, cv=kf, scoring="r2").mean()

    lin = lambda: make_pipeline(StandardScaler(), LinearRegression())
    rf = lambda: RandomForestRegressor(n_estimators=N_ESTIMATORS, random_state=0)

    return {"lin_raw": cv(lin(), RAW_COLS),
            "lin_aug": cv(lin(), RAW_COLS + ENG_COLS),
            "rf_raw":  cv(rf(),  RAW_COLS),
            "rf_aug":  cv(rf(),  RAW_COLS + ENG_COLS)}


scores = bakeoff(add_engineered_features(make_log_dataset()))
lin_raw_r2 = scores["lin_raw"]
lin_aug_r2 = scores["lin_aug"]
rf_raw_r2 = scores["rf_raw"]
rf_aug_r2 = scores["rf_aug"]

print("lin_raw R2:", round(lin_raw_r2, 4), "  lin_aug R2:", round(lin_aug_r2, 4))
print("rf_raw  R2:", round(rf_raw_r2, 4), "  rf_aug  R2:", round(rf_aug_r2, 4))
print("linear lift:", round(lin_aug_r2 - lin_raw_r2, 4),
      "  forest lift:", round(rf_aug_r2 - rf_raw_r2, 4))

lockCopying code is a Full Access feature.

arrow_back

16.1 Predict Water Saturation - Random Forest & Archie's Resistivity

16.3 The Overfitting Curve - Tree Depth and the Bias-Variance Sweet Spot

arrow_forward