Exercise 16.2
Feature Engineering Bake-Off - Does It Help the Linear Model or the Forest?
Starting from the four raw logs, engineer at least three new features (e.g. density porosity, neutron–density separation, log resistivity, a gamma-ray shale index). Using 5-fold cross-validation, compare: (a) raw logs only, (b) engineered features only, (c) raw + engineered. Report mean ± std R² for each. Does feature engineering help a random forest as much as it helps a linear model? Explain why or why not.
---
We'll run the bake-off as a head-to-head between a linear model and a random forest, on the same five-fold cross-validation, to settle the question the book asks: does feature engineering help the forest as much as it helps the linear model?
The verified make_log_dataset generator is embedded for you under a "do not edit" banner. Do not modify it. The constants are also fixed:
RAW_COLS = ["GR", "RHOB", "NPHI", "RT"]
ENG_COLS = ["DPHI", "logRT", "N_D_sep", "RHOB_NPHI"]
N_ESTIMATORS = 150Every score uses the same splitter: KFold(5, shuffle=True, random_state=0).
### Task 1: add_engineered_features(df)
Return a copy of df with exactly four new columns, each a piece of petrophysics a linear model cannot invent on its own:
| column | formula | meaning |
|---|---|---|
DPHI | (2.65 - RHOB) / (2.65 - 1.0) | sandstone density porosity |
logRT | np.log10(RT) | resistivity, linearised |
N_D_sep | NPHI - DPHI | neutron–density separation (gas/shale) |
RHOB_NPHI | (RHOB - RHOB.mean()) * (NPHI - NPHI.mean()) | centered gas-effect interaction product |
RHOB_NPHI is the one a linear model genuinely cannot form by itself: it is the product of two logs, the only way a straight-line model can read density and neutron jointly to catch the gas effect (where the two logs move in opposite directions).
### Task 2: bakeoff(df)
Return a dict with four 5-fold mean CV R² scores: {"lin_raw", "lin_aug", "rf_raw", "rf_aug"}.
lin_*usesmake_pipeline(StandardScaler(), LinearRegression()).rf_*usesRandomForestRegressor(n_estimators=150, random_state=0).*_rawscoresRAW_COLS;*_augscoresRAW_COLS + ENG_COLS.- the target is
df["PHI_core"]. - each score is
cross_val_score(pipe, X, y, cv=KFold(5, shuffle=True, random_state=0), scoring="r2").mean().
### Output variables the tests read
scores = bakeoff(add_engineered_features(make_log_dataset()))
lin_raw_r2 = scores["lin_raw"]
lin_aug_r2 = scores["lin_aug"]
rf_raw_r2 = scores["rf_raw"]
rf_aug_r2 = scores["rf_aug"]> Think about it: the linear model gets a real lift from the engineered > columns (especially the RHOB*NPHI product), but the forest barely moves; > it already reads the raw logs jointly through its splits, so handing it > pre-computed interactions tells it little it did not already know. That gap, > engineering lifts the linear model more than the forest, is the lesson this > exercise is built to prove.
Stuck? Reveal hints one at a time — they progress from nudge to near-solution.
visibilityReveal reference solutionexpand_more
Try solving it yourself first — the hints walk you through it. The solution below is one valid approach; yours may differ and still be correct.
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
# ── Verified log-to-porosity generator (do not edit) ─────────────────────
def make_log_dataset(n=700, seed=42):
"""Synthetic-but-realistic logs -> core porosity for OML 58 wells.
Rock properties (shale volume, porosity, fluid) drive the log responses,
so recovering porosity from the logs is a real multi-log inverse problem -
including the gas effect, where light hydrocarbon lowers both RHOB and NPHI.
"""
rng = np.random.default_rng(seed)
Vsh = np.clip(rng.beta(1.3, 3.2, n), 0, 1) # shale volume fraction
depth = rng.uniform(8000, 11000, n) # ft
phi = np.clip((0.33 - 0.020 * (depth - 8000) / 1000) * (1 - 0.9 * Vsh)
+ rng.normal(0, 0.020, n), 0.02, 0.34) # core porosity (the target)
gas = (rng.random(n) < 0.18) & (Vsh < 0.35) & (phi > 0.14) # gas-bearing clean sand
rho_fl = np.where(gas, 0.35, 1.0) # fluid density (gas vs brine)
rho_ma = 2.65 + 0.03 * Vsh # matrix density (sand -> shale)
RHOB = rho_ma * (1 - phi) + rho_fl * phi + rng.normal(0, 0.035, n) # bulk density, g/cc
NPHI = phi + 0.32 * Vsh - np.where(gas, 0.10, 0.0) + rng.normal(0, 0.022, n) # neutron, v/v
GR = 22 * (1 - Vsh) + 125 * Vsh + rng.normal(0, 7, n) # gamma ray, gAPI
RT = np.exp(rng.normal(0, 0.30, n)) * (1.5 + np.where(gas, 30, 7)
* np.clip(0.30 - phi, 0, 1)) * (1 - 0.5 * Vsh) + 0.5 # deep resistivity, ohm-m
RT = np.clip(RT, 0.3, 400)
return pd.DataFrame({"GR": np.round(GR, 1), "RHOB": np.round(RHOB, 3),
"NPHI": np.round(NPHI, 3), "RT": np.round(RT, 2),
"PHI_core": np.round(phi, 4)})
# ── Bake-off constants (do not edit) ─────────────────────────────────────
RAW_COLS = ["GR", "RHOB", "NPHI", "RT"]
ENG_COLS = ["DPHI", "logRT", "N_D_sep", "RHOB_NPHI"]
N_ESTIMATORS = 150
def add_engineered_features(df):
"""Return a COPY of df with exactly four engineered columns:
DPHI, logRT, N_D_sep, RHOB_NPHI (see prompt for formulas).
"""
out = df.copy()
out["DPHI"] = (2.65 - out["RHOB"]) / (2.65 - 1.0) # sandstone density porosity
out["logRT"] = np.log10(out["RT"]) # resistivity, linearised
out["N_D_sep"] = out["NPHI"] - out["DPHI"] # neutron-density separation
out["RHOB_NPHI"] = (out["RHOB"] - out["RHOB"].mean()) * \
(out["NPHI"] - out["NPHI"].mean()) # centered gas-effect interaction
return out
def bakeoff(df):
"""Return {'lin_raw','lin_aug','rf_raw','rf_aug'} of 5-fold mean CV R2.
lin_* uses make_pipeline(StandardScaler(), LinearRegression());
rf_* uses RandomForestRegressor(n_estimators=N_ESTIMATORS, random_state=0);
*_raw scores RAW_COLS, *_aug scores RAW_COLS + ENG_COLS;
splitter is always KFold(5, shuffle=True, random_state=0).
"""
y = df["PHI_core"].values
kf = KFold(5, shuffle=True, random_state=0)
def cv(pipe, cols):
return cross_val_score(pipe, df[cols].values, y, cv=kf, scoring="r2").mean()
lin = lambda: make_pipeline(StandardScaler(), LinearRegression())
rf = lambda: RandomForestRegressor(n_estimators=N_ESTIMATORS, random_state=0)
return {"lin_raw": cv(lin(), RAW_COLS),
"lin_aug": cv(lin(), RAW_COLS + ENG_COLS),
"rf_raw": cv(rf(), RAW_COLS),
"rf_aug": cv(rf(), RAW_COLS + ENG_COLS)}
scores = bakeoff(add_engineered_features(make_log_dataset()))
lin_raw_r2 = scores["lin_raw"]
lin_aug_r2 = scores["lin_aug"]
rf_raw_r2 = scores["rf_raw"]
rf_aug_r2 = scores["rf_aug"]
print("lin_raw R2:", round(lin_raw_r2, 4), " lin_aug R2:", round(lin_aug_r2, 4))
print("rf_raw R2:", round(rf_raw_r2, 4), " rf_aug R2:", round(rf_aug_r2, 4))
print("linear lift:", round(lin_aug_r2 - lin_raw_r2, 4),
" forest lift:", round(rf_aug_r2 - rf_raw_r2, 4))
lockCopying code is a Full Access feature.