Exercise 14.6
ROP Model with Domain Features - Engineering Beats Raw Channels
Build an ROP prediction model that includes engineered features: MSE, depth-normalized torque (torque/depth), specific energy ratio, and bit run footage. Compare the model's accuracy with and without these features.
---
The full 5,000-record OML-58 drilling log is embedded for you, generated VERBATIM from the chapter (np.random.seed(42)), along with the verified compute_mse function. Do not modify the data generator, compute_mse, or the constants. build only the two functions below on top of them.
The chapter trains its ROP model on five raw channels and validates it with a depth-based split: train on everything above SPLIT_DEPTH = 7000 ft, then test on the deeper, unseen hard formation. That is honest extrapolation into rock the model never saw, so the test R² is strongly negative. And that is the point. Your job is to show that a few engineered domain features make the extrapolation less bad (higher R², lower MAE) without leaking the target.
Constants (already embedded): BIT_DIAMETER = 8.5, SPLIT_DEPTH = 7000, BASE_FEATURES = ['WOB (klbf)', 'RPM', 'Torque (ft-lbs)', 'SPP (psi)', 'Depth (ft)'], TARGET = 'ROP (ft/hr)'.
Your tasks:
- Write
add_domain_features(df, db=BIT_DIAMETER). Return a copy ofdf
with three new columns (do not mutate the caller's DataFrame):
'Torque/Depth'=Torque (ft-lbs)/Depth (ft)(depth-normalized torque).'WOB*RPM'=WOB (klbf)*RPM, a mechanical-power proxy.'Torque/WOB'=Torque (ft-lbs)/ (WOB (klbf)* 1000.0), specific-torque
ratio (WOB converted klbf → lbf, matching the compute_mse convention). None of these touch ROP (ft/hr), so there is no target leakage. (compute_mse does use ROP, so it is embedded only to demonstrate the verified function; it is deliberately not a model feature.)
- Write
train_rop_model(df, features, split_depth=SPLIT_DEPTH, random_state=42).
Use a depth-based split (Depth (ft) < split_depth is train, >= is test), fit GradientBoostingRegressor(n_estimators=200, max_depth=5, learning_rate=0.1, random_state=random_state) on df[features] → df[TARGET], predict the test set, and return the dict {'test_r2': <float>, 'test_mae': <float>}.
Then build these exact output variables:
df_feat = add_domain_features(drilling_data)AUG_FEATURES = BASE_FEATURES + ['Torque/Depth', 'WOB*RPM', 'Torque/WOB']result_base = train_rop_model(df_feat, BASE_FEATURES)result_aug = train_rop_model(df_feat, AUG_FEATURES)r2_base = result_base['test_r2'](a float)r2_aug = result_aug['test_r2'](a float)
> Think about it: both R² values come out negative here. The test set is a > formation the model never trained on, so it cannot do better than guessing the > training mean. The win is relative: the engineered features should make > r2_aug higher (less negative) and the augmented MAE lower than the base > model. Why would a depth-normalized torque ratio help a tree extrapolate into > deeper rock when the raw torque channel alone does not?
Stuck? Reveal hints one at a time — they progress from nudge to near-solution.
visibilityReveal reference solutionexpand_more
Try solving it yourself first — the hints walk you through it. The solution below is one valid approach; yours may differ and still be correct.
import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, r2_score
# ── Verified OML-58 drilling-data generator (do not edit) ────────────────
# Copied VERBATIM from chapters/14-drilling-analytics.qmd (np.random.seed(42),
# 5000 records). This is the realistic well the ROP model trains on.
np.random.seed(42)
n_points = 5000
# Depth increases with variable ROP
depth_start = 500
rop_base = np.piecewise(
np.linspace(0, 1, n_points),
[np.linspace(0, 1, n_points) < 0.3,
(np.linspace(0, 1, n_points) >= 0.3) & (np.linspace(0, 1, n_points) < 0.6),
np.linspace(0, 1, n_points) >= 0.6],
[lambda x: 120 - 80*x, # soft formation, fast drilling
lambda x: 45 + 10*np.sin(20*x), # medium formation, variable
lambda x: 25 - 10*(x-0.6)] # hard formation, slow
)
rop = rop_base + np.random.normal(0, 5, n_points)
rop = np.clip(rop, 5, 200)
# Cumulative depth from ROP (assuming 2-minute measurement intervals)
time_interval_hr = 2 / 60 # 2 minutes in hours
depth = depth_start + np.cumsum(rop * time_interval_hr)
# Drilling parameters correlated with formation
wob = np.where(depth < 4000, 15 + np.random.normal(0, 2, n_points),
np.where(depth < 7000, 25 + np.random.normal(0, 3, n_points),
35 + np.random.normal(0, 3, n_points)))
wob = np.clip(wob, 5, 50)
rpm = np.where(depth < 4000, 140 + np.random.normal(0, 10, n_points),
np.where(depth < 7000, 120 + np.random.normal(0, 8, n_points),
100 + np.random.normal(0, 8, n_points)))
rpm = np.clip(rpm, 40, 180)
# Bit torque from rock-cutting friction: T ~ mu_t * WOB * (Db/12), with WOB in
# lbf and Db in inches. A torque coefficient ~0.125 on an 8.5-in bit gives a
# realistic few-thousand ft-lbf -- the scale the Teale MSE term actually expects.
bit_diameter_in = 8.5
torque = 0.125 * (wob * 1000.0) * (bit_diameter_in / 12.0) + np.random.normal(0, 250, n_points)
torque = np.clip(torque, 800, 30000)
spp = 2000 + 0.15 * depth + np.random.normal(0, 100, n_points)
# Inject a stick-slip zone (6000-6500 ft). The true surface signature is
# OSCILLATING TORQUE at a STEADY surface RPM -- the driller holds RPM constant
# while the bit periodically sticks and releases -- so we modulate torque, not
# RPM. This is what the CV-on-torque detector below is built to catch.
stick_slip_mask = (depth > 6000) & (depth < 6500)
torque[stick_slip_mask] = (
torque[stick_slip_mask] + 3500 * np.sin(np.linspace(0, 60, stick_slip_mask.sum()))
)
drilling_data = pd.DataFrame({
"Depth (ft)": np.round(depth, 1),
"ROP (ft/hr)": np.round(rop, 1),
"WOB (klbf)": np.round(wob, 1),
"RPM": np.round(rpm, 0).astype(int),
"Torque (ft-lbs)": np.round(torque, 0).astype(int),
"SPP (psi)": np.round(spp, 0).astype(int),
})
# ── Verified Mechanical Specific Energy (Teale) function (do not edit) ───
# Copied VERBATIM from chapters/14-drilling-analytics.qmd. Embedded to engineer
# a diagnostic MSE column; ROP-derived MSE is NOT used as a model feature
# (that would be target leakage).
def compute_mse(torque_ftlbs, rpm, wob_lbf, rop_fthr, bit_diameter_in):
"""
Calculate Mechanical Specific Energy.
Parameters
----------
torque_ftlbs : array-like
Torque (ft-lbs).
rpm : array-like
Rotary speed (rev/min).
wob_lbf : array-like
Weight on bit (lbf). Note: input in lbf, not klbf.
rop_fthr : array-like
Rate of penetration (ft/hr).
bit_diameter_in : float
Bit diameter (inches).
Returns
-------
numpy.ndarray
MSE values (psi).
"""
torque = np.asarray(torque_ftlbs, dtype=float)
r = np.asarray(rpm, dtype=float)
w = np.asarray(wob_lbf, dtype=float)
rop = np.asarray(rop_fthr, dtype=float)
db = bit_diameter_in
# Avoid division by zero
rop_safe = np.where(rop > 0, rop, 1e-6)
rotary_term = 480 * torque * r / (db**2 * rop_safe)
wob_term = 4 * w / (np.pi * db**2)
return rotary_term + wob_term
# ── OML-58 model constants (do not edit) ─────────────────────────────────
BIT_DIAMETER = 8.5 # bit diameter, inches
SPLIT_DEPTH = 7000 # depth-based train/test split, ft
BASE_FEATURES = ["WOB (klbf)", "RPM", "Torque (ft-lbs)", "SPP (psi)", "Depth (ft)"]
TARGET = "ROP (ft/hr)"
def add_domain_features(df, db=BIT_DIAMETER):
"""Return a COPY of df with three engineered, leakage-free drilling features.
New columns (none derived from ROP):
'Torque/Depth' = Torque / Depth depth-normalized torque
'WOB*RPM' = WOB (klbf) * RPM mechanical-power proxy
'Torque/WOB' = Torque / (WOB_klbf*1000) specific-torque ratio (lbf)
"""
out = df.copy()
out["Torque/Depth"] = out["Torque (ft-lbs)"] / out["Depth (ft)"]
out["WOB*RPM"] = out["WOB (klbf)"] * out["RPM"]
out["Torque/WOB"] = out["Torque (ft-lbs)"] / (out["WOB (klbf)"] * 1000.0)
return out
def train_rop_model(df, features, split_depth=SPLIT_DEPTH, random_state=42):
"""Train a gradient-boosted ROP model with the chapter's depth-based split.
Train on Depth < split_depth, test on Depth >= split_depth (unseen deeper
formation). Returns {'test_r2', 'test_mae'} on the held-out test set.
"""
train_mask = (df["Depth (ft)"] < split_depth).values
test_mask = (df["Depth (ft)"] >= split_depth).values
X = df[features].values
y = df[TARGET].values
X_train, X_test = X[train_mask], X[test_mask]
y_train, y_test = y[train_mask], y[test_mask]
model = GradientBoostingRegressor(
n_estimators=200,
max_depth=5,
learning_rate=0.1,
random_state=random_state,
)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
return {
"test_r2": float(r2_score(y_test, y_pred)),
"test_mae": float(mean_absolute_error(y_test, y_pred)),
}
# ── Build the graded outputs ─────────────────────────────────────────────
df_feat = add_domain_features(drilling_data)
AUG_FEATURES = BASE_FEATURES + ["Torque/Depth", "WOB*RPM", "Torque/WOB"]
result_base = train_rop_model(df_feat, BASE_FEATURES)
result_aug = train_rop_model(df_feat, AUG_FEATURES)
r2_base = result_base["test_r2"]
r2_aug = result_aug["test_r2"]
print("Base model : R2 = %.4f MAE = %.3f ft/hr" % (r2_base, result_base["test_mae"]))
print("Augmented : R2 = %.4f MAE = %.3f ft/hr" % (r2_aug, result_aug["test_mae"]))
print("Domain features lift R2 by %.4f and cut MAE by %.3f ft/hr"
% (r2_aug - r2_base, result_base["test_mae"] - result_aug["test_mae"]))
lockCopying code is a Full Access feature.