Exerciseschevron_rightChapter 14chevron_right14.6
fitness_center

Exercise 14.6

ROP Model with Domain Features - Engineering Beats Raw Channels

Level 3
Chapter 14: Drilling Analytics
descriptionProblem

Build an ROP prediction model that includes engineered features: MSE, depth-normalized torque (torque/depth), specific energy ratio, and bit run footage. Compare the model's accuracy with and without these features.

---

The full 5,000-record OML-58 drilling log is embedded for you, generated VERBATIM from the chapter (np.random.seed(42)), along with the verified compute_mse function. Do not modify the data generator, compute_mse, or the constants. build only the two functions below on top of them.

The chapter trains its ROP model on five raw channels and validates it with a depth-based split: train on everything above SPLIT_DEPTH = 7000 ft, then test on the deeper, unseen hard formation. That is honest extrapolation into rock the model never saw, so the test R² is strongly negative. And that is the point. Your job is to show that a few engineered domain features make the extrapolation less bad (higher R², lower MAE) without leaking the target.

Constants (already embedded): BIT_DIAMETER = 8.5, SPLIT_DEPTH = 7000, BASE_FEATURES = ['WOB (klbf)', 'RPM', 'Torque (ft-lbs)', 'SPP (psi)', 'Depth (ft)'], TARGET = 'ROP (ft/hr)'.

Your tasks:

  1. Write add_domain_features(df, db=BIT_DIAMETER). Return a copy of df

with three new columns (do not mutate the caller's DataFrame):

  • 'Torque/Depth' = Torque (ft-lbs) / Depth (ft) (depth-normalized torque).
  • 'WOB*RPM' = WOB (klbf) * RPM, a mechanical-power proxy.
  • 'Torque/WOB' = Torque (ft-lbs) / (WOB (klbf) * 1000.0), specific-torque

ratio (WOB converted klbf → lbf, matching the compute_mse convention). None of these touch ROP (ft/hr), so there is no target leakage. (compute_mse does use ROP, so it is embedded only to demonstrate the verified function; it is deliberately not a model feature.)

  1. Write train_rop_model(df, features, split_depth=SPLIT_DEPTH, random_state=42).

Use a depth-based split (Depth (ft) < split_depth is train, >= is test), fit GradientBoostingRegressor(n_estimators=200, max_depth=5, learning_rate=0.1, random_state=random_state) on df[features]df[TARGET], predict the test set, and return the dict {'test_r2': <float>, 'test_mae': <float>}.

Then build these exact output variables:

  • df_feat = add_domain_features(drilling_data)
  • AUG_FEATURES = BASE_FEATURES + ['Torque/Depth', 'WOB*RPM', 'Torque/WOB']
  • result_base = train_rop_model(df_feat, BASE_FEATURES)
  • result_aug = train_rop_model(df_feat, AUG_FEATURES)
  • r2_base = result_base['test_r2'] (a float)
  • r2_aug = result_aug['test_r2'] (a float)

> Think about it: both R² values come out negative here. The test set is a > formation the model never trained on, so it cannot do better than guessing the > training mean. The win is relative: the engineered features should make > r2_aug higher (less negative) and the augmented MAE lower than the base > model. Why would a depth-normalized torque ratio help a tree extrapolate into > deeper rock when the raw torque channel alone does not?

lightbulbHints (0/3)

Stuck? Reveal hints one at a time — they progress from nudge to near-solution.

codeYour solution
main.py
visibilityReveal reference solutionexpand_more

Try solving it yourself first — the hints walk you through it. The solution below is one valid approach; yours may differ and still be correct.

import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, r2_score


# ── Verified OML-58 drilling-data generator (do not edit) ────────────────
# Copied VERBATIM from chapters/14-drilling-analytics.qmd (np.random.seed(42),
# 5000 records). This is the realistic well the ROP model trains on.
np.random.seed(42)
n_points = 5000

# Depth increases with variable ROP
depth_start = 500
rop_base = np.piecewise(
    np.linspace(0, 1, n_points),
    [np.linspace(0, 1, n_points) < 0.3,
     (np.linspace(0, 1, n_points) >= 0.3) & (np.linspace(0, 1, n_points) < 0.6),
     np.linspace(0, 1, n_points) >= 0.6],
    [lambda x: 120 - 80*x,        # soft formation, fast drilling
     lambda x: 45 + 10*np.sin(20*x),  # medium formation, variable
     lambda x: 25 - 10*(x-0.6)]    # hard formation, slow
)
rop = rop_base + np.random.normal(0, 5, n_points)
rop = np.clip(rop, 5, 200)

# Cumulative depth from ROP (assuming 2-minute measurement intervals)
time_interval_hr = 2 / 60  # 2 minutes in hours
depth = depth_start + np.cumsum(rop * time_interval_hr)

# Drilling parameters correlated with formation
wob = np.where(depth < 4000, 15 + np.random.normal(0, 2, n_points),
       np.where(depth < 7000, 25 + np.random.normal(0, 3, n_points),
                35 + np.random.normal(0, 3, n_points)))
wob = np.clip(wob, 5, 50)

rpm = np.where(depth < 4000, 140 + np.random.normal(0, 10, n_points),
       np.where(depth < 7000, 120 + np.random.normal(0, 8, n_points),
                100 + np.random.normal(0, 8, n_points)))
rpm = np.clip(rpm, 40, 180)

# Bit torque from rock-cutting friction: T ~ mu_t * WOB * (Db/12), with WOB in
# lbf and Db in inches. A torque coefficient ~0.125 on an 8.5-in bit gives a
# realistic few-thousand ft-lbf -- the scale the Teale MSE term actually expects.
bit_diameter_in = 8.5
torque = 0.125 * (wob * 1000.0) * (bit_diameter_in / 12.0) + np.random.normal(0, 250, n_points)
torque = np.clip(torque, 800, 30000)

spp = 2000 + 0.15 * depth + np.random.normal(0, 100, n_points)

# Inject a stick-slip zone (6000-6500 ft). The true surface signature is
# OSCILLATING TORQUE at a STEADY surface RPM -- the driller holds RPM constant
# while the bit periodically sticks and releases -- so we modulate torque, not
# RPM. This is what the CV-on-torque detector below is built to catch.
stick_slip_mask = (depth > 6000) & (depth < 6500)
torque[stick_slip_mask] = (
    torque[stick_slip_mask] + 3500 * np.sin(np.linspace(0, 60, stick_slip_mask.sum()))
)

drilling_data = pd.DataFrame({
    "Depth (ft)": np.round(depth, 1),
    "ROP (ft/hr)": np.round(rop, 1),
    "WOB (klbf)": np.round(wob, 1),
    "RPM": np.round(rpm, 0).astype(int),
    "Torque (ft-lbs)": np.round(torque, 0).astype(int),
    "SPP (psi)": np.round(spp, 0).astype(int),
})


# ── Verified Mechanical Specific Energy (Teale) function (do not edit) ───
# Copied VERBATIM from chapters/14-drilling-analytics.qmd. Embedded to engineer
# a diagnostic MSE column; ROP-derived MSE is NOT used as a model feature
# (that would be target leakage).
def compute_mse(torque_ftlbs, rpm, wob_lbf, rop_fthr, bit_diameter_in):
    """
    Calculate Mechanical Specific Energy.

    Parameters
    ----------
    torque_ftlbs : array-like
        Torque (ft-lbs).
    rpm : array-like
        Rotary speed (rev/min).
    wob_lbf : array-like
        Weight on bit (lbf). Note: input in lbf, not klbf.
    rop_fthr : array-like
        Rate of penetration (ft/hr).
    bit_diameter_in : float
        Bit diameter (inches).

    Returns
    -------
    numpy.ndarray
        MSE values (psi).
    """
    torque = np.asarray(torque_ftlbs, dtype=float)
    r = np.asarray(rpm, dtype=float)
    w = np.asarray(wob_lbf, dtype=float)
    rop = np.asarray(rop_fthr, dtype=float)
    db = bit_diameter_in

    # Avoid division by zero
    rop_safe = np.where(rop > 0, rop, 1e-6)

    rotary_term = 480 * torque * r / (db**2 * rop_safe)
    wob_term = 4 * w / (np.pi * db**2)

    return rotary_term + wob_term


# ── OML-58 model constants (do not edit) ─────────────────────────────────
BIT_DIAMETER = 8.5            # bit diameter, inches
SPLIT_DEPTH = 7000           # depth-based train/test split, ft
BASE_FEATURES = ["WOB (klbf)", "RPM", "Torque (ft-lbs)", "SPP (psi)", "Depth (ft)"]
TARGET = "ROP (ft/hr)"


def add_domain_features(df, db=BIT_DIAMETER):
    """Return a COPY of df with three engineered, leakage-free drilling features.

    New columns (none derived from ROP):
      'Torque/Depth' = Torque / Depth            depth-normalized torque
      'WOB*RPM'      = WOB (klbf) * RPM           mechanical-power proxy
      'Torque/WOB'   = Torque / (WOB_klbf*1000)   specific-torque ratio (lbf)
    """
    out = df.copy()
    out["Torque/Depth"] = out["Torque (ft-lbs)"] / out["Depth (ft)"]
    out["WOB*RPM"] = out["WOB (klbf)"] * out["RPM"]
    out["Torque/WOB"] = out["Torque (ft-lbs)"] / (out["WOB (klbf)"] * 1000.0)
    return out


def train_rop_model(df, features, split_depth=SPLIT_DEPTH, random_state=42):
    """Train a gradient-boosted ROP model with the chapter's depth-based split.

    Train on Depth < split_depth, test on Depth >= split_depth (unseen deeper
    formation). Returns {'test_r2', 'test_mae'} on the held-out test set.
    """
    train_mask = (df["Depth (ft)"] < split_depth).values
    test_mask = (df["Depth (ft)"] >= split_depth).values

    X = df[features].values
    y = df[TARGET].values
    X_train, X_test = X[train_mask], X[test_mask]
    y_train, y_test = y[train_mask], y[test_mask]

    model = GradientBoostingRegressor(
        n_estimators=200,
        max_depth=5,
        learning_rate=0.1,
        random_state=random_state,
    )
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    return {
        "test_r2": float(r2_score(y_test, y_pred)),
        "test_mae": float(mean_absolute_error(y_test, y_pred)),
    }


# ── Build the graded outputs ─────────────────────────────────────────────
df_feat = add_domain_features(drilling_data)
AUG_FEATURES = BASE_FEATURES + ["Torque/Depth", "WOB*RPM", "Torque/WOB"]

result_base = train_rop_model(df_feat, BASE_FEATURES)
result_aug = train_rop_model(df_feat, AUG_FEATURES)
r2_base = result_base["test_r2"]
r2_aug = result_aug["test_r2"]

print("Base model :  R2 = %.4f   MAE = %.3f ft/hr" % (r2_base, result_base["test_mae"]))
print("Augmented  :  R2 = %.4f   MAE = %.3f ft/hr" % (r2_aug, result_aug["test_mae"]))
print("Domain features lift R2 by %.4f and cut MAE by %.3f ft/hr"
      % (r2_aug - r2_base, result_base["test_mae"] - result_aug["test_mae"]))

lockCopying code is a Full Access feature.