Does More Data Beat a Better Threshold?

Level 2

Chapter 17: Supervised Learning

descriptionProblem

Cut the training set to 25% of its rows and refit the forest; then restore the full set. Report gas-sand recall at the default 0.5 threshold for both. Does four times the data help the rare class as much as moving the threshold did? What does your answer imply about where to spend effort when a pay facies is being missed: more wells, or better calibration?

---

The gas sand is only ~7% of the section, and at the default 0.5 probability cut a random forest catches barely half of it. There are two levers you could pull to find more gas: gather more labelled training feet (expensive, that means more core and more interpretation) or lower the decision threshold so a foot needs less model confidence before you flag it (free; it is one number). This exercise makes you measure both on the same OML-58 test set and see which one moves the rare-class recall further.

The verified make_facies_dataset generator and FACIES/CENTROIDS constants are embedded for you. Do not modify them or re-derive the rock physics. The data is already loaded and split exactly as in the chapter:

rock = make_facies_dataset()
X = rock[['GR', 'RHOB', 'NPHI', 'RT']].values
y = rock['facies'].values
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=0, stratify=y)

The constants you need are also defined: GAS = 2 (the gas-sand label), SUBSET_SEED = 42, DEFAULT_THR = 0.5, and LOW_THR = 0.2.

Your tasks:

Write gas_recall(model, X_test, y_test, threshold) -> float. Take the model's

probability for the gas class, proba = model.predict_proba(X_test)[:, GAS], flag every test foot with proba >= threshold, and return the recall of the gas class at that cut: the fraction of true gas feet that were flagged:

``python flagged = proba >= threshold is_gas = (y_test == GAS) return float((flagged & is_gas).sum() / is_gas.sum()) ``

Write fit_forest(X_train, y_train) -> RandomForestClassifier that returns a

trained forest: RandomForestClassifier(n_estimators=200, random_state=0).fit(X_train, y_train).

Carve out a fixed-seed quarter of the training rows:

``python rng = np.random.default_rng(SUBSET_SEED) idx = rng.choice(len(X_train), size=len(X_train) // 4, replace=False) X_q, y_q = X_train[idx], y_train[idx] ``

Fit both forests and assign these EXACT output variable names:

```python forest_full = fit_forest(X_train, y_train) forest_quarter = fit_forest(X_q, y_q)

recall_full_default = gas_recall(forest_full, X_test, y_test, DEFAULT_THR) recall_quarter_default = gas_recall(forest_quarter, X_test, y_test, DEFAULT_THR) recall_full_low = gas_recall(forest_full, X_test, y_test, LOW_THR) threshold_lift = recall_full_low - recall_full_default # gain from a lower bar data_lift = recall_full_default - recall_quarter_default # gain from 4x data ```

> Think about it: threshold_lift is what a free knob bought you; data_lift > is what four times as much training data bought you. Which is bigger here? Lowering > the bar from 0.5 to 0.2 lifts gas recall from ~0.56 to ~0.72, while shrinking the > training set to a quarter only drops it to ~0.44, so on this rare class, better > calibration beats 4x more data. Why does the threshold move recall in larger > steps than the data does, and what does that buy you (and cost you) in false alarms?

lightbulbHints (0/3)

Stuck? Reveal hints one at a time — they progress from nudge to near-solution.

codeYour solution

main.py

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# ── Verified OML-58 facies generator (do not edit) ───────────────────────
FACIES = {0: "Shale", 1: "Brine Sand", 2: "Gas Sand", 3: "Siltstone"}

# Each facies' centre in log space (GR, RHOB, NPHI, ln RT) and its scatter (the *_sd
# terms). Resistivity is centred on a LOG scale because it ranges over decades. Defined
# once and shared by the training-set and new-well generators below, which guarantees the
# rock physics is identical in training and inference.
CENTROIDS = {0: (108, 2.51, 0.33, np.log(3.5),  16, 0.080, 0.045, 0.40),
             1: (46,  2.28, 0.27, np.log(9.0),   18, 0.075, 0.045, 0.45),
             2: (43,  2.21, 0.21, np.log(15.0),  18, 0.085, 0.050, 0.55),
             3: (74,  2.40, 0.29, np.log(7.0),   21, 0.090, 0.050, 0.48)}

def make_facies_dataset(n=1500, seed=11):
    """Self-contained labelled facies set for OML 58: four logs -> rock type.

Each facies has a centroid in log space (GR, RHOB, NPHI, log-RT) and realistic
    scatter. The gas sand deliberately overlaps the brine sandstone - same low GR
    and high porosity - separated only by its lower density and neutron and higher
    resistivity. That overlap, plus its rarity (7% of the section), is what makes it
    the hard class, exactly as in the field.
    """
    rng = np.random.default_rng(seed)
    facies = rng.choice([0, 1, 2, 3], size=n, p=[0.42, 0.33, 0.07, 0.18])  # imbalanced
    GR = np.zeros(n); RHOB = np.zeros(n); NPHI = np.zeros(n); RT = np.zeros(n)
    for f, (gr, rhob, nphi, lnrt, grs, rhs, nps, rts) in CENTROIDS.items():
        m = facies == f; k = int(m.sum())
        GR[m]   = gr + rng.normal(0, grs, k)
        RHOB[m] = rhob + rng.normal(0, rhs, k)
        NPHI[m] = np.clip(nphi + rng.normal(0, nps, k), 0, 0.55)
        RT[m]   = np.clip(np.exp(lnrt + rng.normal(0, rts, k)), 0.3, 600)
    return pd.DataFrame({"GR": np.round(GR, 1), "RHOB": np.round(RHOB, 3),
                         "NPHI": np.round(NPHI, 3), "RT": np.round(RT, 2),
                         "facies": facies})

# ── Constants and the verified split (do not edit) ───────────────────────
GAS = 2            # the gas-sand facies label (the rare class we hunt for)
SUBSET_SEED = 42   # fixed seed for the "quarter of the training data" draw
DEFAULT_THR = 0.5  # the default decision threshold
LOW_THR = 0.2      # a lowered, recall-favouring threshold

rock = make_facies_dataset()
X = rock[['GR', 'RHOB', 'NPHI', 'RT']].values
y = rock['facies'].values
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=0, stratify=y)

# ── Your work starts here ────────────────────────────────────────────────
def gas_recall(model, X_test, y_test, threshold):
    """Recall of the gas class at an arbitrary probability cut.

proba = model.predict_proba(X_test)[:, GAS]; flag where proba >= threshold;
    return (flagged gas feet) / (all true gas feet).
    """
    # TODO: proba = model.predict_proba(X_test)[:, GAS]
    # TODO: flagged = proba >= threshold
    # TODO: is_gas = (y_test == GAS)
    # TODO: return float((flagged & is_gas).sum() / is_gas.sum())
    pass

def fit_forest(X_train, y_train):
    """Train and return a 200-tree random forest (fixed random_state)."""
    # TODO: return RandomForestClassifier(n_estimators=200, random_state=0).fit(X_train, y_train)
    pass

# Carve out a fixed-seed quarter of the TRAINING rows.
# TODO: rng = np.random.default_rng(SUBSET_SEED)
# TODO: idx = rng.choice(len(X_train), size=len(X_train) // 4, replace=False)
# TODO: X_q, y_q = X_train[idx], y_train[idx]

# Fit both forests and measure the three recalls.
# TODO: forest_full = fit_forest(X_train, y_train)
# TODO: forest_quarter = fit_forest(X_q, y_q)
forest_full = None
forest_quarter = None

recall_full_default = None
recall_quarter_default = None
recall_full_low = None
threshold_lift = None
data_lift = None

print("recall (full data, thr=0.5)   :", recall_full_default)
print("recall (quarter data, thr=0.5):", recall_quarter_default)
print("recall (full data, thr=0.2)   :", recall_full_low)
print("threshold_lift (free knob)    :", threshold_lift)
print("data_lift (4x training data)  :", data_lift)

visibilityReveal reference solutionexpand_more

Try solving it yourself first — the hints walk you through it. The solution below is one valid approach; yours may differ and still be correct.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier


# ── Verified OML-58 facies generator (do not edit) ───────────────────────
FACIES = {0: "Shale", 1: "Brine Sand", 2: "Gas Sand", 3: "Siltstone"}

# Each facies' centre in log space (GR, RHOB, NPHI, ln RT) and its scatter (the *_sd
# terms). Resistivity is centred on a LOG scale because it ranges over decades. Defined
# once and shared by the training-set and new-well generators below, which guarantees the
# rock physics is identical in training and inference.
CENTROIDS = {0: (108, 2.51, 0.33, np.log(3.5),  16, 0.080, 0.045, 0.40),
             1: (46,  2.28, 0.27, np.log(9.0),   18, 0.075, 0.045, 0.45),
             2: (43,  2.21, 0.21, np.log(15.0),  18, 0.085, 0.050, 0.55),
             3: (74,  2.40, 0.29, np.log(7.0),   21, 0.090, 0.050, 0.48)}

def make_facies_dataset(n=1500, seed=11):
    """Self-contained labelled facies set for OML 58: four logs -> rock type.

    Each facies has a centroid in log space (GR, RHOB, NPHI, log-RT) and realistic
    scatter. The gas sand deliberately overlaps the brine sandstone - same low GR
    and high porosity - separated only by its lower density and neutron and higher
    resistivity. That overlap, plus its rarity (7% of the section), is what makes it
    the hard class, exactly as in the field.
    """
    rng = np.random.default_rng(seed)
    facies = rng.choice([0, 1, 2, 3], size=n, p=[0.42, 0.33, 0.07, 0.18])  # imbalanced
    GR = np.zeros(n); RHOB = np.zeros(n); NPHI = np.zeros(n); RT = np.zeros(n)
    for f, (gr, rhob, nphi, lnrt, grs, rhs, nps, rts) in CENTROIDS.items():
        m = facies == f; k = int(m.sum())
        GR[m]   = gr + rng.normal(0, grs, k)
        RHOB[m] = rhob + rng.normal(0, rhs, k)
        NPHI[m] = np.clip(nphi + rng.normal(0, nps, k), 0, 0.55)
        RT[m]   = np.clip(np.exp(lnrt + rng.normal(0, rts, k)), 0.3, 600)
    return pd.DataFrame({"GR": np.round(GR, 1), "RHOB": np.round(RHOB, 3),
                         "NPHI": np.round(NPHI, 3), "RT": np.round(RT, 2),
                         "facies": facies})


# ── Constants and the verified split (do not edit) ───────────────────────
GAS = 2            # the gas-sand facies label (the rare class we hunt for)
SUBSET_SEED = 42   # fixed seed for the "quarter of the training data" draw
DEFAULT_THR = 0.5  # the default decision threshold
LOW_THR = 0.2      # a lowered, recall-favouring threshold

rock = make_facies_dataset()
X = rock[['GR', 'RHOB', 'NPHI', 'RT']].values
y = rock['facies'].values
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=0, stratify=y)


# ── Your work starts here ────────────────────────────────────────────────
def gas_recall(model, X_test, y_test, threshold):
    """Recall of the gas class at an arbitrary probability cut.

    proba = model.predict_proba(X_test)[:, GAS]; flag where proba >= threshold;
    return (flagged gas feet) / (all true gas feet).
    """
    proba = model.predict_proba(X_test)[:, GAS]
    flagged = proba >= threshold
    is_gas = (y_test == GAS)
    return float((flagged & is_gas).sum() / is_gas.sum())


def fit_forest(X_train, y_train):
    """Train and return a 200-tree random forest (fixed random_state)."""
    return RandomForestClassifier(n_estimators=200, random_state=0).fit(X_train, y_train)


# Carve out a fixed-seed quarter of the TRAINING rows.
rng = np.random.default_rng(SUBSET_SEED)
idx = rng.choice(len(X_train), size=len(X_train) // 4, replace=False)
X_q, y_q = X_train[idx], y_train[idx]

# Fit both forests and measure the three recalls.
forest_full = fit_forest(X_train, y_train)
forest_quarter = fit_forest(X_q, y_q)

recall_full_default = gas_recall(forest_full, X_test, y_test, DEFAULT_THR)
recall_quarter_default = gas_recall(forest_quarter, X_test, y_test, DEFAULT_THR)
recall_full_low = gas_recall(forest_full, X_test, y_test, LOW_THR)
threshold_lift = recall_full_low - recall_full_default     # gain from a lower bar
data_lift = recall_full_default - recall_quarter_default    # gain from 4x data

print("recall (full data, thr=0.5)   :", recall_full_default)
print("recall (quarter data, thr=0.5):", recall_quarter_default)
print("recall (full data, thr=0.2)   :", recall_full_low)
print("threshold_lift (free knob)    :", threshold_lift)
print("data_lift (4x training data)  :", data_lift)

lockCopying code is a Full Access feature.

arrow_back

17.2 Reading a Confusion Matrix - The Most-Confused Facies Pair

17.4 A Fifth Facies - Why a Distinct Rare Class Is Easy

arrow_forward