Exercise 17.4
A Fifth Facies - Why a Distinct Rare Class Is Easy
Add a "Coal" class to make_facies_dataset (very low density ~1.6 g/cc, high resistivity, low gamma ray) at 3% abundance. Retrain and evaluate. Is coal easy or hard to classify, and why? Compare its recall to the gas sand's, and explain what makes a rare class easy (a distinct log signature) versus hard (overlap with a common neighbour).
---
The fifth facies has been added for you. A verified generator make_facies_with_coal(n=1600, seed=11, coal_p=0.03) is embedded; it is the chapter's four-facies set plus a rare Coal class (~3%) whose log signature is deliberately distinct: very low bulk density (~1.5 g/cc), low gamma ray, high resistivity. Do not modify the embedded generator or CENTROIDS. The constants are FEATURE_COLS = ['GR', 'RHOB', 'NPHI', 'RT'], N_ESTIMATORS = 150, GAS = 2, and COAL = 4.
Your task: write evaluate_facies(df, seed=0) that returns a dict:
- Build
X = df[FEATURE_COLS].valuesandy = df['facies'].values. - Split with `train_test_split(X, y, test_size=0.25, random_state=seed,
stratify=y)` (stratifying keeps the rare classes in both halves).
- Fit
RandomForestClassifier(n_estimators=N_ESTIMATORS, random_state=seed). - Predict on the test set and compute per-class recall for the two rare
facies:
coal_recall = recall_score(y_te, pred, labels=[COAL], average="macro")gas_recall = recall_score(y_te, pred, labels=[GAS], average="macro")
- Return
{'coal_recall': coal_recall, 'gas_recall': gas_recall}.
Then call it once and expose two module-level scalars:
res = evaluate_facies(make_facies_with_coal())
coal_recall = res['coal_recall']
gas_recall = res['gas_recall']> Think about it: coal is about as rare as the gas sand, yet the model catches > nearly all of it (recall ≈ 1.0) while still missing about half the gas > (recall ≈ 0.5). Rarity alone is not what makes a class hard; overlap is. > Coal sits alone in log space, so even a handful of examples pins it down; the > gas sand hides inside the brine-sand cloud, so no amount of rebalancing > conjures a boundary that is not there. What does that tell you about when to > chase more data versus better features for a class you keep missing?
Stuck? Reveal hints one at a time — they progress from nudge to near-solution.
visibilityReveal reference solutionexpand_more
Try solving it yourself first — the hints walk you through it. The solution below is one valid approach; yours may differ and still be correct.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import recall_score
# ── Verified facies generator with a 5th class, Coal (do not edit) ───────
# Facies 0-3 are copied VERBATIM from chapter 17; Coal is the new fifth class:
# very low density (~1.5 g/cc), low gamma ray, high resistivity -- a distinct,
# unmistakable log signature, unlike the gas sand that overlaps the brine sand.
FACIES = {0: "Shale", 1: "Brine Sand", 2: "Gas Sand", 3: "Siltstone", 4: "Coal"}
CENTROIDS = {
0: (108, 2.51, 0.33, np.log(3.5), 16, 0.080, 0.045, 0.40),
1: (46, 2.28, 0.27, np.log(9.0), 18, 0.075, 0.045, 0.45),
2: (43, 2.21, 0.21, np.log(15.0), 18, 0.085, 0.050, 0.55),
3: (74, 2.40, 0.29, np.log(7.0), 21, 0.090, 0.050, 0.48),
4: (25, 1.50, 0.50, np.log(120.0), 12, 0.100, 0.050, 0.50), # Coal (new, rare)
}
def make_facies_with_coal(n=1600, seed=11, coal_p=0.03):
"""The chapter-17 four-facies set plus a rare Coal class (default ~3%)."""
rng = np.random.default_rng(seed)
p = np.array([0.41, 0.32, 0.07, 0.17, coal_p])
p = p / p.sum()
facies = rng.choice([0, 1, 2, 3, 4], size=n, p=p)
GR = np.zeros(n); RHOB = np.zeros(n); NPHI = np.zeros(n); RT = np.zeros(n)
for f, (gr, rhob, nphi, lnrt, grs, rhs, nps, rts) in CENTROIDS.items():
m = facies == f; k = int(m.sum())
GR[m] = gr + rng.normal(0, grs, k)
RHOB[m] = rhob + rng.normal(0, rhs, k)
NPHI[m] = np.clip(nphi + rng.normal(0, nps, k), 0, 0.6)
RT[m] = np.clip(np.exp(lnrt + rng.normal(0, rts, k)), 0.3, 2000)
return pd.DataFrame({"GR": GR, "RHOB": RHOB, "NPHI": NPHI, "RT": RT, "facies": facies})
# ── Constants (do not edit) ──────────────────────────────────────────────
FEATURE_COLS = ['GR', 'RHOB', 'NPHI', 'RT']
N_ESTIMATORS = 150
GAS = 2
COAL = 4
def evaluate_facies(df, seed=0):
"""Train a random forest on the 5-facies set and return per-class recall for
the two RARE facies: Coal (a distinct signature) and Gas Sand (overlaps brine).
Returns {'coal_recall': float, 'gas_recall': float}."""
X = df[FEATURE_COLS].values
y = df['facies'].values
X_tr, X_te, y_tr, y_te = train_test_split(
X, y, test_size=0.25, random_state=seed, stratify=y)
rf = RandomForestClassifier(n_estimators=N_ESTIMATORS, random_state=seed)
rf.fit(X_tr, y_tr)
pred = rf.predict(X_te)
coal_recall = recall_score(y_te, pred, labels=[COAL], average="macro")
gas_recall = recall_score(y_te, pred, labels=[GAS], average="macro")
return {'coal_recall': coal_recall, 'gas_recall': gas_recall}
res = evaluate_facies(make_facies_with_coal())
coal_recall = res['coal_recall']
gas_recall = res['gas_recall']
print("Coal recall :", coal_recall)
print("Gas Sand recall :", gas_recall)
print("coal beats gas? :", coal_recall > gas_recall)
lockCopying code is a Full Access feature.