A Fifth Facies - Why a Distinct Rare Class Is Easy

Level 2

Chapter 17: Supervised Learning

descriptionProblem

Add a "Coal" class to make_facies_dataset (very low density ~1.6 g/cc, high resistivity, low gamma ray) at 3% abundance. Retrain and evaluate. Is coal easy or hard to classify, and why? Compare its recall to the gas sand's, and explain what makes a rare class easy (a distinct log signature) versus hard (overlap with a common neighbour).

---

The fifth facies has been added for you. A verified generator make_facies_with_coal(n=1600, seed=11, coal_p=0.03) is embedded; it is the chapter's four-facies set plus a rare Coal class (~3%) whose log signature is deliberately distinct: very low bulk density (~1.5 g/cc), low gamma ray, high resistivity. Do not modify the embedded generator or CENTROIDS. The constants are FEATURE_COLS = ['GR', 'RHOB', 'NPHI', 'RT'], N_ESTIMATORS = 150, GAS = 2, and COAL = 4.

Your task: write evaluate_facies(df, seed=0) that returns a dict:

Build X = df[FEATURE_COLS].values and y = df['facies'].values.
Split with `train_test_split(X, y, test_size=0.25, random_state=seed,

stratify=y)` (stratifying keeps the rare classes in both halves).

Fit RandomForestClassifier(n_estimators=N_ESTIMATORS, random_state=seed).
Predict on the test set and compute per-class recall for the two rare

facies:

coal_recall = recall_score(y_te, pred, labels=[COAL], average="macro")
gas_recall = recall_score(y_te, pred, labels=[GAS], average="macro")

Return {'coal_recall': coal_recall, 'gas_recall': gas_recall}.

Then call it once and expose two module-level scalars:

res = evaluate_facies(make_facies_with_coal())
coal_recall = res['coal_recall']
gas_recall = res['gas_recall']

> Think about it: coal is about as rare as the gas sand, yet the model catches > nearly all of it (recall ≈ 1.0) while still missing about half the gas > (recall ≈ 0.5). Rarity alone is not what makes a class hard; overlap is. > Coal sits alone in log space, so even a handful of examples pins it down; the > gas sand hides inside the brine-sand cloud, so no amount of rebalancing > conjures a boundary that is not there. What does that tell you about when to > chase more data versus better features for a class you keep missing?

lightbulbHints (0/3)

Stuck? Reveal hints one at a time — they progress from nudge to near-solution.

codeYour solution

main.py

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import recall_score

# ── Verified facies generator with a 5th class, Coal (do not edit) ───────
# Facies 0-3 are copied VERBATIM from chapter 17; Coal is the new fifth class:
# very low density (~1.5 g/cc), low gamma ray, high resistivity -- a distinct,
# unmistakable log signature, unlike the gas sand that overlaps the brine sand.
FACIES = {0: "Shale", 1: "Brine Sand", 2: "Gas Sand", 3: "Siltstone", 4: "Coal"}
CENTROIDS = {
    0: (108, 2.51, 0.33, np.log(3.5),   16, 0.080, 0.045, 0.40),
    1: (46,  2.28, 0.27, np.log(9.0),    18, 0.075, 0.045, 0.45),
    2: (43,  2.21, 0.21, np.log(15.0),   18, 0.085, 0.050, 0.55),
    3: (74,  2.40, 0.29, np.log(7.0),    21, 0.090, 0.050, 0.48),
    4: (25,  1.50, 0.50, np.log(120.0),  12, 0.100, 0.050, 0.50),   # Coal (new, rare)
}

def make_facies_with_coal(n=1600, seed=11, coal_p=0.03):
    """The chapter-17 four-facies set plus a rare Coal class (default ~3%)."""
    rng = np.random.default_rng(seed)
    p = np.array([0.41, 0.32, 0.07, 0.17, coal_p])
    p = p / p.sum()
    facies = rng.choice([0, 1, 2, 3, 4], size=n, p=p)
    GR = np.zeros(n); RHOB = np.zeros(n); NPHI = np.zeros(n); RT = np.zeros(n)
    for f, (gr, rhob, nphi, lnrt, grs, rhs, nps, rts) in CENTROIDS.items():
        m = facies == f; k = int(m.sum())
        GR[m]   = gr + rng.normal(0, grs, k)
        RHOB[m] = rhob + rng.normal(0, rhs, k)
        NPHI[m] = np.clip(nphi + rng.normal(0, nps, k), 0, 0.6)
        RT[m]   = np.clip(np.exp(lnrt + rng.normal(0, rts, k)), 0.3, 2000)
    return pd.DataFrame({"GR": GR, "RHOB": RHOB, "NPHI": NPHI, "RT": RT, "facies": facies})

# ── Constants (do not edit) ──────────────────────────────────────────────
FEATURE_COLS = ['GR', 'RHOB', 'NPHI', 'RT']
N_ESTIMATORS = 150
GAS = 2
COAL = 4

def evaluate_facies(df, seed=0):
    """Train a random forest on the 5-facies set and return per-class recall for
    the two RARE facies: Coal (a distinct signature) and Gas Sand (overlaps brine).
    Returns {'coal_recall': float, 'gas_recall': float}."""
    # TODO: X = df[FEATURE_COLS].values
    # TODO: y = df['facies'].values
    # TODO: X_tr, X_te, y_tr, y_te = train_test_split(
    # TODO:     X, y, test_size=0.25, random_state=seed, stratify=y)
    # TODO: rf = RandomForestClassifier(n_estimators=N_ESTIMATORS, random_state=seed)
    # TODO: rf.fit(X_tr, y_tr)
    # TODO: pred = rf.predict(X_te)
    # TODO: coal_recall = recall_score(y_te, pred, labels=[COAL], average="macro")
    # TODO: gas_recall = recall_score(y_te, pred, labels=[GAS], average="macro")
    # TODO: return {'coal_recall': coal_recall, 'gas_recall': gas_recall}
    return None

res = evaluate_facies(make_facies_with_coal())
coal_recall = None if res is None else res['coal_recall']
gas_recall = None if res is None else res['gas_recall']

print("Coal recall     :", coal_recall)
print("Gas Sand recall :", gas_recall)

visibilityReveal reference solutionexpand_more

Try solving it yourself first — the hints walk you through it. The solution below is one valid approach; yours may differ and still be correct.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import recall_score


# ── Verified facies generator with a 5th class, Coal (do not edit) ───────
# Facies 0-3 are copied VERBATIM from chapter 17; Coal is the new fifth class:
# very low density (~1.5 g/cc), low gamma ray, high resistivity -- a distinct,
# unmistakable log signature, unlike the gas sand that overlaps the brine sand.
FACIES = {0: "Shale", 1: "Brine Sand", 2: "Gas Sand", 3: "Siltstone", 4: "Coal"}
CENTROIDS = {
    0: (108, 2.51, 0.33, np.log(3.5),   16, 0.080, 0.045, 0.40),
    1: (46,  2.28, 0.27, np.log(9.0),    18, 0.075, 0.045, 0.45),
    2: (43,  2.21, 0.21, np.log(15.0),   18, 0.085, 0.050, 0.55),
    3: (74,  2.40, 0.29, np.log(7.0),    21, 0.090, 0.050, 0.48),
    4: (25,  1.50, 0.50, np.log(120.0),  12, 0.100, 0.050, 0.50),   # Coal (new, rare)
}


def make_facies_with_coal(n=1600, seed=11, coal_p=0.03):
    """The chapter-17 four-facies set plus a rare Coal class (default ~3%)."""
    rng = np.random.default_rng(seed)
    p = np.array([0.41, 0.32, 0.07, 0.17, coal_p])
    p = p / p.sum()
    facies = rng.choice([0, 1, 2, 3, 4], size=n, p=p)
    GR = np.zeros(n); RHOB = np.zeros(n); NPHI = np.zeros(n); RT = np.zeros(n)
    for f, (gr, rhob, nphi, lnrt, grs, rhs, nps, rts) in CENTROIDS.items():
        m = facies == f; k = int(m.sum())
        GR[m]   = gr + rng.normal(0, grs, k)
        RHOB[m] = rhob + rng.normal(0, rhs, k)
        NPHI[m] = np.clip(nphi + rng.normal(0, nps, k), 0, 0.6)
        RT[m]   = np.clip(np.exp(lnrt + rng.normal(0, rts, k)), 0.3, 2000)
    return pd.DataFrame({"GR": GR, "RHOB": RHOB, "NPHI": NPHI, "RT": RT, "facies": facies})


# ── Constants (do not edit) ──────────────────────────────────────────────
FEATURE_COLS = ['GR', 'RHOB', 'NPHI', 'RT']
N_ESTIMATORS = 150
GAS = 2
COAL = 4


def evaluate_facies(df, seed=0):
    """Train a random forest on the 5-facies set and return per-class recall for
    the two RARE facies: Coal (a distinct signature) and Gas Sand (overlaps brine).
    Returns {'coal_recall': float, 'gas_recall': float}."""
    X = df[FEATURE_COLS].values
    y = df['facies'].values
    X_tr, X_te, y_tr, y_te = train_test_split(
        X, y, test_size=0.25, random_state=seed, stratify=y)
    rf = RandomForestClassifier(n_estimators=N_ESTIMATORS, random_state=seed)
    rf.fit(X_tr, y_tr)
    pred = rf.predict(X_te)
    coal_recall = recall_score(y_te, pred, labels=[COAL], average="macro")
    gas_recall = recall_score(y_te, pred, labels=[GAS], average="macro")
    return {'coal_recall': coal_recall, 'gas_recall': gas_recall}


res = evaluate_facies(make_facies_with_coal())
coal_recall = res['coal_recall']
gas_recall = res['gas_recall']

print("Coal recall     :", coal_recall)
print("Gas Sand recall :", gas_recall)
print("coal beats gas? :", coal_recall > gas_recall)

lockCopying code is a Full Access feature.

arrow_back

17.3 Does More Data Beat a Better Threshold?

18.1 Force the Gas Sand to Separate - A Physics Feature for Clustering

arrow_forward