Exerciseschevron_rightChapter 18chevron_right18.3
fitness_center

Exercise 18.3

Tuning the QC Sensitivity - Recall/Precision vs Contamination

Level 2
Chapter 18: Unsupervised Learning
descriptionProblem

Sweep the isolation forest's contamination from 0.02 to 0.15 on the well with injected faults, and plot recall and precision against it. At what setting do you catch every real fault, and what does that cost you in false alarms? If a flagged foot triggers a costly manual review, where would you set it, and how does that decision mirror the classification threshold of Chapter 17?

---

The isolation forest's contamination is the one judgement call in automated log QC: it is your estimate of how dirty the data is, and it sets how aggressively the forest flags. Turn it up and you catch every real fault, but you also start flagging good feet, and every false alarm is a wasted manual review. Turn it down and the false alarms vanish, but real bad readings slip through into your petrophysical calculations. This is the exact recall/precision trade-off you tuned with the classification threshold in Chapter 17, wearing different clothes.

The verified make_well_with_bad_data generator is embedded for you under a "do not edit" banner; it builds a clean depth-ordered log run with realistic tool/sensor failures injected at known depths, so you have ground truth to score against. Do not modify it.

Your task: sweep contamination over a grid and, for each setting, measure how much of the known damage you catch (recall) and how clean your flags are (precision), then find the gentlest setting that still catches essentially everything.

Write one function:

def qc_sensitivity(seed=3, contaminations=(0.02, 0.05, 0.08, 0.11, 0.15), target_recall=0.9):
    ...
    return recalls, precisions, smallest_safe

Exact procedure (match it so the anchors reproduce):

  1. depth, well, bad = make_well_with_bad_data(seed=seed).
  2. Standardise on this well only: "normal" is well-specific, so fit the

scaler on the well's own rows, clustering RT on log10 as the chapter does: ``python Xa = well.copy(); Xa["RT"] = np.log10(Xa["RT"]) Xa = StandardScaler().fit_transform(Xa.values) ``

  1. For each c in contaminations:

``python iso = IsolationForest(contamination=c, random_state=0).fit(Xa) flagged = iso.predict(Xa) == -1 recall = (flagged & bad).sum() / bad.sum() precision = (flagged & bad).sum() / max(flagged.sum(), 1) ` Append each as a plain float, in the same order as contaminations`.

  1. smallest_safe = the smallest contamination whose recall >= target_recall,

or None if no setting reaches it.

  1. return (recalls, precisions, smallest_safe): recalls and precisions are

lists of floats aligned with contaminations.

Embedded constants (use exactly these): CONTAMS = (0.02, 0.05, 0.08, 0.11, 0.15); every IsolationForest uses random_state=0; TARGET_RECALL = 0.9.

Then call it once at module level and unpack into these exact output names:

recalls, precisions, smallest_safe_contamination = qc_sensitivity(3)

> Think about it: at the gentle end (c=0.02) precision is perfect but recall > is dismal: you flag almost nothing, so you miss most of the damage. Crank > contamination up and recall climbs to 1.0 (you catch every injected fault), > but precision falls as good feet get caught in the net. The gentlest setting > that still catches everything is the sweet spot, and if a flagged foot triggers > a costly review, you might deliberately accept a little less recall to protect > precision. Where would you set it, and why is that the same decision you made > at the classification threshold in Chapter 17?

lightbulbHints (0/3)

Stuck? Reveal hints one at a time — they progress from nudge to near-solution.

codeYour solution
main.py
visibilityReveal reference solutionexpand_more

Try solving it yourself first — the hints walk you through it. The solution below is one valid approach; yours may differ and still be correct.

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import IsolationForest


# ── Verified log-QC data generator (do not edit) ─────────────────────────
def make_well_with_bad_data(seed=3):
    """A clean depth-ordered log run (a stand-in for a real LAS) with realistic
    tool/sensor failures injected, so we know the ground truth to score against."""
    rng = np.random.default_rng(seed)
    depth = 9000 + 0.5 * np.arange(240)
    Vsh = np.clip(0.25 + 0.15 * np.sin(depth / 14) + rng.normal(0, 0.05, depth.size), 0, 1)
    phi = np.clip(0.27 * (1 - 0.8 * Vsh) + rng.normal(0, 0.02, depth.size), 0.02, 0.34)
    GR = 22 * (1 - Vsh) + 125 * Vsh + rng.normal(0, 6, depth.size)
    RHOB = (2.65 + 0.03 * Vsh) * (1 - phi) + 1.0 * phi + rng.normal(0, 0.03, depth.size)
    NPHI = phi + 0.30 * Vsh + rng.normal(0, 0.02, depth.size)
    RT = np.exp(rng.normal(0, 0.3, depth.size)) * (2 + 8 * np.clip(0.30 - phi, 0, 1)) * (1 - 0.5 * Vsh) + 0.5
    df = pd.DataFrame({"GR": GR, "RHOB": RHOB, "NPHI": NPHI, "RT": np.clip(RT, 0.3, 400)})
    bad = np.zeros(depth.size, bool)
    w = (depth >= 9030) & (depth < 9035); df.loc[w, "RHOB"] = 1.55 + rng.normal(0, 0.05, w.sum()); bad |= w
    df.loc[[60, 61, 150], "RT"] = [4500, 3800, 5200.0]; bad[[60, 61, 150]] = True
    h = (depth >= 9088) & (depth < 9092); df.loc[h, "GR"] = 330 + rng.normal(0, 10, h.sum()); bad |= h
    df.loc[[200, 201], "NPHI"] = [-0.08, 0.62]; bad[[200, 201]] = True
    return depth, df, bad


# ── QC sweep constants (do not edit) ─────────────────────────────────────
CONTAMS = (0.02, 0.05, 0.08, 0.11, 0.15)   # contamination grid to sweep
TARGET_RECALL = 0.9                        # "catch essentially everything"


def qc_sensitivity(seed=3, contaminations=CONTAMS, target_recall=TARGET_RECALL):
    """Sweep the IsolationForest contamination over a grid on the well with
    injected faults; for each, report recall and precision against the known
    bad mask, and find the gentlest setting that still catches >=target_recall.

    Returns (recalls, precisions, smallest_safe):
      recalls       = list of floats (one per contamination, same order)
      precisions    = list of floats (one per contamination, same order)
      smallest_safe = the smallest contamination whose recall >= target_recall,
                      or None if none reach it.
    """
    depth, well, bad = make_well_with_bad_data(seed=seed)
    Xa = well.copy()
    Xa["RT"] = np.log10(Xa["RT"])                    # cluster RT on its log scale
    Xa = StandardScaler().fit_transform(Xa.values)   # fit on THIS well - "normal" is well-specific

    recalls, precisions = [], []
    for c in contaminations:
        iso = IsolationForest(contamination=c, random_state=0).fit(Xa)
        flagged = iso.predict(Xa) == -1
        caught = (flagged & bad).sum()
        recalls.append(float(caught / bad.sum()))
        precisions.append(float(caught / max(flagged.sum(), 1)))

    smallest_safe = next((c for c, r in zip(contaminations, recalls)
                          if r >= target_recall), None)
    return recalls, precisions, smallest_safe


recalls, precisions, smallest_safe_contamination = qc_sensitivity(3)

print("recalls:", [round(r, 4) for r in recalls])
print("precisions:", [round(p, 4) for p in precisions])
print("gentlest contamination reaching recall >= 0.9:", smallest_safe_contamination)

lockCopying code is a Full Access feature.