Exerciseschevron_rightChapter 19chevron_right19.1
fitness_center

Exercise 19.1

Tune the QC Net - Precision vs Recall on a Logging Run

Level 2
Chapter 19: Real-World Projects
descriptionProblem

In Project 1, the isolation-forest contamination controls how aggressively the pipeline quarantines log samples. Sweep it from 0.02 to 0.15, and for each value report how many of the four injected faults are caught and how many good samples are wrongly quarantined. Find the lowest setting that catches every injected fault, and state its false-alarm cost. Why is "catch everything" not automatically the right operating point on a 3,000-ft logging run?

---

Project 1's QC step quarantines bad log samples with an isolation forest. Its contamination setting is the single knob that decides how aggressive the net is: raise it and you flag more samples, lower it and you flag fewer. The engineering question is not "can we catch everything?"; it is "what does catching everything cost in false alarms on a 3,000-ft logging run?"

A broken copy of well OD-007 is built for you with four known magnitude faults injected (a density washout and three single-sample spikes), so you know the ground truth. The make_well generator and the broken_well / FAULTS helpers are embedded under a do-not-edit banner: use them exactly.

Write one function:

def qc_sweep(contaminations=(0.02, 0.04, 0.06, 0.10, 0.15), seed=42):
    '''Sweep isolation-forest contamination and, for each value, report how many
    of the four injected faults are caught and how many GOOD samples are
    wrongly quarantined. Returns (sweep, lowest_c).'''

Exact procedure:

  1. new = broken_well(seed=seed); build the feature matrix exactly as the chapter

does: Xq = StandardScaler().fit_transform(np.column_stack([new.GR, new.RHOB, new.NPHI, np.log10(new.RT)])).

  1. The bad rows are bad_rows = set(sum(FAULTS.values(), [])) (a fault is caught

if any of its rows is flagged).

  1. For each c in contaminations, fit IsolationForest(contamination=c, random_state=0)

on Xq, flag the samples it scores -1, then record a dict with int values:

  • faults_caught: how many of the four faults have at least one flagged row,
  • false_alarms: flagged rows that are not in bad_rows,
  • n_flag: total flagged rows.
  1. Build sweep = {round(float(c), 3): {...}} keyed by contamination.
  2. lowest_c = the smallest contamination whose faults_caught == 4

(a plain Python float), or None if no swept value catches all four.

  1. return sweep, lowest_c.

Then call it once at module level into the exact names the tests read:

sweep, lowest_c = qc_sweep()

> Think about it: all four faults are already caught at the lowest > contamination, yet pushing the knob to 0.15 quarantines dozens of perfectly > good samples for no extra catch. On a real 3,000-ft run, every false alarm is a > sample a petrophysicist has to inspect by hand. Why is the right operating point > the lowest setting that clears the faults, not the highest you can afford?

lightbulbHints (0/3)

Stuck? Reveal hints one at a time — they progress from nudge to near-solution.

codeYour solution
main.py
visibilityReveal reference solutionexpand_more

Try solving it yourself first — the hints walk you through it. The solution below is one valid approach; yours may differ and still be correct.

import numpy as np
import pandas as pd
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler


# ── Verified Chapter 19 well generator + fault injector (do not edit) ─────
ARCHIE = dict(a=0.81, m=2.0, n=2.0, Rw=0.04)


def make_well(wid, seed, top=9000.0, n=420):
    """One depth-ordered well whose logs are consistent with facies, porosity
    and Archie water saturation -- the single rock model the pipeline inverts."""
    rng = np.random.default_rng(seed)
    depth = top + 0.5 * np.arange(n)
    edges = np.sort(rng.integers(8, n - 8, rng.integers(8, 13) - 1))
    facies = np.zeros(n, int)
    for seg, f in zip(np.split(np.arange(n), edges),
                      rng.choice([0, 1, 2, 3], len(edges) + 1, p=[0.38, 0.22, 0.20, 0.20])):
        facies[seg] = f
    Vsh = np.clip(np.where(facies == 0, 0.82, np.where(facies == 3, 0.55, 0.15)) + rng.normal(0, 0.06, n), 0, 1)
    base = np.where(facies == 0, 0.06, np.where(facies == 3, 0.14, 0.24))
    bed_q = np.zeros(n)
    for seg in np.split(np.arange(n), edges):
        bed_q[seg] = rng.uniform(-0.10, 0.05)
    phi = np.clip((base + bed_q) * (1 - 0.4 * Vsh) + rng.normal(0, 0.018, n), 0.02, 0.34)
    Sw = np.where(facies == 2, rng.uniform(0.25, 0.68, n), np.where(facies == 1, 0.88, 0.95))
    Sw = np.clip(Sw + rng.normal(0, 0.05, n), 0.12, 1.0)
    gas = (facies == 2) & (Sw < 0.45)
    GR = 18 * (1 - Vsh) + 135 * Vsh + rng.normal(0, 7, n)
    RHOB = (2.65 + 0.03 * Vsh) * (1 - phi) + np.where(gas, 0.6, 1.0) * phi + rng.normal(0, 0.035, n)
    NPHI = phi + 0.30 * Vsh - np.where(gas, 0.08, 0.0) + rng.normal(0, 0.025, n)
    a, m, nn, Rw = ARCHIE.values()
    RT = np.clip(a * Rw / (np.clip(phi, 0.03, 1) ** m * Sw ** nn) * np.exp(rng.normal(0, 0.22, n)), 0.2, 2000)
    return pd.DataFrame({"WELL": wid, "DEPTH": depth, "GR": GR, "RHOB": RHOB, "NPHI": NPHI,
                         "RT": RT, "facies": facies, "PHI_true": phi, "SW_true": Sw})


# Four KNOWN magnitude faults at fixed rows (so "caught all four" is achievable).
FAULTS = {"washout": list(range(50, 58)), "rt_spike_a": [120], "rt_spike_b": [240], "nphi_spike": [300]}


def broken_well(seed=42):
    """make_well, then inject the four known faults. Returns the broken DataFrame."""
    new = make_well("OD-007", seed=seed)
    new.loc[FAULTS["washout"], "RHOB"] = 1.60 + np.random.default_rng(1).normal(0, 0.04, len(FAULTS["washout"]))
    new.loc[FAULTS["rt_spike_a"], "RT"] = 4200.0
    new.loc[FAULTS["rt_spike_b"], "RT"] = 5000.0
    new.loc[FAULTS["nphi_spike"], "NPHI"] = 0.62
    return new
# ── end do-not-edit ──────────────────────────────────────────────────────


def qc_sweep(contaminations=(0.02, 0.04, 0.06, 0.10, 0.15), seed=42):
    """Sweep isolation-forest contamination; for each value report faults caught
    and good samples wrongly quarantined.

    Returns (sweep, lowest_c):
      sweep    = {round(c,3): {'faults_caught': int, 'false_alarms': int, 'n_flag': int}}
      lowest_c = smallest contamination whose faults_caught == 4 (float), else None
    """
    new = broken_well(seed=seed)
    Xq = StandardScaler().fit_transform(
        np.column_stack([new.GR, new.RHOB, new.NPHI, np.log10(new.RT)]))
    bad_rows = set(sum(FAULTS.values(), []))
    sweep = {}
    for c in contaminations:
        flag = IsolationForest(contamination=c, random_state=0).fit(Xq).predict(Xq) == -1
        faults_caught = sum(any(flag[r] for r in rows) for rows in FAULTS.values())
        false_alarms = int(sum(flag[i] and i not in bad_rows for i in range(len(new))))
        sweep[round(float(c), 3)] = dict(faults_caught=int(faults_caught),
                                         false_alarms=false_alarms, n_flag=int(flag.sum()))
    hits = [c for c, v in sweep.items() if v["faults_caught"] == 4]
    lowest_c = float(min(hits)) if hits else None
    return sweep, lowest_c


sweep, lowest_c = qc_sweep()

print("QC sweep:")
for c, v in sweep.items():
    print(f"  c={c}: caught {v['faults_caught']}/4  false_alarms {v['false_alarms']}  n_flag {v['n_flag']}")
print("lowest contamination catching all four faults:", lowest_c)

lockCopying code is a Full Access feature.