Force the Gas Sand to Separate - A Physics Feature for Clustering

Level 2

Chapter 18: Unsupervised Learning

descriptionProblem

The clustering could not isolate the gas sand from log overlap alone. Engineer one new feature that captures the gas signature directly (for example a density–neutron separation term, large where gas suppresses both curves) add it to the clustering inputs, recluster at k=4, and report the new ARI and the gas sand's cluster purity. Does a physics-informed feature do for clustering what the threshold did for classification in Chapter 17?

---

In the chapter, K-means on the four standardised logs could not isolate the gas sand (facies 2, only ~7% of the rock); its log signature overlaps brine sand and siltstone too much. Here you carve it out the way a petrophysicist would: with physics, not more clusters.

The verified make_facies_dataset generator and the FACIES / CENTROIDS constants from the chapter are embedded for you. Do not modify them. Write one function that clusters the logs two ways and scores both against the hidden true facies (the answer key; no model may be fit on it):

def compare_clustering(seed=11):
    """Cluster the four standardised logs with KMeans(k=4) two ways -- on the raw
    logs, and on the logs PLUS a physics density-neutron separation feature --
    and score both against the hidden true facies."""

Use these embedded constants exactly:

K = 4 clusters, N_INIT = 10, RANDOM_STATE = 0 for every KMeans.
RHO_MATRIX = 2.65, RHO_FLUID = 1.0 for the density porosity.
SEP_WEIGHT = 3.0: the separation feature is scaled up by this factor after

standardisation so its rare-but-real gas signal is not drowned out by the four bulk logs.

The exact procedure:

rock = make_facies_dataset(seed=seed); true_facies = rock['facies'].values.
logs = rock[['GR','RHOB','NPHI','RT']].copy(); then

logs['RT'] = np.log10(logs['RT']) (cluster resistivity on its log scale, exactly as the chapter does).

RAW path: X_raw = StandardScaler().fit_transform(logs.values);

km_raw = KMeans(n_clusters=K, n_init=N_INIT, random_state=RANDOM_STATE).fit(X_raw); ari_raw = adjusted_rand_score(true_facies, km_raw.labels_).

Engineered feature:

phi_d = (RHO_MATRIX - logs['RHOB'].values) / (RHO_MATRIX - RHO_FLUID); sep = phi_d - logs['NPHI'].values, the density–neutron separation, large where gas suppresses both RHOB and NPHI together.

ENGINEERED path: stack the 4 logs and sep into a 5-column matrix,

standardise it, then multiply only the last (sep) column by SEP_WEIGHT → X_eng; fit km_eng = KMeans(n_clusters=K, n_init=N_INIT, random_state=RANDOM_STATE); ari_eng = adjusted_rand_score(true_facies, km_eng.labels_).

Gas-sand cluster purity (write a small helper): for a label array, over

the 4 clusters compute (members that are true gas sand, facies==2) / (cluster size) and return the max over clusters: the purity of the cluster that best captures the gas sand. Compute gas_purity_raw from km_raw.labels_ and gas_purity_eng from km_eng.labels_.

return (ari_raw, ari_engineered, gas_purity_raw, gas_purity_engineered):

all four as plain Python floats.

Then, at module level, assign these exact output variable names:

ari_raw, ari_engineered, gas_purity_raw, gas_purity_engineered = compare_clustering(11)

> Think about it: the physics feature does for clustering what the threshold > did for classification in Chapter 17; it carves out a markedly purer gas > cluster (gas_purity_engineered > gas_purity_raw, reliably). The honest twist > you must report: this can lower the global ARI, because forcing out a rare > 7%-facies cluster trades away some of the broad shale-vs-sand structure that > dominates the variance. Both numbers come back; the real, reproducible win is > the purity lift, not the global ARI.

lightbulbHints (0/3)

Stuck? Reveal hints one at a time — they progress from nudge to near-solution.

codeYour solution

main.py

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

# ── Verified Chapter 18 facies generator + constants (do not edit) ───────
FACIES = {0: "Shale", 1: "Brine Sand", 2: "Gas Sand", 3: "Siltstone"}
CENTROIDS = {0: (108, 2.51, 0.33, np.log(3.5),  16, 0.080, 0.045, 0.40),
             1: (46,  2.28, 0.27, np.log(9.0),   18, 0.075, 0.045, 0.45),
             2: (43,  2.21, 0.21, np.log(15.0),  18, 0.085, 0.050, 0.55),
             3: (74,  2.40, 0.29, np.log(7.0),   21, 0.090, 0.050, 0.48)}

def make_facies_dataset(n=1500, seed=11):
    """The Chapter 17 facies set: four logs with a hidden true rock type."""
    rng = np.random.default_rng(seed)
    facies = rng.choice([0, 1, 2, 3], size=n, p=[0.42, 0.33, 0.07, 0.18])
    GR = np.zeros(n); RHOB = np.zeros(n); NPHI = np.zeros(n); RT = np.zeros(n)
    for f, (gr, rhob, nphi, lnrt, grs, rhs, nps, rts) in CENTROIDS.items():
        m = facies == f; k = int(m.sum())
        GR[m]   = gr + rng.normal(0, grs, k)
        RHOB[m] = rhob + rng.normal(0, rhs, k)
        NPHI[m] = np.clip(nphi + rng.normal(0, nps, k), 0, 0.55)
        RT[m]   = np.clip(np.exp(lnrt + rng.normal(0, rts, k)), 0.3, 600)
    return pd.DataFrame({"GR": GR, "RHOB": RHOB, "NPHI": NPHI, "RT": RT, "facies": facies})

# ── Embedded clustering constants (use these exactly) ────────────────────
K = 4                # number of clusters
N_INIT = 10          # KMeans restarts
RANDOM_STATE = 0     # deterministic KMeans
RHO_MATRIX = 2.65    # matrix (grain) density for density porosity, g/cc
RHO_FLUID = 1.0      # fluid density for density porosity, g/cc
SEP_WEIGHT = 3.0     # up-weight the rare gas separation signal after scaling

def compare_clustering(seed=11):
    """Cluster the four standardised logs with KMeans(k=4) two ways -- on the raw
    logs, and on the logs PLUS a physics density-neutron separation feature --
    and score both against the hidden true facies."""
    # TODO 1: rock = make_facies_dataset(seed=seed); true_facies = rock['facies'].values
    # TODO 2: logs = rock[['GR','RHOB','NPHI','RT']].copy(); logs['RT'] = np.log10(logs['RT'])
    # TODO 3 (RAW):  X_raw = StandardScaler().fit_transform(logs.values)
    #                km_raw = KMeans(n_clusters=K, n_init=N_INIT, random_state=RANDOM_STATE).fit(X_raw)
    #                ari_raw = adjusted_rand_score(true_facies, km_raw.labels_)
    # TODO 4 (FEATURE): phi_d = (RHO_MATRIX - logs['RHOB'].values) / (RHO_MATRIX - RHO_FLUID)
    #                   sep = phi_d - logs['NPHI'].values
    # TODO 5 (ENG):  stack the 4 logs + sep, standardise the 5 columns, then
    #                multiply ONLY the last (sep) column by SEP_WEIGHT -> X_eng
    #                km_eng = KMeans(n_clusters=K, n_init=N_INIT, random_state=RANDOM_STATE).fit(X_eng)
    #                ari_eng = adjusted_rand_score(true_facies, km_eng.labels_)
    # TODO 6 (PURITY): for a label array, max over clusters of
    #                  (members with facies==2) / (cluster size)
    #                  -> gas_purity_raw, gas_purity_eng
    # TODO 7: return (float(ari_raw), float(ari_eng),
    #                 float(gas_purity_raw), float(gas_purity_eng))
    return None, None, None, None

ari_raw, ari_engineered, gas_purity_raw, gas_purity_engineered = compare_clustering(11)

print("raw-log ARI:        ", ari_raw)
print("engineered ARI:     ", ari_engineered)
print("gas purity (raw):   ", gas_purity_raw)
print("gas purity (eng):   ", gas_purity_engineered)

visibilityReveal reference solutionexpand_more

Try solving it yourself first — the hints walk you through it. The solution below is one valid approach; yours may differ and still be correct.

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score


# ── Verified Chapter 18 facies generator + constants (do not edit) ───────
FACIES = {0: "Shale", 1: "Brine Sand", 2: "Gas Sand", 3: "Siltstone"}
CENTROIDS = {0: (108, 2.51, 0.33, np.log(3.5),  16, 0.080, 0.045, 0.40),
             1: (46,  2.28, 0.27, np.log(9.0),   18, 0.075, 0.045, 0.45),
             2: (43,  2.21, 0.21, np.log(15.0),  18, 0.085, 0.050, 0.55),
             3: (74,  2.40, 0.29, np.log(7.0),   21, 0.090, 0.050, 0.48)}

def make_facies_dataset(n=1500, seed=11):
    """The Chapter 17 facies set: four logs with a hidden true rock type."""
    rng = np.random.default_rng(seed)
    facies = rng.choice([0, 1, 2, 3], size=n, p=[0.42, 0.33, 0.07, 0.18])
    GR = np.zeros(n); RHOB = np.zeros(n); NPHI = np.zeros(n); RT = np.zeros(n)
    for f, (gr, rhob, nphi, lnrt, grs, rhs, nps, rts) in CENTROIDS.items():
        m = facies == f; k = int(m.sum())
        GR[m]   = gr + rng.normal(0, grs, k)
        RHOB[m] = rhob + rng.normal(0, rhs, k)
        NPHI[m] = np.clip(nphi + rng.normal(0, nps, k), 0, 0.55)
        RT[m]   = np.clip(np.exp(lnrt + rng.normal(0, rts, k)), 0.3, 600)
    return pd.DataFrame({"GR": GR, "RHOB": RHOB, "NPHI": NPHI, "RT": RT, "facies": facies})


# ── Embedded clustering constants (use these exactly) ────────────────────
K = 4                # number of clusters
N_INIT = 10          # KMeans restarts
RANDOM_STATE = 0     # deterministic KMeans
RHO_MATRIX = 2.65    # matrix (grain) density for density porosity, g/cc
RHO_FLUID = 1.0      # fluid density for density porosity, g/cc
SEP_WEIGHT = 3.0     # up-weight the rare gas separation signal after scaling


def _gas_cluster_purity(labels, true_facies):
    """Max over the K clusters of (true gas sand, facies==2) / (cluster size)."""
    labels = np.asarray(labels)
    is_gas = np.asarray(true_facies) == 2
    best = 0.0
    for c in np.unique(labels):
        m = labels == c
        size = int(m.sum())
        if size == 0:
            continue
        best = max(best, int((is_gas & m).sum()) / size)
    return best


def compare_clustering(seed=11):
    """Cluster the four standardised logs with KMeans(k=4) two ways -- on the raw
    logs, and on the logs PLUS a physics density-neutron separation feature --
    and score both against the hidden true facies."""
    rock = make_facies_dataset(seed=seed)
    true_facies = rock["facies"].values          # answer key -- never fit on this

    logs = rock[["GR", "RHOB", "NPHI", "RT"]].copy()
    logs["RT"] = np.log10(logs["RT"])             # resistivity on its log scale

    # RAW path: cluster the four standardised logs.
    X_raw = StandardScaler().fit_transform(logs.values)
    km_raw = KMeans(n_clusters=K, n_init=N_INIT, random_state=RANDOM_STATE).fit(X_raw)
    ari_raw = adjusted_rand_score(true_facies, km_raw.labels_)

    # ENGINEERED feature: density porosity minus neutron porosity. Gas suppresses
    # both RHOB (raising phi_d) and NPHI, so sep spikes in gas sand.
    phi_d = (RHO_MATRIX - logs["RHOB"].values) / (RHO_MATRIX - RHO_FLUID)
    sep = phi_d - logs["NPHI"].values

    # ENGINEERED path: standardise the 5-column matrix, then up-weight only sep.
    stacked = np.column_stack([logs.values, sep])
    X_eng = StandardScaler().fit_transform(stacked)
    X_eng[:, -1] *= SEP_WEIGHT
    km_eng = KMeans(n_clusters=K, n_init=N_INIT, random_state=RANDOM_STATE).fit(X_eng)
    ari_eng = adjusted_rand_score(true_facies, km_eng.labels_)

    gas_purity_raw = _gas_cluster_purity(km_raw.labels_, true_facies)
    gas_purity_eng = _gas_cluster_purity(km_eng.labels_, true_facies)

    return float(ari_raw), float(ari_eng), float(gas_purity_raw), float(gas_purity_eng)


ari_raw, ari_engineered, gas_purity_raw, gas_purity_engineered = compare_clustering(11)

print("raw-log ARI:        ", round(ari_raw, 4))
print("engineered ARI:     ", round(ari_engineered, 4))
print("gas purity (raw):   ", round(gas_purity_raw, 4))
print("gas purity (eng):   ", round(gas_purity_engineered, 4))

lockCopying code is a Full Access feature.

arrow_back

17.4 A Fifth Facies - Why a Distinct Rare Class Is Easy

18.2 The Cost of Choosing k Wrong - ARI vs k for Electrofacies

arrow_forward