The Cost of Choosing k Wrong - ARI vs k for Electrofacies

Level 1

Chapter 18: Unsupervised Learning

descriptionProblem

Cluster the logs at k=2, 3, 4, and 6, and for each compute the ARI against the true facies. Plot ARI versus k. Where does agreement peak, and does it match what the silhouette score recommended? Explain to a geologist why the "best" k by silhouette is not always the most useful k for naming rock.

---

The chapter chose k with the silhouette score and the elbow plot. Here you ask the sharper, label-aware question those diagnostics cannot: how good is each k at actually recovering the rock? You sweep several k, score each clustering against the hidden true facies with the adjusted Rand index (ARI), and watch the score peak at a small k and then collapse as you over-split.

The verified make_facies_dataset generator and its FACIES / CENTROIDS constants are embedded for you under a do-not-edit banner; use them exactly as given; do not re-derive the physics or change the seeds.

Write one function:

def ari_vs_k(seed=11, ks=(2, 3, 4, 6)):
    '''Cluster the standardised logs at several k and score each against the
    hidden true facies, to show that more clusters is NOT better.'''

Embedded constants (use these exactly):

The k values to sweep are exactly KS = (2, 3, 4, 6) (the default ks).
Every KMeans uses n_init=10, random_state=0.
Cluster resistivity on the log10 scale, exactly as in the chapter.

Exact procedure:

rock = make_facies_dataset(seed=seed); true_facies = rock['facies'].values.
logs = rock[['GR','RHOB','NPHI','RT']].copy(); logs['RT'] = np.log10(logs['RT']);

X = StandardScaler().fit_transform(logs.values).

For each k in ks: km = KMeans(n_clusters=k, n_init=10, random_state=0).fit(X);

record adjusted_rand_score(true_facies, km.labels_).

Build ari_by_k = a dict {k: ari} with int keys and **plain Python

float values, and best_k = the k with the maximum** ARI (use max(ari_by_k, key=ari_by_k.get)).

return (ari_by_k, best_k).

Then, at module level, call it once and unpack into the exact names the tests read:

ari_by_k, best_k = ari_vs_k(11)

> Think about it: ARI peaks at a small k (2 or 3; shale-vs-reservoir is > the data's real signal) and falls hard as k grows to 6, even though k=6 makes > more clusters. The silhouette's favourite (k=2, from the chapter) is not > always the most useful for naming rock. We cluster at k=3 anyway to keep an > interpretable sand split. Why does adding clusters past the real structure > destroy agreement with the truth instead of improving it?

lightbulbHints (0/3)

Stuck? Reveal hints one at a time — they progress from nudge to near-solution.

codeYour solution

main.py

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

# ── Verified Chapter 18 facies generator + constants (do not edit) ───────
FACIES = {0: "Shale", 1: "Brine Sand", 2: "Gas Sand", 3: "Siltstone"}
CENTROIDS = {0: (108, 2.51, 0.33, np.log(3.5),  16, 0.080, 0.045, 0.40),
             1: (46,  2.28, 0.27, np.log(9.0),   18, 0.075, 0.045, 0.45),
             2: (43,  2.21, 0.21, np.log(15.0),  18, 0.085, 0.050, 0.55),
             3: (74,  2.40, 0.29, np.log(7.0),   21, 0.090, 0.050, 0.48)}

def make_facies_dataset(n=1500, seed=11):
    """The Chapter 17 facies set: four logs with a hidden true rock type."""
    rng = np.random.default_rng(seed)
    facies = rng.choice([0, 1, 2, 3], size=n, p=[0.42, 0.33, 0.07, 0.18])
    GR = np.zeros(n); RHOB = np.zeros(n); NPHI = np.zeros(n); RT = np.zeros(n)
    for f, (gr, rhob, nphi, lnrt, grs, rhs, nps, rts) in CENTROIDS.items():
        m = facies == f; k = int(m.sum())
        GR[m]   = gr + rng.normal(0, grs, k)
        RHOB[m] = rhob + rng.normal(0, rhs, k)
        NPHI[m] = np.clip(nphi + rng.normal(0, nps, k), 0, 0.55)
        RT[m]   = np.clip(np.exp(lnrt + rng.normal(0, rts, k)), 0.3, 600)
    return pd.DataFrame({"GR": GR, "RHOB": RHOB, "NPHI": NPHI, "RT": RT, "facies": facies})

# The k values to sweep (the chapter's question: how many rock types?).
KS = (2, 3, 4, 6)

def ari_vs_k(seed=11, ks=(2, 3, 4, 6)):
    """Cluster the standardised logs at several k and score each against the
    hidden true facies, to show that more clusters is NOT better.

Returns (ari_by_k, best_k):
      ari_by_k = {k: ari} with int keys and plain-float ARI values
      best_k   = the k with the maximum ARI (max(ari_by_k, key=ari_by_k.get))
    """
    # TODO: rock = make_facies_dataset(seed=seed)
    # TODO: true_facies = rock['facies'].values
    # TODO: logs = rock[['GR', 'RHOB', 'NPHI', 'RT']].copy()
    # TODO: logs['RT'] = np.log10(logs['RT'])          # cluster RT on its log scale
    # TODO: X = StandardScaler().fit_transform(logs.values)
    # TODO: ari_by_k = {}
    # TODO: for k in ks:
    # TODO:     km = KMeans(n_clusters=k, n_init=10, random_state=0).fit(X)
    # TODO:     ari_by_k[int(k)] = float(adjusted_rand_score(true_facies, km.labels_))
    # TODO: best_k = max(ari_by_k, key=ari_by_k.get)
    # TODO: return ari_by_k, best_k
    return {}, None

# TODO: ari_by_k, best_k = ari_vs_k(11)
ari_by_k = None
best_k = None

print("ARI by k:", ari_by_k)
print("best k by ARI:", best_k)

visibilityReveal reference solutionexpand_more

Try solving it yourself first — the hints walk you through it. The solution below is one valid approach; yours may differ and still be correct.

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score


# ── Verified Chapter 18 facies generator + constants (do not edit) ───────
FACIES = {0: "Shale", 1: "Brine Sand", 2: "Gas Sand", 3: "Siltstone"}
CENTROIDS = {0: (108, 2.51, 0.33, np.log(3.5),  16, 0.080, 0.045, 0.40),
             1: (46,  2.28, 0.27, np.log(9.0),   18, 0.075, 0.045, 0.45),
             2: (43,  2.21, 0.21, np.log(15.0),  18, 0.085, 0.050, 0.55),
             3: (74,  2.40, 0.29, np.log(7.0),   21, 0.090, 0.050, 0.48)}


def make_facies_dataset(n=1500, seed=11):
    """The Chapter 17 facies set: four logs with a hidden true rock type."""
    rng = np.random.default_rng(seed)
    facies = rng.choice([0, 1, 2, 3], size=n, p=[0.42, 0.33, 0.07, 0.18])
    GR = np.zeros(n); RHOB = np.zeros(n); NPHI = np.zeros(n); RT = np.zeros(n)
    for f, (gr, rhob, nphi, lnrt, grs, rhs, nps, rts) in CENTROIDS.items():
        m = facies == f; k = int(m.sum())
        GR[m]   = gr + rng.normal(0, grs, k)
        RHOB[m] = rhob + rng.normal(0, rhs, k)
        NPHI[m] = np.clip(nphi + rng.normal(0, nps, k), 0, 0.55)
        RT[m]   = np.clip(np.exp(lnrt + rng.normal(0, rts, k)), 0.3, 600)
    return pd.DataFrame({"GR": GR, "RHOB": RHOB, "NPHI": NPHI, "RT": RT, "facies": facies})


# The k values to sweep (the chapter's question: how many rock types?).
KS = (2, 3, 4, 6)


def ari_vs_k(seed=11, ks=(2, 3, 4, 6)):
    """Cluster the standardised logs at several k and score each against the
    hidden true facies, to show that more clusters is NOT better.

    Returns (ari_by_k, best_k):
      ari_by_k = {k: ari} with int keys and plain-float ARI values
      best_k   = the k with the maximum ARI (max(ari_by_k, key=ari_by_k.get))
    """
    rock = make_facies_dataset(seed=seed)
    true_facies = rock["facies"].values          # revealed only to score; no model sees it
    logs = rock[["GR", "RHOB", "NPHI", "RT"]].copy()
    logs["RT"] = np.log10(logs["RT"])            # cluster resistivity on its log scale
    X = StandardScaler().fit_transform(logs.values)
    ari_by_k = {}
    for k in ks:
        km = KMeans(n_clusters=k, n_init=10, random_state=0).fit(X)
        ari_by_k[int(k)] = float(adjusted_rand_score(true_facies, km.labels_))
    best_k = max(ari_by_k, key=ari_by_k.get)
    return ari_by_k, best_k


ari_by_k, best_k = ari_vs_k(11)

print("ARI by k:", {k: round(v, 4) for k, v in ari_by_k.items()})
print("best k by ARI:", best_k)

lockCopying code is a Full Access feature.

arrow_back

18.1 Force the Gas Sand to Separate - A Physics Feature for Clustering

18.3 Tuning the QC Sensitivity - Recall/Precision vs Contamination

arrow_forward