Exercise 18.2
The Cost of Choosing k Wrong - ARI vs k for Electrofacies
Cluster the logs at k=2, 3, 4, and 6, and for each compute the ARI against the true facies. Plot ARI versus k. Where does agreement peak, and does it match what the silhouette score recommended? Explain to a geologist why the "best" k by silhouette is not always the most useful k for naming rock.
---
The chapter chose k with the silhouette score and the elbow plot. Here you ask the sharper, label-aware question those diagnostics cannot: how good is each k at actually recovering the rock? You sweep several k, score each clustering against the hidden true facies with the adjusted Rand index (ARI), and watch the score peak at a small k and then collapse as you over-split.
The verified make_facies_dataset generator and its FACIES / CENTROIDS constants are embedded for you under a do-not-edit banner; use them exactly as given; do not re-derive the physics or change the seeds.
Write one function:
def ari_vs_k(seed=11, ks=(2, 3, 4, 6)):
'''Cluster the standardised logs at several k and score each against the
hidden true facies, to show that more clusters is NOT better.'''Embedded constants (use these exactly):
- The k values to sweep are exactly
KS = (2, 3, 4, 6)(the defaultks). - Every
KMeansusesn_init=10, random_state=0. - Cluster resistivity on the log10 scale, exactly as in the chapter.
Exact procedure:
rock = make_facies_dataset(seed=seed);true_facies = rock['facies'].values.logs = rock[['GR','RHOB','NPHI','RT']].copy();logs['RT'] = np.log10(logs['RT']);
X = StandardScaler().fit_transform(logs.values).
- For each
kinks:km = KMeans(n_clusters=k, n_init=10, random_state=0).fit(X);
record adjusted_rand_score(true_facies, km.labels_).
- Build
ari_by_k= a dict{k: ari}with int keys and **plain Python
float values, and best_k = the k with the maximum** ARI (use max(ari_by_k, key=ari_by_k.get)).
return (ari_by_k, best_k).
Then, at module level, call it once and unpack into the exact names the tests read:
ari_by_k, best_k = ari_vs_k(11)> Think about it: ARI peaks at a small k (2 or 3; shale-vs-reservoir is > the data's real signal) and falls hard as k grows to 6, even though k=6 makes > more clusters. The silhouette's favourite (k=2, from the chapter) is not > always the most useful for naming rock. We cluster at k=3 anyway to keep an > interpretable sand split. Why does adding clusters past the real structure > destroy agreement with the truth instead of improving it?
Stuck? Reveal hints one at a time — they progress from nudge to near-solution.
visibilityReveal reference solutionexpand_more
Try solving it yourself first — the hints walk you through it. The solution below is one valid approach; yours may differ and still be correct.
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
# ── Verified Chapter 18 facies generator + constants (do not edit) ───────
FACIES = {0: "Shale", 1: "Brine Sand", 2: "Gas Sand", 3: "Siltstone"}
CENTROIDS = {0: (108, 2.51, 0.33, np.log(3.5), 16, 0.080, 0.045, 0.40),
1: (46, 2.28, 0.27, np.log(9.0), 18, 0.075, 0.045, 0.45),
2: (43, 2.21, 0.21, np.log(15.0), 18, 0.085, 0.050, 0.55),
3: (74, 2.40, 0.29, np.log(7.0), 21, 0.090, 0.050, 0.48)}
def make_facies_dataset(n=1500, seed=11):
"""The Chapter 17 facies set: four logs with a hidden true rock type."""
rng = np.random.default_rng(seed)
facies = rng.choice([0, 1, 2, 3], size=n, p=[0.42, 0.33, 0.07, 0.18])
GR = np.zeros(n); RHOB = np.zeros(n); NPHI = np.zeros(n); RT = np.zeros(n)
for f, (gr, rhob, nphi, lnrt, grs, rhs, nps, rts) in CENTROIDS.items():
m = facies == f; k = int(m.sum())
GR[m] = gr + rng.normal(0, grs, k)
RHOB[m] = rhob + rng.normal(0, rhs, k)
NPHI[m] = np.clip(nphi + rng.normal(0, nps, k), 0, 0.55)
RT[m] = np.clip(np.exp(lnrt + rng.normal(0, rts, k)), 0.3, 600)
return pd.DataFrame({"GR": GR, "RHOB": RHOB, "NPHI": NPHI, "RT": RT, "facies": facies})
# The k values to sweep (the chapter's question: how many rock types?).
KS = (2, 3, 4, 6)
def ari_vs_k(seed=11, ks=(2, 3, 4, 6)):
"""Cluster the standardised logs at several k and score each against the
hidden true facies, to show that more clusters is NOT better.
Returns (ari_by_k, best_k):
ari_by_k = {k: ari} with int keys and plain-float ARI values
best_k = the k with the maximum ARI (max(ari_by_k, key=ari_by_k.get))
"""
rock = make_facies_dataset(seed=seed)
true_facies = rock["facies"].values # revealed only to score; no model sees it
logs = rock[["GR", "RHOB", "NPHI", "RT"]].copy()
logs["RT"] = np.log10(logs["RT"]) # cluster resistivity on its log scale
X = StandardScaler().fit_transform(logs.values)
ari_by_k = {}
for k in ks:
km = KMeans(n_clusters=k, n_init=10, random_state=0).fit(X)
ari_by_k[int(k)] = float(adjusted_rand_score(true_facies, km.labels_))
best_k = max(ari_by_k, key=ari_by_k.get)
return ari_by_k, best_k
ari_by_k, best_k = ari_vs_k(11)
print("ARI by k:", {k: round(v, 4) for k, v in ari_by_k.items()})
print("best k by ARI:", best_k)
lockCopying code is a Full Access feature.