Predict Water Saturation - Random Forest & Archie's Resistivity

Level 2

Chapter 16: ML Fundamentals

descriptionProblem

Extend make_log_dataset (or write your own generator) to include a SW_core column for water saturation, then train and evaluate a model to predict $S_w$ from the four logs. Report test R² and RMSE, and produce a cross-validated predicted-vs-actual crossplot. Which log turns out to be the most important predictor of saturation, and why? (Archie's equation links formation resistivity to water saturation, so think about which log that points to.)

---

We'll build the saturation analogue of the porosity model from this chapter. A verified generator make_sw_dataset(n=900, seed=42) is embedded for you; it is the Sw-target twin of make_log_dataset: it draws the same four wireline logs (GR, RHOB, NPHI, RT), but the target column is SW_core, water saturation in v/v. The physics that ties the logs to saturation is Archie's equation, RT = a·Rw / (φ^m · Sw^n), so deep resistivity RT carries the saturation signal, while the height above the free-water level (which sets Sw) is drawn independently of lithology on purpose. Do not modify the embedded generator or its constants.

The constants are: FEATURE_COLS = ['GR', 'RHOB', 'NPHI', 'RT'], N_ESTIMATORS = 150, TEST_SIZE = 0.25, SEED = 42.

Your task: write predict_sw(df, seed=42) that returns a dict:

Build X = df[FEATURE_COLS].values and y = df['SW_core'].values.
Split with

train_test_split(X, y, test_size=TEST_SIZE, random_state=seed).

Fit RandomForestRegressor(n_estimators=N_ESTIMATORS, random_state=seed)

on the four logs to predict SW_core.

Predict on the held-out test set and compute:

test_r2 = r2_score(y_test, pred)
rmse = mean_squared_error(y_test, pred) ** 0.5
top_feature = FEATURE_COLS[int(np.argmax(rf.feature_importances_))]

Return {'test_r2': test_r2, 'rmse': rmse, 'top_feature': top_feature}.

Then call it once on the seed-42 dataset and expose three module-level scalars:

result = predict_sw(make_sw_dataset())
sw_test_r2 = result['test_r2']
sw_rmse = result['rmse']
sw_top_feature = result['top_feature']

> Think about it: the model should score around R² ≈ 0.92 with an RMSE of > about 0.066 saturation units: strong, but a touch below the porosity model > because saturation is the noisier target. The most important predictor comes > out as RT: Archie's equation ties formation resistivity directly to > water saturation, so a tree-based model that is free to weight any log will > lean hardest on resistivity. Why does the same log dominate even on a fresh > random realization of the field?

lightbulbHints (0/3)

Stuck? Reveal hints one at a time — they progress from nudge to near-solution.

codeYour solution

main.py

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error

# ── Verified data generators (do not edit) ───────────────────────────────
def make_log_dataset(n=700, seed=42):
    """Synthetic-but-realistic logs -> core porosity for OML 58 wells."""
    rng = np.random.default_rng(seed)
    Vsh = np.clip(rng.beta(1.3, 3.2, n), 0, 1)                 # shale volume fraction
    depth = rng.uniform(8000, 11000, n)                        # ft
    phi = np.clip((0.33 - 0.020 * (depth - 8000) / 1000) * (1 - 0.9 * Vsh)
                  + rng.normal(0, 0.020, n), 0.02, 0.34)       # core porosity (the target)
    gas = (rng.random(n) < 0.18) & (Vsh < 0.35) & (phi > 0.14)  # gas-bearing clean sand
    rho_fl = np.where(gas, 0.35, 1.0)                          # fluid density (gas vs brine)
    rho_ma = 2.65 + 0.03 * Vsh                                 # matrix density (sand -> shale)
    RHOB = rho_ma * (1 - phi) + rho_fl * phi + rng.normal(0, 0.035, n)   # bulk density, g/cc
    NPHI = phi + 0.32 * Vsh - np.where(gas, 0.10, 0.0) + rng.normal(0, 0.022, n)  # neutron, v/v
    GR = 22 * (1 - Vsh) + 125 * Vsh + rng.normal(0, 7, n)      # gamma ray, gAPI
    RT = np.exp(rng.normal(0, 0.30, n)) * (1.5 + np.where(gas, 30, 7)
         * np.clip(0.30 - phi, 0, 1)) * (1 - 0.5 * Vsh) + 0.5  # deep resistivity, ohm-m
    RT = np.clip(RT, 0.3, 400)
    return pd.DataFrame({"GR": np.round(GR, 1), "RHOB": np.round(RHOB, 3),
                         "NPHI": np.round(NPHI, 3), "RT": np.round(RT, 2),
                         "PHI_core": np.round(phi, 4)})

def make_sw_dataset(n=900, seed=42):
    """Sw-target twin of make_log_dataset: the four logs -> core water saturation.

Saturation is set by height above the free-water level (drawn INDEPENDENTLY
    of lithology), and Archie's equation ties deep resistivity RT to Sw.
    """
    rng = np.random.default_rng(seed)
    Vsh = np.clip(rng.beta(1.3, 3.2, n), 0, 1)                 # shale volume fraction
    depth = rng.uniform(8000, 11000, n)                        # ft
    phi = np.clip((0.33 - 0.020 * (depth - 8000) / 1000) * (1 - 0.9 * Vsh)
                  + rng.normal(0, 0.020, n), 0.02, 0.34)       # porosity, v/v
    hac = rng.uniform(0, 1, n)                                 # height above free-water (norm.)
    sw = np.clip(0.20 + 0.75 * (1 - hac) ** 1.5
                 + rng.normal(0, 0.05, n), 0.15, 1.0)          # water saturation (the target)
    a, m, nn, Rw = 1.0, 2.0, 2.0, 0.06                         # Archie constants
    RT = a * Rw / (np.clip(phi, 0.02, 1) ** m * sw ** nn)      # Archie deep resistivity, ohm-m
    RT = RT * np.exp(rng.normal(0, 0.10, n))                   # log-normal measurement noise
    RT = np.clip(RT, 0.3, 2000)
    rho_ma = 2.65 + 0.03 * Vsh                                 # matrix density (sand -> shale)
    RHOB = rho_ma * (1 - phi) + 1.0 * phi + rng.normal(0, 0.035, n)  # bulk density, g/cc (brine)
    NPHI = phi + 0.32 * Vsh + rng.normal(0, 0.022, n)         # neutron, v/v
    GR = 22 * (1 - Vsh) + 125 * Vsh + rng.normal(0, 7, n)     # gamma ray, gAPI
    return pd.DataFrame({"GR": np.round(GR, 1), "RHOB": np.round(RHOB, 3),
                         "NPHI": np.round(NPHI, 3), "RT": np.round(RT, 2),
                         "SW_core": np.round(sw, 4)})

# ── Model constants (do not edit) ────────────────────────────────────────
FEATURE_COLS = ['GR', 'RHOB', 'NPHI', 'RT']
N_ESTIMATORS = 150
TEST_SIZE = 0.25
SEED = 42

def predict_sw(df, seed=42):
    """Train a random forest on the four logs to predict water saturation.

Returns {'test_r2': float, 'rmse': float, 'top_feature': str}:
      test_r2     = r2_score on the held-out test split
      rmse        = sqrt of mean_squared_error on the test split (sat. units)
      top_feature = FEATURE_COLS entry with the largest feature_importances_
    """
    # TODO: X = df[FEATURE_COLS].values
    # TODO: y = df['SW_core'].values
    # TODO: X_train, X_test, y_train, y_test = train_test_split(
    # TODO:     X, y, test_size=TEST_SIZE, random_state=seed)
    # TODO: rf = RandomForestRegressor(n_estimators=N_ESTIMATORS, random_state=seed)
    # TODO: rf.fit(X_train, y_train)
    # TODO: pred = rf.predict(X_test)
    # TODO: test_r2 = r2_score(y_test, pred)
    # TODO: rmse = mean_squared_error(y_test, pred) ** 0.5
    # TODO: top_feature = FEATURE_COLS[int(np.argmax(rf.feature_importances_))]
    # TODO: return {'test_r2': test_r2, 'rmse': rmse, 'top_feature': top_feature}
    return None

result = predict_sw(make_sw_dataset())
sw_test_r2 = None if result is None else result['test_r2']
sw_rmse = None if result is None else result['rmse']
sw_top_feature = None if result is None else result['top_feature']

print("Sw test R2  :", sw_test_r2)
print("Sw RMSE     :", sw_rmse)
print("top feature :", sw_top_feature)

visibilityReveal reference solutionexpand_more

Try solving it yourself first — the hints walk you through it. The solution below is one valid approach; yours may differ and still be correct.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error


# ── Verified data generators (do not edit) ───────────────────────────────
def make_log_dataset(n=700, seed=42):
    """Synthetic-but-realistic logs -> core porosity for OML 58 wells."""
    rng = np.random.default_rng(seed)
    Vsh = np.clip(rng.beta(1.3, 3.2, n), 0, 1)                 # shale volume fraction
    depth = rng.uniform(8000, 11000, n)                        # ft
    phi = np.clip((0.33 - 0.020 * (depth - 8000) / 1000) * (1 - 0.9 * Vsh)
                  + rng.normal(0, 0.020, n), 0.02, 0.34)       # core porosity (the target)
    gas = (rng.random(n) < 0.18) & (Vsh < 0.35) & (phi > 0.14)  # gas-bearing clean sand
    rho_fl = np.where(gas, 0.35, 1.0)                          # fluid density (gas vs brine)
    rho_ma = 2.65 + 0.03 * Vsh                                 # matrix density (sand -> shale)
    RHOB = rho_ma * (1 - phi) + rho_fl * phi + rng.normal(0, 0.035, n)   # bulk density, g/cc
    NPHI = phi + 0.32 * Vsh - np.where(gas, 0.10, 0.0) + rng.normal(0, 0.022, n)  # neutron, v/v
    GR = 22 * (1 - Vsh) + 125 * Vsh + rng.normal(0, 7, n)      # gamma ray, gAPI
    RT = np.exp(rng.normal(0, 0.30, n)) * (1.5 + np.where(gas, 30, 7)
         * np.clip(0.30 - phi, 0, 1)) * (1 - 0.5 * Vsh) + 0.5  # deep resistivity, ohm-m
    RT = np.clip(RT, 0.3, 400)
    return pd.DataFrame({"GR": np.round(GR, 1), "RHOB": np.round(RHOB, 3),
                         "NPHI": np.round(NPHI, 3), "RT": np.round(RT, 2),
                         "PHI_core": np.round(phi, 4)})


def make_sw_dataset(n=900, seed=42):
    """Sw-target twin of make_log_dataset: the four logs -> core water saturation.

    Saturation is set by height above the free-water level (drawn INDEPENDENTLY
    of lithology), and Archie's equation ties deep resistivity RT to Sw.
    """
    rng = np.random.default_rng(seed)
    Vsh = np.clip(rng.beta(1.3, 3.2, n), 0, 1)                 # shale volume fraction
    depth = rng.uniform(8000, 11000, n)                        # ft
    phi = np.clip((0.33 - 0.020 * (depth - 8000) / 1000) * (1 - 0.9 * Vsh)
                  + rng.normal(0, 0.020, n), 0.02, 0.34)       # porosity, v/v
    hac = rng.uniform(0, 1, n)                                 # height above free-water (norm.)
    sw = np.clip(0.20 + 0.75 * (1 - hac) ** 1.5
                 + rng.normal(0, 0.05, n), 0.15, 1.0)          # water saturation (the target)
    a, m, nn, Rw = 1.0, 2.0, 2.0, 0.06                         # Archie constants
    RT = a * Rw / (np.clip(phi, 0.02, 1) ** m * sw ** nn)      # Archie deep resistivity, ohm-m
    RT = RT * np.exp(rng.normal(0, 0.10, n))                   # log-normal measurement noise
    RT = np.clip(RT, 0.3, 2000)
    rho_ma = 2.65 + 0.03 * Vsh                                 # matrix density (sand -> shale)
    RHOB = rho_ma * (1 - phi) + 1.0 * phi + rng.normal(0, 0.035, n)  # bulk density, g/cc (brine)
    NPHI = phi + 0.32 * Vsh + rng.normal(0, 0.022, n)         # neutron, v/v
    GR = 22 * (1 - Vsh) + 125 * Vsh + rng.normal(0, 7, n)     # gamma ray, gAPI
    return pd.DataFrame({"GR": np.round(GR, 1), "RHOB": np.round(RHOB, 3),
                         "NPHI": np.round(NPHI, 3), "RT": np.round(RT, 2),
                         "SW_core": np.round(sw, 4)})


# ── Model constants (do not edit) ────────────────────────────────────────
FEATURE_COLS = ['GR', 'RHOB', 'NPHI', 'RT']
N_ESTIMATORS = 150
TEST_SIZE = 0.25
SEED = 42


def predict_sw(df, seed=42):
    """Train a random forest on the four logs to predict water saturation.

    Returns {'test_r2': float, 'rmse': float, 'top_feature': str}:
      test_r2     = r2_score on the held-out test split
      rmse        = sqrt of mean_squared_error on the test split (sat. units)
      top_feature = FEATURE_COLS entry with the largest feature_importances_
    """
    X = df[FEATURE_COLS].values
    y = df['SW_core'].values
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=TEST_SIZE, random_state=seed)
    rf = RandomForestRegressor(n_estimators=N_ESTIMATORS, random_state=seed)
    rf.fit(X_train, y_train)
    pred = rf.predict(X_test)
    test_r2 = r2_score(y_test, pred)
    rmse = mean_squared_error(y_test, pred) ** 0.5
    top_feature = FEATURE_COLS[int(np.argmax(rf.feature_importances_))]
    return {'test_r2': test_r2, 'rmse': rmse, 'top_feature': top_feature}


result = predict_sw(make_sw_dataset())
sw_test_r2 = result['test_r2']
sw_rmse = result['rmse']
sw_top_feature = result['top_feature']

print("Sw test R2  :", sw_test_r2)
print("Sw RMSE     :", sw_rmse)
print("top feature :", sw_top_feature)

lockCopying code is a Full Access feature.

arrow_back

15.9 Gas Dehydration - Water Content & Dew-Point Spec

16.2 Feature Engineering Bake-Off - Does It Help the Linear Model or the Forest?

arrow_forward