Outlier Detection by Well

Level 2

Chapter 6: Petroleum Data Sources

descriptionProblem

A single month that's wildly off the trend is either a real event (a workover that doubled rate, a well test) or a data error (a fat-fingered entry). Either way you want it flagged for a human to look at, never silently averaged in.

The starter builds two years of monthly oil for two wells. OD-001 has one planted spike; OD-002 is clean. Write flag_outliers(df, well_col, value_col, n_sigma=3.0):

For each well, compute the mean and standard deviation of value_col.
Compute each row's z-score (value − mean) / std.
Return the rows whose |z| > n_sigma, with a z_score column added.

A well with no outliers contributes nothing. (This is the classic z-score screen; on small samples it's blind, which is why you run it across a full history, and why a flag means "go look", not "delete".)

lightbulbHints (0/3)

Stuck? Reveal hints one at a time — they progress from nudge to near-solution.

codeYour solution

main.py

visibilityReveal reference solutionexpand_more

Try solving it yourself first — the hints walk you through it. The solution below is one valid approach; yours may differ and still be correct.

import pandas as pd

rows = []
for i in range(24):
    rows.append({"well_id": "OD-001", "month": i + 1, "oil_bopd": 2400 - 15 * i})
    rows.append({"well_id": "OD-002", "month": i + 1, "oil_bopd": 1800 - 12 * i})
df = pd.DataFrame(rows)
df.loc[(df["well_id"] == "OD-001") & (df["month"] == 11), "oil_bopd"] = 5500


def flag_outliers(df, well_col, value_col, n_sigma=3.0):
    flagged = []
    for well, g in df.groupby(well_col):
        mean = g[value_col].mean()
        std = g[value_col].std()  # sample std (ddof=1)
        if not std or pd.isna(std):
            continue
        z = (g[value_col] - mean) / std
        hit = g[z.abs() > n_sigma].copy()
        hit["z_score"] = z[z.abs() > n_sigma]
        flagged.append(hit)
    if flagged:
        return pd.concat(flagged).reset_index(drop=True)
    return pd.DataFrame(columns=list(df.columns) + ["z_score"])


out = flag_outliers(df, "well_id", "oil_bopd")
for _, r in out.iterrows():
    print(f"{r['well_id']} month {int(r['month'])}: {r['oil_bopd']} bopd "
          f"(z = {r['z_score']:+.2f}) -- review")

lockCopying code is a Full Access feature.

arrow_back

6.6 Data Gap Analysis

6.8 LAS to DataFrame Pipeline

arrow_forward