Data Cleaning Pipeline

Level 3

Chapter 4: NumPy & Pandas

descriptionProblem

Real production data arrives messy. Sensors drop out. Manual entries contain typos. Operators sometimes report a rate that's physically impossible. A monthly production review pipeline takes that raw dump and turns it into a clean, analysable table, without ever touching the original records.

You are given a 12-row DataFrame raw covering 4 wells over 3 months, with these data-quality issues:

One negative oil rate (sensor error)
One physically impossible oil rate above 8,000 bopd (data entry typo)
Two missing oil values (sensor dropout)

Build a function clean_production(raw: pd.DataFrame) -> pd.DataFrame that returns a new DataFrame clean (does not mutate raw) and:

Replaces oil rates that are negative or > 8,000 with NaN.
Sorts by well then date.
Within each well, linearly interpolates the NaN oil values

(use groupby("well") + transform + interpolate).

Adds derived columns:

total_liquid_bpd = oil_bopd + water_bwpd
water_cut_pct = water_bwpd / total_liquid_bpd * 100 (rounded to

1 decimal)

Then call your function on raw and store the result in clean.

lightbulbHints (0/5)

Stuck? Reveal hints one at a time — they progress from nudge to near-solution.

codeYour solution

main.py

import pandas as pd
import numpy as np

raw = pd.DataFrame({
    "well":      ["OD-001", "OD-001", "OD-001",
                  "OD-002", "OD-002", "OD-002",
                  "OD-003", "OD-003", "OD-003",
                  "OD-005", "OD-005", "OD-005"],
    "date":      pd.to_datetime([
        "2026-01-01", "2026-02-01", "2026-03-01",
        "2026-01-01", "2026-02-01", "2026-03-01",
        "2026-01-01", "2026-02-01", "2026-03-01",
        "2026-01-01", "2026-02-01", "2026-03-01",
    ]),
    # OD-001: clean
    # OD-002: month 2 missing (NaN sensor dropout)
    # OD-003: month 1 has impossible 12,500; month 3 has -50 sensor error
    # OD-005: month 2 missing
    "oil_bopd":   [1842, 1790, 1735,
                   965,  np.nan, 940,
                   12500, 1200, -50,
                   2210, np.nan, 2085],
    "water_bwpd": [1100, 1180, 1250,
                   2000, 2050, 2100,
                   1700, 1850, 1950,
                    800,  860,  920],
})

def clean_production(raw):
    """Clean the production DataFrame and return a new (non-mutating) copy."""
    # TODO: 1) impossible→NaN, 2) sort, 3) interpolate per well, 4) derived columns
    pass

clean = clean_production(raw)
print(clean.head(10).to_string(index=False))

visibilityReveal reference solutionexpand_more

Try solving it yourself first — the hints walk you through it. The solution below is one valid approach; yours may differ and still be correct.

import pandas as pd
import numpy as np

raw = pd.DataFrame({
    "well":      ["OD-001", "OD-001", "OD-001",
                  "OD-002", "OD-002", "OD-002",
                  "OD-003", "OD-003", "OD-003",
                  "OD-005", "OD-005", "OD-005"],
    "date":      pd.to_datetime([
        "2026-01-01", "2026-02-01", "2026-03-01",
        "2026-01-01", "2026-02-01", "2026-03-01",
        "2026-01-01", "2026-02-01", "2026-03-01",
        "2026-01-01", "2026-02-01", "2026-03-01",
    ]),
    "oil_bopd":   [1842, 1790, 1735,
                   965,  np.nan, 940,
                   12500, 1200, -50,
                   2210, np.nan, 2085],
    "water_bwpd": [1100, 1180, 1250,
                   2000, 2050, 2100,
                   1700, 1850, 1950,
                    800,  860,  920],
})


def clean_production(raw):
    out = raw.copy()

    # Step 1 - physically impossible oil rates → NaN
    out.loc[out["oil_bopd"] < 0, "oil_bopd"] = np.nan
    out.loc[out["oil_bopd"] > 8000, "oil_bopd"] = np.nan

    # Step 2 - sort by (well, date) so interpolation is in time order
    out = out.sort_values(["well", "date"]).reset_index(drop=True)

    # Step 3 - linear interpolate oil_bopd within each well
    out["oil_bopd"] = out.groupby("well")["oil_bopd"].transform(
        lambda s: s.interpolate(method="linear", limit=2)
    )

    # Step 4 - derived columns
    out["total_liquid_bpd"] = out["oil_bopd"] + out["water_bwpd"]
    out["water_cut_pct"] = (out["water_bwpd"] / out["total_liquid_bpd"] * 100).round(1)

    return out


clean = clean_production(raw)
print(clean.to_string(index=False))

lockCopying code is a Full Access feature.

arrow_back

4.3 Decline Rate Calculator

4.5 Multi-Well Comparison Dashboard

arrow_forward