Exerciseschevron_rightChapter 4chevron_right4.4
fitness_center

Exercise 4.4

Data Cleaning Pipeline

Level 3
Chapter 4: NumPy & Pandas
descriptionProblem

Real production data arrives messy. Sensors drop out. Manual entries contain typos. Operators sometimes report a rate that's physically impossible. A monthly production review pipeline takes that raw dump and turns it into a clean, analysable table, without ever touching the original records.

You are given a 12-row DataFrame raw covering 4 wells over 3 months, with these data-quality issues:

  • One negative oil rate (sensor error)
  • One physically impossible oil rate above 8,000 bopd (data entry typo)
  • Two missing oil values (sensor dropout)

Build a function clean_production(raw: pd.DataFrame) -> pd.DataFrame that returns a new DataFrame clean (does not mutate raw) and:

  1. Replaces oil rates that are negative or > 8,000 with NaN.
  2. Sorts by well then date.
  3. Within each well, linearly interpolates the NaN oil values

(use groupby("well") + transform + interpolate).

  1. Adds derived columns:
  • total_liquid_bpd = oil_bopd + water_bwpd
  • water_cut_pct = water_bwpd / total_liquid_bpd * 100 (rounded to

1 decimal)

Then call your function on raw and store the result in clean.

lightbulbHints (0/5)

Stuck? Reveal hints one at a time — they progress from nudge to near-solution.

codeYour solution
main.py
visibilityReveal reference solutionexpand_more

Try solving it yourself first — the hints walk you through it. The solution below is one valid approach; yours may differ and still be correct.

import pandas as pd
import numpy as np

raw = pd.DataFrame({
    "well":      ["OD-001", "OD-001", "OD-001",
                  "OD-002", "OD-002", "OD-002",
                  "OD-003", "OD-003", "OD-003",
                  "OD-005", "OD-005", "OD-005"],
    "date":      pd.to_datetime([
        "2026-01-01", "2026-02-01", "2026-03-01",
        "2026-01-01", "2026-02-01", "2026-03-01",
        "2026-01-01", "2026-02-01", "2026-03-01",
        "2026-01-01", "2026-02-01", "2026-03-01",
    ]),
    "oil_bopd":   [1842, 1790, 1735,
                   965,  np.nan, 940,
                   12500, 1200, -50,
                   2210, np.nan, 2085],
    "water_bwpd": [1100, 1180, 1250,
                   2000, 2050, 2100,
                   1700, 1850, 1950,
                    800,  860,  920],
})


def clean_production(raw):
    out = raw.copy()

    # Step 1 - physically impossible oil rates → NaN
    out.loc[out["oil_bopd"] < 0, "oil_bopd"] = np.nan
    out.loc[out["oil_bopd"] > 8000, "oil_bopd"] = np.nan

    # Step 2 - sort by (well, date) so interpolation is in time order
    out = out.sort_values(["well", "date"]).reset_index(drop=True)

    # Step 3 - linear interpolate oil_bopd within each well
    out["oil_bopd"] = out.groupby("well")["oil_bopd"].transform(
        lambda s: s.interpolate(method="linear", limit=2)
    )

    # Step 4 - derived columns
    out["total_liquid_bpd"] = out["oil_bopd"] + out["water_bwpd"]
    out["water_cut_pct"] = (out["water_bwpd"] / out["total_liquid_bpd"] * 100).round(1)

    return out


clean = clean_production(raw)
print(clean.to_string(index=False))

lockCopying code is a Full Access feature.