Exercise 4.4
Data Cleaning Pipeline
Real production data arrives messy. Sensors drop out. Manual entries contain typos. Operators sometimes report a rate that's physically impossible. A monthly production review pipeline takes that raw dump and turns it into a clean, analysable table, without ever touching the original records.
You are given a 12-row DataFrame raw covering 4 wells over 3 months, with these data-quality issues:
- One negative oil rate (sensor error)
- One physically impossible oil rate above 8,000 bopd (data entry typo)
- Two missing oil values (sensor dropout)
Build a function clean_production(raw: pd.DataFrame) -> pd.DataFrame that returns a new DataFrame clean (does not mutate raw) and:
- Replaces oil rates that are negative or > 8,000 with
NaN. - Sorts by
wellthendate. - Within each well, linearly interpolates the NaN oil values
(use groupby("well") + transform + interpolate).
- Adds derived columns:
total_liquid_bpd = oil_bopd + water_bwpdwater_cut_pct = water_bwpd / total_liquid_bpd * 100(rounded to
1 decimal)
Then call your function on raw and store the result in clean.
Stuck? Reveal hints one at a time — they progress from nudge to near-solution.
visibilityReveal reference solutionexpand_more
Try solving it yourself first — the hints walk you through it. The solution below is one valid approach; yours may differ and still be correct.
import pandas as pd
import numpy as np
raw = pd.DataFrame({
"well": ["OD-001", "OD-001", "OD-001",
"OD-002", "OD-002", "OD-002",
"OD-003", "OD-003", "OD-003",
"OD-005", "OD-005", "OD-005"],
"date": pd.to_datetime([
"2026-01-01", "2026-02-01", "2026-03-01",
"2026-01-01", "2026-02-01", "2026-03-01",
"2026-01-01", "2026-02-01", "2026-03-01",
"2026-01-01", "2026-02-01", "2026-03-01",
]),
"oil_bopd": [1842, 1790, 1735,
965, np.nan, 940,
12500, 1200, -50,
2210, np.nan, 2085],
"water_bwpd": [1100, 1180, 1250,
2000, 2050, 2100,
1700, 1850, 1950,
800, 860, 920],
})
def clean_production(raw):
out = raw.copy()
# Step 1 - physically impossible oil rates → NaN
out.loc[out["oil_bopd"] < 0, "oil_bopd"] = np.nan
out.loc[out["oil_bopd"] > 8000, "oil_bopd"] = np.nan
# Step 2 - sort by (well, date) so interpolation is in time order
out = out.sort_values(["well", "date"]).reset_index(drop=True)
# Step 3 - linear interpolate oil_bopd within each well
out["oil_bopd"] = out.groupby("well")["oil_bopd"].transform(
lambda s: s.interpolate(method="linear", limit=2)
)
# Step 4 - derived columns
out["total_liquid_bpd"] = out["oil_bopd"] + out["water_bwpd"]
out["water_cut_pct"] = (out["water_bwpd"] / out["total_liquid_bpd"] * 100).round(1)
return out
clean = clean_production(raw)
print(clean.to_string(index=False))
lockCopying code is a Full Access feature.