Exercise 6.3
Production Data Loader
This is the function you write once and run on every production export you ever receive. Raw monthly data arrives with the same handful of problems: a digit transposed into a negative rate, the same well-month exported twice, and a meter failure leaving a blank. You clean it and you report exactly what you changed; a silent cleaner is a liability.
The starter gives records (a list of monthly dicts with well_id, date, oil_bopd, water_bwpd, gas_mscfd, days_on) carrying all three faults.
Write load_and_clean_production(records) that returns (df, summary):
- Build a DataFrame and parse
dateto datetime. - Drop duplicate
well_id+daterows (keep the first). - Replace any negative rate (oil/water/gas) with
NaN. You can't have
negative flow; flag it, don't trust it.
- Sort by well then date, and add:
water_cut=water_bwpd / (oil_bopd + water_bwpd)cum_oil_bbl= per-well cumulative oil volume = `(oil_bopd ×
days_on)` summed down the well (a rate × days is a volume).
- Return
dfand asummarydict withn_input,n_duplicates_removed,
n_negative_fixed, and n_missing (count of remaining NaN in oil+water).
Stuck? Reveal hints one at a time — they progress from nudge to near-solution.
visibilityReveal reference solutionexpand_more
Try solving it yourself first — the hints walk you through it. The solution below is one valid approach; yours may differ and still be correct.
import numpy as np
import pandas as pd
records = [
{"well_id": "OD-001", "date": "2023-01-01", "oil_bopd": 2400, "water_bwpd": 120, "gas_mscfd": 1900, "days_on": 31},
{"well_id": "OD-001", "date": "2023-02-01", "oil_bopd": 2350, "water_bwpd": 140, "gas_mscfd": 1880, "days_on": 28},
{"well_id": "OD-001", "date": "2023-02-01", "oil_bopd": 2350, "water_bwpd": 140, "gas_mscfd": 1880, "days_on": 28},
{"well_id": "OD-001", "date": "2023-03-01", "oil_bopd": -50, "water_bwpd": 160, "gas_mscfd": 1850, "days_on": 31},
{"well_id": "OD-001", "date": "2023-04-01", "oil_bopd": 2280, "water_bwpd": None, "gas_mscfd": 1820, "days_on": 30},
{"well_id": "OD-002", "date": "2023-01-01", "oil_bopd": 1800, "water_bwpd": 300, "gas_mscfd": 1500, "days_on": 31},
{"well_id": "OD-002", "date": "2023-02-01", "oil_bopd": 1750, "water_bwpd": 330, "gas_mscfd": 1480, "days_on": 28},
{"well_id": "OD-002", "date": "2023-03-01", "oil_bopd": 1700, "water_bwpd": 360, "gas_mscfd": 1460, "days_on": 31},
]
RATE_COLS = ["oil_bopd", "water_bwpd", "gas_mscfd"]
def load_and_clean_production(records):
df = pd.DataFrame(records)
n_input = len(df)
df["date"] = pd.to_datetime(df["date"])
# 1. Drop duplicate well-date exports.
dup_mask = df.duplicated(subset=["well_id", "date"], keep="first")
n_duplicates_removed = int(dup_mask.sum())
df = df[~dup_mask].copy()
# 2. Negative rates are impossible -> NaN.
n_negative_fixed = 0
for col in RATE_COLS:
neg = df[col] < 0
n_negative_fixed += int(neg.sum())
df.loc[neg, col] = np.nan
# 3. Derived columns.
df = df.sort_values(["well_id", "date"]).reset_index(drop=True)
df["water_cut"] = df["water_bwpd"] / (df["oil_bopd"] + df["water_bwpd"])
oil_volume = df["oil_bopd"] * df["days_on"]
df["cum_oil_bbl"] = oil_volume.groupby(df["well_id"]).cumsum()
n_missing = int(df[["oil_bopd", "water_bwpd"]].isna().sum().sum())
summary = {
"n_input": n_input,
"n_duplicates_removed": n_duplicates_removed,
"n_negative_fixed": n_negative_fixed,
"n_missing": n_missing,
}
return df, summary
df, summary = load_and_clean_production(records)
print(f"Cleaned {summary['n_input']} -> {len(df)} rows: "
f"{summary['n_duplicates_removed']} dup, {summary['n_negative_fixed']} negative, "
f"{summary['n_missing']} missing")
print(df[["well_id", "date", "oil_bopd", "water_cut", "cum_oil_bbl"]].to_string(index=False))
lockCopying code is a Full Access feature.