Production Data Loader

Level 2

Chapter 6: Petroleum Data Sources

descriptionProblem

This is the function you write once and run on every production export you ever receive. Raw monthly data arrives with the same handful of problems: a digit transposed into a negative rate, the same well-month exported twice, and a meter failure leaving a blank. You clean it and you report exactly what you changed; a silent cleaner is a liability.

The starter gives records (a list of monthly dicts with well_id, date, oil_bopd, water_bwpd, gas_mscfd, days_on) carrying all three faults.

Write load_and_clean_production(records) that returns (df, summary):

Build a DataFrame and parse date to datetime.
Drop duplicate well_id+date rows (keep the first).
Replace any negative rate (oil/water/gas) with NaN. You can't have

negative flow; flag it, don't trust it.

Sort by well then date, and add:

water_cut = water_bwpd / (oil_bopd + water_bwpd)
cum_oil_bbl = per-well cumulative oil volume = `(oil_bopd ×

days_on)` summed down the well (a rate × days is a volume).

Return df and a summary dict with n_input, n_duplicates_removed,

n_negative_fixed, and n_missing (count of remaining NaN in oil+water).

lightbulbHints (0/3)

Stuck? Reveal hints one at a time — they progress from nudge to near-solution.

codeYour solution

main.py

import numpy as np
import pandas as pd

records = [
    {"well_id": "OD-001", "date": "2023-01-01", "oil_bopd": 2400, "water_bwpd": 120,  "gas_mscfd": 1900, "days_on": 31},
    {"well_id": "OD-001", "date": "2023-02-01", "oil_bopd": 2350, "water_bwpd": 140,  "gas_mscfd": 1880, "days_on": 28},
    {"well_id": "OD-001", "date": "2023-02-01", "oil_bopd": 2350, "water_bwpd": 140,  "gas_mscfd": 1880, "days_on": 28},  # duplicate export
    {"well_id": "OD-001", "date": "2023-03-01", "oil_bopd": -50,  "water_bwpd": 160,  "gas_mscfd": 1850, "days_on": 31},  # data-entry error
    {"well_id": "OD-001", "date": "2023-04-01", "oil_bopd": 2280, "water_bwpd": None, "gas_mscfd": 1820, "days_on": 30},  # meter failure
    {"well_id": "OD-002", "date": "2023-01-01", "oil_bopd": 1800, "water_bwpd": 300,  "gas_mscfd": 1500, "days_on": 31},
    {"well_id": "OD-002", "date": "2023-02-01", "oil_bopd": 1750, "water_bwpd": 330,  "gas_mscfd": 1480, "days_on": 28},
    {"well_id": "OD-002", "date": "2023-03-01", "oil_bopd": 1700, "water_bwpd": 360,  "gas_mscfd": 1460, "days_on": 31},
]

def load_and_clean_production(records):
    """Return (cleaned_df, summary_dict)."""
    # TODO
    pass

df, summary = load_and_clean_production(records)
print(summary)
print(df.to_string(index=False))

visibilityReveal reference solutionexpand_more

Try solving it yourself first — the hints walk you through it. The solution below is one valid approach; yours may differ and still be correct.

import numpy as np
import pandas as pd

records = [
    {"well_id": "OD-001", "date": "2023-01-01", "oil_bopd": 2400, "water_bwpd": 120,  "gas_mscfd": 1900, "days_on": 31},
    {"well_id": "OD-001", "date": "2023-02-01", "oil_bopd": 2350, "water_bwpd": 140,  "gas_mscfd": 1880, "days_on": 28},
    {"well_id": "OD-001", "date": "2023-02-01", "oil_bopd": 2350, "water_bwpd": 140,  "gas_mscfd": 1880, "days_on": 28},
    {"well_id": "OD-001", "date": "2023-03-01", "oil_bopd": -50,  "water_bwpd": 160,  "gas_mscfd": 1850, "days_on": 31},
    {"well_id": "OD-001", "date": "2023-04-01", "oil_bopd": 2280, "water_bwpd": None, "gas_mscfd": 1820, "days_on": 30},
    {"well_id": "OD-002", "date": "2023-01-01", "oil_bopd": 1800, "water_bwpd": 300,  "gas_mscfd": 1500, "days_on": 31},
    {"well_id": "OD-002", "date": "2023-02-01", "oil_bopd": 1750, "water_bwpd": 330,  "gas_mscfd": 1480, "days_on": 28},
    {"well_id": "OD-002", "date": "2023-03-01", "oil_bopd": 1700, "water_bwpd": 360,  "gas_mscfd": 1460, "days_on": 31},
]

RATE_COLS = ["oil_bopd", "water_bwpd", "gas_mscfd"]


def load_and_clean_production(records):
    df = pd.DataFrame(records)
    n_input = len(df)
    df["date"] = pd.to_datetime(df["date"])

    # 1. Drop duplicate well-date exports.
    dup_mask = df.duplicated(subset=["well_id", "date"], keep="first")
    n_duplicates_removed = int(dup_mask.sum())
    df = df[~dup_mask].copy()

    # 2. Negative rates are impossible -> NaN.
    n_negative_fixed = 0
    for col in RATE_COLS:
        neg = df[col] < 0
        n_negative_fixed += int(neg.sum())
        df.loc[neg, col] = np.nan

    # 3. Derived columns.
    df = df.sort_values(["well_id", "date"]).reset_index(drop=True)
    df["water_cut"] = df["water_bwpd"] / (df["oil_bopd"] + df["water_bwpd"])
    oil_volume = df["oil_bopd"] * df["days_on"]
    df["cum_oil_bbl"] = oil_volume.groupby(df["well_id"]).cumsum()

    n_missing = int(df[["oil_bopd", "water_bwpd"]].isna().sum().sum())
    summary = {
        "n_input": n_input,
        "n_duplicates_removed": n_duplicates_removed,
        "n_negative_fixed": n_negative_fixed,
        "n_missing": n_missing,
    }
    return df, summary


df, summary = load_and_clean_production(records)
print(f"Cleaned {summary['n_input']} -> {len(df)} rows: "
      f"{summary['n_duplicates_removed']} dup, {summary['n_negative_fixed']} negative, "
      f"{summary['n_missing']} missing")
print(df[["well_id", "date", "oil_bopd", "water_cut", "cum_oil_bbl"]].to_string(index=False))

lockCopying code is a Full Access feature.

arrow_back

6.2 Null Value Detection

6.4 Unit Converter for Log Data

arrow_forward