Exerciseschevron_rightChapter 6chevron_right6.6
fitness_center

Exercise 6.6

Data Gap Analysis

Level 2
Chapter 6: Petroleum Data Sources
descriptionProblem

A missing month in a production series is not nothing: it's a workover, a shut-in, a facility outage, or a reporting failure, and it quietly breaks any decline fit or cumulative you run across it. So you find the gaps explicitly.

The starter gives monthly prod data where some wells are missing months. Write find_production_gaps(df, well_col, date_col) that returns a DataFrame with one row per gap and columns <well_col>, gap_start, gap_end, gap_months:

  1. For each well, build the expected monthly range from its first to its

last reported month (pd.date_range(..., freq="MS")).

  1. Find the months in that range with no record.
  2. Collapse runs of consecutive missing months into a single gap:

gap_start = first missing month, gap_end = last missing month of the run, gap_months = how many months it spans.

A well with continuous reporting contributes no rows.

lightbulbHints (0/3)

Stuck? Reveal hints one at a time — they progress from nudge to near-solution.

codeYour solution
main.py
visibilityReveal reference solutionexpand_more

Try solving it yourself first — the hints walk you through it. The solution below is one valid approach; yours may differ and still be correct.

import pandas as pd

prod = pd.DataFrame([
    {"well_id": "OD-001", "date": "2023-01-01", "oil_bopd": 2400},
    {"well_id": "OD-001", "date": "2023-02-01", "oil_bopd": 2350},
    {"well_id": "OD-001", "date": "2023-03-01", "oil_bopd": 2300},
    {"well_id": "OD-001", "date": "2023-04-01", "oil_bopd": 2260},
    {"well_id": "OD-001", "date": "2023-05-01", "oil_bopd": 2220},
    {"well_id": "OD-001", "date": "2023-06-01", "oil_bopd": 2180},
    {"well_id": "OD-002", "date": "2023-01-01", "oil_bopd": 1800},
    {"well_id": "OD-002", "date": "2023-02-01", "oil_bopd": 1760},
    {"well_id": "OD-002", "date": "2023-05-01", "oil_bopd": 1680},
    {"well_id": "OD-002", "date": "2023-06-01", "oil_bopd": 1650},
    {"well_id": "OD-003", "date": "2023-01-01", "oil_bopd": 3100},
    {"well_id": "OD-003", "date": "2023-02-01", "oil_bopd": 3050},
    {"well_id": "OD-003", "date": "2023-04-01", "oil_bopd": 2980},
])
prod["date"] = pd.to_datetime(prod["date"])


def _month_index(ts):
    return ts.year * 12 + (ts.month - 1)


def find_production_gaps(df, well_col, date_col):
    rows = []
    for well, g in df.groupby(well_col):
        dates = sorted(pd.to_datetime(g[date_col]).tolist())
        expected = pd.date_range(dates[0], dates[-1], freq="MS")
        present = set(dates)
        missing = [d for d in expected if d not in present]

        run = []
        for d in missing:
            if run and _month_index(d) - _month_index(run[-1]) == 1:
                run.append(d)
            else:
                if run:
                    rows.append({well_col: well, "gap_start": run[0], "gap_end": run[-1], "gap_months": len(run)})
                run = [d]
        if run:
            rows.append({well_col: well, "gap_start": run[0], "gap_end": run[-1], "gap_months": len(run)})

    return pd.DataFrame(rows, columns=[well_col, "gap_start", "gap_end", "gap_months"])


gaps = find_production_gaps(prod, "well_id", "date")
for _, r in gaps.iterrows():
    print(f"{r['well_id']}: {r['gap_start'].strftime('%Y-%m')} -> "
          f"{r['gap_end'].strftime('%Y-%m')} ({r['gap_months']} month gap)")

lockCopying code is a Full Access feature.