Exerciseschevron_rightChapter 4chevron_right4.2
fitness_center

Exercise 4.2

Production Data Loader

Level 2
Chapter 4: NumPy & Pandas
descriptionProblem

Raw production exports arrive as CSV with two recurring problems: date columns come in as plain strings, and readings go missing. A standard petroleum convention for short gaps is forward-fill: if today's reading didn't come in, assume it equals yesterday's until a new value arrives.

The starter holds the export as CSV_TEXT (read it with pd.read_csv(io.StringIO(CSV_TEXT))). Write two functions:

  1. load_production(csv_text): read the CSV, convert every column whose

name contains "date" (case-insensitive) to a real datetime, then forward-fill missing values, and return the cleaned DataFrame.

  1. quality_report(csv_text): return a dict describing the load:

n_records (row count), date_columns (list of detected date column names), missing_before (count of missing cells in the raw file), and missing_after (count remaining after the fill, should be 0).

A loader that quietly drops or mangles data is worse than one that fails; the quality report is how you prove the load did what you think it did.

lightbulbHints (0/3)

Stuck? Reveal hints one at a time — they progress from nudge to near-solution.

codeYour solution
main.py
visibilityReveal reference solutionexpand_more

Try solving it yourself first — the hints walk you through it. The solution below is one valid approach; yours may differ and still be correct.

import io
import pandas as pd

CSV_TEXT = """date,oil_bopd,water_bwpd,fwhp_psi
2024-01-01,1500,100,2200
2024-02-01,,110,2180
2024-03-01,1440,,2150
2024-04-01,1410,140,
2024-05-01,1380,150,2090
"""


def load_production(csv_text):
    df = pd.read_csv(io.StringIO(csv_text))
    for col in df.columns:
        if "date" in col.lower():
            df[col] = pd.to_datetime(df[col])
    return df.ffill()


def quality_report(csv_text):
    raw = pd.read_csv(io.StringIO(csv_text))
    clean = load_production(csv_text)
    return {
        "n_records": len(raw),
        "date_columns": [c for c in raw.columns if "date" in c.lower()],
        "missing_before": int(raw.isna().sum().sum()),
        "missing_after": int(clean.isna().sum().sum()),
    }


clean = load_production(CSV_TEXT)
print(clean)
report = quality_report(CSV_TEXT)
print(f"Loaded {report['n_records']} records, "
      f"filled {report['missing_before']} gaps, "
      f"{report['missing_after']} missing remain.")

lockCopying code is a Full Access feature.