Part II: Petroleum Data Engineering

Chapter 6

Working with Petroleum Industry Data

schedule19 min readfitness_center10 exercises

auto_awesomeAI Key Takeaways

Get the TL;DR and the key concepts before you dive in — or as a quick review after.

Why This Chapter Exists

Every calculation in this book (every decline curve, every PVT correlation, every reservoir model) starts with data. And petroleum data is unlike data in most other industries.

Well logs arrive in LAS files, a format invented in 1989 that most programmers have never seen. Production records come as monthly CSV exports from databases that were designed before Python existed. Drilling data streams in real-time from sensors thousands of feet underground, often with gaps, noise, and units that change between operators. A single field might have data in five different formats, collected by three different companies, measured in two different unit systems.

If you cannot load, parse, clean, and organize this data reliably, nothing else in this book works. The most sophisticated machine learning model is useless if it is trained on data where half the wells have missing pressure readings and the other half report in different units.

This chapter teaches you to handle petroleum data the way experienced engineers do: skeptically, systematically, and with checks at every step.

infoWhat You'll Learn

Parse LAS files (the petroleum industry's standard well log format) using lasio
Load and clean production data from CSV and Excel sources
Handle the unit inconsistencies that plague real petroleum datasets
Build quality control checks that catch physically impossible values
Access public petroleum datasets for practice and research
Construct reusable data loading pipelines

The Petroleum Data Landscape

Before writing any code, it helps to understand what kinds of data exist in this industry and why each one matters.

Well Log Data

When a well is drilled, logging tools are lowered into the borehole to measure the physical properties of the rock formations. These measurements (gamma ray response, electrical resistivity, bulk density, neutron porosity, and others) are recorded as continuous curves against depth. The resulting dataset is called a well log.

Well logs are the primary source of information about subsurface rock properties. They tell you whether a formation is sand or shale, whether it contains oil or water, how porous the rock is, and how easily fluids can flow through it. Without well logs, petroleum engineering would be guesswork.

The standard file format for well log data is LAS (Log ASCII Standard). It is a plain-text format with a header section containing well metadata and a data section containing the log curves as columns of numbers. The format is simple, but it has quirks: null values are typically represented as -999.25, depth can be in feet or metres, and different vendors structure the header differently.

Production Data

Once a well is producing, operators record how much oil, gas, and water it produces over time. This data is typically reported monthly and includes:

Oil rate (barrels per day or barrels per month)
Gas rate (thousand standard cubic feet per day, Mscf/d)
Water rate (barrels per day)
Flowing pressures (tubing head pressure, casing pressure)
Cumulative production (total barrels produced since the well started)
Days on production (how many days the well actually flowed that month)

Production data drives reserve estimation, decline curve analysis, and economic evaluation. It is also the most commonly messy dataset in the industry. Wells shut in for maintenance, meters fail, operators change reporting conventions, and manual data entry introduces errors.

Drilling Data

During drilling operations, sensors on the rig and in the drillstring record parameters in real time: weight on bit, rotary speed, torque, rate of penetration, mud flow rate, standpipe pressure, and dozens more. This data arrives at high frequency, sometimes one reading per second, and is used to optimize drilling performance and detect problems like kicks, stuck pipe, or equipment failure.

Drilling data is typically stored in WITSML (Wellsite Information Transfer Standard Markup Language) format, though many operators export it to CSV or proprietary formats for analysis.

Reservoir and Simulation Data

These datasets are small but high-consequence: a single bad PVT point can skew an entire material-balance estimate. Reservoir engineers work with pressure-volume-temperature (PVT) data from laboratory fluid analyses, core measurements from rock samples, and output from numerical reservoir simulators. These datasets tend to be smaller but more structured than production or drilling data.

Reading LAS Files with `lasio`

The lasio library is the standard Python tool for reading LAS files. It handles the format's quirks (header parsing, null value replacement, unit extraction) so you can focus on the data.

main.pystarting Python…

import lasio
import pandas as pd
import numpy as np

# lasio can read LAS files from disk or from a URL.
# For this example, we'll create a minimal LAS file in memory
# to demonstrate the structure without requiring a file download.

las_content = """~VERSION INFORMATION
 VERS.                 2.0 : CWLS LOG ASCII STANDARD - VERSION 2.0
 WRAP.                  NO : ONE LINE PER DEPTH STEP

~WELL INFORMATION
 WELL.         OD-003 : Well Name
 FLD.       OML 58 : Field Name
 LOC.        OML 58       : Location
 COMP.     NATIONAL PETRO : Company
 SRVC.      SCHLUMBERGER  : Service Company
 DATE.       15-MAR-2025  : Log Date
 STRT.           5000.000 : START DEPTH (FT)
 STOP.           9500.000 : STOP DEPTH (FT)
 STEP.              0.500 : STEP (FT)
 NULL.          -999.2500 : NULL VALUE

~CURVE INFORMATION
 DEPT.FT                  : Depth
 GR  .GAPI               : Gamma Ray
 RT  .OHMM               : Deep Resistivity
 RHOB.G/CC               : Bulk Density
 NPHI.V/V                : Neutron Porosity

~A  DEPT       GR        RT       RHOB     NPHI
  5000.000   45.200    12.500    2.350    0.180
  5000.500   48.100    11.800    2.360    0.175
  5001.000   52.300     9.200    2.380    0.190
  5001.500   78.600     3.100    2.450    0.280
  5002.000   95.200     1.800    2.520    0.320
  5002.500  102.400     1.500    2.550    0.340
  5003.000   98.700     1.600    2.540    0.330
  5003.500   65.300     5.400    2.410    0.240
  5004.000   42.100    15.200    2.320    0.160
  5004.500   38.500    22.800    2.290    0.145
  5005.000   35.200    45.600    2.250    0.130
"""

# Write it to a temporary file so lasio can read it
with open("sample_well.las", "w") as f:
    f.write(las_content)

# Read the LAS file
las = lasio.read("sample_well.las")

# Access well header information
print("WELL INFORMATION")
print(f"  Well Name:  {las.well.WELL.value}")
print(f"  Field:      {las.well.FLD.value}")
print(f"  Company:    {las.well.COMP.value}")
print(f"  Start Depth: {las.well.STRT.value} {las.well.STRT.unit}")
print(f"  Stop Depth:  {las.well.STOP.value} {las.well.STOP.unit}")
print(f"  Step:        {las.well.STEP.value} {las.well.STEP.unit}")
print(f"  Null Value:  {las.well.NULL.value}")
print()

# Access curve information
print("AVAILABLE CURVES")
for curve in las.curves:
    print(f"  {curve.mnemonic:6s} [{curve.unit:5s}]  {curve.descr}")

The header tells you everything about the well and the measurement context before you look at a single data point. This matters because the same curve mnemonic (like GR) can mean different things depending on the logging tool, the vendor, and the vintage of the data.

Now convert the log data to a Pandas DataFrame for analysis:

main.pystarting Python…

# Convert LAS data to a Pandas DataFrame
df = las.df()

# lasio sets depth as the DataFrame index by default.
# For most analysis, it's more convenient as a column.
df = df.reset_index()
df = df.rename(columns={"DEPT": "DEPTH_FT"})

print(f"Dataset shape: {df.shape[0]} rows × {df.shape[1]} columns")
print(f"Depth range: {df['DEPTH_FT'].min():,.1f} to {df['DEPTH_FT'].max():,.1f} ft")
print()
print(df.to_string(index=False))

Eleven data points is a toy dataset. A real well log might have 18,000 rows (9,000 feet at half-foot spacing). The code works the same way regardless of size; that is the point of writing it properly from the start.

Loading Production Data

Production data most commonly arrives as CSV or Excel files exported from production databases. The structure varies by operator, but the core fields are consistent: a well identifier, a date, and rate or volume columns for oil, gas, and water.

main.pystarting Python…

import pandas as pd
import numpy as np

# Simulate a realistic monthly production dataset for three wells
np.random.seed(42)

wells = []
for well_id, qi, di, bsw_start in [
    ("OD-001", 2400, 0.08, 0.05),
    ("OD-002", 1800, 0.12, 0.15),
    ("OD-003", 3100, 0.06, 0.02)
]:
    dates = pd.date_range("2020-01-01", periods=48, freq="MS")
    months = np.arange(len(dates))

    # Arps exponential decline with noise
    oil_rate = qi * np.exp(-di * months) + np.random.normal(0, qi * 0.03, len(months))
    oil_rate = np.maximum(oil_rate, 0)  # rates cannot be negative

    # Water cut increasing over time
    water_cut = bsw_start + 0.008 * months + np.random.normal(0, 0.01, len(months))
    water_cut = np.clip(water_cut, 0, 0.95)
    water_rate = oil_rate * water_cut / (1 - water_cut)

    # Gas-oil ratio (roughly constant with some scatter)
    gor = 800 + np.random.normal(0, 50, len(months))
    gas_rate = oil_rate * gor / 1000  # Mscf/d

    # Introduce some realistic data issues
    oil_rate[15] = np.nan   # meter failure
    oil_rate[16] = np.nan
    oil_rate[30] = -50      # data entry error (negative rate)
    water_rate[22] = np.nan  # missing water measurement

    for i, date in enumerate(dates):
        wells.append({
            "well_id": well_id,
            "date": date,
            "oil_bopd": round(oil_rate[i], 1) if not np.isnan(oil_rate[i]) else np.nan,
            "water_bwpd": round(water_rate[i], 1) if not np.isnan(water_rate[i]) else np.nan,
            "gas_mscfd": round(gas_rate[i], 1),
            "days_on": np.random.choice([28, 29, 30, 31], p=[0.1, 0.1, 0.5, 0.3]),
        })

prod = pd.DataFrame(wells)

print(f"Production dataset: {len(prod)} records across {prod['well_id'].nunique()} wells")
print(f"Date range: {prod['date'].min().strftime('%Y-%m')} to {prod['date'].max().strftime('%Y-%m')}")
print()
print(prod.head(10).to_string(index=False))

Data Quality Comes First

Raw petroleum data almost always contains problems. Sensors fail downhole. Operators transpose digits during manual entry. Wells shut in for weeks and the database records zeros (or worse, carries forward the last reading as if production continued). Different operators use different units without labeling them.

If you build a decline curve on data that includes a month where the rate was accidentally recorded as negative, your forecast is wrong. If you train a machine learning model on well logs where null values were left as -999.25 instead of being handled, the model learns that -999.25 is a real measurement and produces nonsense.

Data quality is not a preliminary step you rush through to get to the interesting work. It is the interesting work.

main.pystarting Python…

# Systematic data quality checks for production data

print("=" * 60)
print("DATA QUALITY REPORT")
print("=" * 60)

# 1. Missing values
print("\n1. MISSING VALUES")
missing = prod.isnull().sum()
for col in missing[missing > 0].index:
    pct = missing[col] / len(prod) * 100
    print(f"   {col}: {missing[col]} missing ({pct:.1f}%)")

# 2. Physically impossible values
print("\n2. PHYSICALLY IMPOSSIBLE VALUES")
negative_oil = prod[prod["oil_bopd"] < 0]
if len(negative_oil) > 0:
    print(f"   Negative oil rates: {len(negative_oil)} records")
    for _, row in negative_oil.iterrows():
        print(f"     {row['well_id']} on {row['date'].strftime('%Y-%m')}: "
              f"{row['oil_bopd']} bopd")
else:
    print("   No negative rates found.")

# 3. Statistical outliers (values beyond 3 standard deviations)
print("\n3. STATISTICAL OUTLIERS (>3σ from mean)")
for col in ["oil_bopd", "water_bwpd", "gas_mscfd"]:
    valid = prod[col].dropna()
    mean, std = valid.mean(), valid.std()
    outliers = valid[(valid > mean + 3*std) | (valid < mean - 3*std)]
    if len(outliers) > 0:
        print(f"   {col}: {len(outliers)} outliers detected")
    else:
        print(f"   {col}: no outliers")

# 4. Duplicate records
print("\n4. DUPLICATE RECORDS")
dupes = prod.duplicated(subset=["well_id", "date"], keep=False)
print(f"   Duplicate well-date combinations: {dupes.sum()}")

# 5. Date continuity (gaps in monthly reporting)
print("\n5. REPORTING GAPS")
for well in prod["well_id"].unique():
    well_data = prod[prod["well_id"] == well].sort_values("date")
    date_diffs = well_data["date"].diff().dt.days
    gaps = date_diffs[date_diffs > 35]  # more than ~1 month
    if len(gaps) > 0:
        print(f"   {well}: {len(gaps)} gap(s) in monthly reporting")
    else:
        print(f"   {well}: continuous monthly reporting")

print("\n" + "=" * 60)

This report is the first thing you should run on any new dataset. It takes seconds and saves hours of debugging later. The negative oil rate we planted in the data was caught immediately. In a real workflow, you would flag these records for review with the field operator before removing or correcting them. Note that this outlier check pools all wells into one distribution, which blunts it when wells produce at very different rates. Exercise 6.7 does it per well, which is what you would use in practice.

Cleaning the Data

Once you know what the problems are, you fix them. The approach depends on the type of problem:

Negative rates are physically impossible. Oil cannot flow backwards into the reservoir. These are data entry errors and should be set to NaN (not a number) and either interpolated or excluded from analysis.
Missing values may be filled by interpolation if the gap is short (one or two months), or left as NaN if the gap is long (the well may have been shut in).
Unit mismatches require conversion. You must know whether a rate column is in barrels per day or barrels per month before doing any calculation.

main.pystarting Python…

# Clean the production data

prod_clean = prod.copy()

# 1. Replace negative rates with NaN
neg_mask = prod_clean["oil_bopd"] < 0
prod_clean.loc[neg_mask, "oil_bopd"] = np.nan
print(f"Replaced {neg_mask.sum()} negative oil rate(s) with NaN")

# 2. Interpolate short gaps (≤ 2 consecutive missing values)
for well in prod_clean["well_id"].unique():
    mask = prod_clean["well_id"] == well
    prod_clean.loc[mask, "oil_bopd"] = (
        prod_clean.loc[mask, "oil_bopd"]
        .interpolate(method="linear", limit=2)
    )
    prod_clean.loc[mask, "water_bwpd"] = (
        prod_clean.loc[mask, "water_bwpd"]
        .interpolate(method="linear", limit=2)
    )

remaining_nulls = prod_clean[["oil_bopd", "water_bwpd"]].isnull().sum()
print("After interpolation - remaining nulls:")
print(f"  oil_bopd:   {remaining_nulls['oil_bopd']}")
print(f"  water_bwpd: {remaining_nulls['water_bwpd']}")

# 3. Add derived columns that are useful for analysis
prod_clean["liquid_blpd"] = prod_clean["oil_bopd"] + prod_clean["water_bwpd"]
prod_clean["water_cut"] = prod_clean["water_bwpd"] / prod_clean["liquid_blpd"]
prod_clean["oil_monthly_bbl"] = prod_clean["oil_bopd"] * prod_clean["days_on"]

# 4. Cumulative oil production per well
prod_clean = prod_clean.sort_values(["well_id", "date"])
prod_clean["cum_oil_bbl"] = (
    prod_clean.groupby("well_id")["oil_monthly_bbl"]
    .cumsum()
)

print()
print("Cleaned dataset - first 8 rows:")
print(prod_clean[["well_id", "date", "oil_bopd", "water_cut", "cum_oil_bbl"]]
      .head(8).to_string(index=False))

Building a Data Loading Pipeline

In practice, you will load data from the same sources repeatedly: updating production records monthly, loading new well logs as wells are drilled, pulling drilling data for each new operation. Writing the loading and cleaning logic once and packaging it into reusable functions saves time and prevents errors.

main.pystarting Python…

import lasio
import pandas as pd
import numpy as np

def load_las(filepath, null_value=-999.25):
    """
    Load a LAS file and return a clean DataFrame.

    Replaces the LAS null value with NaN, resets depth
    to a column, and standardizes column names to uppercase.
    """
    las = lasio.read(filepath)
    df = las.df().reset_index()

    # lasio already maps the header NULL to NaN; this catches files
    # that hard-code a different sentinel in the data section
    df = df.replace(null_value, np.nan)

    # Standardize column names
    df.columns = [c.upper() for c in df.columns]

    # Extract well name from header
    well_name = las.well.WELL.value if hasattr(las.well, "WELL") else "UNKNOWN"

    return df, well_name, las


def load_production_csv(filepath, date_col="date", well_col="well_id"):
    """
    Load a production CSV file with standard quality checks.

    Returns the cleaned DataFrame and a quality report dictionary.
    """
    df = pd.read_csv(filepath, parse_dates=[date_col])
    report = {"total_records": len(df)}

    # Check for and remove negative rates
    rate_cols = [c for c in df.columns if "rate" in c.lower()
                 or "bopd" in c.lower() or "bwpd" in c.lower()
                 or "mscfd" in c.lower()]

    neg_count = 0
    for col in rate_cols:
        neg_mask = df[col] < 0
        neg_count += neg_mask.sum()
        df.loc[neg_mask, col] = np.nan
    report["negative_values_replaced"] = neg_count

    # Check for duplicates
    dupes = df.duplicated(subset=[well_col, date_col])
    report["duplicates_removed"] = dupes.sum()
    df = df[~dupes]

    # Count remaining nulls
    report["remaining_nulls"] = df.isnull().sum().to_dict()

    return df, report


def validate_log_data(df, depth_col="DEPT"):
    """
    Run basic physical validation on well log data.

    Returns a dictionary of issues found.
    """
    issues = {}

    # Gamma ray should be positive (0-200 GAPI typical)
    if "GR" in df.columns:
        bad_gr = (df["GR"] < 0) | (df["GR"] > 500)
        if bad_gr.any():
            issues["GR_out_of_range"] = bad_gr.sum()

    # Bulk density should be between ~1.0 and 3.0 g/cc
    if "RHOB" in df.columns:
        bad_rhob = (df["RHOB"] < 1.0) | (df["RHOB"] > 3.2)
        if bad_rhob.any():
            issues["RHOB_out_of_range"] = bad_rhob.sum()

    # Neutron porosity should be between 0 and 0.6 (fractional)
    if "NPHI" in df.columns:
        bad_nphi = (df["NPHI"] < -0.05) | (df["NPHI"] > 0.65)
        if bad_nphi.any():
            issues["NPHI_out_of_range"] = bad_nphi.sum()

    # Resistivity should be positive
    if "RT" in df.columns:
        bad_rt = df["RT"] <= 0
        if bad_rt.any():
            issues["RT_non_positive"] = bad_rt.sum()

    if not issues:
        issues["status"] = "All checks passed"

    return issues


# Demonstrate the pipeline
print("Pipeline demonstration:")
print()

# Load the LAS file we created earlier
log_df, well_name, las_obj = load_las("sample_well.las")
print(f"Loaded well log: {well_name}")
print(f"  Rows: {len(log_df)}, Curves: {len(log_df.columns)}")

# Validate the log data
log_issues = validate_log_data(log_df)
print(f"  Validation: {log_issues}")
print()
print("This pipeline can be reused for every well in your project.")
print("Write it once. Trust it every time.")

Public Datasets for Practice

You do not need to work at an oil company to access real petroleum data. Several public datasets are available for learning and research:

Equinor Volve Dataset. Equinor (formerly Statoil) released the complete dataset from the Volve field in the Norwegian North Sea after it was decommissioned. This includes well logs, production data, seismic data, reservoir models, and reports. It is the most comprehensive public petroleum dataset available and is used in university courses and research worldwide. Available at data.equinor.com.

North Dakota Industrial Commission (NDIC). The state of North Dakota publishes production data for all oil and gas wells in the Bakken and other formations. This is monthly production data for thousands of wells, freely accessible. Useful for decline curve analysis practice.

UK North Sea Transition Authority (NSTA). The UK government publishes production, well, and field data for all offshore operations on the UK Continental Shelf. Available at nstauthority.co.uk.

Kansas Geological Survey. Provides well log data in LAS format for wells across Kansas. Good for petrophysical analysis practice.

For the exercises in this book, we provide curated sample datasets in the companion repository. These are cleaned subsets of public data, sized appropriately for each chapter's calculations.

Summary

This chapter covered the foundation of all petroleum data work:

Petroleum data comes in domain-specific formats: LAS for well logs, CSV/Excel for production, WITSML for drilling. Each has its own conventions and quirks.
lasio is the standard Python library for reading LAS files. It handles header parsing, null value replacement, and unit extraction.
Data quality checks are not optional. Negative rates, missing values, duplicate records, and unit mismatches are common in real petroleum datasets. Check for them systematically before any analysis.
Cleaning follows a consistent pattern: replace impossible values with NaN, interpolate short gaps, flag long gaps for review, add derived columns (water cut, cumulative production), and standardize units.
Public datasets, particularly the Equinor Volve dataset and US state commission data, provide realistic practice material.
Reusable loading functions save time and prevent errors. Write them once, validate them, and use them throughout your project.

The next chapter applies these data handling skills to one of the most important analyses in petroleum engineering: interpreting well logs to determine what is in the rock and how much of it can be produced.

Exercises

fitness_center

Exercise 6.1Practice

: LAS File Inspection

Download a LAS file from the Kansas Geological Survey or the companion repository. Using lasio, write a script that prints: The well name, field, and ...

arrow_forward