Part II: Petroleum Data Engineering
Chapter 6
Working with Petroleum Industry Data
Why This Chapter Exists
Every calculation in this book — every decline curve, every PVT correlation, every reservoir model — starts with data. And petroleum data is unlike data in most other industries.
Well logs arrive in LAS files, a format invented in 1989 that most programmers have never seen. Production records come as monthly CSV exports from databases that were designed before Python existed. Drilling data streams in real-time from sensors thousands of feet underground, often with gaps, noise, and units that change between operators. A single field might have data in five different formats, collected by three different companies, measured in two different unit systems.
If you cannot load, parse, clean, and organize this data reliably, nothing else in this book works. The most sophisticated machine learning model is useless if it is trained on data where half the wells have missing pressure readings and the other half report in different units.
This chapter teaches you to handle petroleum data the way experienced engineers do — skeptically, systematically, and with checks at every step.
infoWhat You'll Learn
- Parse LAS files (the petroleum industry's standard well log format) using
lasio - Load and clean production data from CSV and Excel sources
- Handle the unit inconsistencies that plague real petroleum datasets
- Build quality control checks that catch physically impossible values
- Access public petroleum datasets for practice and research
- Construct reusable data loading pipelines
The Petroleum Data Landscape
Before writing any code, it helps to understand what kinds of data exist in this industry and why each one matters.
Well Log Data
When a well is drilled, logging tools are lowered into the borehole to measure the physical properties of the rock formations. These measurements — gamma ray response, electrical resistivity, bulk density, neutron porosity, and others — are recorded as continuous curves against depth. The resulting dataset is called a well log.
Well logs are the primary source of information about subsurface rock properties. They tell you whether a formation is sand or shale, whether it contains oil or water, how porous the rock is, and how easily fluids can flow through it. Without well logs, petroleum engineering would be guesswork.
The standard file format for well log data is LAS (Log ASCII Standard). It is a plain-text format with a header section containing well metadata and a data section containing the log curves as columns of numbers. The format is simple, but it has quirks — null values are typically represented as -999.25, depth can be in feet or metres, and different vendors structure the header differently.
Production Data
Once a well is producing, operators record how much oil, gas, and water it produces over time. This data is typically reported monthly and includes:
- Oil rate (barrels per day or barrels per month)
- Gas rate (thousand standard cubic feet per day, Mscf/d)
- Water rate (barrels per day)
- Flowing pressures (tubing head pressure, casing pressure)
- Cumulative production (total barrels produced since the well started)
- Days on production (how many days the well actually flowed that month)
Production data drives reserve estimation, decline curve analysis, and economic evaluation. It is also the most commonly messy dataset in the industry. Wells shut in for maintenance, meters fail, operators change reporting conventions, and manual data entry introduces errors.
Drilling Data
During drilling operations, sensors on the rig and in the drillstring record parameters in real time: weight on bit, rotary speed, torque, rate of penetration, mud flow rate, standpipe pressure, and dozens more. This data arrives at high frequency — sometimes one reading per second — and is used to optimize drilling performance and detect problems like kicks, stuck pipe, or equipment failure.
Drilling data is typically stored in WITSML (Wellsite Information Transfer Standard Markup Language) format, though many operators export it to CSV or proprietary formats for analysis.
Reservoir and Simulation Data
Reservoir engineers work with pressure-volume-temperature (PVT) data from laboratory fluid analyses, core measurements from rock samples, and output from numerical reservoir simulators. These datasets tend to be smaller but more structured than production or drilling data.
Reading LAS Files with `lasio`
The lasio library is the standard Python tool for reading LAS files. It handles the format's quirks — header parsing, null value replacement, unit extraction — so you can focus on the data.
The header tells you everything about the well and the measurement context before you look at a single data point. This matters because the same curve mnemonic (like GR) can mean different things depending on the logging tool, the vendor, and the vintage of the data.
Now convert the log data to a Pandas DataFrame for analysis:
Eleven data points is a toy dataset. A real well log might have 18,000 rows (9,000 feet at half-foot spacing). The code works the same way regardless of size — that is the point of writing it properly from the start.
Loading Production Data
Production data most commonly arrives as CSV or Excel files exported from production databases. The structure varies by operator, but the core fields are consistent: a well identifier, a date, and rate or volume columns for oil, gas, and water.
Data Quality: Why It Matters More Than You Think
Raw petroleum data almost always contains problems. Sensors fail downhole. Operators transpose digits during manual entry. Wells shut in for weeks and the database records zeros (or worse, carries forward the last reading as if production continued). Different operators use different units without labeling them.
If you build a decline curve on data that includes a month where the rate was accidentally recorded as negative, your forecast is wrong. If you train a machine learning model on well logs where null values were left as -999.25 instead of being handled, the model learns that -999.25 is a real measurement and produces nonsense.
Data quality is not a preliminary step you rush through to get to the interesting work. It is the interesting work. In practice, experienced engineers spend more time cleaning and validating data than they spend on any model or calculation.
This report is the first thing you should run on any new dataset. It takes seconds and saves hours of debugging later. The negative oil rate we planted in the data was caught immediately. In a real workflow, you would flag these records for review with the field operator before removing or correcting them.
Cleaning the Data
Once you know what the problems are, you fix them. The approach depends on the type of problem:
- Negative rates are physically impossible. Oil cannot flow backwards into the reservoir. These are data entry errors and should be set to
NaN(not a number) and either interpolated or excluded from analysis. - Missing values may be filled by interpolation if the gap is short (one or two months), or left as
NaNif the gap is long (the well may have been shut in). - Unit mismatches require conversion. You must know whether a rate column is in barrels per day or barrels per month before doing any calculation.
Public Datasets for Practice
You do not need to work at an oil company to access real petroleum data. Several public datasets are available for learning and research:
Equinor Volve Dataset — Equinor (formerly Statoil) released the complete dataset from the Volve field in the Norwegian North Sea after it was decommissioned. This includes well logs, production data, seismic data, reservoir models, and reports. It is the most comprehensive public petroleum dataset available and is used in university courses and research worldwide. Available at data.equinor.com.
North Dakota Industrial Commission (NDIC) — The state of North Dakota publishes production data for all oil and gas wells in the Bakken and other formations. This is monthly production data for thousands of wells, freely accessible. Useful for decline curve analysis practice.
UK North Sea Transition Authority (NSTA) — The UK government publishes production, well, and field data for all offshore operations on the UK Continental Shelf. Available at nstauthority.co.uk.
Kansas Geological Survey — Provides well log data in LAS format for wells across Kansas. Good for petrophysical analysis practice.
For the exercises in this book, we provide curated sample datasets in the companion repository. These are cleaned subsets of public data, sized appropriately for each chapter's calculations.
Building a Data Loading Pipeline
In practice, you will load data from the same sources repeatedly — updating production records monthly, loading new well logs as wells are drilled, pulling drilling data for each new operation. Writing the loading and cleaning logic once and packaging it into reusable functions saves time and prevents errors.
Summary
This chapter covered the foundation of all petroleum data work:
- Petroleum data comes in domain-specific formats — LAS for well logs, CSV/Excel for production, WITSML for drilling. Each has its own conventions and quirks.
lasiois the standard Python library for reading LAS files. It handles header parsing, null value replacement, and unit extraction.- Data quality checks are not optional. Negative rates, missing values, duplicate records, and unit mismatches are common in real petroleum datasets. Check for them systematically before any analysis.
- Cleaning follows a consistent pattern: replace impossible values with
NaN, interpolate short gaps, flag long gaps for review, add derived columns (water cut, cumulative production), and standardize units. - Public datasets — particularly the Equinor Volve dataset and US state commission data — provide realistic practice material.
- Reusable loading functions save time and prevent errors. Write them once, validate them, and use them throughout your project.
The next chapter applies these data handling skills to one of the most important analyses in petroleum engineering: interpreting well logs to determine what is in the rock and how much of it can be produced.
Exercises
LAS File Inspection
Download a LAS file from the Kansas Geological Survey or the companion repository. Using lasio, write a script that prints: The well name, field, and ...
Null Value Detection
LAS files use a null value (typically -999.25) to represent missing data. Write a function count_nulls_by_curve(las_filepath) that: Reads the LAS file...
Production Data Loader
Write a function load_and_clean_production(filepath) that: Reads a CSV file of monthly production dataParses dates properlyReplaces any negative rates...
Unit Converter for Log Data
Different operators report well logs in different units. Density might be in g/cc or kg/m³. Depth might be in feet or metres. Resistivity might be in ...
Multi-Well Data Merge
You have two files: a well header file (well name, field, latitude, longitude, spud date, operator) and a monthly production file (well name, date, oi...
Data Gap Analysis
Write a function find_production_gaps(df, well_col, date_col) that: Groups data by wellChecks for gaps in monthly reporting (months where no record ex...
Outlier Detection by Well
Statistical outliers in production data can indicate real events (a workover that boosted production) or data errors. Write a script that: For each we...
LAS to DataFrame Pipeline
Write a complete function las_to_analysis_ready(filepath) that: Reads the LAS fileReplaces null values with NaNDrops any curves that are more than 50%...
Multi-Well Production Analysis
Using the Volve dataset (available from data.equinor.com) or any multi-well production CSV with at least 12 months of data, write a script that: Loads...
Build Your Own Data Quality Dashboard
Using everything from this chapter, write a script that takes any production CSV file and produces a complete data quality report: Record count and da...