Part I: Python Fundamentals

Chapter 4

NumPy and Pandas — The Engineer's Power Tools

schedule15 min readfitness_center10 exercises

Up to this point, every calculation we have written processes one value at a time, or iterates through a list element by element. That works for a single well, a single depth, a single month. It does not scale. A producing field generates thousands of data points per day across dozens of wells. A reservoir simulation grid can contain millions of cells. A well log records measurements every six inches over thousands of feet of section.

NumPy and Pandas are the two libraries that make large-scale petroleum data analysis practical. NumPy provides fast array operations — performing the same calculation on thousands of values simultaneously, without a loop. Pandas provides the DataFrame — a tabular data structure that handles the messy, labeled, mixed-type data that petroleum engineers actually work with.

This chapter teaches both through real petroleum data tasks: loading production data, cleaning it, computing engineering quantities across entire fields, merging datasets from different sources, and producing the summary tables and visualizations that appear in engineering reports.

infoWhat You Will Learn

  • NumPy arrays — vectorized arithmetic, performance, and why loops disappear
  • Pandas Series and DataFrames — loading, indexing, filtering, and transforming tabular data
  • Data cleaning — handling missing values, outliers, unit inconsistencies, and physically impossible values
  • Merging and aggregating — joining well headers with production data, computing field-level summaries
  • Time series — resampling, rolling averages, and trend analysis for production surveillance

NumPy — Fast Arithmetic on Arrays

Why Arrays Matter

Consider a routine task: calculating hydrostatic pressure at 100 different depths for a given mud weight. With a Python list and a loop, you write the formula 100 times (via iteration). With a NumPy array, you write it once.

main.py

The NumPy line 0.052 mud_weight_ppg depths_array applies the formula to all 100,000 elements simultaneously. There is no loop. The operation is vectorized — it runs in optimized C code underneath, which is why it is dramatically faster. More importantly, the code is shorter and easier to read: one line that looks like the equation instead of four lines of loop mechanics.

At 100,000 points, the speedup is typically 50–200x. At 10 million points — common in seismic data and reservoir simulation grids — the difference between "runs in a second" and "runs for five minutes" is the difference between NumPy and a Python loop.

Array Operations for Petroleum Calculations

main.py

Every operation above — subtraction, division, clipping, boolean masking — is applied to the entire array at once. The line porosity = (rho_matrix - rhob) / (rho_matrix - rho_fluid) computes porosity for all 500 depth points in a single expression. The boolean mask porosity > 0.15 produces an array of True/False values that can be used to select only the reservoir interval.

main.py

Linear Algebra — Reservoir Engineering Applications

NumPy's linear algebra capabilities are essential for reservoir engineering problems that involve systems of equations. A common example: solving for pressures in a multi-well system where each well influences its neighbors.

main.py

This is a preview of the discretized flow equations used in reservoir simulation. In Chapter 11, we will build a complete 1D reservoir simulator using these same principles applied to hundreds of grid blocks.

Pandas — Tabular Data for Real Engineering

Loading Production Data

main.py

Inspecting and Understanding the Data

Before any analysis, you need to understand what you are working with. How many records? What types? Where are the gaps?

main.py

This report immediately reveals the data quality issues we introduced: missing values in three columns, one negative oil rate, and one impossibly high oil rate. In real field data, these problems are universal. The next section shows how to handle them.

Data Cleaning — Handling Real-World Petroleum Data

main.py

Grouping and Aggregation — Field-Level Analysis

Individual well data becomes field-level intelligence through grouping and aggregation.

main.py
main.py

Merging — Joining Well Headers with Production Data

Production data and well metadata typically live in separate tables. Merging them lets you analyze production by well type, formation, operator, or any other attribute.

main.py

The merge operation joined 96 production records with 4 header records, matching on the well column. This is equivalent to a VLOOKUP in Excel, but it works on millions of rows and does not break when you sort the data.

Time Series — Resampling and Rolling Averages

Production data arrives at different frequencies: daily from SCADA, monthly from allocation, quarterly for regulatory reporting. Resampling converts between frequencies. Rolling averages smooth out noise to reveal underlying trends.

main.py

Building a Monthly Production Report

This is the kind of deliverable that a production engineer creates every month. Pandas makes it a repeatable, auditable process.

main.py

Summary

This chapter covered the two libraries that form the backbone of petroleum data science:

  • NumPy arrays enable vectorized arithmetic — performing calculations on thousands of values without writing loops. Density porosity, hydrostatic pressure profiles, and linear algebra for reservoir systems all benefit from array operations.
  • Pandas DataFrames handle the labeled, mixed-type tabular data that petroleum engineers actually work with: production records, well headers, and surveillance metrics.
  • Data cleaning is not optional in petroleum data. Missing values, negative rates, impossible readings, and unit inconsistencies are the norm, not the exception. Systematic cleaning with documented steps is an engineering discipline.
  • Merging joins data from different sources — production tables with well headers, log data with formation tops — enabling analysis by well type, formation, operator, or any other attribute.
  • Time series operations — rolling averages and resampling — separate measurement noise from engineering signal, enabling meaningful trend analysis and forecasting.
  • The monthly production report — a standard industry deliverable — becomes a repeatable, auditable Pandas pipeline rather than a manual spreadsheet exercise.

In the next chapter, we focus entirely on visualization: the standard plots and chart types that petroleum engineers use to communicate data, identify problems, and support decisions.

Exercises

fitness_center
Exercise 4.1Practice

Vectorized PVT Calculations

Using NumPy, implement the Standing correlation for bubble point pressure: Pb=18.2[(Rsγg)0.83×10(0.00091×T−0.0125×API)−1.4]P_b = 18.2 \left[ \left( \f...

arrow_forward
codePythonSolve Nowarrow_forward
fitness_center
Exercise 4.2Practice

Production Data Loader

Write a function load_production(filepath) that reads a CSV file, automatically detects date columns, converts them to datetime, handles missing value...

arrow_forward
codePythonSolve Nowarrow_forward
fitness_center
Exercise 4.3Practice

Decline Rate Calculator

For each well in the production dataset, calculate the monthly decline rate using: Di=qi−qi+1qi×ΔtD_i = \frac{q_i - q_{i+1}}{q_i \times \Delta t}Di​=q...

arrow_forward
codePythonSolve Nowarrow_forward
fitness_center
Exercise 4.4Practice

Data Cleaning Pipeline

The raw production dataset contains intentional errors (negative rates, impossibly high values, missing data). Write a complete cleaning pipeline that...

arrow_forward
codePythonSolve Nowarrow_forward
fitness_center
Exercise 4.5Practice

Multi-Well Comparison Dashboard

Create a 2×2 subplot figure for the field production dataset showing: (a) oil rate over time for all wells, (b) water cut over time for all wells, (c)...

arrow_forward
codePythonSolve Nowarrow_forward
fitness_center
Exercise 4.6Practice

Cumulative Production and EUR Estimation

For each well, calculate cumulative oil production using cumsum(). Plot cumulative oil vs. time. Estimate a simple EUR (Estimated Ultimate Recovery) b...

arrow_forward
codePythonSolve Nowarrow_forward
fitness_center
Exercise 4.7Practice

Allocation Reconciliation

In many fields, total production is measured at a central facility (fiscal metering), and individual well production is estimated through allocation. ...

arrow_forward
codePythonSolve Nowarrow_forward
fitness_center
Exercise 4.8Practice

Well Ranking System

Create a well ranking system that scores each well on multiple criteria: oil rate (higher is better), water cut (lower is better), GOR trend (stable i...

arrow_forward
codePythonSolve Nowarrow_forward
fitness_center
Exercise 4.9Practice

Pressure Survey Analysis

A pressure survey measures static bottomhole pressure (SBHP) at multiple times during a well's life. These measurements tell the reservoir engineer wh...

arrow_forward
codePythonSolve Nowarrow_forward
fitness_center
Exercise 4.10Practice

Field Summary Dashboard

Build a complete field summary that a production manager could present in a monthly review meeting. It should include: A summary table with one row pe...

arrow_forward
codePythonSolve Nowarrow_forward