Part I: Python Fundamentals

Chapter 4

NumPy and Pandas: The Engineer's Power Tools

schedule20 min readfitness_center10 exercises

auto_awesomeAI Key Takeaways

Get the TL;DR and the key concepts before you dive in — or as a quick review after.

Up to this point, every calculation we have written processes one value at a time, or iterates through a list element by element. That works for a single well, a single depth, a single month. It does not scale. A producing field generates thousands of data points per day across dozens of wells. A reservoir simulation grid can contain millions of cells. A well log records measurements every six inches over thousands of feet of section.

NumPy and Pandas are the two libraries that make large-scale petroleum data analysis practical. NumPy provides fast array operations, performing the same calculation on thousands of values simultaneously, without a loop. Pandas provides the DataFrame, a tabular data structure that handles the messy, labeled, mixed-type data that petroleum engineers actually work with.

infoWhat You Will Learn

NumPy arrays: vectorized arithmetic, performance, and why loops disappear
Pandas Series and DataFrames: loading, indexing, filtering, and transforming tabular data
Data cleaning: handling missing values, outliers, unit inconsistencies, and physically impossible values
Merging and aggregating: joining well headers with production data, computing field-level summaries
Time series: resampling, rolling averages, and trend analysis for production surveillance

NumPy: Fast Arithmetic on Arrays

Why Arrays Matter

Consider a routine task: calculating hydrostatic pressure at 100 different depths for a given mud weight. With a Python list and a loop, you write the formula 100 times (via iteration). With a NumPy array, you write it once.

main.pystarting Python…

import numpy as np
import time

mud_weight_ppg = 11.6
n_depths = 100_000  # 100,000 depth points - realistic for a fine-grid pressure profile

# === List approach - element by element ===
depths_list = [i * 0.1 for i in range(1, n_depths + 1)]  # 0.1 to 10,000 ft in 0.1-ft steps

start = time.perf_counter()
pressures_list = []
for d in depths_list:
    pressures_list.append(0.052 * mud_weight_ppg * d)
list_time = time.perf_counter() - start

# === NumPy approach - all at once ===
depths_array = np.arange(0.1, n_depths * 0.1 + 0.1, 0.1)

start = time.perf_counter()
pressures_array = 0.052 * mud_weight_ppg * depths_array
numpy_time = time.perf_counter() - start

print(f"Depths computed:     {n_depths:,}")
print(f"List approach:       {list_time*1000:.1f} ms")
print(f"NumPy approach:      {numpy_time*1000:.3f} ms")
print(f"Speedup:             {list_time/numpy_time:.0f}x faster")
print(f"\nFirst 5 pressures:   {pressures_array[:5]}")
print(f"Last 5 pressures:    {pressures_array[-5:]}")

The NumPy line 0.052 mud_weight_ppg depths_array applies the formula to all 100,000 elements simultaneously. There is no loop. The operation is vectorized; it runs in optimized C code underneath, and the single line reads like the equation instead of four lines of loop mechanics.

At 100,000 points, the speedup is typically 50–200x. At 10 million points (common in seismic data and reservoir simulation grids) the difference between "runs in a second" and "runs for five minutes" is the difference between NumPy and a Python loop.

Array Operations for Petroleum Calculations

A reservoir engineer rarely computes one pressure. They compute it for every depth, every well, every timestep. A NumPy array lets a single expression run over the whole dataset at once, which is the difference between code that keeps up with the data and code that crawls through it value by value.

main.pystarting Python…

import numpy as np

# Well log data - 500 depth points from a density-neutron log
np.random.seed(42)
n = 500
depth = np.linspace(9000, 9500, n)

# Synthetic log values
rhob = np.where((depth > 9150) & (depth < 9350),
                2.35 + np.random.normal(0, 0.03, n),
                2.55 + np.random.normal(0, 0.03, n))

rhob = np.clip(rhob, 1.8, 3.0)  # Physical bounds for bulk density

# === Porosity from Density Log ===
# Density porosity formula: φ = (ρma - ρb) / (ρma - ρf)
# where ρma = matrix density (sandstone ≈ 2.65 g/cc)
#       ρf  = fluid density (≈ 1.0 g/cc for water)
#       ρb  = measured bulk density

rho_matrix = 2.65
rho_fluid = 1.0

porosity = (rho_matrix - rhob) / (rho_matrix - rho_fluid)
porosity = np.clip(porosity, 0, 0.45)  # Porosity cannot be negative or > 45%

print(f"=== Density Porosity Calculation ===")
print(f"Depth range:     {depth[0]:.0f} – {depth[-1]:.0f} ft")
print(f"Data points:     {n}")
print(f"Matrix density:  {rho_matrix} g/cc (sandstone)")
print(f"Fluid density:   {rho_fluid} g/cc (water)")
print()
print(f"Porosity statistics:")
print(f"  Min:   {porosity.min():.3f} ({porosity.min()*100:.1f}%)")
print(f"  Max:   {porosity.max():.3f} ({porosity.max()*100:.1f}%)")
print(f"  Mean:  {porosity.mean():.3f} ({porosity.mean()*100:.1f}%)")
print(f"  Std:   {porosity.std():.3f}")

# Find the reservoir zone (porosity > 0.15)
reservoir_mask = porosity > 0.15
reservoir_depths = depth[reservoir_mask]
print(f"\nReservoir zone (φ > 15%):")
print(f"  Top:       {reservoir_depths[0]:.0f} ft")
print(f"  Bottom:    {reservoir_depths[-1]:.0f} ft")
print(f"  Thickness: {reservoir_depths[-1] - reservoir_depths[0]:.0f} ft")
print(f"  Avg φ:     {porosity[reservoir_mask].mean():.3f} ({porosity[reservoir_mask].mean()*100:.1f}%)")

Every operation above (subtraction, division, clipping, boolean masking) is applied to the entire array at once. The line porosity = (rho_matrix - rhob) / (rho_matrix - rho_fluid) computes porosity for all 500 depth points in a single expression. The boolean mask porosity > 0.15 produces an array of True/False values that can be used to select only the reservoir interval.

main.pystarting Python…

import matplotlib.pyplot as plt

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 10), sharey=True)

# Track 1: Bulk Density
ax1.plot(rhob, depth, color="#4682B4", linewidth=0.7)
ax1.set_xlabel("Bulk Density (g/cc)", fontsize=10)
ax1.set_ylabel("Depth (ft)", fontsize=11)
ax1.set_xlim(2.0, 2.8)
ax1.set_title("RHOB", fontsize=11, fontweight='bold')
ax1.invert_yaxis()
ax1.grid(True, alpha=0.15)

# Track 2: Density Porosity
ax2.plot(porosity * 100, depth, color="#2E8B57", linewidth=0.8)
ax2.axvline(x=15, color="#CC4444", linestyle="--", linewidth=1, alpha=0.6, label="Cutoff (15%)")
ax2.fill_betweenx(depth, 0, porosity * 100, where=(porosity > 0.15),
                  alpha=0.15, color="#FFD700")
ax2.set_xlabel("Porosity (%)", fontsize=10)
ax2.set_xlim(0, 35)
ax2.set_title("Density Porosity", fontsize=11, fontweight='bold')
ax2.legend(loc="lower right", fontsize=9)
ax2.grid(True, alpha=0.15)

# Highlight reservoir zone
for ax in [ax1, ax2]:
    ax.axhspan(9150, 9350, alpha=0.04, color="#FFD700")

fig.suptitle("Oso-Deep 003 - Porosity from Density Log", fontsize=13, fontweight='bold')
fig.tight_layout()
plt.show()

Density porosity log for Oso-Deep 003. The yellow-shaded zone between 9,150 and 9,350 ft shows porosity above the 15% cutoff (dashed line), indicating reservoir-quality rock. Porosity averages approximately 18% in the pay zone and drops to near-zero in the bounding shale intervals. The red dashed line is the net pay cutoff - rock below this porosity is too tight to produce hydrocarbons economically.

Linear Algebra: Reservoir Engineering Applications

Some reservoir problems cannot be solved one cell at a time: the pressure in each grid block depends on its neighbors, so all the unknowns must be solved together. That is a system of equations, and NumPy's linalg solver handles it in one call.

main.pystarting Python…

import numpy as np

# Simplified steady-state pressure calculation for 4 connected grid blocks.
# Each block's pressure depends on its neighbors and any wells producing from it.
#
# The system Ax = b represents the discretized flow equations:
# A = transmissibility matrix (how easily fluid flows between blocks)
# b = source/sink terms (wells producing or injecting)
# x = unknown pressures

# Transmissibility matrix (symmetric - flow is bidirectional)
A = np.array([
    [ 3, -1, -1,  0],
    [-1,  3,  0, -1],
    [-1,  0,  3, -1],
    [ 0, -1, -1,  3],
])

# Source terms - negative means production (fluid leaving the system)
# Block 0: injector adding 500 bbl/d equivalent
# Block 3: producer taking 500 bbl/d equivalent
b = np.array([500, 0, 0, -500])

# Solve for pressures
pressures = np.linalg.solve(A, b)

print("Grid Block Pressures (relative units):")
for i, p in enumerate(pressures):
    well_type = "Injector" if b[i] > 0 else "Producer" if b[i] < 0 else "No well"
    print(f"  Block {i}: {p:8.1f}  ({well_type})")

# Verify the solution: A @ x should equal b
residual = np.linalg.norm(A @ pressures - b)
print(f"\nResidual (should be ~0): {residual:.2e}")

This is a preview of the discretized flow equations used in reservoir simulation. In Chapter 11, we will build a complete 1D reservoir simulator using these same principles applied to hundreds of grid blocks.

Pandas: Tabular Data for Real Engineering

Loading Production Data

Production data shows up as a file, not a Python object, so the first move in any analysis is loading it into a shape you can work with. Pandas reads a CSV into a DataFrame, a table with named columns, in a single call.

main.pystarting Python…

import pandas as pd
import numpy as np

# Create a realistic multi-well production dataset
np.random.seed(123)

wells = ["OD-001", "OD-003", "OD-005", "OD-007"]
dates = pd.date_range("2025-01-01", periods=24, freq="MS")  # 24 months

records = []
for well in wells:
    # Each well has different initial rate and decline characteristics
    base_rates = {
        "OD-001": (2400, 0.04, 300),
        "OD-003": (3150, 0.06, 420),
        "OD-005": (1800, 0.03, 150),
        "OD-007": (2950, 0.05, 80),
    }
    qi, di, wi = base_rates[well]

    for i, date in enumerate(dates):
        oil = qi * np.exp(-di * i) + np.random.normal(0, qi * 0.02)
        water = wi + 40 * i + np.random.normal(0, 30)
        gas = oil * (2.1 + np.random.normal(0, 0.1))
        fwhp = 800 - 8 * i + np.random.normal(0, 15)

        records.append({
            "well": well,
            "date": date,
            "oil_bopd": max(0, round(oil, 1)),
            "water_bwpd": max(0, round(water, 1)),
            "gas_mscfd": max(0, round(gas, 1)),
            "fwhp_psi": max(50, round(fwhp, 0)),
        })

# Introduce some realistic data quality issues
records[14]["oil_bopd"] = np.nan        # Missing value - sensor outage
records[27]["oil_bopd"] = -200          # Negative - database error
records[42]["water_bwpd"] = np.nan      # Missing
records[55]["oil_bopd"] = 15000         # Impossibly high - wrong well allocation
records[70]["fwhp_psi"] = np.nan        # Missing

df = pd.DataFrame(records)
df.to_csv("field_production_24mo.csv", index=False)

print(f"Dataset: {len(df)} records × {len(df.columns)} columns")
print(f"Wells: {df['well'].nunique()}")
print(f"Date range: {df['date'].min().date()} to {df['date'].max().date()}")
print(f"\nFirst 8 rows:")
print(df.head(8).to_string(index=False))

Inspecting and Understanding the Data

Before any analysis, you need to understand what you are working with. How many records? What types? Where are the gaps?

main.pystarting Python…

print("=== Data Quality Report ===\n")
print(f"Shape: {df.shape[0]} rows × {df.shape[1]} columns\n")

# Missing values
print("Missing values:")
for col in df.columns:
    n_missing = df[col].isna().sum()
    if n_missing > 0:
        print(f"  {col}: {n_missing} ({n_missing/len(df)*100:.1f}%)")

# Basic statistics
print("\nNumerical summary:")
print(df.describe().round(1).to_string())

# Check for physically impossible values
print(f"\nData quality flags:")
print(f"  Negative oil rates:     {(df['oil_bopd'] < 0).sum()}")
print(f"  Oil rate > 10,000 bopd: {(df['oil_bopd'] > 10000).sum()}")
print(f"  Negative water rates:   {(df['water_bwpd'] < 0).sum()}")
print(f"  Missing FWHP readings:  {df['fwhp_psi'].isna().sum()}")

The report surfaces every issue we injected: missing values in three columns, one negative oil rate, and one impossibly high oil rate. In real field data, these problems are universal. The next section shows how to handle them.

Data Cleaning: Handling Real-World Petroleum Data

infoWhat is `NaN`, and why a `.copy()`?

NaN stands for "Not a Number", Pandas' way of representing a missing value. It survives arithmetic (5 + NaN = NaN), shows as a blank cell in printed output, and is detected with .isna(). Use NaN, not zero or -999, for missing data; downstream calculations will then propagate the missing-ness instead of silently corrupting averages.

df.copy() creates a separate DataFrame so changes to clean cannot leak back into df. Skipping it is the source of Pandas' infamous SettingWithCopyWarning and a class of "why is the original data changing?" bugs. Treat raw data as read-only; clean into a copy; analyse the copy. The same discipline applies to NumPy arrays: use .copy() when you intend to mutate.

main.pystarting Python…

# Make a working copy - never modify the raw data
clean = df.copy()

# Step 1: Replace physically impossible values with NaN
# Oil rates cannot be negative and rarely exceed 10,000 bopd in this field
clean.loc[clean["oil_bopd"] < 0, "oil_bopd"] = np.nan
clean.loc[clean["oil_bopd"] > 10000, "oil_bopd"] = np.nan

print("After removing impossible values:")
print(f"  Total NaN in oil_bopd: {clean['oil_bopd'].isna().sum()}")

# Step 2: Interpolate missing values within each well
# Linear interpolation is appropriate for short gaps (1-2 months)
clean = clean.sort_values(["well", "date"])
clean["oil_bopd"] = clean.groupby("well")["oil_bopd"].transform(
    lambda x: x.interpolate(method="linear", limit=2)
)
clean["water_bwpd"] = clean.groupby("well")["water_bwpd"].transform(
    lambda x: x.interpolate(method="linear", limit=2)
)
clean["fwhp_psi"] = clean.groupby("well")["fwhp_psi"].transform(
    lambda x: x.interpolate(method="linear", limit=2)
)

print(f"  After interpolation: {clean['oil_bopd'].isna().sum()} remaining NaN")

# Step 3: Calculate derived columns
clean["total_liquid_bpd"] = clean["oil_bopd"] + clean["water_bwpd"]
clean["water_cut_pct"] = (clean["water_bwpd"] / clean["total_liquid_bpd"] * 100).round(1)
clean["gor_scf_bbl"] = (clean["gas_mscfd"] * 1000 / clean["oil_bopd"]).round(0)

print(f"\nCleaned dataset: {len(clean)} records")
print(f"New columns added: total_liquid_bpd, water_cut_pct, gor_scf_bbl")
print(f"\nSample of cleaned data:")
print(clean[clean["well"] == "OD-003"].head(6).to_string(index=False))

Grouping and Aggregation: Field-Level Analysis

A manager does not want 96 rows of per-well monthly data; they want one number per well and one for the field. Grouping and aggregation collapse the table to that.

main.pystarting Python…

# Well-level summary - the kind of table that appears in every monthly report
well_summary = clean.groupby("well").agg(
    avg_oil=("oil_bopd", "mean"),
    latest_oil=("oil_bopd", "last"),
    peak_oil=("oil_bopd", "max"),
    avg_water_cut=("water_cut_pct", "mean"),
    latest_water_cut=("water_cut_pct", "last"),
    avg_gor=("gor_scf_bbl", "mean"),
    avg_fwhp=("fwhp_psi", "mean"),
    months=("date", "count"),
).round(0)

print("=== Well Performance Summary ===\n")
print(well_summary.to_string())

# Field totals
field_oil = clean.groupby("date")["oil_bopd"].sum()
field_water = clean.groupby("date")["water_bwpd"].sum()
field_wc = field_water / (field_oil + field_water) * 100

print(f"\n=== Field Totals ===")
print(f"Current field oil rate:  {field_oil.iloc[-1]:,.0f} bopd")
print(f"Current field water cut: {field_wc.iloc[-1]:.1f}%")
print(f"Peak field oil rate:     {field_oil.max():,.0f} bopd")

main.pystarting Python…

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(10, 5.5))

# Pivot to get one column per well
pivot = clean.pivot_table(index="date", columns="well", values="oil_bopd", aggfunc="sum")

colors = ["#2E8B57", "#4682B4", "#D4A847", "#CC6644"]
pivot.plot.area(ax=ax, stacked=True, color=colors, alpha=0.75, linewidth=0.5)

ax.set_xlabel("Date", fontsize=11)
ax.set_ylabel("Oil Rate (bopd)", fontsize=11)
ax.set_title("OML 58 - Field Oil Production by Well", fontsize=13, fontweight='bold')
ax.legend(title="Well", loc="upper right", fontsize=9)
ax.grid(axis='y', alpha=0.2)

fig.tight_layout()
plt.show()

Stacked area chart of each well's contribution to total field oil production over 24 months. Production declines across all wells; OD-003 (blue) starts highest, while OD-005 (gold) declines slowest and holds the steadiest base. The stack makes the field's dependence on a few wells plain - the read that drives production planning and decline forecasting.

Merging: Joining Well Headers with Production Data

Production data and well metadata typically live in separate tables. Merging them lets you analyze production by well type, formation, operator, or any other attribute.

infoThe four flavours of `merge`

pd.merge(left, right, on="well", how=...) lets you choose what happens to non-matching rows:

how="inner" (default): keep only rows where the key is in both tables. Safe; loses orphaned data.
how="left": keep every row of the left table; NaN for missing matches in the right. Most common in production analysis: "every well in the production data, with whatever metadata I have."
how="right": mirror image of left.
how="outer": keep every row from both; NaN everywhere there's no match. Use when reconciling two databases.

Choose inner for safety, left for completeness. Pick deliberately; the wrong join can silently drop wells.

main.pystarting Python…

# Well header table - static information about each well
headers = pd.DataFrame({
    "well": ["OD-001", "OD-003", "OD-005", "OD-007"],
    "well_type": ["Vertical", "Horizontal", "Vertical", "Horizontal"],
    "target_formation": ["E3000 Sand", "E3000 Sand", "D2000 Sand", "E3000 Sand"],
    "tvd_ft": [9800, 9650, 8400, 9900],
    "lateral_length_ft": [0, 4200, 0, 5100],
    "completion_date": pd.to_datetime(["2023-06-15", "2025-03-22", "2022-11-01", "2025-08-10"]),
})

# Merge production data with well headers
merged = clean.merge(headers, on="well", how="left")

# Now we can analyze by well type
by_type = merged.groupby("well_type").agg(
    well_count=("well", "nunique"),
    avg_oil=("oil_bopd", "mean"),
    avg_water_cut=("water_cut_pct", "mean"),
    avg_gor=("gor_scf_bbl", "mean"),
).round(0)

print("Performance by Well Type:\n")
print(by_type.to_string())

# Analyze by formation
by_fm = merged.groupby("target_formation").agg(
    wells=("well", "nunique"),
    total_oil=("oil_bopd", "sum"),
    avg_wc=("water_cut_pct", "mean"),
).round(0)

print(f"\nPerformance by Formation:\n")
print(by_fm.to_string())

The merge operation joined 96 production records with 4 header records, matching on the well column. This is equivalent to a VLOOKUP in Excel, but it works on millions of rows and does not break when you sort the data.

Time Series: Resampling and Rolling Averages

Production data arrives at different frequencies: daily from SCADA, monthly from allocation, quarterly for regulatory reporting. Resampling converts between frequencies. Rolling averages smooth out noise to reveal underlying trends.

main.pystarting Python…

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Generate synthetic daily production data for one well
np.random.seed(456)
days = 365
dates_daily = pd.date_range("2025-01-01", periods=days, freq="D")

# Underlying decline + daily noise + occasional shut-ins
qi = 2800
di = 0.0018  # daily decline rate
base_rate = qi * np.exp(-di * np.arange(days))
noise = np.random.normal(0, 80, days)
daily_oil = base_rate + noise

# Simulate 5 brief shut-ins (maintenance, weather)
shutin_starts = [45, 112, 198, 267, 320]
for s in shutin_starts:
    duration = np.random.randint(1, 4)
    daily_oil[s:s+duration] = 0

daily_oil = np.maximum(daily_oil, 0)

daily_df = pd.DataFrame({
    "date": dates_daily,
    "oil_bopd": daily_oil
}).set_index("date")

# Calculate rolling average
daily_df["rolling_30d"] = daily_df["oil_bopd"].rolling(window=30, min_periods=10).mean()

# Resample to monthly averages
monthly = daily_df["oil_bopd"].resample("MS").mean()

# Plot
fig, ax = plt.subplots(figsize=(10, 5))

ax.plot(daily_df.index, daily_df["oil_bopd"], color="#CCCCCC", linewidth=0.5,
        alpha=0.7, label="Daily (raw)")
ax.plot(daily_df.index, daily_df["rolling_30d"], color="#2E8B57", linewidth=2,
        label="30-day rolling avg")
ax.scatter(monthly.index, monthly.values, color="#D4A847", zorder=5, s=40,
           edgecolors="white", linewidth=0.5, label="Monthly avg")

# Economic limit
ax.axhline(y=200, color="#CC4444", linestyle="--", linewidth=1, alpha=0.5,
           label="Economic limit (200 bopd)")

ax.set_xlabel("Date", fontsize=11)
ax.set_ylabel("Oil Rate (bopd)", fontsize=11)
ax.set_title("OD-009 - Daily Production with Rolling Average", fontsize=13, fontweight='bold')
ax.legend(loc="upper right", fontsize=9)
ax.set_ylim(0, 3200)
ax.grid(True, alpha=0.15)

fig.tight_layout()
plt.show()

print(f"Daily records:       {len(daily_df)}")
print(f"Monthly averages:    {len(monthly)}")
print(f"Initial rate:        {daily_df['oil_bopd'].iloc[:7].mean():,.0f} bopd (first week avg)")
print(f"Final rate:          {daily_df['oil_bopd'].iloc[-7:].mean():,.0f} bopd (last week avg)")
print(f"Annual decline:      {(1 - daily_df['oil_bopd'].iloc[-7:].mean() / daily_df['oil_bopd'].iloc[:7].mean()) * 100:.1f}%")

Daily vs. 30-day rolling average oil production for Well OD-009. The raw daily data (gray) shows the noise inherent in field measurements - gauge fluctuations, brief shut-ins for maintenance, allocation adjustments. The 30-day rolling average (green) reveals the underlying decline trend, which is what the reservoir engineer actually needs for forecasting. The red dashed line marks the economic limit (200 bopd for this well) - the rate below which production costs exceed revenue.

Building a Monthly Production Report

Every month, a production engineer builds this table by hand. The same steps in Pandas run on demand and leave an audit trail.

main.pystarting Python…

# Latest month's data
latest_month = clean["date"].max()
latest = clean[clean["date"] == latest_month].copy()

# Previous month for comparison
prev_month = latest_month - pd.DateOffset(months=1)
previous = clean[clean["date"] == prev_month].copy()

# Build report
report = latest[["well", "oil_bopd", "water_bwpd", "gas_mscfd", "water_cut_pct", "fwhp_psi"]].copy()
report = report.merge(
    previous[["well", "oil_bopd"]].rename(columns={"oil_bopd": "prev_oil"}),
    on="well", how="left"
)
report["change_pct"] = ((report["oil_bopd"] - report["prev_oil"]) / report["prev_oil"] * 100).round(1)

# Add field total row
field_total = pd.DataFrame([{
    "well": "FIELD TOTAL",
    "oil_bopd": report["oil_bopd"].sum(),
    "water_bwpd": report["water_bwpd"].sum(),
    "gas_mscfd": report["gas_mscfd"].sum(),
    "water_cut_pct": (report["water_bwpd"].sum() /
                      (report["oil_bopd"].sum() + report["water_bwpd"].sum()) * 100),
    "fwhp_psi": report["fwhp_psi"].mean(),
    "prev_oil": report["prev_oil"].sum(),
    "change_pct": ((report["oil_bopd"].sum() - report["prev_oil"].sum()) /
                    report["prev_oil"].sum() * 100),
}])

report = pd.concat([report, field_total], ignore_index=True)

print(f"=== MONTHLY PRODUCTION REPORT - {latest_month.strftime('%B %Y')} ===\n")
print(report.round(1).to_string(index=False))

Summary

This chapter covered the two libraries that form the backbone of petroleum data science:

NumPy arrays enable vectorized arithmetic, performing calculations on thousands of values without writing loops. Density porosity, hydrostatic pressure profiles, and linear algebra for reservoir systems all benefit from array operations.
Pandas DataFrames handle the labeled, mixed-type tabular data that petroleum engineers actually work with: production records, well headers, and surveillance metrics.
Data cleaning is not optional in petroleum data. Missing values, negative rates, impossible readings, and unit inconsistencies are the norm, not the exception. Systematic cleaning with documented steps is an engineering discipline.
Merging joins data from different sources (production tables with well headers, log data with formation tops), enabling analysis by well type, formation, operator, or any other attribute.
Time series operations, rolling averages and resampling, separate measurement noise from engineering signal, enabling meaningful trend analysis and forecasting.
The monthly production report, a standard industry deliverable, becomes a repeatable, auditable Pandas pipeline rather than a manual spreadsheet exercise.

In the next chapter, we focus entirely on visualization: the standard plots and chart types that petroleum engineers use to communicate data, identify problems, and support decisions.

Exercises

fitness_center

Exercise 4.1Practice

: Vectorized PVT Calculations

Using NumPy, implement the Standing correlation for bubble point pressure: Pb=18.2[(Rsγg)0.83×10(0.00091×T−0.0125×API)−1.4]P_b = 18.2 \left[ \left( \f...

arrow_forward