How it works
Pipeline overview
┌──────────────┐ ┌──────────────────┐ ┌──────────────┐ ┌──────────────────┐
│ DataSource │────►│ align_to_grid() │────►│ discretize() │────►│ BN inference │
│ (per node) │ │ reproject + resample │ continuous → │ │ VariableElim. │
└──────────────┘ │ to reference grid│ │ BN states │ │ batched queries │
└──────────────────┘ └──────────────┘ └──────────┬───────┘
│
┌──────────▼───────┐
│ InferenceResult │
│ (H, W, n_states) │
│ + entropy map │
└──────────────────┘
Loading a Bayesian network
geobn reads .bif files (Bayesian Interchange Format), the standard format used by
GeNIe/Netica/bnlearn. Use geobn.load():
Attaching data sources
Each root node (node with no parents) in the BN corresponds to an evidence variable.
Attach a DataSource to each one:
bn.set_input("slope_angle", geobn.WCSSource(url="https://example.com/wcs", layer="dtm"))
bn.set_input("recent_snow", geobn.ConstantSource(30.0))
At inference time, geobn inspects all registered georeferenced sources and selects the
one with the finest resolution (smallest pixel size) as the reference grid (CRS,
resolution, extent). All other sources are reprojected to this grid automatically. To
override this behaviour, call bn.set_grid(crs, resolution, extent) explicitly before
running inference.
GridSpec and alignment
The reference grid is described by a GridSpec(crs, transform, shape). Grid alignment
uses pure numpy + pyproj bilinear interpolation — no rasterio dependency.
ConstantSource is a special case: it returns a 1×1 sentinel array with crs=None,
which align_to_grid() recognises and broadcasts to the full grid shape.
Discretization
Bayesian networks operate on discrete states. Every continuous evidence source must be
mapped to BN state names via set_discretization():
bn.set_discretization(
"slope_angle",
breakpoints=[0, 25, 40, 90], # bin edges
labels=["gentle", "steep", "extreme"], # must match BN state names exactly
)
breakpoints must have len(labels) + 1 values. Pixels outside [breakpoints[0],
breakpoints[-1]] become NaN (no inference).
NaN / NoData propagation
NaN values propagate strictly: if any input pixel is NaN, that pixel is excluded from inference and all output bands for that pixel are NaN.
This means: - Pixels outside WCS coverage → NaN inputs → NaN outputs - Sea pixels in a land DEM → NaN depth → NaN output - Invalid sensor readings → NaN evidence → NaN posteriors
Inference batching
Running one pgmpy VariableElimination.query() per pixel is prohibitively slow for
large rasters, so geobn never queries per pixel. Instead, inference proceeds in two
steps: first measure how much distinct evidence the map actually contains, then
choose the cheapest strategy for that amount.
Step 1 — group pixels by observed evidence combination
After discretization, every pixel is described by one state index per evidence node
— e.g. (slope=steep, rainfall=high). geobn groups all valid pixels by these
combinations and counts how many distinct combinations actually occur on the map.
This count is the key quantity, and it is usually far smaller than the number of theoretically possible combinations:
- It can never exceed the number of valid pixels.
- Geography is spatially correlated — neighbouring pixels tend to share the same slope class, forest class, etc.
- Scalar inputs (
ConstantSource, live weather values) contribute exactly one state each, collapsing the combination count further.
A network with millions of possible combinations may still produce only a few dozen observed ones, and geobn's strategy choice is driven entirely by the observed count — the size of the theoretical state space costs nothing by itself.
Step 2 — choose a strategy
Few observed combinations (≤ 200): per-combination queries. One pgmpy
VariableElimination query runs per observed combination (a single elimination pass
shared across all query nodes). For a 500×500 grid with 3 evidence nodes (3 states
each) there are at most 27 combinations regardless of grid size — a handful of
targeted queries is the cheapest possible approach.
Many observed combinations: one joint query. Asking thousands of nearly identical questions repeats the same internal propagation work each time. Instead, geobn asks pgmpy a single bigger question: the joint distribution P(query, e₁, …, eₖ) with no evidence at all. Normalising that array along the query axis converts it into the full conditional table P(query | e₁, …, eₖ) — the answer for every combination at once:
Per-pixel results are then read from the table by numpy fancy indexing. This keeps networks with many evidence nodes tractable: 10 nodes × 3 states (59,049 combinations) resolves in well under a second, where the per-combination loop would take minutes.
Safety valve: the joint table's size is the product of all evidence state counts × the query state count. If that exceeds an in-memory bound (~80 MB), the joint strategy is skipped and the per-combination loop is used — slower, but it only ever pays for combinations that actually occur.
Summary of regimes
| Observed combos | Table fits in memory? | Strategy | Cost |
|---|---|---|---|
| ≤ 200 | (irrelevant) | per-combination queries | milliseconds |
| many | yes | single joint query + table lookup | ~constant, sub-second |
| many | no | per-combination queries | proportional to observed combos |
In all cases the per-combination results are scattered back to the original pixel positions, and evidence combinations with zero prior probability yield NaN posteriors.
Output
InferenceResult holds:
probabilities— dict mapping each query node to a(H, W, n_states)float32 arraystate_names— ordered state labels per query nodecrs,transform— spatial metadata for the output grid
From an InferenceResult you can:
result.entropy("node")— Shannon entropy map (bits), shape (H, W)result.to_geotiff("out/")— write multi-band GeoTIFFsresult.to_xarray()— return an xarray Datasetresult.show_map()— interactive Leaflet map
Real-time / repeated inference
When running inference repeatedly (e.g. updating weather forecasts while terrain stays fixed), geobn provides two optimisation tiers:
Tier 1 — bn.freeze(*nodes): marks nodes as static. The discrete index array
is computed once on the first infer() call and cached for all subsequent calls.
The pgmpy VariableElimination object is also cached.
Tier 2 — bn.precompute(query): solves all ∏ n_states combinations once (via a
single joint query per query node) and stores a numpy lookup table. Subsequent
infer() calls use O(H×W) fancy indexing — zero pgmpy queries per call. Best for
real-time dashboards with a fixed BN structure.
Call bn.clear_cache() to reset all caches if inputs change.