PPP Heat Solver — Performance Dashboard

Filters

All charts below react to these selections

Dataset

Grid size

Implementation

Communication

I/O mode

of runs in dataset match

Scaling Analysis

Per-run metrics against core count · filter-driven

Throughput (Mcells/s)

Cell-updates per second — hardware utilization

P2P vs RMA parity

Matched pairs · points on the diagonal = identical

I/O overhead

Iteration time breakdown: no-IO · seq IO · parallel IO

Implementation comparison

Same cores, same grid — which decomposition wins?

Amdahl fit — serial fraction ƒ

Fits S(N) = 1 / (ƒ + (1−ƒ)/N) per series. Lower ƒ = better scaling ceiling (1/ƒ). Super-linear runs yield ƒ < 0 — interpreted as cache-win, not modelable.

Fitted ƒ per series

Pareto frontier — time vs cost

Each dot is a run · X = wall time (s) · Y = CPU-cost (cores × seconds). Gold stars mark Pareto-optimal configs — no other run is both cheaper AND faster. The frontier is your efficient menu.

Optimization wins — baseline vs optimized

Matched configs across both snapshots. Points below the diagonal = faster now. HDF5 chunk fix: H5Pset_chunk(gridSize) — one whole-grid chunk instead of tile-sized chunks.

HDF5 chunk layout

Before: N×N tile-sized chunks → N metadata ops per collective write, many tiny I/O requests.

After: one whole-grid chunk → single collective write, MPI-IO aggregates perfectly into Lustre stripes.

Parallel overhead — where does the time actually go?

For each run, the ideal time is T₁ / N. Anything above that is communication + synchronization + OMP fork/join. Dark bars = compute, colored bars = overhead. Overhead growing past 50% means you're about to stop scaling.

MPI 2D

Hybrid 1D

Hybrid 2D

Decomposition visualizer

How is the grid carved up? Pick a run — see every MPI tile, every OMP thread slice, every halo.

Run:

Configuration

Halo geometry

Measured performance

Cache fit — why (super-)linear?

rank tile (color = rank id) · ║ OMP thread row partitioning · halo zones (exaggerated for visibility — actual halo is 2 cells)

Node topology — where ranks physically live

Barbora: Intel Xeon Gold 6240 (Cascade Lake) · 2 sockets × 18 cores = 36 cores/node · 2 NUMA domains · 24.75 MB L3/socket · 190 GB RAM. Your scripts use --ntasks-per-node=32 (hybrid: 2/4 tasks × 16/8 OMP) — 4 cores per node stay idle, shown as dashed slots below.

Hover a rank to reveal its halo neighbors: same socket (shared L3 cache — fastest) same node, different socket (NUMA hop) cross-node (InfiniBand, slowest)

Heatmaps

Grid × Cores matrices per implementation — hotspots at a glance

MPI 2D

Hybrid 1D

Hybrid 2D

Hybrid 2D vs Hybrid 1D gap

Ratio > 1 means Hybrid 2D is slower at that config

Best config per grid size

Fastest (no-IO) configuration discovered in the sweep

All runs

Click a header to sort · respects filters above

Reference plots — matplotlib output

Generated by scripts/generate_plots.py · log-log scales · course-style reference. Click any image to enlarge.

MPI 2D Hybrid 1D Hybrid 2D

These are the upstream course-provided plots. The interactive Plotly views above are richer — filters, hover data, dataset switcher, cache-fit overlay.

Built with Plotly · Tailwind · Alpine · uv · rebuild via uv run scripts/build_dashboard.py