Filters
All charts below react to these selections
Dataset
Grid size
Implementation
Communication
I/O mode
of runs in dataset match
Scaling Analysis
Per-run metrics against core count · filter-driven
Throughput (Mcells/s)
Cell-updates per second — hardware utilization
P2P vs RMA parity
Matched pairs · points on the diagonal = identical
I/O overhead
Iteration time breakdown: no-IO · seq IO · parallel IO
Implementation comparison
Same cores, same grid — which decomposition wins?
Amdahl fit — serial fraction ƒ
Fits S(N) = 1 / (ƒ + (1−ƒ)/N) per series. Lower ƒ = better scaling ceiling (1/ƒ). Super-linear runs yield ƒ < 0 — interpreted as cache-win, not modelable.
Fitted ƒ per series
Pareto frontier — time vs cost
Each dot is a run · X = wall time (s) · Y = CPU-cost (cores × seconds).
Gold stars mark Pareto-optimal configs — no other run is both cheaper AND faster.
The frontier is your efficient menu.
Optimization wins — baseline vs optimized
Matched configs across both snapshots. Points below the diagonal = faster now.
HDF5 chunk fix: H5Pset_chunk(gridSize) — one whole-grid chunk instead of tile-sized chunks.
HDF5 chunk layout
Before: N×N tile-sized chunks → N metadata ops per collective write, many tiny I/O requests.
After: one whole-grid chunk → single collective write, MPI-IO aggregates perfectly into Lustre stripes.
Parallel overhead — where does the time actually go?
For each run, the ideal time is T₁ / N. Anything above that is
communication + synchronization + OMP fork/join. Dark bars = compute, colored bars = overhead.
Overhead growing past 50% means you're about to stop scaling.
MPI 2D
Hybrid 1D
Hybrid 2D
Decomposition visualizer
How is the grid carved up? Pick a run — see every MPI tile, every OMP thread slice, every halo.
Run:
Configuration
Halo geometry
Measured performance
Cache fit — why (super-)linear?
rank tile (color = rank id) ·
║ OMP thread row partitioning ·
halo zones (exaggerated for visibility — actual halo is 2 cells)
Node topology — where ranks physically live
Barbora: Intel Xeon Gold 6240 (Cascade Lake) · 2 sockets × 18 cores = 36 cores/node · 2 NUMA domains · 24.75 MB L3/socket · 190 GB RAM.
Your scripts use --ntasks-per-node=32 (hybrid: 2/4 tasks × 16/8 OMP) — 4 cores per node stay idle, shown as dashed slots below.
Hover a rank to reveal its halo neighbors:
same socket (shared L3 cache — fastest)
same node, different socket (NUMA hop)
cross-node (InfiniBand, slowest)
Heatmaps
Grid × Cores matrices per implementation — hotspots at a glance
MPI 2D
Hybrid 1D
Hybrid 2D
Hybrid 2D vs Hybrid 1D gap
Ratio > 1 means Hybrid 2D is slower at that config
Best config per grid size
Fastest (no-IO) configuration discovered in the sweep
All runs
Click a header to sort · respects filters above
Reference plots — matplotlib output
Generated by
scripts/generate_plots.py · log-log scales · course-style reference. Click any image to enlarge.
MPI 2D
Hybrid 1D
Hybrid 2D
These are the upstream course-provided plots. The interactive Plotly views above are richer — filters,
hover data, dataset switcher, cache-fit overlay.