|
H5CPP
v1.14.0
Modern C++ templates for HDF5 serial and parallel I/O
|
|
Parallel HDF5 I/O via MPI-IO — collective and independent transfer, the FAPL setup, and the often-missed fact that you don't need a parallel filesystem to use it.
A single-threaded H5Dwrite from one process can absorb maybe 1–2 GB/s on modern NVMe. That's fast — until you have hundreds of ranks producing data and a single coordinator becomes the bottleneck that wastes the rest of the cluster. The two standard ways to scale out:
| Pattern | Setup | Trade-off |
|---|---|---|
| File-per-rank | Each rank writes its own .h5 | No coordination cost. Many files (post-process to merge). |
| Collective MPI | All ranks write to ONE shared file | Coordinated single output. Needs MPI-IO + (usually) a parallel FS. |
HDF5's MPI integration handles the second case: every rank sees the same h5::fd_t, they coordinate behind the scenes through MPI-IO, and the result is one HDF5 container regardless of how many ranks wrote to it.
What's happening:
rank.h5::create for the file and h5::create<double> for the dataset are collective; all ranks must call them or HDF5 deadlocks.offset and count, and the underlying H5Dwrite coordinates writes without forcing inter-rank synchronisation per element.h5::fd_t destructs, it participates in the collective H5Fclose.The default transfer mode is independent — each rank writes its slab whenever it's ready, no inter-rank coordination. Switch to collective mode by composing a tuned dxpl_t:
| Mode | Coordination | When it wins |
|---|---|---|
| Independent | None per write | Sparse writes; ranks write to disjoint regions; small batches |
| Collective | Barrier per write | Dense writes covering the full extent; large slabs; aligned chunks |
Collective writes can be 2-10x faster on dense workloads because the MPI-IO layer aggregates per-rank writes into a single large I/O request. They're slower than independent on sparse / unbalanced workloads because the barrier waits for the slowest rank.
You can run MPI HDF5 on a single laptop, with one rank, on a POSIX filesystem. And on multiple ranks against any filesystem the MPI implementation supports — Lustre, GPFS, BeeGFS, or plain NFS, or even local
/tmpshared via symlinks.
What MPI HDF5 actually needs:
| Requirement | Required? | Notes |
|---|---|---|
HDF5 built with MPI support (--enable-parallel) | ✔ yes | Most distro packages ship a separate libhdf5-mpi-dev |
| An MPI implementation linked into the binary | ✔ yes | OpenMPI / MPICH / Intel MPI / etc. |
| Multiple physical machines | ✘ no | Single-host, single-rank is a fine smoke test |
| Parallel filesystem (Lustre / GPFS / BeeGFS) | ✘ no | The MPI-IO driver works against any filesystem MPI can mount |
| Network with RDMA / IB | ✘ no | Performance benefits at scale, but optional |
Why it works on a regular filesystem: the MPI-IO layer in your MPI runtime serialises writes through the operating system the same way POSIX pwrite does. You don't get the bandwidth amplification a true parallel FS provides, but the semantics are correct — multiple ranks writing disjoint slabs of the same file land where they should.
What this means in practice:
mpirun -n 4 ./my_mpi_test against a tmpfs filesystem in the CI runner. No HPC cluster needed to validate the code paths.mpirun -n 1. Behaviour is identical to multi-rank apart from the lack of contention.The performance ceiling on a non-parallel FS is roughly the filesystem's single-writer bandwidth divided by the number of ranks hitting it. That's bad for production HPC; fine for testing, prototyping, and small-cluster work.
Simpler alternative when ranks don't actually need to share a file:
Zero coordination cost, no MPI-IO required, simpler. Post-process to merge the per-rank files if you need a single artifact.
MPI parallelism scales across ranks; the async-mode machinery (h5::async::*) scales within a rank by overlapping HDF5 I/O with compute. Compose them — an MPI rank can run an async write while working on the next batch:
See Async-mode handles for the type-level machinery; the h5cpp Multithreaded Filter Pipeline — Current State (v1.12.7) report covers the FAPL-scoped worker pool and the async-mode dispatch strategy.