Performance
Environment
CPU: 11th Gen Intel Core i7-11700K @ 3.60 GHz · OS: Linux 5.15 x86_64
Build: -O3 -march=native, C++23
Baseline: boost::decimal = 100 %; numbers above 100 % are faster than boost.
CPU frequency scaling was active (powersave governor + turbo); figures are indicative, not lab-grade.
Construct from (significand, exponent)
| relative | ns/op | type |
|---|---|---|
| 100.0 % | 3 934 | boost decimal32 — baseline |
| 76.5 % | 5 142 | float naive |
| 509 % | 772 | float optimised |
| 454 % | 866 | intel bid32 |
| 326 % | 1 208 | scaled uint32 |
| 75.2 % | 5 228 | bcd uint32 |
| 480 % | 820 | fixed32 checked |
| 484 % | 814 | fixed32 unchecked |
| relative | ns/op | type |
|---|---|---|
| 100.0 % | 981 | boost decimal64 — baseline |
| 8.9 % | 11 047 | double naive |
| 130 % | 757 | double optimised |
| 106 % | 926 | intel bid64 |
| 89.4 % | 1 096 | scaled uint64 |
| 13.7 % | 7 132 | bcd uint64 |
| 137 % | 718 | fixed64 checked |
| 158 % | 621 | fixed64 unchecked |
Fixed-exponent types dominate at both widths. BCD construction is 7–12× slower than boost.
Decompose to (significand, exponent)
| relative | ns/op | type |
|---|---|---|
| 100.0 % | 61 368 | boost decimal32 frexp10 — baseline |
| 671 % | 9 152 | float utils::decompose |
| 855 % | 7 181 | intel bid32 bid::decompose |
| 1781 % | 3 446 | scaled uint32 as_pair |
| relative | ns/op | type |
|---|---|---|
| 100.0 % | 24 250 | boost decimal64 frexp10 — baseline |
| 265 % | 9 147 | double utils::decompose |
| 361 % | 6 727 | intel bid64 bid::decompose |
| 610 % | 3 974 | scaled int64 as_pair |
boost::decimal::frexp10 is the most expensive decompose path. scaled::as_pair() is a direct field read — 18× faster at 32-bit, 6× faster at 64-bit. Use libdecimal types for any path that must inspect the representation (serialization, wire encoding, logging).
Compare
| relative | ns/op | type |
|---|---|---|
| 100.0 % | 60 017 | boost decimal64 — baseline |
| 385 % | 15 611 | double |
| 119 % | 50 563 | intel bid64 |
| 276 % | 21 771 | fixed64 -4 |
| 294 % | 20 431 | scaled int64 |
| relative | ns/op | type |
|---|---|---|
| 100.0 % | 56 215 | boost decimal64 — baseline |
| 1240 % | 4 535 | double |
| 111 % | 50 457 | intel bid64 |
| 454 % | 12 370 | fixed64 -4 |
| 237 % | 23 721 | scaled int64 |
Boost comparison is the slowest decimal option. fixed64 -4 sorts 4.5× faster; boost comparison is nearly identical to intel BID64.
Arithmetic
| relative | ns/op | type |
|---|---|---|
| 100.0 % | 190 245 | boost decimal64 — baseline |
| 2378 % | 8 002 | double |
| 373 % | 51 030 | intel bid64 |
| 14.9 % | 1 278 906 | bcd64 |
| 1886 % | 10 090 | fixed64 -4 |
| 632 % | 30 082 | scaled int64 |
| relative | ns/op | type |
|---|---|---|
| 100.0 % | 5 457 | boost decimal64 — baseline |
| 3439 % | 159 | double |
| 264 % | 2 070 | intel bid64 |
| 4.6 % | 118 188 | bcd64 |
| 654 % | 834 | fixed64 -4 |
| 530 % | 1 030 | scaled int64 |
Boost arithmetic is the slowest decimal path. fixed64 is 19× faster for accumulation. BCD should not be used for arithmetic — it is 7–20× slower than boost.
String parse and format
| relative | ns/op | type |
|---|---|---|
| 100.0 % | 44 106 | boost decimal64 — baseline |
| 315 % | 14 008 | intel bid64 |
| 21.5 % | 205 445 | bcd64 (digit-string + exp) |
| 94.7 % | 46 559 | stod (double reference) |
| relative | ns/op | type |
|---|---|---|
| 100.0 % | 131 586 | boost decimal64 — baseline |
| 293 % | 44 877 | intel bid64 (std::format) |
| 238 % | 55 215 | bcd64 (.str()) |
| 827 % | 15 919 | scaled int64 (cast to string) |
| 66.3 % | 198 355 | double (to_string) |
Intel BID64 parses 3× faster than boost. For formatting, scaled's cast-to-string is 8× faster than boost's ostringstream <<.
Fee calculation — notional × rate
| relative | ns/op | type |
|---|---|---|
| 100.0 % | 4 607 | boost decimal64 — baseline |
| 569 % | 810 | double |
| 124 % | 3 730 | intel bid64 |
| 176 % | 2 617 | fixed64 -4 |
| 273 % | 1 688 | scaled int64 |
Multiply is boost's least-bad category; intel is only 1.24× faster. Scaled is 2.7× faster.
Wire encode / decode
Note
No boost type in this group — baseline is scaled int64 (first runner).
| relative | ns/op | type |
|---|---|---|
| 100.0 % | 4 667 | scaled int64 (field access) — baseline |
| 116 % | 4 024 | fixed64 -4 (field + constexpr exp) |
| 64.3 % | 7 264 | intel bid64 (decompose) |
| relative | ns/op | type |
|---|---|---|
| 100.0 % | 14 047 | scaled int64 (direct ctor) — baseline |
| 310 % | 4 531 | fixed64 -4 (direct ctor) |
| 172 % | 8 168 | intel bid64 (encode from pair) |
scaled encode is a field read — near-free. But its constructor calls normalize() unconditionally, making decode 3× slower than fixed64. Prefer fixed64 for round-trip wire paths.
Risk limit — accumulate + clamp (10 000 values)
| relative | ns/op | type |
|---|---|---|
| 100.0 % | 219 475 | boost decimal64 — baseline |
| 1370 % | 16 017 | double |
| 279 % | 78 644 | intel bid64 |
| 612 % | 35 873 | fixed64 -4 |
| 227 % | 96 909 | scaled int64 |
Under realistic mixed-op pressure (add + branch + assign per step), boost is still the slowest decimal type.
Summary
| type | construct | decompose | compare | accumulate | multiply | parse | format | risk |
|---|---|---|---|---|---|---|---|---|
double |
1.3× | 2.7× | 3.9× | 23.8× | 5.7× | 0.95× | 0.66× | 13.7× |
intel bid64 |
1.1× | 3.6× | 1.2× | 3.7× | 1.2× | 3.2× | 2.9× | 2.8× |
fixed64 -4 |
1.6× | — | 2.8× | 18.9× | 1.8× | — | — | 6.1× |
scaled int64 |
0.9× | 6.1× | 2.9× | 6.3× | 2.7× | — | 8.3× | 2.3× |
bcd64 |
0.1× | — | — | 0.15× | 0.05× | 0.2× | 2.4× | — |
Numbers are speedup relative to boost decimal (>1× = faster than boost).
Type selection
| use case | recommended type | reason |
|---|---|---|
| Arithmetic — PnL, accumulation | fixed64 -4 |
7–19× faster; compile-time exponent eliminates runtime normalization |
| Multiply hot path — fees, notional | scaled int64 |
2.7× faster; no overflow risk from fixed exponent |
| Compare / sort / rank | fixed64 -4 |
2.8–4.5× faster |
| Wire encode | scaled int64 |
field read, near-free |
| Wire decode | fixed64 -4 |
3× faster; avoids normalize() in ctor |
| String parse | intel bid64 |
3.2× faster than boost |
| String format | scaled int64 |
8.3× faster via cast-to-string |
| Inspect internals / serialize | scaled int64 |
6–18× faster as_pair() |
| General purpose / mixed | intel bid64 |
competitive across all ops, no exponent constraint |
| Avoid for arithmetic | bcd64 |
7–20× slower than boost across all arithmetic ops |
Repeating the experiment
Prerequisites
# build directory must exist and be configured with benchmarks enabled
cmake -S . -B build \
-DCMAKE_BUILD_TYPE=Release \
-Dlibdecimal_BUILD_BENCHMARKS=ON
Running the harness
The script:
- Rebuilds only the changed benchmark targets (
ninja -j$(nproc)) - Runs all 8 benchmarks in order:
construct-from-pair,decompose,compare,arithmetic,string,ops-fee,encode-decode,mixed-ops - Writes a timestamped log to
bench-results/bench-<YYYYMMDD-HHMMSS>.txt
Changing the baseline
nanobench uses the first bench.run() call in each Bench instance as the 100 % reference when .relative(true) is set. To rebaseline on a different type, move its bench.run() block to the top of the group in the corresponding bench/*.cpp file, then rerun the harness.
For construct-from-pair.cpp, which uses a static_for<tuple_t> loop, the first type in the tuple definition is the baseline:
using import_64_t = std::tuple<
traits_t<boost::decimal::decimal64_t>, // <-- baseline (first = 100%)
bench::ieee754<double, bench::kind_t::naive>,
...
>;