Products benchmark release open-source product feature report

Rewriting a FIX Engine in C++23: What Got Simpler (and What Didn't)

DEV Communityby AlanApril 1, 20267 min read0 views

QuickFIX has been around forever. If you've touched FIX protocol in the last 15 years, you've probably used it. It works. It also carries a lot of code that made sense in C++98 but feels heavy now. I wanted to see how far C++23 could take a FIX engine from scratch. Not a full QuickFIX replacement (not yet anyway), but a parser and session layer where I could actually use modern tools. The project ended up at about 5K lines of headers, covers 9 message types, parses an ExecutionReport in ~246 ns. QuickFIX does the same parse in ~730 ns on identical synthetic input. Microbenchmark numbers, so grain of salt. Single core, pinned affinity, RDTSCP timing, warmed cache, 100K iterations. But the code changes that got there were more interesting to me than the final numbers. <h

I've been working on a FIX protocol engine in C++23. Header-only, about 5K lines, compiled with -O2 -march=native on Clang 18. Parses an ExecutionReport in ~246 ns on my bench rig. QuickFIX does the same message in ~730 ns.

Before anyone gets excited: single core, pinned affinity, warmed cache, synthetic input. Not production traffic. The 3x gap will shrink on real messages with variable-length fields and optional tags. I know.

But the code that got there was more interesting to me than the final number. Most of the gains came from replacing stuff that QuickFIX had to build by hand because C++98 didn't have the tools.

The pool that disappeared

QuickFIX has a hand-rolled object pool. About 1,000 lines of allocation logic, intrusive free lists, manual cache line alignment. Made total sense when it was written. C++98 didn't give you anything better.

Now there's std::pmr::monotonic_buffer_resource. Stack buffer, pointer bump, reset between messages:

template  class MonotonicPool : public std::pmr::memory_resource {  alignas(64) std::array buffer_{};  std::pmr::memory_resource* upstream_;  std::pmr::monotonic_buffer_resource resource_;*_

template  class MonotonicPool : public std::pmr::memory_resource {  alignas(64) std::array buffer_{};  std::pmr::memory_resource* upstream_;  std::pmr::monotonic_buffer_resource resource_;*_

public: MonotonicPool() noexcept : upstream_{std::pmr::null_memory_resource()} , resource_{buffer_.data(), buffer_.size(), upstream_} {}_

void reset() noexcept { resource_.release(); } // do_allocate/do_deallocate just forward to resource_ };`

Enter fullscreen mode

Exit fullscreen mode

Call reset() after each message. P99 went from 780 ns to 56 ns. That's 14x on the tail, and it's basically just "stop hitting the allocator."

I also use mimalloc for per-session heaps. mi_heap_new() per session, mi_heap_destroy() on disconnect. Felt wasteful at first, like I was throwing away too much memory per session. But perf stat said otherwise so I stopped arguing.

consteval tag lookup

FIX messages are key-value pairs with integer tag numbers. Tag 35 is MsgType, tag 49 is SenderCompID, tag 55 is Symbol. QuickFIX resolves these with a switch statement, fifty-something cases.

C++23 lets you build the lookup table at compile time:

inline constexpr int MAX_COMMON_TAG = 200;

consteval std::array create_tag_table() { std::array table{}; for (auto& entry : table) { entry = {"", false, false}; } table[1] = {TagInfo<1>::name, TagInfo<1>::is_header, TagInfo<1>::is_required}; table[8] = {TagInfo<8>::name, TagInfo<8>::is_header, TagInfo<8>::is_required}; table[35] = {TagInfo<35>::name, TagInfo<35>::is_header, TagInfo<35>::is_required}; // ~30 more entries return table; }

inline constexpr auto TAG_TABLE = create_tag_table();

[[nodiscard]] inline constexpr std::string_view tag_name(int tag_num) noexcept { if (tag_num >= 0 && tag_num < MAX_COMMON_TAG) [[likely]] { return TAG_TABLE[tag_num].name; } return ""; }`

Enter fullscreen mode

Exit fullscreen mode

Array index, O(1), zero branches at runtime. About 300 branches eliminated across the parser.

Field offsets use the same trick. QuickFIX stores them in a std::map, so every field access is a tree traversal. Here it's offsets_[tag]. Took me a while to get the constexpr initialization right for nested structs, but once it compiled it was basically free._

SIMD: the scenic route

FIX uses SOH (0x01) as the field delimiter. Scanning for it byte-by-byte is fine until your messages have 40+ fields.

Started with raw AVX2 intrinsics. Worked. Process 32 bytes, compare against SOH, extract positions from the bitmask:

const __m256i soh_vec = _mm256_set1_epi8(fix::SOH);_

for (size_t i = 0; i < simd_end; i += 32) { __m256i chunk = _mm256_loadu_si256( reinterpret_cast(ptr + i)); __m256i cmp = _mm256_cmpeq_epi8(chunk, soh_vec); uint32_t mask = static_cast(mm256_movemask_epi8(cmp));

while (mask != 0) { int bit = __builtin_ctz(mask); // lowest set bit result.push(static_cast(i + bit)); mask &= mask - 1; // clear it } }`

Enter fullscreen mode

Exit fullscreen mode

Then I realized I'd need an AVX-512 path, an SSE path, and an ARM NEON path. Four copies of the same logic with different intrinsic names. Maintaining that sounded miserable.

Tried Highway (Google's portable SIMD library). Nice API, but the build dependency was heavy for a header-only project. Compile times went up noticeably. I spent a couple hours trying to make it work as a submodule before giving up.

Ended up on xsimd. Header-only, template-based, picks the instruction set at compile time:

template  inline SohPositions scan_soh_xsimd(std::span data) noexcept {  using batch_t = xsimd::batch;  constexpr size_t width = batch_t::size;

template  inline SohPositions scan_soh_xsimd(std::span data) noexcept {  using batch_t = xsimd::batch;  constexpr size_t width = batch_t::size;

const batch_t soh_vec(static_cast(fix::SOH)); // same loop, portable across architectures }`

Enter fullscreen mode

Exit fullscreen mode

Raw AVX2 was maybe 5% faster on the same hardware. I kept both paths in the repo but default to xsimd. The portability is worth 5%.

SOH scan throughput: 3.32 GB/s. Sounds impressive until you realize that's just finding delimiters. Actual parsing is slower. But it means delimiter scanning isn't the bottleneck anymore, which is the whole point.

What didn't get simpler

Session state. FIX sessions have sequence numbers, heartbeat timers, gap fill logic, reject handling. I was hoping std::expected would clean up the error propagation and... it helped a little. Like 10% less boilerplate. The complexity is in the protocol, not the language. It's a state machine with a lot of branches and I don't think any C++ standard is going to fix that.

Message type coverage. I've got 9 types (NewOrderSingle, ExecutionReport, the session-level ones). QuickFIX covers all of them. Adding a new type isn't hard, just tedious. Field definitions, validation rules, serialization. About a day per message type if you include tests. I got to nine and just... stopped. Started working on the transport layer instead because that was more interesting. Not my proudest engineering decision.

Header-only at 5K lines. Compiles in 2.8s on Clang, 4.1s on GCC. That's fine on my machine. No idea what happens on a CI runner with 2GB of RAM. I keep saying I'll add a compiled-library option. Haven't done it.

Benchmarks

$ ./bench --iterations=100000 --pin-cpu=3

ExecutionReport parse: 246 ns (QuickFIX: 730 ns) NewOrderSingle parse: 229 ns (QuickFIX: 661 ns) Field access (4): 11 ns (QuickFIX: 31 ns) Throughput: 4.17M msg/sec (QuickFIX: 1.19M msg/sec)`

Enter fullscreen mode

Exit fullscreen mode

Single core, RDTSCP timing, 100K iterations, synthetic messages. Not captured from a real feed. The gap will narrow on production traffic with variable-length fields and optional tags. I'm pretty confident the parser is faster, just not sure by how much once you leave the lab.

Where I am with it

Not production-ready. Parser and session layer work well enough to benchmark, but nobody should route real orders through this.

The thing that kept surprising me was how much of QuickFIX's complexity was the language, not the problem. PMR replaced a thousand-line pool. consteval eliminated a fifty-case switch. And xsimd collapsed four architecture-specific codepaths into one template. These aren't exotic features either, they just didn't exist in C++98. I don't know if this thing will ever cover all the message types QuickFIX does, but the parser core feels solid enough that I keep coming back to it on weekends.

GitHub: github.com/Lattice9AI/NexusFIX

Still figuring out: whether header-only holds past 10K lines, how much the 3x gap closes on captured traffic, and which message types actually matter beyond the obvious nine. If you've worked with FIX and have opinions on any of that, I'm interested.

Part of NexusFix, an open-source FIX protocol engine in C++23.

Original source

DEV Community

https://dev.to/silverstream/rewriting-a-fix-engine-in-c23-what-got-simpler-and-what-didnt-4icg

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

benchmarkreleaseopen-source

ProductsLive

The AI Revolution in Development: Why Outer Loop Agents Are the Next Big Thing

Editor’s note: Robert Brennan is speaking at ODSC AI East this April 28th-30th in Boston! Check out his talk, “ Agents for the Inner Loop and Outer Loop ,” there! If you’re using AI to code, you’re probably doing what most developers do: running it inside your IDE or as a CLI on your laptop. You make a few changes, the AI helps out, and you iterate until it’s working. That’s the inner loop of development — and it’s where AI has already made a massive impact. But here’s the thing most people are missing: AI is about to revolutionize what happens after you push your code. The Shift to Outer Loop Agents Think about everything that happens after you git push: CI/CD runs, code gets reviewed, issues get tracked, vulnerabilities get flagged. That’s the outer loop — and traditionally, it’s been pr

ODSC Medium

3mabout 1 hour ago

ReleasesLive

Why We Built an API for Spanish Fiscal ID Validation Instead of Just Implementing It

A few months ago I was integrating fiscal identifier validation into a project. I googled it, found a 30-line JavaScript function, copied it, tested it with four cases, and it worked. I dropped it into the codebase and forgot about it. Three months later, a user emailed us: their CIF wasn't validating. It was a perfectly valid CIF. That's when I understood there's a difference between implementing validation and maintaining it correctly. <h2> The problem isn't that it's hard </h2> Validating a NIF seems straightforward: 8 digits, one letter, modulo 23. Any developer can implement it in ten minutes. The problems appear when you scratch the surface: NIF: the basic algorithm works, but there are special f

DEV Community

5m18 minutes ago

ProductsLive

How I built a browser-based video editor with FFmpeg.wasm (no backend, no server costs)

I got tired of opening CapCut every time I needed to quickly join 2-3 clips. Too many menus, too many features I'll never use. So I built my own. <a href="https://www.2minclip.com" rel="noopener noreferrer">2minclip.com</a> — a free online video editor that runs entirely in the browser. No install, no signup, no watermark. Here's how I built it and what I learned. <h2> The core idea </h2> The concept is simple: ilovepdf but for video. You open the browser, upload your clips, edit, export. That's it. No account, no server processing, no storage costs. The key technical decision was using FFmpeg.wasm — a WebAssembly port of FFmpeg that runs entirely in the browser. This means: <ul> <li>Zero server costs (users process video on their

DEV Community

3m16 minutes ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 155 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Products

ProductsLive

Apple turns 50: 8 of the company’s biggest tech milestones

With Apple turning 50 years old today, Northumbria University’s Nick Dalton goes through some of the tech giant’s most notable tech milestones. Read more: Apple turns 50: 8 of the company’s biggest tech milestones