Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessApple turns 50: 8 of the company’s biggest tech milestonesSilicon RepublicI Built an AI Agent That Can Write Its Own Tools When It Gets StuckDEV CommunityBuilding a "Soft Sensor" for Cement Kilns: Predicting Control Levers with PythonDEV CommunityWe Traced One Query Through Perplexity’s Entire Stack in Cohort – Here’s What Actually Happens in 3 SecondsDEV CommunityAgent Self-Discovery: How AI Agents Find Their Own WalletsDEV CommunityYour content pipeline is lying to you, and in regulated software, that's a serious problemDEV CommunityDiffusion-based AI model successfully trained in electroplatingPhys.org AIClaude Code hooks: how to intercept every tool call before it runsDEV CommunityHow I built a browser-based video editor with FFmpeg.wasm (no backend, no server costs)DEV CommunityWhy We Built an API for Spanish Fiscal ID Validation Instead of Just Implementing ItDEV CommunityA technical deep-dive into building APEX: an autonomous AI operations system on OpenClawDEV CommunityBest Amazon Spring Sale laptop deals 2026ZDNet AIBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessApple turns 50: 8 of the company’s biggest tech milestonesSilicon RepublicI Built an AI Agent That Can Write Its Own Tools When It Gets StuckDEV CommunityBuilding a "Soft Sensor" for Cement Kilns: Predicting Control Levers with PythonDEV CommunityWe Traced One Query Through Perplexity’s Entire Stack in Cohort – Here’s What Actually Happens in 3 SecondsDEV CommunityAgent Self-Discovery: How AI Agents Find Their Own WalletsDEV CommunityYour content pipeline is lying to you, and in regulated software, that's a serious problemDEV CommunityDiffusion-based AI model successfully trained in electroplatingPhys.org AIClaude Code hooks: how to intercept every tool call before it runsDEV CommunityHow I built a browser-based video editor with FFmpeg.wasm (no backend, no server costs)DEV CommunityWhy We Built an API for Spanish Fiscal ID Validation Instead of Just Implementing ItDEV CommunityA technical deep-dive into building APEX: an autonomous AI operations system on OpenClawDEV CommunityBest Amazon Spring Sale laptop deals 2026ZDNet AI

Rewriting a FIX Engine in C++23: What Got Simpler (and What Didn't)

DEV Communityby AlanApril 1, 20267 min read0 views
Source Quiz

<p>QuickFIX has been around forever. If you've touched FIX protocol in the last 15 years, you've probably used it. It works. It also carries a lot of code that made sense in C++98 but feels heavy now.</p> <p>I wanted to see how far C++23 could take a FIX engine from scratch. Not a full QuickFIX replacement (not yet anyway), but a parser and session layer where I could actually use modern tools. The project ended up at about 5K lines of headers, covers 9 message types, parses an ExecutionReport in ~246 ns. QuickFIX does the same parse in ~730 ns on identical synthetic input.</p> <p>Microbenchmark numbers, so grain of salt. Single core, pinned affinity, RDTSCP timing, warmed cache, 100K iterations. But the code changes that got there were more interesting to me than the final numbers.</p> <h

I've been working on a FIX protocol engine in C++23. Header-only, about 5K lines, compiled with -O2 -march=native on Clang 18. Parses an ExecutionReport in ~246 ns on my bench rig. QuickFIX does the same message in ~730 ns.

Before anyone gets excited: single core, pinned affinity, warmed cache, synthetic input. Not production traffic. The 3x gap will shrink on real messages with variable-length fields and optional tags. I know.

But the code that got there was more interesting to me than the final number. Most of the gains came from replacing stuff that QuickFIX had to build by hand because C++98 didn't have the tools.

The pool that disappeared

QuickFIX has a hand-rolled object pool. About 1,000 lines of allocation logic, intrusive free lists, manual cache line alignment. Made total sense when it was written. C++98 didn't give you anything better.

Now there's std::pmr::monotonic_buffer_resource. Stack buffer, pointer bump, reset between messages:

template  class MonotonicPool : public std::pmr::memory_resource {  alignas(64) std::array buffer_{};  std::pmr::memory_resource* upstream_;  std::pmr::monotonic_buffer_resource resource_;*_

public: MonotonicPool() noexcept : upstream_{std::pmr::null_memory_resource()} , resource_{buffer_.data(), buffer_.size(), upstream_} {}_

void reset() noexcept { resource_.release(); } // do_allocate/do_deallocate just forward to resource_ };`

Enter fullscreen mode

Exit fullscreen mode

Call reset() after each message. P99 went from 780 ns to 56 ns. That's 14x on the tail, and it's basically just "stop hitting the allocator."

I also use mimalloc for per-session heaps. mi_heap_new() per session, mi_heap_destroy() on disconnect. Felt wasteful at first, like I was throwing away too much memory per session. But perf stat said otherwise so I stopped arguing.

consteval tag lookup

FIX messages are key-value pairs with integer tag numbers. Tag 35 is MsgType, tag 49 is SenderCompID, tag 55 is Symbol. QuickFIX resolves these with a switch statement, fifty-something cases.

C++23 lets you build the lookup table at compile time:

inline constexpr int MAX_COMMON_TAG = 200;

consteval std::array create_tag_table() { std::array table{}; for (auto& entry : table) { entry = {"", false, false}; } table[1] = {TagInfo<1>::name, TagInfo<1>::is_header, TagInfo<1>::is_required}; table[8] = {TagInfo<8>::name, TagInfo<8>::is_header, TagInfo<8>::is_required}; table[35] = {TagInfo<35>::name, TagInfo<35>::is_header, TagInfo<35>::is_required}; // ~30 more entries return table; }

inline constexpr auto TAG_TABLE = create_tag_table();

[[nodiscard]] inline constexpr std::string_view tag_name(int tag_num) noexcept { if (tag_num >= 0 && tag_num < MAX_COMMON_TAG) [[likely]] { return TAG_TABLE[tag_num].name; } return ""; }`

Enter fullscreen mode

Exit fullscreen mode

Array index, O(1), zero branches at runtime. About 300 branches eliminated across the parser.

Field offsets use the same trick. QuickFIX stores them in a std::map, so every field access is a tree traversal. Here it's offsets_[tag]. Took me a while to get the constexpr initialization right for nested structs, but once it compiled it was basically free._

SIMD: the scenic route

FIX uses SOH (0x01) as the field delimiter. Scanning for it byte-by-byte is fine until your messages have 40+ fields.

Started with raw AVX2 intrinsics. Worked. Process 32 bytes, compare against SOH, extract positions from the bitmask:

const __m256i soh_vec = _mm256_set1_epi8(fix::SOH);_

for (size_t i = 0; i < simd_end; i += 32) { __m256i chunk = _mm256_loadu_si256( reinterpret_cast(ptr + i)); __m256i cmp = _mm256_cmpeq_epi8(chunk, soh_vec); uint32_t mask = static_cast(mm256_movemask_epi8(cmp));

while (mask != 0) { int bit = __builtin_ctz(mask); // lowest set bit result.push(static_cast(i + bit)); mask &= mask - 1; // clear it } }`

Enter fullscreen mode

Exit fullscreen mode

Then I realized I'd need an AVX-512 path, an SSE path, and an ARM NEON path. Four copies of the same logic with different intrinsic names. Maintaining that sounded miserable.

Tried Highway (Google's portable SIMD library). Nice API, but the build dependency was heavy for a header-only project. Compile times went up noticeably. I spent a couple hours trying to make it work as a submodule before giving up.

Ended up on xsimd. Header-only, template-based, picks the instruction set at compile time:

template  inline SohPositions scan_soh_xsimd(std::span data) noexcept {  using batch_t = xsimd::batch;  constexpr size_t width = batch_t::size;

const batch_t soh_vec(static_cast(fix::SOH)); // same loop, portable across architectures }`

Enter fullscreen mode

Exit fullscreen mode

Raw AVX2 was maybe 5% faster on the same hardware. I kept both paths in the repo but default to xsimd. The portability is worth 5%.

SOH scan throughput: 3.32 GB/s. Sounds impressive until you realize that's just finding delimiters. Actual parsing is slower. But it means delimiter scanning isn't the bottleneck anymore, which is the whole point.

What didn't get simpler

Session state. FIX sessions have sequence numbers, heartbeat timers, gap fill logic, reject handling. I was hoping std::expected would clean up the error propagation and... it helped a little. Like 10% less boilerplate. The complexity is in the protocol, not the language. It's a state machine with a lot of branches and I don't think any C++ standard is going to fix that.

Message type coverage. I've got 9 types (NewOrderSingle, ExecutionReport, the session-level ones). QuickFIX covers all of them. Adding a new type isn't hard, just tedious. Field definitions, validation rules, serialization. About a day per message type if you include tests. I got to nine and just... stopped. Started working on the transport layer instead because that was more interesting. Not my proudest engineering decision.

Header-only at 5K lines. Compiles in 2.8s on Clang, 4.1s on GCC. That's fine on my machine. No idea what happens on a CI runner with 2GB of RAM. I keep saying I'll add a compiled-library option. Haven't done it.

Benchmarks

$ ./bench --iterations=100000 --pin-cpu=3

ExecutionReport parse: 246 ns (QuickFIX: 730 ns) NewOrderSingle parse: 229 ns (QuickFIX: 661 ns) Field access (4): 11 ns (QuickFIX: 31 ns) Throughput: 4.17M msg/sec (QuickFIX: 1.19M msg/sec)`

Enter fullscreen mode

Exit fullscreen mode

Single core, RDTSCP timing, 100K iterations, synthetic messages. Not captured from a real feed. The gap will narrow on production traffic with variable-length fields and optional tags. I'm pretty confident the parser is faster, just not sure by how much once you leave the lab.

Where I am with it

Not production-ready. Parser and session layer work well enough to benchmark, but nobody should route real orders through this.

The thing that kept surprising me was how much of QuickFIX's complexity was the language, not the problem. PMR replaced a thousand-line pool. consteval eliminated a fifty-case switch. And xsimd collapsed four architecture-specific codepaths into one template. These aren't exotic features either, they just didn't exist in C++98. I don't know if this thing will ever cover all the message types QuickFIX does, but the parser core feels solid enough that I keep coming back to it on weekends.

GitHub: github.com/Lattice9AI/NexusFIX

Still figuring out: whether header-only holds past 10K lines, how much the 3x gap closes on captured traffic, and which message types actually matter beyond the obvious nine. If you've worked with FIX and have opinions on any of that, I'm interested.

Part of NexusFix, an open-source FIX protocol engine in C++23.

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

benchmarkreleaseopen-source

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Rewriting a…benchmarkreleaseopen-sourceproductfeaturereportDEV Communi…

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 155 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Products