Skip to content

Performance tuning

Measured on the repo's benchmark suite, single workstation:

Workload Throughput / latency
Small payload latency mean 556 µs, p99 1075 µs
1 MB array throughput 7.7k msg/s, 8.0 GB/s
4 MB array throughput 2.25k msg/s, 9.4 GB/s
1080p RGB 1422 fps, 8.8 GB/s

See Benchmarks to reproduce.

Copy-on-use

Decoded NumPy arrays alias the ZMQ frame memory — that's how large-payload throughput hits memcpy bandwidth. Consequence:

  • Mutate the array? arr = arr.copy() first.
  • Hold the array past the callback? Copy first.

Queue sizing

Per-socket HWM defaults to 10. Increase queue_size on high-rate producers whose subscribers are slow — but remember ZMQ drops silently at the HWM.

When to prefer the single-blob path

Tiny messages (primitives only, < 1 KB) see no benefit from multipart. The inline to_bytes path is fine there. The transport always uses multipart today.

uvloop

Installed by default on Unix. Drops tail latency on high-rate small messages.

Sync subscriber mode

For control loops needing sub-100 µs p99, see Sync subscriber mode. Bypasses asyncio + zmq.asyncio on the receive path.