One of the things that surprised me following last week's Jepsen report on Radix DLT (jepsen.io/analyses/radix-dlt-1) was seeing both blockchain/DLT people *and* the database community go "Hang on, 16 transactions per second can't be right"--and expecting wildly different figures.

In short: DLT folks seemed surprised that the figure wasn't 50 txns/sec. The DB community seemed surprised it wasn't 5,000.

This feels like a significant cultural gap in our collective expectations about what modern distributed systems should do! I'd like to help bridge that gap

Disclaimers first: I'm not a performance expert. Jepsen as a tool and company doesn't focus on performance. Jepsen workloads are designed to find safety bugs, which often stresses things like concurrency control mechanisms. They don't necessarily reflect "real" behavior.

Any benchmark is going to depend heavily on hardware, kernel tuning, network, request size, contention, concurrency, pipelining, compression, locality, etc. Totally normal to see a 10x difference in goodput. When RDX Works says Olympia can do 50 TPS, that's entirely reasonable!

With that said: here are throughput and latency graphs from Jepsen talking to Radix DLT running on a cluster of 5 m5.xlarge nodes (all validators) backed by EBS. No reads, just write transactions between an exponentially-distributed pool of 30 accounts. Whole dataset fits in RAM.

Here's etcd on that same cluster doing writes of transaction-sized (~250B) JSON blobs to random keys. No special tuning, just an out of the box install. No connection pipelining/pooling or batching. Just plain old connection-per-request over HTTP.

This is ~typical for dist DBs.

So while I generally say that in performance terms "15 and 100 are the same number"... I dunno, there's a bit of a difference here! And this doesn't seem specific to Radix--DLTs like Ethereum, Bitcoin, & Cardano are also running in the tens of txns/sec range. What's up with that?

To wit: both of these are 5-node consensus systems. Both use that consensus system to build a totally ordered log of state transitions, and run a replicated state machine on top of that log. But etcd here is pushing 2600 instead of 12 TPS, and with median latencies ~20x lower.

Follow

Is it that DLTs are doing *byzantine* consensus? Etcd uses Raft (raft.github.io/), which is not Byzantine fault-tolerant. Takes 2 network hops plus a disk sync on a majority of nodes to commit. ~2n messages/txn. Throughput bounded by the single, totally-ordered Raft log.

Radix is based on Hotstuff (arxiv.org/abs/1803.05069), which is Byzantine fault-tolerant, three-phase consen. ~6n (I think?) messages/txn.

And like, Hotstuff *itself* can go fast. The paper reports c5.4xlarge clusters pushing ~120K ops/sec (1KB/op, batches of 400 ops per round).

So maybe this is more about the state machine *on top* of the consensus mechanism? Additional cryptography above the BFT consensus core? Serialization costs? General runtime slowness? They're both modern GCed languages: Radix is in Java, etcd is in Go.

As the crypto maxim goes: DYOR!

Here's a YourKit snapshot from one of those Radix nodes pushing ~12 txns/sec. Some of it's crypto (BouncyCastle), but it looks like it's burning a ton of time in BerkeleyDB IO. Roughly 1/3rd waiting for fsync.

jepsen.io.s3.amazonaws.com/mis

Rather a *lot* of fsyncs, as it turns out. Roughly 11 calls per txn on each node, at least in this particular run.

Etcd does way more per second (!?) but, like most DBs, batches. At ~2700 txns/sec, etcd gets away with only ~0.27 syncs/txn in this run.

gist.github.com/aphyr/9f8e549c

Anyway, that's where I'd start digging if I were working on Radix perf! Some of this might be limitations of BDB's API... but I'd be looking for ways to omit or batch syncs, asking where I might be seeing write amplification, trying to measure write locality/ordering, etc.

Zooming out: Some of these costs can probably be optimized away in time. I suspect permissionless DLTs are always going to be at a latency and throughput disadvantage though. For starters, Lamport 2002 puts a two msg-delay lower bound on async consensus: lamport.azurewebsites.net/pubs

Paxos & Raft achieve this 2-msg-delay lower bound in the stable case. IIRC Fast Byzantine Paxos is also 2, Tendermint needs 4. Hotstuff is... 6? I think? I always struggle to read papers like this.

Chime in if I'm wrong about any of this--haven't been in this literature in a bit

That tells us that any globe-spanning consensus system using light through fiber will have a hard latency floor of ~200 ms. Consensus latencies in local datacenters can get down around 5-10. Not every application is latency-sensitive, but some require or profit from low latency!

Constant factors: BFT and permissionless networks tend to rely on cryptographic signatures, and those take compute, bandwidth, and storage.

I have a loose suspicion that the UTXO state machine representation for the ledger might also impose costs. Would love to hear about this.

Redundancy: Sybil resistance pushes DLTs to run the same computation on lots of nodes. Ethereum uses at least 5770 nodes to do a single (very slow) computer's computation. Radix uses ~100. DBs using regular old (i.e. permissioned) consensus usually run 3, 5, or 7 replicas.

Does this *matter*? I honestly don't know.

ACH latencies are multiple days. OTOH, some trading systems start issuing order requests while the packet with offers is still coming over the wire. Transaction value can just barely beat--or far outweigh--processing & storage costs.

This is something I kind of expected DLT & DeFi whitepapers to discuss as a matter of course: What kinds of apps would be insensitive to these costs? Which ones might find it more efficient to keep running on permissioned, centralized networks?

Curious to hear y'all's thoughts!

If I can leave you with one idea, it's:

DLTs, like any database, are empirically investigable artifacts. You can build, install, and ask one to store some data. See if it comes back like you'd expect. Even simple tests can lead to interesting & exciting results.

Try it out! ❤️

"Hang on, wasn't Radix slow with COMMIT_NO_SYNC too?"

Yup! That tells us fsync can't be the only factor. All that CPU has to be going somewhere. High but variable system time. I'd also look at 540kBps of inbound network traffic vs 1.9 MBps of disk writes: write amplification?

Thing is that none of this is even remotely close to saturating disk or network bandwidth. It's a fresh, empty cluster and request volumes are *tiny*, so like... page cache should be able to hold most if not all of this data.

I dunno. Software is a ~rich tapestry~

Sign in to participate in the conversation
Jepsen

A single-user Mastodon instance for Jepsen announcements & discussion.