One of the things that surprised me following last week's Jepsen report on Radix DLT (https://jepsen.io/analyses/radix-dlt-1.0-beta.35.1) was seeing both blockchain/DLT people *and* the database community go "Hang on, 16 transactions per second can't be right"--and expecting wildly different figures.
Is it that DLTs are doing *byzantine* consensus? Etcd uses Raft (https://raft.github.io/), which is not Byzantine fault-tolerant. Takes 2 network hops plus a disk sync on a majority of nodes to commit. ~2n messages/txn. Throughput bounded by the single, totally-ordered Raft log.
Radix is based on Hotstuff (https://arxiv.org/abs/1803.05069), which is Byzantine fault-tolerant, three-phase consen. ~6n (I think?) messages/txn.
And like, Hotstuff *itself* can go fast. The paper reports c5.4xlarge clusters pushing ~120K ops/sec (1KB/op, batches of 400 ops per round).
As the crypto maxim goes: DYOR!
Here's a YourKit snapshot from one of those Radix nodes pushing ~12 txns/sec. Some of it's crypto (BouncyCastle), but it looks like it's burning a ton of time in BerkeleyDB IO. Roughly 1/3rd waiting for fsync.
http://jepsen.io.s3.amazonaws.com/misc/radix-dlt/Radix-2022-02-16.snapshot
Rather a *lot* of fsyncs, as it turns out. Roughly 11 calls per txn on each node, at least in this particular run.
Etcd does way more per second (!?) but, like most DBs, batches. At ~2700 txns/sec, etcd gets away with only ~0.27 syncs/txn in this run.
https://gist.github.com/aphyr/9f8e549ce86113efd652c63e5266f604
Zooming out: Some of these costs can probably be optimized away in time. I suspect permissionless DLTs are always going to be at a latency and throughput disadvantage though. For starters, Lamport 2002 puts a two msg-delay lower bound on async consensus: https://lamport.azurewebsites.net/pubs/lower-bound.pdf
"Hang on, wasn't Radix slow with COMMIT_NO_SYNC too?"
Yup! That tells us fsync can't be the only factor. All that CPU has to be going somewhere. High but variable system time. I'd also look at 540kBps of inbound network traffic vs 1.9 MBps of disk writes: write amplification?
Thing is that none of this is even remotely close to saturating disk or network bandwidth. It's a fresh, empty cluster and request volumes are *tiny*, so like... page cache should be able to hold most if not all of this data.
I dunno. Software is a ~rich tapestry~