Jepsen 0.2.7 is now available! Includes a (known-buggy) preview of lazyfs: a filesystem which can intentionally lose un-fsynced writes!
Ayyyyyy, congratulations! 🎉
When I joined I had a very clear view of what I wanted to fix: that @jepsen_io data loss failure from 2012. :)
With the release of the Raft based Quorum Queues we now have a queue type that provides the kind of data safety users expect from a messaging system. https://twitter.com/kjnilsson/status/1530163817367445505
Hear from @jepsen_io about the safety and of our streaming data engine – what we fixed and what we shouldn’t. Live webinar on May 25 at 10am PST.
Cheers to @redpandadata on a delightful collaboration, and congratulations on their new release. :-)
Redpanda has addressed most of these issues in the just-released 21.11.15, and the upcoming 22.1.1 fixes aborted reads and lost writes with transactions--lost/stale messages are still under investigation. A few more issues require only documentation to address.
I am begging the cryptocurrency community to consider alternative ways of knowing, such as "emailing someone to ask them questions instead of speculating in chat" and "submitting a handful of transactions and seeing if they show up"
They also stressed the importance of end-to-end verification of safety properties, because APIs are how exchanges and users actually interact with DLTs. This is a challenge in traditional databases as well: composition of (e.g.) serializable transactional DBs is nontrivial!
I'm not sure how widespread this understanding is in the DLT space (still looking for a citation for RDX Works's definition) but the researchers I've talked to were unanimous: losing committed transactions *is* a safety error, even if every validator agrees to throw away data.
Since the release I've had the chance to chat with a handful of analysts working specifically on verification of blockchain/cryptocurrency/DLT systems, and can confirm that they also use the usual distsys sense of "safety property"--namely: "something bad does not happen".
Some helpful and much-better-informed comments from @trianglesphere on tendermint/hotstuff latency, including a nicely drawn Lamport diagram.
@jepsen_io I’m pretty sure it’s 7 delays. 1 to validator, 7 to finalize, 1 from any validator back to the client. By this metric, PBFT/tender mint is 3. Ignore the new view, but each set of arrows is a hop
Thing is that none of this is even remotely close to saturating disk or network bandwidth. It's a fresh, empty cluster and request volumes are *tiny*, so like... page cache should be able to hold most if not all of this data.
I dunno. Software is a ~rich tapestry~
"Hang on, wasn't Radix slow with COMMIT_NO_SYNC too?"
Yup! That tells us fsync can't be the only factor. All that CPU has to be going somewhere. High but variable system time. I'd also look at 540kBps of inbound network traffic vs 1.9 MBps of disk writes: write amplification?
If I can leave you with one idea, it's:
DLTs, like any database, are empirically investigable artifacts. You can build, install, and ask one to store some data. See if it comes back like you'd expect. Even simple tests can lead to interesting & exciting results.
Try it out! ❤️
This is something I kind of expected DLT & DeFi whitepapers to discuss as a matter of course: What kinds of apps would be insensitive to these costs? Which ones might find it more efficient to keep running on permissioned, centralized networks?
Curious to hear y'all's thoughts!