A bug in Jepsen: from versions 0.1.2 to 0.2.0, the counter checker docstring incorrectly claimed to handle decrements, which could cause valid histories to be reported as failures. This did not affect official Jepsen reports, but other counter tests using decrements may have been affected: https://groups.google.com/u/1/a/jepsen.io/g/announce/c/GsQ6e2e-Mcs
Love this thing where Google Cloud decides that jepsen.io has been stable for a while and it really ought to do something about that, so it kills the VM and spins up a new one to replace it only *after* it's dead, resulting in ~10 minutes of spurious downtime.
It's been doing this for ~two years, COME ON Google, y'all are supposed to be experts at rollouts. Start new nodes *before* you kill existing ones!
Did an interview with Tobias Macey talking about Jepsen's design, software verification in general, and the distributed database landscape: https://www.dataengineeringpodcast.com/jepsen-distributed-systems-testing-episode-143/
Reminder: The @jepsen_io Quarantine DB Talk featuring Kyle Kingsbury (@aphyr) is next Monday July 27 @ 4:30pm ET. Video will be live and uncut for the public over Zoom. All are welcome to join. https://db.cs.cmu.edu/events/db-seminar-spring-2020-db-group-black-box-isolation-checking-with-elle/
Hello #clojure friends
If you have some free time and you know #jepsen, this might be relevant for you
Anyway, if you have strong feelings, drop em here.
It's also, like... Jepsen is roughly 50/50 paid vs unpaid work right now. Jepsen contract rates are high, which covers research, maintenance, and writing in between. It's hard to imagine sponsors could materially shift that balance.
On the other hand, this presents a conflict-of-interest problem: so long as reports have a single sponsor (typically the vendor), it's easy to disclose and understand, but that's much trickier when there's a mix of a dozen ongoing sponsors.
Right now Redis makes a great cache, lossy message bus, and scratchpad, but you have to plan on data loss. Redis-Raft should hopefully change that by offering strict serializability, and from our testing, it looks like they're on track. Watch for GA next year!
Redis-Raft is really cool, because of the existing Redis replication strategies (Sentinel, Cluster, Enterprise, CRDT), all of them can lose updates during partitions.
There are a ton of neat bugs here, including infinite loops, total data loss on failover, servers sending responses to the wrong clients, and all kinds of crashes. None should have affected production users; Redis-Raft wasn't public until May, and GA isn't until 2021.
New Jepsen analysis: we worked with Redis Labs to evaluate Redis-Raft, a new, still-under-development approach to Redis replication, and found 21 issues, 20 of which have been fixed in recent builds. https://jepsen.io/analyses/redis-raft-1b3fbf6
I'm gonna be giving a Zoom talk on Elle for CMU's database seminar, on July 27th. I think anyone can join, if you want to listen in. :) https://db.cs.cmu.edu/events/db-seminar-spring-2020-db-group-black-box-isolation-checking-with-elle/
@jepsen_io Completely agreed! I'm aware of a few other instances of reported non-serializable behavior under PostgreSQL SSI: https://www.postgresql.org/message-id/20141021071458.2678.9080%40wrigleys.postgresql.org https://firstname.lastname@example.orgemail@example.com Repros: https://github.com/gfredericks/pg-serializability-bug
Anyway, consistency models are a mess; news at 11. 😂
So, while Berenson et al. say that snapshot isolation isn't stronger than repeatable read, PostgreSQL appears to have implicitly adopted the strict interpretation instead, and says that SI is stronger than RR. In fact, SI prohibits *every* anomaly in the strict interpretation of the ANSI SQL standard, including their (narrow) definition of phantoms!
This could still be consistent with the ANSI SQL standard's definition of repeatable read! As Berenson et al. pointed out twenty five years ago (!), the standard is ambiguous. The paper argues that there are two interpretations of the ANSI phenomena: strict, and broad. They say the broad interpretation is what ANSI *meant* to define, and goes on to define and analyze snapshot isolation in those terms. https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-95-51.pdf