Lessons Learned From Real-world Nosql Database Migrations

1 minggu yang lalu

In “Battle-Tested Tips for a Better NoSQL Migration,” I shared my apical strategies for planning, executing and de-risking a NoSQL database migration. I discussed cardinal steps for illustration schema and information migration, information validation and important considerations specified arsenic exertion switches, tooling, separator cases and nan thought that you mightiness not request to migrate each your data.

Now, let’s analyse really teams really migrated their information — what challenges they faced, trade-offs, really they proceeded and lessons learned. These are each real-world examples pinch names and identifying specifications obfuscated.

Streaming Bulk Load (DynamoDB to ScyllaDB)

First example: A ample media streaming institution that decided to move from DynamoDB to ScyllaDB to trim costs.

One absorbing facet of this usage lawsuit is that nan squad had an ingestion process that overwrote their full information group daily. As a result, location was nary request to forklift their information from 1 database to another. They could conscionable configure their ingestion occupation to constitute to ScyllaDB successful summation to DynamoDB.

As soon arsenic nan occupation kicked in, information was stored successful some databases. Since DynamoDB and ScyllaDB information models are truthful similar, that greatly simplified nan process. It’s much analyzable erstwhile switching from a archive shop aliases a relational database to wide-column NoSQL.

As I mentioned successful nan erstwhile article, a migration from 1 exertion to different almost ever requires making immoderate changes. Even pinch akin databases, features and soul workings vary. Some of this team’s migration concerns were related to nan measurement ScyllaDB handled out-of-order writes, really they would instrumentality grounds versioning and nan ratio of information compression. These were each valid and absorbing concerns.

The main instruction from this migration is nan request to understand nan differences betwixt your root database and target databases. Even databases that are rather akin successful galore respects, specified arsenic ScyllaDB and DynamoDB, do person differences that you request to admit and navigate. As you research these differences, you whitethorn yet stumble upon room for improvement, which is precisely what happened here.

The usage lawsuit successful mobility was very susceptible to out-of-order writes. Before we explicate really they addressed it, let’s screen what an out-of-order constitute involves.

Understanding Out-of-Order Writes

Out-of-order writes hap erstwhile newer updates get earlier older ones.

For example, presume you’re moving a dual-write setup, penning to some your root and target databases astatine nan aforesaid time. Then you plug successful a migration instrumentality (such arsenic nan ScyllaDB Migrator) to commencement reference information from nan root database and penning it to nan destination one. The Spark occupation sounds immoderate information from nan root database, past nan customer writes an update to that aforesaid data. The customer writes nan information to nan target database first and nan Spark occupation writes it after. The Spark occupation mightiness overwrite nan fresher data. That’s an out-of-order write.

Martin Fowler describes it this way: “An out-of-order arena is 1 that’s received late, sufficiently precocious that you’ve already processed events that should person been processed aft nan out-of-order arena was received.”

With some Cassandra and ScyllaDB, you tin grip these out-of-order writes by utilizing nan CQL (Cassandra Query Language) protocol to explicitly group timestamps connected writes. In our example, nan customer update would see a later timestamp than nan Spark write, truthful it would “win” — nary matter which arrives last.

This capacity doesn’t beryllium successful DynamoDB.

How nan Team Handled Out-of-Order Writes successful DynamoDB

The squad was handling out-of-order writes utilizing DynamoDB’s Condition Expressions, which are very akin to lightweight transactions successful Cassandra. However, Condition Expressions successful DynamoDB are overmuch much costly (with respect to performance arsenic good arsenic cost) than regular non-conditional expressions.

How did this squad effort to circumvent nan out-of-order constitute utilizing ScyllaDB? Initially, they implemented a read-before-write anterior to each write. This efficaciously caused their number of sounds to spike.

After we met pinch them and analyzed their situation, we improved their exertion and database capacity considerably by simply manipulating nan timestamp of their writes. That’s nan aforesaid attack that different customer of ours, Zillow, uses to grip out-of-order events.

Engagement Platform: TTL’d Data (ScyllaDB Self-Managed to ScyllaDB Cloud)

Next, let’s look astatine a migration crossed different flavors of nan aforesaid database: a ScyllaDB to ScyllaDB migration. An engagement level institution decided to migrate from a self-managed on-premises ScyllaDB deployment to nan ScyllaDB Cloud managed solution, truthful we helped them move information over.

No information modeling changes were needed, greatly simplifying nan process. Though we initially suggested carrying retired an online migration, they chose to return nan offline way instead.

Why an Offline Migration?

An offline migration has immoderate clear drawbacks: There’s a information nonaccomplishment model adjacent to nan clip nan migration takes and nan process is alternatively manual. You person to snapshot each node, transcript nan snapshots location and past load them into nan target system. And if you take not to dual-write, switching clients is simply a one-way move; going backmost would mean losing data.

We discussed those risks upfront, but nan squad decided that these risks wouldn’t outweigh nan benefits and simplicity of doing it offline. (They expected astir of their information to beryllium expired pinch TTL (Time to Live) eventually).

Before nan accumulation migration, we tested each measurement to amended understand nan imaginable information nonaccomplishment window.

In astir cases, it is besides imaginable to wholly displacement from information nonaccomplishment to a impermanent inconsistency erstwhile carrying retired an offline migration. After you move your writers, you simply repetition nan migration steps again from nan root database (now a read-only system), truthful restoring immoderate information that wasn’t captured arsenic portion of nan first snapshot.

A Typical TTL-Based Migration Flow

This squad utilized TTL information to power their information expiration, truthful let’s talk really a migration pinch TTL information typically works.

First, you configure nan exertion clients to do dual-writing but support nan customer reference only from nan existing root of truth. Eventually, nan TTL connected that root of truth expires. At this point, you tin move nan sounds to nan caller target database and each information should beryllium successful sync.

How nan Migration Actually Played Out

In this case, nan customer was only reference and penning against a azygous existing root of truth. With nan exertion still running, nan squad took an online snapshot of their information crossed each nodes. The resulting snapshots were transferred to nan target cluster and we loaded nan information utilizing Load and Stream (a ScyllaDB hold that builds connected nan Cassandra nodetool refresh command).

Rather than simply loading nan information for nan node and discarding nan tokens, which nan node is not a replica for, Load and Stream really streams nan information to different cluster members. This greatly simplifies nan wide migration process. Instead of conscionable loading nan information and dropping nan tokens that aren’t needed, Load and Stream really streams nan information to different nodes successful nan cluster.

After nan team’s Load and Stream completed, nan customer simply switched sounds and writes complete to nan caller root of truth.

Next, let’s research really a messaging app institution approached nan situation of migrating much than a trillion rows from Cassandra to ScyllaDB.

Since Cassandra and ScyllaDB are API compatible, specified migrations shouldn’t require immoderate schema aliases exertion changes. However, fixed nan criticality of their information and consistency requirements, an online migration attack was nan only feasible option. They needed zero personification effect and had zero tolerance for information loss.

Using a Shadow Cluster for Online Migration

The squad opted to create a “shadow cluster.” A protector cluster is simply a reflector of a accumulation cluster that has nan aforesaid information (mostly) and receives nan aforesaid sounds and writes. They created it from nan disk snapshots from nodes successful nan corresponding accumulation cluster. Production postulation (both sounds and writes) was mirrored to nan protector cluster via a information work that they created for this circumstantial purpose.

With a protector cluster, they could measure nan capacity effect of nan caller level earlier they really switched. It besides allowed them to thoroughly trial different aspects of nan migration, specified arsenic longer-term stableness and reliability.

The drawbacks? It’s reasonably expensive, since it typically doubles your infrastructure costs while you’re moving nan protector cluster. Having a protector cluster besides adds complexity to things for illustration observability, instrumentation, imaginable codification changes and truthful on.

Negotiating Throughput and Latency Trade-offs During Migration

One notable lesson learned from this migration: really important it is to guarantee nan root strategy stableness during nan existent information migration. Most teams conscionable want to migrate their information arsenic accelerated arsenic possible. However, migrating arsenic accelerated arsenic imaginable could impact latencies, and that could beryllium a problem erstwhile debased latencies are captious to nan extremity users’ satisfaction.

In this team’s case, nan solution was to migrate nan information arsenic accelerated arsenic possible, but only up to nan constituent wherever it started to impact latencies connected nan root system.

And really galore operations per 2nd should you tally to migrate? At which level of concurrency? There’s nary easy reply here. Really, you person to test.

Wrapping Up

The “best” NoSQL migration approach? As nan breadth and diverseness of these examples show, nan reply is rather simple: it depends. A regular batch ingestion fto 1 squad skip nan accustomed migration steps entirely. Another had to navigate TTLs and snapshot timing. And yet different squad was really focused connected making judge migration didn’t discuss their strict latency requirements. What worked for 1 squad wouldn’t person worked for nan adjacent — and your circumstantial requirements will style your ain migration way arsenic well.

I dream these examples provided an absorbing peek into nan types of trade-offs and method considerations you’ll look successful your ain migration. If you’re funny to study more, I promote you to browse nan room of ScyllaDB personification migration stories. For example: