What Is Data Gravity And Is It Important With Serverless?

When I got started with serverless a few years ago, you could rarely have a conversation about it without hearing the phrase "vendor lock-in".

For those of you luckily enough to have not heard that phrase before, it means that by committing to use a specific vendor you are tying yourself to them. Meaning you couldn't separate from them even if you wanted to (or have a really difficult time doing so).

While that isn't inherently wrong about serverless, does it really matter? We used to call vendor lock-in making a decision. You decide on a vendor, use their feature set, and build your software.

But that rubs some people the wrong way, which is fine. However, you must make a decision when choosing a cloud vendor to host and build your software. Some go to great lengths to build "vendor-agnostic" solutions where they could pick up their code from AWS on Monday and have it up and running over in Azure on Tuesday.

Kudos to those that do it. It adds complexity to the solution, but they've added the ability to move their code between service providers in case something goes horribly wrong (either from a disaster recovery standpoint or from a price hike scenario).

Or have they?

While the code might be able to move, what about all of your data? How is that going to move between vendors? You can't start fresh, you'd lose all your customers. It needs to move along with your code.

What Exactly Is Data Gravity?

As the amount of storage in your application increases, the ability to move becomes more and more difficult. This is similar to a gravitational pull, where the larger the volume, the stronger the pull is. Hence the name data gravity.

It's estimated that 2.5 million terabytes of data are created every day. Data is everything nowadays. The sheer amount of volume in terms of data across the world is astonishing. You might be surprised at the amount of data in your application as well.

In serverless applications, data gravity comes from a handful of different things:

Database records - Perhaps the most obvious source of data gravity, whether it be DynamoDB or Aurora, these records accumulate with interactions in your app. As your consumers use your APIs, the data gravity increases as records are saved.

Document storage - With AWS, S3 is a service that holds everything from analytic data, to videos and pictures, to archive data. It even allows you to host your websites from it. Over time, the amount of objects in S3 can grow to petabyte, exabyte, or higher scale, leading to significant data gravity.

Logs - By default your logs are stored in CloudWatch in AWS. All the execution, error, and debug logs across your Lambda functions, Step Function workflows, and API Gateways are consolidated in a single location. With serverless apps, logs are the key to successful monitoring and troubleshooting. As your app runs, your logs will grow, resulting in yet another source of data gravity.

Events - According to the serverless design principles, you should be building your applications to trigger transactions from events. Events come from many sources, like SNS, SQS, and EventBridge. In a fault-tolerant application, you should be storing off events so you can replay them in a recovery scenario. Services like EventBridge have an event archive which does just that. This creates another source of data gravity you must consider.

Minimizing the Reach of Data Gravity

As you can see, data gravity has a pretty large reach in serverless applications. Critical data accumulates in so many places it feels impossible to get the data from all the nooks and crannies of your app.

To minimize the amount of effort in the event that you actually do pack up and change vendors, start by consolidating data in as few locations as possible. You can do this by aggregating your content in an S3 bucket or storing additional data in the database.

For logs, you can easily export them directly to S3. Since you typically don't want to keep logs around forever, you can automatically archive or delete them by setting the storage lifecycle on the bucket or on individual items.

Events are not as straight forward to consolidate. You need to take matters into your own hands when it comes to consolidating events in a location outside of EventBridge. One solution for backing up events is to configure a generic rule to capture all events, pass them to a Lambda function, and have the function write the contents of the event to DynamoDB.

BackupEventBridgeEventsFunction:
  Type: AWS::Serverless::Function
  Properties:
    CodeUri: functions/backup-eventbridge-events
    Policies:
      - AWSLambdaBasicExecutionRole
      - Version: 2012-10-17
        Statement:
          - Effect: Allow
            Action:
              - dynamodb:PutItem
            Resource: !GetAtt EventBackupTable.Arn
    Events:
      EventFired:
        Type: EventBridgeRule
        Properties:
          Pattern:
            account:
              - !Ref AWS::AccountId

This matches on all events originating in your AWS account and sends them to the BackupEventBridgeEvents function, which could then write the contents to Dynamo in a way where they could be sorted in chronological order.

By backing up your logs to S3 and your events to Dynamo, you now have only two locations to consider when migrating your important data.

Is It Worth It?

Since you're already practicing serverless, you're already pretty locked in to your cloud vendor. The chances of you being able to successfully switch vendors in the event of a disaster in a timely manner is pretty slim. And that is ok!

One of the major benefits you get when you go "all in" on a vendor like AWS is that all the services integrate easily with each other. You use the tools to their maximum potential because you aren't creating workarounds to avoid being locked in.

This leads to faster build times, more innovation, and ultimately cheaper cloud costs. Use the services the way they were intended to be used and you will build the best application possible.

Ultimately, I'd say "no, it isn't worth it" to implement a pick up and run strategy. Now, it might be useful to backup your logs and events if you have an actual disaster. You would need to replay the events that occurred in the system so you can rebuild your data and hit your RPO.

But in my opinion, implementing alternative measures because you don't trust your cloud vendor is not worth the time.

To the vendor-agnostic purists, there's never a scenario where you truly have no work to shift to something else. If you decided to host your own database in an EC2 instance, you'd still have work to transfer that out somewhere else. EBS volumes are vendor specific to AWS. You can't pick up the volume as-is and just start using it in Azure. You'd have some sort of backup/convert/restore process to get your data successfully migrated.

Conclusion

Data gravity is real. The more data you build over time, the harder it is to move to something else. It could be with your cloud vendor or with a third party vendor providing a utility service.

The longer you use something, the more difficult it is to leave.

But that is a good problem to have. Your developers will build expertise with the cloud vendor. Moving to a different provider will force your engineers to relearn the nuance of how things work, which has a cost of its own.

I find the highest success comes from going all-in. Fully commit to your solution. Wavering and doubt will throw instability in your solution. Building your entire application as a giant provider model will drive up the complexity and difficulty to maintain.

So, is data gravity a concern for serverless apps? Not for those who are doing it right.

Happy coding!

Allen Helton's blog