Site icon Audit Mania

Implementing Data Contracts: 7 Key Learnings

Forward: A Data Engineering Movement Afoot?

We ran into Andrew, a team lead and senior engineer at GoCardless, at our London IMPACT event.

He talked about implementing data contracts and how GoCardless has eschewed what has started to solidify as the industry standard ELT modern data warehouse approach. This approach typically involves complex pipelines, large data dumps, and numerous in-warehouse transformations.

GoCardless’s ETL approach focuses on treating data like an API. They codify data needs in a data contract and schemas from consumers up front, and then deliver the data pre-modeled into the data warehouse.

It’s becoming clear a data engineering movement is afoot as we have recently covered other organizations taking a similar approach: Convoy and Vimeo.

You can read more about Convoy’s approach from my post with their Head of Product, Data Platform, Chad Sanderson, The modern data warehouse is broken.

We also have an article covering Vimeo’s approach in a case study sourced from former VP Lior Solomon’s interview with the Data Engineering Show podcast.

Is this a (back to the future as Joe Reis would call it) data engineering best practice?

Is it an alternative approach fit for a certain breed of organization? (Interestingly these three organizations share a need for near real-time data and have services that produce copious first-party event/transactional data). Or is it a passing fad?

Time will be the judge. But the emergence of this alternative approach reveals a few truths of which data leaders should note:

Data contracts could become a key piece of the data freshness and quality puzzle — as explained by my colleague Shane Murray in the video below — and much can be learned from Andrew’s experience implementing them at GoCardless, detailed here.

7 Key Learnings From Our Experience Implementing Data Contracts

A perspective from Andrew Jones, GoCardless

One of the core values at GoCardless is to ask why.

After meeting regularly with data science and business intelligence teams and hearing about their data challenges, I began to ask “why are we having data issues arise from service changes made upstream?”

This post will answer that question and how asking it led to the implementation of data contracts at our organization. I will cover some of the technical components I have discussed in a previous article series with a deeper dive into the key learnings.

Data quality challenges upstream

What I found in my swim upstream were well meaning engineers modifying services unaware that something as simple as dropping a field could have major implications on dashboards (or other consumers) downstream.

Part of the challenge was data was an afterthought, but part of the challenge was also that our most critical data was coming directly from our services’ databases via a CDC.

The problem with investing in code and tooling to transform the data after it’s loaded is that when schemas change those efforts either lose value or need to be re-engineered.

Image courtesy of Andrew Jones.

So there were two problems to solve. On the process side, data needed to become a first class citizen during upstream service updates, including proactive downstream communication. On the technology side, we needed a self-service mechanism to provision capacity and empower teams to define how the data should be ingested.

Fortunately, software engineers solved this problem long ago with the concept of APIs. These are essentially contracts with documentation and version control that allow the consumer to rely on the service without fear it will change without warning and negatively impact their efforts.

At GoCardless we believed data contracts could serve this purpose, but as with any major initiative, it was important to first gather requirements from the engineering teams (and maybe sell a bit too).

Talking to the team

We started talking to every engineering team at the company. We would explain what data contracts were, extol their benefits, and solicit feedback on the design.

We got some great, actionable feedback.

For example, most teams didn’t want to use AVRO so we decided to use JSON as the interchange format for the contracts because it’s extensible. The privacy and security team helped us build in privacy by design particularly around data handling and categorizing what entity owned which data asset.

We also found these teams had a wide range of use cases and wanted to make sure the tooling was flexible enough. In fact, what they really wanted was autonomy.

In our previous setup, once data came into the data warehouse from the CDC, a data engineer would own that data and everything that entailed to support the downstream services. When those teams wanted to give a service access to BigQuery or change their data, they would need to go through us.

Not only was that helpful feedback, but that became a key selling point for the data contract model. Once the contract was developed they could select from a range of tooling, not have to conform to our opinions about how the data was structured, and really control their own destiny.

Let’s talk about how it works.

Our data contract architecture and process

Image courtesy of Andrew Jones.

The data contract process is completely self-serve.

It starts with the data team using Jsonnet to define their schemas, categorize the data, and choose their service needs. Once the Json file is merged in Github, dedicated BigQuery and PubSub resources are automatically deployed and populated with the requested data via a Kubernetes cluster.

For example, this is an example of one of our data contracts, which has been abridged to only show two fields.

An abridged data contract courtesy of Andrew Jones.

The process is designed to be completely automated and decentralized. This minimizes the dependencies across teams so that if one service needs to increase the workers for a service, it doesn’t impact another teams’ performance.

The Internal Risk division was one of the first teams to utilize the new system. They use data across different features to identify and mitigate risk from potentially fraudulent behavior.

Previously they were leveraging data from the production database just because it was there. Now they have a much more scalable environment in BigQuery that they can manage directly.

As of this writing, we are 6 months into our initial implementation and excited by the momentum and progress. Roughly 30 different data contracts have been deployed which are now powering about 60% of asynchronous inter-service communication events.

We have made steady inroads working with traditional data consumers such as analytics and data science teams and plan to help migrate them over to this system as we decommission our CDC pipeline at some point in the next year.

Most importantly, our data quality has started to improve, and we are seeing some great examples of consumers and generators working together to set the data requirements — something that rarely happened before.

Key learnings

So what have we learned?

Tradeoffs

No data engineering approach is free of tradeoffs. In pursuing this data contract strategy we made a couple of deliberate decisions including:

The future of data contracts at GoCardless

Data contracts are a work in progress at GoCardless. It is a technical and cultural change that will require commitment from multiple stakeholders.

I am proud of how the team and organization has responded to the challenge and believe we will ultimately fully migrate to this system by the middle of next year–and as a result will see upstream data incidents plummet.

While we continue to invest in these tools, we are also starting to look more at what we can do to help our consumers discover and make use of the data. Initially this will be through a data catalog, and in future we will consider building on that further to add things like data lineage, SLOs and other data quality measures.

If you feel that GoCardless appeals to you and that you would like to find out more about life at GoCardless you can find our posts on TwitterInstagram and LinkedIn.

Are you interested in joining GoCardless? See our jobs board here.

Interested in learning more about data quality? Talk to the experts at Monte Carlo.

Exit mobile version