Adapting Site Reliability Engineering Principles to Data Reliability Engineering | Sandip Sahota, BMO

Adapting Site Reliability Engineering Principles to Data Reliability Engineering | Sandip Sahota, BMO

Modern data systems are incredibly complex, comprising vast pipelines, thousands of datasets, and a diverse array of users. With this complexity comes the inevitability of things going wrong. During a recent presentation by Sandip Sahota, the foundational principles of Data …...

Written by

Sandip Sahota

Published on

18 Feb 2025


Modern data systems are incredibly complex, comprising vast pipelines, thousands of datasets, and a diverse array of users. With this complexity comes the inevitability of things going wrong. During a recent presentation by Sandip Sahota, the foundational principles of Data Reliability Engineering (DRE)—adapted from Site Reliability Engineering (SRE)—were discussed as a guide to managing and optimizing these intricate systems. Here’s a breakdown of these critical concepts:

1. Embrace Risk

Data systems are inherently unpredictable. Sahota emphasized the importance of moving away from unrealistic expectations of perfect data quality. Instead, organizations must adopt a risk management mindset, identifying what levels of risk are acceptable based on their unique business context.  

For instance, the risk tolerance of a bank will differ vastly from that of a startup. Preparing for potential disruptions is far more practical than striving for an unachievable perfection.

2. Set Standards

When someone claims the “data is wrong,” what does that mean?  

DRE calls for clear standards and metrics through Service Level Agreements (SLAs) and Service Level Indicators (SLIs). These agreements define critical parameters like when data should arrive and what constitutes “quality.”  

Monitoring systems can then track these indicators in real-time, offering actionable insights and ensuring accountability. This approach eliminates the guesswork and inefficiencies of informal communication about data issues.

3. Reduce Toil

Repetitive tasks consume valuable time and energy. Automation is the cornerstone of reducing toil in data systems by: 

  • Monitoring pipelines
  • Handling API token rotations
  • Automating routine processes 

Sahota highlighted the importance of having a complete view of the data pipeline, allowing for quick detection and resolution of issues—often without human involvement.

4. Control Releases

Changes to systems, such as updates to ETL (Extract, Transform, Load) processes or schema modifications, can ripple across an organization, potentially disrupting dashboards or machine learning models. Effective release management involves processes like:  

  • Code reviews 
  • Unit testing 
  • Thorough communication before implementation 

By controlling how and when changes are made, teams can prevent costly errors and maintain system stability.

5. Maintain Simplicity

While enterprise data systems are rarely simple, unnecessary complexity is an avoidable source of risk and inefficiency. Sahota encouraged teams to actively seek opportunities to simplify their systems, whether by consolidating processes or eliminating redundant components. Simpler systems are not only easier to maintain but also more resilient to disruptions.

The Proven Foundations of DRE
These principles of DRE are not new—they are adaptations from the time-tested practices of SRE teams at organizations like Google, which manage massive systems like YouTube and Google Search. The key takeaway? While the technology and scope may differ, the fundamentals of reliability engineering remain universally applicable. 

By embracing these principles, organizations can build data systems that are not only more resilient but also more aligned with the needs of their users and stakeholders. These practices empower teams to focus on innovation and strategic initiatives rather than constantly putting out fires.

 

Additionally, we have our upcoming Big Data Summit Canada event where you can learn more about the newest trends in Data & AI. Visit https://www.bigdatasummitcanada.com/ to register now!

To learn more about the insights shared in this session, you can watch the full recording here:

Get the latest news