{"id":107018,"date":"2025-07-10T01:06:41","date_gmt":"2025-07-10T01:06:41","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=107018"},"modified":"2025-06-03T01:55:55","modified_gmt":"2025-06-03T01:55:55","slug":"devops-vs-sre-bridging-the-gap-not-building-walls-part-2-putting-it-into-practice","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/devops\/devops-vs-sre-bridging-the-gap-not-building-walls-part-2-putting-it-into-practice\/","title":{"rendered":"DevOps vs. SRE: Bridging the Gap, Not Building Walls (Part 2 &#8211; Putting it into Practice)"},"content":{"rendered":"\n<p>In <a href=\"https:\/\/www.red-gate.com\/simple-talk\/devops\/devops-vs-sre-bridging-the-gap-not-building-walls-part-1\/\">Part 1 of this series<\/a>, we covered the operational overlap between DevOps and <a href=\"https:\/\/www.red-gate.com\/simple-talk\/devops\/culture\/site-reliability-engineering-vs-devops\/\">Site Reliability Engineering<\/a> (SRE). While DevOps emerged from the need for agile and automated software delivery cycles, SRE has its roots in teams doing systems engineering. SRE emphasizes stability, observability, and proactive failure management. On the surface, they might appear to serve different priorities (speed versus stability), but both aim to build resilient systems that deliver value continuously and reliably.<\/p>\n\n\n\n<p>Now, in Part 2, we shift focus from theory to practice. We will explore how organizations can harmoniously integrate both practi: the agility of DevOps and the resilience of SRE. During integration, it requires deliberate changes to culture, tooling, metrics, and collaboration patterns. We\u2019ll examine how cross-functional teams can work in tandem. Additionally, we will cover how a unified performance framework can combine DORA metrics with SLOs and error budgets.<\/p>\n\n\n\n<p>We\u2019ll also look at real-world practices that teams can adopt. This guide will equip you with practical steps to build software at scale without compromising reliability.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-culture-as-the-foundation\">Culture as the Foundation<\/h2>\n\n\n\n<p>Organizational culture determines how teams react under pressure, how they collaborate, and how they learn from failures. Without a shared culture rooted in trust, transparency, and ownership, even the most advanced tooling or process redesigns won\u2019t succeed. When integrating DevOps and SRE, culture is the first and most crucial frontier.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-shared-ownership-and-blameless-culture\">Shared Ownership and Blameless Culture<\/h3>\n\n\n\n<p>Both DevOps and SRE emphasize a culture of ownership. Instead of assigning blame when things go wrong, high-performing teams focus on continuous learning. This mindset is supported by practices such as blameless postmortems, which encourage open discussions about incidents without fear of retribution. Even while working in high-stakes scenarios.<\/p>\n\n\n\n<p>That doesn\u2019t mean avoiding individual accountability. If someone accidentally deletes the production sales database and causes, for example, a $10,000-a-minute outage, the goal isn&#8217;t to gloss over what happened. It\u2019s to understand why the system allowed a single person to cause such a failure, and how to prevent it in the future. A blameless approach focuses on improving processes to avoid the same mistakes from repeating themselves in the future.<\/p>\n\n\n\n<p>A practical step is to institute joint retrospectives that involve both development and <a href=\"https:\/\/www.red-gate.com\/simple-talk\/devops\/culture\/site-reliability-engineering-vs-devops\/\">SRE teams<\/a>. For example, after a production outage, a team may hold a postmortem that includes:<\/p>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li> A timeline of the incident <\/li>\n\n\n\n<li> Decisions made at each point <\/li>\n\n\n\n<li> Communication gaps <\/li>\n\n\n\n<li> Remediations and follow-up tasks <\/li>\n<\/ul>\n<\/div>\n\n\n<p>Such retrospectives, when shared transparently across teams, prevent knowledge silos and foster a shared understanding of system fragility. This learning cycle improves future resilience and helps ensure that the same mistakes aren\u2019t repeated.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"h-practical-example\">Practical Example<\/h4>\n\n\n\n<p>A practical example of implementing a blameless culture is by automating postmortems after high-severity incidents. Below is a GitHub Actions workflow (in YAML) that triggers when a GitHub issue labeled sev1 is closed. It automatically creates a new postmortem issue using a Markdown template. For more details, see <a href=\"https:\/\/docs.github.com\/en\/actions\/using-workflows\/workflow-syntax-for-github-actions&quot; \\t &quot;_new\">GitHub Actions syntax reference<\/a> and this <a href=\"https:\/\/github.com\/peter-evans\/create-issue-from-file\">GitHub repo.<\/a><\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"block\" highlight=\"false\" decode=\"true\">name: Trigger Postmortem\n\non:\n  issues:\n    types: [closed]\n\njobs:\n  create-postmortem:\n    if: contains(github.event.issue.labels.*.name, 'sev1')\n    runs-on: ubuntu-latest\n    steps:\n      - name: Create Postmortem Issue\n        uses: peter-evans\/create-issue-from-file@v4\n        with:\n          title: 'Postmortem - ${{ github.event.issue.title }}'\n          content-filepath: '.github\/PULL_POSTMORTEM_TEMPLATE.md'<\/pre><\/div>\n\n\n\n<p><strong>What\u2019s in the postmortem output?<\/strong><\/p>\n\n\n\n<p>A basic postmortem Markdown template (e.g., the <code>PULL_POSTMORTEM_TEMPLATE.md<\/code> referenced in the previous code sample) may include some basic system information, to which you would add details and follow ups after studying and discussing the issues that occurred:<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:tsql highlight:0 decode:true block\" >## Postmortem - Checkout API Outage\n\n**Date\/Time of Incident:** April 12, 2025 \u2013 14:32 UTC\n**Lead Investigator:** @jdoe\n\n### Summary\nThe checkout service returned 500 errors to all users due \nto a misconfigured database migration.\n\n### Impact\n- 14,000 users affected\n- 21 minutes of downtime\n- Estimated revenue loss: $8,200\n\n### Timeline\n- 14:32 \u2013 Deployment started\n- 14:34 \u2013 Error rate spike detected\n- 14:36 \u2013 Incident escalated via PagerDuty\n- 14:50 \u2013 Rollback completed\n\n### Root Cause\nA missing column in the production database schema caused the \napplication to crash when writing new orders.\n\n### Lessons Learned\n- Need stricter schema validation in CI\n- Migration tested in staging did not match production\n\n### Action Items\n- [ ] Add schema diff checks to pre-deploy hook\n- [ ] Update staging database snapshot weekly\n- [ ] Schedule follow-up review \u2013 @alice (Due: Apr 19)<\/pre><\/div>\n\n\n\n<p>This makes postmortems consistent, quick to generate, and easy to share across teams. You can customize the Markdown template to include other fields like severity ratings, customer comms, or Slack channel logs.<\/p>\n\n\n\n<p><strong>Explanation of the YAML Code Snippet for GitHub Actions:<\/strong><\/p>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li><code>name: Trigger Postmortem<\/code>: Names the GitHub Actions workflow so it&#8217;s easily identifiable in the Actions tab. <\/li>\n\n\n\n<li><code>on: issues: types: [closed]:<\/code> This configuration tells GitHub to run the workflow when an issue is closed, which aligns with how incidents are often tracked \u2014 each as a separate issue. <\/li>\n\n\n\n<li><code>if: github.event.issue.labels contains 'sev1'<\/code>: Ensures the workflow only runs if the issue being closed has a <code>sev1<\/code> label. This label typically marks high-priority incidents (e.g., a major outage), making sure postmortems are only generated for significant events. <\/li>\n\n\n\n<li><code>runs-on: ubuntu-latest<\/code>: Specifies the operating system for the GitHub Actions runner. Using <code>ubuntu-latest<\/code> ensures compatibility with most community actions and scripts. <\/li>\n\n\n\n<li><code>steps<\/code>: This section defines the actual work done by the job. In this case: <div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li><code>- name: Create Postmortem Issue<\/code>: A human-readable step name for clarity. <\/li>\n\n\n\n<li><code>uses: <\/code><a href=\"https:\/\/github.com\/peter-evans\/create-issue-from-file\">peter-evans\/create-issue-from-file@v4<\/a>: A third-party GitHub Action used to create a new GitHub issue using a predefined markdown file. This action automates the creation of structured postmortem documentation. <\/li>\n\n\n\n<li><code>with<\/code>: parameters: <div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li><code>title<\/code>: Dynamically sets the new issue title by referencing the original issue title \u2014 helpful for tracking and traceability. <\/li>\n\n\n\n<li><code>content-filepath<\/code>: Points to the markdown template <code>(.github\/PULL_POSTMORTEM_TEMPLATE.md<\/code>) used to create the postmortem content. This file typically includes standard fields like Impact, Timeline, Root Cause, Lessons Learned, and Action Items. <\/li>\n<\/ul>\n<\/div><\/li>\n<\/ul>\n<\/div><\/li>\n<\/ul>\n<\/div>\n\n\n<h3 class=\"wp-block-heading\" id=\"h-cross-functional-teams\">Cross-Functional Teams<\/h3>\n\n\n\n<p>Siloed teams, each working on their own piece of the system without knowledge or concerns for the other teams, leads to communication breakdowns and delayed responses. Integrating DevOps and SRE means breaking down these silos by building cross-functional teams that work together toward common goals, enhancing both agility and reliability. Effective communication between teams fosters proactive problem-solving and quick adaptation to changing system conditions.<\/p>\n\n\n\n<p>A powerful model is the <a href=\"https:\/\/www.atlassian.com\/devops\/frameworks\/sre-vs-devops\">DevOps-SRE rotation<\/a>. In this model, developers take on on-call duties (even for a few days per sprint) under the mentorship of experienced SREs. This approach helps developers better understand operational challenges, encourages them to write more resilient code, and ensures that everyone has a stake in the system&#8217;s stability.<\/p>\n\n\n\n<p>When developers switch roles, they gain direct experience with real-world issues that SREs face in production environments, from handling incidents to managing alert fatigue. Similarly, SREs get a chance to familiarize themselves with the design and functionality of new features, which helps in scaling and maintaining those features effectively. This reciprocal learning strengthens the bonds between development and operations, fostering empathy and creating a more resilient system overall.<\/p>\n\n\n\n<p>Another effective approach is embedding SREs within development squads for specific feature sprints. During this time, SREs can:<\/p>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li> Review infrastructure implications of new features <\/li>\n\n\n\n<li> Perform capacity planning to ensure scalability and availability <\/li>\n\n\n\n<li> Identify Service Level Indicators (SLIs) and Service Level Objectives (SLOs) relevant to the new functionality <\/li>\n<\/ul>\n<\/div>\n\n\n<p>These practices ensure that operational requirements are considered early in the development process. It leads to smoother deployments and fewer performance-related surprises in production.<\/p>\n\n\n\n<p>To integrate developers and SREs into a cohesive on-call rotation, you can create on-call schedules using tools like <a href=\"https:\/\/www.pagerduty.com\/\">PagerDuty<\/a> or <a href=\"https:\/\/www.atlassian.com\/software\/opsgenie\">Opsgenie<\/a>.<\/p>\n\n\n\n<p><strong>Example of a Terraform snippet to provision a rotation for DevOps and SRE teams using PagerDuty:<\/strong><\/p>\n\n\n\n<p>Earlier, we mentioned how integrating DevOps and SRE teams can include shared on-call responsibilities, such as having developers rotate into on-call schedules under SRE mentorship. If you&#8217;re implementing that model, you\u2019d typically use something like <a href=\"https:\/\/www.pagerduty.com\/\">PagerDuty schedules<\/a> to manage alternating on-call duties automatically.<\/p>\n\n\n\n<p>However, in many real-world cases, teams prefer a fallback-style escalation, where an SRE is paged first, followed by a developer if the SRE doesn\u2019t respond in time. This model ensures that operational expertise is the first line of defense while still encouraging cross-functional awareness and shared responsibility.<\/p>\n\n\n\n<p>The following Terraform snippet creates a PagerDuty escalation policy using static user references. It escalates to the developer only if the primary SRE does not acknowledge the incident within 10 minutes.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"block\" highlight=\"false\" decode=\"true\">resource \"pagerduty_escalation_policy\" \"dev_sre_policy\" {\n  name = \"Dev-SRE Escalation Policy\"\n  num_loops = 2\n\n  rule {\n    escalation_delay_in_minutes = 10\n    target {\n      type = \"user_reference\"\n      id = pagerduty_user.dev1.id\n    }\n  }\n\n  rule {\n    escalation_delay_in_minutes = 10\n    target {\n      type = \"user_reference\"\n      id = pagerduty_user.sre1.id\n    }\n  }\n}<\/pre><\/div>\n\n\n\n<p><strong>Explanation of the Code Snippet:<\/strong><\/p>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li><code>resource \"pagerduty_escalation_policy\" \"dev_sre_policy\":<\/code> This line defines a new resource in Terraform for creating a PagerDuty escalation policy named <code>dev_sre_policy<\/code>. An escalation policy determines how incidents are escalated to different users if not resolved on time. <\/li>\n\n\n\n<li><code>name = \"Dev-SRE Escalation Policy\":<\/code> Specifies the name of the escalation policy. This is a human-readable identifier, so it&#8217;s easy to reference and understand its purpose. <\/li>\n\n\n\n<li><code>num_loops = 2<\/code>: This attribute defines how many times PagerDuty will loop through the escalation chain before the issue is considered unresolved. In this case, the policy will loop through two levels, ensuring that both the DevOps and SRE teams have an opportunity to handle the issue. <\/li>\n\n\n\n<li><code>rule<\/code>: This block defines the rules for escalating the incident. Each rule specifies the delay before the incident is escalated to the next user, and who the target user is for that escalation. <div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li><code>escalation_delay_in_minutes = 10<\/code>: The incident will escalate to the next target user after 10 minutes if the current user does not respond. <\/li>\n\n\n\n<li><code>target { type = \"user\" id = pagerduty_user.dev1.id }<\/code>: Specifies the first target for escalation, which is a DevOps user (<code>dev1<\/code>). The <code>id<\/code> refers to the unique identifier of the user within PagerDuty. <\/li>\n\n\n\n<li><code>target { type = \"user\" id = pagerduty_user.sre1.id }<\/code>: Similarly, if the DevOps user does not acknowledge or resolve the incident, it will be escalated to the SRE user (<code>sre1<\/code>), using their unique ID. <\/li>\n\n\n\n<li><code>type = \"user_reference\":<\/code> Specifies the individual user as the target. This defaults to user_reference, which is the most common option. To use rotating schedules instead of specific users, replace the type with <code>schedule_reference<\/code> and refer to a defined on-call schedule. This is useful when teams want developers and SREs to alternate responsibilities without hardcoding users. See <a href=\"https:\/\/registry.terraform.io\/providers\/PagerDuty\/pagerduty\/latest\/docs\/resources\/escalation_policy&quot; \\t &quot;_new\">PagerDuty Terraform documentation<\/a> for complete configuration options. <\/li>\n<\/ul>\n<\/div><\/li>\n<\/ul>\n<\/div>\n\n\n<p>This configuration ensures that both the DevOps and SRE teams are engaged in the incident response process, reducing the risk of alert fatigue and fostering a sense of shared responsibility. The escalation policy also ensures that there is no gap in incident resolution, maintaining system reliability and speed in resolving issues.<\/p>\n\n\n\n<p>This shared responsibility model helps both teams build empathy and technical depth by understanding each other\u2019s contexts and operational challenges. The integration of developers and SREs into a single, unified workflow improves collaboration and ensures a smoother path to achieving both system reliability and software agility.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-leadership-and-buy-in\">Leadership and Buy-In<\/h3>\n\n\n\n<p>Cultural transformation requires leadership to prioritize and model reliability. Leaders must treat reliability as a first-class feature, integrating it into team <a href=\"https:\/\/www.qlik.com\/us\/kpi&quot; \\l &quot;:~:text=KPI%20stands%20for%20key%20performance,the%20organization%20make%20better%20decisions.\">KPIs<\/a> (Key Performance Indicators) and <a href=\"https:\/\/www.atlassian.com\/agile\/agile-at-scale\/okr\">OKRs<\/a> (Objectives and Key Results). This ensures that reliability is a shared responsibility across teams.<\/p>\n\n\n\n<p>For example, leaders might set the following OKRs to focus on reliability while maintaining delivery speed:<\/p>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li><strong>Objective<\/strong>: Improve system reliability without slowing down delivery <div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li><strong>KR1<\/strong>: Maintain SLO (Service Level Objective) adherence > 99.95% <\/li>\n\n\n\n<li><strong>KR2<\/strong>: Reduce change failure rate to &lt; 10% <\/li>\n\n\n\n<li><strong>KR3<\/strong>: Keep MTTR (Mean Time to Recovery) under 30 minutes <\/li>\n<\/ul>\n<\/div><\/li>\n<\/ul>\n<\/div>\n\n\n<p>These metrics ensure reliability is a key part of performance reviews. Leaders should actively review them in sprint reviews to reinforce the shared responsibility for reliability.<\/p>\n\n\n\n<p>Additionally, the \u201c<a href=\"https:\/\/www.thoughtworks.com\/insights\/decoder\/y\/you-build-it-you-run-it\">you build it, you run it<\/a>\u201d model works best when leadership provides support. Engineers should be empowered to own their work, but adequate resources and guidance are crucial to avoid burnout.<\/p>\n\n\n\n<p>Leaders should foster a culture of accountability and support by providing necessary tools, mentorship, and feedback, ensuring teams can deliver high-quality, reliable systems<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-tooling-that-unites-not-divides\">Tooling That Unites, Not Divides <\/h2>\n\n\n\n<p>A fragmented toolchain often reinforces silos. When DevOps and SRE teams work on separate platforms and dashboards, it limits visibility and creates tunnel vision. To successfully integrate these philosophies, teams must establish shared visibility, standardized workflows, and tooling choices that support both fast delivery and system reliability.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-common-tooling-stack\">Common Tooling Stack<\/h3>\n\n\n\n<p>The key to integration is aligning on a unified tooling stack across essential areas like CI\/CD, Infrastructure as Code (IaC), monitoring, and observability. A potential integrated stack could include:<\/p>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li><strong>CI\/CD<\/strong>: <a href=\"https:\/\/github.com\/features\/actions\">GitHub Actions<\/a>, <a href=\"https:\/\/docs.gitlab.com\/ci\/\">GitLab CI\/CD<\/a>, Azure DevOps, and <a href=\"https:\/\/www.red-gate.com\/products\/flyway\/community\/\">Flyway<\/a> (for database migrations) <\/li>\n\n\n\n<li><strong>IaC<\/strong>: <a href=\"https:\/\/developer.hashicorp.com\/terraform\">Terraform<\/a> (for multi-cloud provisioning), <a href=\"https:\/\/www.pulumi.com\/\">Pulumi<\/a> (for code-native IaC), <a href=\"https:\/\/www.redhat.com\/en\/ansible-collaborative\">Ansible<\/a> (for configuration management) <\/li>\n\n\n\n<li><strong>Monitoring<\/strong>: <a href=\"https:\/\/prometheus.io\/\">Prometheus<\/a> for time-series metrics, <a href=\"https:\/\/grafana.com\/\">Grafana<\/a> for visualization and <a href=\"https:\/\/www.red-gate.com\/products\/redgate-monitor\/\">Redgate Monitor<\/a> (as an alternative for SQL Server environments) <\/li>\n\n\n\n<li><strong>Observability<\/strong>: <a href=\"https:\/\/www.datadoghq.com\/\">Datadog<\/a>, <a href=\"https:\/\/newrelic.com\/blog\/best-practices\/what-are-slos-slis-slas\">New Relic<\/a>, or <a href=\"https:\/\/opentelemetry.io\/\">OpenTelemetry<\/a> for tracing and logs <\/li>\n<\/ul>\n<\/div>\n\n\n<p>One of the most overlooked areas in DevOps pipelines is database change management. Traditional DevOps pipelines often exclude databases, treating them as manual bottlenecks. This leads to inconsistent deployment processes and makes it difficult to ensure smooth database changes. Redgate SQL Change Automation fills this gap by:<\/p>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li> Enabling version control for SQL schema changes <\/li>\n\n\n\n<li> Running pre-deployment checks (linting, validation) <\/li>\n\n\n\n<li> Integrating with CI\/CD tools like Octopus Deploy and Azure Pipelines <\/li>\n<\/ul>\n<\/div>\n\n\n<p>This ensures that database changes follow the same process as application code, allowing both developers and SREs to contribute to safe and traceable database deployments.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-sample-pipeline-using-github-actions-and-flyway\">Sample Pipeline Using GitHub Actions and Flyway<\/h3>\n\n\n\n<p>For teams practicing database DevOps, it\u2019s essential to integrate schema migrations into the same CI\/CD pipelines that manage application deployments. Flyway, a widely adopted open-source database migration tool, enables version-controlled and testable database changes using SQL scripts or Java-based migrations.<\/p>\n\n\n\n<p>Here\u2019s a simplified example of how to automate database deployments with GitHub Actions and Flyway. This pipeline validates and deploys schema changes to a staging environment whenever code is pushed to the repository.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"block\" highlight=\"false\" decode=\"true\">name: CI\/CD Pipeline\non: [push]\n\njobs:\n  build:\n    runs-on: ubuntu-latest\n    steps:\n      - uses: actions\/checkout@v2  # Checkout the code repository\n      - name: Run Flyway Validate\n        run: |\n          flyway -url=jdbc:postgresql:\/\/localhost:5432\/mydb -user=dbuser -password=secret validate\n\n  deploy:\n    needs: build  # Deploy job runs only after a successful build\n    runs-on: ubuntu-latest\n    steps:\n      - name: Run Flyway Migrate\n        run: |\n          flyway -url=jdbc:postgresql:\/\/localhost:5432\/mydb -user=dbuser -password=secret migrate<\/pre><\/div>\n\n\n\n<p><strong>Explanation of the Workflow:<\/strong><\/p>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li><code>name: CI\/CD Pipeline<\/code>: Labels the GitHub Actions workflow. <\/li>\n\n\n\n<li><code>on: [push]<\/code>: Triggers the pipeline when new code is pushed. <\/li>\n\n\n\n<li><code>jobs:<\/code>: Defines the stages in the pipeline: <div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li><code>build job:<\/code> Runs flyway validate to ensure all pending migrations are valid and properly formatted. <\/li>\n\n\n\n<li><code>deploy job<\/code>: Executes flyway migrate to apply the migrations to the staging database. <\/li>\n<\/ul>\n<\/div><\/li>\n<\/ul>\n<\/div>\n\n\n<p>Flyway supports a wide range of databases (PostgreSQL, SQL Server, MySQL, Oracle, etc.) and allows you to manage schema changes with plain <code>.sql<\/code> files or Java-based migrations. You can store these scripts in version control to ensure traceability and rollback support.<\/p>\n\n\n\n<p>Integrating Flyway into your CI\/CD pipeline ensures that database changes pass through the same validation and deployment controls as application code. This improves release consistency and reduces the risk of breaking changes during production deployments.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-sre-observability-meets-devops-pipelines\">SRE Observability Meets DevOps Pipelines<\/h3>\n\n\n\n<p>SREs often use Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to track system performance. However, these metrics are frequently managed outside the deployment lifecycle, which results in missed opportunities for proactive remediation. To address this gap, it&#8217;s essential to integrate reliability checks directly into the CI\/CD pipelines. This ensures that any performance issues are caught early and acted upon before they reach production.<\/p>\n\n\n\n<p>Key strategies to integrate SRE observability into DevOps pipelines include:<\/p>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li><strong>Alerting as Code<\/strong>: Storing <a href=\"https:\/\/prometheus.io\/docs\/alerting\/latest\/alertmanager\/\">Prometheus Alertmanager<\/a> configurations in Git repositories so alerts can be version-controlled and deployed alongside code changes. <\/li>\n\n\n\n<li><strong>SLO Gates<\/strong>: Setting SLO thresholds for services (e.g., ensuring a 99.9% API success rate) and automatically failing builds or blocking promotions if these thresholds are exceeded. <\/li>\n\n\n\n<li><strong>Chaos Testing<\/strong>: Using tools like <strong>Gremlin<\/strong> or <strong>LitmusChaos<\/strong> to intentionally simulate adverse conditions such as network latency, CPU spikes, or pod crashes during pre-production testing. This helps assess how the system behaves under stress before it reaches production. <\/li>\n<\/ul>\n<\/div>\n\n\n<p>Redgate\u2019s Flyway integration can enhance this process by embedding database SLOs into the CI\/CD pipeline. For instance, if a database schema migration is expected to take more than 10 seconds the deployment is halted, and a remediation workflow is triggered.<\/p>\n\n\n\n<p style=\"padding-right:0;padding-left:var(--wp--preset--spacing--md)\">What is a database migration?In this context, a migration refers to any database schema change applied to a target environment, such as creating or altering tables, adding indexes, modifying constraints, or running pre-populated data scripts. Migrations can range from safe, additive changes (like adding a column) to non\u2013backward-compatible operations such as dropping tables, renaming columns, or restructuring schemas.<\/p>\n\n\n\n<p>Given this variability, teams should scope their SLOs based on the nature and risk level of each migration. For instance:<\/p>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li><strong>Safe changes<\/strong> (e.g., adding a nullable column) may skip strict SLO enforcement. <\/li>\n\n\n\n<li><strong>High-impact or irreversible changes<\/strong> (e.g., column drops, large data rewrites) should trigger latency SLO checks and rollback readiness protocols. <\/li>\n<\/ul>\n<\/div>\n\n\n<p>Here\u2019s an example shell script to enforce a 10-second migration time threshold using Flyway:<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"block\" highlight=\"false\" decode=\"true\">START=$(date +%s)\n\nflyway -url=jdbc:postgresql:\/\/localhost:5432\/mydb -user=dbuser -password=secret migrate\n\nEND=$(date +%s)\nDURATION=$((END - START))\n\nTHRESHOLD=10\n\nif [ \"$DURATION\" -gt \"$THRESHOLD\" ]; then\n  echo \"\u26a0\ufe0f Migration exceeded latency threshold ($DURATION seconds &gt; $THRESHOLD seconds). Aborting.\"\n  exit 1\nfi<\/pre><\/div>\n\n\n\n<p>This script can be extended to evaluate change types based on naming conventions or metadata (e.g., flagging destructive operations for stricter checks<\/p>\n\n\n\n<p>Here\u2019s another shell script you can use in a Jenkins pipeline to detect when a latency-based Service Level Objective (SLO) limit has been breached. While this script doesn&#8217;t enforce a rollback, it can be used as a gate that stops the deployment process early, helping teams catch issues before they reach production.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"block\" highlight=\"false\" decode=\"true\">LATENCY=$(curl -s http:\/\/metrics\/api_latency | jq .p95)  # Fetches the 95th percentile latency metric\nTHRESHOLD=500  # Sets the threshold for acceptable latency in milliseconds\n\nif [ \"$LATENCY\" -gt \"$THRESHOLD\" ]; then\n  echo \"Latency SLO breached: ${LATENCY}ms exceeds ${THRESHOLD}ms.\"\n  echo \"Aborting deployment to prevent pushing underperforming code.\"\n  exit 1\nfi<\/pre><\/div>\n\n\n\n<p>This script functions as an early exit checkpoint\u2014if the 95th percentile latency exceeds the defined threshold it prevents the pipeline from progressing further. In a full implementation, this would typically be followed by rollback commands or automated remediation steps, such as reverting to a stable build, notifying on-call engineers via Slack or PagerDuty, or logging the event for audit purposes.<\/p>\n\n\n\n<p>Note<em>:<\/em> This snippet is intentionally simplified to demonstrate the SLO check. In production systems, you&#8217;d likely wrap this in a larger script that includes rollback logic, notification workflows, or integration with deployment platforms.<\/p>\n\n\n\n<p><strong>More Explanation of the Code Snippet:<\/strong><\/p>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li><code>LATENCY=$(curl -s http:\/\/metrics\/api_latency | jq .p95):<\/code> This command fetches the 95th percentile latency (<code>p95<\/code>) from a metrics endpoint (<code>http:\/\/metrics\/api_latency<\/code>) using <code>curl<\/code> and processes the result with <code>jq<\/code> to extract the desired metric. <\/li>\n\n\n\n<li><code>THRESHOLD=500<\/code>: Sets a threshold for latency (in milliseconds). In this case, if the latency exceeds 500ms, the deployment is considered to have failed. <\/li>\n\n\n\n<li><code>if [ \"$LATENCY\" -gt \"$THRESHOLD\" ]; then<\/code>: The script compares the fetched latency with the threshold. If the latency exceeds the threshold, it triggers the actions inside the if block. <\/li>\n\n\n\n<li><code>echo \"Latency SLO breached. Aborting deployment.\":<\/code> Outputs a message to the console indicating that the latency threshold has been breached. <\/li>\n\n\n\n<li><code>exit 1<\/code>: Exits the script with a non-zero status, which in Jenkins causes the deployment to fail, preventing the release of potentially faulty code. <\/li>\n<\/ul>\n<\/div>\n\n\n<p>This type of gate is crucial because it prevents poorly performing builds from reaching production, thereby protecting the end-users from experiencing degraded service. By embedding SLO checks into the CI\/CD pipeline, teams can proactively ensure that reliability objectives are met before a deployment progress to production.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-aligning-metrics-and-kpis\">Aligning Metrics and KPIs<\/h2>\n\n\n\n<p>Metrics are powerful tools for driving behavior. If your team\u2019s metrics aren\u2019t aligned, their priorities won\u2019t be either. In DevOps, speed is often prioritized, measured by <a href=\"https:\/\/dora.dev\/guides\/dora-metrics-four-keys\/\">DORA metrics<\/a>. While in SRE, stability is essential, tracked via SLIs (Service Level Indicators), SLOs (Service Level Objectives), and error budgets. An effective strategy strikes a balance between both.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-merge-dora-slo-metrics\">Merge DORA + SLO Metrics<\/h3>\n\n\n\n<p>To get a complete picture of software performance, it&#8217;s crucial to combine the following key metrics from both DevOps and SRE:<\/p>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li><strong>DORA Metrics<\/strong>: <div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li><strong>Deployment frequency<\/strong>: How often code is deployed to production. <\/li>\n\n\n\n<li><strong>Lead time for changes<\/strong>: The time it takes for a code change to go from development to production. <\/li>\n\n\n\n<li><strong>Change failure rate<\/strong>: The percentage of changes that fail in production. <\/li>\n\n\n\n<li><strong>MTTR (Mean Time to Recovery)<\/strong>: The time it takes to recover from a failure. <\/li>\n<\/ul>\n<\/div><\/li>\n\n\n\n<li><strong>SRE Metrics<\/strong>: <div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li><strong>SLO adherence<\/strong>: How well your system meets its reliability goals. <\/li>\n\n\n\n<li><strong>Error budget burn rate<\/strong>: The rate at which the error budget is being consumed, indicating the system&#8217;s stability. <\/li>\n\n\n\n<li><strong>Time spent on toil<\/strong>: Operational work that is manual, repetitive, and doesn\u2019t contribute to long-term reliability. <\/li>\n<\/ul>\n<\/div><\/li>\n<\/ul>\n<\/div>\n\n\n<p>When you merge these metrics, you can gain a more comprehensive understanding of both the speed (from DORA) and stability (from SRE) of your software. For example, if your team deploys multiple times daily, but constantly violates SLOs with incidents spiking post-deployment, this signals a reliability issue that needs addressing.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-how-integration-looks-in-practice\">How Integration Looks in Practice<\/h3>\n\n\n\n<p>To integrate both sets of metrics, here are some practical steps:<\/p>\n\n\n\n<p><strong>Tag deployments in your observability tools (e.g., Grafana, Datadog, New Relic)<\/strong><\/p>\n\n\n\n<p>Deployment markers help correlate changes with performance shifts. For instance, when a new release is pushed, a tag allows teams to visualize when that happened and whether key metrics (latency, error rates, CPU usage) changed immediately afterward.<\/p>\n\n\n\n<p><strong>Correlate service performance with any kind of change event<\/strong><\/p>\n\n\n\n<p>This includes code deployments, configuration changes, and database migrations. Use CI\/CD metadata, commit hashes, or deployment annotations to connect change events to service metrics like API latency or transaction success rates. This creates a traceable path between what changed and how the system behaved.<\/p>\n\n\n\n<p><strong>Trigger alerts for SLO violations after deployments<\/strong><\/p>\n\n\n\n<p>For example, if latency increases or query timeout rates spike beyond acceptable thresholds, this may indicate a degraded experience or reliability issue. These alerts help teams proactively catch performance regressions before they affect users.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"h-code-snippet-1-promql-query-for-slo-burn-rate\">Code Snippet 1: PromQL Query for SLO Burn Rate<\/h4>\n\n\n\n<p>Here\u2019s a Prometheus Query Language (PromQL) query to measure the SLO burn rate for HTTP requests:<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"block\" highlight=\"false\" decode=\"true\"># This query calculates the proportion of HTTP requests\n# that meet a latency SLO (e.g., responses under 0.5s)\nrate(http_request_duration_seconds_bucket{le=\"0.5\", status!=\"500\"}[5m])\n\/\nrate(http_request_duration_seconds_count[5m])<\/pre><\/div>\n\n\n\n<p>This burn rate query returns a ratio, specifically, the fraction of requests over the last 5 minutes that completed successfully (i.e., under 0.5s and not HTTP 500 errors).<\/p>\n\n\n\n<p style=\"padding-right:0;padding-left:var(--wp--preset--spacing--md)\"><strong>Note on terminology<\/strong>: This query shows how close you are to meeting your SLO, not how much of your error budget has been \u201cused up.\u201d While the term &#8220;burn rate&#8221; is commonly used in SRE contexts, it\u2019s important to understand that this isn&#8217;t a cumulative metric. It\u2019s a real-time indicator of performance degradation \u2014 if the ratio drops consistently below your SLO target, then you start consuming your error budget.<\/p>\n\n\n\n<p><strong>Explanation of the Code Snippet<\/strong><\/p>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li><code>rate(http_request_duration_seconds_bucket{le=\"0.5\", status!=\"500\"}[5m]):<\/code> <br>This line calculates the rate of HTTP requests over the last 5 minutes that completed in 0.5 seconds or less, excluding failed responses <code>(status!=\"500\").<\/code> <br>The <code>_bucket<\/code> metric comes from a Prometheus histogram, which tracks the cumulative count of observations that fall below a certain latency threshold (<code>le=\"0.5\"<\/code> stands for \u201cless than or equal to 0.5 seconds\u201d). <br>This allows us to measure what portion of traffic is fast enough to meet the latency SLO. <\/li>\n\n\n\n<li><code>rate(http_request_duration_seconds_count[5m])<\/code>: <br>This line calculates the total rate of all HTTP requests over the last 5 minutes, regardless of duration. <br><code>_count<\/code> represents the total number of observations made by the histogram. <\/li>\n\n\n\n<li> Dividing these two gives a proportion: <br>The percentage of requests that met the latency target in the last 5 minutes. If this ratio drops, it suggests that latency is increasing and the service is drifting away from its SLO target. <\/li>\n<\/ul>\n<\/div>\n\n\n<p>What\u2019s the difference between _bucket and _count?<\/p>\n\n\n\n<p>In Prometheus, <code>_bucket<\/code> is used to measure latency distributions by counting requests that are <strong>less <\/strong>than or equal to specific thresholds (e.g., 0.5s, 1s, 2s). _count tracks the total number of<strong> <\/strong>requests, so it acts as the denominator for calculating ratios or percentages.<\/p>\n\n\n\n<p><strong>Learn more about Prometheus histograms and SLOs:<\/strong><\/p>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/prometheus.io\/docs\/concepts\/metric_types\/&quot; \\l &quot;histogram\">Prometheus Documentation: Histogram Metric Type<\/a> <\/li>\n\n\n\n<li><a href=\"https:\/\/medium.com\/@benpourian\/measuring-latency-using-prometheus-d3b3fe1cac57\">How To Monitor Latency With Prometheus<\/a> <\/li>\n\n\n\n<li><a href=\"https:\/\/sre.google\/sre-book\/service-level-objectives\/\">Google SRE Book: Service Level Objectives<\/a> <\/li>\n<\/ul>\n<\/div>\n\n\n<h4 class=\"wp-block-heading\" id=\"h-code-snippet-2-git-tagging-for-deployment-metadata\">Code Snippet 2: Git Tagging for Deployment Metadata<\/h4>\n\n\n\n<p>Incorporating deployment metadata into your observability stack helps teams correlate performance changes with specific deployments. One common technique is to tag commits in your Git repository at the point of deployment. This gives you a timestamped, versioned reference to when a specific codebase was shipped \u2014 which can later be visualized in tools like Grafana, Datadog, or Honeycomb.<\/p>\n\n\n\n<p>Here\u2019s a sample Git command used during deployment automation to create a timestamped tag:<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"block\" highlight=\"false\" decode=\"true\"># Create a Git tag that marks a deployment, using the current date and time.\n# The format avoids colons to ensure cross-platform compatibility.\ngit tag -a deploy-app-$(date +%F-%H%M%S) -m \"Deployment: version 1.2.3\"\n\n# Push the tag to the remote repository so it can be used by observability tools.\ngit push origin \u2013tags\n<\/pre><\/div>\n\n\n\n<p><strong>Explanation of the Code Snippet<\/strong><\/p>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li><code>git tag -a deploy-$(<\/code>date +%F-%H%M%S<code>) -m \"Deployment: version 1.2.3<\/code><strong>&#8220;<\/strong>: This creates a new Git tag for the deployment. The <code>$(<\/code>date +%F-%H%M%S<code>)<\/code> portion dynamically generates the current timestamp in the format <code>YYYY-MM-DD-HH:MM:SS<\/code>, ensuring that each deployment is uniquely tagged by time. The -m flag adds a message (e.g., version 1.2.3) to the tag for easy identification. <\/li>\n\n\n\n<li><code>git push origin --tags<\/code>: This command pushes the new tag to the remote Git repository, making it available for tracking in your CI\/CD pipeline and observability tools. <\/li>\n<\/ul>\n<\/div>\n\n\n<h3 class=\"wp-block-heading\" id=\"h-dashboards-and-visibility\">Dashboards and Visibility<\/h3>\n\n\n\n<p>A unified dashboard strategy is key to improving cross-team visibility and accountability. Rather than having separate <a href=\"https:\/\/cloud.google.com\/blog\/products\/devops-sre\/using-devops-and-sre-principles-to-manage-looker\">DevOps and SRE dashboards<\/a>, it\u2019s more effective to consolidate them into a single, cohesive observability pane. This gives everyone, from engineers to management, a shared, real-time view of system health and performance trends. A well-structured dashboard should include:<\/p>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li><strong>Real-time deployment metrics<\/strong>: This shows how frequently code is deployed, how quickly it\u2019s deployed, and its impact on system stability. <\/li>\n\n\n\n<li><strong>SLA\/SLO adherence charts<\/strong>: These charts track the adherence to Service Level Agreements (SLAs) and Service Level Objectives (SLOs), giving teams visibility into how close they are to meeting reliability goals. <\/li>\n\n\n\n<li><strong>Error budget usage trends<\/strong>: Monitoring the rate at which your error budget is being consumed helps you understand how much room you have left for failures before violating your reliability commitments. <\/li>\n\n\n\n<li><strong>Alerts on critical regressions<\/strong>: When performance dips or reliability issues occur, these alerts highlight critical regressions in real-time <\/li>\n<\/ul>\n<\/div>\n\n\n<p>Many teams use tools like Grafana to build and maintain unified observability dashboards.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"h-why-grafana\">Why Grafana?<\/h4>\n\n\n\n<p><a href=\"https:\/\/grafana.com\/\">Grafana<\/a> is an excellent choice for building unified observability dashboards. It\u2019s highly customizable and supports:<\/p>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li><strong>Templating<\/strong>: This allows you to create dynamic dashboards where users can select specific services, regions, or time periods for the displayed data. <\/li>\n\n\n\n<li><strong>Multiple data sources<\/strong>: Grafana can pull data from a variety of sources, including Prometheus, Elasticsearch, and others, making it easy to integrate metrics across various systems. <\/li>\n\n\n\n<li><strong>Alert routing<\/strong>: Grafana allows you to configure alerts based on specific thresholds and route them to tools like Slack, PagerDuty, or email to notify the team. <\/li>\n<\/ul>\n<\/div>\n\n\n<p>For enhanced collaboration, share these dashboards in Slack incident channels so that the team is immediately aware of any issues. You can also rotate ownership weekly to ensure that all team members stay familiar with the health of the system. Standups and retrospectives are great opportunities to review these dashboards, discuss any problems, and make improvements.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"h-example-grafana-dashboard-panel-for-error-budget-tracking\">Example: Grafana Dashboard Panel for Error Budget Tracking<\/h4>\n\n\n\n<p>Here\u2019s an example of how you might visualize error budget remaining in Grafana using a gauge panel. While Grafana dashboards can be exported or configured using JSON, most teams use the web-based UI to build them interactively.<\/p>\n\n\n\n<p>This JSON configuration defines a gauge panel:<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"block\" highlight=\"false\" decode=\"true\">{\n  \"type\": \"gauge\",\n  \"title\": \"Error Budget Remaining\",\n  \"targets\": [\n    {\n      \"expr\": \"100 - (slo_burn_rate * 100)\",\n      \"refId\": \"A\"\n    }\n  ]\n}<\/pre><\/div>\n\n\n\n<p><strong>Explanation of the Code Snippet<\/strong><\/p>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li><code>type: \"gauge\":<\/code> This specifies that the panel is a gauge, which visually represents the remaining error budget as a percentage. A gauge is particularly useful for tracking real-time metrics, as it provides a clear indication of how much room you have before reaching the error budget threshold. <\/li>\n\n\n\n<li><code>title: \"Error Budget Remaining\"<\/code>: This sets the title of the panel to indicate that it\u2019s tracking how much of the error budget is still available. <\/li>\n\n\n\n<li><code>targets:<\/code> This section defines the query that fetches the data. The <code>expr<\/code> field contains the Prometheus query for calculating the error budget remaining: <div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li><code>100 - (slo_burn_rate * 100):<\/code> <\/li>\n\n\n\n<li><code>slo_burn_rate:<\/code> This variable represents the rate at which the error budget is being consumed. A burn rate higher than expected means the service is nearing failure or downtime. <\/li>\n<\/ul>\n<\/div><\/li>\n\n\n\n<li><code>100 - (slo_burn_rate * 100):<\/code> This calculation subtracts the burn rate from 100 to determine how much error budget remains. A value of 0 means the error budget is completely burned, and no further issues can be tolerated. <\/li>\n\n\n\n<li><code>refId<\/code>: This is a unique identifier for the query target, allowing Grafana to differentiate between multiple data sources or metrics. <\/li>\n<\/ul>\n<\/div>\n\n\n<p><strong>What this does:<\/strong><\/p>\n\n\n\n<p>The panel calculates the remaining error budget by subtracting the current SLO burn rate from 100. For instance, if your burn rate is 0.25 (25%), this panel would display 75% error budget remaining.<\/p>\n\n\n\n<p><strong>How to use it in practice:<\/strong> <\/p>\n\n\n\n<p>In Grafana\u2019s UI, you would:<\/p>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li> Go to your dashboard. <\/li>\n\n\n\n<li> Add a <strong>new panel<\/strong> \u2192 select <strong>Gauge<\/strong> as the visualization type. <\/li>\n\n\n\n<li> Under <strong>metrics<\/strong>, paste the PromQL expression: 100 &#8211; (<code>slo_burn_rate * 100<\/code>) <\/li>\n\n\n\n<li> Customize the title and thresholds as needed. <\/li>\n<\/ul>\n<\/div>\n\n\n<p>This configuration is useful when paired with alert thresholds \u2014 for example, triggering a warning when the remaining error budget drops below 50%, or a critical alert below 10%.<\/p>\n\n\n\n<p><strong>See Grafana&#8217;s panel editor in action<\/strong>:<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"2563\" height=\"3068\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2025\/06\/word-image-107018-1.png\" alt=\"\" class=\"wp-image-107019\" srcset=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2025\/06\/word-image-107018-1.png 2563w, https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2025\/06\/word-image-107018-1-251x300.png 251w, https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2025\/06\/word-image-107018-1-855x1024.png 855w, https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2025\/06\/word-image-107018-1-768x919.png 768w, https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2025\/06\/word-image-107018-1-1283x1536.png 1283w, https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2025\/06\/word-image-107018-1-1711x2048.png 1711w\" sizes=\"auto, (max-width: 2563px) 100vw, 2563px\" \/><\/figure>\n\n\n\n<p>\n  \n<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"block\" highlight=\"false\" decode=\"true\">\n<img loading=\"lazy\" decoding=\"async\" width=\"2563\" height=\"2323\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2025\/06\/word-image-107018-2.png\" class=\"wp-image-107020\" srcset=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2025\/06\/word-image-107018-2.png 2563w, https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2025\/06\/word-image-107018-2-300x272.png 300w, https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2025\/06\/word-image-107018-2-1024x928.png 1024w, https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2025\/06\/word-image-107018-2-768x696.png 768w, https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2025\/06\/word-image-107018-2-1536x1392.png 1536w, https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2025\/06\/word-image-107018-2-2048x1856.png 2048w\" sizes=\"auto, (max-width: 2563px) 100vw, 2563px\" \/>25)*<\/pre><\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-common-pitfalls-and-how-to-avoid-them\">Common Pitfalls and How to Avoid Them<\/h2>\n\n\n\n<p>Even with the best intentions, teams often stumble when merging DevOps and SRE. Understanding common failure modes can help you avoid them.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-silos-reappearing\">Silos Reappearing<\/h3>\n\n\n\n<p>Ironically, merging DevOps and SRE can recreate old boundaries. DevOps teams focus on deployment, while SREs chase uptime.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"h-solution\">Solution<\/h4>\n\n\n\n<p>Co-author quarterly roadmaps and OKRs that align business velocity with error budget policies. For example:<\/p>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li> Q2 Objective: Launch 10 new features with &lt;5% SLO budget consumption <\/li>\n\n\n\n<li> Q3 Objective: Reduce time-to-detect incidents by 30% while maintaining weekly releases <\/li>\n<\/ul>\n<\/div>\n\n\n<p>Create shared quarterly reliability objectives across squads.<\/p>\n\n\n\n<p>Example:<\/p>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li> Shared OKR: &#8220;Launch 15 features with &lt;5% increase in error rate&#8221; <\/li>\n\n\n\n<li> Shared incident channel: #incident-sre-devops <\/li>\n<\/ul>\n<\/div>\n\n\n<h3 class=\"wp-block-heading\" id=\"h-tooling-overload\">Tooling Overload<\/h3>\n\n\n\n<p>More tools \u2260 better outcomes. Fragmented platforms lead to unclear ownership, duplicated effort, and increased maintenance overhead.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"h-solution-0\">Solution<\/h4>\n\n\n\n<p>Standardize around a minimal, interoperable toolchain. Ideally, one that\u2019s used consistently across DevOps, SRE, and development teams.<\/p>\n\n\n\n<p>For database automation, use a tool like <a href=\"https:\/\/www.red-gate.com\/products\/flyway\/community\/\">Flyway<\/a> (or <a href=\"https:\/\/www.liquibase.com\/\">Liquibase<\/a>). Flyway supports version-controlled schema migrations, integrates smoothly with CI\/CD pipelines, and is simple enough for both developers and DBAs to adopt.<\/p>\n\n\n\n<p>Assign clear owners for each tool in your stack and document internal usage guidelines. This ensures teams know how tools should be used, prevents duplication, and reduces context-switching overhead. A shared, consistent toolset empowers collaboration and makes system behavior more predictable at scale.<\/p>\n\n\n\n<p>Maintain a single, easily accessible tool-ownership.md file (a Markdown file typically stored in your internal documentation repo) that lists each major tool in your engineering stack, along with its primary and backup owners. This helps prevent confusion, ensures continuity during PTO or turnover, and improves accountability for tooling decisions.<\/p>\n\n\n\n<p>For example:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><tbody><tr><td><p><strong>Tool <\/strong><\/p><\/td><td><p><strong>Primary Owner <\/strong><\/p><\/td><td><p><strong>Backup Owner<\/strong><\/p><\/td><\/tr><tr><td><p>Flyway<\/p><\/td><td><p>DBA Team <\/p><\/td><td><p>Platform Eng<\/p><\/td><\/tr><tr><td><p>Grafana<\/p><\/td><td><p>SRE Team <\/p><\/td><td><p>DevOps Team <\/p><\/td><\/tr><tr><td><p>Prometheus<\/p><\/td><td><p>Observability <\/p><\/td><td><p>SRE Team <\/p><\/td><\/tr><tr><td><p>Jenkins <\/p><\/td><td><p>DevOps Team <\/p><\/td><td><p>Platform Eng<\/p><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p style=\"padding-right:0;padding-left:var(--wp--preset--spacing--md)\"><strong>Tip<\/strong><em>:<\/em> This file should be part of your internal engineering handbook or runbook, preferably version-controlled (e.g., in GitHub or GitLab), and linked from your developer portal or team wiki. Make it easy for anyone to find out who to contact for tooling questions or issues.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-not-defining-toil-or-slis-clearly\">Not Defining Toil or SLIs Clearly<\/h3>\n\n\n\n<p>When toil is untracked, the monotonous nature of the work can cause people to silently (and sometimes not so silently) burn out. Toil refers to manual, predictable tasks that are operational but don\u2019t contribute to long-term system improvements, such as restarting stuck services, manually pushing builds, or cleaning up logs.<\/p>\n\n\n\n<p>These activities may seem minor in isolation, but over time, they consume team capacity, create frustration, and distract from higher-value work like automation, optimization, or feature delivery. Similarly, when SLIs (Service Level Indicators) are vague or poorly defined, they fail to reflect what matters to end users. This leads to misaligned priorities and missed reliability goals.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"h-solution-1\">Solution<\/h4>\n\n\n\n<p>Quantify toil and set goals to reduce it.<\/p>\n\n\n\n<p>Examples:<\/p>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li> Reduce manual data restores by 80% in 2 months <\/li>\n\n\n\n<li> Automate 100% of schema validation by end of Q1 <\/li>\n<\/ul>\n<\/div>\n\n\n<p>Also, align SLIs with user experience:<\/p>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li> API response time &lt;200ms (95th percentile) <\/li>\n\n\n\n<li> Query failure rate &lt;0.5% over rolling 7-day window <\/li>\n\n\n\n<li> DB Migration success rate > 99.95% <\/li>\n\n\n\n<li> API Latency p95 &lt; 400ms <\/li>\n<\/ul>\n<\/div>\n\n\n<p>Track toil using Jira\/Linear with a custom label toil and query reports monthly:<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"block\" highlight=\"false\" decode=\"true\">jira search \"labels = toil AND updated &gt;= -30d\"<\/pre><\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-wrapping-up-working-with-devops-and-sre\">Wrapping Up: Working with DevOps and SRE<\/h2>\n\n\n\n<p><a href=\"https:\/\/www.red-gate.com\/simple-talk\/devops\/culture\/site-reliability-engineering-vs-devops\/\">DevOps and SRE<\/a> are not mutually exclusive. Together, they represent the next evolution of software engineering. They combine speed with safety, autonomy with accountability, and agility with observability.<\/p>\n\n\n\n<p>The successful integration of DevOps and SRE requires more than adopting tools or changing titles. It\u2019s a holistic effort that spans cultural transformation, shared metrics, and thoughtful tooling. Redgate Flyway exemplifies how a tool can operationalize this integration. It brings automation, versioning, and reliability to a traditionally opaque part of the stack: the database.<\/p>\n\n\n\n<p>Organizations that implement this fusion not only ship faster but also build systems that withstand change. They align their teams on shared goals, empower them with reliable tools, and create feedback loops that continuously improve delivery and operations.<\/p>\n\n\n\n<p>When you bridge the gap between DevOps and SRE, you&#8217;re not choosing between delivery velocity and reliability, you&#8217;re choosing both. And in doing so, you&#8217;re building the kind of resilient, scalable, and responsive systems that define modern software excellence.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In Part 1 of this series, we covered the operational overlap between DevOps and Site Reliability Engineering (SRE). While DevOps emerged from the need for agile and automated software delivery cycles, SRE has its roots in teams doing systems engineering. SRE emphasizes stability, observability, and proactive failure management. On the surface, they might appear to&#8230;&hellip;<\/p>\n","protected":false},"author":342511,"featured_media":107021,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[143512,53],"tags":[5970],"coauthors":[159023],"class_list":["post-107018","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-devops","category-featured","tag-devops"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/107018","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/342511"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=107018"}],"version-history":[{"count":6,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/107018\/revisions"}],"predecessor-version":[{"id":107029,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/107018\/revisions\/107029"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media\/107021"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=107018"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=107018"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=107018"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=107018"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}