Developer Productivity Engineering Blog

GenAI Won’t Replace Your Continuous Delivery Pipeline—It Will Stress It

Generative AI (GenAI) is reshaping software engineering, introducing unprecedented opportunities for rapid innovation alongside significant new complexities. While GenAI accelerates coding and experimentation, it also introduces challenges to code quality, comprehension, batch sizing, pipeline friction, troubleshooting complexity, and regulatory compliance. 

This paper examines GenAI’s impacts and limitations across software delivery—highlighting the necessity of reliable and robust testing, human oversight, efficient CI/CD pipelines, and rigorous security practices. Emphasizing core DevOps practices—including continuous delivery—captured by DORA metrics, we explore strategies organizations must adopt to fully realize GenAI’s benefits while effectively managing its inherent risks.

Prefer to download? Get the PDF version here.

Strategic outlook: GenAI adoption and usage in enterprise software development (next 5–10 years)

Some envision a future where GenAI eliminates the need for traditional software delivery practices—automatically producing flawless, deployable software. This vision overlooks three fundamental realities.

  • First, GenAI produces source code, not executable binaries or bytecode. Traditional build pipelines remain essential to transform that code into working software. 
  • Second, this source code is unproven. Like any hypothesis about intended behavior, it must be empirically validated through rigorous testing and real-world feedback. 
  • Third, compliance requirements demand traceability, auditability, and demonstrable reliability—none of which can be satisfied by opaque, probabilistic AI outputs alone. 

Together, these realities underscore a central thesis of this paper: rather than displacing established continuous delivery practices, GenAI amplifies their importance.

Why GenAI can’t replace your build pipeline

An example of Java bytecode (source)

GenAI cannot reliably generate bytecode or machine-level binaries, both of which demand deterministic structure, architectural precision, and runtime correctness. These constraints exceed the capabilities of probabilistic models—especially for binaries, where precision and system-level integrity are critical.

In addition, GenAI-generated bytecode and binaries lack the traceability required for compliance, making verification and accountability difficult. Regulatory standards, industry frameworks, and compliance requirements demand clear, auditable evidence of reliability, security, and correctness. As a result, the direct and reliable generation of these artifacts remains beyond practical reach, meaning we need to rely on robust build pipelines.

Testing as validation—and specification

In most practical scenarios—whether human-authored or AI-generated—software is fundamentally a hypothesis about its intended behavior, requiring empirical validation through thorough testing, monitoring, and real-world feedback.

This means that generated code needs rigorous testing to ensure it performs as expected and meets regulatory compliance requirements. The unproven nature of source code, especially when generated by probabilistic AI, necessitates this validation process to transform it into reliable software.

With GenAI, testing becomes even more critical—not only to validate correctness, but to define what correctness is. Automated tests act as executable specifications that document the code’s expected behavior. Fortunately, with GenAI’s assistance, it will also become significantly easier to write tests and achieve higher test coverage.

Formal verification: a theoretical—but impractical—alternative to testing

In theory, formal verification could make testing unnecessary. It refers to the use of mathematical proofs to demonstrate that a program behaves exactly as intended under all conditions—eliminating the need for empirical validation through test cases. If software could be developed this way at scale, correctness would be guaranteed by design.

GenAI introduces new possibilities here. By generating both implementation code and corresponding specifications, it may lower the entry barrier to formal methods—particularly in domains that demand high assurance but have traditionally been held back by the labor-intensive nature of formal specification. GenAI could help automate the derivation of formal properties or invariants, narrowing the historical gap between what code does and what it’s supposed to do.

However, this ideal remains far from reality for most software teams. Formal methods—such as those implemented in Lean, Coq, or Agda—depend on precisely defined, static, and unambiguous specifications. This level of precision is rarely achievable in enterprise environments, where business logic is ambiguous, evolving, and often negotiated informally. Even with GenAI’s assistance, expressing such dynamic requirements in machine-verifiable form remains highly challenging.

Continuous delivery pipelines, with robust test automation and continuous monitoring, will continue to be the primary mechanism for ensuring correctness, compliance, and quality.

Ambiguity in business logic

Business requirements are rarely static or fully explicit. They evolve through conversation, interpretation, and iteration—often containing implicit assumptions and subtle ambiguities. Capturing such fluid and collaborative processes remains a significant challenge, even with GenAI’s assistance.

In this context, automated tests play a dual role. Beyond checking functional correctness, they serve as executable specifications—they document what “correct” behavior means in practice, and are constantly run in CI environments to ensure those specifications are met. They embody acceptance criteria and clarify business intent in a way that both developers and stakeholders can understand and verify.

This role becomes even more critical in GenAI-assisted development. When machines are writing the code, we need an independent and rigorous means of defining what that code is supposed to do. Automated tests offer that definition. They tell us when the code is complete, support sign-off from business stakeholders, form the basis for compliance audits, and provide essential scaffolding for troubleshooting

In effect, testing is no longer just about catching bugs—it is about creating a shared, testable understanding that is constantly verified against the application code. In the GenAI era, this makes robust test suites not just useful but foundational to safe and reliable software delivery.

Compliance reinforces testing needs

Testing continues to provide essential empirical validation needed to demonstrate regulatory compliance, offering auditable proof of software reliability and correctness in regulated environments.

Automated tests and robust continuous delivery pipelines are now the primary means of defining, validating, and governing software correctness. In a world where code can be generated at scale and speed, the ability to rigorously test, monitor, and trace what that code is supposed to do becomes the linchpin of responsible software delivery.

Balancing human-led and AI-generated software development in enterprise environments

In this section, we explore why human-led development remains essential, how AI-agent software development will expand, and the critical continuous delivery practices required to ensure safe and efficient deployments.

The continuing importance of human-led development

Human-led software development will remain an important and enduring component of enterprise software delivery, especially where complex legacy systems, ambiguous requirements, and regulatory compliance issues exist.

  • Implicit business knowledge: Ambiguous and evolving business requirements are best navigated by humans who can intuitively interpret and adapt to shifting priorities and nuanced domain knowledge.
  • Complexity of legacy systems: Human developers possess deep expertise needed to interpret intricate dependencies, subtle side effects, and undocumented behaviors common in legacy environments.
  • Risk management and compliance: Stringent regulatory, security, and compliance requirements necessitate human oversight to fully assess the implications and risks associated with software changes.
  • Trust and accountability: Enterprises value human-in-the-loop governance models to ensure clear accountability, trustworthiness, and oversight in critical software decisions.

The emergence and growth of AI-agent-generated software

While human oversight remains vital, AI-agent-generated software will become increasingly prevalent in enterprise environments due to its productivity advantages and the speed of development it enables—e.g., reducing mundane developer activities like writing boilerplate code, generating test data, and resolving merge conflicts. However, adopting this development approach requires careful management of potential risks through comprehensive continuous delivery practices.

Essential continuous delivery practices for AI-agent-generated software

To effectively leverage AI-agent-generated software, enterprises must strengthen robust continuous delivery practices to ensure software reliability, security, and compliance. The most competitive enterprises achieve software delivery excellence with:

  • Robust validation and testing automation: Write, run, and maintain extensive automated tests, including property-based and fuzz testing, to rigorously validate agent-generated software.
  • Rapid rollback and recovery mechanisms: Provide automated rollback capabilities and efficient recovery strategies to minimize disruption and reduce Mean Time to Recovery (MTTR).
  • Continuous security and compliance checks: Embed automated static and dynamic security analyses and compliance verification directly into continuous delivery pipelines.
  • Human-in-the-loop governance: Establish human checkpoints for reviewing critical AI-agent-generated software outputs, particularly in sensitive or highly regulated domains.
  • Versioning and provenance tracking: Maintain comprehensive traceability and versioning mechanisms to facilitate audits and enhance accountability.

In summary, the future of enterprise software development is not a choice between human or AI-agent-driven approaches but rather an integration of both.

Human-led development will remain crucial, providing expertise, accountability, and adaptive insight where necessary. 

Why DORA metrics and CD practices are even more critical in the GenAI era

Google Cloud’s DORA is the largest and longest-running research program of its kind (source)

In the age of GenAI, feedback cycle frequency, code complexity, and code volume all increase dramatically, magnifying the potential for errors, unintended behaviors, and deployment risks. DORA’s continuous delivery practices become not just relevant but strategically indispensable— as is their evaluation by DORA’s four key metrics:

  1. Deployment Frequency
  2. Lead Time for Changes
  3. Mean Time to Recovery (MTTR)
  4. Change Failure Rate

DORA offers technology-agnostic, outcome-oriented measurement and proven methodologies essential for maintaining control, quality, and speed amid accelerated GenAI-driven development. By enabling rapid feedback, rigorous automated testing, observability, and effective risk management, DORA and CD practices empower organizations to harness the productivity gains of GenAI—by safeguarding software reliability, ensuring regulatory compliance, and accelerating delivery of high-quality software.

Shift-left and working in small batches

“The significant shift AI has brought in terms of developer productivity and code generation speed may have inadvertently led the field to overlook the importance of small batch sizes.”

– DORA.dev

DORA has consistently shown that “larger changes are slower and more prone to creating instability”. Working in small batches, combined with shift-left practices, forms the foundational core of continuous delivery and software delivery excellence. Small batches enable rapid, frequent integration and deployment, significantly reducing complexity and facilitating faster feedback loops. 

Shift-left ensures issues are detected and addressed early and frequently in the development cycle, when fixing the issues is most efficient. Together, these practices reinforce each other, driving higher deployment frequency, shorter lead times, improved reliability, and ultimately superior software delivery performance.

Together, shift-left and working in small batches reinforce each other, driving higher deployment frequency, shorter lead times, improved reliability, and ultimately superior software delivery performance.

GenAI introduces a notable risk of increasing batch sizes and shifting feedback loops to the right, potentially undermining software delivery excellence and your capability to continuously integrate.

In general, one primary reason for large batch sizes is the significant fixed cost associated with handoffs between stages in the software delivery pipeline. This handoff cost is influenced by three main factors:

  • Feedback time: Delays in receiving pipeline feedback increase waiting periods, leading developers to batch more changes.
  • Pipeline friction: Build and test pipeline flakiness introduces uncertainty, discouraging frequent merges.
  • Troubleshooting time: Prolonged debugging of pipeline failures further incentivizes developers to merge less frequently and in larger batches.

GenAI compounds these issues by inherently encouraging larger batch sizes due to:

  • Increased frequency of feedback events: The sheer speed of generating new code with GenAI significantly increases the total number of pipeline executions, causing congestion and exacerbating delays and bottlenecks.
  • Verbosity: GenAI produces substantial volumes of code rapidly.
  • Accessibility and speed: The ease and speed of generating extensive code segments make frequent large-scale refactoring commonplace.
  • Lower code comprehension: Developers may find themselves managing and maintaining code they do not fully understand, increasing the complexity and surface area affected by each batch.

The second primary reason for large batch sizes is low test coverage. If code changes are not sufficiently covered by automated tests, it leads to a lack of confidence in the development feedback loop. A common way to address this risk is to introduce more manual checks and processes, like lengthy code reviews or pull request reviews, which results in integrating changes less frequently. Developers fear that frequent integrations will inevitably result in numerous difficult merges, more time spent in code reviews, and prolonged troubleshooting incidents down the line.

“Improving the development process does not automatically improve software delivery—at least not without proper adherence to the basics of successful software delivery, like small batch sizes and robust testing mechanisms.”

– DORA.dev

With GenAI, there is a massive opportunity to bridge coverage gaps by using AI to create a lot more tests. This will increase the handover cost as running more tests takes longer, leading to longer feedback time, increased pipeline friction, and longer time to troubleshoot failures. If unaddressed, this trend creates a vicious cycle:

  • More batches will increase the load on CI, causing slower feedback cycles.
  • Larger batches will increase feedback cycle time and load on CI.
  • Larger batches consume significantly more troubleshooting time, particularly when changes span extensive codebases that developers may not deeply understand.
  • Increased troubleshooting time, coupled with lengthy builds and tests, further extends effective feedback cycles, exacerbating the problem by incentivizing even larger batches.

To break this cycle, pipelines must significantly improve their performance and troubleshooting capabilities. A pipeline that is “GenAI ready” can at least absorb a load multiple factors higher than what is required today with the same quality of service at reasonable cost.

The effective feedback time, crucially influenced by pipeline build times, test times, and troubleshooting durations, must be drastically reduced. Simply allocating more resources (e.g., increasing the number of CI agents) is economically inefficient and unsustainable. Without fundamentally improving pipeline efficiency, such strategies quickly become cost-prohibitive.

Enhanced pipelines using efficiency technologies—such as universal caching, smart resource allocation, and predictive test selection—become essential to counter the inherent drift toward larger batch sizes introduced by GenAI.

Additionally, developers must be able to get quality feedback on their workstation before CI is even involved. The developer failure troubleshooting experience must be significantly improved to address the increased complexity of troubleshooting due to GenAI code.

Ultimately, only significantly enhanced pipelines can preserve the capability of continuous delivery amidst the pressures exerted by GenAI. Organizations must proactively address these issues, aiming to shift feedback loops back to the left—towards earlier stages of development—or risk deteriorating deployment frequencies, slower releases, and declining software quality.

Many organizations can (and do) invest significant time and resources building in-house expertise to maintain and accelerate their CD pipelines. Alternatively, for organizations that cannot afford this approach, there is Develocity®.

Develocity is a toolchain observability and acceleration platform designed to handle the scale and complexity of GenAI-driven development. It accelerates feedback cycles, increases CI efficiency while decreasing cost, and offers complete oversight of your build pipeline. Unlike typical CI tools, Develocity shifts visibility left and enhances CI systems, ensuring fast builds, reliable tests, and efficient teams.

A once-in-a-lifetime opportunity for rapid experimentation

Rapid experimentation is central to software delivery excellence. The faster your teams can iterate, deploy changes, and gather user feedback, the sooner your organization realizes tangible customer value.

GenAI dramatically accelerates the speed of coding experimentation, presenting an unparalleled opportunity for swift innovation. However, to fully leverage this transformative capability, your delivery pipeline must be equally prepared. Without an optimized and efficient pipeline, GenAI’s extraordinary potential for accelerated go-to-market velocity and rapid experimentation remains unrealized.

Leveraging artifact provenance and quality gates across the software delivery lifecycle

As discussed, AI-generated code introduces unique challenges to software development, including reduced developer comprehension, rapid and frequent code changes, and increased risks associated with security, compliance, and unexpected behaviors. This is particularly the case for agent-generated software. To effectively manage these challenges, detailed artifact provenance—comprehensive information about how artifacts are produced (e.g., JDK versions, build tools, plugins, test libraries, and test outcomes)—becomes essential.

By capturing this detailed provenance data, organizations can implement precise, enforceable quality gate policies throughout all stages of the software delivery lifecycle (SDLC). These policies facilitate systematic validation at multiple points, including code development in IDEs, continuous integration builds, testing phases, and deployments.

This rigorous, lifecycle-wide approach provides the following critical benefits:

  • Enhanced traceability and auditability: Every artifact can be accurately traced to its source, enabling swift identification, diagnosis, and resolution of issues at any stage of development.
  • Comprehensive security and compliance assurance: Provenance-based policies provide a lightweight, cost-effective alternative to traditional security scanners, allowing rapid and frequent checks throughout the entire SDLC. This aligns seamlessly with DORA’s emphasis on pervasive security, embedding proactive security checks efficiently into developers’ daily workflows.
  • Greater confidence and trust: Systematic verification of artifact provenance at multiple SDLC stages ensures consistent quality, reliability, and compliance, empowering teams to confidently leverage AI-driven development.

In summary, detailed artifact provenance combined with robust, multi-stage quality gate policies are the essential guardrails that enable organizations to safely maximize the productivity gains offered by AI-generated code.

Strategic imperatives for thriving in the GenAI-driven software landscape

The era of GenAI offers transformative potential but also demands critical adjustments to traditional software delivery approaches. Software in the age of AI will need robust build and test pipelines more than ever.

Human-led development remains indispensable due to legacy system complexity, business knowledge intricacies, and regulatory demands. Where AI agents contribute directly, strong continuous delivery practices—enhanced observability, robust automated testing, rapid recovery capabilities, and strict governance—are essential for safely managing AI-generated software.

DORA metrics and continuous delivery practices emerge as more strategically valuable than ever, addressing challenges posed by larger batch sizes, increased feedback frequency, and pipeline bottlenecks driven by GenAI. Shift-left methodologies combined with smaller batches remain critical to maintaining integration speed and quality. Enhanced pipeline efficiency—leveraging smart resource allocation, caching, predictive test selection, and AI-powered failure analytics—ensures developers receive rapid, high-quality feedback at every SDLC stage, without incurring unsustainable infrastructure costs.

Finally, capturing detailed artifact provenance and implementing rigorous, multi-stage quality gates secures traceability, compliance, and confidence in AI-generated artifacts.

By adopting these comprehensive strategies, organizations can take advantage of GenAI’s potential for innovation, experimentation, and quicker go-to-market, while maintaining strong and adaptable software delivery practices. 

Prefer to share this content as a PDF? 

Download the whitepaper version