How does LinkedIn create a culture of developer productivity that is capable of building over 100,000 times a day?
Szczepan Faber is architect and technical leader for development tools at LinkedIn. Szczepan recently sat down with Hans Dockter, founder and CEO of Gradle, to discuss the transformation of Build engineering, what a healthy Build culture looks like, and the future of developer tools automation.
Thank you for the many excellent questions we received during the live webcast!
Hans: Hello, everyone. I’m Hans Dockter, founder of Gradle and the CEO of Gradle Inc., and this is the first episode of our new webinar series focused on build engineering and developer productivity. As every industry vertical is transformed by software, the development productivity has become a key mission for many organizations. It’s a key competitive advantage. And I’m very happy to have Szczepan Faber from LinkedIn with me today as our first guest. He is the founder of Mockito and a true expert in the field, and he’s leading the DevTools team at LinkedIn. Welcome, Szczepan.
Szczepan: Thank you very much for having me, Hans. I’m happy to be here. Let’s talk about automation.
Hans: Yeah, cool. So let’s start with the question, “What brought you to build engineering?”.
Szczepan: Many years ago, I found that the best use of my time, is when I actually help other engineers in my organization to be productive. So I started making tools for engineers. You might use Mockito as an example. I joined Gradle very early, there were four of us developing Gradle back in 2011. And I’m also extremely passionate about automation, which I think is key for developer productivity. If everything is automated, and you can ship your change to production as fast as possible, that is a key indicator of productivity.
Hans: So talking about that, how many engineers do you have to support at LinkedIn?
Szczepan: We have about 3,000 engineers, and they are supported by a group of 300 engineers within the foundation team. So we built these core developer infrastructures for all the devs so that they can ship high-quality products to production as fast as possible, as reliably and as consistently.
Hans: And that was always the thing that impressed me with LinkedIn, how early they significantly invested in that space. I know organizations with, you know, 1,000 engineers, that their foundation team is a handful of people, and you can see the result. And so maybe you can share a couple of more metrics regarding the scale at which you’re building at LinkedIn.
Szczepan: Yeah, I’m happy to. So today we have 100,000 Gradle builds per day, which is an amazing scale, and we’re still trying to figure out how to manage that really well. And our most complicated/biggest application, which is LinkedIn.com, ships to production several times a day. And across all the codebases that form LinkedIn.com, we have over 1,000 comments per week, which for us, is a very big scale. And of course, there are some companies out there that are even higher scale. Still, the level of challenges we have is pretty amazing.
Hans: Yeah and that you ship multiple times a day.
Szczepan: We ship multiple times a day. So our web front-end LinkedIn.com site that you might know, if you use it, it ships three times per day to production. The web services behind it, the media services, they also ship several times a day. Some of those services would be shipping even every change to production. It depends on which layer of architecture we’re looking at.
Hans: And is that in a team responsibility to make the decision, oh, we want to do continuous deployment, or we want to have a different release lifecycle, or how is the organization figuring this out?
Szczepan: That’s a great question. For LinkedIn.com, we discovered very early that we really want to ship to production multiple times a day. It’s key for us because it puts pressure on the entire ecosystem of tools and processes in the organization. And achieving that model, with releasing to production so frequently, gives us a competitive advantage where we can test business theories and ship features to production very, very early.
Szczepan: One key insight that we have discovered is that we want to separate the code push from feature push. So code push is when, like new binaries out there serving production traffic, but the feature is not yet visible to everybody, it’s gated. And we can gradually show and expose the feature to the growing population of our users, and that manages the risks. And that’s part of our automation, part of our developer infrastructure, a part of our commit to production pipeline.
Hans: And I think what is fascinating with that, is everyone wants to ship that often. And for start-up small teams, that’s pretty easy. But on a scale of LinkedIn, I hardly know any organization that’s able to pull that off. There are a few, but not many in the world. So we still have this organization, oh, we’re trying to get from three months to one month, right? Or we’re trying to get to biweekly.
Hans: Right, and what is even more fascinating, that you are doing this not just since last year. At what point were you able to say, OK, we can now ship multiple times a day? It wasn’t always the case at LinkedIn.
Szczepan: No, no.
Hans: It was a conscious decision to invest and to make it a reality, right?
Szczepan: No. We do have more products than just LinkedIn.com. LinkedIn.com is something that everybody knows. Other products can be shipped to production even more frequently, like, at the level of every change, or maybe less frequently, depending on the sort of the business. In the past, we used to have monolithic architecture, where all our microservices were in one giant code repository. That was a long time ago. And we have found that it hurts our ability to ship code. Like, the levels of investments needed were, like, super high. And we decided that this is not the architecture, this is not how we want software development at LinkedIn. We want independent development cycles to our teams’ autonomy within our teams. We still want to have a system that manages that. We want to have a standardized developer workflow that is optimized for trial base development, for shipping to production quickly, for doing code reviews of every change so that we have a system that helps with quality at our scale. At the same time, you want the independent development cycles, independent deployments.
Hans: Yep. And I mean we don’t have time for a deep dive into the whole monorepo versus multirepo topic. But it’s interesting. For me, it’s a question, what is the boundary of the version control system, of the source repository? And a monorepo would mean there’s only one boundary to the whole organization. But I think there are other good boundaries, where you create more independence when you go with a multirepo approach. But you have to do the investment to connect all the pieces.
Hans: For me, a very important thing is, how do you make sure that when one repository breaks another repository, that you learn about it very fast? And it’s not always the consumers of that repository that have to figure out, hey, why am I not working anymore? So is there anything you do in this area at LinkedIn?
Szczepan: Yeah. I don’t know if you want to go down that path, because the multirepo, monorepo discussion, is like, we could have a separate webinar on that topic.
Szczepan: We found at LinkedIn that, when we invested more in a multi-code base environment, like today we have 10,000 codebases at LinkedIn, and some of them would be sort of test projects that don’t matter, you know, like, some little things. Some of them would be like software libraries, relatively small. But then, on that spectrum, we also have pretty large applications like our LinkedIn.com, for example. And all that is developed in our multi-code base environment at LinkedIn. And we found that we need to build a lot of tools to organize that environment. Like, we want to know, what are the dependencies of all our products.
Hans: Yes, yes.
Szczepan: Like having that dependency graph, we found it’s very useful because we can do very interesting things with it. For example, if I’m pushing a change to my library, we can automatically, in our CI pipeline, builds and tests for all of the products that depend on that software library.
Hans: As part of a pull request, built?
Hans: More or less?
Szczepan: Yes, more or less.
Hans: That’s where you want to be. And I think that’s part of our mission to provide this as a commodity solution to the world.
Hans: So but that’s great insight.
Szczepan: That’s actually what I’m missing a lot. Like, when you’re a company and you’re growing, maybe you start with a monorepo because that’s the easiest way to start. At some point, you have different products, different technology stacks. It does not make sense for you to have one giant codebase. But then you split out, and you have those separate teams, separate codebases, but there are no tools out there to manage that landscape of all your projects.
Szczepan: Like, to understand your dependencies, to help you resolve conflicts if you’re reusing code in your organization. And we do want to reuse code in the organization, right?
Szczepan: So I’d love to have a solution to that.
Hans: Yes. So that’s good. But you did the investment, and I think for you, it’s a really good system you have developed. And just sharing something from our experience at Gradle. Gradle is a monorepo. We have, I don’t know, a million lines of code. And for us, it builds scales easily, so that is all under control. But what we have seen, the disadvantage of that is, it reduces the engagement of the community. If you just want to contribute to the C++ plugins of Gradle, and you have to deal with this big monorepo with complex test automation, it’s not as much fun. And that’s why we now have tried with some of the new projects, like the Kotlin DSL, this is in a separate repository. And it’s by far the biggest community engagement we’ve ever seen. So there are many reasons why you want to think about the boundaries of repositories. And I was surprised that, initially, at least for an open-source project, this was a very important aspect.
Szczepan: I can totally imagine that. Like, I’m a contributor. I want to contribute to this code. And yet, this is a pretty big codebase. Like, I need to import all that to IntelliJ, to my ID, and there’s just so many classes, so much source code. Where do I even start?
Szczepan: I need to run the build, but what tests do I run? Which submodule? The build is going to take one hour, like, how long? So I absolutely get it.
Hans: And then issue track, right? The whole ecosystem, there’s a whole ecosystem around this boundary. Issue tracker and, hey, you want to follow the progress of only that repository. So another thing I found fascinating is the metric number of builds. I found that it sounds like a very simple metric that you could easily gain and whatnot. So two things I’ve seen is that organizations with a very high number of builds are the ones that are able to ship fast and are the ones that are much more productive.
Szczepan: Sure, sure.
Hans: And the differences between organizations are orders of magnitude when it comes to the number of builds. And that is, for me one fascinating thing. That’s such a simple metric. Of course, people could game it. But in general, it gives some strong suggestions already about how productive is this organization. How many feedback cycles do they have?
Szczepan: I like that. I think this is interesting. Like we’ve continued to use delivery mode. We’ve tried ways development, with those modern development practices that optimize for developer productivity. Like, the number of builds you run is associated with how you partition your changes, how you develop code. Instead of cooking your change for or weeks, you want to work in increments. And you push that incremental change, that changes should be compatible. Because the change is small, then the code review turnaround is relatively fast.
Szczepan: So in general, like, you know the whole pipeline is smooth. The whole development cycles are more smooth. It’s like to me, it’s also part of lean development. Like small batches, right?
Hans: And I think if I were a developer at LinkedIn, and I’d say, hey, LinkedIn has infrastructure that gives me feedback when I’ve broken something downstream. Yeah, I want that feedback. So it’s an indication that people trigger builds to get feedback. So when you ask the developer productivity team, the dev tool team, have an infrastructure provides a lot of valuable feedback, I ask for that a lot. When I have a build infrastructure that has not given me similar feedback opportunities, I have less incentive to ask for the feedback.
Szczepan: That’s true. That’s what we found. It’s like the quality of signals from our CI builds, like from tests from our builds, is critical to have a healthy ecosystem, for shipping changes fast, for being able to upgrade diversions at scale, because most of the changes are compatible changes. So we can automatically propagate version bumps in our ecosystem. And we know the dependency graph, right?
Szczepan: So we know what to upgrade.
Hans: I’m curious, not sure if you want to answer this, but have you had, let’s say, 2017, one incident where a couple of days, you couldn’t release.
Szczepan: So it depends on the product.
Szczepan: So looking at, let’s say, LinkedIn.com our most prominent product. It could have happened. I don’t know the exact details, but it’s pretty rare, because for us, since our goal is to release to production three times a day, this means that, if there is no release in the first half of the day, and I’m talking about, most of the developers would be in Pacific Standard Time. So we don’t have a team that is spread out that works on that codebase, spread out through the entire world. We’re talking about three releases within eight hours. And let’s say, for four, five hours, half of the day, we don’t have a release to production, this is a very strong signal that there’s going to be some people, a group of people, working on fixing it. And it’s relatively urgent. And if there’s no release during a day, that’s a major problem. Like this is, like, on-calls are busy, sweating, like, let’s figure it out. Like, this thing has to be resolved. Because then, imagine, we haven’t had a release in a day, so the number of changes that accumulate is so big, so the chance that they’re going to break the next release is higher.
Hans: Yes, exactly.
Szczepan: So you can end up in this vicious cycle where you end up just delaying the release and like, oh no, we have to run back because there’s this problem.
Hans: Yeah, cool.
Szczepan: I’d say that it’s very rare that we have several days with no release. It could have happened. It’s a big issue even if for half a day if there’s no release.
Hans: It’s one of my favorite anecdotes. There was a situation a couple of years ago where, I don’t know what bank, it was a public incident. Anyhow, their website, including online banking, was down for three days. Three days, I mean, it’s unimaginable. But then, but the fascinating thing is, we were doing Gradle consulting for another bank. And I talked with one of their engineering leads, whether they heard about it. They said, yes. And the instructions they got from their management was basically to not release for the next six month, basically, to do a complete release freeze, to kind of understand what happened and how can we avoid that?
Szczepan: It solves the problem, right? There’s not going to be a regression if you don’t do any changes, right?
Hans: Exactly. Wow, is that really the consequence of that, instead of investing in a rollback functionality? Anyhow, I was like, wow.
Szczepan: Yeah. It’s fascinating. It’s completely against the continuous delivery philosophy. And in the famous continuous deliverables, where you really want to release often, because then you practice it. You make the releases boring. You make them like, so easy in transferring, right?
Szczepan: And this enables engineering teams to focus on the products, rather than fixing the releases, managing rollbacks, and cherry-picking what is a good change to include the release or not.
Hans: But the key thing, why I think this is so important, everyone wants to do it like this. But I would say 99% of the enterprises in the world, they’re not there yet. Releasing means a day of work, or maybe then, every two weeks. And that’s already kind of, you know, requires extreme determination that they’re able to do this. So in that respect, for you, because you’re doing this for so many years, it’s now, of course, that’s the way we’re doing it, and we want to do it. But hardly any organization that’s, I would say, with starting a couple of engineers or more, are there yet. So that’s still the reality in the industry.
Szczepan: Yeah. For us, it’s a constant challenge. It’s not like, oh, we build it, and this automation works, and for the next couple of years, we’re good. You develop your change and you’ll catch the next release train, and it goes out. No. This requires constant focus. We have developer productivity teams, developer infra teams. And there are new challenges that arise with new problems, and we have to be solving them.
Hans: And I think that is, for me, if you would have told another enterprise, hey, we want to build up a foundation team. We have 300 people, and we have 2,000 engineers. They would say, what? 10% of the engineering workforce is helping with the manufacturing process of software? They probably would have been upset by the idea that they have to invest so many resources into that. But they are now stuck with three months’ release cycles, and you release a couple of days. And if you would compete with them, if you were in the same vertical, goodnight tor them. That’s the reality.
Szczepan: Yeah. It’s a good point. For some organizations, it’s hard to justify those teams, the productivity teams or the built infrastructure, developer infrastructure teams. We at LinkedIn, we learned the hard way. So we used to be in the position where releasing was hard. It took us, you know, we released every month, and then we shrunk it to every two weeks. But it was like a major hassle, and every release was stressful, and every release introduced a lot of overhead because top engineers on the team would not be working on products, but they would be working on the “stabilization phase” of the release. So we learned the hard way. And then we found that we can’t be shipping to production as fast as we can. So we discovered that we need really strong foundations. We need a really strong infrastructure. And I can see that more organizations are interested in this. There are, like, meetups for developer productivity, amazing talks from all over the place. So it’s changing, and it’s awesome.
Hans: I agree. I think we see now the realization that this needs to be a first-class problem at the CTO level. Many organizations, even, let’s say, organizations that are not so engineering-driven, culturally, like LinkedIn. And that’s great. And I think another part of that is there are the economic aspects of that. Competitive advantage, in terms of shipping fast and reducing waiting time for engineers. It is a big part of the budget. But then there’s also the satisfaction you have as an engineer when you can be productive when you can roll out your changes very quickly. So I guess, what is your perspective on, let’s say, engineering churn and developer productivity at an organization?
Szczepan: So I have this automation bug, like a virus, and because of it I love automation in shipping to production very often. And I think that it really helps engineers in general. Like when I can ship to production safely and frequently, and I’m working on my change this morning, and I’m ready, and somebody reviewed my code and later during the same day it goes out, it liberates me. It helps me deliver faster, provide faster results.
Szczepan: And then, I’m going to contrast that with, let’s say you release every week, which is not too bad. But then, if I’m an engineer on that team, and the release is on Wednesday, and it’s already Tuesday, and I promised that my change will go out, what I’m doing on Tuesday? I’m really frantically coding. And then I’m, you know, with the unit test, maybe I’ll do it later because I really want my change to be included in the release. And we don’t want those wrong incentives, this you know, model where you’re stressed, you cannot operate in the mode that every day, I focused on building the product, high-quality product. And I’m coding and I’m not worrying, when is the next release, because it’s, you know, around the corner. The release train is almost there.
Hans: So as you said, this is a never-ending investment that you have to make into developer productivity. New languages, new frameworks, new growth. So, you know, there’s so many things that introduce regressions, or that introduce new challenges. So developers are never 100% happy, I guess, with the work of the developer productivity team. And so for me, there’s one question I like to ask developer productivity teams. So if you made the build 20% faster, would anyone say thank you, or would anyone even notice? I mean, it’s an amazing 20%, wow, how much money this saves, and whatnot. But I think it’s not that much faster, that everyone would realize oh, it’s so much faster. So how are you trying to kind of communicate to developers, hey, this is the great work we’re doing, and this is the progress that we have achieved, even if they haven’t realized it?
Szczepan: Most of the teams at LinkedIn, especially the big ones, the important ones, they care a lot about the build speed, like a lot. One of the key indicators that we are looking at when we assess the developer productivity, is how long does it take from commit to production? But actually, we partition that to the time that we track is how much it takes, how long it takes from pushing a commit and getting the binary that is ready to be shipped to production. This helps us a lot, because this chunk of time, this part of the pipeline, is absolutely managed by tools. Like the quality of your tests, the quality of your automation. So we have a direct influence on this, we as a build tools organization or the developer infra organization, because product developers, they write those tests. They implement the tests.
Hans: Yes. So you work in a centralized tooling team, so you’re not part of an application team.
Hans: I mean, we’re providing a platform with Gradle. And what we’re always struggling with, if we have to support something, let’s say, an IDE we’re not using ourselves, right, the empathy. How do you create empathy within your team for problems you’re not facing yourself? You know what I mean?
Szczepan: It’s hard to keep the level of engagement, like between the application development teams and the developer productivity and infrastructure. And you manage that through process, through, you know, meeting regularly and working together regularly, sometimes embedding engineers from productivity teams within the application team. So there are ways.
Hans: That’s cool.
Szczepan: At the same time, we have to acknowledge this challenge because it is a challenge. You know, we have different goals. Right, like and then that’s absolutely fine. That’s actually healthy, because if you have those different goals, then when they all merge, we have a great developer ecosystem. One thing I want to call out is that the application team, the product team, they want to go as fast as possible. Developers on the German autobahns, highways, just shipping features and developing. Now, the developer infra team, we don’t want to go that fast, you know. We are building a foundation and infrastructure.
Hans: Yes, yes.
Szczepan: We have to be really cautious, because our mistakes and the problem we introduce, will have a very large-scale impact. So there are differences, and they get slightly different goals, but overall, this helps us. And we need to manage that challenge, and we do. And like you will do too if you build those teams.
Hans: Yep. That’s cool. So I like the embedding idea because that’s what I’ve seen a lot in the wild, at least traditional built engineering team. I’ve seen the teams that get only their own agenda. We just want to have a stable, stable release pipeline. We don’t care if any new features are in there, it just has to be stable. And they were not caring about what other developer needs. They ignored the complexity of the challenge. When I’m doing consulting, I always want to talk directly with the developers, not just with the build team. They don’t necessarily like that. They want to be the kind of team that gives me the information. And then I learn a lot of interesting stuff. And in a healthy organization, there is cooperation. In unhealthy ones, you have a lot of friction and there’s almost like, they’re enemies. That has to change.
Szczepan: We also have friction. It’s not a rose garden all the time. Not all our tools are up to the quality that we want them to be. So it is a challenge.
Hans: But the key is, right, that it is a function of relationship?
Szczepan: It has to be. One of the problems that we found at LinkedIn is that the foundation team was sometimes referred to as the tools team at LinkedIn. We traditionally were the guys that deal with the tools. Like, we build products, and then you can throw the ball over the fence. Those guys will do the tools. That’s not really healthy. We need to work together. We want to really understand what tooling needs you have, what are your use cases, what are the problems. At the same time, we don’t work on all the tooling needs for the entire LinkedIn. We have to focus on the core developer infrastructure and the challenges that are sufficiently generic that apply to every software team at LinkedIn. Not like, oh, there’s this one team, they have this one problem, but this is tools, we are not going to deal with this because we’ll just tell those guys to fix it. That’s not going to work.
Szczepan: Sometimes with many of the automation and tools, we have joint ownership, like software-wise, where we would be jointly working together on some of the pieces. So that’s also something that we use.
Hans: Ah, that’s cool. That’s great. That’s great because it’s the same for us, right? We don’t have the domain expertise for every ecosystem.
Szczepan: Absolutely. And we want to provide it, like for the dev infra team or the Gradle foundation team. We have a Gradle foundation team at LinkedIn. We want to provide this Gradle expertise and consulting. But it doesn’t mean that that team will own all the Gradle plugins across LinkedIn because we have 500 Gradle plugins at LinkedIn, developed for all kinds of technology stacks and challenges and use cases. And the team only owns the core Gradle platform. We can’t really own every plugin. It just wouldn’t scale at all.
Hans: Yeah, it makes complete sense. But you’re still the experts, right? You can give them advice. I think that’s a really good system.
Hans: So one question I have is, in terms of recruiting for the build engineering team, what are the qualities you’re looking for, like, 20 years experience with Maven, Gradle, or what?
Szczepan: It’s useful, of course. Like having, you know, many years of experience in the domain of automation, that’s useful. I’d say that the key attribute we look at is being able to get yourself unstuck when you’re working on problems because what we’ve found is developer productivity devs or the build tools guys, they work with hundreds of codebases across different teams and like even the different technology stacks. You don’t have a traditional team on one product like you have 10 devs working on the same product. So if you don’t know anything, you just, hey, how do I do this? And he tells you. But sometimes you work on many problems, and you don’t really have like a bunch of other people who you can ask how to solve particular problems. You have to be really able to figure things yourself. And that’s the key attribute we look at. And I found it really hard to even learn on the job. This is really something pretty inherent.
Szczepan: One other thing I want to call out, which is, I think, something that we found because we were successful in teaching, is this empathy. Like that you really want to understand the problem, understand a use case, before you start developing. So it cannot be this gung ho. Somebody comes with the feature and goes, Yeah, sure, why not? It’s a good idea, and you just code. Understanding, hey, do you really need that feature? OK, why don’t you use that? OK, how does it feed our strategy of building tools for engineers? And then like, oh, OK, now that makes sense. OK, let’s start building something.
Hans: Yes. You’re looking for people that can really have product ownership and have a product mindset saying, hey, why do you need that? Not just, oh yeah, I do it, right? And that requires a little bit of maturity, or some maturity on the developer side, that they do not just say, oh, we want that feature, but that they are willing to discuss the problem space and also aren’t arrogant on their side, saying just do it and shut up.
Szczepan: Absolutely. It’s like, to me, all the developers at LinkedIn are like our customers.
Szczepan: I absolutely want to make sure we solve their problems and the use cases. This does not mean that we’ll do everything that customers want us to do. I really want to understand why, why you need this. And this is interesting, and this is challenging for developer productivity teams because those teams often don’t have traditional product managers, as with the typical product you have. The typical product team has a product manager who has the vision for the product, and he can help the team design that product. On the developer infra, well, you know, those are engineers. They build tools for engineers, so they should be good being project managers. But that’s not the case. With this product thinking, people had to take that lead. So engineers would have to be able to think that hat and do it.
Hans: And two things I see out in the wild. One is there’s not this product thinking. And basically, everything is based on escalation. Developers are upset about something, that is the next priority. They don’t have their own kind of roadmap and their own criteria to say hey, this is what we think is good for the customers.
Szczepan: That’s true. I mean, the customers don’t always know. They’re not necessarily deep experts in all the automation questions, so you have to have your own opinion on what you think is good for them. So that is one thing we’re seeing right there, just run by escalation. And you think, hey, you have this obvious problem. You say, oh, but no one is complaining about it. Is that really how you want to drive your priorities? For me, thinking about process engineers in other industries, those are real experts. They have studied that topic, and they have their very own agenda, their very own opinion, what makes a productive environment. And I think we need to have people that have product vision. Otherwise, you’re just driven like a leaf in the wind.
And then the other thing we’re seeing that prevents that from happening is that the build teams get completely swamped by support requests. “My CI build is not working.” And you want to have product people on a team. And then all they have to do is to fix a build failure, which is not even related to the build logic but to changing the codebase.
Szczepan: Absolutely. I would just say that we see that at LinkedIn. I love the quote from Henry Ford, It’s like, “If I did what my customers wanted me to do, I would give them faster horses.”. So you want to listen to the customers, but you really want to design that product yourself. And what you said earlier reminds me about the squeaky wheel problem, that you will fix the squeaky wheel. And that squeaky wheel is the team that is screaming the most that, hey our build fails, and stuff. Or, you know, we have this big problem, and they will be prioritized because they are the loudest. Which, to some degree with that problem, helps having separate organizations. Where, you know, a different organization has the budget and resources to decide what we work on. And then the escalations have to be resolved, and the prioritization has to be agreed on.
Hans: That’s great insight. When you look at the engineering leadership at LinkedIn, you already talked about the metric of commit to production. Are there any other metrics they’re interested in, basically where they hold you accountable for improving it?
Szczepan: Yes. So our leadership looks at various metrics related to developer productivity because, at the end of the day, we want to have an organization where developers can really focus on building great products and being productive. So what we look at from commit volume to various code review metrics, like all changes that LinkedIn has had to go through code review. So we look at how long does it take from when the review is created and when the review is approved for shipping to production? We look at how long do deployments take. Like is it a half hour to deploy your entire application to all the hosts, or slower? So we look at various kinds of metrics. And the one that we particularly look at it from the standpoint of foundation, like my team, is commit to production-ready. So you ship your change, and when do you have a binary that you can ship to production? And that’s one of the key metrics as well. We also look at the local dev metrics or like development cycles to make them as fast as possible. And we want them to be also fast. You want developers to be productive.
Hans: Over the next, let’s say, over the next three to five years, what are the features that you think, oh, those are the most missing features. They would have the biggest impact to improve from where we are right now in the automation field. What would some of those features be, either you’re looking forward to that someone else provides them, or you might want to develop yourself at LinkedIn.
Szczepan: Let’s zoom out and think about the global landscape. I’d love to see more tools and automation and solutions, both for the multi code-based environment and for the monorepo and the use cases. And the multi code-based environment, I mentioned that before. I would like a solution, like a really complete holistic end-to-end solution. For like, hey, I have this 1,000 developers organization. We have many, many codebases, but we want to organize around that. So we want to have code review systems. We want to have trunk-based development. We want to be able to understand what versions are in production, what are the version conflicts, how do I resolve them, the dependency graph, all that. I would love to see a solution to that.
And I don’t think anyone is actually developing it. You have a bunch of tools that you can integrate, and you can build yourself. We built it at LinkedIn. But there’s no solution for that, and I would love to have that. Because ideally, at LinkedIn I would prefer not to be building the core developer infra. I would like to build the stuff that is really unique to LinkedIn, like solving our unique challenges, rather than, how do we do, you know, code reviews. How do we do that kind of stuff? I would like to be able to use an off-the-shelf solution that is proven to work in other organizations as well. That would be my preference.
Szczepan: And one last thing is like, I’d love to see also progress on the monorepo tooling, like Facebook, Google are investing in like built systems. Facebook is investing in Mercuria, then source control. Microsoft is investing in the virtual file system for Git. So I’d love to see more there. I don’t think those things are there yet. I don’t think if I have really, really massive development teams, I can just take those tools and get it working. I think I would still need a lot of investment. But at some point in the future, in a couple of years, I think those tools will be there.
Hans: Yeah. For us, it’s the same. It’s a different dimension of scalability. You need to scale for very large repositories, and you need to be able to scale for many repositories. That is what a modern build infrastructure needs to provide. If you look at where the industry is, they’re kind of in a difficult state. So they have their monorepo, which is mostly a monolith, in most cases. And they got really burned by that. Now they’re trying to get away from it and chipping away and creating many, many small repositories. But the majority of the code is still in the monolith. So now they still have this big monolith thing, but now, on top, they have now the orchestration of 1,000 codebases to deal with. So in terms of complexity, they’re in the worst possible state. And that is, I would say, where many, many organizations are right now, and it will not change. They will not be able to get rid of the monolith for quite a few years.
Szczepan: And you have to account for cleaning the tech, you really have to fund it. I want to have a team that cleans up the tech. Now I want to share some numbers.
Hans: Yes, please.
Szczepan: Last July, we completely removed the old monorepo that traditionally we have at LinkedIn. The effort of getting rid of it took like 2.5 years for us for a couple of devs. And when we started the repost, I think it was 12 million lines of code, and when we finished it created 800 separate code bases. Probably half of that or like 30%, we deleted. It was like the death code. So It’s just an interesting data point.
Hans: Nice, yeah. I think we’re ready for some questions. “What kind of changes did your team make to help take LinkedIn from releasing once a month to multiple times daily, and how long did that take?”
Szczepan: There were many changes, and also it was a gradual process. So first, we started to think about how can we shrink that process from one month to two weeks into one week? At some point, we were fairly productive with weekly releases. But still, we wanted releases several times a day. And that was a decision we made. And it really forces us to organize and build necessary tooling. And it’s not only tooling, like, build processes, like teach engineers how to write high-quality tests that give high-quality signals in a short time.
So what are the changes? I think there was an automation side, and there was a process side. Process would be to shift from this thinking that, oh, you know, we can manually test something a little bit later. I can shift that change today, and tomorrow I will verify that. You couldn’t have that at all. Like every day, you have to produce the highest possible quality of the code, because this goes to production today. So it’s like a major mental shift for engineering teams, where there’s no phase of stabilization of the release, where you cherry-pick code changes that work or bug fixes and stuff like that. It’s really that every day, you focus on quality. And this also puts a lot of pressure on how to develop your test, how to structure your test. You have to be thinking about your testing pyramids, right? Like those slow UI tests, you don’t want to have too many of those. You want to have a lot of unit tests. How do you create? How do you manage flaky tests? That was like a big for us.
One of the biggest challenges was how do we ensure that our trunk, our master branch, is always healthy. We really cannot have broken master branch and then just, oh, somebody is fixing it, and the next half hour, he’ll be done, and he’ll fix it. You can’t have it. If you have a large team, and you want to shift to production several times a day, you have to front-load all your testing before the code is merged, so that you have the highest chance that your trunk, your master branch, is always healthy at any point of time. So those are some of the challenges. And this is a lot, so I hope that it answers your question, at least to some degree.
Hans: I’m just curious were there engineers that couldn’t deal with that new world, and rather preferred a world with kind of longer cycles and stabilization phase? For me, it was always, ship, ship, ship, right? But was it a cultural change that affected how you recruit people, that you were looking, because of that, potentially for different engineering profiles, in terms of what engineers to recruit?
Szczepan: That’s a great question. For the teams that ship to production very frequently, like our LinkedIn.com, there was no major pushback. Engineers like to ship their changes to production frequently. This is part of my productivity. So they are happy. This puts a lot of pressure on writing a lot of tests and like, making sure every engineer, from time to time, is in the on-call rotation where he had swift releasing. So there are certain chores, and there is certain overhead for an engineer. He cannot be only responsible for happy and merry coding of code features, and somebody else deals with the releases with stabilization.
Hans: No, it’s hard work.
Szczepan: Yeah. It’s hard work. So every engineer has to be kind of a DevOps guy. So I don’t see the problem there. Now if we look at the entire LinkedIn, we have teams that they cannot ship to production, you know, a few times a day. We have teams that work on our data storage, which has to have great performance. And you cannot run this performance test in a few hours. You need to have tested for it for a couple of days. So it really depends what kind of domain we’re talking about. But some teams wouldn’t be able to, and it’s undesirable for them. It’s fine if they have different release cadence.
Hans: Cool. There are tons of great questions. So Mary is asking, “When solving a developer infrastructure tooling limitation, how do you determine whether to build a solution in-house, as opposed to buying a product off the shelf?”
Szczepan: This is an amazing question. This is this whole build versus buy dilemma. And it’s all about ROI. So you have to build some kind of a cost return on investment model. What are the costs and what are the benefits? And the useful indicator is what is your core business as an organization? Does it make sense for you to maintain something? And then the trade-off is, OK, you can accept the off-the-shelf solution, but then you have to accept the trade-offs. Like, maybe some of the things will not work perfectly for you. Maybe some of the use cases is something, ah, you know, I don’t like this feature. I don’t like how it works, but on the other hand, we don’t want to build it ourselves. And then in the built infra, it’s an often trap, maybe in product teams as well, where you think that, oh, we build it and it works. You build it once, and it’s fine, this is one cost. It’s never one cost. Every line of code that you write has to be maintained until this code is in production.
Szczepan: Whatever you build, another 100 lines of code, another function, another class, it adds up to your maintenance long term. And if you keep building those customizations, custom tools, and custom features, then you find yourself at some point that, well, I have to hire a bunch of new devs because otherwise, I cannot build any new features, because I have to keep maintaining the old stuff. So that’s also our build versus buy. It’s a big question. Build a good ROI, return on investment analysis. The data will tell you.
Hans: Data-driven, right, have good insights. That’s for me another fascinating thing. You come to an organization. You ask them, how many builds do you have a day? How often are they failing? How long does a developer build take? How long does a CI build take? No one knows. You think, OK, then how do you want to build an ROI model, right, when you don’t know how often is it executed by the developer and things like this.
Szczepan: I missed that metric when we talked about metrics. We also track the stability of our pipelines.
Hans: Right. Very important.
Szczepan: How many tools failures we have, how many user failures, we also try that.
Hans: And people, when you ask them, completely underestimate, for example, the number of failures. We ask them, how many of your CI builds are failing? Oh, a couple of percent. And then you measure, oops. 25%. It’s not necessarily a bad thing, but it’s just that it’s fascinating. It’s hard to do, to make those educated decisions without data.
And next question, “How does LinkedIn support the build pipeline for different platforms, web, iOS, Android.”
So every project at LinkedIn, regardless of the technology stack, from like C++, Python, some other things, they’re all built with Gradle. And previously, it did not scale for us because we had built those engineers specializing in different build tools all over the place. So the level of, like, silo within the team was high. And then certain code tests would be duplicated across the technology stacks because the build systems were different, and you couldn’t really reuse that code. So having a common build infrastructure on top of one technology, we chose Gradle. We found it useful for scaling. We find it useful to help our team to scale.
Hans: Yes. “How do you get buy-in from developers used to doing long-lived branches to trunk-based development? And then, what is a good way to help people change their mindset to move that way?”
Szczepan: This is hard. At LinkedIn, we never recommended long-lived branches, so we didn’t have the exact problem that you are referring to. We used to have, like, release branch, which you always have when you have weekly releases or monthly releases. We used to have development on branches as a work-around for, like, let’s deal with the integration problems later, I want to focus on my feature, kind of mentality. So we had some of that in the past. Going to trunk-based development was something we have done a long time ago. The decision to do that was like circa 2011. So that is, like, for us, it’s in the past.
I’d say that the key thing that we want to use to convince developers to go to trunk-based development is the overhead of integration, which is higher with the long-lived branches, which is always higher. There’s always going to be a chunk of development time dedicated to, like, integration, to cherry-picking, to stabilizing, and that’s like a pure waste.
Hans: It’s a nightmare. And for me, when I’ve seen this, it was often a measure of desperation. You have a big monorepo, but the build is so slow, and the tests are flaky. So the only way you could have some illusion of progress, is you have your own branch. But then the merge nightmare comes. So for me, it was almost like a desperate measure, because you don’t know what else to do. And not just wait all the time and be broken.
Szczepan: And delivering incrementally in small batches is key. It really helps. It’s like you amortize the complexity and cost over time. And that is what trunk-based development provides.
Hans: But you can do it because you have fast feedback cycles. That is always the key, right?
Szczepan: I want to say one last thing, is that trunk-based development does not mean that you don’t do branches at all. Some people say you have either trunk-based development or like, pull request model. That’s just untrue. You can absolutely do pull requests and do trunk-based development. You can even cook your pull request for a month if you need it because the change is hard to do because you have to do a lot of figuring things out. That is fine, as long this is not a default way for the team to operate. That, like, every developer works in a one-month length development branch, and when there’s a last Friday of the month, let’s do it, let’s merge it, and who goes first is easy for him. No merge conflicts and stuff. So you do not want to end up there. But it’s fine to work with branches.
Hans: This is a very interesting question. So in your opinion, “What kind of size does a team need to be before they really start paying attention to build automation and DevOps concerns? Do you think there should be much of a concern for tiny teams, let’s say, 10 developers?”
Szczepan: This is an amazing question. I’m going to use one interesting data point. So when we started developing a new LinkedIn.com in 2015, we found that when we had more than 15 engineers on the team delivering to the same code base, the Git stopped scaling for us, in the sense that the developer couldn’t push his change. Why? Because he has to merge with the upstream. So he merges with the upstream. He runs the build or sends the build request to run it in the cloud. Half an hour later, the build is finished. It’s all good. I want to push again. He tries to push. Oh, I cannot push again. And then developers were not able to push code, right? So that also is one of the reasons is that we don’t use a pull request model.
We, at LinkedIn internally, we have trunk-based development. So this was our trigger point for, oh, shoot, we need some kind of, like, a commit merge queue to get all the comments in the queue and start merging them and running the build. This is an example of where we have found out that, at a certain point in this team size, something stopped scaling for us, and we have to come up with a solution.
You know, it really depends because you would be finding those different trigger points for different problems, for deployment, for local development. You’ll find that, oh, the code base now is too big to use it in IntelliJ or into the ID. We have to do something, right?
Hans: But at the same time my answer is simple. If this is a serious project, not just a four-week experiment, even if it’s three people, even if it’s myself, I would invest in automation. I would make automation a first-class citizen from the very first day. And it would be very different automation with very different scalability requirements, then you have at LinkedIn. But with automation, if you don’t do it, it will bite you. You will pay a price after 30 days. You are already less productive than compared to you had invested in automation on the first day.
So I would start any project with a build, and then, as soon as I have to do something manually a couple of times, I would invest in automating it. Because it’s a key part of lean production and it makes it easy to ask questions. And for me, when you don’t have automation, oh, was there performance regression compared to yesterday? Or it takes me two hours to set everything up to ask the question. I don’t ask the question.
Szczepan: I agree. Continual thinking and considering automation from day one. At the same time, you don’t want to overcook. You want to allow some evolutionary changes. You want to be prepared for scaling up. You want to think, OK, what’s going to happen next year, where we double the size of the engineering team? At the same time, you have to know that whatever you come up today to scale-out next year, most likely a year from now or two years from now, you’re going to have to scratch it because you’ll have different scalability issues, and you’ll have to build different things. So I think, manage yours and the leadership’s expectation that this is continual effort.
Hans: Yes. That’s key.
Szczepan: Collecting the data, tracking the data from day one, around productivity, helps to solicit resources to the effort of automation.
Hans: Keep it simple, but automate.
Szczepan: I feel that we give very generic answers.
Hans: No, for me, it’s like, immediately. Automation immediately, because I’ve seen it also gets your people in a productive mindset. I mean, when I asked you the question, what brought you into domain, you said automation. When I joined a team, it was always the first thing I took on the responsibility. We need to increase the amount of automation. I cannot be productive without it. So for me, it’s keeping it simple. Don’t say, oh, for a small code base, we need to split it over 50 repositories because maybe in two years that helps to keep it simple. But automate what you need to automate.
So the next question, “I assume you have end-to-end tests for everything before you do a release, or at least some smoke tests.”
Szczepan: We have all kinds of tests. The end-to-end tests or UI tests, they are expensive. So we manage our testing strategy well, so in order to release several times a day, your tests have to fit a certain time box. Your test cannot run for six hours. it’s just not going to work. So this is, for us, a forcing function to get our testing strategy well, and also to think about what is the stuff you don’t want to test. What is the stuff that you think, this is not worth building tests for. We monitor our production rollouts. We can hide the feature if it’s problematic. So you introduce different ways of managing your risk, not only like tests.
Hans: Right. That’s a very good point. Nice. Second to last question, “With such a rapid production release schedule, how often do changes need to be reverted? Maybe never? How time-consuming would reverting production changes typically take?”
Szczepan: It depends on the product. Very rarely, it’s going to be a rollback or revert for us. We usually forward fix, it is just a more natural way of doing things. Keep in mind that we don’t expose the feature yet. We have ways of exposing that we can hide that. If the feature is problematic, we can hide it, like, automatically. So you don’t have to revert the code from the production System. So I hope this answers your question.
Like there is a way to roll back. There is a way to in trunk-based development at LinkedIn, there is a way to create a hotfix, a change on the version that you built, let’s say, yesterday or two days ago. So we have tooling around that. We have processes around that. It’s possible, but it is pretty rare. Usually, you forward fix. And it happens, let’s say, for the API service behind LinkedIn.com, some kind of rollback happens a couple of times a week. So this happens. And I think it’s a bit too high. We should put it down. I hope that answered the question.
Hans: Cool. Last question. “How do you handle flaky tests?”
Szczepan: Flaky tests are always fun. So one of the things we have discovered is that flaky tests are one of the most disheartening, morale, and energy-sapping problem in an engineering team, where you want to ship your change, and yet, the test that is written by somebody else is failing, and it’s flaky. You can’t get your change out, and it’s very problematic. And also what we have discovered is that it really hurts our frequent release cadence if you have flaky tests because your build stops and then you have to rerun things.
So some of the things that we have discovered to manage them, like tracking and managing flaky tests, so understanding what are your flaky tests. You want to track it. On some of our codebases, we have automated tools to disable flaky tests. So they would run all the tests overnight without any changes. So if you run a test 1,000 times without any code change, you’ll find out if it’s flaky or not. That’s one way to do it.
Hans: That’s cool.
Szczepan: You will have to manage flaky tests. Like in a small team that doesn’t release to production Frequently, it’s fine to have a flaky test. Probably the cost is really low. For larger teams, a lot of the high commit volume, you want to do it. We don’t do it that often, but so some organizations deal with flakiness with various kinds of retries. If you invest in retries, it’s like a slippery slope. You have to be watchful. When you are doing retries, retry from the point of failure, rather than retrying the entire, big operation because you want to optimize for speed. Keep tracking that.
You want to make the flaky test visible and also attributable to teams or even individuals. So you have this concept of visibility leads to responsibility leads to results. So make it visible, the flaky test. And you know, how many flaky tests? And then, oh wow, we have 5% flaky tests. And those who introduced flakey tests should be held accountable for fixing them.
Hans: Cool, awesome. So, yeah. Thanks for the great questions. Thanks for the fantastic answers. I learned a lot.
Szczepan: Thank you.
Hans: Thanks for coming. That was great, and yeah, so we’re continuing the discussion and feedback at Twitter. Mockitoguy is Szczepan’s Twitter handle. My one is hans_d. For more on Gradle, see gradle.org and gradle.com.
Particularly on gradle.com, we have a couple of blog posts that also discuss this topic about the challenge of build engineering, the cost of build, building ROI models around build, so you’ll find some good resources there, on gradle.com. And then our next webcast will be on build performance troubleshooting, sometimes in June. We will let you know once the date is established.
Thanks a lot for attending. Have a great day. Have a great evening, bye.
Szczepan: Thank you very much for watching, guys. I hope you learned a lot. And keep automating, keep automating. I can’t stress it more.
Hans: Cool. Thanks a lot.