subscribe for Updates

Summary:

Welcome to an insightful discussion with Jez Humble, a Site Reliability Engineer at Google and a co-founder of DORA. In this talk from Prodacity, Jez shares his extensive experience in DevOps, continuous delivery, and the impact of these practices on organizational outcomes.

Transcript:

Jez Humble (00:17):

Good morning everyone. Thanks very much for that amazing introduction, Bryon. I hope I can live up to that in some way. So yes, I'm right now a site reliability engineer at Google. I am tech lead of the SRE team that manages Cloud Run, App Engine, Cloud Functions, which is kind of the best job in the world. I really love doing that.

(00:38):

I'm here today to talk about DevOps. As Bryon said, I co-founded DORA. We are, I think, probably the longest running research program into DevOps. We've been going since the state of DevOps report in 2014, which was a puppet production. Nicole and I, Nicole was the CEO of DORA, came on board, and Nicole turned that into a rigorous scientific program where we investigated what works, and what doesn't work, and how to implement it. So DevOps, probably everyone is familiar with DevOps. Who came to see Nathen [Harvey] speak on Monday? Okay, so I'm just going to briefly recap some of this for people who weren't there. We define DevOps at Google as an organizational and cultural movement that aims to increase software delivery velocity, improve, service reliability, and build shared ownership among software stakeholders.

(01:41):

So everyone's got their own definition. That's by design. This is what we think about how to define DevOps at Google. Take your own definition, that's totally fine. We're still learning how to get better at doing these things. How to get better at building and operating software systems. New technologies are coming out all the time, so there's no one way of doing it. In this program, we found out a way to measure software delivery performance, the impact of changing software delivery performance on organizational outcomes, and the factors that drive improved performance. So, how do we measure software delivery performance, and why is it important? We found that actually software delivery matters. We were told for a long time that it's not strategic to be able to deliver software faster and better, but actually we found it is. It improves not just commercial outcomes such as profitability, productivity, and market share, it also impacts non-commercial goals that we're all familiar with in government, such as the ability to achieve organization and mission goals, the quality and quantity of products or services we can provide, our ability to deliver customer satisfaction, and improved operating efficiency as well.

(02:58):

So these capabilities matter. They matter to your organization or to your business. There's four things that we found are important in driving those organizational outcomes, which we call software delivery and operational performance. So, two things to do with speed deployment, frequency, how frequently you can deploy into production and then lead time for changes. How long does it take you to go from writing code to getting that code out in production? And then, two things that are concerned with stability: change fail rate; when you push a change out to production, what percentage of the time do you have to roll back or remediate because something went wrong? And then time to restore service; when something goes wrong in production, how long does it take you to actually fix that problem?

(03:46):

What we find, and we found this consistently every year, there's no reason it should come out like this, but it always does. We do cluster analysis to take the responses to our surveys and group them into clusters in a way that's statistically valid. What we find every time is that our responses split into three or four groups. There's always an elite-performing group or a high-performing group that's able to deploy faster and achieve high stability. And, at the bottom, a low-performing group that deploys infrequently and has worse stability, and then one or two groups in between. And that's really amazing for me. Every time we run the stats, I kind of hold my breath because we've been told for so long that speed and stability is a zero sum game, that if you go faster, you're going to break things. And what we find is that that's not true. Actually, speed and stability are a complimentary. The best teams are able to go fast, and produce higher stability, and higher quality, and have happier teams as well. And so that's what we're actually working towards. Not go faster and break things, but go faster, higher stability, higher quality, happier teams. We find this works everywhere. You can see that 20% plus of our responses are from organizations with 10,000 people or more.

(05:04):

We have most of our responses coming from technology organizations, but we have responses from highly regulated organizations, including healthcare, telecoms, government, energy, and we find that these practices work everywhere. It's not that they work better in some places than other places. What we find in big companies is that it's very heterogeneous. You'll find teams that are high performers and teams that are low performers and managers and leadership moves around and the team shift performance based on that, basically. That's not something, by the way, that we find in the survey, that's my personal experience is that it tends to be leadership driven. But the point is this works everywhere, works in big organizations, small organizations, highly regulated, not highly regulated. You can do this anywhere. And as Bryon says, I worked in 18F where we helped build- the team I worked on - built Cloud.gov, which is for a government service that uses continuous delivery to deploy changes.

(06:04):

So, we've thrown this word out a lot. What do we mean by it? Continuous delivery is about making releases boring. We want to be able to push changes out to production at any time, with no drama. Who here works in an organization where people have to work evenings and weekends to push out releases? Okay, well thanks very much for putting your hand up. I'm sorry about that. The reason that me and Dave Farley wrote the book in 2010, is that we worked on a team that started doing that, and then found ways to stop doing that, and we never wanted anyone to have to do that again. So we haven't succeeded yet, but that's the goal. You should be able to push out changes at any time, and it should be a complete non-event. And what we found, since we wrote the book in 2010, Dave and I, is that this is possible anywhere. We have case studies from, again, government, from financial services, from any kind of organization, you like, firmware, you can do this anywhere. It is hard. It takes time, it takes investment.

(07:08):

It's actually a whole bunch of different practices. We found - I think there's probably like 13 things on this technical practices list - that help drive this capability of making releases boring, and being able to perform them whenever you want. Implementing those things not only drives better performance, it also drives cultural change. So as Nathen said on Monday, you can't change the way you think to change the way you act. You have to change the way you act to change the way you think. Changing the way you behave is what changes culture. And so, implementing these practices drives cultural change, which in turn drives performance. And then implementing these practices, and achieving this capability, also results in teams that are less burnt out, that have less pain deploying, and that have high quality measured in terms of the amount of time people spend doing rework. So a lot of these things on the left seem kind of a bit random perhaps, but they all do have one thing in common, which is they're all about building quality in. Instead of this idea that we'll build something and then after it's built, we'll test it, and then we'll do the security analysis...

(08:19):

That doesn't work. You can't take software that wasn't built to be secure or built to be performant. And then once it's built, then fix those problems afterwards. You can't wave your magic wand and have the DevOps fairies come and make your insecure software secure. That's not something you can fix after the fact. You have to actually build that in. And so a lot of these practices are about giving developers really fast feedback, from the beginning, about the impact of what they're doing. And have we actually, has this change introduced some security vulnerability? Has it made the system less performant? Have I introduced a defect? You want to get developers that information straight away so that they can make sure their software is always deployable from day one. Even if you wouldn't necessarily want users seeing that stuff, it's got to work from day one, and it has got to achieve all those different requirements that we care about.

(09:14):

In order to do that, we need a lot of testing and we need to be doing testing not after dev complete. We need to be doing testing all the time. So continuous testing is one of the key cornerstones of continuous delivery. That means having a lot of automated tests. Testing that the system, that the individual methods and functions do what they're supposed to do, but also that we can actually run a user journey end-to-end, and that works in an automated way, testing performance and security and all those kinds of things in an automated way, and doing manual exploratory testing all the way through the software delivery lifecycle, not just at the dev complete point. We tie all that stuff together into what's called a deployment pipeline. A deployment pipeline means everything you need to reproduce the configuration of your production environment has to be in version control.

(10:05):

I should be able to stand up a production environment, or a production-like environment, in a fully automated way, using information from version control and be able to deploy my service to that environment in a fully automated way. That's what we're looking for. And then anytime I make a change to that system, we run automated tests. If those tests fail, we fix them straight away. Once we have a build that passes the automated tests, that goes downstream for maybe performance testing or security testing or other kinds of exploratory testing that we're doing all the time. And again, the moment we find a problem, we fix it straight away. Once we have bills that are deployable, we can do a push button deploy into a staging or pre-prod environment, and then we can use exactly the same process that we've used for that, to do deployments into production environments.

(10:56):

The key metric that we care about with the deployment pipeline is lead time. So Mary and Tom Poppendieck have a way of thinking about this because actually lead time is hard to measure. It's, of those four key metrics, it's probably the one that's hardest to measure because you've got to start from when hands are on keyboard, not when the changes pushed into version control, but like from hands-on keyboard, actually writing your production code, right through to production. And that's something that a lot of teams don't actually track. So here's a way to think about it, from Mary and Tom Poppendieck. They say, "how long would it take your organization to deploy a change that involves just one single line of code? Do you do this on a repeatable, reliable basis?" This is important in terms of software delivery performance in terms of velocity, but it's also really important in terms of reliability. All you've got to think about is what happens when you discover a vulnerability. Say some library - who was around for the Log4j vulnerability, right - you remember that? OK.

(12:05):

If you work in an organization which doesn't have this kind of automation and configuration management in place, what you were probably doing, is standing up teams to do archeology and hunt around and find out which systems contain those vulnerable versions of Log4j. Try and find the source code, and then try and work out how you deploy that thing, which may not have been deployed for months, or even years. And then try and get all that stuff deployed. And that's actually a really miserable problem to have to deal with. Anyone here working in an organization where you don't have the source code to some of your mission critical systems? Alright. Anyone working in an organization where you have to buy parts for your hardware for mission critical systems on eBay? Yep. There's one person there. Always at least one person. That's a real thing.

(12:54):

People say that continuous delivery is risky. But actually, when you have this problem, solving this becomes really hard. And continuous delivery is about making it extremely straightforward to do that. I was actually on call at Google, for the services I spoke about earlier, when the Log4j vulnerability hit. And in Google, all our code is in a single monorepo. And for every library, there's exactly one version, checked into version control. S,o fixing that problem for Log4j at Google was...we looked in our version control repository, to see which version of Log4j we had checked in. It wasn't a vulnerable one, we were fine. Had it been a vulnerable one, we would've validated the latest version of Log4j. We always check anything we import into our repo, do a bunch of validations of that to make sure that it's secure and passes all our tests that we care about, import it, and then we declare all our dependencies within the Google code base. So you just check that in and then the system will automatically rebuild anything that consumes that library, and deploy it into production in a fully automated way. And that's because we have this capability of complete traceability from source control to production, and everything is automated. But that is extremely hard to do. It takes a lot of investment. It's expensive, but it gives you this amazing capability which you care about, not just from the velocity point of view, but also it's critical for security.

(14:24):

The same thing happened with Apache Struts a few years ago, and it is exactly the same problem. They just didn't know which of their production systems were using that vulnerable version, and they had to do archeology to work out which services were impacted and how to fix them. So, we find in our research that actually, high performers build security into software. They're not using a downstream process to validate security. They're conducting security reviews frequently. They complete changes rapidly. And they're running security tests as part of the deployment pipeline. And the job of InfoSec is not as a gatekeeper. It's to make it easy for development teams to do the right things by building tool trains, and process ease, and making it easy for developers to get fast feedback and do the right thing. So building security in is a key capability that enables continuous delivery.

(15:16):

I want to talk a bit about architecture. When people talk about architecture for continuous delivery, they often talk about Kubernetes and all these microservices, and all this kind of thing. And those are great ideas and technologies, but it's the outcomes that those technologies enable that are important, not the technologies. What we find is actually really important in continuous delivery, is whether you can answer yes to these five questions: "Can I make large scale changes to the design of my system without the permission of someone outside my team or without, depending on other teams? Can my team complete its work without needing fine grain communication and coordination with people outside the team? Can my team deploy and release its product or service on demand independently of other services the product or service depends on? Can my team do most of its testing on demand without requiring complex, manually configured integrated testing environments that take me days or weeks to get set up?" And then finally, the key piece, "can my team perform deployments during normal business hours with negligible downtime?"

(16:20):

Now, you can do this stuff on mainframes. I've seen teams do it. You could also invest years in moving everything to Kubernetes and having a spiffy microservices style architecture and not be able to answer yes to these questions, in which case you basically you wasted your money. It's not about the particular technologies and processes you use, it's about the outcomes those things enable. That's what you have to keep an eye on as someone involved in a technology transformation.

(16:54):

Obviously I work for Google Cloud, we like the Clouds, it's good. When I was working for the federal government...I still have my NIST folder with all the NIST documentation including 800-53 with all the controls that you have to implement for a government service, but one of my favorite NIST documents is 800-145, which is NIST's definition of a Cloud. And it's definitely the shortest NIST document that I own. It's only a few pages long and it defines Cloud in terms of these five characteristics: on-demand self-service, which means that your people must be able to self-service stuff from the Cloud. You see this a lot where agencies procure the Cloud, but as a developer, I still have to create a ticket, or send an email, and wait days in order to get a VM. That's not the Cloud. You just paid a lot of money for some virtualization which didn't change your outcomes at all in your delivery process.

(17:58):

Broad network access, which means that teams have to be able to access the resources on any devices that they need to be able to do that. Resource pooling, which is a small number of hardware devices supporting a large number of virtual devices. Rapid elasticity, this is the illusion of infinite resources, the ability to scale up and scale down on demand, and then finally measured service, which is you only pay for what you actually use. And so, back in 2019, we asked teams who was actually doing these things. So these are teams that said they were using the Cloud. Only 29% of respondents actually met NIST's definition of what the Cloud is. And you see this a lot. People buy the Cloud, but they use the same old school data center practices to manage that infrastructure, and to make changes to it and the deployment pipeline going up to that stuff. And that doesn't work. It doesn't make a difference. You've got to implement these practices. You've got to think differently about how you manage Cloud-based infrastructure, and how you manage software delivery to Cloud-based infrastructure. And when you do that, we find that you're 24x more likely to be in that elite performing group. It can make a huge difference if you actually do it right.

(19:16):

So, I want to end by talking about how you actually implement all this stuff. So this is probably a bit overwhelming. It is hard. It takes time. A lot of people try and implement continuous delivery through a traditional program management approach, where you do a lot of upfront planning, and then spend a lot of time trying to do it, and then if there's an end date, and then we're done. And that is completely antithetical to the way continuous delivery works. And actually, it's not the best way to do this kind of thing. You shouldn't try and do it all at once. And if you try and do it all at once, you'll probably fail. We actually investigated "what are effective strategies for transforming your organization?" And what we find, actually, is that a lot of the very popular ones are not in fact used by elite performers. So things like center of excellence, not to say these things can't work, just to say that you've got to be really careful. If you create a center of excellence, for example, and you put all the excellent people in the center of excellence, who are all the people who are actually doing the work? They're the people that are not excellent, how good is that? How do you think people are going to feel about that?

(20:31):

It's problematic. What we find actually works really well, is not top down, not bottom up, but building community structures that involve everyone in the organization. So that means building communities of practice, doing grassroots stuff, experimenting with ideas, and then taking the things that don't work, and talking about those, "we tried this and it didn't work...here's why." Doing retrospectives and postmortems, and then taking the things that do work, and then helping those organically move out through the organization. And I got an example of how we did that at Google. Google did not do a lot of automated testing early on. I've been working in XP teams since around 2005 when I joined a company called ThoughtWorks, and I was brainwashed into it. I remember reading the XP book in 2001 and thinking, well, this can't possibly work, what a load of crap. And then joining a team in 2005 that was doing XP and getting brainwashed, and then realizing, this actually works. It's amazing. I never want to work any other way.

(21:40):

But it wasn't being told you have to have 80% test coverage from your unit tests. That doesn't work. So Google did not have a testing culture, and the testing culture was built by a community of practice that had no official funding, and they basically did it in their 20% time. They did it, in Google, assuming you're doing fine in your performance reviews, you get 20% of your time, or as we sometimes jokingly call it 120% of your time, to work on things that you personally care about. And so there was a group of people that created what they called the testing grouplet. So, it was a bunch of Google developers who really cared about testing and wanted to build a testing culture. Again, no budget, no official project to work on it. But what they did is they came up with all these ideas to help build the testing culture.

(22:33):

One of them was called Testing on the Toilet, which is a monthly newsletter that they would print out and then paste on the back of all the toilet cubicles so that you could never escape from learning about testing. And if you go to a Google campus today, you can still see, in the restrooms, they have Testing on the Toilet newsletters still printed out every month. And so it was kind of thinking of ingenious ways to help get people excited about this, and talk about it, and start working in this way. And that is how we implemented Google's testing culture. And now we have things like I can't check in a change without writing tests for it. It won't pass code review. We have static detections to make sure that you have good coverage and stuff like that. But that doesn't work unless you have the culture in place, and the culture is created by people who are interested in it, and infect other people, and get other people excited about it and make it just the way that we work. So that, in my experience, is the way to do it. I've talked about a lot of stuff. You can go to cloud.google.com/devops or dora.dev and see all our research, all those practices I've talked about, there's lots of examples that you can deep dive into. You can take the quick check. There's tons of free material there to use. I think I have some time at the end for questions as well. Do I have some time or am I done?

(23:59):

Any questions? No one's told me to stop.

Audience (24:08):

So first of all, I'm so happy. [Inaudible] and I guess one of the questions - we're working with [Inaudible] and I know that a few weeks ago [Inaudible]. What does that mean and what metrics matter? We were talking about 18F. Are there things about the DORA metrics today that need to be adopted in government different or not?

Jez Humble (24:43):

Okay. So the question is "are there things about the DORA metrics that need to be different in government that you need to think about differently in government?" Yes, in the sense that every organization is different and you need to always take this stuff and apply it to your organization in your own way.

(25:09):

Everyone says - and I have a whole talk about this - everyone says, "oh, that sounds great. It won't work here." And everyone's right. You can't just take what someone else did, and copy it, and expect it to work the same way. It won't. You've got to work out your own solutions. And in particular, you've got to be really careful about taking these metrics and putting them in a top down way. So back when I worked at ThoughtWorks, there was a story from a friend of mine about a company he worked at where they said, we're going to do TDD, and we're going to implement automated testing, and we're going to have a rule that there has to be 80% test coverage. And their contractor was like, "okay, fine, we'll implement that." And they got 80% test coverage, and they went and looked at the tests and what they found was in all these automated tests, they would call the methods and then they would assert true. And so they got 80% test coverage. And then they put a new rule in place that said, okay, "all the tests have to have statements at the end." And then they looked at those tests, they did what they were supposed to do, and at the end of those tests it said assert five is two plus three. So if you implement metrics in a top down way, you will get the numbers you want. Engineers are amazing at achieving the outcomes that you set for them. They might just not do it in the way that you want them to. So, I think you've got to be really careful with the DORA metrics about not making it, "we will achieve this in the top down mandate way. You have to apply the communities of practice thing. Find teams who are interested in doing this stuff, and interested in working this way, and give them the support, and resources, and time they need to experiment with it, and make mistakes, and implement it, and then get other people to talk to them, have internal conferences, and find out - get them to spread that knowledge, and reward them.

(27:12):

If someone does something cool, or if someone does something that's audacious that fails, talk about it and say, "that was really great." Reward those people. Make it clear that they've done a great job. Talk about them in front of the rest of the organization. That's how you do it. I mean, there's a guy called [Inaudible] who was the CFO of Statoil. And despite being the CFO of a Norwegian oil company is one of the funniest speakers I've ever seen. He's awesome. One of the things he says is "you can't change command and control through command and control."

(27:50):

So when you're implementing the DORA metrics, it's a really great framework to think about how we actually change outcomes by implementing practices. How do we implement practices? With leadership support. But not with leadership telling people, okay, "we're going to achieve this time to restore, and this deploy frequency," but actually by helping teams experiment with "what's our goal going to be for the next three months? How are we going to achieve that? What are we going to implement? How will we see if it works or not?" And building those communities of practice. And you've got to make your own mistakes. You've got to work out your own solutions. So that's my answer to that question. But great question. Thank you very much. Alright, I think we're out of time. Thanks very much everyone.