Complexity: A Potential Enemy of Delivery
Summary:
Join Mike Lanciano as he explores the intricate balance between simplicity and complexity in technology delivery. From his unique experiences across various sectors to his current role at the Department of Defense (DoD), Mike shares insightful anecdotes and lessons on the trade-offs between embracing complexity and striving for simplicity. Whether you're a developer, engineer, or tech enthusiast, this talk sheds light on the hidden costs of over-engineering and the pursuit of efficient, scalable solutions.
Transcript:
Mike Lanciano (0:15)
So, if you're reading the slide title, you're not in the wrong place. This is the renaming of this original presentation. Little backstory about the original name. The original name was Wonton Burrito Meals. And if you're curious why, there's a scene in "Futurama" where the professor says, "I teach a class every year and I make it so complex and I make the name as impossible to remember so that nobody takes the class." And that class is the Advanced Mathematics of Quantum Neutrino Fields. And Fry sits there and he writes down Wonton Burrito Meals, which proved two things as I, like, gave this talk before. Nobody watches "Futurama" nearly as much as I do. So very lost reference. But even more importantly, if you make the title that, people don't want to listen to you talk about it. So this is the reworking of the title, which is "The Complexity Chaos: The Hidden Cost of Over-engineering Delivery." Talk a little bit about, it's day three. Thank you all for coming. I know that you're all super excited to be here. I'm super excited to be here. I was here last year as well and my talk was on the organizational mapping using AI. So this is a little bit of a different style of talk this year and I've been listening to the Hamilton soundtrack all week to get myself amped for this.
(1:40)
Okay, so this is a large picture of me so that we can make sure that this presentation is zero trust. My designers told me it would look good. I didn't know I would have a screen this large. My name is Mike Lanciano, affectionately known in a lot of the DoD as Mikey Lasagna. I am the Director of Engineering Growth for Clarity. Why are you the Director of Engineering Growth? Because I got to pick my title and I didn't like the Director of Engineering title. I wanted to be more people focused. I'm a software engineer, term platform engineer. My last push to production was two and a half weeks ago. So I still do code. I still do all sorts of organizational mapping, and I still do things that matter, which is hence my challenge with the Director of Engineering title that I had.
(2:24)
And I like to say that I play a data scientist on weekends and I'll explain a little bit more about that. And my career spans everything from five-person Silicon Valley startups where you are the person doing everything, to 50,000 person companies, every different sector from banking, finance, healthcare, energy, you name it. And when I get free time, I go nowhere near a computer. I like to spend a lot of time working with wood or playing hockey. That was actually my icebreaker. And this quote kind of on the side here showcases a very real world that we would like to live in, which is, "Simplicity does not perceive complexity but follows it." But oftentimes we don't ever get to return to some of that simplicity. And that's kind of what this talk is about and kind of the challenges around simplicity in an organization.
(3:20)
So welcome to the DoD, right? I said I had a lot of different jobs before. A friend of mine, I was interviewing a couple different companies, a friend of mine calls and says, "Why don't you come work for us with the Department of Defense?" And I was like, "I don't know. That doesn't sound fun." Like, you know, we talked a little bit about Kohl’s…like, how do you get people engaged? And how do you kind of do all these things? And I was like, I don't know if it's for me. I really liked, I was pitching VCs at the time. I was spending a lot of time doing demos. I was spending a lot of time doing rapid prototyping, fast and loose, move fast and break stuff was like, you know, tattooed on my arm. And it was like, let's go. And so when my friend was like, "Why don't you come to the DoD?" And I was like, "That doesn't sound great." And then he is like, he said something very, like, astute and something that made me change my mind, which is, "You know, the DoD has logistics problems of Amazon. They have the data processing challenges of Google and the information delivery challenges of Netflix." And I was like, "I'm in. Sold. Like, let's do it." And I joined a company at the time that was doing some advanced data analytics and I got to do some of the coolest stuff that I've ever gotten to do. And I was doing it at a scale that I had never seen before. So I was processing billions of events a day, looking at different metrics, and using Bayesian networks, which is my, I play a data scientist on TV speech, right? To look at how different things happened in the world.
(04:47)
And I just thought it was the coolest thing. I had never had access in all of my career to such technology. And I had worked in some really mission critical things like banking and finance because it turns out, if you don't get your paycheck on time, people get very angry. So I do consider that mission critical. That's the world that I had lived in and getting the opportunity to do this was great. And then I kind of moved into my next job at the DoD. And I got to kind of explain different technologies and work with different technologies. And that's kind of the precipice of this talk. And when I was there, all the tools that I got to play with were super cool, right? So Amazon had done this amazing job. I'm sorry, not Amazon. Netflix had done this amazing job of the Netflix open source stack and delivering all of these great technologies for how we could horizontally scale on commodity hardware and how we could make homogenous data centers far more viable and adopt different cloud technologies as we wanted to move for cost effectiveness and all the things that we wanted to achieve. And I was like, this is the model, right? Papers are coming out about Elastic MapReduce. That's how old I am. They're coming out about Kubernetes. They're coming about a container technology. They're coming out about the the Netflix open source stack and all those things that are happening while I'm in the DoD and while I'm getting all this stuff going. It's making our lives at scale, right? Easier. And I'm like, this is great.
(6:17)
Everybody should do a microservice architecture pattern. Everybody should do microservices. Everybody should do these things because these things make your lives easier and you can always hit scale. And that was the theory, right? And when I went to a lot of organizations and a lot of organizations were like, this is what we're doing, this is what we're challenged with. And ultimately, we want to be able to do that at scale. What I never applied was the thing that I learned at the startup, which is, what scale? And what I'd never, like, heard a lot of customers talk about was, you know, like, who is your customer? And everybody's answer was everybody, right? Everybody in the DoD. Like, at what scale? And like, who is those people? There's an xkcd comment or a comic where the guy who is paid the most in the room, the only thing that he says in every single meeting is, but will it scale, right? And there's a lot of people that get paid a lot of money to ask, like, will it scale? Okay.
(7:17)
Then last year happened. And we've been through a lot over the last three years with the pandemic and everything like that. But this article shook Reddit, right? And if Reddit is obviously the best place we go for all of our information about how the world feels about everything. So it's a great understanding of the world. Although I will actually say, somebody made the comment yesterday that, like, if you want to get an understanding of the pains of the DoD, read Reddit. It's actually not inaccurate. So Amazon releases this article about like, hey, we went back to a monolith, right? That's the article. That's the thing that they sell. And everyone's like, "Hmm, you just told us to use all of your cloud technology. You told us to do all these microservices architecture. You are the company that set this scale. What does that mean?" And everybody's like, "Well, let's dive in." So we start looking and pull back the onion a little bit on Amazon and breaking up the, like, breaking back to a monolith. And as we look at that, we see, right, that it's not actually a monolith. What they really did was actually, like, they were using a lot of serverless technologies. They were using a lot of scalable compute technologies. They were using a lot of step functions, which are like ingrained cloud service functions that made their lives easier because they wanted to hit scale, right? And what we actually found was there was a cost to that scale because they were never actually hitting the end user scale that they needed. And when they were using these technologies, which were supposed to give them this kind of infinite scale, what happened was the cost of transferring data, because it never hit those efficiencies, never was realized. And what that really means, for anybody in the audience who, like, may be, like, great, that's a lot of technical stuff. I don't know what that actually means. What that really means is, like, they were trading off the cost of the thinking at volume, how much they had to move between services for the fact that they could scale. And then they realized, like, we don't actually move that much. We don't actually have that much volume. We don't actually have that much throughput. There's a soft spot on the stage here. So if I trip, it's not my fault. Like, we don't have that much volume. And it's actually easier if we do all of our processing in a single processing, like, queue in a single processing application and we scale that application horizontally and that will make our lives infinitely more efficient.
(9:51)
The numbers on this thing are staggering, right? By the way, it's like, they never hit, like, more than 5% of their peak capacity or anticipated capacity. It's absolutely worth a read. And so we find, like, this isn't really, like, everyone's angry. Everyone's super angry. Like, well this isn't really a monolith. Oh, it's definitely still a services architecture. It's this thing. It's that thing. You know, oh, and like, only this was only for Prime Video. It wasn't for Shipping. It wasn't for this thing. And, like, all those things are true, but, like, you realize like people hold a lot of very valuable opinions about scale and kind of where they go. And then of course everything goes back to ROI. So what was the cost? Amazon saves 90% of their costs. Like, right? Like everyone loves a good ROI story. 90% is a huge staggering number at Amazon. So why not? And obviously, the bottom quote there just kind of, is, like, the key takeaway actually from this article that, like, people completely missed and highlighted, which is the approach results in an even higher quality and even better customer service experience. 'Cause that's our goal. Customer service.
(11:00)
Okay. So my team, I'm known affectionately as an individual who will use the phrase spicy take. And that take is not spicy. So my team likes to send me pictures of mayonnaise sandwiches whenever I make a statement that I consider spicy. So I expect probably a lot of mayonnaise sandwiches after this. But I always caveat my speeches with a little bit about what the speech is so that we don't get the wrong takeaways. There's a great O'Reilly, like, Photoshop that says, like, getting the wrong idea about a thing you heard at a conference definitely can happen. And this is not an anti-complexity talk. Actually, like, what this is, is to talk about trade-offs at a time where complexity can outweigh outcomes, which is the outcome that customer experience, that delivering value, that thing that you're actually trying to achieve in engineering for futures that are hypothetical and may never actually exist. And the trade-offs that you make in your architecture patterns, your delivery patterns, all of the things you do as a team, as an organization, and how ultimately you caveat that and you make those bargains with your team and how you understand, like, your long-term and short-term goals as an organization. So this isn't anti Kubernetes. This isn't anti anything. It's not pro monolith. I'm not saying, like, go out and build monoliths after this, right? What this talk is mostly about is looking at your organization saying, what are our strengths? What are our trade offs? And does our architecture support that vision? And are we engineering for a potential future that is not a reality?
(12:39)
Okay, so when we look at the current landscape, right? And, like, the why this isn't the anti Kubernetes talk. I have eight engineers right now at KubeCon and they are all enjoying themselves, but this is the CNCF roadmap and I know that Bryon put it up on the first day and he showed this, like, so, you know. Even if we break this down to some of the smaller core components, each of those components has a cost associated with operating it, delivering it, maintaining it, updating it, rotating it, doing different things like STIGs, and hardening, and monitoring, and observing. All of that... The best part is, like, if you have a problem in your organization, there's probably a tool in this chart that will solve that problem for you at the cost of you introducing yet another tool into your tool chain for that. So that's, like, really the interesting part of this landscape. If we have a very complex landscape with a lot of different tools that solve a lot of different problems and we are instantly eager to adopt those tools. And I'm going to talk a little bit about a time when adopting those tools with real world scenario didn't actually exist.
(14:00)
Okay, how many people here have been part of an outage? This is…That's actually way more than I expected.
[Ricky] Yeah, this week.
[Mike] Thanks, Ricky. Yeah, so this is the, like, one of the things where we had, like, our retro where we had our moment of, like, are we doing things that don't matter? Are we doing things that are keeping us from delivering outcomes? We had a five-day outage, right? Not necessarily all within our control. You know, you get to the JRSS and all those other components of the network and you're, like, "Well, great, that's out of my control." But we had a five-day outage and that brought everybody together and everyone was like, what do we do? What's broken? How do we fix it? So there's a moment of, no pun intended, clarity in a, yeah, in a five-day outage where you're, like, okay, like, we need to get back to the basics and we need to think about this. And what we found was when we looked at it, our GitOps process contained 14 different tools, right? So if you remember the chart that I saw before, the team was like, I have 14 different tools, these are 14 different things. They solve these problems and these are the problems that I have, and therefore let's go. And we put those into the chain and we were like, this is going to solve all of our issues. But we also had 14 different failure points, right? We had 14 different things that could break even if I had two people on each thing all the time. It was 28 people to run essentially what was just our GitOps process with the amount of tools. Now, if even if you just include half an FTE for each one of those things, still a large amount of people for how we kind of get to this. And even worse, is, like, we saw demos, we saw real capability, real-world scenarios where we were like, this is the tool, this is the thing, this is the thing that will get me there.
(15:53)
And as we kind of adopted these things, we realized, like, we weren't experts in these things. Other people were. Right? So we were using tools that, like, we didn't build, we didn't understand, and therefore, like, when they broke, like, we were calling people. You know, the open source is free like a puppy that Bryon likes to talk about. Like, we were calling other people and being, like, let's go get it. Like, what's wrong? Like, what are we missing? And like, we got really great answers sometimes. It was, like, yeah, just restart it, right? Like, the typical IT thing, like just kick that container. It'll be great. Don't worry about it. They're like, "Oh, all right." And sometimes it actually worked, which is even more frustrating. And we chose, like, so these tools weren't necessarily industry standard. These tools were not tools that we had built. These tools were not tools that we understood. We had kind of outsourced a lot of our knowledge based off of thing people trying to make our lives and our delivery easier. So we bring the team in and we kind of get together and we say like, "Okay, like, let's simplify these tools, and let's make some trade offs, and let's understand like what we're going to do, and what's going to be, like, a different way of doing things." And ultimately, we moved down to, I think there ended up being six tools that we kept total and we spent fewer hours, 60 hours actually, removing the tools from our GitOps chains, which was less than our total outage, which was 350 hours for the year. And if you look at the chart, like, the chart is real numbers, obviously obscured for certain reasons, but, like, that's what happened, right? Like, our ability to operate and delivery, we spent less time in outage and we were able to spend more time delivering features. And the whole team was able to spend more time doing things that they actually enjoyed doing and delivering new capabilities and be more cross-functional because we didn't have as much complexity in our chain.
(17:50)
And so when I meet with a lot of customers, and by the way, like, it's worth it to say those 14 different tools at the time were delivering 10 apps, okay? So we had built this massive pipeline for 10 applications. And we were told at one point it could be 60 apps, right? So we were like, well, we got to make sure that's easier. We got to make sure that's a developer experience. By the way, like, I'm a developer. I like nothing better than developer experience the way it's easy and how we make that happen. But the reality was we were spending more time engineering the delivery of applications because we were told that we're going to have to hit a certain scale. And the same was true by the way, like, when you do this with architectures, with application architectures and things. This is just happens to be example that's specifically focused on delivery. So we talk about why we use these things, why we use microservices, why we use all these, like, why we use all the values of these things. And you know, the question I always get is like, shouldn't we be building for scale? Shouldn't we be building highly scalable systems, right? MongoDB, famous for their, like, web scale DB tagline. Everyone's really not sure what that means. Every time I ask, I'm like, "What does that mean?" And the sales person I think goes, "I don't know." I get sold a lot of Mongo. But the reality is, like, scalability is a refactoring problem. So the patterns that you're actually delivering with the ability to use modern technologies like containers to use patterns like 12-factor applications are building for eventual scale. So what you are doing is you're offloading the scale portion of this. Like, I often like to say, like, scale problems are a good problem to have, and people don't understand what that means. And what that means is really, like, the pain of trying to scale is oftentimes if you've architected in a way that allows you to understand the cost by which you're scaling, i.e. like your test harness, your testing processes, how you're going to get there, the frameworks that you use, and getting there is a lot easier than to understand the cost, and you can plan the technical debt rather than trying to scale and realize that simplicity should have been your first option.
(20:17)
So the things are, like, when you're doing something manually, right? Sometimes there's a benefit to doing it manually and continuing to do it manually until it becomes too painful for your organization to continue to do it manually. And then you invest in a lot of that complexity. And it turns out you can actually measure these things, right? So on the left hand side, and this is why I said this is not really a anti Kubernetes talk or why it's not an anti-complexity talk specifically is, like, we have different types of complexity. And like I said, it's actually about complexity trade-offs, right? So we talked about switching costs. We talked about other things and a lot of people, whoop, . There we go. Okay. We talked a lot about different types of complexity because you really don't get away from complexity.
(21:08)
Simplicity is kind of an illusion on how some of these things actually function. It's easy to look at something and say it's simple, but really you're actually making trade offs. And some of those trade offs are organizational complexity, right? The things that you're doing from, like, how, the switching costs, how much is it going to take? What's the manpower? What is all of the things that I need to actually make these things work? I.e. am I going to have to scale this person? Like, am I going to have to add another person exponentially for every single time I want to do this? And how do I get there? And what are my actual end points for an organization? Like, how many customers will I ultimately service? And how am I going to be able to sustain this organization? So for instance, if I pick this very obscure thing, I was at an organization recently where they told me to adopt a new technology, they had to go hire this PhD. And I was like, that doesn't seem like a technology that you would want to adopt. And it's actually a very common technology, but it was a couple of different things. And what really was seeing inside of the hood was I can't go hire people off the street with this type of technical efficiency that will actually help my organization deliver and to do things manually, and therefore I have to go out and outsource that. Or I have to go find this, like, diamond in the rough, this person or these people that exist or don't exist. And that's an organizational complexity, an engineering complexity, right? If you can run a monolith, right? But, like, what is your actual engineering? How are you going to deliver it? Are you set up to do release cadences? So if you're making the trade off between the engineering side, which is, like, you're going to run a monolith, your organizational complexity is, how am I going to do release planning? How are all these functions within a monolith actually going to come together?
(22:53)
And the operational complexity, which is like, how often do I have to spend time looking at this? How often do I have to spend time thinking about this? Like, how easy is it to monitor? And how easy is it to get to? And like, what's great about this is this conference has mentioned a lot of different things. Nate gave a conversation on DORA metrics on Monday. Like, love DORA. But you know, we've talked a lot about metrics. We've talked a lot about actionable things. And understanding is, if you look at the graph on the right, this is three hypothetical architectures, right? That one of them is a microservice architecture, one of them is a monolith architecture, and one of them is a, I'm blanking on what the other one is, but the important part of this diagram specifically is not the actual architecture. Like, it's a very interesting illustration of understanding quantifiably what are the things that you care about as an organization? If you are an organization that cares about your agility and your sustainability, i.e. your ability to hire, your ability to go out and find people that need to maintain, then you probably want the architecture that scores highest in those areas. But it causes you as an organization to look back and think, what are the things that I care about? What are the actual ilities of my organization that I need today in a real sense? And how do I get there? And, like, how do I measure that? And 'cause the answer is, like, I want it all all the time, right? Like, that doesn't exist. Like, that's not a real thing. That's not a world we live in.
(24:21)
We have this thing called a CAP theorem in computer science, which is you get consistency, availability, or partitioning and you get to pick two and that's it, right? Like, so you get to pick what your trade-offs are for each of those things. But if your organization can't understand and measure, right, the things that it cares about the most and its journey and its outcomes and the thing that it's delivering, i.e. the ability to sustain, the ability to interoperate, the ability to actually observe the things that you need to observe and measure. If you can't make those trade offs and concessions, that's where you really start with understanding the complexities that you introduce to your organization and how you understand whether or not this is something you need to do today, whether you need to do something tomorrow, and whether or not you're potentially architecting for a future that would never really come to fruition. And so we're going to talk a little bit about how we can actually start getting to a place where we ask ourselves questions that help us understand some of those things that I showed on the diagram earlier, which is accounting for the complexity trade-offs in your delivery and in your architectures. So the first question is, how difficult would it be for us to refactor and re-platform when the time comes? We talked about vendor lock-in on day one, right? Like, are you really locked in or is there a cost, right? And what is that cost? That cost has an ROI. That cost, like, has a person power attached to it. It has a very real architecture solution attached to it. How easy is it for us to lift from ECS to EKS? This was actually the hardest thing for me. I'm a big Kubernetes fan, but, like, one time somebody said to me, they were like, "Did you watch Netflix last night?" I'm like, yeah, a lot of Netflix in this talk. They're like, "You watched Netflix last night?" Like, yeah. They're like, "Hey, what architecture did it come from?" It's like, what? Like, "What container did it come from?" "I don't know." "Was it running on EKS or ECS? Or was it running on Google Cloud?" "I don't know. I watch my Netflix." "Cool, then you got user valued. You got the user value you were looking for and you don't care at all what the backend architecture is." As architects and engineers, we care. We want those things. We care about our developer experience, but our end users, our people who need the downstream value do not care, and they don't understand. And that's like, it is a good thing. How confident are you in your ability to refactor? Your ability to be a testing organization and to understand how you could potentially refactor your architecture or your application or your delivery process at any given time in the trade-offs that you're making, the three trade-offs that we talked about, is a confidence factor, right? A vote of, we used to do vote of five after every single scrum, like, planning, which is, like, how confident are we can we can deliver this scrum? And if we were anything below a three, it was like we had to re-vote. We had to remove things. We had to figure those things out. It's the same thing.
(27:16)
Your organization should have a confidence factor on how well you'll be able to refactor. Am I spending more time maintaining than I am delivering? I've lived this nightmare a million times. Yeah. Am I actually doing the things I care about? Am I like, or am I spending more time actually trying to fix what's broken? Am I sacrificing delivery for developer experience? Yeah. Obviously, I'm a first-class citizen of the developer experience, right? Like, I want my experience to be good. But at a certain point, there's a trade off and there has to be an organizational trade off between what my experience is and what the end user experience is. And if I'm not making those bargaining decisions, ultimately that could suffer including the end user experience. And will organizational turnover affect my ability to sustain operations?
(28:09)
People leave in the DoD. It's even worse because people are on rotation. People like have a very set time clock. It's hard to find certain people, right? I love to write Rust. Finding good qualified Rust engineers. Not as easy as you'd think, right? Finding highly qualified Kubernetes engineers. Also not as easy as you think. So this is the starting point. There's a no real call to action specifically in this talk, but this is one of those things where as you look at these points, you can start to formalize in your organization thoughts around what am I making trade offs on? And am I making trade offs for things that shouldn't be making trade offs for? Am I working towards an eventual scale that will never be a reality, right? Can I quantify the users that I have, the things that I'm trying to deliver? And can I understand the trade-offs of when I will make those trade-offs? And when I will do those scales, when I will re-scaling, when I will do redo that re-platforming, when I'll introduce that next problem that I actually have and figure out how to solve it. So this is really just, this talk is about understanding how to start those conversations in your organization, understanding how to make those trade-offs, and understanding how to go forth, and understand that complexity itself is not just balanced with simplicity, but complexity is trade-offs with other complexity that you have in your organization. Okay, so that's all my time I think. I don't think I have time for questions. I don't know who tells me that or not, but thank you very much for your time and attention. And if you have any questions, you can catch up with me later and we'll talk soon.