The DevOps Handbook starts out by reiterating many of the themes outlined in The Phoenix
Project: Dev & Ops must avoid working at odds with each other, when they do work at odds with each other technical debt grows exponentially, and as a result culture becomes inefficient, cynical, and even hopeless. The prologue concludes by prescribing DevOps in order to instill a culture driven by hypothesis, collaboration, and automation which ultimately affects the customer positively, and in turn fosters a great internal workplace.
In the forward, the outline of the book is shared:
- PART 1 – “The Three Ways: A brief history of DevOps, and introduce the underpinning theory and key themes. This section includes the high-level overview of The Three Ways (Flow, Feedback, and Continual Learning & Experimentation)
- PART 2 – Where To Start: Description of how and when to start DevOps. This section presents concepts including value streams, organization design, organizational patterns, and ways to begin winning hearts and minds in an organization lacking DevOps.
- PART 3 – The First Way: The Technical Practices of Flow: Acceleration of flow by building foundations of deployment pipelines. This includes a detailed look at automated testing, continuous integration (CI), continuous delivery (CD), and architecture for low-risk releases.
- PART 4 – The Second Way: The Technical Practices of Feedback: Acceleration of feedback by creating effective production metrics. This section features lessons on perceiving problems prior to their growth, and also creating review and communication processes between Dev & Ops.
- Part 5 – The Third Way: The Technical Practices of Continual Learning and Experimentation: Acceleration of continual learning through culture, reserving time for organizational learning, and processes for globalizing knowledge gained from the first two ways.
- Part 6 – The Technical Practices of Integrating Information Security, Change Management, and Compliance: Integration of security, utilizing shared source code repositories and services, enhancing telemetry, protecting the deployment pipeline, and concurrently achieving change management goals.
PART 1: The Threeays
The authors start by underscoring key themes which will be central to the rest of the handbook. One of those themes is Agile Methodology, which the writers articulate isn’t necessarily opposite of DevOps but rather a natural evolutionary ancestor to DevOps. In the historical context of software development, Agile set the stage by being rooted in speed and continuity.
Chapter 1: Agile, Continuous Delivery and the Three Ways
Another theme introduced as a central pillar is Value Stream. Also the theme of lead time is centrally introduced, with reference to Toyota Kata. Our first look at a technical indicator is “%C/A” as a Quality Assurance indicator — what percentage of time can downstream work centers rely on work from upstream as is? This chapter is one of the briefest & just generally sets the table for DevOps as a way of thinking that values efficiency in complex processes.
Chapter 2: The Principles of Flow
It is explained to the reader that Flow focuses on work from “left to right” across an entire system. This means two things: Firstly we want to increase flow from development to operations, and secondly, we want to do it across the global system and not merely increase flow in little pockets of the system.
There is a caveat in increasing flow in Information Technology. Unlike in manufacturing, IT flow is invisible. In manufacturing, flow moves physical products, but in IT, flow moves code and configurations. Therefore, central to increasing flow in IT is increasing visibility of flow, which can be done with visibility tools including Kanban boards.
In the IT flow, just like in the manufacturing flow, we want to limit Work In Progress (WIP), which is partially finished work waiting for completion to move onto the next stage in the process. A common creator of WIP is multitasking, which we must limit by controlling queue time & reducing daily juggling of tasks.
Also, to increase flow we want to reduce batch sizes rather than have big, infrequent, bulky deployments. By reducing batch sizes, we make error discovery easier, have faster lead times, less WIP, and less rework.
Moreover, we want to reduce handoffs. We must avoid the creation of unneeded queues, and avoid communication shortfalls regarding dependencies for example. In other words we don’t want “too many cooks in the kitchen” bumping into each other, miscommunicating orders, etcetera.
Furthermore, we must elevate constraints. Where are the bottlenecks? That is where we must allocate resources in order to keep the show rolling. Any improvement not made at [the bottleneck] is an illusion as the famous Eliyahu Goldratt wrote.
Additionally, we want loosely coupled architecture so that we don’t need big committee meetings. Instead small teams can maximize productivity locally to them with minimal constraints rather than having big teams all interdependent on each other to operate.
Lastly, we want to eliminate hardships and waste in the value stream. “The use of any material beyond what the customer needs and is willing to pay for,” as said by Taiichi Ohno, one of the pioneers of the Toyota Production System. He believed that waste was the biggest threat to business viability, and that there were seven major kinds of manufacturing waste: inventory, overproduction, extra processing (“goldplating”), transportation, waiting, motion, and defects.
Chapter 3: The Second Way: The Principles of Feedback
Unlike Flow which deals with movement of work from left to right in the value stream, feedback considers movement from right to left. The bottom line with feedback is we want the first way (Flow) to generate information that helps us make the system more safe and resilient. To that end, failures become opportunities for learning rather than the blame game.
Dr. Sidney Dekker observed in complex systems that doing the same thing twice won’t necessarily lead to the same result, because error is inherent to life. Therefore, static checklists aren’t always enough. Dr. Stephen Spear then stated that beyond checklists systems can be safer when four conditions are met:
- Complex work is managed so that problems in design and operations are revealed
- Problems are swarmed and solved
- New local knowledge is exploited throughout the organization
- Leaders create leaders.
Feedback loops are how feedback manifests, and waterfall methodology lacks this, so with waterfall it can take a year to notice bugs in the test process. On the contrary, in DevOps feedback loops are what allow you to “steer.”
Problem swarming: Dr. spear stated that the goal of swarming is to contain problems before they spread, and diagnose and treat the problem so it can’t reoccur, which converts upfront ignorance into knowledge. An example of this is in the Toyota manufacturing plants, there are cord above each factory worker so if there is a problem, the individual pulls the cord, and the team leader is notified. If the team leader can’t solve it in the allotted time, then the entire process is halted and resources allocated to this problem until resolution. In turn no technical debt accumulates. Note that it must be essential that the culture says it is acceptable to pull the cord.
Chapter 4: The Third Way: The Principles of Continual Learning and Experimentation
When we achieve The Third Way, constant individual knowledge creates constant team knowledge. Instead of work being rigidly defined and enforced, there is an acceptance that individuals know their work best. Leadership requires and actively promotes learning, the system of work is dynamic, line workers are experimenting, we document results, and so on. The key to unlocking The Third Way is a high trust culture, where we believe ourselves to be lifelong learners, taking risks in daily work, and adhering to a scientific approach.
When we don’t achieve The Third Way, the root cause of accidents is too often considered to be “human error” and management’s “blame-culture” grows. Here Dr. Sidney Dekker pointed out how this negative culture disallows research into root cause analysis, and worse, a growing secrecy of work creates more future problems which can’t be easily identified. Then, problems aren’t visible until a disaster occurs.
Also we must institutionalize the improvement of daily processes. In the absence of improvement, processes actually get worse. Therefore we must reserve time to regularly pay down technical debt and fix defects.
We must value the creation of artifacts, or in other words “convert ignorance into nuggets of knowledge.” For example, code repositories shall be across an organization, globalizing the knowledge and letting it physically manifest.
Also we must inject faults, tension, and stress into processes so that we rehearse major failures and become better prepared.
Lastly we should be rethinking the nature of leadership. Historically leaders are looked at as making all the right decisions, allocating all the right resources, orchestrating perfection. However, as noted by Mike Rother in The Coaching Kata, leadership should emphasize the scientific method, and allow strategic goals to inform shorter term goals, for example “make target conditions i.e. reduce lead time by 10%.” Then leadership shall interpret results and revise for next time. After that, leaders call upon front line workers to implement this problem solving approach more locally. This is at the core of Toyota Production System — Leadership is not an all-knowing God, but rather an observer motivating & empowering people to improve local systems.
PART 2: Where To Start
At this point the book starts focusing on how to initiate a DevOps transformation. The writers prescribe that we must be electing which value streams to start with, understanding work being done in candidate value streams, designing the organization and architecture with Conway’s Law in mind, and enabling teams to be locally empowered to achieve DevOps.
Chapter 5: Selecting Which Value Stream to Start With
Chapter 5 highlights an interesting case study, where the Nordstrom team picked the right projects to initiate DevOps. They had to decide where to start efforts, and they didn’t want to cause upheaval. Their primary goal was to demonstrate early DevOps wins to instill confidence. To do this, they focused on three areas: The customer mobile app, their in-store restaurant systems, and digital properties. Each of these areas had business goals that weren’t being met but weren’t so essential to the company, and were therefore relatively low risk endeavors. The results were with their customer mobile, they doubled features delivered per month and halved the defects. For the cafe part of business, lead times were reduced by 60%. Ultimately this success resulted in the promotion of the DevOps leader, and organizationally Nordstrom believed in DevOps.
We then consider Dr. Robert Fernandez, MIT, who wrote about ideal phases used by change agents:
- First, we find innovators and early adopters who are kindred spirits who actually want to undertake change. Ideally these people are also respected in the organization.
- Second, we build critical mass and a silent majority. We build a stable base of support, expand the coalition, and create the bandwagon effect.
- Third, we identify the holdouts, high profile influential detractors who may resist or sabotage efforts. If Steps 1 & 2 are achieved, the holdouts won’t have much of a choice but to join the group.
Chapter 6: Understanding the Work in Our Value Stream, Making it Visible, and Expanding it Across the Organization
After selecting an initial candidate project, we need to identify all the members of the value stream, and see who is responsible for working together to add value to the customer. These individuals can include: product owners, development, QA, operations, infosec, release managers, tech executives or value stream managers. Once the group is formed, we create the value stream map.
Value Stream Maps: Start identifying places where work must wait weeks or months — places where significant rework is generated or conceived. Also start with high-level process blocks, choose 5–15 process blocks, and include in each block the lead time and process time for a work item to be processed, as well as percent complete and accurate as measured by downstream teams. Then construct an ideal value stream map, AKA a future state of work as a target utopia. Then work backwards to determine how to reach that state through experimentation.
These DevOps transformations are inevitably in conflict with ongoing business operations, so we need to create a team that is attached to the hip of the entire business processes, not internally fighting business processes.
The nature of transformation work should have cadence. For each iteration the team should agree on a small set of goals that generate some value. Then teams should review progress and do it again. Typically a 2–4 week timeframe is a good planning horizon. Keeping the intervals short achieves flexibility, decreases delays, and strengthens feedback loops.
An example that ties so much of this together is Operation Inversion at Linkedin. Quickly after their IPO, their internal application also had so much technical debt that they could barely keep up with the interest payments on the technical debt. Operation Inversion halted all new features until some key services were decoupled from an archaic internal application called Leo, and nearly all technical debt was paid down. Doing this enabled fertile ground for the huge scalability that Linkedin underwent after the IPO.
Increase the visibility of work: Everyone needs to know the current state of work and it should be up-to-date. Dev and Ops should have the same tools & nomenclature. Especially with a shared Kanban board, Dev and Ops will begin to have the same work cues. Then, issues will be obviously identifiable. This also calls for a unified backlog to see the technical debt that needs to be paid off together.
Chatting: Chat rooms (ie. Slack, Openfire, Campfire) helps reinforce visibility. These tools allow fast sharing of information, history logs, and ability to invite whomever into the conversation. This creates a lot of visibility and even better, rapid visibility.
Chapter 7: How to Design Our Organization and Architecture with Conway’s Law in Mind
In 1968, computer scientist Melvin Conway conducted an experiment observing people being assigned to projects. He observed that the way you assign people to a project (amount of people, design of team, etc) directly affects how the project outcome is shaped. System designs are constrained to produce designs which are copies of the communications structures of the team. For example, the larger an organization is, the less flexibility it has, and the more pronounced this phenomenon will be on the outcome of the organization’s work. How we organize our teams has a powerful effect on the software produced, as well as the architectural and production outcomes.
In the DevOps world, if we ignore Conway’s Law, teams would be designed to be too tightly coupled, all waiting on each other for work to be done, with even small changes creating potentially disastrous global problems. Instead, we must avoid single points of global failure in the team structure, which will avoid single points of global failure in the architecture.
The best way to avoid the harms of ignoring Conway’s Law is to create market-oriented teams. We must encourage cross-functionality and embed functional engineers and skills into each service team. This enables each service team to independently add value to the customer without having to open other tickets with other groups.
Service-oriented architectures: Services should be isolated by being independently testable and deployable. A key feature in DevOps is loosely coupled services with bounded contexts. Services should be independently updatable, and for example a developer should be able to understand and update the code or a service without knowing anything about the internals of its peer services.
Lastly, keep teams small with domains small and bounded. A great example of this is the “Two Pizza Rule” at Amazon, where team sizes are about the amount of people who can be fed by two pizzas. This ensures the team has a clear, shared understanding of their domain. Also it limits the rate at which their domain can evolve, and it offers team autonomy. Lastly, the Two Pizza Rule offers opportunity for leadership experience without risk of catastrophe.
Chapter 8: How to Get Great Outcomes by Integrating Operations into the Daily Work of Development
Ops must be absorbed into Dev. Doing so improves relationships, speeds up code releases, reduces lead times. Employ the three following broad strategies to do integrate Ops into Dev:
- 1) Create self-service capabilities to enable developers to maximize productivity
- 2) Embed Ops engineer into service teams
- 3) If embedding Ops is not possible, assign Ops liaisons to service teams
The integration should be happening on a daily basis by being written into Dev’s daily rituals (ie a morning kick-off meeting). There is also a huge cultural aspect to this which creates positive unintended consequences.
Overall, when integrating Ops into Dev, we are creating an environment where Dev is still reliant on the capabilities provided by Ops, but not reliant on the actual individuals in Ops.
PART 3: The First Way: The Technical Practices of Flow
Our goal in Flow is to create practices to sustain fast movement of work from Dev to Ops without causing chaos and disruption to the production environment and customers. To do this we must implement continuous delivery. In this Part we will look at:
- Creating foundations of automated deployment pipeline
- Ensuring we have automated tests that validate that we are in a deployable state
- Having developers integrate code daily
- Architecting environments and code to enable low risk releases
Chapter 9: Create Foundations of Deployment Pipeline
Automated and on-demand self-serviced environments stored in version control are a major pillar of DevOps, with the goal being to ensure we can recreate everything instantaneously based off of what’s in version control.
We must use automation for tasks like copying virtualized environment, booting an Amazon Machine Image in EC2, using infrastructure as code configuration tools (ie puppet, chef, a subtle, salt), using automated operating system configuration tools (ie solaris jumpstart, red hat kickstart), assembling an environment from set of virtual images or containers (ie docker), and spinning up new environments in a public cloud.
We shall make changes with “commits,” which enable us to undo changes and track who did what when. Keep in mind, version control is for everyone, and there should be artifacts reflecting everyone’s true ideal state (Data teams, Security, Dev, Ops, etc).
It is not enough to be able to recreate the entire previous state of production environment. We must also be able to recreate entire pre-production and build processes as well. Therefore we need to put into version control everything relied upon by build processes, ie. tools and their dependencies.
An example is in the Puppet Labs 2014 State of DevOps Report, the use of version control by Ops was the highest predictor of both IT performance AND organizational performance. VC in Ops was a higher predictor than whether or not Dev used version control! The reason for this is there are orders of magnitude more configurable settings in an Ops environment than in Dev.
The bottom line of Chapter 9 is we must make it easier to rebuild things than to repair things. We must be rooted in reducing risk by offering Dev on-demand production-like environments and using them with high frequency. By doing this we avoid a developer thinking their work is done, when it doesn’t actually work in a production environment. With CI/CD, if there is a problem, it can quickly and easily be identified and resolved. This sets the stage for enabling comprehensive test automation.
Chapter 10: Enable Fast and Reliable Automated Testing
If we have a separate QA team find and fix errors in infrequent test phases throughout the year, then errors will be unveiled late, and the cause of the error will be taxing to identify. Worse, our ability to learn from the mistake and integrate that knowledge is diminished. Therefore, we must have automated testing, because without it, the more code we write, the more time and money is required to test our code which isn’t viable for business.
An example is Mike Bland at Google Web Server. There used to be a fear of creating problems prior to automated testing at Google Web Server. Now there is a great culture of “if you break my project that’s fine and if I break your project that’s fine,” because automated tests are constantly running.
We need three conditions to confirm that we are in a working continuous integration state:
- Comprehensive reliable set of automated tests that validate that we are deployable
- A culture that stops the entire production line when there are failures (AKA pulling the cord at Toyota)
- Developers working in small batches on trunk (main branch) rather than long-lived feature branches.
Test-driven development: We want to write tests before doing the work. We begin every change to the system by first writing an automated test that validates the expected behavior fails. Developed by Kent Beck in the 1990s, there are three steps for Test-driven Dev:
- Ensure the tests fail, write a test for the next bit of functionality you want to add, check in.
- Ensure the tests pass, write the functional code until the test passes, checkin.
- Refactor both old and new code to make it well-structured, ensure tests pass, check in again.
We also want to automatically test our environments to validate that they have been built and configured properly, and we must test the dependencies which our functional requirements rely upon.
Lastly the cultural norm needs to be that if something breaks, we stop everyone (because the team is most important), and get back into a green build state ASAP.
Chapter 11: Enable and Practice Continuous Integration
The ability to branch with as many branches as possible to avoid introducing errors into the main branch is valuable. However, with more branches come more integration challenges, especially if we wait until the end of projects to integrate. Therefore merges into trunk shall be a part of everyone’s daily work.
Use gated commits: Reject any commits that take us out of a deployable state. The deployment pipeline first confirms that the submitted change will successfully merge, and that it will pass automated tests before actually being merged. If not, the developer will be notified without impacting others in the value stream.
In 2012, Bazaar Voice was preparing for an IPO, relying on 5 million lines of code on 12,000 servers. They wanted to tackle some core problems:
- They lacked automation which made any testing inadequate to prevent large scale failures.
- They lacked version control branching strategy which allowed developers to check in code right up to the production release
- Their teams running microservices were also performing independent releases, which was causing issues to the first two issues.
In the six weeks that followed, Bazaar Voice allowed for no more new features. Instead they just prepared automated tests for CI, which led to successful outcomes and IPO.
Chapter 12 : Automate and Enable Low Risk Releases
If you want more changes, you need more deployments. Thus, we must reduce friction associated with production deployments, making sure they can be done frequently and easily. This is done by extending our deployment pipeline. Instead of just continuously integrating code, we enable the promotion into production of any build that passes all tests — either on demand or automated.
Automated deployment processes: There must be a mechanism for automated deployments, and this mechanism must be documented to help us simplify and automate as many steps as possible. Areas to look for simplification and automation are:
- Package code in ways suitable for deployment
- Create pre-configured VMs or containers
- Automate deployment and configuration of middleware
- Copy packages or files into production servers
- Restart servers applications or services
- Generate configuration files from templates
- Run smoke tests to ensure the system is working
- Scripting database migrations
Deploy vs Release: Note deployment and release are not synonymous, and they have different purposes. Deployment is the installation of a specified version of software into a given environment. A deployment may or may not be associated with a release of a feature to customers. A release is when we make a feature available to customers. When we conflate the two terms, it makes it difficult to create accountability for good outcomes. Decoupling these two empowers Dev and Ops to be responsible for success of fast frequent deployments, while separately enabling product owners to be responsible for successful business outcomes of a release.
Feature toggling: Feature toggling is adding to features a conditional statement, which is easily done with configuration files, ie. JSON, YAML. They enable us to roll back easily, and features that create problems can be removed quickly. (Don’t forget to test feature toggles just like anything else!)
Perform dark launches: This is where we deploy all the functionality into production and then perform testing of that functionality while it is still invisible to customers. For large or risky changes this is often done for weeks before the production launch.
Chapter 13: Architect for Low Risk Releases
Architecture of any successful organization will necessarily evolve over its lifetime. This is even true for companies like Google, who are on their 6th entire rewrite of their architecture.
In the case of Ebay, when they needed to rearchitect, they would first do a small pilot project to prove to themselves they they were sensitive enough to the problems before rearchitecting. They used a technique called “The strangler application pattern.” Instead of replacing old services with architectures they are no longer flush with organizational goals, they put the existing functionality behind an API, and avoid making more changes to it. Then all new functionalities are implemented in new services using new desired architecture, making calls to the old system when and if needed. This method is especially useful for migrating portions of a monolithic application and/or too tightly coupled to one that is loosely coupled.
The writers then run the reader through a case study involving evolutionary architecture at Amazon in 2002. Amazon in 1996 was a monolithic application running on a web server talking to a database on the backend. This application evolved to hold all the business logic, display logic and functions that Amazon is now famous for (ie. similarities, recommendation). Eventually old applications became too intertwined across functions. It didn’t have space to evolve anymore. This caused Amazon to create a service-oriented architecture that allowed them to build many software components rapidly and independently. In other words they moved from a two-tier monolith to a fully distributed, decentralized services platform.
The lessons from Amazon’s undertakings are:
- Service-orientation achieves isolation, ownership, and control
- Prohibiting direct database access by clients making scalability way easier
- Dev and Ops greatly benefits from switching to service-orientation, as it creates fertile ground for them to innovate quickly with strong customer focus
- Each service can have a team associated with it, and that team should be completely responsible for the service
PART 4: Feedback
Now that the rate of flow has increased from left to right, we need to increase speed from right to left so that left to right is even more fast and efficient.
Chapter 14: Create Telemetry to Enable Seeing and Solving Problems
High performers use a disciplined approach to problem solving, using telemetry to analyze root cause. On the other end of the spectrum, lower performers tend to just reboot the server without investigating an issue before acting. In other words, metrics inspire confidence in individuals creating solutions. The example given is at Etsy, “if it moves, they have a graph for it, and if it doesn’t move, they have a graph for it for if it starts moving.” The 2015 State of DevOps Report states that high performers solve problems 168x faster than their peers. The top two technical practices that enable fast mean time to recovery (MTTR) are:
- Having version control by operations
- Having telemetry and proactive monitoring in the prod environment
In addition to collecting telemetry from production, we want to do it in Dev and Test. Fore example, if builds are taking twice as long as normal, we want to know that in real time.
Relevant information worth logging include: Authentication/authorization decisions, system and data access, system and data changes, invalid inputs, resources (RAM, CPU bandwidth), health and availability, starts and shutdowns, faults and errors, delays, backups.
By recording these measurables, you are not only quickly solving problems, but you’re also allowing for the possibility to prevent problems by improving design. As we know, prevention is better than cure.
A key piece in solving problems with telemetry is integrating self-service solutions to empower local and sudden solutions. With self-service, the individual’s closest to the problem are the most informed about the problem, and they are in turn the best ones to fix the problem.
Find and fill telemetry gaps: We want metrics from the following levels, so look in these places to ensure your world of telemetry isn’t porous:
- Business level (sales transactions, revenue, user signups),
- Application level (transaction times, user response times, application faults)
- Infrastructure level (database, OS, networking, storage)
- Client software level (application errors, user measured transaction times)
- Deployment pipeline level (build pipeline status, change deployment lead times, deployment lead times, environment status)
Chapter 15: Analyze Telemetry to Better Anticipate Problems and Achieve Goals
We must create tools to discover variances and weak failure signals hidden in our metrics to avoid customer-impacting errors.
Outlier detection: This is a method from Netflix dealing with misbehaving nodes, where you simply compute what the current normal is and then any nodes not fitting that behavior are removed. Just kill the misbehaving node and then log it, we don’t really even need to get Ops involved.
Use means and standard deviation to determine when something is “significantly different”. For example, if something is way slower, or if unauthorized login attempts are three standard deviations higher than the mean, we need to know what’s going on. Moreover we must ensure that our data has gaussian distribution. You can detect anomalies in non-gaussian distributions as well and you can learn more about that here.
Having someone who is a statistician or skilled in statistics and statistical software such as Tableau can be very helpful with filling drawing conclusions to anticipate problems.
CHAPTER 16: Enable Feedback so Dev and Ops Can Safely Deploy Code
The fear of deploying code (shared by both Dev and Ops) has an antidote: Get the feedback to teams. It isn’t enough to collect info, we must disperse it. In turn, we create the cultural confidence in our code that comes with the knowledge gained from the telemetry.
One straightforward way to do this is to have developers watch their work get used downstream, which causes them to see customer hardship first-hand. For example, UX teams can observe their product being used by users, and they may gain useful feedback such as noticing their product requires way too many clicks by the user. Another way is to have developers self-manage their services briefly before it goes into production, and use Ops engineers as consultants before pushing into production. Doing all of this is the best way to avoid needing to wake up Ops team members at 2 AM because of an outrage, for example.
Launch guidance and requirements should include the following checks:
- Does the application perform as designed?
- Is the application creating an unsupportable amount of alerts in production?
- Is coverage of monitoring sufficient to restore service in the case of mishaps?
- Is the service loosely coupled enough to support a high right of change?
- Is there a predictable automation process?
- Is there good enough production habits on the Dev team?
Ultimately funneling feedback back to Dev & Ops creates a better working relationship between Dev & Ops, because shared goals and empathy are reinforced.
Chapter 17: Integrate Hypothesis-driven Development and AB Testing into Daily Work
One of the writers of the Handbook, Jezz Humble, once said: “The most inefficient way to test a business model or product idea is to build a complete product to see whether the predicted demand actually exists”. That is, before building, we must ask, “should we build?” Then we validate through the scientific method if features are worth the labor required to birth them.
We want a world where every employee can implement rapid experiments daily. An example of this is at Intuit Turbotax, they run experiments during their tax season despite it being peak traffic time! In fact, they run the experiments then because it is peak traffic time.
In AB testing, you are randomly shown one of two versions (A the control, B the varied version). This method was pioneered in the marketing world. Prior to email and social media, direct marketing was done by prospecting through postcards campaigns. They experimented with varying the copywriting styles, wording, colors, and more. Each experiment often required a new, expensive, long trial. However, these expenses were easily paid off if, for example, the rate of sales increased by a significant percentage.
As it relates to DevOps, each feature must be thought of as a hypothesis, and it can be built as the B in AB testing, while the product without that feature remains as A. Then we can run through the scientific process to determine effect of B, and if B is even worth executing into our production.
To conclude Chapter 17, we must outperform our competition, and experimentation unleashes the power of creativity to get business wins.
Chapter 18: Create Review and Coordination Processes to Increase Quality of our Current Work
Along with the Three Ways, we can simultaneously improve security by injecting it into our daily work. In the end, we Typically we tend to review changes just prior to deployment, and approvals can often come from external teams removed from the work, plus the time required for approvals also lengthens lead times.
An example of how a company optimizes review changes is at GitHub. The team at GitHub believes inspection can increase quality while being integrated into normal daily work. Pull requests at GitHub as a mechanism for peer review. Github flow is composed of 5 steps:
- 1. To work on something new, an engineer works on something branched off from the master branch
- 2. That engineer commits to that branch locally, regularly pushing their work to the same named branch
- 3. When they think it is ready for merging, they open a pull request
- 4. When they get their desired reviews and approvals, the engineer can then merge into the master branch
- 5. Once changes are merged and pushed to master, the engineer deploys then into production
In puppet labs 2014 State of DevOps Report, one of the key finding was that high performing organizations relied more on peer review and less on external approval of changes. Here is a very real example of how odd it can be to have external teams approving changes rather than peers. Take for example 50 lines of code being submitted for change. Will a change board really know if this change should be approved better than the developer who sits next to the developer who submitted the change?
The principle of small batch sizes also applies to peer review. Developers work in small incremental steps and look at each other’s work along the way.
As is the case with so much of this book, culture is key. Optimal culture will value code review as much as it does code writing.
To conclude Part 4, we now see that by implementing feedback loops, we can enable everyone to work together towards shared goals, see problems in real-time, and with fast detection and recovery, ensure features operate as designed while achieving organizational goals and learning.
PART 5: The Third Way: The Technical Practices of Continual Learning and Experimentation
This part is all about opportunistically capitalizing on accidents and failures, and continually making the system safer, resulting in higher resilience and ever-growing collective knowledge.
Chapter 19: Enable and Inject Learning into Daily Work
Even with checklists, we still cannot predict all outcomes from all actions. Risk is inherent — we will never achieve 0% risk and nor should we want to. Our goal here is to create an organization that can constantly heal itself.
An example given is, one time an entire availability zone (AZ) of AWS went down. However, Netflix, who many thought should’ve been affected, was not affected at all. That is because years before the AWS outage, Netflix had rearchitected and designed for significant failures even if an AWS AZ went out. Netflix was set up so that if an AZ went out, Netflix would show static content such as cached or unpersonalized results, requiring less computing. Moreover, Netflix had been running “chaos monkey” which simulated AWS failures by constantly and randomly killing production services.
By having this type of willingness shown by Netflix, we create an environment that disagrees with “the bad apple theory.” Human error is not our cause of troubles. Instead human error is a result of the design of the tools created. Rather than the blame game, the goals should be to maximize opportunities for continual learning. If you punish failures, you guarantee the failures will happen again.
What we seek is blameless post-mortems after accidents occur, and injection of failure routinely before accidents occur. In blameless post-mortem meetings, construct a timeline of what happened as soon as you can after the mistake, so that the link between cause and effect is fresh. Include relevant stakeholders: Who is affected by the problem, who contributed to it, who diagnosed it, etc.
Oftentimes we create countermeasures that assume in the future we will be more careful and less stupid. However, instead of assuming we will magically improve as individuals, we must design real countermeasures to prevent these errors from happening again from a design and system standpoint. Examples of true counter measures include:
- New automated tests
- Adding production telemetry
- Adding more peer review
- Conducting rehearsals of this failure as part of regular activity
After the post-mortem meeting, publish the findings as broadly as possible and place it in a central location for all. In order for such information to spread, we ultimately need the right posture and tone fromleadership. Do they value failures in production as a good opportunity or is it too threatening?
Institute “game days” to rehearse failures: Aservice is not truly, truly tested until it is broken in production. At Amazon they will literally power off facilities without notice and then allow people to follow their processes wherever they led. By doing this they expose fundamental defects in a system and improve single points of failure.
Chapter 20: Convert Lccal Discoveries into Global Improvements
Now, we must create mechanism to capture knowledge locally and as soon as possible disperse it globally.
We can use chatrooms and chatbots to facilitate fast communication. Also, we can put automation tools into the middle of chat room functions. You can look into chat logs as well, so that if you’re new, information is there for you. An example of this is “Hubot” at GitHub, an app that interacted with the Ops team in their chat rooms. Here, users could execute commands by instructing the bot to perform actions. The bot’s response would also appear on the chat room. Benefits of Hubot were that everyone sees everything happening, new engineers saw what daily work was, people were more culturally accepting of asking for help, and rapid organizational learning was accumulated.
Another element is to express as many processes and policies in code as possible. Information that hasn’t physically manifested, but rather is spontaneously known by certain people but not everyone, is a great way to create organizational ignorance. Instead, if for example there are best architectural practices for a company, that company should write those in templates for that architecture to be widely accessible and known.
Create a single firm-wide shared source code repository: For example, Google in 2015 built a single source code rep with over 1 billion files, and over 2 billion lines of code, which spanned across all 25,000 engineers spanning every Google property. One valuable effect of this is engineers can leverage the diverse expertise of everyone globally. This also cuts out the need for coordination to access codes. People can access what they need, directly.
We can also spread knowledge by using automated tests in the shared libraries, so that our tests are essentially documenting our processes and features. Any engineer looking to understand how the system works can directly review the tests and test results to deduce how things work.
Chapter 21: Reserve Time to Create Organizational Learning and Improvement
At the Toyota Production System, they have “improvement blitzes”, also known as “kaizen blitzes.” A group is gathered to focus on a process and problems, with the objective being process improvement, and the way to do it is to have people from outside the process advise people inside the process. This often results in new solutions that otherwise couldn’t have been created.
At Target, they have a “DevOps Dojo” where DevOps coaches help teams meet up in an isolated environment, sometimes up to 30 days at a time. They have capacity to have 8 teams doing 30 day challenges at the same time. They’ve had greatly strategic, critical capabilities come through the dojo, including PoS (point of sale), inventory, pricing, promotion, and more. An extra benefit is, when a team is done and have improved their process, where do they take that knowledge? Back to their local teams who may not have attended.
Also you want to schedule rituals to pay down technical debt, which was outlined in prior sections of this summary. We repeat this topic because, not only does paying down technical debt increase flow, but also setting aside time for “spring and fall cleanings” increases and improves knowledge about where failures, waste & WIP are being formed.
Additionally, attend external conferences where possible, and make sure that those who attend disperse what they gained to non-attending employees. Similarly, where possible, have internal conferences!
To conclude this Chapter 21 and Part 5, we must use knowledge for improvements in our daily work, and that is a important as our daily work itself.
PART 6: The Technical Practices of Integrating Information Security, Change Management, and Compliance
won’t only improve security, but create processes that are easier to audit and support compliance plus regulatory obligations. Part 6 outlines the following:
- Making security a part of everyone’s job
- Integrating preventative controls into our shared source code repository
- Integrating security with our deployment pipeline
- Integrating security without our telemetry
- Protecting our deployment pipeline
- Integrating our deployment activity with change approval processes
- Reducing reliance on separation of duty
Chapter 22: Information Security is Everyone's Job Every Day
One major objection to implementing DevOps is “Infosec and Compliance won’t let us.” However, implementing DevOps is one of the best ways to implement InfoSec into daily activity. InfoSec teams are typically hugely outnumbered by everyone else, so it also makes sense to infuse InfoSec into everyone’s day so that there is better coverage, so to speak.
We can get InfoSec involved sooner with feature teams, ideally as early as possible. This helps InfoSec better understand the team goals in the context of organizational goals, observe implementations as they’re being built, and give guidance and feedback early in the project while there is time to amend.
When possible, we also want to integrate security into post-mortems and defect tracking. Sharing a work tracking system for this purpose can be greatly advantageous.
Integrate security into our deployment pipeline. In this step, we can automate as many InfoSec tests as possible, providing Dev, Ops, and Security a real-time feedback on the acceptability and compliance of their work.
We should also define design patterns into the lives of developers to help them prevent accidental infractions. For example, we can grey out submit buttons when commits shouldn’t be happening, or we can put in password storing systems.
Ensure security of the environment: We want to ensure environments are in a hardened, low-risk state with monitoring controls and sound configuration. A great example of this is the FedRAMP approvals for the United States Federal Government, being applied to their massive digitization and DevOps overhaul.
Also, we can create security telemetry into environments. We can monitor and alert on items such as security group changes, changes to configurations, cloud infrastructure changes, web server errors, and much more.
Lastly, protecting deployment pipelines means ensuring each CI process run with isolation. CI processes should be in their own isolated container or virtual machine, ensuring version control credentials used are read-only, etc. The contents of this way to integrate security objectives into daily work is fully dedicated and detailed in the next chapter.
CHAPTER 23: Protecting the Deployment Pipeline
The nature of changes as defined by ITIL is important to know. Standard changes are low risk, and they follow an established approval process or can be pre-approved. Normal changes are higher risk, and require review or approval from an agreed upon authority. Urgent changes are an emergency, which may need to be put into production immediately.
As it relates to DevOps, we can ideally look to automate and simplify Standard Changes. We also want to recategorize as many of changes as appropriate to pre-approved standard changes. We also want to reduce reliance on separation of duty, because as we know we are trying to reduce handoffs.
Better security implies that we defend and are sensible about our data. Also it implies we can recover from security mishaps before catastrophe hits. Lastly, we can make security better than ever by inviting InfoSec into the processes rather than working at odds with InfoSec.