Featuring a hand-picked lineup of presenters that we think best emphasize the values and ideas that will move our community forward.
Senior Software Engineer, Chef
Co-Author, "Effective DevOps"
Infrastructure Operations Engineer, TravisCI
Co-Author, "Effective DevOps"
Senior Director of Operations, Threat Stack
Director of Technical Advocacy, Hashicorp
Security Operations Engineer, Github
Sr. Software Development Engineer, Chef
Developer Advocate Extraordinaire, Logz.io
J. Paul Reed
Author, "DevOps in Practice"
Co-host, "The Ship Show"
First time conference presenter!
Principal Technologist, Pivotal Cloud Foundry
Co-host, "Arrested DevOps"
Incident and Alerting Specialist, VictorOps
Author, "Post-incident Reviews"
9am-5pm each day. Session times TBD.
View the schedule online or download the schedule in iCal format
Put Some Dev in Your Devops
Presented by Katherine Daniels
One of the goals of modern operations teams is to provide and operate services that other engineering teams use to do their jobs, enabling developers to get their work done with as little friction or operational overhead as possible. Some of those services, such as hardware provisioning and server configuration, have typically fallen under the domain of system administration, but there’s no reason that they have to be developed or operated in an old-school sysadmin-sort of way.
This talk will look at how Etsy’s operations team had worked to add some dev to their ops, including test coverage, refactoring, and deployment processes, with a focus on how to add better development processes to existing infrastructure and tools (since we didn’t always have the luxury of throwing things out and starting from scratch!). It will discuss specific tools such as Chef, Nagios, and Etsy's deployinator, but the concepts will be applicable to most operations tech stacks.
Platform Agnostic and Self Organizing Software Packages
Presented by Nell Shamrell
One of the dreams of development is to build a software package once, then be able to deploy it anywhere. With current Open Source projects this dream is closer than ever. Come to this talk to learn how to create software packages that run (almost) anywhere. You will see how the same application can be run on bare metal, on a VM, or in a container - with everything needed to automate that application already built into the package itself. This even works with a mixed infrastructure - metal for your static compute heavy loads, vms for your persistent data stores, and ephemeral short lived containers for you applications managed by Kubernetes or other container scheduling services.
Come to this talk to also learn how to build and deploy these packages with the intelligence to self organize into topologies, no central orchestrator needed. Learn how the dream of platform agnostic and self organizing packages is fulfilled today and where it will evolve in the future.
Security in Automation
Presented by Jamesha Fisher
Security automation involves so much more than just infrastructure as code. How do we make it easier for engineers to do their jobs and at the same time have security in all that we do, including for companies just starting out? In Jamesha Fisher’s talk, she expands upon the world of Security in DevOps, and how automation helps engineering overall.
Next-Gen Bug Catching with Fault Localization: The Future is Now
Presented by Robbie McKinstry
Sometimes the systems we need to scale are human systems, not software systems. Fault localization tooling improves team member autonomy, enabling engineers to find and fix defects faster and with greater confidence. In this talk, we'll step through a simple fault localizer you could write yourself and talk about how fault localization fits into your workflow, from development to CI/CD and on to production. Finally, we'll talk about next-generation localization tools and how they will not only enable developer autonomy but also enable system autonomy. After all, the goal of DevOps isn't automated systems but autonomous systems.
Scale it to a Billion: How to build it, keep it safe, and keep it running
Presented by Pete Cheslock
Over the past three years, Threat Stack has been working on building a scalable distributed system to manage a continually growing corpus of customers’ critical data. Like many growing companies, Threat Stack has limited time, money, and resources, but that doesn’t offer an excuse to skimp on things such as high availability and security.
I'll share the operational and security practices that helped Threat Stack scale while staying stable and secure, covering technology and tools and the various scale points that forced hard decisions. Along the way, we'll also explores approaching security not as a dedicated team but as culture that everyone owns.
- Going from five servers and a few hundred thousand events per day to several hundred servers and ten billion unique events each day
- How on-demand telemetry helps Threat Stack scale
- Early design decisions that worked (and those that didn’t)
- When to use distributed systems (and when not to)
Designing for Operations
Presented by Craig McLuckie
Kubernetes and Linux application containers are making it easier than ever to build and deploy distributed systems software. They also provide a gateway to more intrinsically sustainable operations. During this session we will explore how to think about building more intrinsically operable systems with Kubernetes that are not only easy to live with, but how they are changing the game for operations teams that are responsible for the care and feeding of production systems.
Reactive Infrastructure with Consul
Presented by Seth Vargo
Consul is an open source tool for service discovery, monitoring, and infrastructure configuration. There are two sides to monitoring - exposing problems with alerts and acting upon those alerts to automatically resolve them when possible or notify an operator. For exposing problems, Consul works much like other monitoring solutions. Users can define any script for Consul to intelligently check and report the health status of a node in a cluster. In this way, Consul is compatible with Nagios and Sensu style checks, but the problem with monitoring systems like Nagios or Sensu is that they are knowledge silos. They are designed to ingest health information and expose them to human operators. Consul supports health monitoring using Nagios-style plugins, but it is designed to expose that information in a way that is both machine and human actionable.
With Consul custom watches and service discovery integration, infrastructure can automatically react and adjust around failures. If a web node is reporting an unhealthy state, Consul can automatically remove the node from the load balancer. If a disk space health check is low, Consul can automatically run logrotate and delete everything in `/tmp`. If CPU load is high, Consul can trigger a script to add more nodes to the cluster. In this way, Consul pushes the existing paradigms of monitoring, making it much more than a notification system.
From Turing to Big Data: A Look at Computing and Analytics
Presented by PJ Hagerty
A look at where computing and analytics began and where it is headed. The basis of all DevOps and development starts with our measurements. We need to understand what role analytics and metrics plays in modern computing. To find out, we look at where analytics began in the advent of computer science and where we stand today, with an eye toward the future.
Swimming in Services: Navigating Unknown Waters
Presented by Jennifer Davis
Whether a monolithic or microservices architecture, in-house solution or third party platforms, our environments are trending towards increased complexity. What measures can we take to ensure quality work experiences for ourselves, companies, and customers?
In this talk, we will examine operational patterns, practices and tools that serve in these unstable and rapidly changing waters. While we can't know what we don't know, we can be prepared with appropriate responses to avoid drowning and in the process change the way we build our products.
Topics will include:
- Qualifying and Quantifying Risk
- Recognizing and Responding to Issues
- Recovering and Reflection on Failure
Zebras all the way down: The engineering challenges of the data path
Presented by Bryan Cantrill
Much attention is rightfully devoted to the development and deployment of stateless services, but these services are not themselves devoid of persistent state; rather, they rely on other services to manage this state for them. This data path, however -- that stack of software that is emphatically not stateless, being responsible for distributed and/or persistent state -- is entirely different in its constraints and failure modes. This software takes years or even decades to get right, can be arduous to upgrade, and -- even in a post-cloud era -- lives and dies by the fickle whims of hardware and firmware. This talk will reflect on two decades of building the data path, from the dawn of storage networking through modern cloud storage services.
I Volunteer as Tribute - the Future of Oncall
Presented by Bridget Kromhout
Living #opslife makes us keenly aware of the cavernous gap between lofty ideals and 3am reality. In a perfect world, everyone would be devopsing sans effort. In the real world, sharing oncall is not as easy as giving devs prod AWS creds, adding them to the rotation, and saying “good luck! have fun!”
From tightly-guarded fiefdoms to “of course all the devs are on call” to carefully negotiated compromises, I’ve lived this movie enough times to see what works (and what definitely doesn’t). I spent 1999 to 2015 on call for production infrastructure and made mistakes so you don’t have to! Spoiler alert: instead of volunteering as tribute to the vagaries of the pager, volunteer to invest in your architecture and your co-workers; you’ll sleep better at night.
Scrutinizing the Scrutiny
Presented by Jason Hand
Common approaches to post-incident reviews are often short-sighted in their focus and rarely bring about any real improvements to our overall systems.
This talk will provide insight into new ways teams are analyzing incidents in retrospect in order to continuously improve system uptime.
Many organizations have found great value in retrospective analysis following incidents that impact the reliability and availability of a service. Commonly known as post-incident reviews or postmortems, companies routinely analyze what went wrong in retrospect. This talk will point out the true value of a post-incident review as well as how to perform them for maximum exposure of improvements for every organization’s people, process, and technology.
Let’s explore a deeper understanding of failure in complex systems and key metrics leveraged to consistently improve the availability and reliability of systems. Jason will point out common flaws in the way many organizations approach retrospective analysis of outages and service disruptions as well as uncover areas often overlooked during a retrospective (such as what were engineers thinking when they made the decisions they made).
Pulling from the new O’Reilly Media book “Post-Incident Reviews: Learning From Failure for Improved Incident Response”, the audience will walk away with a broader understanding of their purpose and how to get started on a new path towards continuously improving the uptime of systems and services. Jason will also provide a template for audience members to take back to their teams to use as a starting point for a new approach.
Audience challenges & takeaways:
- What’s broken about current methods of incident retrospective exercises?
- Why is a “the human element” important?
- What is the true purpose of a post-incident review?
- What are the key component of the exercise?
- How can we continuously improve this process?
"Failure" as Success: the Mindset, the Methods, and the Landmines
Presented by J. Paul Reed
"Failing fast," "failing forward" and "Learning from failure" are all the rage in the tech industry right now. The tech company "unicorns" seem to talk endlessly about how they reframe failure into success. And yet, many of us are still required to design and implement backup system capabilities, redundancies, and controls into our software and operations processes. And when those fail, we cringe at the conversation with management that will ensue. So is all this talk of reframing "failure" as "success" within our organizations just that: talk? And what does that look like, anyway? We'll explore mindset, the history it's rooted in, as well as effective methods to move your organization toward it and some land mines to avoid along the way.
Presented by You(?)
Five minute presentations by attendees. No product pitches. Sign-ups at the event.
We believe in creating high quality events that are accessible to all. Thanks to these sponsors for helping us maintain low ticket prices. Email firstname.lastname@example.org for information about sponsoring this and other events or view our prospectus.
Venue & Hotel
Our event will be held in the auditorium of the Union Trust Building located at 501 Grant Street in Pittsburgh, PA. The official conference hotel is the Omni William Penn Hotel located directly across Oliver Avenue from the conference venue. To book a room at a special conference rate, call and mention Code & Supply Uptime or use the hotel's provided online booking service.