More often than I’d like, I’ll get a message from someone “important”:
“Did you know that [insert application] is down at the moment?”
“Of course,” I lie. “We have people working on it right now.”
I’ll then find someone on my team:
“Did you realise that [insert application] is down at the moment?”
“Of course …”
Actually, in the two and half years that I’ve been doing this job, I’ve only been a liar in the first part of the reply: I’ve never found people sitting on their hands while an application has been down.
“Oh, thanks for letting us know … yes, we should get right on it,” they’ve never said.
What has happened in this little interchange is that I’ve been disturbed, my team has been disturbed and finally the team working on the problem has been disturbed. Because we’re all “important”, the really important people, the team fixing the outage have had to stop, explain what they’re doing and answer any stupid questions I might have. I usually have several.
We are trying to change that in our group. It’s an initiative we call Working in plain sight. The idea is simple: if we can design work so that simply doing the work gives off signals that show what’s happening, we can leave people alone to work without disruption. It’s an essential part of being Agile: working in a transparent way.
The situation I’ve described above is an example of unplanned work; we didn’t see this coming. Working in plain sight applies equally to planned work and I’ll talk about that a in a subsequent blog post when I talk about what we’re doing with Agile Programme Management. The goal of management is to maximise the proportion of planned to unplanned work because you can control planned work.
But you also have to deal with the unplanned and unexpected. This is what we are doing with our new processes and tooling for dealing with system outages. The goal is to notify the right people and report status with minimal interruption for the team working the problem.
This is an early design so we’ll be adjusting as we go. The diagram shows a simple cycle of what happens:
- Firstly we’ll get an alert when a system encounters a problem. If we’re doing a good job, that alert will happen through our monitoring tools like New Relic. But it could also happen that we hear about it from a user – when we’re not doing such a good job. We use PagerDuty to alert the teams in either case. The alerting tools will also set up the workspace for the teams to collaborate. In our case, this is a Slack thread. We communicate the location of the slack thread to everyone on the alert list (stakeholders, users, support teams).
- Work for us happens in ServiceNow where we log our production tickets. If the fix requires a development team, they will do their work in Jira. An important part of doing the work is collaboration – across a variety of teams and subject matter experts from the user community. Doing the collaboration in plain sight in the Slack thread is what is new here.
- Reporting is the stage where things are really going to be different. It’s different because the place where we do the work and the place where we report are the same place.
The Slack thread in our production operations channels (we have several channels for the different parts of our application portfolio) carries the production ticket number. Everyone can see what the team is doing to solve the problem by simply following the activity in the thread. That means, if the team remembers that they have the world’s eyes upon them, they can let everyone know what’s going on simply by recording their activity in the thread. It’s something they’d have to do anyway to collaborate on solving the problem (see work above).
In other words, just doing the work and connecting everyone to where the work is being done, meets the reporting needs. “Important” people like me can dive in to ask useful questions like “What help do you need?” Sometimes I even add some value by helping to interpret what’s going on for other observers.
If we are successful, we’ll remove the need for separate status calls/charts that the team would have had to prepare in the past.
- The team will resolve the ticket in ServiceNow and again, the status will go back out in the Slack channel and also in our status reporting systems. We haven’t settled on a tool for that part of the activity yet. The resolution will trigger an alert to everyone who is subscribed.
The processes and systems are new. We don’t have all the automation we’d like in place yet and our process is still very application-centric; if you’re an end user of one of our applications, you probably want to know what you can or can’t do when a system is down. We can’t do that yet. And we don’t yet have the monitoring dashboards that we’d like consistently. Those dashboards need to bring together the salient information in the underlying tools.
But we’re on our way there and we’re hopeful that we’ll have a better way of doing work when we have an outage. In the spirit of Agile, we’re trying to minimise hand-offs and to work transparently. Stopping to communicate with senior management is one of those hand-offs that doesn’t help the team – unless they’re blocked, of course. If we can eliminate as much of that as possible when the unexpected happens, “important” people like me can go back to doing whatever we’re normally supposed to do.
As you’ll see, it’s been some time since I previously blogged (almost five years). But I’m resurrecting the blog to talk about the Agile practices that my group within the CIO at IBM (we look after IBM’s Sales & Marketing Systems) is implementing. I’m describing some of the initiatives in this blog; my friend Rash Khan is describing another important initiative called Brilliant at the Basics. That programme addresses embedding Agile practices in our 320 teams worldwide. He introduces the series of articles in this post.