1
00:00:02,939 --> 00:00:05,819
Narrator: You're listening to
the humans of DevOps podcast, a

2
00:00:05,819 --> 00:00:09,449
podcast focused on advancing the
humans of DevOps through skills,

3
00:00:09,479 --> 00:00:13,799
knowledge, ideas and learning,
or the SK il framework.

4
00:00:33,330 --> 00:00:36,090
Jason Baum: Hey everyone, it's
Jason Baum, Director of Member

5
00:00:36,090 --> 00:00:40,860
experience at DevOps Institute.
And this is the humans of DevOps

6
00:00:40,860 --> 00:00:45,090
podcast. Welcome back. Hope you
had another great week this

7
00:00:45,090 --> 00:00:48,240
week. I always hope you have a
great week. So I hope this one

8
00:00:48,240 --> 00:00:52,620
was even greater than the last
one. Today we're gonna be

9
00:00:52,620 --> 00:00:56,850
talking about incidents,
mistakes, do overs, we talked

10
00:00:56,850 --> 00:01:00,210
about blameless culture. It's
one of the core principles of

11
00:01:00,210 --> 00:01:05,310
DevOps. But in reality, is it as
easy as just saying, we have a

12
00:01:05,310 --> 00:01:08,940
blameless culture. In preparing
for today's episode, I'm

13
00:01:08,940 --> 00:01:12,510
reminded of a quote by Phoebe
Waller bridge, the creator of

14
00:01:12,510 --> 00:01:16,260
fleabag, and show runner of
killing Eve. That's the very

15
00:01:16,260 --> 00:01:19,950
reason they put rubbers on the
ends of pencils because people

16
00:01:19,950 --> 00:01:25,290
make mistakes. I love that
quote. If you don't know who she

17
00:01:25,290 --> 00:01:29,250
is, you know, Waller bridge has
made a career of bringing to

18
00:01:29,250 --> 00:01:33,300
life unconventional women who
make a lot of mistakes. But what

19
00:01:33,300 --> 00:01:36,900
her series have in common is
that her flawed characters get a

20
00:01:36,900 --> 00:01:40,830
chance at redemption, moving
past mistakes, and offering them

21
00:01:40,830 --> 00:01:43,050
an opportunity to prove
something to themselves and come

22
00:01:43,050 --> 00:01:47,670
out stronger and more confident
on the other side. I feel like

23
00:01:47,670 --> 00:01:50,940
this quote, the metaphor she
made is a perfect setup for our

24
00:01:50,940 --> 00:01:54,660
conversation today. Incidents
are a great opportunity to

25
00:01:54,660 --> 00:01:58,410
gather both context and skill.
They take people out of their

26
00:01:58,410 --> 00:02:01,350
day to day roles and force teams
to solve unexpected and

27
00:02:01,350 --> 00:02:05,700
challenging problems together.
Joining me today to discuss this

28
00:02:05,700 --> 00:02:09,990
topic is Lisa Carlin Curtis,
Lisa is a product engineer

29
00:02:10,020 --> 00:02:15,750
incident.io. In fact, Lisa was
employee number two@incident.io.

30
00:02:16,410 --> 00:02:19,680
She started out as a consulting
working at Accenture before

31
00:02:19,680 --> 00:02:22,620
accidentally becoming a
developer. I'd love to hear how

32
00:02:22,620 --> 00:02:26,190
that happened. Lisa loves
building stuff. But it's also

33
00:02:26,190 --> 00:02:28,440
interested in how people
interact with each other in a

34
00:02:28,440 --> 00:02:32,310
work environment, particularly
in software engineering. Outside

35
00:02:32,310 --> 00:02:35,220
of work, Lisa loves cooking, and
pretty much any competitive

36
00:02:35,220 --> 00:02:40,740
sports, we definitely have that
in common. But I guess she likes

37
00:02:40,740 --> 00:02:44,040
the British one. She says, I
really don't know that much

38
00:02:44,040 --> 00:02:48,120
about them. So Lisa, welcome to
the podcast. Thank you for

39
00:02:48,120 --> 00:02:48,870
joining me.

40
00:02:49,409 --> 00:02:50,759
Lisa Karlin Curtis: Hey, lovely
to be here. I'm really looking

41
00:02:50,759 --> 00:02:52,319
forward to awesome.

42
00:02:52,349 --> 00:02:55,709
Jason Baum: Are you ready to get
human? Alright, let's do it. All

43
00:02:55,709 --> 00:03:03,149
right. So we're talking about
incidents? How can incidents be

44
00:03:03,149 --> 00:03:06,209
considered an opportunity to
gather context and skill?

45
00:03:07,740 --> 00:03:10,950
Lisa Karlin Curtis: So I kind of
started thinking about this.

46
00:03:11,520 --> 00:03:14,640
When I was reflecting on Yeah, I
kind of became a software

47
00:03:14,640 --> 00:03:17,610
engineer accidentally, and I've
accelerated quite quickly, I've

48
00:03:17,610 --> 00:03:20,760
been very fortunate. And part of
that is because a lot of the

49
00:03:20,760 --> 00:03:22,920
stuff I did before I was an
engineer was actually quite

50
00:03:22,920 --> 00:03:27,180
useful. But also part of it was,
I realized, I started doing this

51
00:03:27,180 --> 00:03:30,630
thing where I was basically
running towards the fire. So

52
00:03:30,660 --> 00:03:33,000
stuff would go wrong. And I'd be
like, Oh, that looks kind of

53
00:03:33,000 --> 00:03:37,500
interesting. And all of the
times where I learned most the

54
00:03:37,500 --> 00:03:40,560
sort of step changes in my
understanding or my context,

55
00:03:41,370 --> 00:03:45,000
were around incidents. So they
were like, something would go

56
00:03:45,000 --> 00:03:47,400
wrong. And either I would learn,
like, while we were fixing that

57
00:03:47,400 --> 00:03:50,070
problem, I would learn about
your stuff. Or straight

58
00:03:50,070 --> 00:03:51,930
afterwards, when I was like
reflecting on it and talking to

59
00:03:51,930 --> 00:03:55,380
people about it, I'd learned a
bunch of stuff. And so I kind of

60
00:03:55,380 --> 00:03:57,690
started thinking about this and
talking to people about it. And

61
00:03:57,840 --> 00:04:00,450
it turns out other people had
had the same experience, what a

62
00:04:00,450 --> 00:04:04,800
surprise. And so I think that
there is there is something very

63
00:04:04,800 --> 00:04:08,010
unusual about incident, which is
why it is an incident Right?

64
00:04:08,010 --> 00:04:10,560
Like something, something
happens that is unexpected that

65
00:04:10,560 --> 00:04:12,660
you didn't know was going to
happen. And then you have to

66
00:04:12,660 --> 00:04:14,970
react to it. And that pushes
people outside their comfort

67
00:04:14,970 --> 00:04:17,880
zone. And it pushes you to do
things and see things that you

68
00:04:17,880 --> 00:04:21,450
wouldn't otherwise see. So I
guess like, I can think of like

69
00:04:21,450 --> 00:04:24,330
three, three key areas where
it's really useful. So one is

70
00:04:24,330 --> 00:04:26,850
about like broadening your
horizons because you see the

71
00:04:26,850 --> 00:04:30,360
stuff that you wouldn't see in
your day to day. One is about

72
00:04:30,360 --> 00:04:33,600
like teaching you how to build
stuff that fails gracefully, and

73
00:04:33,600 --> 00:04:38,730
then an observable way. So I
think that one of the one of the

74
00:04:38,730 --> 00:04:41,160
key differentiators between good
software engineering and great

75
00:04:41,160 --> 00:04:43,740
software engineering is about
what happens when the thing that

76
00:04:43,740 --> 00:04:48,180
you didn't think could happen
happens. So like step one, make

77
00:04:48,180 --> 00:04:52,290
it work. Step two, make it work
really fast. Step three, make it

78
00:04:52,290 --> 00:04:54,720
work really fast. And when you
get a negative number that you

79
00:04:54,720 --> 00:04:57,780
weren't expecting you explode
really, really loudly as opposed

80
00:04:57,780 --> 00:05:00,480
to just take the negative number
and let's just Pay somebody a

81
00:05:00,480 --> 00:05:04,290
negative amount of money or you
know, whatever it might be. And

82
00:05:04,290 --> 00:05:06,960
then sorry, just to finish off,
the third is about building your

83
00:05:06,960 --> 00:05:09,960
network. So you have a whole
bunch of connections with

84
00:05:09,960 --> 00:05:12,000
different people in your
organization, you work with,

85
00:05:12,000 --> 00:05:15,540
like your team. But in an
incident, often you have to, you

86
00:05:15,540 --> 00:05:17,760
have to work with lots and lots
of people from across the

87
00:05:17,760 --> 00:05:21,180
organization. And that builds
bonds that are really important.

88
00:05:21,240 --> 00:05:24,300
And I think really valuable both
to you as an individual and to

89
00:05:24,300 --> 00:05:24,930
the company.

90
00:05:25,230 --> 00:05:29,040
Jason Baum: Yeah, absolutely. If
you listen to this podcast,

91
00:05:29,040 --> 00:05:33,210
you've heard me use parenting
often as examples. Because I

92
00:05:33,210 --> 00:05:38,640
think parenting is so applicable
to what goes on in day to day

93
00:05:38,640 --> 00:05:43,920
life outside of your home. And
one of the things that I was

94
00:05:43,920 --> 00:05:49,350
told when I was a new parent was
plan for the unexpected or plan

95
00:05:49,350 --> 00:05:52,440
for the implantable. And I think
that's applicable here. I think

96
00:05:52,440 --> 00:05:55,170
with incidents and mistakes,
it's almost like, you need to

97
00:05:55,170 --> 00:05:57,840
expect it, it's going to happen,
if you're building a program,

98
00:05:57,840 --> 00:06:01,800
it's going to have a bug. It's
what happens after it happens.

99
00:06:02,310 --> 00:06:06,150
That really matters, right?
That's where all the everything

100
00:06:06,150 --> 00:06:06,780
happens.

101
00:06:07,439 --> 00:06:09,209
Lisa Karlin Curtis: Yeah, I
think that's the differentiator

102
00:06:09,239 --> 00:06:13,709
in terms of, if you're, if
you're building a system, you,

103
00:06:13,739 --> 00:06:16,319
you will predict a certain
number of the possible things

104
00:06:16,319 --> 00:06:18,299
that are going to happen, and
you will make your system behave

105
00:06:18,299 --> 00:06:21,179
well in them. And that's all
great. And then you read a book

106
00:06:21,179 --> 00:06:23,699
that's like, oh, you should make
your system observable. And you

107
00:06:23,699 --> 00:06:25,739
go, Okay, I will add some
loglines. And I will add some

108
00:06:25,739 --> 00:06:28,379
metrics. And like, it's really
easy to do that in a way that

109
00:06:28,379 --> 00:06:30,359
doesn't really add any value.
And we've all we've all seen

110
00:06:30,359 --> 00:06:32,399
examples of that. We've all seen
dashboards that really mean

111
00:06:32,399 --> 00:06:35,939
anything. And the way to get
from like, having read in a book

112
00:06:36,029 --> 00:06:38,849
to being actually able to do it,
valuably, I think is just to see

113
00:06:38,849 --> 00:06:41,879
it. And it's very difficult to
do that into learn that in the

114
00:06:41,879 --> 00:06:43,949
abstract, but as soon as you see
someone trying to debug a

115
00:06:43,949 --> 00:06:47,759
problem, and you see like, you
know, what, what is the

116
00:06:47,759 --> 00:06:50,039
breadcrumb? What are the
breadcrumbs that they are

117
00:06:50,039 --> 00:06:54,329
following, in order to get from
our API is slow to this is the

118
00:06:54,329 --> 00:06:57,539
root cause I can now fix this
problem. And if you see somebody

119
00:06:57,539 --> 00:06:59,789
do that enough times, you start
to be able to lay your own

120
00:06:59,789 --> 00:07:02,999
breadcrumbs. Because you can
kind of imagine you can, you can

121
00:07:03,029 --> 00:07:05,219
empathize, you can put yourself
into that person's shoes who's

122
00:07:05,219 --> 00:07:08,939
trying to debug it and be like,
Oh, maybe it'll be useful for me

123
00:07:08,939 --> 00:07:11,729
to like, have a metric here.
Because if this specific bit

124
00:07:11,789 --> 00:07:15,269
starts to go weird, we want to
know about it. And I think that

125
00:07:15,269 --> 00:07:18,929
that's something as you say,
where it's like, it's all about

126
00:07:18,929 --> 00:07:22,169
preparing for the unexpected.
And a lot of that is actually

127
00:07:22,199 --> 00:07:24,659
counter intuitively, perhaps
it's not being able to handle

128
00:07:24,659 --> 00:07:27,899
every case, it's being able to
either be sure that what you're

129
00:07:27,899 --> 00:07:32,279
doing is right, or get a human
to help you out, right, and that

130
00:07:32,309 --> 00:07:34,529
ENCODE is like throwing an
exception or panicking or

131
00:07:34,529 --> 00:07:37,409
whatever you want to call it.
And that's the most important

132
00:07:37,409 --> 00:07:39,479
thing, particularly if you're
building your billing software,

133
00:07:39,479 --> 00:07:42,689
like cars and planes and you
know, software that we trust

134
00:07:42,689 --> 00:07:46,619
with our lives, then you need to
be sure that if the software

135
00:07:46,619 --> 00:07:48,959
sees anything that it doesn't
expect it the first the human.

136
00:07:49,289 --> 00:07:51,599
And that's the same in like
FinTech, which is my background.

137
00:07:51,599 --> 00:07:54,269
And it's the same in lots of
bits of software. And that's

138
00:07:54,269 --> 00:07:56,669
something that like, you're not
really taught that and if you

139
00:07:56,669 --> 00:07:58,709
read it in a book, it doesn't
really land. But once you see

140
00:07:58,709 --> 00:08:00,689
it, you can really start to
engage with it and sort of do it

141
00:08:00,689 --> 00:08:01,229
yourself.

142
00:08:01,980 --> 00:08:04,470
Jason Baum: So I mentioned
blameless culture. And that's

143
00:08:04,470 --> 00:08:09,000
really important in DevOps, with
incidents of on blameless

144
00:08:09,000 --> 00:08:15,360
culture anywhere, really. But I
feel like it is something that

145
00:08:15,360 --> 00:08:19,320
is said, I've heard it is said
so much that it almost becomes a

146
00:08:19,320 --> 00:08:23,190
buzzword. And you really
question the authenticity of

147
00:08:23,640 --> 00:08:28,170
when someone says, oh, we have a
blameless culture? How do you

148
00:08:28,200 --> 00:08:33,390
actually have a blameless
culture? How do how do incidents

149
00:08:33,630 --> 00:08:39,330
lead to learning? Without
feeling like, the person who

150
00:08:39,330 --> 00:08:47,910
made the mistake is getting in
the way or, you know, is? Well,

151
00:08:47,910 --> 00:08:50,520
yeah, feeling like you made a
mistake, let people down. I

152
00:08:50,520 --> 00:08:53,400
think that's, that's inherent
nature for all of us, right?

153
00:08:54,390 --> 00:08:55,890
Lisa Karlin Curtis: Yeah,
absolutely. I think that there's

154
00:08:55,890 --> 00:08:59,190
a lot of people have a lot of
shame around, like making

155
00:08:59,190 --> 00:09:03,480
mistakes at work. And that is a
very like human thing, humans

156
00:09:03,480 --> 00:09:07,770
are incredibly susceptible to
shame. And what that means is

157
00:09:07,770 --> 00:09:10,260
that there is so much
psychological pressure when you

158
00:09:10,260 --> 00:09:13,290
make a mistake to try and cover
it up. And that is like the

159
00:09:13,290 --> 00:09:15,750
worst possible thing that you
can do in a software engineering

160
00:09:15,750 --> 00:09:19,140
environment. And we know that,
and yet all of us still have

161
00:09:19,140 --> 00:09:21,090
that, right. All of us still
have that moment when we find a

162
00:09:21,090 --> 00:09:24,540
mistake was made. And we're
like, maybe I'll just fix it,

163
00:09:24,660 --> 00:09:28,620
and it will be fine. And no one
will ever know. And I think that

164
00:09:28,620 --> 00:09:32,700
is such a, it's so hardwired
into our brains, that you have

165
00:09:32,700 --> 00:09:36,090
to work very, very hard to
combat it. And so some of the

166
00:09:36,090 --> 00:09:39,000
obvious things that you can do
as an individual, try and be

167
00:09:39,000 --> 00:09:41,610
very open about your mistakes,
particularly if you're in a

168
00:09:41,610 --> 00:09:45,210
leadership role. Or some or if
you have quite a lot of social

169
00:09:45,210 --> 00:09:47,010
capital, because you've been in
that organization for a long

170
00:09:47,010 --> 00:09:50,610
time. That means that people
will kind of monkey see monkey

171
00:09:50,610 --> 00:09:53,430
do right. If you do it, other
people will, will copy you and

172
00:09:53,430 --> 00:09:56,550
we'll follow your example. And
then there's another part of it,

173
00:09:56,550 --> 00:10:00,210
which is I think you talked
about failing together. So The

174
00:10:00,210 --> 00:10:04,530
way that most most technology is
more complicated than one person

175
00:10:04,530 --> 00:10:08,310
made one error. It's normally
lots of people made lots of

176
00:10:08,310 --> 00:10:10,950
decisions that have all
coalesced into a bad thing.

177
00:10:11,550 --> 00:10:16,170
There was a famous one company,
I used to work out where a

178
00:10:16,170 --> 00:10:19,230
junior engineer had kind of gone
on to there, they'd written some

179
00:10:19,230 --> 00:10:22,380
code that was supposed to send
an email, telling people who

180
00:10:22,380 --> 00:10:25,050
weren't paying that they should
pay basically kind of trying to

181
00:10:25,050 --> 00:10:28,500
prompt and increase conversion.
And the logic was a bit wrong.

182
00:10:28,530 --> 00:10:30,690
And it was actually targeting
all the people who were paying.

183
00:10:30,840 --> 00:10:33,900
And they ran it in staging. And
customer support got inundated

184
00:10:33,900 --> 00:10:37,680
with requests. And the studio
engineer is sitting there being

185
00:10:37,680 --> 00:10:39,870
like, put it in staging, I'm
really confused. This is very

186
00:10:39,870 --> 00:10:42,780
stressful, like the team jumps
in, like they go to support

187
00:10:42,780 --> 00:10:45,810
support, then is sort of told
this the mistake, don't worry,

188
00:10:45,810 --> 00:10:49,290
your billings, fine. And they
start to look back. And it turns

189
00:10:49,290 --> 00:10:52,080
out that somebody has seeded
staging with production data to

190
00:10:52,080 --> 00:10:55,200
run some load tests, and they
didn't anonymize the emails. And

191
00:10:55,200 --> 00:10:57,990
so they've got they've just ran.
Basically, they've run their

192
00:10:57,990 --> 00:11:01,530
code in production, but they
didn't know. And something like

193
00:11:01,530 --> 00:11:04,800
that the junior engineer is, is
really mortified because they've

194
00:11:04,830 --> 00:11:06,900
done this thing. And they've
they clicked a button and a

195
00:11:06,900 --> 00:11:10,050
bunch of emails went out. And
that's really bad. But I think

196
00:11:10,050 --> 00:11:12,720
it's important to look at that
as a group and be like, Well,

197
00:11:13,260 --> 00:11:16,110
how would you have known that?
Possibly, right? What Why did we

198
00:11:16,110 --> 00:11:19,440
put production data in staging,
why did we not anonymize it? Why

199
00:11:19,590 --> 00:11:22,200
is staging setup so that it can
send unlimited emails to

200
00:11:22,200 --> 00:11:24,810
unlimited numbers of people? And
there are a whole load of other

201
00:11:24,810 --> 00:11:26,640
questions, right, and you can
start to look at it as a

202
00:11:26,640 --> 00:11:28,770
systemic problem. Or you can
look at it, there's like the

203
00:11:28,770 --> 00:11:31,530
Swiss cheese analogy of like,
all the holes have to line up.

204
00:11:31,860 --> 00:11:35,460
And I think that if you talk
about things like that a lot,

205
00:11:35,490 --> 00:11:38,700
then people get it, and people
buy into it. And at that point,

206
00:11:38,700 --> 00:11:41,670
it's much more comfortable to
admit your mistake, because you

207
00:11:41,670 --> 00:11:43,650
know that your team is going to
gather around you and you know

208
00:11:43,650 --> 00:11:46,290
that your team is going to take
accountability. And so if you'd

209
00:11:46,290 --> 00:11:48,930
like if you succeed together, if
you fail together, you can build

210
00:11:48,930 --> 00:11:52,320
this blameless culture. But if
you hang people out to dry, if

211
00:11:52,320 --> 00:11:55,530
you mock people, if you're mean,
that's just going to reinforce

212
00:11:55,530 --> 00:11:58,020
the shame that that person is
already worried about.

213
00:11:59,159 --> 00:12:03,029
Jason Baum: So if I if I'm
hearing you, it's, it's that

214
00:12:03,029 --> 00:12:06,989
proactive honesty, it's the
mistake is made calling it out.

215
00:12:07,169 --> 00:12:11,789
But saying, basically, it's
calling it out for what it is

216
00:12:11,819 --> 00:12:16,319
this happened. How do we address
it? What do we do coming

217
00:12:16,319 --> 00:12:20,159
together, getting everybody to
rally around it? Without

218
00:12:20,189 --> 00:12:21,389
pointing the finger?

219
00:12:22,380 --> 00:12:24,000
Lisa Karlin Curtis: Yeah, I
think that's exactly it. And

220
00:12:24,000 --> 00:12:27,450
then you need to combine that
with incident shouldn't be a big

221
00:12:27,450 --> 00:12:32,100
scary monster. So I think
there's a blog post on our blog

222
00:12:32,100 --> 00:12:34,410
about like incidents and no bad
thing, you should be declaring

223
00:12:34,410 --> 00:12:36,480
more incidents. And there are
there are, there are lots of

224
00:12:36,480 --> 00:12:38,460
organizations who sort of
measure their success on number

225
00:12:38,460 --> 00:12:41,430
of incidents, which I think is
is a really perverse incentive.

226
00:12:42,060 --> 00:12:44,790
But I think that if if incidents
become the norm, then mistakes

227
00:12:44,790 --> 00:12:47,190
become the norm. And if you're
all talking about incidents, and

228
00:12:47,190 --> 00:12:49,560
if that information about that
those incidents is really

229
00:12:49,560 --> 00:12:53,130
accessible to people in your
organization, then you've made a

230
00:12:53,130 --> 00:12:55,170
mistake, just like everybody
else on the team has made a

231
00:12:55,170 --> 00:12:58,620
mistake has made hundreds of
mistakes. Whereas if that is all

232
00:12:58,620 --> 00:13:02,010
kept hush hush within the team,
and it's not broadcast, then all

233
00:13:02,010 --> 00:13:03,840
of a sudden, like that's the
first mistake anyone in the

234
00:13:03,840 --> 00:13:06,210
company has ever made as far as
you're concerned. And that's a

235
00:13:06,210 --> 00:13:07,560
really terrifying place to be.

236
00:13:08,309 --> 00:13:12,599
Jason Baum: Yeah, and we all
know that's not true. But yet we

237
00:13:12,599 --> 00:13:15,989
feel it's just inherent human
nature, right? To think that

238
00:13:15,989 --> 00:13:18,539
your mistake is the worst
mistake ever made, and oh my

239
00:13:18,539 --> 00:13:23,669
god, they're gonna fire me or
they're gonna like, black list

240
00:13:23,669 --> 00:13:26,129
me or something bad is gonna
happen. I'm never gonna work

241
00:13:26,129 --> 00:13:28,589
again in this at the end of my
life. And like, we just have

242
00:13:28,589 --> 00:13:31,799
this habit, I think even as like
from like kids through

243
00:13:32,129 --> 00:13:35,639
adulthood, I always hope that
that feeling would go away, and

244
00:13:35,639 --> 00:13:39,629
it never has. Why?

245
00:13:40,740 --> 00:13:43,260
Lisa Karlin Curtis: I think it's
kind of it's partly the imposter

246
00:13:43,260 --> 00:13:47,430
syndrome thing of like, the more
you progress, the more

247
00:13:47,460 --> 00:13:49,980
responsibility to have you have,
the more you can see all the

248
00:13:49,980 --> 00:13:55,620
things that you don't have to
do. But I think that also there

249
00:13:55,620 --> 00:14:01,140
is a there's another part of it,
which is we, I think we inherent

250
00:14:01,170 --> 00:14:04,350
we net, we inherently strive for
perfection. And we want

251
00:14:04,350 --> 00:14:07,440
ourselves to be perfect, because
it's quite inconvenient that

252
00:14:07,440 --> 00:14:10,890
we're not, because every single
decision you make, you're like,

253
00:14:10,920 --> 00:14:13,440
I think I'm right. You as a
software engineer, you spend

254
00:14:13,440 --> 00:14:16,830
your entire day going. Yeah, I
think this is right, let's do

255
00:14:16,830 --> 00:14:19,950
it. And if you've got a little
voice in your head all the time

256
00:14:19,950 --> 00:14:22,710
going, what if you're not what
if you're not, that becomes

257
00:14:22,710 --> 00:14:26,010
really stressful and really
tiring. And so what we do is we

258
00:14:26,010 --> 00:14:29,250
go, yeah, I'm right. Most of the
time, this is kind of fine. And

259
00:14:29,250 --> 00:14:31,650
then when something goes wrong,
that punctures that sense of

260
00:14:31,650 --> 00:14:34,590
confidence, and then that's
really destructive. And then we

261
00:14:34,590 --> 00:14:36,840
get really stressed because we
think that the only reason

262
00:14:36,840 --> 00:14:39,810
anybody has hired us is because
we're right all the time. And

263
00:14:39,810 --> 00:14:42,540
they haven't. But because
because that makes our day to

264
00:14:42,540 --> 00:14:45,480
day easier. I think it's very
easy to kind of fall into that

265
00:14:45,840 --> 00:14:49,890
and fall into the trap of almost
believing your own rubbish that

266
00:14:49,920 --> 00:14:52,380
you are, in fact, perfect and
flawless.

267
00:14:53,519 --> 00:14:56,729
Jason Baum: It's so funny
because it's even, I would say

268
00:14:56,759 --> 00:15:01,019
probably, if not the most common
one. have the most common

269
00:15:01,019 --> 00:15:05,399
questions that come up in an
interview processes. Name and

270
00:15:05,399 --> 00:15:08,579
incident that happened that were
you like, for example, where you

271
00:15:08,579 --> 00:15:11,519
made a mistake, but how did you
handle it? And how did you

272
00:15:11,519 --> 00:15:14,099
overcome it? How did you
overcome it? Right? That's like

273
00:15:14,099 --> 00:15:17,789
one of the most common
questions, I think. If you

274
00:15:17,789 --> 00:15:21,089
haven't, if you haven't had it,
I don't know, maybe you haven't

275
00:15:21,269 --> 00:15:24,419
ever applied for a job before in
your life? Because I think it

276
00:15:24,419 --> 00:15:27,809
must be the most common one. And
yet, when you're hired to do a

277
00:15:27,809 --> 00:15:31,409
job, it's almost like yeah, we
do strive for perfection. And

278
00:15:31,409 --> 00:15:33,959
that question went out the
window. It's like, now I can't

279
00:15:33,959 --> 00:15:34,859
make a mistake.

280
00:15:36,480 --> 00:15:40,260
Lisa Karlin Curtis: Yeah, it's,
it's really, it feels like one

281
00:15:40,260 --> 00:15:42,540
of those things that there
should be a better answer to as

282
00:15:42,540 --> 00:15:44,640
well. But I don't think there is
I don't think there's a silver

283
00:15:44,640 --> 00:15:47,190
bullet. I think like, you talk
about it, you lead from the

284
00:15:47,190 --> 00:15:53,880
front. You you try and catch it
when it does happen. And, and

285
00:15:53,880 --> 00:15:57,150
you hope that slowly but surely,
people, you know that that

286
00:15:57,150 --> 00:15:59,340
feeling of wanting to hide it,
that feeling of shame just

287
00:15:59,340 --> 00:16:02,460
becomes less and less strong,
and the muscle, you develop this

288
00:16:02,460 --> 00:16:06,120
muscle of overriding it. But you
know, I'm talking about this in

289
00:16:06,120 --> 00:16:10,590
evangelizing, I still absolutely
have that instinct. The only

290
00:16:10,590 --> 00:16:12,960
thing that I've learned is I
have a muscle that I can now be

291
00:16:12,960 --> 00:16:15,600
like, I can recognize it. And I
can look it in the face and be

292
00:16:15,600 --> 00:16:18,210
like, we're not doing that
today. Because that's not

293
00:16:18,210 --> 00:16:22,380
useful. But it's it's definitely
an active thing. It's not it's

294
00:16:22,380 --> 00:16:23,580
not a sort of default.

295
00:16:23,789 --> 00:16:27,479
Jason Baum: Yeah, yeah. And, and
we have to look at it, it's

296
00:16:27,509 --> 00:16:31,049
based on what you're saying.
It's like isolated in each

297
00:16:31,139 --> 00:16:34,949
incident is an incident,
isolated, we have to take care

298
00:16:34,949 --> 00:16:39,239
of it, learn from it, and just
move past that. If I'm gathering

299
00:16:39,239 --> 00:16:39,629
that.

300
00:16:40,380 --> 00:16:42,420
Lisa Karlin Curtis: Think
emotionally, absolutely. I think

301
00:16:42,420 --> 00:16:45,660
there's there's another side to
that, right, which is, if you as

302
00:16:45,660 --> 00:16:48,180
an organization, I think
incidents are really important

303
00:16:48,180 --> 00:16:50,790
source of data to understand
where you should be putting your

304
00:16:50,790 --> 00:16:54,600
chips. So if you have good
reporting, if you can look at

305
00:16:54,600 --> 00:16:57,780
your incidents in aggregate, if
you have like a good way of

306
00:16:57,780 --> 00:17:00,960
recording them, and categorizing
them, then you can start to use

307
00:17:00,960 --> 00:17:04,860
that data and start to go, oh,
this bit of tech seems to cause

308
00:17:04,860 --> 00:17:07,290
us a lot of problems or, you
know, this process seems to be

309
00:17:07,290 --> 00:17:11,040
really risky for us. And now
that will help that will help us

310
00:17:11,040 --> 00:17:13,680
decide like, where are we going
to invest next. So I think from

311
00:17:13,680 --> 00:17:15,900
that point of view, you don't
want to kind of leave them

312
00:17:15,900 --> 00:17:18,510
behind the tool. And that's in
direct conflict with what you

313
00:17:18,510 --> 00:17:22,470
want people to do emotionally,
which is to, you know, be there

314
00:17:22,470 --> 00:17:25,710
in the moment, solve the
problem, close it and not worry

315
00:17:25,710 --> 00:17:28,320
about it. And I think that
that's quite difficult to

316
00:17:28,320 --> 00:17:31,530
manage, because you
simultaneously are telling

317
00:17:31,530 --> 00:17:34,170
people to leave it behind, and
also telling them to constantly

318
00:17:34,170 --> 00:17:37,230
be thinking about them and be
you know, sitting in a quarterly

319
00:17:37,230 --> 00:17:39,210
review being like, what went
wrong this quarter? What do we

320
00:17:39,210 --> 00:17:42,300
want to invest in to help make
our platform more reliable,

321
00:17:42,300 --> 00:17:42,630
right?

322
00:17:43,560 --> 00:17:46,050
Jason Baum: You know, what it
makes me think of, and apologies

323
00:17:46,050 --> 00:17:50,460
ahead of time, American
football, we have a quarterback

324
00:17:50,520 --> 00:17:54,930
and the quarterback, throws
interceptions. It's a given

325
00:17:54,930 --> 00:17:57,930
thing. Everyone knows their
quote, no matter who it is Tom

326
00:17:57,930 --> 00:18:02,220
Brady is I mean, they're going
to throw interceptions. And yet,

327
00:18:02,760 --> 00:18:06,120
they strive for perfection.
Because what sport what athlete

328
00:18:06,120 --> 00:18:10,650
doesn't, right? And then when
they happen, the one thing that

329
00:18:10,650 --> 00:18:14,010
I would say is in that culture
is they get on the phone, or

330
00:18:14,010 --> 00:18:16,770
they go next to the offensive
coordinator coach, and they look

331
00:18:16,770 --> 00:18:18,960
at what happened in that play.
Here's what happened. Here's why

332
00:18:18,960 --> 00:18:20,850
he didn't see it. This is what
happens. And then they are

333
00:18:20,850 --> 00:18:23,760
supposed to forget it. Forget it
ever happened and move on?

334
00:18:23,760 --> 00:18:25,710
Because how do you move on with
the rest of the game, if all

335
00:18:25,710 --> 00:18:30,810
you're thinking about is the one
big mistake you made? And I just

336
00:18:31,620 --> 00:18:35,760
I think that for me, when you're
when you're talking about kind

337
00:18:35,760 --> 00:18:38,670
of forgetting it, that that
instantly popped into my head?

338
00:18:40,200 --> 00:18:42,900
So many things could be applied
to that, I think.

339
00:18:44,070 --> 00:18:45,630
Lisa Karlin Curtis: Yeah, I
think also when we talk about

340
00:18:45,630 --> 00:18:48,990
incidents, there's nothing
that's specific to engineering

341
00:18:48,990 --> 00:18:51,990
about them. Really. I think the
engineers talk about them a lot.

342
00:18:51,990 --> 00:18:55,710
We have a language that we
discussed them in. But there are

343
00:18:55,710 --> 00:18:58,650
loads of examples of incidents
that are not engineering. And I

344
00:18:58,650 --> 00:19:02,670
think almost all the stuff that
we discuss around incidents

345
00:19:02,670 --> 00:19:05,400
being a chance to build your
network with other people being

346
00:19:05,400 --> 00:19:08,070
a chance to touch things that
you don't normally interact

347
00:19:08,070 --> 00:19:11,880
with, you know, being a chance
to watch what your system does

348
00:19:11,910 --> 00:19:15,120
when it fails. Like that feels
very engineering. But actually,

349
00:19:15,150 --> 00:19:18,390
if you're a customer success
team, you know what happens when

350
00:19:18,390 --> 00:19:21,450
your processes fall over? What
happens when the person who's

351
00:19:21,450 --> 00:19:24,240
doing all of the glue work has
gone on holiday and all of a

352
00:19:24,240 --> 00:19:26,700
sudden something bad happens?
And like you're still stress

353
00:19:26,700 --> 00:19:30,150
testing, you're still finding
the edges. It's just a slightly

354
00:19:30,150 --> 00:19:30,930
different environments.

355
00:19:34,020 --> 00:19:36,510
Ad: Are you looking to get
DevOps certified? Demonstrate

356
00:19:36,510 --> 00:19:38,520
your DevOps knowledge and
advance your career with a

357
00:19:38,520 --> 00:19:41,490
certification from DevOps
Institute? get certified in

358
00:19:41,490 --> 00:19:45,300
DevOps leader, SRE or dev SEC
ops, just to name a few. Learn

359
00:19:45,300 --> 00:19:49,020
anywhere, anytime. The choice is
yours. Choose to get certified

360
00:19:49,020 --> 00:19:52,500
through our vast partner network
self study programs, or our new

361
00:19:52,500 --> 00:19:55,470
skillup elearning videos. The
exams are developed in

362
00:19:55,470 --> 00:19:58,020
collaboration with industry
thought leaders, and subject

363
00:19:58,020 --> 00:20:01,050
matter experts in the DevOps
space and Learn more at DevOps

364
00:20:01,050 --> 00:20:03,150
institute.com/certifications.

365
00:20:08,790 --> 00:20:10,140
Lisa Karlin Curtis: I think
what's what I find interesting

366
00:20:10,140 --> 00:20:13,740
about that is that you you start
at, you're like, oh, when things

367
00:20:13,740 --> 00:20:17,250
go wrong when it's bad. And
we've had a number of incidents

368
00:20:17,250 --> 00:20:20,100
where I would say the net impact
on our company has been

369
00:20:20,100 --> 00:20:24,090
positive. Because somebody
reports it, we see it, we've got

370
00:20:24,870 --> 00:20:27,480
some really quite good
observability. So often, we can

371
00:20:27,480 --> 00:20:30,180
like, find it pretty quickly fix
it, turn it around and say half

372
00:20:30,180 --> 00:20:33,900
an hour. And the customer ends
that interaction, actually

373
00:20:33,900 --> 00:20:36,660
feeling better about us than
when they started, which is

374
00:20:36,660 --> 00:20:39,300
probably quite counterintuitive,
because really, if we just

375
00:20:39,300 --> 00:20:41,490
hadn't broken it in the first
place, maybe that would have

376
00:20:41,490 --> 00:20:45,810
been better for them. But we
we've joked internally about

377
00:20:45,810 --> 00:20:48,780
maybe we should deliberately
come up with bugs because of how

378
00:20:48,810 --> 00:20:51,240
how much like great feedback we
get when we fix things from

379
00:20:51,240 --> 00:20:51,600
people,

380
00:20:51,630 --> 00:20:55,140
Jason Baum: right? I feel like
that's the evolution, right of

381
00:20:55,140 --> 00:21:01,590
any good product is the
feedback. So, you know, in

382
00:21:01,590 --> 00:21:04,200
thinking about letting it go and
thinking about these

383
00:21:04,200 --> 00:21:08,760
improvements. I feel like there
must be obstacles, though,

384
00:21:08,760 --> 00:21:12,960
besides ourselves, right? And
our own internal turmoil that we

385
00:21:12,960 --> 00:21:15,540
put ourselves through, when we
make a mistake, or when an

386
00:21:15,540 --> 00:21:20,130
incident happens. There's
deadlines, and you need to hit

387
00:21:20,130 --> 00:21:23,880
them, you need to meet them. And
when you miss them, that's a big

388
00:21:23,880 --> 00:21:31,530
deal. So how does that play into
when incidents happen? How does

389
00:21:31,530 --> 00:21:37,260
that how does that, I guess,
impact that feeling that we're

390
00:21:37,260 --> 00:21:39,900
already feeling right, the shame
that you talked about, and then

391
00:21:39,900 --> 00:21:42,930
we have this deadline looming
over our heads?

392
00:21:43,980 --> 00:21:45,360
Lisa Karlin Curtis: I think
that's really interesting. I

393
00:21:45,360 --> 00:21:49,020
think it's very, very difficult
because you have a trade off

394
00:21:49,050 --> 00:21:53,850
generally. So normally, there's
that there's a triangle of like,

395
00:21:53,850 --> 00:21:57,840
speed, and quality, and the
common warts on the other end of

396
00:21:57,840 --> 00:22:01,980
the triangle. But the idea being
you have to, you have to trade

397
00:22:01,980 --> 00:22:05,130
off something number of people,
maybe, maybe we should just

398
00:22:05,130 --> 00:22:11,220
scrap all of that, I'm gonna
start again, that's fine. I

399
00:22:11,220 --> 00:22:14,250
think there's a trade off here.
So when something goes wrong,

400
00:22:15,330 --> 00:22:18,420
the first thing is I need to fix
the things broke. And that takes

401
00:22:18,420 --> 00:22:21,270
you however long it takes you
and basically nothing else

402
00:22:21,270 --> 00:22:24,840
matters. Generally, there is
there is a sort of a type of

403
00:22:24,930 --> 00:22:27,960
failure mode, where you're just
trying to bring your system back

404
00:22:27,960 --> 00:22:31,470
up, or resolve the bug or stop
anything getting any worse. And

405
00:22:31,470 --> 00:22:33,900
that's a really easy decision,
because it's there, it's on

406
00:22:33,900 --> 00:22:37,170
fire, we got to fix it. And then
you get to a sort of second

407
00:22:37,170 --> 00:22:41,850
stage of an incident, which
maybe is like follow ups. Or

408
00:22:42,030 --> 00:22:44,700
maybe you're still kind of in
the incident mode where nothing

409
00:22:44,700 --> 00:22:47,850
is on fire anymore. But there's
a lot of things that you could

410
00:22:47,850 --> 00:22:51,540
do that would make it less
likely to happen or resolve it

411
00:22:51,540 --> 00:22:55,260
in a neater way. And that's
where you need your strong

412
00:22:55,260 --> 00:22:58,860
engineers to come in and make
those trade offs. And it's like,

413
00:22:58,860 --> 00:23:00,870
what is the value of this piece
of work? How long is it going to

414
00:23:00,870 --> 00:23:05,010
take us? How much in the wrong
direction? Is it from what we

415
00:23:05,010 --> 00:23:08,100
thought we were doing? And can
we afford to punt it? Can we

416
00:23:08,100 --> 00:23:12,090
afford to delay it? And you get
to this point where you've got a

417
00:23:12,090 --> 00:23:15,420
deadline, people are set of
problems. And you need to make a

418
00:23:15,420 --> 00:23:19,260
decision about basically which
is more important. And that is a

419
00:23:19,410 --> 00:23:22,740
that is a strange shootout trade
off often because it's like, we

420
00:23:22,740 --> 00:23:25,440
have four people in our team, we
have two weeks, what shall we

421
00:23:25,440 --> 00:23:29,850
do? And the answer to that is
not worth 60 Nowadays, because

422
00:23:29,880 --> 00:23:32,640
in all likelihood, you won't get
anything more than in my

423
00:23:32,640 --> 00:23:36,270
experience. And so instead it
has to be right which of these

424
00:23:36,300 --> 00:23:38,430
which of these is more of a risk
to us? What happens if we missed

425
00:23:38,430 --> 00:23:40,350
the deadline. And that's a
decision that needs to get

426
00:23:40,350 --> 00:23:43,380
escalated to someone who has the
authority to make that call and

427
00:23:43,380 --> 00:23:46,950
the information. So that person
that means that you have to make

428
00:23:46,950 --> 00:23:49,410
that information really
available to them in terms of,

429
00:23:49,590 --> 00:23:51,960
you know, what, what is the work
that we could do to mitigate it?

430
00:23:51,990 --> 00:23:56,670
What would it be mitigating? And
versus how far behind? Are we on

431
00:23:56,670 --> 00:23:59,970
deadline? What does it mean, if
we don't get the deadline? And

432
00:24:00,030 --> 00:24:03,840
that is one of those. I think
the lots of people have this,

433
00:24:03,870 --> 00:24:07,140
oh, we'll find a creative
solution. And sometimes there's

434
00:24:07,140 --> 00:24:09,420
a creative solution and
somebody's overskirt something

435
00:24:09,420 --> 00:24:11,910
and actually, it's all gonna be
fine. And sometimes there isn't.

436
00:24:12,510 --> 00:24:15,450
There isn't enough time, and you
have to pick something. And I

437
00:24:15,450 --> 00:24:17,730
think identifying that is really
important and being really

438
00:24:17,730 --> 00:24:20,310
honest, from a kind of
motivation and human point of

439
00:24:20,310 --> 00:24:23,340
view. I think the the times
where I've seen that go badly is

440
00:24:23,340 --> 00:24:25,710
when people either kind of try
and have their cake and eat it

441
00:24:25,770 --> 00:24:28,260
and sort of say, Oh, I know you
said you can't do these two

442
00:24:28,260 --> 00:24:31,830
things. But what if I told you
you could. And then there's

443
00:24:31,830 --> 00:24:35,010
another problem where there is a
trade off and nobody makes the

444
00:24:35,010 --> 00:24:38,640
decision. And then you just end
up in a situation where both

445
00:24:38,640 --> 00:24:41,310
like the team is all kind of
looking at each other being

446
00:24:41,310 --> 00:24:43,950
like, do we make the decision
now because we don't think it's

447
00:24:43,950 --> 00:24:47,190
our choice, but but I guess no
one's telling us what to do. And

448
00:24:47,190 --> 00:24:48,840
then somebody shouts at them
afterwards because they made the

449
00:24:48,840 --> 00:24:49,350
wrong decision.

450
00:24:51,030 --> 00:24:53,940
Jason Baum: I think that leads
us right to my next question.

451
00:24:54,930 --> 00:24:59,970
It's what can the leaders do to
make this mindset part of the

452
00:25:00,000 --> 00:25:04,500
culture of the organization and
encourage it across all teams.

453
00:25:06,450 --> 00:25:09,360
Lisa Karlin Curtis: So I think
the first thing to look for is

454
00:25:10,530 --> 00:25:15,120
look out with anybody playing
the hero. In all organizations

455
00:25:15,480 --> 00:25:18,810
that I've ever seen, there are a
group of people who take on more

456
00:25:18,810 --> 00:25:20,490
than their fair share of the
burden of dealing with

457
00:25:20,490 --> 00:25:24,570
incidents. And we could call
them heroes. And that is really

458
00:25:24,570 --> 00:25:27,240
good until it's really bad. And
it's good, because they're

459
00:25:27,240 --> 00:25:28,980
probably very good at dealing
with incidents, because they've

460
00:25:28,980 --> 00:25:31,320
had a lot of practice. And
they're often the people who've

461
00:25:31,320 --> 00:25:34,440
been at the company for a long
time. But it's bad because it

462
00:25:34,440 --> 00:25:36,690
means that nobody else is
learning how to do it. And so

463
00:25:36,690 --> 00:25:38,820
all of those benefits that we
talked about right at the start

464
00:25:39,300 --> 00:25:42,180
on, no one else is getting that.
So they're kind of gatekeeping,

465
00:25:42,360 --> 00:25:45,840
the skill needed to debug these
issues. And that's very

466
00:25:45,840 --> 00:25:49,080
problematic, because it stunts
other people's growth. And when

467
00:25:49,080 --> 00:25:51,840
that person burns out, or when
that person goes on holiday, or

468
00:25:51,840 --> 00:25:53,940
when that person leaves the
company, you're suddenly in a

469
00:25:53,940 --> 00:25:56,370
really bad situation. So you end
up with these really bad key man

470
00:25:56,370 --> 00:26:00,420
dependencies. And as a leader, I
think it's really important to

471
00:26:00,450 --> 00:26:04,860
identify those patterns. And if
you're, if you're using tooling

472
00:26:04,860 --> 00:26:07,230
you can look at who's answering
who's getting paged, who's

473
00:26:07,230 --> 00:26:09,870
taking your on call load, you
can look at your incident, who's

474
00:26:09,870 --> 00:26:12,510
leading your incidents, you
know, have you got somebody

475
00:26:12,510 --> 00:26:14,910
who's leading 50% of your
incidents? That's probably not a

476
00:26:14,910 --> 00:26:19,380
good sign. And you can use that
data to find those people and

477
00:26:19,380 --> 00:26:21,540
then chat to them Be like, why
are you doing that? And probably

478
00:26:21,540 --> 00:26:24,330
the answer is, well, because I
think it's useful. And that's

479
00:26:24,330 --> 00:26:27,240
like, great. But now we're gonna
have a conversation about why we

480
00:26:27,240 --> 00:26:29,880
need to spread this load out of
the team. And it's a combination

481
00:26:29,880 --> 00:26:32,850
of like protecting you and your
mental health, frankly, but

482
00:26:32,850 --> 00:26:35,370
also, it's about spreading the
knowledge and spreading the

483
00:26:35,370 --> 00:26:40,110
experience. So I think that that
kind of pattern of having those

484
00:26:40,110 --> 00:26:44,370
superheroes is really damaging.
And it restricts your ability to

485
00:26:44,370 --> 00:26:48,000
scale your incident response.
And then, as a leader, the other

486
00:26:48,000 --> 00:26:51,960
things you can do is encourage
people to show that working. So

487
00:26:52,770 --> 00:26:54,660
if you want people to learn from
incidents you to make that

488
00:26:54,660 --> 00:26:57,060
information available to them.
And then you need to make it

489
00:26:57,060 --> 00:27:00,060
accessible by which I mean
available is like, have your

490
00:27:00,060 --> 00:27:03,180
conversations in a public Slack
channel. Ideally, use some

491
00:27:03,180 --> 00:27:06,360
incident tooling so that you can
curate those conversations and

492
00:27:06,360 --> 00:27:09,060
build a timeline that somebody
can interact with, write a post

493
00:27:09,060 --> 00:27:12,750
mortem, share the post mortem.
And then by accessible, I mean,

494
00:27:12,900 --> 00:27:15,600
try and make it really easy for
people to get that information.

495
00:27:15,840 --> 00:27:19,080
So have it in a knowledge base
that people can look at to find

496
00:27:19,080 --> 00:27:21,900
something that they're
interested in. And if you're,

497
00:27:22,320 --> 00:27:25,110
it's like step one, write the
thing. But if you just write a

498
00:27:25,110 --> 00:27:28,470
post mortem that goes into draw,
no one's really one at that

499
00:27:28,470 --> 00:27:33,030
point. So push it out to people
and make it clear to people that

500
00:27:33,030 --> 00:27:36,240
reading those materials is part
of their job and considered a

501
00:27:36,240 --> 00:27:39,480
very good use of their time. And
that's a difficult balance,

502
00:27:39,480 --> 00:27:41,850
because there are some,
sometimes you need people to

503
00:27:41,850 --> 00:27:45,390
ship stuff. But I think it's
important to talk about this

504
00:27:45,390 --> 00:27:48,210
explicitly, and talk about the
fact that if you look at what

505
00:27:48,210 --> 00:27:50,580
other people did in their
incidents, you can build better

506
00:27:50,580 --> 00:27:53,460
software, you're going to have
less, fewer incidents or less

507
00:27:53,460 --> 00:27:57,900
fewer, you're gonna have fewer
incidents, or your incidents are

508
00:27:57,900 --> 00:28:02,010
going to be less severe and
easier to debug. And that then

509
00:28:02,010 --> 00:28:04,230
means that you'll be able to
sort of teach the next

510
00:28:04,230 --> 00:28:06,510
generation to the next
generation, and you get this

511
00:28:06,570 --> 00:28:08,970
great positive feedback loop, if
everybody's talking about it,

512
00:28:08,970 --> 00:28:11,460
and learning from each other, as
opposed to the negative feedback

513
00:28:11,460 --> 00:28:13,860
loop, where people are keeping
it very secret where people are

514
00:28:13,860 --> 00:28:16,590
gatekeeping it and where not
everyone is getting involved.

515
00:28:17,880 --> 00:28:21,420
Jason Baum: I feel like the word
of the day, the word of the day

516
00:28:21,630 --> 00:28:25,530
is transparency. I feel like
that that's pretty much what

517
00:28:25,530 --> 00:28:29,070
you're saying. Not like not to
put a word in your mouth,

518
00:28:29,070 --> 00:28:31,590
because I don't think you've
said it specifically. But what

519
00:28:31,590 --> 00:28:34,680
I'm hearing is transparency,
transparency, and transparency.

520
00:28:34,770 --> 00:28:37,230
As someone who works with the
engineering team, or you know,

521
00:28:37,230 --> 00:28:42,060
I'm just relaying information
from people to people, on a on a

522
00:28:42,090 --> 00:28:45,960
leadership team, for example,
and half the time with a

523
00:28:45,960 --> 00:28:49,140
deadline. It's because the
leadership doesn't necessarily

524
00:28:49,140 --> 00:28:52,680
understand it. You know, they
they don't necessarily know,

525
00:28:53,010 --> 00:28:56,340
what is the specific issue? What
is the specific reason why a

526
00:28:56,340 --> 00:29:00,780
deadline isn't being hit? Or?
Or, you know, and that's where I

527
00:29:00,780 --> 00:29:05,820
feel like the culture part can
sometimes go awry, right. We

528
00:29:05,820 --> 00:29:08,970
allow it to happen when there
isn't transparency.

529
00:29:10,080 --> 00:29:12,330
Lisa Karlin Curtis: Yeah, I
think that when, when you lack

530
00:29:12,330 --> 00:29:18,060
transparency that gives you it
gives humans a lot more remit to

531
00:29:18,060 --> 00:29:22,560
try and get what they want from
bad, bad ways, basically. And if

532
00:29:22,560 --> 00:29:27,600
you don't have transparency, you
can lie. And you can put forward

533
00:29:27,600 --> 00:29:31,830
an argument that suits whatever
you think your goal is. And in

534
00:29:31,830 --> 00:29:34,380
an ideal world, everybody in
your organization has exactly

535
00:29:34,380 --> 00:29:36,390
the same goal, and they're all
pulling in the same direction.

536
00:29:36,930 --> 00:29:40,020
But in reality, often people
view their goals as being

537
00:29:40,110 --> 00:29:43,740
slightly like in conflict with
other people's goals, because

538
00:29:43,740 --> 00:29:45,660
they're trying to get more
resources for their team,

539
00:29:45,810 --> 00:29:49,170
because they think their problem
is the most important thing. And

540
00:29:49,950 --> 00:29:52,920
if you if you don't have
transparency, it's very

541
00:29:52,920 --> 00:29:55,110
difficult to hold people
accountable to those things.

542
00:29:55,320 --> 00:29:59,730
Whereas if you do and if people
are very honest and open, then

543
00:29:59,880 --> 00:30:02,190
you The organization can make
the right choice for the

544
00:30:02,190 --> 00:30:05,100
organization. And if you think
about a sort of individual first

545
00:30:05,100 --> 00:30:08,340
versus organization, first type
culture, ideally, as an

546
00:30:08,340 --> 00:30:10,950
organization, you should be
putting your chips and the thing

547
00:30:10,950 --> 00:30:13,710
that is most important, not on
the thing that has the best

548
00:30:13,740 --> 00:30:19,140
argument. And the way that you
don't make that mistake is to

549
00:30:19,140 --> 00:30:22,200
use transparency, and to be open
and honest and like generate

550
00:30:22,200 --> 00:30:25,350
that culture and generate the
culture of it being okay to make

551
00:30:25,350 --> 00:30:28,860
mistakes. And also it being okay
to say, I don't think this is so

552
00:30:28,860 --> 00:30:32,160
important. And that's not know
that that's not gonna impact

553
00:30:32,160 --> 00:30:35,310
your career. And I think that,
that's one of the reasons why

554
00:30:35,310 --> 00:30:38,400
people don't do that. If because
there is this view that like to

555
00:30:38,400 --> 00:30:41,370
get promoted, or to get that job
that you really want to get

556
00:30:41,370 --> 00:30:43,440
influenced, you need to be the
most important person, you need

557
00:30:43,440 --> 00:30:45,930
to be doing the most important
thing. And none of us are always

558
00:30:45,930 --> 00:30:49,470
doing the most important thing.
i This week I work I'm not doing

559
00:30:49,470 --> 00:30:51,930
the most important thing that is
very clear. And I and that's

560
00:30:51,930 --> 00:30:55,950
kind of fun. But it also means
that if somebody else needs an

561
00:30:55,950 --> 00:30:57,930
extra pair of hands, I'm going
to jump on their thing I'm not

562
00:30:57,930 --> 00:31:01,440
going to just pursue with mine,
because I want to look good. And

563
00:31:01,440 --> 00:31:04,020
that's the sort of team first
thinking that I think you need

564
00:31:04,050 --> 00:31:08,070
to try and get into your
cultural DNA as an organization.

565
00:31:08,460 --> 00:31:11,550
Jason Baum: I love that I would
love to hear in a in a team

566
00:31:11,550 --> 00:31:14,880
update with the company. This
week, I'm not working on the

567
00:31:14,880 --> 00:31:18,090
most important thing. You know,
I don't think we ever hear that.

568
00:31:18,090 --> 00:31:20,370
Because everyone who does want
to be the most important, I

569
00:31:20,370 --> 00:31:26,670
think. So, you know, you you
said accountability. And I

570
00:31:26,670 --> 00:31:29,610
hadn't planned on asking this
question, but it does now

571
00:31:29,640 --> 00:31:35,370
trigger something in me.
Incidents are okay. And they are

572
00:31:35,370 --> 00:31:39,810
learning experiences. We just
spent the past 30 minutes

573
00:31:39,810 --> 00:31:45,990
talking about it. But when is
how does accountability play

574
00:31:45,990 --> 00:31:52,020
into this? When our incidents?
Not that they're not okay. But

575
00:31:52,260 --> 00:31:55,590
when do we need to hold
accountability? Because I think

576
00:31:55,590 --> 00:31:58,860
there needs to be an element of
accountability. How does that

577
00:31:58,860 --> 00:32:02,760
play in? Especially in a
blameless culture,

578
00:32:02,760 --> 00:32:05,940
Lisa Karlin Curtis: I think it's
a really difficult balance. And

579
00:32:05,940 --> 00:32:09,150
I think I'd come back to the
stuff I was saying about failing

580
00:32:09,150 --> 00:32:15,870
together. So I think that you,
you can, as a team, you take

581
00:32:15,870 --> 00:32:19,230
accountability for what happens
in your team. There are

582
00:32:19,230 --> 00:32:22,260
obviously occasions where
somebody goes rogue and does

583
00:32:22,260 --> 00:32:25,230
something that the team thinks
was terrible idea that that's

584
00:32:25,410 --> 00:32:29,790
sort of a HR issue, frankly, and
I think it's very separate. But

585
00:32:29,790 --> 00:32:33,210
generally, it's you as a team
have made some some choices. And

586
00:32:33,360 --> 00:32:37,050
you're now looking at the
consequences of those choices. I

587
00:32:37,050 --> 00:32:40,050
think that the way to hold teams
accountable around incidents is

588
00:32:40,050 --> 00:32:43,140
the same way that you hold teams
accountable for any kind of

589
00:32:43,140 --> 00:32:48,030
delivery. So if you imagine an
incident is normally something

590
00:32:48,030 --> 00:32:53,880
that that team maintains, has
broken in some way. And so that

591
00:32:53,880 --> 00:32:56,610
team has a kind of agreement or
a contract with the rest of the

592
00:32:56,610 --> 00:32:58,890
org that they will have this
service and it will do this

593
00:32:58,890 --> 00:33:02,910
stuff. And sometimes it won't,
because incidents happen and

594
00:33:03,000 --> 00:33:06,240
mistakes happen. I think you
make people accountable by by

595
00:33:06,240 --> 00:33:09,150
making them transparent, and by
making them expose the trade

596
00:33:09,150 --> 00:33:12,840
offs that they're making. And so
as an example, if you're in a

597
00:33:12,840 --> 00:33:16,920
team, which is under loads and
loads of pressure for time, and

598
00:33:16,920 --> 00:33:18,540
they're like, you've got to shut
this thing as quickly as you can

599
00:33:18,540 --> 00:33:21,960
as quickly as you can, if you as
the as the senior engineer, or

600
00:33:21,960 --> 00:33:24,960
the tech leader are looking at
them saying, cool, we'll do

601
00:33:24,960 --> 00:33:28,050
that. But there's risk. And
these are the things that might

602
00:33:28,410 --> 00:33:32,040
go wrong. If we do that, are we
comfortable with that risk? Then

603
00:33:32,040 --> 00:33:34,320
you're accountable. Because if
something goes wrong, it's

604
00:33:34,320 --> 00:33:36,870
either like, yeah, I said, these
things will go wrong. And we

605
00:33:36,870 --> 00:33:39,870
talked about it. And we decided
we were okay with the risk, or

606
00:33:39,900 --> 00:33:42,540
it's something completely
different has gone wrong, that

607
00:33:42,540 --> 00:33:45,690
is maybe significantly more
severe than the things that we

608
00:33:45,690 --> 00:33:48,360
thought my right let's talk
about why we thought this was

609
00:33:48,360 --> 00:33:52,170
safe. And why we thought why
that wasn't in our risk

610
00:33:52,170 --> 00:33:54,990
assessment. And so you're you're
holding people accountable for

611
00:33:54,990 --> 00:33:57,060
the trade offs that they're
making. You're not holding

612
00:33:57,060 --> 00:33:59,850
people accountable for an
individual thing that went

613
00:33:59,850 --> 00:34:04,200
wrong. It's not like why did
this happen? It's why did you

614
00:34:04,200 --> 00:34:07,140
think that we should take this
risk, what pressure was put on

615
00:34:07,140 --> 00:34:10,920
you, and let's look at it like
at a system level, as opposed to

616
00:34:11,070 --> 00:34:13,590
some person pressed a button on
a backfill, and it made the

617
00:34:13,590 --> 00:34:15,780
database really sad. And now
we're going to run around

618
00:34:15,780 --> 00:34:19,140
screaming saying that they
should be fired. And I think the

619
00:34:19,140 --> 00:34:23,760
other thing about accountability
is it's about time. So I think

620
00:34:23,760 --> 00:34:26,670
incidents are generally a
lagging indicator, as opposed to

621
00:34:26,730 --> 00:34:29,220
a leading indicator, if that's
terminology people are familiar

622
00:34:29,220 --> 00:34:32,070
with. leading indicator
basically means you find out

623
00:34:32,070 --> 00:34:34,710
pretty quickly whether a choice
you're making is good or bad.

624
00:34:34,830 --> 00:34:37,560
And a lagging indicator is
something where a choice that

625
00:34:37,560 --> 00:34:42,060
you make has impact sometime in
the future. And because

626
00:34:42,060 --> 00:34:45,360
incidents are a lagging
indicator, often the people

627
00:34:45,360 --> 00:34:47,790
handling the incidents are not
the people who made those trade

628
00:34:47,790 --> 00:34:51,150
offs. And that's really
important to recognize when that

629
00:34:51,150 --> 00:34:55,560
is true. And to understand what
is the what are the root causes

630
00:34:55,590 --> 00:34:58,500
to have that kind of discussion
whether whether you go down the

631
00:34:58,500 --> 00:35:01,140
five why's route or some other
route But to have a really

632
00:35:01,140 --> 00:35:04,110
meaningful discussion about what
were the choices we could have

633
00:35:04,110 --> 00:35:08,340
made to avoid this. And why
didn't we make them? Was it

634
00:35:08,340 --> 00:35:10,800
because we had loads of pressure
on delivery? Was it because no

635
00:35:10,800 --> 00:35:12,690
one was thinking about it, and
we just didn't think it was a

636
00:35:12,690 --> 00:35:15,180
risk. And then that's the
problem that you have to solve.

637
00:35:15,210 --> 00:35:17,070
And those are the things that
you can hold people accountable

638
00:35:17,070 --> 00:35:17,670
for think.

639
00:35:18,300 --> 00:35:20,580
Jason Baum: Awesome. Thank you
for answering that. One that I

640
00:35:20,580 --> 00:35:23,490
don't know. It just came when
you said accountability. It's it

641
00:35:23,490 --> 00:35:26,400
just popped into my head because
I feel like all we've ever heard

642
00:35:26,910 --> 00:35:29,370
for years was accountability,
accountability, who's

643
00:35:29,400 --> 00:35:31,440
accountable for this? who's
accountable for that? All the

644
00:35:31,440 --> 00:35:37,860
accountability, stuff that
people say? And, yeah, it's hard

645
00:35:37,860 --> 00:35:42,000
to be blameless, when, when
that's what the buzzword was

646
00:35:42,030 --> 00:35:48,420
before blameless. So now, we're
at the point of the podcast,

647
00:35:48,780 --> 00:35:54,120
where I like to ask sort of a
fun question of you personally,

648
00:35:54,780 --> 00:35:57,840
because this is the humans of
DevOps, and we're all about the

649
00:35:57,840 --> 00:36:02,010
humans. So if there was one
thing that you could be

650
00:36:02,010 --> 00:36:04,890
remembered for, what would that
be?

651
00:36:06,420 --> 00:36:10,140
Lisa Karlin Curtis: I think I
would like to be remembered as

652
00:36:11,310 --> 00:36:16,590
someone who he made systems work
better for people.

653
00:36:17,610 --> 00:36:20,310
Jason Baum: Great. I think
that's, that's certainly

654
00:36:20,310 --> 00:36:24,600
applicable for today's
conversation. Well, thank you,

655
00:36:24,600 --> 00:36:27,270
Lisa, so much for joining me
today. It was an absolute

656
00:36:27,270 --> 00:36:27,960
pleasure.

657
00:36:28,680 --> 00:36:30,900
Lisa Karlin Curtis: Thanks so
much. I really enjoyed it. And

658
00:36:30,900 --> 00:36:31,110
thank

659
00:36:31,110 --> 00:36:33,210
Jason Baum: you for listening to
this episode of the humans of

660
00:36:33,210 --> 00:36:36,390
DevOps Podcast. I'm going to end
this episode the same way I

661
00:36:36,390 --> 00:36:38,970
always do encouraging you to
become a member of DevOps

662
00:36:38,970 --> 00:36:42,390
Institute to get access to even
more great resources just like

663
00:36:42,390 --> 00:36:46,320
this one. Until next time, stay
safe, stay healthy, and most of

664
00:36:46,320 --> 00:36:48,960
all, stay human, live long and
prosper.

665
00:36:52,230 --> 00:36:54,330
Narrator: Thanks for listening
to this episode of the humans of

666
00:36:54,330 --> 00:36:57,900
DevOps podcast. Don't forget to
join our global community to get

667
00:36:57,900 --> 00:37:01,230
access to even more great
resources like this. Until next

668
00:37:01,230 --> 00:37:04,680
time, remember, you are part of
something bigger than yourself.

669
00:37:04,980 --> 00:37:05,790
You belong

