This file contains notes about people active in resilience engineering, as well as some influential
researchers who are no longer with us, organized alphabetically. It also includes people and papers
from related fields, such as cognitive systems engineering and naturalistic decision-making.
For each person, I list concepts that they reference in their writings, along
with some publications. The publications lists aren't comprehensive:
they're ones I've read or have added to my to-read list.
Allspaw is the former CTO of Etsy. He applies concepts from resilience engineering to the tech industry.
He is one of the founders Adaptive Capacity Labs, a resilience engineering consultancy.
Cook is a medical doctor who studies failures in complex systems. He is one of the founders Adaptive Capacity Labs, a resilience engineering consultancy.
Dekker is a human factors and safety researcher with a background in aviation.
His books aimed at a lay audience (Drift Into Failure, Just Culture, The Field Guide to 'Human Error' investigations)
have been enormously influential. He was a founder of the MSc programme in Human Factors & Systems Safety at Lund University.
His PhD advisor is David Woods.
Dekker developed the theory of drift, characterized by five concepts:
Scarcity and competition
Decrementalism, or small steps
Sensitive dependence on initial conditions
Unruly technology
Contribution of the protective structure
Just Culture
Dekker examines how cultural norms defining justice can be re-oriented to minimize the negative impact and maximize learning when things go wrong.
Retributive justice as society's traditional idea of justice: distributing punishment to those responsible based on severity of the violation
Restorative justice as an improvement for both victims and practicioners: distributing obligations of rebuliding trust to those responsible based on who is hurt and what they need
First, second, and third victims: an incident's negative impact is felt by more than just the obvious victims
Learning theory: people break rules when they have learned there are no negative consequences, and there are actually positive consequences - in other words, they break rules to get things done to meet production pressure
Reporting culture: contributing to reports of adverse events is meant to help the organization understand what went wrong and how to prevent recurrence, but accurate reporting requires appropriate and proportionate accountability actions
Complex systems: normal behavior of practicioners and professionals in the context of a complex system can appear abnormal or deviant in hindsight, particularly in the eyes of non-expert juries and reviewers
The nature of practicioners: professionals want to do good work, and therefore want to be held accountable for their mistakes; they generally want to help similarly-situated professionals avoid the same mistake.
Doyle is a
control systems researcher. He is seeking to identify the universal laws that capture the
behavior of resilient systems, and is concerned with the architecture of such
systems.
Ericsson introduced the idea of deliberate practice as a mechanism for
achieving high level of expertise.
Ericsson isn't directly associated with the field of resilience engineering.
However, Gary Klein's work is informed by his, and I have a particular
interest in how people improve in expertise, so I'm including him here.
Feltovich is a retired Senior Research Scientist at the Florida Institute for Human & Machine Cognition (IHMC),
who has done extensive reserach in human expertise.
Herrera is an associate professor in
the department of industrial economics and technology management at NTNU and a
senior research scientist at SINTEF. Her areas of expertise include safety management and
resilience engineering in avionics and air traffic management.
Hollnagel proposed that there is always a fundamental tradeoff between
efficiency and thoroughness, which he called the ETTO principle.
Safety-I vs. Safety-II
Safety-I: avoiding things that go wrong
looking at what goes wrong
bimodal view of work and activities (acceptable vs unacceptable)
find-and-fix approach
prevent transition from 'normal' to 'abnormal'
causality credo: believe that adverse outcomes happen because something goes
wrong (they have causes that can be found and treated)
it either works or it doesn't
systems are decomposable
functioning is bimodal
Safety-II: performance variability rather than bimodality
the system’s ability to succeed under varying conditions, so that the number
of intended and acceptable outcomes (in other words, everyday activities) is
as high as possible
performance is always variable
performance variation is ubiquitous
things that go right
focus on frequent events
remain sensitive to possibility of failure
be thorough as well as efficient
FRAM
Hollnagel proposed the Functional Resonance Analysis Method (FRAM) for modeling
complex socio-technical systems.
Johannesen is currently a UX researcher and community advocate at IBM.
Her PhD dissertation work examined how humans cooperate, including studies of anesthesiologists.
Macrae is a social psychology
researcher who has done safety research in multiple domains, including aviation
and healthcare. He helped set up the new healthcare investigation agency in
England. He is currently a professor of organizational behavior and psychology
at the Notthingham University Business School.
Maguire is a cognitive systems engineering researcher who is currently completing a PhD at Ohio State University.
Maguire has done safety work in multiple domains, including forestry, avalanches, and software services.
Perrow is a sociologist who studied the Three Mile Island disaster. "Normal Accidents" is cited by numerous other influential systems engineering publications such as Vaughan's "The Challenger Launch Decision".
Concepts
Complex systems: A system of tightly-coupled components with common mode connections that is prone to unintended feedback loops, complex controls, low observability, and poorly-understood mechanisms. They are not always high-risk, and thus their failure is not always catastrophic.
Normal accidents: Complex systems with many components exhibit unexpected interactions in the face of inevitable component failures. When these components are tightly-coupled, failed parts cannot be isolated from other parts, resulting in unpredictable system failures. Crucially, adding more safety devices and automated system controls often makes these coupling problems worse.
Common-mode: The failure of one component that serves multiple purposes results in multiple associated failures, often with high interactivity and low linearity - both ingredients for unexpected behavior that is difficult to control.
Production pressures and safety: Organizations adopt processes and devices to improve safety and efficiency, but production pressure often defeats any safety gained from the additions: the safety devices allow or encourage more risky behavior. As an unfortunate side-effect, the system is now also more complex.
Jens Rasmussen was a very influential researcher in human factors and safety systems.
Contributions
Skill-rule-knowledge (SKR) model
TBD
Dynamic safety model
Rasmussen proposed a state-based model of a socio-technical system as a system
that moves within a region of a state space. The region is surrounded by
different boundaries:
Reason is a psychology researcher who did work on understanding and categorizing human error.
Contributions
Accident causation model (Swiss cheese model)
Reason developed an accident causation model that is sometimes known as the swiss cheese model of accidents.
In this model, Reason introduced the terms "sharp end" and "blunt end".
Human Error model: Slips, lapses and mistakes
Reason developed a model of the types of errors that humans make:
Roth is a cognitive psychologist who
serves as the principal scientist at Roth Cognitive Engineering, a small
company that conducts research and application in the areas of human factors
and applied cognitive psychology (cognitive engineering)
Scott is an anthropologist who also does research in political science. While
Scott is not a member of a resilience engineering community, his book Seeing
like a state has long been a staple of the cognitive systems engineering and
resilience engineering communities.
Shorrock is a chartered psychologist and a chartered ergonomist and human
factors specialist. He is the editor-in-chief of EUROCONTROL
HindSight
magazine. He runs the excellent Humanistic Systems blog.
Woods has a research background in cognitive systems engineering and did work
researching NASA accidents. He is one of the founders Adaptive Capacity
Labs, a resilience engineering
consultancy.
Woods has contributed an enormous number of concepts.
The adaptive universe
Woods uses the adaptive universe as a lens for understanding the behavior of
all different kinds of systems.
All systems exist in a dynamic environment, and must adapt to change.
A successful system will need to adapt by virtue of its success.
Systems can be viewed as units of adaptive behavior (UAB) that interact. UABs
exist at different scales (e.g., cell, organ, individual, group, organization).
All systems have competence envelopes, which are constrained by boundaries.
The resilience of a system is determined by how it behaves when it comes near
to a boundary.
Events will produce demands that challenge boundaries on the adaptive
capacity of any UAB
Adaptive capacities are regulated to manage the risk of saturating CfM
No UAB can have sufficient ability to regulate CfM to manage the risk of saturation alone
Some UABs monitor and regulate the CfM of other UABs in response to changes
in the risk of saturation
Adaptive capacity is the potential for adjusting patterns of action to
handle future situations, events, opportunities and disruptions
Performance of a UAB as it approaches saturation is different from the
performance of that UAB when it operates far from saturation
All UABs are local
There are bounds on the perspective any UAB, but these limits are overcome
by shifts and contrasts over multiple perspectives.
Reflective systems risk mis-calibration
(Shorter wording)
Boundaries are universal
Surprise occurs, continuously
Risk of saturation is monitored and regulated
Synchronization across multiple units of adaptive behavior in a network is necessary
Risk of saturation can be shared
Pressure changes what is sacrificed when
Pressure for optimality undermines graceful extensibility
All adaptive units are local
Perspective contrast overcomes bounds
Mis-calibration is the norm
Concepts
Many of these are mentioned in Woods's short course.
the adaptive universe
unit of adaptive behavior (UAB), adaptive unit
adaptive capacity
continuous adaptation
graceful extensibility
sustained adaptability
Tangled, layered networks (TLN)
competence envelope
adaptive cycles/histories
precarious present (unease)
resilient future
tradeoffs, five fundamental
efflorescence: the degree that changes in one area tend to recruit or open up
beneficial changes in many other aspects of the network - which opens new
opportunities across the network ...
reverberation
adaptive stalls
borderlands
anticipate
synchronize
proactive learning
initiative
reciprocity
SNAFUs
robustness
surprise
dynamic fault management
software systems as "team players"
multi-scale
brittleness
decompensation
working at cross-purposes
proactive learning vs getting stuck
oversimplification
fixation
fluency law, veil of fluency
capacity for manoeuvre (CfM)
crunches
sharp end, blunt end
adaptive landscapes
law of stretched systems: Every system is continuously stretched to operate at capacity.
cascades
adapt how to adapt
unit working hard to stay in control
you can monitor how hard you're working to stay in control (monitor risk of saturation)
reality trumps algorithms
stand down
time matters
Properties of resilient organizations
Tangible experience with surprise
uneasy about the precarious present
push initiative down
reciprocity
align goals across multiple units
goal conflicts, goal interactions (follow them!)
to understand system, must study it under load
adaptive races are unstable
adaptive traps
roles, nesting of
hidden interdependencies
net adaptive value
matching tempos
tilt toward florescence
linear simplification
common ground
problem detection
joint cognitive systems
automation as a "team player"
"new look"
sacrifice judgment
task tailoring
substitution myth
directability
directed attention
inter-predictability
error of the third kind: solving the wrong problem