The best case scenario for the Root Cause Analysis (RCA) approach to safety is when used to investigate a “close call”. In our experience, these are infrequently reported. Most commonly, an RCA is conducted when a fatality/injury/other significant damage has occurred, making it a posteriori. It is a necessary step to be taken to prevent future incidents. So how do you identify Organizational/Systemic Failure in a RCA?
Begin at the incident
Every incident involves the basic details of who, what, where, when, why, and how. These generally break down into specific issues. From there you look at each issue and determine what issues surrounding them influenced the outcome that contributed to the incident. Sometimes it is clear: the supervisor moved the forklift while the worker was attaching the forks. Other times it requires further investigation: the worker’s feet slid on the gravel far more rapidly than initially expected. Surrounding issues become questions of training, direction, equipment, worker/client health, etc. These all then turn back to management of the worker or expectations with the client.
The important detail to remember in RCA is that it is looking beyond the individual injured. Incidents occur where the worker/client was the primary cause of the incident, but that does not make them a “root cause”. OSHA provides a useful handout that discusses this in further detail.
From the incident, examine the distribution of responsibility and authority. Using the list of red flags provided in previous posts, examine documentation on who, what, where, when, why, and how the decisions were conducted. The most challenging part of this process is identifying what is missing from the provided materials in the sense of what doesn’t exist but should be there as a matter of professional practice. For example, a common failure mode that we see is an organization that either does not do a RCA or does it incorrectly.
Here we would like to give a clear definition of “Organizational Failure” from the safety coordinate system (see previous post for that discussion). It is the circumstance where a failure mode is present and the issues surrounding it maintains the failure at the risk to workers/clients/process-essential equipment. For example, if an organization’s safety officer responds to pressure to not mention one of the primary root causes to an incident, that is a likely perpetuating failure mode. Organizational Failure is when no other part of the organization addresses the failure mode either, making it perpetual.
What makes up a system is the connection between groups/organizations. Organizational Failure can be qualitatively assessed to an individual/small group, such as a dysfunctional relationship between the heads of Human Resources (HR) and Maintenance. Systemic Failure is measured by how the Organizational Failure not only persists but propagates.
For example, the dysfunctional relationship may be occurring in a Special Utility District (SUD) that provides ambulance services. The HR head may be focused on hiring loyal, as opposed to competent, staff which leads to mishandling of invoices, overcharging (or undercharging), and potential fraud. The Maintenance head may be passive-aggressively retaliating with fewer vehicles in service, incomplete inspections, and repair request prioritizing independent of need. The lower quality, more expensive service carries risk to the taxpaying clients (who have no real choice of an alternative). Thus far, the Organizational Failure mode can persist indefinitely. How does this situation become systemic? Let’s consider the HR head convincing the elected SUD Directors to outsource the maintenance, and the incompetent staff hired by HR fails to find the correct Contractor and/or at the correct price. The Contractor may have the authority to do the repairs, but without the responsibility to do them correctly and in addition to a hefty interest in profit maximizing, the Contractor themselves becomes vulnerable to developing a failure mode. Now let’s consider a retirement community being served by the ambulance service. The managers of the retirement community notice that the response times are long, the care is minimal, and the rate of injury to their clients is higher, bringing a cost. They may consider contracting with a specialized ambulance service, but that’s got its own cost. Instead, they decide to contribute to the campaigns of SUD Directors in charge of the ambulance service, who in turn encourage prioritizing the retirement community in its policies which in turn de-prioritizes the taxpayers. A new failure mode has developed. As the problems continue to propagate, surrounding organizations/groups can decide to implement defensive actions. For example, nearby ambulance services may have an agreement with the one experiencing the Organizational Failure to respond across boundaries in certain circumstances, such as for vehicle accidents on a major interstate highway that passes through the territory. More and more frequently, these services may be called on to respond. Rumors spread among the EMTs, and the functional service sees enough of the failure mode to modify its agreement with the SUD, setting a cap. This boundary, while increasing the risk to the taxpayers in the Organizational Failing ambulance service, is the simplest solution to protecting the taxpayers/clients in the functional ambulance service’s community.
So the situation above is a zoomed in look at a specific situation. It does not necessarily have to be an isolated Systemic Failure. Let’s ask what organization has supervisory authority over the Special Utility Districts. For this example, let’s assume it is only the State Auditor, and no other entity has jurisdiction. The State Auditor does not have the resources to investigate SUDs, and its priorities are set by Federal guidelines (ensuring that the state does not lose any funding due to malfeasance) and the State Legislature. A tiny Special Utility District has a minuscule budget compared to its usual audits, and the taxpayers impacted are less than .1% of the population. The State Auditor will have no interest in investigating. What does that mean for all the thousands of other SUDs in the relatively large state? This is a failure mode. If a compilation of various local news articles shows failures similar to the ambulance service occurring in scattered communities throughout the state, this is a Systemic Failure as well. (In the next post, we will talk about the unique conditions present in organizations as a matter of size; if the cumulative impacted population is 1-2% of the state, is it really a Systemic Failure? It depends.)
In our experience as expert witnesses, the corrective step in that arena is that of litigation. It is imperfect, taking a number of years and with a low probability of instigating a correction. Generally, multiple lawsuits are required to trigger a correction.
So looking at the example above (there is no such situation in our archives, this is purely theoretical), legal counsel could be retained if (a) a patient was injured/dies by equipment that wasn’t maintained, (b) an EMT is injured/dies by equipment that wasn’t maintained, (c) a worker for the maintenance contractor is injured and the contractor does not have workers’ compensation insurance, (d) a worker for the maintenance contractor is fatally injured, or (e) any other situation giving workers/clients standing to sue. Out of this situation, ACS could be retained by the plaintiff, the maintenance company’s defense, or the SUD’s defense. In talking to the lawyer initially, we listen to the known details of the case. No case ever looks like Organizational/Systemic Failure at this point. The way these situations tend to play out is highly varied, but here’s an example:
- The plaintiff’s lawyer hires us to perform an independent Root Cause Analysis; we find that the maintenance company failed to follow equipment operators’ manuals and that the ambulance service failed to enforce its supervisory terms in the contract. The lawyer focuses on the maintenance company as having the stronger evidence base, and the ambulance service successfully claims that it was not required to supervise. These cases rarely reveal Organizational/Systemic Failures.
- The maintenance company hires us to analyze another expert’s RCA as well as additional information provided as a result of the initial findings. We find that while the company did not follow the maintenance schedules, the ambulance service was not timely in providing the equipment to be serviced in the first place and was not paying its invoices in full for the services being conducted. Then one of two things can happen: (1) The initial expert will then modify their report’s findings provided the new information, and a consensus that the ambulance service had neglected safety is achieved; or (2) The initial expert is not given the opportunity to supplement the initial report and the plaintiff’s lawyer moves forward using our opinion and saving their client money. These cases can reveal the existence of Organizational/Systemic Failure.
- The lawyer for the ambulance service (a) gives us the impression that the maintenance company failed to fulfill its contract, we find out differently and no report is issued; or (b) gives us enough information to tell them that it indicates there are significant safety problems in the organization and that all we can do is make sure the other experts are honest–we generally are not hired at this point. In situation (a), we can detect Organizational/Systemic Failure if present; however, no documentation is produced as the lawyer does not request a report.
The approach described above is very costly, and the probability of Organizational/Systemic Failure being detected is low. Another alternative to correcting failures prior to litigation involves an organization hiring outside consultants to review its operations. We have done this a few times. In one case, an Organizational Failure was detected between two managers. The one that hired us was fatally injured by mishandling toxic substances, and our interpretation of the situation was that he hired us to prevent more fatalities among his workers arising from the ongoing failure. Our report led to the organizational branch that we had investigated being shut down, likely due to regulatory/litigation concerns. The benefit is that future workers being exposed at the facility never manifested.
Thus far, ACS experts have experience in correcting or preventing the propagation of Systemic Failure, but not while working on an ACS project to our knowledge. We do not request to be informed of case outcomes.
Safety, the Key Measure
As discussed previously, the most reliable indicator of a failure is a safety incident, whether a “close call”, injury/fatality, or essential equipment loss. Root Cause Analysis into the failure can provide information indicating that the failure is a (perpetuating) Organizational Failure or (propagating) Systemic Failure. How this concept scales up in size with larger entities will be discussed next.