washington
CNN
—
A massive and devastating computer glitch at the Federal Aviation Administration this week that canceled or delayed thousands of flights has Americans uneasily confronting the technology behind U.S. air travel — at least for the second time in a month.
If the country picks up the pieces again, embattled air travelers may wonder why flying has suddenly become so susceptible to devastating IT problems.
According to current and former industry officials, government reports and outside analysts, the answer involves not just aging hardware and software but institutional failures that make technology updates more challenging.
Over the years — in the face of explosive demand for air travel — bureaucratic confusion and deferred maintenance have resulted in an increasingly fragile system, even as it grows more complex, with far more points of failure than many consumers may realize .
Southwest’s system-wide meltdown in recent days — in the midst of a winter storm and during the most critical travel period of the year — and Wednesday’s widespread flight disruptions may have put many of these problems front and center for U.S. passengers, But they are just the latest manifestations of a long-standing and enormously complex problem.
The week’s biggest headache was a corrupted database file in a pilot advisory system that issues warnings, called NOTAMs, of various hazards that could affect flying, from notices of runway closures to the presence of nearby construction equipment. 1 A source familiar with the matter told CNN that the corrupted files also existed on the FAA’s backup system, and CNN first reported the details on Wednesday.
Officials began reactivating the main NOTAM system early Wednesday morning, but failed to fully restore it by the start of East Coast rush hour, causing the FAA to ground a halt. A senior U.S. official told CNN on Wednesday that there was no evidence of foul play in the incident, a detail the FAA later publicly confirmed.
“The FAA is continuing a thorough review to determine the root cause of the outage in the Notification of Air Tasks (NOTAM) system,” the agency said in a statement Wednesday night. “Our initial work has traced the outage to a corrupted database file. At this time, there is no evidence of a cyber attack. The FAA is working hard to further identify the cause of this issue and take all necessary steps to prevent such outages from happening again occur.”
The FAA said late Thursday that the data files were “corrupted by personnel who failed to follow procedures.”
The NOTAM issue comes days after the FAA said an “air traffic computer problem” delayed a flight to a Florida airport on Jan. 1 by several hours. 2. The system, called ERAM, is responsible for tracking hundreds of flights at a time and is considered an important part of the FAA’s effort to modernize US airspace.
In Southwest’s case, the airline’s weather-related problems were exacerbated by an outdated dispatch system that couldn’t automatically adjust to disruptions caused by severe winter weather, requiring painstaking human intervention.
Despite starting to modernize their equipment, in some cases airlines and the U.S. government may still rely on technologies that may be years or even decades old.
The 30-year-old FAA software that failed this week is at least six years away from an update, a U.S. administration official told CNN on Thursday, though the official said Transportation Secretary Pete Buttigieg (Pete Buttigieg) has been pushing since the crash to speed up that timeline.
The notifications issued by the FAA’s NOTAM system are “Jurassic,” said Kathleen Bangs, a former airline pilot and aviation expert. “It’s a clumsy system that often overburdens pilots with pages and pages of not-so-urgent notifications, written in antiquated code, that sometimes obscure the very important piece of safety information that pilots really need.”
The FAA acknowledges the age of the NOTAM system. In its most recent budget request to Congress, the agency called for funding to help “eliminate the failing legacy hardware behind it.”
Back in 2012, the FAA decided to replace the aging legacy voice switches used in air traffic control communications with new Internet-based communications technologies. But because of a contract dispute, the FAA now intends to keep using the old switch until at least 2030, according to a report last year by the Transportation Department’s inspector general.
The ERAM air traffic system was at the center of the disruption in January. 2 is younger and was not fully operational until 2015. But according to the 2020 inspector general report, the system was supposed to be fully implemented five years ago, replacing another year of the system that had been in operation for more than 40 years. The FAA is currently working to update the hardware and software of the ERAM, which has had at least seven ERAM failures since 2014, a record that has sparked congressional scrutiny. But according to the 2020 report, the ERAM upgrade may not be complete until 2026.
Meanwhile, aviation experts say many of the IT systems airlines rely on were bespoke long ago, some running on legacy mainframe computers that weren’t designed to handle the flood of incoming information.
“This isn’t a standard Windows server or modern VMware architecture,” says Seth Miller, IT consultant, aviation reporter and editor for travel publication PaxExAero. “These are old, old systems.”
As a result, a severe crisis could easily overwhelm such fragile facilities, said an airline industry official, who spoke on condition of anonymity to discuss the issue more freely.
“These systems were built when airlines were smaller, and they weren’t necessarily able to handle so much data at once,” the official said. “When you have something like a massive winter storm over the holidays, it can’t handle a lot of change at the same time because it’s in a system that wasn’t built to handle moving data sets that large.”
Industry experts say technology’s age isn’t always inherently problematic. That’s what this era implies: an inability to scale to meet new demands, and a lack of proper support as the rest of the world grows. Using custom technology rather than off-the-shelf solutions exacerbates the problem, Miller said, because maintaining it requires increasingly specialized parts and expertise.
Trying to integrate old systems with new ones — always in real time, because the global aviation industry never sleeps — also creates opportunities for itself to go catastrophically wrong.
While all flight delays and cancellations tend to present a similar experience for air travelers, the root cause of the disruption can vary widely. A lot more things can go wrong than you expect – this highlights the complexity of the airline industry and how IT-related travel disruptions can be quickly and easily resolved.
Industry experts say getting a flight off involves a complex mix of information, and disruptions in any part of the information supply chain could cause delays.
These vulnerabilities are magnified by the sheer number of companies involved in the ecosystem—not just the airlines, but their suppliers, and their suppliers’ suppliers.
“There are so many different systems talking to each other,” said Ross Feinstein, a former spokesman for American Airlines and the TSA.
For example, Feinstein said, TSA reviews airline listings. “If TSA goes down, it stops the review process for bookings, which means passengers can’t check in and get their boarding passes back. It could be a weather company down and pilots can’t get the latest weather updates for takeoff, en route or arrival data.”
In 2019, problems with computers at a third-party company whose flight-planning tool helps airlines calculate the weight and balance of planes caused delays for several airlines across the country.
In 2021, an outage at Saber, one of the world’s largest airline booking companies, caused disruption across the globe.
The interconnectedness of the aviation industry spans dozens of countries, companies, institutions and databases, creating multiple points of failure. Backup and redundancy can help, but it’s still a very complex system to system.
Beneath the surface of the airline industry’s IT problems lie deeper, messier and more human challenges.
Take, for example, the FAA’s attempt to replace voice switches in air traffic. According to the Inspector General’s report, the main cause of the failure came when the FAA and its potential supplier were at loggerheads over contract requirements. The controversy centered on possible software flaws in the new switches and whether the vendor could still deliver quality products on time.
The root of the problem is not technical in itself. This is a procurement issue. But it had a lasting impact on FAA technology. The final termination of the contract means the FAA will need to spend more than $270 million through 2030 to continue using its aging legacy voice switches, the report said.
“Continued reliance on these switches creates the risk of communication disruption,” the report concluded.
A similar dynamic played out in the debate around 5G wireless technology near airports, which threatened to cause major disruption last year. Bureaucratic disagreements and years-delayed avionics upgrades have led to a crisis in which U.S. planes are not equipped with the technology to handle potential 5G interference.
Meanwhile, the FAA continues to be led by an acting administrator and lacks a Senate-confirmed head. That has had a real-world impact on IT upgrades and other projects, according to a person familiar with the agency, who spoke on condition of anonymity to discuss the matter more freely.
“It’s really hard to set direction and vision when you don’t know if you’re going to be there for a week or if you’re going to be there for 18 months,” the person said.
Meanwhile, much of the airline industry’s outstanding technical debt can be traced to the spate of mergers and bankruptcies after 9/11, when many airlines were more focused on finances than technology upgrades, industry officials said.
Bureaucratic myopia is itself responsible for today’s technological malaise in the aviation industry. In some cases, institutional inertia and business priorities outweigh investments in expensive and tedious infrastructure.
But the increasingly interconnected and digital nature of systems now means that when things go wrong, they can happen in far more catastrophic ways.
Aviation experts say the challenge can only be met with more investment and better planning.
“[The FAA] are doing more with less, and they need more money to modernize,” Feinstein said. This will again be a battle when the FAA reauthorization bill comes out. ”
–– CNN’s Pete Muntean, Gregory Wallace and Marnie Hunter contributed to this report