00:00

00:0000:00

More To Listen

Daily News Brief | 3 February

Why Türkiye’s plan to build Somalia spaceport marks a new milestone

Why Western mental health practice fails in Palestine | The HUMAN Line

Why Iran’s latest protests feel different?

Guardians of Istanbul: The four sacred watchers of the Strait

Why are teens turning to AI for therapy?

2025: The year AI changed the world

What happened in 2025?

The law says no - why do forced marriages still exist?

Why Africa’s future may depend on its diaspora

Technology & Innovation

When codes rule the world

Software underpins everything from hospitals to global air travel, yet a single glitch can have an effect worldwide. How do interconnected systems amplify risk and what can organisations can do to stop failures becoming systemic crises?

December 19, 2025

Host/Producer: Ezgi Toper
Guest: Professor Salil Kanhere
Craft Editor: Nasrullah Yilmaz
Production Team: Afzal Ahmed, Ahmet Ziya Gumus, Mucteba Samil Olmez, Khaled Selim
Executive Producer: Nasra Omar Bwana

TRANSCRIPT

SALIL KANHERE, PROFESSOR AT UNSW SYDNEY: So, we have this handful of providers who are essentially supporting the entire digital economy. So, and if something goes wrong, they're so intricately connected that the failure sort of starts to cascade. So if something went wrong in one region, the ripple effects get felt everywhere, right?

If AWS goes down, then so many other things can, who are relying on it, the hospitals, government services, consumer websites. I mean, everything sort of, can go down.

EZGI TOPER, HOST: You’re listening to “In the Newsroom”, and I’m Ezgi Toper. In this podcast, we have conversations with colleagues and experts that go beyond the headlines.

Software quietly powers almost every part of our daily lives: from the hospitals we rely on, to banking systems and air travel. Most of the time, it works so seamlessly that we barely notice it’s there. But when it fails, the consequences can ripple far beyond a single app or platform.

Take what happened last month when a number of high-profile websites, including X and ChatGPT, went down for many people due to issues affecting a major internet infrastructure firm: Cloudflare.

NEWS REPORTER 1: Let’s get back to our breaking news right now. Infrastructure internet infrastructure provider Cloudflare says its deploying a fix for an issue…

NEWS REPORTER 2: New at 8:30, we’re learning about a tech outage that’s impacting ChatGPT and social media site X.

NEWS REPORTER 3: The winner for this week’s award for taking down the internet goes to Cloudflare.

EZGI: According to Cloudflare, 20 percent of all websites worldwide use its services in some form. So one glitch on their end has an enormous ripple effect.

Or what about when AWS, the world’s largest cloud provider, crashed in October and took everything from Snapchat, Reddit, Signal to smart beds offline.

The cause? A small bug in automation software that had widespread consequences.

You see, all these recent incidents underscore just how reliant the internet has become on a small number of core infrastructure providers.

In this episode, I speak with Professor Salil Kanhere from the School of Computer Science and Engineering at UNSW Sydney in Australia. He leads a group on information, security and privacy.

He helps us understand why software failures are becoming more disruptive, how interconnected systems amplify risk, and what governments and organisations can do to prevent small glitches from becoming systemic crises.

Welcome to the show, Salil. Thank you so much for being with me today.

PROFESSOR SALIL: Thanks, Ezgi for having me. It's wonderful to talk to you.

EZGI: So from your perspective, when did software cross the threshold from being a tool to becoming a form of critical infrastructure?

PROFESSOR SALIL: Yeah, that's a great, lead question. So I would say I don't think of it like a single moment, right? It's sort of happened gradually. I mean, we've been using software for many years now. So, earlier software was essentially being used as a tool to help people do things faster, more efficiently. But then if you start looking at the turn of the millennium, say to the 2000s, 2010s. Over that period, you started to see software essentially being pervasive, right? So it was being used in everything really. It started to get used in all vital services, banking, healthcare, logistics, energy, transport, government, I mean, you name it.

And this was happening on a very large scale. And all these services were using it in real time, essentially, right? So, it was always being called on. The software systems were always being called on. And I think, one point when we reached the pandemic, I think that's when things really became heavily dependent on software.

So, if you look at way back, 5-6 years ago, suddenly everyone had to shift to online systems. So, software was just everywhere. Where people were sitting in their homes and just on their screens all the time, right? Be it work, be it education, healthcare, I mean, you name it, right? Everything was essentially reliant on software being functioning properly, securely and fairly.

EZGI: Right, but softwares aren't always foolproof. We've seen a lot of recent outages. The AWS outage event. What does that reveal to us about the fragility of things like cloud dependence today?

PROFESSOR SALIL: If you think about it, this doesn't necessarily… we are not sort of saying the cloud systems themselves are poorly engineered. Actually, they're quite robust, I would say. It just exposes how concentrated these software systems essentially are in terms of their dependency, right?

So, it's not about the fragility areas. We are not talking of an individual server or a data centre. It's more at a systemic level. So, we have this handful of providers who are essentially supporting the entire digital economy, right? So, and if something goes wrong, they're so intricately connected that the failure sort of starts to cascade. So, if something went wrong in one region, the ripple effects get felt everywhere, right?

So, you mentioned AWS. If AWS goes down, then so many other things can – who are relying on it – the hospitals, government services, consumer websites, I mean, everything sort of, can go down. So, that's essentially, again, leading to the point that all these platforms have crossed this barrier into the critical infrastructure territory.

Another thing that comes to mind is a lot of times organisations, particularly smaller ones, because they don't have the capacity, they subscribe to these platforms, and then they're under the illusion that, OK, I'm using AWS whatever, right? And I don't have to worry about anything because they will take care of everything, right? And I think that is not the right thinking, even though you are relying on these providers and certainly you're offloading a lot of things to them.

Each organisation should have some resilience planning in mind, because at some point this service is going to go down. We are seeing that already. So, each organisation should also plan for what happens if that one provider goes down, and that's sort of hinting that they should not be relying on single providers for all of their services.

EZGI: One of the most shocking software crashes of 2025 occurred in late November. Airbus, a major European aerospace company, paused global air traffic to install a software fix.

Airlines around the world were forced to ground thousands of planes following the discovery of a software problem. Around 6,000 A320 planes were thought to be affected.

And speaking of that ripple effect that you mentioned, one incident that came to mind was the Airbus software update issue that ended up grounding all of the aircrafts and disrupting global air travel. So, what did that incident teach us about the risks of these software updates in things that are safety critical?

PROFESSOR SALIL: Yeah. Again, that was really shocking, and that's a very good question. So, if you look at safety critical systems, we need to understand that any changes being made have risks.

Normally when you think of software updates, a lot of time they happen is you're trying to fix some bugs, you're trying to improve performance, you're increasing security, right? Whatever. I mean, so the intentions are correct. Yes, you want to do all that, but then in sectors like aviation where it's super safety critical, due to the complexity and all these interdependence, you must understand that this can introduce unforeseen risks. So, software updates in these sort of ecosystems should not be treated like routine maintenance, like, oh, we are just doing a simple update.

This has to be really orchestrated very well. So, things like you stage the rollout, and you have rollback mechanisms. So, if something goes wrong, you can quickly go back and still the system operates. So, that doesn't cause these sort of problems. And then sort of trying to isolate the problems whereby if the problem happens, it's only contained and it doesn't bring down the entire ecosystem.

I think this kind of reminds us, essentially that the agility in deploying these updates must also be balanced with uncompromising safety assurance if you may, right? You have to be careful you're not moving so fast and breaking things that that's never an option in these kinds of sectors.

EZGI: Right. And do you think, professor, if these types of safety measures were put into place, that these types of failures wouldn't happen? Are these kind of failures inevitable in systems of these big scales, or do you believe they're preventable fully?

PROFESSOR SALIL: Yeah, so, it's also a good question. So, failures, I think, inevitable, right? Given that we are talking, the scale at which we are talking, I mean, we have billions of, people are using all these software systems. But what I think what is important, is that catastrophic failures should not be inevitable. So, yes, failures will happen, we know that, and as we all know, these systems are so complex.

It's not the competence that we are blaming. Something will go wrong because no one has foreseen a particular fault. Now, the key is, can you keep this maintained, right? Can this fault, whenever it happens, can it stay local? And can we recover from it, right? And does it.. and ensuring that it does not cascade into a full system disruption.

So many of these incidents, the root cause might be a very simple bug, but then it's the actual architectural design choice that has caused the failure because the architecture hasn't ensured this containment. So, even this small bug has then spread everywhere and it has brought down the whole system. So, I think, yeah, failures will happen, but I don't think grounding a whole fleet of airplanes or essentially halting essential services should happen. That's, I think, more a governance and design failure than purely a technical failure.

EZGI: And as you mentioned the systems are very complex, and it can be hard to keep it local when there's a lot of that interconnectedness of cloud platforms, mobile networks, logistics software, AI. Do you believe all of that interconnectedness actually intensifies this risk of the ripple effects, this widespread disruption that you speak of?

PROFESSOR SALIL: Yes, certainly, I agree. I do agree that this certainly escalates things, right? So, each, each of those layers essentially you talked about, cloud, mobile networks, logistics, and AI. So, by themselves, yes, they can, I mean, they are generally fairly resilient, but the danger arises once you start connecting them. And so if one thing goes down, it propagates, right?

So, if you talk of cloud services, they're often hosting cloud platforms, they're often hosting various services. You have mobile networks that provides access. Then you have your logistics, which could manage some physical things, operations, and then AI is sort of looking at all the data and making real-time decisions. So, now if one component fails, then maybe you'll have a situation where the AI system is using old data, incomplete data. You might have the logistics system lose visibility of some of the ecosystem. You might have mobile users who may not access your system. So, all these essentially cause this cascade rather than one single outlet.

I think it's very important as an organisation to understand these interactions and then ensuring… because as you mentioned, a lot of things are automated these days, right? So if one thing goes wrong here and there's automation, then it percolates so quickly that before humans can even react, things have escalated to a point where the whole system goes down.

EZGI: And I want to circle back to something you said earlier about a lot of the time, these kind of epic fails are governance issues as well. Do you believe governments are prepared for these software-based infrastructure failures at national scales?

PROFESSOR SALIL: I would say, maybe, partially. I mean, yes, governments are realising, as a result of, some of these failures happening, but perhaps they're a bit slow, and the scale is perhaps not appreciated, right? So, if you look at most governments, we have plans in place for other traditional infrastructure failures, right? If the power goes out. If there's a natural disaster, if there's a physical attack, generally we have systems to handle that and recover from that.

The issue with software is, yeah, as we kind of already discussed, it's different. It spreads much faster. It crosses boundaries. So in fact, the one government. It may already be out of the reach of a single government, right, because a lot of times we are looking at private platforms like Amazon, Cloudflare, and these, so they're not really under direct control of the government.

I must say a lot of governments in the world have made progress, so they have put in various, cybersecurity frameworks… how do you manage incident response? There's even a critical infrastructure regulation now in place, but I think we're still lagging, so there needs to be a realisation that software failures are not isolated incidents. So, if you look at any contingency plan, you need to model for a scenario where perhaps one component affects another component, and then how do you deal with that.

So, that's essentially something that needs to happen and certainly governments can regulate the outcomes, but many times they will not have control because of these large entities that expand beyond singular governments.

And then having systems which can work in this safe mode, as they're called. So, if something goes wrong, you can isolate the system and it still works. It gives you some basic functionality, rather than completely shutting down and then nothing works.

And then, of course, stress testing essentially doing what we call “fire drills.” You could call them “digital fire drills” or things like that, where you simulate and make sure you're thoroughly testing all of this. And, also having some human oversight. I mean, yes, automation is good. That's fantastic. But when there are certain critical systems, you do want the human to have some say. Otherwise, a problem will escalate, and before you even know it, the whole system could go down. So, yeah, I think essentially we should accept failure as inevitable, but just that the failures are contained, and we can recover from them.

EZGI: Is it possible we'll reach a point where code is regulated like physical infrastructure?

SALIL: I think so, yes and in many ways, we are sort of, in that direction. It's a little patchy and uneven, but we will get there. Of course, we perhaps may not regulate all code. That might not be the priority, but at least code that supports essential services like cloud platforms, you talked about, payment systems, healthcare, energy, communications. These are sort of extremely critical for everything we do today. So, certainly, the code that sort of underpins all that, should be sort of regulated, and we're getting towards that.

And there are already trends if you look at it. We talked of how societies sort of depend on it so heavily. We have crossed that threshold essentially. And then governments are sort of thinking of these risk-based regulations. So, you want to look at ensuring assurance, certification, testing, and then auditing, when failures can cause systemic harm. And we are seeing all that already in sectors like aviation, financial systems. People are even talking of AI. How do you regulate AI systems, right?

I think we'll get there. That said, the way the regulation may look would probably be different from physical things like steel, concrete, and all that. Because software is such a global ecosystem. It's so interconnected. So, the regulation will have to be different, but the goals will probably be the same. You want duty of care, you want resilience, you want public accountability.

EZGI: As our conversation with Professor Salil Kanhere makes clear: software underpins essential services, national infrastructure and global systems. Failures may be inevitable in systems this complex, but catastrophic breakdowns don’t have to be.

Whether it’s better system design, stronger governance, digital “fire drills,” or rethinking how we regulate code that supports essential services, the challenge ahead is learning how to contain failure before it cascades.

Thanks for tuning in. Until next time, I’m Ezgi Toper, and this was “In the Newsroom”.