Programming note: I recently launched a weekly podcast, Complex Systems with Patrick McKenzie. About 50% of the conversations cover Bits about Money's beat. The remainder will be on other interesting intersections of technology, incentives, culture, and organizational design. The first three episodes covered teaching trading, Byrne Hobart on the epistemology of financial firms, and the tech industry vs. tech reporting divide. Subscribe to it anywhere you listen to podcasts. If you enjoy it, writing a review (in your podcast app or to me via email) helps quite a bit.
On July 19th, a firm most people have sensibly never heard of knocked out a large portion of the routine operations at many institutions worldwide. This hit the banking sector particularly hard. It has been publicly reported that several of the largest U.S. banks were affected by the outage. I understand one of them to have idled tellers and bankers nationwide for the duration. (You’ll forgive me for not naming them, as it would cost me some points.) The issue affected institutions across the size spectrum, including large regionals and community banks.
You might sensibly ask why that happened and, for that matter, how it was possible it would happen.
You might be curious about how to quickly reconstitute the financial system from less legible sources of credit when it is down. (Which: probably less important as a takeaway, but it is quite colorful.)
Brief necessary technical context
Something like 20% of the readership of this column has an engineering degree. To you folks, I apologize in advance for the following handwaviness. (You may be better served by the Preliminary Post Incident Review.)
Many operating systems have a distinction between the “kernel” supplied by the operating system manufacturer and all other software running on the computer system. For historical reasons, that area where almost everything executes is called “userspace.”
In modern software design, programs running in userspace (i.e. almost all programs) are relatively limited in what they can do. Programs running in kernelspace, on the other hand, get direct access to the hardware under the operating system. Certain bugs in kernel programming are very, very bad news for everything running on the computer.
CrowdStrike Falcon is endpoint monitoring software. In brief, “endpoint monitoring” is a service sold to enterprises which have tens or hundreds of thousands of devices (“endpoints”). Those devices are illegible to the organization that owns them due to sheer scale; no single person nor group of people understand what is happening on them. This means there are highly variable levels of how-totally-effed those devices might be at exactly this moment in time. The pitch for endpoint monitoring is that it gives your teams the ability to make those systems legible again while also benefitting from economies of scale, with you getting a continuously updated feed of threats to scan for from your provider.
One way an endpoint might be effed is if it was physically stolen from your working-from-home employee earlier this week. Another way is if it has recently joined a botnet orchestrated from a geopolitical adversary of the United States after one of your junior programmers decided to install warez because the six figure annual salary was too little to fund their video game habit. (No, I am not reading your incident reports, I clarify for every security team in the industry.)
In theory, you perform ongoing monitoring of all of your computers. Then, your crack security team responds to alerts generated by your endpoint monitoring solution. This will sometimes merit further investigation and sometimes call for immediate remedial work. The conversations range from “Did you really just install cracked Starcraft 2 on your work PC? … Please don’t do that.” to “The novel virus reported this morning compromised 32 computers in the wealth management office. Containment was achieved by 2:05 PM ET, by which point we had null routed every packet coming out of that subnet then physically disconnected power to the router just to be sure. We have engaged incident response to see what if any data was exfiltrated in the 47 minutes between detection and null routing. At this point we have no indications of compromise outside that subnet but we cannot rule out a threat actor using the virus as a beachhead or advanced persistent threats being deployed.”
(Yes, that does sound like a Tom Clancy novel. No, that is not a parody.)
Falcon punched
Falcon shipped a configuration bug. In brief, this means that rather than writing new software (which, in modern development practice, hopefully goes through fairly extensive testing and release procedures), CrowdStrike sent a bit of data to systems with Falcon installed. That data was intended to simply update the set of conditions that Falcon scanned for. However, due to an error at CrowdStrike, it actually caused existing already-reviewed Falcon software to fail catastrophically.
Since that failure happened in kernelspace at a particularly vulnerable time, this resulted in Windows systems experiencing total failure beginning at boot. The user-visible symptom is sometimes called the Blue Screen of Death.
Configuration bugs are a disturbingly large portion of engineering decisions which cause outages. (Citation: let’s go with “general knowledge as an informed industry observer.” As always, while I’ve previously worked at Stripe, neither Stripe nor its security team necessarily endorses things I say in my personal spaces.)
However, because this configuration bug hit very widely distributed software running in kernelspace almost universally across machines used by the workforce of lynchpin institutions throughout society (most relevantly to this column, banks, but also airlines, etc etc), it had a blast radius much, much larger than typical configuration bugs.
Have I mentioned that IT security really likes military metaphors? “Blast radius” means “given a fault or failure in system X, how far afield from X will we see negative user impact.” I struggle to recall a bug with a broader direct blast radius than the Falcon misconfiguration.
Once the misconfiguration was rolled out, fixing it was complicated by the tiny issue that a lot of the people needed to fix it couldn’t access their work systems because their machine Blue Screen of Death’ed.
Why? Well, we put the vulnerable software on essentially all machines in a particular institution. You want to protect all the devices. That is the point of endpoint monitoring. It is literally someone’s job to figure out where the devices that aren’t endpoint monitored exist and then to bring them into compliance.
Why do we care about optimizing for endpoint monitoring coverage? Partly it is for genuinely good security reasons. But a major part of it is that small-c compliance is necessary for large-C Compliance. Your regulator will effectively demand that you do it.
Why did Falcon run in kernelspace rather than userspace?
Falcon runs in kernelspace versus userspace in part because the most straightforward way to poke its nose in other programs’ business is to simply ignore the security guarantees that operating systems give to programs running in userspace. Poking your nose in another program’s memory is generally considered somewhere between rude and forbidden-by-very-substantial-engineering-work. However, endpoint monitoring software considers that other software running on the device may be there at the direction of the adversary. It therefore considers that software’s comfort level with its intrusion to be a distant secondary consideration.
Another reason Falcon ran in kernelspace was, as Microsoft told the WSJ, Microsoft was forbidden by an understanding with the European Commission from firmly demoting other security software developers down to userspace. This was because Microsoft both a) wrote security software and b) necessarily always had the option of writing it in kernelspace, because Microsoft controls Windows. The European Commission has pushed back against this characterization and pointed out that This Sentence Uses Cookies To Enable Essential Essay Functionality.
Regulations which strongly suggest particular software purchases
It would be an overstatement to say that the United States federal government commanded U.S. financial institutions to install CrowdStrike Falcon and thereby embed a landmine into the kernels of all their employees’ computers. Anyone saying that has no idea how banking regulation works.
Life is much more subtle than that.
The United States has many, many different banking regulators. Those regulators have some desires for their banks which rhyme heavily, and so they have banded into a club to share resources. This lets them spend their limited brainsweat budgets on things banking regulators have more individualized opinions on than simple, common banking regulatory infrastructure.
One such club is the Federal Financial Institutions Examination Council. They wrote the greatest crossover event of all time if your interests are a) mandatory supervisory evaluations of financial institutions and b) IT risk management: the FFIEC Information Technology Examination Handbook's Information Security Booklet.
The modal consumer of this document is probably not a Linux kernel programmer with a highly developed mental model of kernelspace versus userspace. That would be an unreasonable expectation for a banking supervisor. They work for a banking regulator, not a software company, doing important supervisory work, not merely implementation. Later this week they might be working on capital adequacy ratios, but for right now, they’re asking your IT team about endpoint monitoring.
The FFEITC ITEH ISB (the acronym just rolls off the tongue) is not super prescriptive about exactly what controls you, a financial institution, have to have. This is common in many regulatory environments. HIPAA, to use a contrasting example, is unusual in that it describes a control environment that you can reduce to a checklist with Required or Optional next to each of them. (HIPAA spells that second category “Addressable”, for reasons outside the scope of this essay, but which I’ll mention because I don’t want to offend other former HIPAA Compliance Officers.)
To facilitate your institution’s conversation with the examiner who drew the short straw, you will conduct a risk analysis. Well, more likely, you’ll pay a consulting firm to conduct a risk analysis. In the production function that is scaled consultancies, this means that a junior employee will open U.S. Financial Institution IT Security Risk Analysis v3-edited-final-final.docx and add important client-specific context like a) their name and b) their logo.
That document will heavily reference the ITEH, because it exists to quickly shut down the line of questioning from the examiner. If you desire a career in this field, you will phrase that as “guiding the conversation towards areas of maximum mutual interest in the cause of 'advanc[ing] the nation’s monetary, financial, and payment systems to build a stronger economy for all Americans.'” (The internal quotation is lifted from a job description at the Federal Reserve.)
Your consultants are going to, when they conduct the mandatory risk analysis, give you a shopping list. Endpoint monitoring is one item on that shopping list. Why? Ask your consultant and they’ll bill you for the answer, but you can get my opinion for free and it is worth twice what you paid for it: II.C.12 Malware Mitigation.
Does the FFEITC have a hugely prescriptive view of what you should be doing for malware monitoring? Well, no:
Management should implement defense-in-depth to protect, detect, and respond to malware. The institution can use many tools to block malware before it enters the environment and to detect it and respond if it is not blocked. Methods or systems that management should consider include the following: [12 bullet points which vary in specificity from whitelisting allowed programs to port monitoring to user education].
But your consultants will tell you that you want a very responsive answer to II.C.12 in this report and that, since you probably do not have Google’s ability to fill floors of people doing industry-leading security research, you should just buy something which says Yeah We Do That.
CrowdStrike’s sales reps will happily tell you Yeah We Do That. This web page exists as a result of a deterministic process co-owned by the Marketing and Sales departments at a B2B software company to create industry-specific “sales enablement” collateral. As a matter of fact, if you want to give CrowdStrike your email address and job title, they will even send you a document which is not titled Exact Wording To Put In Your Risk Assessment Including Which Five Objectives And Seventeen Controls Purchasing This Product Will Solve For.
CrowdStrike is not, strictly speaking, the only vendor that you could have installed on every computer you owned to make your regulators happy with you. But, due to vagaries of how enterprise software sales teams work, they sewed up an awful lot of government-adjacent industries. This was in part because they aggressively pursued writing the sort of documents you need if the people who read your project plans have national security briefs.
I’m not mocking the Federal Financial Institutions Examining Council for cosplaying as having a national security brief. (Goodness knows that that happens a lot in cybersecurity... and government generally. New York City likes to pretend it has an intelligence service, which is absolutely not a patronage program designed to have taxpayers fund indefinite foreign vacations with minimal actual job duties.)
But money is core societal infrastructure, like the power grid and transportation systems are. It would be really bad if hackers working for a foreign government could just turn off money. That would be more damaging than a conventional missile being fired at random into New York City, and we might be more constrained in responding.
And so, we ended up in a situation where we invited an advanced persistent threat into kernelspace.
It is perhaps important to point out that security professionals understand security tools to themselves introduce security vulnerabilities. Partly, the worry is that a monoculture could have a particular weakness that could be exploited in a particular way. Partly, it is that security tools (and security personnel!) frequently have more privileges than is typical, and therefore they can be directly compromised by the adversary. This observation is fractal in systems engineering: at every level of abstraction, if your control plane gets compromised, you lose. (Control plane has a specific meaning in networking but for this purpose just round it to “operating system (metaphorical) that controls your operating systems (literal).”)
CrowdStrike maintains that they do not understand it to be the case that a bad actor intentionally tried to bring down global financial infrastructure and airlines by using them as a weapon. No, CrowdStrike did that themselves, on accident, of their own volition. But this demonstrates the problem pretty clearly: if a junior employee tripping over a power cord at your company brings down computers worldwide, the bad guys have a variety of options for achieving directionally similar aims by attacking directionally similar power cords.
When money stops money-ing
I found out about the CrowdStrike vulnerability in the usual fashion: Twitter. But then my friendly local bank branch cited it (as quote the Microsoft systems issue endquote) when I was attempting to withdraw cash from the teller window.
My family purchased a duplex recently and is doing renovation prior to moving in. For complex social reasons, a thorough recitation of which would make me persona non grata across the political spectrum, engaging a sufficient number of contractors in Chicago will result in one being asked to make frequent, sizable payments in cash.
This created a minor emergency for me, because it was an other-than-minor emergency for some contractors I was working with.
Many contractors are small businesses. Many small businesses are very thinly capitalized. Many employees of small businesses are extremely dependent on receiving compensation exactly on payday and not after it. And so, while many people in Chicago were basically unaffected on that Friday because their money kept working (on mobile apps, via Venmo/Cash App, via credit cards, etc), cash-dependent people got an enormous wrench thrown into their plans.
I personally tried withdrawing cash at three financial institutions in different weight classes, as was told it was absolutely impossible (in size) at all of them, owing to the Falcon issue.
At one, I was told that I couldn’t use the tellers but could use the ATM. Unfortunately, like many customers, I was attempting to take out more cash from the ATM than I ever had before. Fortunately, their system that flags potentially fraudulent behavior will let a customer unflag themselves by responding to an instant communication from the bank. Unfortunately, the subdomain that communication directs them to runs on a server apparently protected by CrowdStrike Falcon.
It was not impossible at all financial institutions. I am aware of a few around Chicago which ran out of physical cash on hand at some branches, because all demand for cash on a Friday was serviced by them versus by “all of the financial institutions.” (As always happens during widespread disturbances in infrastructure, there quickly arises a shadow economy of information trading which redirects relatively sophisticated people to the places that are capable of servicing them. This happens through offline social networks since time immemorial and online social networks since we invented those. The first is probably more impactful but the second is more legible, so banking regulators pretend this class of issues sprang fully formed from the tech industry just in time to bring down banks last year.)
I have some knowledge of the history of comprehensive failures of financial infrastructure, and so I considered doing the traditional thing when convertibility of deposits is suspended by industry-wide issues: head to the bar.
A hopefully unnecessary disclaimer: the following is historical fact despite rhyming with stereotype.
Back in 1970, there was a widespread and sustained (six months!) strike in the Irish banking sector. Workers were unable to cash paychecks because tellers refused to work. So, as an accommodation for customers, operators of pubs would cash the checks from the till, trusting that eventually checks drawn on the accounts of local employers would be good funds again.
Some publicans even cashed personal checks, backed by the swift and terrible justice of the credit reporting bureau We Control Whether You Can Ever Enjoy A Pint With Your Friends Again. This kept physical notes circulating in the economy.
As I told my contractors, to their confusion, I was unable to simply go down to the local bar to get them cash with the banks down. I don’t have sufficient credit with the operator of the local bar, as I don’t drink.
I told them, to their even greater confusion, that I had considered going down to the parish and buying all their cash on hand with a personal check. Churches, much like bars, have much of their weekly income come through electronic payments but still do a substantial amount of cash management through the workweek heading into the weekend. I’m much more a known quantity at church than I am at the friendly neighborhood watering hole. (Also, when attempting to workaround financial infrastructure bugs to get workers their wages, consider relying on counterparties with common knowledge of James 5:4.)
I eventually resolved the issue in a more boring fashion: I texted someone I reasonably assumed to have cash and asked them to bring it over.
Financial infrastructure normally functions to abstract away personal ties and replace favor-swapping with legibly-priced broadly-offered services.
Thankfully, while this outage was surprisingly deep and broad, banks were mostly back to normal on the following Monday.