“4:11,” the Northeast blackout that began on August 14th, 2003 at 4:11p.m. EDT, immediately brought to mind the events of 9/11. Although quickly dispelled, our first thoughts at 4:11 jumped to more physical terrorism. Next, many of us considered cyber terrorism. After all, it had been only two days since the Blaster worm had struck around the world. And, it had been only a few months earlier, in January 2003, when a computer worm disabled a safety system in a US nuclear power plant.
Ironically, just as 911 is the emergency telephone number, 411, now known as directory assistance, was formerly called information by telephone service providers. At 4:11, Internet border gateway router chatter spiked orders of magnitude above normal levels, just as it had done when the SQL Slammer worm took out huge sections of the Internet a few months earlier. This time, the large numbers of network withdrawals were caused by the power outage rather than a cyber attack. However, the cyber symptoms were virtually identical, underscoring the interoperability of the power grid and the Internet. Independent attacks or failures on either can cause serious harm to the other. A coordinated attack on both would wreak havoc, e.g., a broad Slammer-like worm strike on the Internet simulcast with a distributed denial of service attack on critical grid asset communications.
The North American power grid is both a marvel and an enigma. It encompasses 15,000- plus generators, 10,000-plus power plants, transmission and distribution lines that could encircle the globe more than ten times and millions upon millions of networked devices. With no central control, the grid works remarkably well. Or, at least it had until recently.
Was 4:11 a harbinger of the grid of the future? A grid driven more by greed than by reliability? grid willing to sacrifice future capacity for today’s profits? An antiquated grid operating in the 21st century under 1950’s design constraints? An industry unwilling to invest in the future? We can expect to see more 4:11-like events if we don’t take some action.
From what we have learned about 4:11 so far, it appears that it wasn’t caused directly by a security breach – a computer worm, virus or hacker attack. However, it is clear that multiple computer failures in the hours preceding 4:11 set off the cascade of events resulting in the largest power outage ever.
By comparison, meteorologists called the storm that hit North America’s eastern seaboard in October 1991 a “perfect storm” because of the rare combination of factors that created it. The boat in the true story was armed with electronic navigational tools and signaling systems, but the sheer force of the violent seas overwhelmed them and rendered them useless.
Sound familiar? At 4:11, the northeastern portion of North America’s electrical grid encountered a “near-perfect e-storm.” It collapsed because of a rare combination of factors. System operators were armed with electronic supervisory control tools and telecommunication systems, but the sheer force of the violent power swings overwhelmed them and rendered them useless.
Figure 1., above, depicts on a timeline the juxtapositioning of the power grid, computer and human events occurring just before 4:11. Taking a closer look…
At about 2:14 p.m. on August 14, FirstEnergy’s Energy Management System (EMS) lost its capability to process and to communicate alarms to system operators. Such alarms provide vital notification of power system events and out-ofacceptable- range condition measurements. FirstEnergy control center operators were unaware of the alarm processing malfunction and did not know that power network conditions were changing.
A few minutes later, a number of FirstEnergy’s substation remote terminal units (RTUs) stopped communicating with the EMS master. These failures may have resulted from buffer overflows in the master and some RTUs.
FirstEnergy’s EMS architecture utilizes multiple primary servers running various applications and one backup server able to run any of the applications.
At 2:41 p.m., the primary EMS server hosting the alarm processing function failed, probably because of the buffer overflows. At that point, the EMS performed an automatic “failover” to the backup server. The backup server continued running the stalled alarm processing function for 13 minutes until it also failed.
At 2:54 p.m., both the primary and the backup EMS servers running the alarm processing function stopped running all their applications. The EMS continued operating without these two servers, but with diminished performance. Operator screen refresh rates slowed from a few seconds to almost a minute per screen, further inhibiting the operators’ capabilities to observe what was happening to their power system.
At 3:05 p.m., a FirstEnergy 345 kV line into Cleveland contacted an overgrown tree, faulted, tripped and locked out. The loss of this 345 kV path caused the remaining three southern 345-kV lines into Cleveland to pick up more load. FirstEnergy control system operators received no notification of these major events and went about their duties unaware of what had happened.
At 3:08 p.m., FirstEnergy staff successfully rebooted the primary server. However, the alarm processing function was still stalled and the server was taken out of service again at 3:46 p.m.
At 3:32 p.m., another 345 kV line into Cleveland contacted a tree, faulted, tripped and locked out. Loading on the remaining two 345-kV lines increased again. Once again, because of their failed EMS alarming capability, FirstEnergy control system operators were unaware of what had happened.
At 3:41 p.m., FirstEnergy operators watched the lights flicker as the control center lost line power and automatically switched to an emergency backup power source.
At 3:42 p.m., an alert FirstEnergy system operator concluded that the EMS alarm system had malfunctioned and began to take independent action.
Because of the alarm and RTU communication failures, FirstEnergy control center operators did not know they were losing significant portions of their power system. By the time they began to assimilate and respond to external reports of the true power system condition, it was too late to stop the cascading sequence of events leading up to the 4:11 blackout.
Although most folks outside the electric power industry — at least the ones I talk to — believe 4:11, the largest blackout in history, was caused by a couple of overgrown trees in Ohio, it’s clear that computer failures exacerbated the physical events. Had t h e computers not malfunctioned, FirstEnergy control center system operators would have been able to react appropriately and prevent the blackout cascade.
FirstEnergy’s EMS servers apparently suffered no malicious attacks on August 14. Had they been attacked and had they succumbed, 4:11 would have been much worse.
To reduce risks to the reliability of the bulk electric systems from any compromise of critical cyber assets (computers, software and communication networks) that support those systems, last June, the North American Electric Reliability Council (NERC) membership voted to adopt Urgent Action Standard 1200 -- Cybersecurity. NERC’s Board of Trustees adopted the Standard in August.
Initially, control areas and reliability coordinators are required to complete cybersecurity self-assessments in early 2004. Ultimately, the permanent version of the standard currently under development will require all entities performing the reliability authority, balancing authority, interchange authority, transmission service provider, transmission operator, generator, or load-serving entity function to create and maintain a cybersecurity policy for the specific implementation of this standard. NERC anticipates that full compliance requirement implementation will become effective in early 2005.
NERC’s Cybersecurity Standard 1200 transcends existing checklists and guidelines, requiring North American electric utilities to plan and implement specific security programs, with sanctions and financial penalties for noncompliance. The standard mandates specific NERC due diligence security reporting requirements, authorized by an officer of the entity, with random spot checks to monitor compliance. It includes whistleblower provision for entities appearing to disregard the standard.
Under the new standard, utilities must identify and protect their critical cyber assets -- certain computer hardware, software, networks and databases, including control center systems, contained within defined cybersecurity perimeters. Perimeter protection must include physical security of the cyber assets and electronic security of any communications crossing that perimeter.
The latest version of the permanent standard goes on to provide more detail concerning what constitutes a minimum list of critical cyber assets. It specifically includes those cyber assets providing telemetry, SCADA, automatic generator control (AGC), load shedding, black start, real-time power system modeling, substation automation control and real-time inter-utility data exchange.
The standard even expands compliance requirements to non-critical cyber assets contained on a network accompanying critical assets within a cybersecurity perimeter. The permanent standard draft also provides some metrics for data communications between critical cyber assets. Each such data communication stream must provide 99.5% availability over the period of a year regardless of the communications technology employed leased-line or dial-up telephone, point-tomultipoint or spread-spectrum radio, microwave, fiber optics, etc.
Most importantly, the standard calls for all such data communications conducted over shared public network resources to be encrypted utilizing appropriate confidentiality, integrity and authentication and (in some cases) non-repudiation functionality. What does that mean? In simple terms…
Given the state of utility communications today, these are some pretty serious requirements and we all need to take them seriously. Exercising due diligence, every entity potentially covered by the standard should, at the very least, begin setting up an internal cybersecurity task force right away.
The first step is to identify a high-level internal cybersecurity advocate – someone who is well respected, above reproach and at a high enough level in your organization to command authority. This isn’t as easy as it sounds. Many organizations assign top level cybersecurity responsibility to the head of information technology (IT). I’ve seen situations in some of the largest utilities where that IT head had little understanding of the real-time EMS and SCADA systems. On the other hand, I’ve also seen organizations assign all security functions, including cybersecurity, to the head of physical security – the person in charge of the gates, guards and guns.
Either of these two extremes could lead to cybersecurity standard compliance failure. It’s important to assemble a balanced, multifunctional team within your organization – one thoroughly knowledgeable both about IT and about operational cyber assets (don’t forget telecommunications paths and remote sites). In some organizations, this can prove difficult because of internal hierarchies and balances of power. In that case, you may need neutral outside help.
In response to what I’ve said above, you may be saying, “We don’t have a budget for cybersecurity. Who’s going to pay for this?” The answer is that we all are – one way or another.
Ironically, just as 911 is the emergency telephone number, 411, now known as directory assistance, was formerly called information by telephone service providers. At 4:11, Internet border gateway router chatter spiked orders of magnitude above normal levels, just as it had done when the SQL Slammer worm took out huge sections of the Internet a few months earlier. This time, the large numbers of network withdrawals were caused by the power outage rather than a cyber attack. However, the cyber symptoms were virtually identical, underscoring the interoperability of the power grid and the Internet. Independent attacks or failures on either can cause serious harm to the other. A coordinated attack on both would wreak havoc, e.g., a broad Slammer-like worm strike on the Internet simulcast with a distributed denial of service attack on critical grid asset communications.
The North American power grid is both a marvel and an enigma. It encompasses 15,000- plus generators, 10,000-plus power plants, transmission and distribution lines that could encircle the globe more than ten times and millions upon millions of networked devices. With no central control, the grid works remarkably well. Or, at least it had until recently.
Was 4:11 a harbinger of the grid of the future? A grid driven more by greed than by reliability? grid willing to sacrifice future capacity for today’s profits? An antiquated grid operating in the 21st century under 1950’s design constraints? An industry unwilling to invest in the future? We can expect to see more 4:11-like events if we don’t take some action.
From what we have learned about 4:11 so far, it appears that it wasn’t caused directly by a security breach – a computer worm, virus or hacker attack. However, it is clear that multiple computer failures in the hours preceding 4:11 set off the cascade of events resulting in the largest power outage ever.
By comparison, meteorologists called the storm that hit North America’s eastern seaboard in October 1991 a “perfect storm” because of the rare combination of factors that created it. The boat in the true story was armed with electronic navigational tools and signaling systems, but the sheer force of the violent seas overwhelmed them and rendered them useless.
Figure 1. Interaction of Grid, Computer and Human Events (Source: Interim Report: Causes of the August 14th Blackout in the United States and Canada, U.S.-Canada Power System Outage Task Force, November 2003)
Sound familiar? At 4:11, the northeastern portion of North America’s electrical grid encountered a “near-perfect e-storm.” It collapsed because of a rare combination of factors. System operators were armed with electronic supervisory control tools and telecommunication systems, but the sheer force of the violent power swings overwhelmed them and rendered them useless.
Figure 1., above, depicts on a timeline the juxtapositioning of the power grid, computer and human events occurring just before 4:11. Taking a closer look…
At about 2:14 p.m. on August 14, FirstEnergy’s Energy Management System (EMS) lost its capability to process and to communicate alarms to system operators. Such alarms provide vital notification of power system events and out-ofacceptable- range condition measurements. FirstEnergy control center operators were unaware of the alarm processing malfunction and did not know that power network conditions were changing.
A few minutes later, a number of FirstEnergy’s substation remote terminal units (RTUs) stopped communicating with the EMS master. These failures may have resulted from buffer overflows in the master and some RTUs.
FirstEnergy’s EMS architecture utilizes multiple primary servers running various applications and one backup server able to run any of the applications.
At 2:41 p.m., the primary EMS server hosting the alarm processing function failed, probably because of the buffer overflows. At that point, the EMS performed an automatic “failover” to the backup server. The backup server continued running the stalled alarm processing function for 13 minutes until it also failed.
At 2:54 p.m., both the primary and the backup EMS servers running the alarm processing function stopped running all their applications. The EMS continued operating without these two servers, but with diminished performance. Operator screen refresh rates slowed from a few seconds to almost a minute per screen, further inhibiting the operators’ capabilities to observe what was happening to their power system.
At 3:05 p.m., a FirstEnergy 345 kV line into Cleveland contacted an overgrown tree, faulted, tripped and locked out. The loss of this 345 kV path caused the remaining three southern 345-kV lines into Cleveland to pick up more load. FirstEnergy control system operators received no notification of these major events and went about their duties unaware of what had happened.
At 3:08 p.m., FirstEnergy staff successfully rebooted the primary server. However, the alarm processing function was still stalled and the server was taken out of service again at 3:46 p.m.
At 3:32 p.m., another 345 kV line into Cleveland contacted a tree, faulted, tripped and locked out. Loading on the remaining two 345-kV lines increased again. Once again, because of their failed EMS alarming capability, FirstEnergy control system operators were unaware of what had happened.
At 3:41 p.m., FirstEnergy operators watched the lights flicker as the control center lost line power and automatically switched to an emergency backup power source.
At 3:42 p.m., an alert FirstEnergy system operator concluded that the EMS alarm system had malfunctioned and began to take independent action.
Because of the alarm and RTU communication failures, FirstEnergy control center operators did not know they were losing significant portions of their power system. By the time they began to assimilate and respond to external reports of the true power system condition, it was too late to stop the cascading sequence of events leading up to the 4:11 blackout.
Although most folks outside the electric power industry — at least the ones I talk to — believe 4:11, the largest blackout in history, was caused by a couple of overgrown trees in Ohio, it’s clear that computer failures exacerbated the physical events. Had t h e computers not malfunctioned, FirstEnergy control center system operators would have been able to react appropriately and prevent the blackout cascade.
FirstEnergy’s EMS servers apparently suffered no malicious attacks on August 14. Had they been attacked and had they succumbed, 4:11 would have been much worse.
To reduce risks to the reliability of the bulk electric systems from any compromise of critical cyber assets (computers, software and communication networks) that support those systems, last June, the North American Electric Reliability Council (NERC) membership voted to adopt Urgent Action Standard 1200 -- Cybersecurity. NERC’s Board of Trustees adopted the Standard in August.
Initially, control areas and reliability coordinators are required to complete cybersecurity self-assessments in early 2004. Ultimately, the permanent version of the standard currently under development will require all entities performing the reliability authority, balancing authority, interchange authority, transmission service provider, transmission operator, generator, or load-serving entity function to create and maintain a cybersecurity policy for the specific implementation of this standard. NERC anticipates that full compliance requirement implementation will become effective in early 2005.
NERC’s Cybersecurity Standard 1200 transcends existing checklists and guidelines, requiring North American electric utilities to plan and implement specific security programs, with sanctions and financial penalties for noncompliance. The standard mandates specific NERC due diligence security reporting requirements, authorized by an officer of the entity, with random spot checks to monitor compliance. It includes whistleblower provision for entities appearing to disregard the standard.
Under the new standard, utilities must identify and protect their critical cyber assets -- certain computer hardware, software, networks and databases, including control center systems, contained within defined cybersecurity perimeters. Perimeter protection must include physical security of the cyber assets and electronic security of any communications crossing that perimeter.
The latest version of the permanent standard goes on to provide more detail concerning what constitutes a minimum list of critical cyber assets. It specifically includes those cyber assets providing telemetry, SCADA, automatic generator control (AGC), load shedding, black start, real-time power system modeling, substation automation control and real-time inter-utility data exchange.
The standard even expands compliance requirements to non-critical cyber assets contained on a network accompanying critical assets within a cybersecurity perimeter. The permanent standard draft also provides some metrics for data communications between critical cyber assets. Each such data communication stream must provide 99.5% availability over the period of a year regardless of the communications technology employed leased-line or dial-up telephone, point-tomultipoint or spread-spectrum radio, microwave, fiber optics, etc.
Most importantly, the standard calls for all such data communications conducted over shared public network resources to be encrypted utilizing appropriate confidentiality, integrity and authentication and (in some cases) non-repudiation functionality. What does that mean? In simple terms…
- Confidentiality means that nobody should be able to see the data that ought not to.
- Integrity means that no portions of the data should go missing or be replaced by bogus data.
- Authentication means determining whether someone is, in fact, who he or she says they are.
- Non-repudiation means if someone messes with the data, they will not be able to deny it later.
Given the state of utility communications today, these are some pretty serious requirements and we all need to take them seriously. Exercising due diligence, every entity potentially covered by the standard should, at the very least, begin setting up an internal cybersecurity task force right away.
The first step is to identify a high-level internal cybersecurity advocate – someone who is well respected, above reproach and at a high enough level in your organization to command authority. This isn’t as easy as it sounds. Many organizations assign top level cybersecurity responsibility to the head of information technology (IT). I’ve seen situations in some of the largest utilities where that IT head had little understanding of the real-time EMS and SCADA systems. On the other hand, I’ve also seen organizations assign all security functions, including cybersecurity, to the head of physical security – the person in charge of the gates, guards and guns.
Either of these two extremes could lead to cybersecurity standard compliance failure. It’s important to assemble a balanced, multifunctional team within your organization – one thoroughly knowledgeable both about IT and about operational cyber assets (don’t forget telecommunications paths and remote sites). In some organizations, this can prove difficult because of internal hierarchies and balances of power. In that case, you may need neutral outside help.
In response to what I’ve said above, you may be saying, “We don’t have a budget for cybersecurity. Who’s going to pay for this?” The answer is that we all are – one way or another.