Winning Systems & Security Practitioners 5. Resilience

1450 words, 5 12 minutes.


Winston S. Churchill “In defeat: Defiance” - Winston S. Churchill.

This is part 5 of 6 in a short series of posts on winning systems for Information Security practitioners. It aims to plug the gap between policy and products and put you, the practitioner, back in the driving seat. After all if you don’t know what system you’re implementing, how can you decide what products or features are important to you? How can you evaluate what they might be worth?

Congratulations. You’re prepared. You’ve turned the odds around though rigorous application of default-deny. You’ve implemented responsiveness. When assets are under attack they automatically block/ban unwelcome connections. You’ve taken care to select the most robust software for high-risk and exposed services. Now it’s time to think the unthinkable. What happens when after all that, the attacker somehow defeats you? What happens when robust wasn’t robust enough?

Resilience is the ability to carry on and return to a good state after a period of stress which caused failure or damage. An organism which is resilient might suffer the loss of an appendage, or may elect to sacrifice some part of itself in order to continue living so that it may return to health in the near future. In Information Security, resilience is about anticipating failure, containing the negative effects of it, and being able to resume normal service at a low cost in time and effort.

Resilience is a system like all the others discussed in this series. It’s a continuous process you commit-to and cycle around. It’s not something you do once then close the lid on. The system of resilience I’m talking about here is:

  1. Consider points of likely failure.
  2. Consider modes of failure.
  3. Accept that failure will happen sooner or later.
  4. Formulate ways of containing the failure.
  5. Formulating ways of quickly recovering from it.
  6. Incorporate lessons learned for the next cycle.
  7. Repeat.

If you’ve implemented the other systems discussed previously, failures will be very rare. If you skipped my advice or failed to implement it fully, you’ll be having your resilience tested quite regularly. In order to help you think practically about the first 3 steps above, consider these scenarios:

  • What happens if you get a wrong netmask/IP address on a Firewall rule?
  • What happens if you delete a Firewall rule or 2?
  • What happens if you get a VLAN or an interface/virtual interface wrong?
  • What happens when your AV encounters a new virus or zero day threat?
  • What if a hacker obtains a password/login?
  • What if attackers execute arbitrary code on a public facing system?
  • What if they obtain a company laptop?

Too easy? Now the difficult part.

  • What happens when the bad-guy is inside your browser?
  • What happens when the bad-guy is inside your endpoint?
  • What happens when the bad-guy is inside your building?
  • What happens when the bad guy is inside your vendor?
  • What happens when the bad guy is an employee?
  • What happens when the bad guy is on your SysAdmin team?
  • What happens when you are the bad guy?

By the way, this is what peak paranoia feels like.

Compartmentalisation & Containment
Operating Systems

Initially your primary concern is containing a breach. Fortunately we have an embarrassment of technological riches with which to do this. The exact choice of which technologies to use will depend on individual trade-offs between cost, complexity, security, and many other factors. Taking just an OS centric view we have:

  • User/Group/Role separation including Mandatory Access Control.
  • Write only or read only filesystems for critical data such as transactions or configuration files.
  • Chroot environments for simple applications, components of apps, or scripts.
  • Jails, Zones, or Containers for more comprehensive isolation.
  • Virtual Machines.
  • Logical Partitions (LPARs).
  • Physical Separation (e.g. servers, blades, microservers).

Let us assume our attacker exploits a vulnerability in both our web application and the web server executing that app. Even if he finds himself with a open command shell it would be for a non-privileged single-role user (probably the web server user for that particular application or microsite). He’d be looking at a largely read-only filesystem. He’d be inside a Jail or Container. Even if he escaped the Jail into the host VM, he’d have no raw access to the network for snooping. He’d have to subvert the hypervisor. Perhaps the hypervisor is inside a logical partition. He is looking at a chain of 6 or 7 exploits just to get to a point where he can sniff some network traffic and collect credentials. Of course we’re assuming the network has something useful to show him. Speaking of which…

Compartmentalisation & Containment

At the network layer we have a number of potential technologies to prevent this kind of compromise spreading. These technologies make it less likely that other systems can be exploited, or useful credentials or valuable traffic can be gathered by the attacker once he gets to this point.

  • Soft isolation (VPN tunnels, SSL, SSH, simple routing).
  • VLANs.
  • MPLS VRFs.
  • Data diodes.
  • Physically separate paths/networks.

Don’t forget, we already implemented default deny, responsive blocking, and where possible have selected robust software/services. Our attacker just broke out of a locked box, escaped his cell, and climbed the wall of the prison, only to find himself in the middle of a desert. His packet sniffer now confirms that desert is on Mars.

Fast Recovery

Today we’re lucky, we have many possible fast-recovery mechanisms. We aren’t limited to traditional high effort cost backup and restore. Most recovery actions can be triggered automatically should the target fail a regular integrity check. Spot a server or desktop process that isn’t one of the 30 expected OS components or the 5 applications? Shoot that instance in the head and re-provision it. What’s important is that the integrity of the replacement is cryptographically proven. Where automatic integrity checks indicate just one Zone, Container, or Jail is suspicious, rebuild or reboot that single element from an immutable source.

  • Snapshots.
  • Immutable containers.
  • Immutable VM templates.
  • Netboot-style provisioning from “gold” images.
  • Fast provisioning using automation frameworks.
Incorporating Lessons-Learned

Before we can incorporate lessons into the next cycle of our system, we need first to have learned them. We can’t learn anything unless it’s written down. Accurately. This means secure log files. It means revision control. It means configuration management. It means integrity checking. It means having a record of what your IT is doing, ready for the day you need to examine that record.

  • Enabling the most detailed logging you can sustain.
  • Enabling enhanced accounting on exposed systems.
  • Keeping those logs somewhere secure.
  • Ensure the integrity of them, cryptographically.
  • Logging from servers, desktops, networks, applications.
  • Having a means of searching/sorting/replaying them.

Is now a good time to remind you that synchronised enterprise-wide time is a must?

This is the one system out of all systems where having some skills is important. At the very least, you’ll need to have a record of what happened to show someone else skilled, so they can figure out what went wrong. That person needs to be able to spot the first signs of a successful intrusion by an attacker. They’ll need to know when your last, good snapshot was before that system or sub-system was compromised. They’ll need to know what the entry-point was before it can be closed. They’ll need to factor that closure in to your updated resiliency cycle. Maybe you need a configuration change, an upgrade, a patch, or replacement of fragile software. Maybe there is a product or service which would have prevented this breach. Your logs and packet traces may even help vendors get a fix out faster if one is needed.

The point is, in the unlikely event there is a breach it will be strictly contained and the environment sterilised.

Resilience is a continuous process, like all the other systems. Without having to face one successful attack, there’s still more than enough “what ifs?” here to keep you busy thinking about resilience for a long time. If that sounds daunting then remember this: It beats fire fighting. It beats working on the weekend, or through the night. It’s much easier than having to deliver bad news to executives, shareholders, and customers. Using these systems puts you back in control of your workload. By now you can see that it’s really systems that will keep you safe, not products, nor skills, but winning systems around which you can iterate.

If you still need convincing the systems approach is right, give me one last shot.

If you’re pretty sure it’s a bad idea then my final post will remove any doubt.

Finally if you’re on-board 100%, I can help you convince everyone else. Read this.

Nick Hutton

Engineer, Investor, Founder, Product Manager

London, England