Home Blog How we made an IoT controller Cyber Immune

14 AUGUST 2023

Blog

How we made
an IoT controller
Cyber Immune

Controllers are installed at the border of the physical and digital worlds, so they must be protected from cyber attacks. We tell how we provided such protection

The use of internet of things (IoT) technologies in urban environments optimizes and automates urban processes, making cities safer, more efficient and more comfortable places to live. IoT programmable logic controllers (PLCs) can collect information from individual sensors, connected engineering systems and other equipment, and send it to cloud platforms and applications for further analytics.

At the same time as forming the boundary between the physical and digital worlds, such devices are a cybersecurity bottleneck. By hacking them, cybercriminals get to directly impact processes in the physical world, as well as gain access to, or modify, sensitive data. From the IT loop, they thus infiltrate the OT (operational technology) loop.

We took a typical PLC for a smart city and, together with the manufacturer (ISS), made it Cyber Immune, that is, guaranteed to achieve the set security goals, even under attack. This was a major step toward realizing the potential of the smart city concept, while avoiding the associated critical cyber-risks. This post explains exactly what we did and why, and how it helped to eliminate the cyber-risks.

What we used as the basis

For the hardware part, we took a fairly typical PLC with an ARM9 processor, which looked as follows:

The PLC communicated with top-level cloud platforms and applications via Ethernet, and with field devices via RS-485.
The PLC was Debian-based and designed to handle the main operation scenarios of the device:

Communication with the cloud to receive commands and send telemetry
Processing commands received from the cloud and, on their basis, generating control signals for execution units
Controlling execution units and getting readings from them
Various device update and configuration scenarios for administrators

To implement these use cases, the PLC architecture was kept quite minimal. It consisted of several components, the main one being a Mosquitto message broker for sending and receiving data via the MQTT protocol. This component formed the heart of the first build, since it channeled all information flows:

Control data – from the cloud to the device via the TLS protocol
Administrative data – between device components (components subscribed to events and exchanged messages for workflow management)
Field data – between the device and the sensors (the conversion of the high-level MQTT protocol into signals was handled by several drivers)

Schematically, the architecture can be represented as follows:

PLC security issues

Looking at the PLC from a security perspective, we can identify a number of fundamental problems:

Mixing of data flows. All flows are processed within a single component: the MQTT broker. If one of the data flows is compromised, say, as a result of CVE exploitation in a component, cybercriminals can develop an attack on the broker and gain control of the entire system.
For instance, a vulnerability in the network stack (such as the Linux RCE vulnerability CVE-2019-11815 or similar) can be exploited to execute arbitrary code on the device.
Large broker codebase. Unfortunately, any code contains bugs, some of which can lead to vulnerabilities. The more code in a component, the higher the likelihood of vulnerabilities, so the task of eliminating them in the broker entirely is tough indeed, since the component’s codebase is huge and comes in the form of an open-source component, which means that every update must also pass numerous checks.
For example, an attacker can carry out a Man-in-the-Middle attack by forcing the victim to connect to a fake MQTT server and exploit a vulnerability in the MQTT protocol implementation. An example of such a vulnerability is CVE-2017-2892.
Potential for privilege escalation. By exploiting vulnerabilities in the network stack (for example, that same Linux RCE vulnerability CVE-2019-11815) or in an external network driver, cybercriminals can escalate privileges and gain control over the device.
Need for precise configuration. Since the TLS protocol was implemented by the broker, a malfunction or wrong configuration could have led either to errors in the protocol implementation (all the way up to lack of encryption or certificate verification) or to vulnerabilities that likewise would have allowed privilege escalation in the system.
Lack of protection against field-level attacks. Cybercriminals can replace a field-level execution unit with their own and generate an arbitrary signal that is incorrectly processed by a driver or business logic component, which can compromise the device.
For example, they could hack a device connected to the RS-485 port and thus gain access to the bus (most likely remotely). Next, they could exploit a vulnerability in the Modbus RTU protocol, allowing them to remotely execute arbitrary code in the PLC.

In practice, the above issues greatly increase the information security risks, as they can not only make an incident more likely, but also more damaging. The burden on the operating unit also increases – if the vendor releases a patch for a vulnerability, the administrator must respond immediately and update the device.

What did we do?

Our task was to make the solution Cyber Immune. That meant designing the system so that the properties we identified as important were maintained under all circumstances, even in the event of a cyberattack. To do so, we had to address the fundamental issues outlined above.

The first thing we did was to define what security meant in our particular case by posing the question: What are we protecting the PLC from? That is, we formulated the security goals and assumptions. The security goals were those invariant properties which, if achieved, would ensure the safe operation of the system in any possible use case, factoring in the security assumptions. The security assumptions consisted of additional constraints imposed on the operating conditions of the system.

Next, we designed the PLC architecture so that the stated security goals would be met without fail, taking into account the safety assumptions. To achieve this, we modified the architecture of the original PLC.

Above all, it was necessary not only to bulletproof the system at the application code level, but also at the deeper level of the operating system. For this reason, our first step in designing the architecture was to migrate the solution from Debian to KasperskyOS, which meets the requirements for out-of-the-box Cyber Immunity.

That done, we adjusted the architecture of the application part of the PLC, eliminating the fundamental architectural problems discussed above: we isolated the MQTT broker, separated the upstream and downstream data flows, and improved administration security.

The result was a Cyber Immune system guaranteed to achieve the security goals, as we have verified through multiple tests and checks.

Now let’s review the steps we went through in order.

Step 1. Security goals and assumptions

The Cyber Immune approach is based on the Secure by Design philosophy, which states that the security of a system is not an additional requirement, but an integral part of its design. That is, security must be built in at the design stage.

At the same time, Secure by Design in its purest form says little about the specifics of how to achieve this.

This is where Cyber Immunity comes in, which adds flesh to the bones:

Methodology – how exactly to organize the process and the desired results/artifacts of each stage
Design requirements – how exactly to design the system to implement Secure by Design in a cost-effective way

Putting the Cyber Immunity approach into practice, the first step was to formulate an answer to the question: What are we protecting the PLC from? That is, to explicitly describe the PLC values and potential troubles unacceptable for us – at the level of the product’s business purpose.

We realized that we needed:

Guaranteed secure interaction with the cloud platform while maintaining the integrity and confidentiality of transmitted data under any circumstances
Commands received from the cloud platform to be correctly converted into signals (integrity, authenticity) and routed to the field level
All system information (configuration, certificates, login credentials, device ID) to be securely stored and processed (again: integrity, authenticity)
The event log and the security log of the device to be accumulated and stored with integrity, authenticity and failsafe operation
Secure administration of the device

These were the things that we decided must be protected from potential problems. And not from specific vulnerabilities or attack scenarios. We proceeded from the premise that “everything that can be hacked will be hacked.” That is, any violations of the aspects outlined above are unacceptable for us, be they known vulnerabilities or zero-days that appear later.

Defining what’s important for us underpins our understanding of what system security actually is. After all, there is no thing as “general security.” Security is always specific to each product depending on its business purpose and context of use. Aiming at general security means trying to make the system invulnerable by protecting every single component from every single threat (which is impossible, not least due to limited resources).

Instead, the Cyber Immune approach introduces security goals and assumptions. With this in mind, we documented the above-described values in more formal language in the form of security goals.

In summary, the security goals for the PLC look as follows:

The solution provides secure (integrity, confidentiality) data transfer between the PLC and the cloud platform.
The solution delivers secure (integrity, authenticity) storage and processing of configuration information (configuration files, commands) as well as identification/authentication information (TLS certificates, login credentials, device ID).
The solution delivers secure (integrity, authenticity) conversion and transmission of control/measurement information (field device readings) and command information (control commands) to the PLC.
The solution delivers accumulation and secure (integrity, authenticity, failsafe operation) storage of the event log (system time changes; confirmation of receipt and execution status of telecontrol commands; connection/disconnection to/from the cloud platform; configuration changes) and of the security log (access made during local diagnostics; content of actions during local diagnostics).
The solution delivers secure PLC diagnostics (retrieval of information with no possibility of configuration) for users who used certificate-based authorization to establish a secure channel.

In defining the security assumptions, we set a number of restrictions. In particular, we defined and identified:

• Constraints on physical access to the PLC
• Potential actions by internal and external intruders
• Trusted configuration sources
• Threats that we do not provide protection against (for example, we accepted that we do not protect against anthropogenic threats)

Step 2. Porting the solution to KasperskyOS

The next important step was to design the system so as to deliver the set security goals, taking into account the security assumptions. To achieve this, we modified the architecture of the original PLC. Let’s take a look at what we did and why.

Trust in the system level of the solution is essential – without it all further efforts to build a secure system as a whole will be futile. So, the first design step was to port the solution from the original Debian operating system to KasperskyOS, which meets the requirements for out-of-the-box Cyber Immunity. Due to support for most POSIX calls and the large variety of basic components in the SDK (for example, network and file subsystems), porting did not entail a complete rewrite of the entire code.

The main task that was addressed was to switch the administrative data flow (the way components interact within the solution) from the MQTT protocol to IPC (interprocess communication – the core mechanism of interaction within KasperskyOS). This solved several tasks at once:

Vulnerabilities in individual components cannot automatically propagate an attack to other components and the operating system kernel. This is because each component of the solution runs in a separate entity and in an isolated address space, in user mode, including drivers.
This makes it extremely difficult to develop an attack in the event of component compromise. Each interaction between entities is controlled by a security monitor as per the security policies laid down by the architect at the design stage. This makes it possible to describe all allowed interactions between components, as well as the data transmitted through them. In case of violation of these rules (for example, if a component is compromised and an attacker tries to develop an attack through it), such requests will be rejected by the operating system kernel, and a corresponding entry will appear in the audit event log.
Now the system configuration cannot be changed during operation. The system prohibits running new entities, modifying relationships and changing security policies – these are done automatically by the operating system.

Accordingly, the NEAT principle was observed:

• Non-bypassable – protection mechanisms cannot be bypassed
• Evaluatable – all security policies are stored inside the monitor, able to be easily analyzed and verified as correct and sufficient
• Always-invoked – security policies are applied to any interaction in the system
• Tamperproof – the system cannot be modified during operation

Step 3. Isolating the broker

After protecting the system layer, we also needed to secure the solution’s application layer. This began by analyzing the central component of the whole solution: the MQTT broker.

The MQTT broker has the largest attack surface (due to its large codebase and interaction with the external network), and its correct operation underpins the whole solution. This component also implements a TLS connection – the main mechanism for secure interaction with the cloud.

What’s more, the MQTT protocol contains almost no security features: the MQTT 5.0 specification contains only general recommendations for the developer. Above all, this jeopardizes the security goal for communication with the cloud platform, but it is also a security bottleneck in attacks on other system components and, thus, a violation of the other security goals.

To increase broker security and simplify the procedure for verifying the correctness of the TLS connection, there was no need to invent some kind of on-the-spot protection; instead, we used one of the many KasperskyOS design patterns: TLS Terminator.

Design patterns are replicable architectural constructs that address common design issues within some frequently occurring context.

The TLS Terminator pattern provides for the use of a separate component to establish the TLS connection, which:

Delivers simplicity and transparency – TLS Terminator provides a network socket at the operating system level, so the MQTT broker sees no difference between a TLS-protected connection and a regular network connection, which lowers the broker configuration requirements.
Reduces the attack surface – the broker can connect only through TLS Terminator; the security policies prohibit unencrypted connections, so the only object the broker can interact with is the cloud.
Simplifies verification – the TLS Terminator component is based on the Mbed TLS library, which has a much smaller codebase than the popular OpenSSL. More details can be found in the report by our colleague Daria Zimarina.

The result is the following communication pattern:

That is:

The broker no longer interacts with the external network directly
A direct attack on the broker from the external network is rendered impossible
Only authorized subscribers can access the broker

On top of that, we maintain strict control over all interactions with TSL Terminator as per the security policies, which have been flexibly configured specifically for this component.

As a result of the work carried out, the attack surface of the MQTT broker seen by an intruder in the external network was drastically reduced, and verification of the TLS connection mechanism was simplified.

Step 4. Separating the data flows

The next step was to separate the flows of data coming from the cloud (control data flow) and telemetry data collected from field-level execution units (field data flow).

This was accomplished as follows:

The business logic components were divided into three groups:
a) common – responsible for device operation (logging, network stack, configuration, etc.);
b) cloud-interacting – responsible for generating commands based on data received from the cloud, including common components;
c) field level-interacting – responsible for converting data received from execution units into messages to be passed to the cloud.
The MQTT broker component was run in two instances in separate, isolated entities. The first component communicated exclusively with the cloud and common business logic components. The second component interacted solely with field-level business logic components.
In line with the security policy, the system was configured so that components communicating with the cloud could only transfer data to field-level device drivers in unidirectional mode.
The same unidirectional data transfer rule was configured between field-level business logic components and the network stack.
The logic for checking data coming from field-level devices was placed in a separate component.

The outcome was the following communication pattern:

As a result of the work carried out, the device was protected from attacks by an intruder in the internal network.

Step 5. Enhancing administration security

One of the customer’s main requirements for the smart controller was usability. For this purpose, a web server was deployed on the device, providing the administrator with a wide range of options. However, from a security perspective, the web server, like the MQTT broker, is a highly vulnerable component that requires protection.

A particularly important security goal for the PLC is to ensure secure administration of the device. But how is this administration organized? For example, when diagnosing the device status, an authorized user sends a request to the web server. The web server, in turn, forwards a diagnostic request to the components to share their status and then sends the result back to the user. When updating configurations, the web server passes the dataset to the component responsible for validating and installing the update package. So, by hacking the web server that has a large attack surface, an attacker can cause damage to other components of the solution.

By isolating the web server in a separate component, we solved multiple issues related to privilege escalation, but the question of administrator authorization remained. Most reliable is certificate-based authorization, but implementing this mechanism in the web server itself gives rise to the same issues as when using TLS in the MQTT broker.

To solve this problem, we also used the TLS Terminator pattern, but not in the client version (which establishes a connection at the request of the component), but in the server version.

The server version of TLS Terminator (also based on Mbed TLS) makes it possible to deploy a server available for connection from an external network using the TLS protocol, and then notify the hooked-up component that a connection has appeared. As a result, the most important operations related to certificate-based authorization are moved to a separate, small entity with a minimal attack surface.

The result?

Having taken all these steps, we got a product that can be described as Cyber Immune. That is, guaranteed to achieve its set security goals, given the security assumptions.

After much work to make the PLC Cyber Immune, the solution architecture was transformed from this:

To this:

As the diagram shows, all components in the final architecture are divided into three groups as per the security goals of the system:

Green = trusted. Components on which the achievement of the security goals directly depends.
Yellow = highly trusted. Components that increase trust in the data that passes through them.
Red = untrusted. All the others.

Red untrusted components are placed at the edge of the system; green trusted ones at the heart; yellow highly trusted ones in between.

In this arrangement, basic protection and verification were sufficient for untrusted components, because the achievement of the security goals did not depend on them. The protection and verification requirements for trusted components are certainly high, but since the trusted components are “covered” by highly trusted ones, their protection requirements could be relaxed. So only the highly trusted components, which, as you can see, were few, needed to be protected and verified to the maximum extent. The result was a heterogeneous system in which only a small fraction of highly trusted components needed to be thoroughly protected and verified.

Thanks to cyber-immunization, we not only eliminated the security issues present in the original PLC, but fundamentally changed the approach to security by embedding it at the design stage. Thus, we are now protected against not only known threats, but potential future ones, which means that administrators can sleep peacefully – now they have time to respond and install security patches, because there won’t be so many of them.

Note that the Cyber Immune practices described in this post are applicable not only to a specific device or industry, but to many other devices and industries in today’s cyber-physical world. In this sense, the Cyber Immune approach provides a concrete yet universal answer to the question of how to build highly secure systems while keeping costs at an acceptable level.

Authors:

Alexander Vinyavsky, Technology Evangelist, Kaspersky

Maxim Dontsov, Security Analysis Group Manager, Kaspersky