Technology Final Report
Secure/Resilient Systems and
Data Dissemination/Provenance
September 2017
Prepared for
The Northrop Grumman Cyber Research Consortium
As part of IS Sector Investment Program
Prepared by
Bharat Bhargava
CERIAS, Purdue University
1
Table of Contents
1
Executive Summary.................................................................................................................3
1.1
Statement of Problem......................................................................................................3
1.2
Current State of Technology............................................................................................4
1.3
Proposed Solution............................................................................................................5
1.4
Technical Activities, Progress, Findings and Accomplishments...................................18
1.5
Distinctive Attributes, Advantages and Discriminators.................................................23
1.6
Tangible Assets Created by Project...............................................................................24
1.7
Outreach Activities and Conferences.............................................................................25
1.8
Intellectual Property Accomplishments.........................................................................26
2
General Comments and Suggestions for Next Year...............................................................26
List of Figures
Figure 1. High-level view of proposed resiliency framework …..………...……………….……….……………….5
Figure 2. Service acceptance test ………………...……………………………………………………………….....7
Figure 3. View of space and time of MTD-based resiliency solution….………………………………………..…..9
Figure 4. Moving target defense application example…….……...………………………………………………...10
Figure 5. High-level resiliency framework architecture.……………………………………………………..….....11
Figure 6. System states of the framework….……………………………………………………………………….12
Figure 7. Data d1 leakage from Service X to Service Y….…………………………………………………………14
Figure 8. Data Sensitivity Probability Functions ………..……………………………………………………15
Figure 9. Encrypted search over database of active bundles (by Leon Li, NG “WAXEDPRUNE” project) ...……16
Figure 10. Experiment Setup For Moving Target Defense (MTD)…………………………………………………20
Figure 11. EHR dissemination in cloud (created by Dr. Leon Li, NGC) …………………………………………. 21
Figure 12. AB performance overhead with browser's crypto capabilities on / off ……..…………………………. 22
Figure 13. Encrypted Search over Encrypted Database ………………………………..…………………………. 23
List of Tables
Table 1. Executive Summary………………….………………………………………………………………...….1
Table 2. Operations supported by different crypto systems …..……………………………………………...15
Table 3. Moving Target Defense (MTD) Measurements……...........………………………………………….21
Table 4. Encrypted Database of Active Bundles. Table ‘EHR_DB’ …………………………………………..22
2
1 Executive Summary
Title
Author(s)
Principal Investigator
Secure/Resilient Systems and Data Dissemination/Provenance
Bharat Bhargava
Bharat Bhargava
Funding Amount
$200,000
Period of
Performance
September 1, 2016 - August 31, 2017
Was this a
continuation of
Investment Project?
Key Words
Yes
TRL Level
3
data provenance, data leakage, resiliency, adaptability, security, self-healing,
MTD
Key Partners &
Vendors
Table 1: Executive Summary
1.1 Statement of Problem
In a cloud-based environment, the enlarged attack surface along with the constant use of zeroday exploits hampers attack mitigation, especially when attacks originate at the kernel level. In a
virtualized environment, an adversary that has fully compromised a virtual machine (VM) and
has system privileges (kernel level, not the hypervisor) without being detected by traditional
security mechanisms exposes the cloud processes and cloud-resident data to attacks that might
compromise their integrity and privacy, jeopardizing mission-critical functions. The main
shortcoming of traditional defense solutions is that they are tailored to specific threats, therefore
limited in their ability to cope with attacks originating outside their scope. There is need to
develop resilient, adaptable, reconfigurable infrastructure that can incorporate emerging
defensive strategies and tools. The architectures have to provide resiliency (withstand cyberattacks, and sustain and recover critical function) and antifragility (increase in capability,
resilience, or robustness as a result of mistakes, faults, attacks, or failures).
The volume of information and real time requirements have increased due to the advent of
multiple input points of emails, texts, voice, tweets. They are all coming to the government
agencies such as US State Department for dissemination to many stakeholders, to make sure of
security of classified information (cyber data, user data, attack event data) so that it can be
identified as classified (secret) and disseminated based on access privileges to the right user in a
specific location on a specific device. For forensics/provenance, the determination of the identity
of all who have accessed/updated/disseminated the sensitive cyber data including the attack
event data is needed. There is need to build systems capable of collecting, analyzing and reacting
to dynamic cyber events across all domains while also ensuring that cyber threats are not
propagated across security domain boundaries and compromise the operation of system.
3
Solutions that develop a science of cyber security that can apply to all systems, infrastructure,
and applications are needed. The current resilience schemes based on replication lead to an
increase in the number of ways an attacker can exploit or penetrate the systems. It is critical to
design a vertical resiliency solution from the application layer down to physical infrastructure in
which the protection against attacks is integrated across all the layers of the system (i.e.,
application, runtime, network) at all times, allowing the system to start secure, stay secure and
return secure+ (i.e. return with increased security than before) [13] after performing its function.
1.2 Current State of Technology
Current industry-standard cloud systems such as Amazon EC2 provide coarse-grain monitoring
capabilities (e.g. CloudWatch) for various performance parameters for services deployed in the
cloud. Although such monitors are useful for handling issues such as load distribution and
elasticity, they do not provide information regarding potentially malicious activity in the domain.
Log management and analysis tools such as Splunk [1], Graylog [2] and Kibana [3] provide
capabilities to store, search and analyze big data gathered from various types of logs on
enterprise systems, enabling organizations to detect security threats through examination by
system administrators. Such tools mostly require human intelligence for detection of threats and
need to be complemented with automated analysis and accurate threat detection capability to
quickly respond to possibly malicious activity in the enterprise and provide increased resiliency
by providing automation of response actions. In addition Splunk is expensive
There are well-established moving target defense (MTD) solutions designed to combat against
specific threats, but limited when there are exploits beyond their boundaries. For instance,
application-level redundancy and replication schemes prevent exploits that target the application
code base, but fail against code injection attacks that target runtime execution, e.g. buffer and
heap overflows, and control flow of the application.
Instruction set randomization [51], address space randomization [4], randomizing runtime [5],
and system calls [6] have been used to effectively combat against system-level (i.e. returnoriented/code injection) attacks. System-level diversification and randomizations are considered
mature and tightly integrated into some operating systems. Most of these defensive security
mechanisms (i.e. instruction/memory address randomizations) are effective for their targets,
however, modern sophisticated attacks require defensive solution approaches to be deeply
integrated into the architecture, from the application-level down to the infrastructure
simultaneously and at all times.
Several general approaches have been proposed for controlling access to shared data and
protecting its privacy. DataSafe is a software-hardware architecture that supports data
confidentiality throughout their lifecycle [7]. It is based on additional hardware and uses a
trusted hypervisor to enforce policies, track data flow, and prevent data leakage. Applications
running on the host are not required to be aware of DataSafe and can operate unmodified and
access data transparently. The hosts without DataSafe can only access encrypted data, but it is
unable to track data if they are disclosed to non- DataSafe hosts. The use of a special architecture
limits the solution to well-known hosts that already have the required setup. It is not practical to
4
assume that all hosts will have the required hardware and software components in a crossdomain service environment.
A privacy-preserving information brokering (PPIB) system has been proposed for secure
information access and sharing via an overlay network of brokers, coordinators, and a central
authority (CA) [8]. The approach does not consider the heterogeneity of components such as
different security levels of client’s browsers, different user authentication schemes, trust levels of
services. The use of a trusted third party (TTP) creates a single point of trust and failure.
Other solutions address secures data dissemination in untrusted environments. Pearson et al.
present a case study of EnCoRe project that uses sticky policies to manage the privacy of shared
data across different domains [9]. In the EnCoRe project, the sticky policies are enforced by a
TTP and allow tracking of data dissemination, which makes it prone to TTP-related issues. The
sticky policies are also vulnerable to attacks from malicious recipients.
1.3 Proposed Solution
We propose an approach for enterprise system and data resiliency that is capable of dynamically
adapting to attack and failure conditions through performance/cost-aware process and data
replication, data provenance tracking and automated software-based monitoring &
reconfiguration of cloud processes (see Figure 1). The main components of the proposed solution
and the challenges involved in their implementation are described below.
Figure 1. High-level view of proposed resiliency framework
5
1.3.1 Software-Defined Agility & Adaptability
Adaptability to adverse situations and restoration of services is significant for high performance
and security in a distributed environment. Changes in both service context and the context of
users can affect service compositions, requiring dynamic reconfiguration. While changes in user
context can result in updated priorities such as trading accuracy for shorter response time in an
emergency, as well as updated constraints such as requiring trust levels of all services in a
composition to be higher than a particular threshold in a critical mission, changes in service
context can result in failures requiring the restart of a whole service composition. Advances in
virtualization have enabled rapid provisioning of resources, tools, and techniques to build agile
systems that provide adaptability to changing runtime conditions. In this project, we will build
upon our previous work in adaptive network computing [10], end-to-end security in SOA [11]
and the advances in software-defined networking (SDN) to create a dynamically reconfigurable
processing environment that can incorporate a variety of cyber defense tools and techniques. Our
enterprise resiliency solution is based on two main industry-standard components: The cloud
management software of OpenStack [12] – Nova, which provides virtual machines on demand;
and the Software Defined Networks (SDN) solution – Neutron, which provides networking as a
service and runs on top of OpenStack.
The solution that we developed for monitoring cloud processes and dynamic reconfiguration of
service compositions as described in [10] involved a distributed set of monitors in every service
domain for tracking service/domain-level performance and security parameters and a central
monitor to keep track of the health of various cloud services. Even though the solution enables
dynamic reconfiguration of entire service compositions in the cloud, it requires replication,
registration and tracking of services at multiple sites, which could have performance and cost
implications for the enterprise. In order to overcome these challenges, the proposed framework
utilizes live monitoring of cloud resources to dynamically detect deviations from normal service
behavior and integrity violations, and self-heal by reconfiguring service compositions through
software-defined networking of automatically migrated service instances. A component of this
software-defined agility and adaptability solution is live monitoring of services as described
below.
1.3.1.1 Live Monitoring
Cyber-resiliency is the ability of a system to continue degraded operations, self-heal, or deal with
the present situation when attacked [13]. We may need to shut down less critical computations,
communications and allow for weaker consistency as long as the mission requirements are
satisfied. For this we need to measure the assurance level, (integrity/accuracy/trust) of the system
from the Quality of Service (QoS) parameters such as response time, throughput, packet loss,
delays, consistency, acceptance test success, etc.
To ensure the enforcement of SLAs and provide high security assurance in enterprise cloud
computing, a generic monitoring framework needs to be developed. The challenges involved in
effective monitoring and analysis of service/domain behavior include the following:
Identification of significant metrics, such as response time, CPU usage, memory usage,
etc., for service performance and behavior evaluation.
6
Development of models for identifying deviations from performance (e.g., achieving the
total response time below a specific threshold) and security goals (e.g., having service
trust levels above a certain threshold).
Design and development of adaptable service configurations and live migration solutions
for increased resilience and availability.
Development of effective models for detection of anomalies in a service domain relies on careful
selection of performance and security parameters to be integrated into the models. Model
parameters should be easy-to-obtain and representative of performance and security
characteristics of various services running on different platforms. We plan to investigate and
utilize the following monitoring tools that provide integration with OpenStack in order to gather
system usage/resiliency parameters in real time [14]:
1. Ceilometer [15]: Provides a framework to meter and collect infrastructure metrics such as
CPU, network, and storage utilization. This tool provides alarms set when a metric
crosses a predefined threshold, and can be used to send alarm information to external
servers.
2. Monasca [16]: Provides a large framework for various aspects of monitoring including
alarms, statistics, and measurements for all OpenStack components. Tenants can define
what to measure, what statistics to collect, how to trigger alarms and the notification
method.
3. Heat [17]: Provides an orchestration engine to launch multiple composite cloud
applications based on templates in the form of text files that can be treated like code.
Enabling actions like autoscaling based on alarms received from Ceilometer.
As a further improvement for dynamic service orchestration and self-healing, we plan to
investigate models that are based on a graceful degradation approach for service composition,
which replace services that do not pass acceptance tests as seen in Figure 2 based on userspecified or context-based policies with ones that are more likely to pass the tests at the expense
of decreased performance.
Figure 2. Service acceptance test
1.3.2 Moving Target Defense for Resiliency/Self-healing
The traditional defensive security strategy for distributed systems is to prevent attackers from
gaining control of the system using known techniques such as firewalls, redundancy,
replications, and encryption. However, given sufficient time and resources, all these methods can
be defeated, especially when dealing with sophisticated attacks from advanced adversaries that
leverage zero-day exploits. This highlights the need for more resilient, agile and adaptable
solutions to protect systems. MTD is a component in NGC project Cyber Resilient System [13].
Sunil Lingayat of NGC has taken interest and connected us with other researchers in NGC
7
working in Dayton.
Our proposed Moving Target Defense (MTD) [18, 19] attack-resilient virtualization-based
framework is a defensive strategy that aims to reduce the need to continuously fight against
attacks by decreasing the gain-loss balance perception of attackers. The framework narrows the
exposure window of a node to such attacks, which increases the cost of attacks on a system and
lowers the likelihood of success and the perceived benefit of compromising it. The reduction in
the vulnerability window of nodes is mainly achieved through three steps:
1. Partitioning the runtime execution of nodes in time intervals
2. Allowing nodes to run only with a predefined lifespan (as low as a minute) on
heterogeneous platforms (i.e. different OSs)
3. Proactively monitoring their runtime below the OS
The main idea of this approach is allowing nodes to run on a given computing platform (i.e.
hardware, hypervisors and OS) for a controlled period of time chosen in such a manner that
successful ongoing attacks become ineffective as suggested in [20, 21, 22, 23]. We accomplish
such a control by allowing nodes to run only for a short period of time to complete n client
requests on a given underlying computing platform, then vanish and appear on a different
platform with different characteristics, i.e., guest OS, Host OS, hypervisor, hardware, etc. We
refer to this randomization and diversification technique of vanishing a node to appear in another
platform as reincarnation.
The proposed framework introduces resiliency and adaptability to systems. Resilience has two
main components (1) continuing operation and (2) fighting through compromise [13]. The MTD
framework takes into consideration these components since it transforms systems to be able to
adapt and self-heal when ongoing attacks are detected, which guarantees operation continuity.
The initial target of the framework is to prevent successful attacks by establishing short lifespans
for nodes/services to reduce the probability of attackers’ taking over control. In case an attack
occurs within the lifespan of a node, the proactive monitoring system triggers a reincarnation of
the node.
The attack model considers an adversary taking control of a node undetected by the traditional
defensive mechanisms, a valid assumption in the face of novel attacks. The adversary gains high
privileges of the system and is able to alter all aspects of the applications. Traditionally, the
advantage of the adversaries, in this case, is the unbounded time and space to compromise and
disrupt the reliability of the system, especially when it is replicated (i.e. colluding). The
fundamental premise of proposed framework is to eliminate the time and space advantage of the
adversaries and create agility to avoid attacks that can defeat system objectives by extending the
cloud framework. We assume the cloud management software stack (i.e. framework) and the
virtual introspection libraries are secure.
1.3.2.1 Resiliency Framework Design
The criticality of diversity as a defensive strategy in addition to replication/redundancy was first
proposed in [24]. Diversity and randomization allow the system defender to deceive adversaries
by continuously shifting the attack surface of the system. We introduce a unified generic MTD
framework designed to simultaneously move in space (i.e. across platforms) and in time (i.e.
8
time-intervals as low as a minute). Unlike the-state-of-the-art singular MTD solution approaches
[25, 26, 27, 28, 29, 30], we view the system space as a multidimensional space where we apply
MTD on all layers of the space (application, OS, network) in short time intervals while we are
aware of the status of the rest of the nodes in the system.
Figure 3 illustrates how the MTD framework works. The y-axis depicts the view of the space
(e.g. application, OS, network) and the x-axis the runtime (i.e. elapsed time). The figure
compares the traditional replicated systems without any diversification and randomization
technique, the state-of-the-art systems [25, 26, 27, 28, 29, 30] with diversification and
randomization techniques applied to certain layers of the infrastructure (application, OS or
network) and proposed solution, which applies MTD to all layers.
Figure 3. View of space and time of MTD-based resiliency solution
As illustrated in Figure 3.c, nodes/services that are not reincarnated in a particular time-interval
are marked with the result of an observation (e.g. introspection) of either Clean (C) or Dirty (D)
(i.e. not compromised/compromised). To illustrate, in the third reincarnation round with replica
n, we detect replica 1 to be clean (marked with C) and replica 2 as dirty as shown in that time
interval entry with D. We reincarnate the node whose entry shows D prior to the scheduled node
in the next time-interval.
Two important factors need to be considered in the design of this framework: the lifespan of
nodes or virtual machines, and the migration technique used in the reincarnation. Figure 4 shows
a possible scenario in which virtual machines running on a platform become IDLE when an
attack occurs and is detected. When to and how to reincarnate nodes are our main research
questions.
9
Figure 4. Moving target defense application example
Long periods of times increase the probability of success of an ongoing attack, while too short
times impact the performance of the system. Novel ways to determine when to vanish a node to
run the replica in a new VM need to be developed. In [23] VMs are reincarnated in fixed periods
of times chosen using Round Robin or a randomized selection mechanism. We propose the
implementation of a more adaptable solution, which uses Virtual Machine Introspection (VMI) to
persistently monitor the communication between virtual requests and available physical
resources and switch the VM when anomalous behaviors are observed.
The other crucial factor in our design is the live migration technique used for the virtual machine
reincarnation. Migrating operating system instances across different platforms have traditionally
been used to facilitate fault management, load balancing, and low-level system maintenance
[31]. Several techniques have been proposed to carry out the majority of migration while OSes
continue to run to achieve acceptable performance with minimal service downtimes. We propose
to integrate some of these techniques [31, 32, 33] in a clustered environment to our MTD
solution to guarantee adaptability and agility in our system. When virtual machines are running
live services, it is important that the reincarnation occurs in such a manner that both downtime
and total transfer time are minimal. The downtime refers to the time when no service is available
during the transition. The total transfer time refers to the time it takes complete the transition
[31]. Our main idea is to continue running the service in the source VM until the destination VM
is ready to offer the service independently. In this process there will be some time where part of
the state (the most unchangeable state information) is copied to the destination VM while the
source VM is still running. At some point, the source VM will be stopped to copy the rest of the
information (the most changeable state information) to the destination VM, which will take
control after the information is copied. No service will be available when the source VM is
stopped and the copying process has not been completed. This period of time defines the
downtime.
1.3.2.2 Resiliency Framework Infrastructure
Our framework will be built on top of OpenStack cloud framework [12], a widely adopted open
source cloud management software stack. Figure 5 shows the high-level architecture of our
10
framework on top right and OpenStack cloud framework on the bottom left.
Figure 5. High-level resiliency framework architecture
In the cloud framework, starting from the infrastructure at the bottom layer is the hardware.
Each hardware has a host OS, a hypervisor (KVM/Xen) to virtualize the hardware for the guest
VMs on top of it, and the cloud software stack framework, OpenStack [12] in our case. The
vertical bars are some of the OpenStack framework implementation components: nova, neutron,
horizon, and glance. In addition, the libvmi library for virtual introspection interfaces with the
libvirt library, which is used by the hypervisor for virtualization. This library allows us to
intercept the resource-mapping requests (i.e. memory) from the VM to the physical available
resource in order to detect anomalous behavior even in the event of VM/OS compromise.
We introduce two abstraction layers: a high-level System State (top) and the Application
Runtime (bottom), dubbed time-interval runtime. To illustrate the system state, we consider
Desired as the desired system state at all times, and Undesired as the state we like to avoid (i.e.
turbulence, compromised, failed or exit system state). The driving engine of these two high-level
states is the time-interval runtime indirect outputs depicted as the dotted arrows.
The Application Runtime defines the time an application runs in a VM. Our framework
transforms the traditional services designed to be protected their entire runtime (as shown on the
guest VMs on the cloud framework) to services that deal with attacks in time intervals as
depicted in Figure 5 (as Time Interval Runtime). Such transformation is simply achieved by
allowing the applications to run on heterogeneous OS’s and variable underlying computing
platforms (i.e. hardware and hypervisors), thereby, creating a mechanically generated system
instance (s) that is diversified in time and space. The Application Runtime can vary depending on
the detection of anomalous behaviors.
The System State and the Application Runtime are two abstraction layers that operate in
synchrony. At the application layer, we refresh and map one or more Apps/VMs (App1. . . Appn)
to different platforms (Hardware1. . . HWn) in pre-specified time intervals, referred as timeinterval runtime. To gain a holistic view of the high-level system state, we continuously reevaluate the system state (with libvmi depicted in the horizontal blue arrow) at the end of each
11
interval to determine the current state of the system at that specific time interval.
System state is the state of the system at any given time. The state changes are dictated by the
application runtime status. For instance, if the application fails or crashes, then the system is in
the failed state. Similarly, the system is in a compromised state when the attacker succeeds and is
undetected. These two states failed and compromised are under the Undesired category. Figure 6
shows the possible different states, where TIRE (Time Interval Runtime Execution) represents
the observation of the current state at the end of the runtime, D the Desired state, C the
Compromised state, F the Failed state and E the Exit state.
Figure 6. System states of the framework
The key objective of the framework is to start the system in a Desired state and stay in that state
as often as possible. This technique implements resiliency in the three phases of the system
lifecycle “Start Secure. Stay Secure. Return Secure” [13]. In the event that the system transitions
into one Undesired state, a valid assumption in cyberspace, the system bounces back seamlessly
into the Desired state even in the event of an OS compromise. For this, applications run in a
specific VM for a pre-specified time and then are moved to a new one.
1.3.2.3 Live Reincarnation of Virtual Machines
Reincarnation is a technique for enhancing the resiliency of a system by terminating a running
node and starting a fresh new one in its place on (possibly) a different platform/OS which will
continue to perform the tasks as its predecessor when it was dropped off of the network and
reconnected to it. One key question is determining when to reincarnate a machine. One approach
is setting a fixed period of time for each machine and reincarnating them after that lifespan. In
this first approach machines to be reincarnated are selected either in Round Robin or randomly.
However, attacks can occur within the lifespan of each machine, which makes live monitoring
mechanisms a crucial element. Whether an attack is going on at the beginning of the
reincarnation determines how soon the source VM must be stopped to keep the system resilient.
When no threats are present both source VM and destination VM can participate in the
reincarnation process. The source VM can continue running until the destination VM is ready to
continue. On the contrary, in case an attack is detected the source VM should be stopped
immediately and the reincarnation must consist simply of copying the state of the source to the
destination VM and continue the tasks after the process. The latter case presents a higher
downtime than the former case. Our target is including adaptability to our system in a manner we
can distinguish between these two cases.
12
In a clustered environment the reincarnation of a machine needs to concentrate on physical
resources such as disk, memory and network. Disk resources are assumed to be shared through a
Network-Attached Storage (NAS) so not much needs to be done regarding this. Memory and
network are the targets of our study. There is a tradeoff we need to consider to manage these
resources. Stopping the source virtual machine at once to copy its entire address space consumes
a lot of network resources, which negatively impact the performance. The decision for the
migration should involve consideration of various parameters including the number of modified
pages that need to be migrated, available network bandwidth, hardware configuration of the
source and destination, load on the source and destination etc. We previously proposed a model
based on mobile agent-based services [52] that migrate to different platforms in the cloud to
optimize the running time of mobile applications under varying contexts. By utilizing the
statefulness of mobile agents, the model enables resuming processes after migration to a different
platform. We plan to build on our knowledge in adaptable systems and performance optimization
to design a model for live migration of services and restoration on the destination platform to
enable high performance and continuous availability. We will focus on the following aspects:
Memory Management: We are interested in developing an adaptable solution that copies
the memory state to the destination VM while still running the source VM when no
attacks are detected. A daemon will be in charge of this task. Initially the entire memory
space is copied to the destination and later just dirty pages are copied when the
destination VM is ready to take over. This guarantees a much smaller downtime when
machines are reincarnated because of their lifespan expiry (no attacks detected).
Network Management: The new virtual machine must be set up with the same IP address
as its predecessor. All other machines in the cluster must be aware of this change, which
can be achieved by sending ARP replies to the rest of the machines.
1.3.2 Data Provenance and Leakage Detection in Untrusted Cloud
Monitoring data provenance and adaptability in data dissemination process is crucial for
resilience against data leakage and privacy violations in untrusted cloud environments. Our
solution ensures that each service can only access data for which it is authorized. In addition to
that, our approach detects several classes of data leakages made by authorized services to
unauthorized ones. Context-sensitive evaporation of data in proportion to the distance from the
data owner can make illegitimate disclosures less probable, as distant data guardians tend to be
less trusted.
W extended our “WaxedPrune” prototype, demonstrated at Northrop Grumman Tech Fest 2016,
to support automatic derivation of data provenance through interaction of the Active Bundle
(AB) engine [34] with the central cloud monitor responsible for automated monitoring and
reconfiguration of cloud enterprise processes. Extension [50] supports detection of several types
of data leakages, in addition to enforcing fine-grained security policies. Some of the ideas were
proposed by Leon Li (NG) during our weekly meetings.
1.3.3.1 Data Leakage Detection
In the current implementation of the secure data dissemination mechanism, data is transferred
13
among services in encrypted form by means of Active Bundles (AB) [40, 41]. This privacypreserving mechanism protects against tamper attacks, eavesdropping, spoofing and man-in-themiddle attacks. Active Bundles ensure that a party can access only those portions of data it is
authorized for. However, authorized service may leak sensitive data to unauthorized one.
Leakage scenario is illustrated on Fig. 7. Service X, who is authorized to read data d 1, can leak
this data item behind the scene to an unauthorized service, e.g. to Service Y. This leakage needs
to be detected and reported to data owner. Let us denote data D = {d1, d2, … dn}; set of policies P
= {p1, p2, … pk}. Data leakage can occur in two forms:
1) Encrypted data got leaked, i.e. the whole AB (the whole Electronic Health Record)
Active Bundle data can only be extracted by Active Bundle’s kernel after authentication is passed
and access control policies are evaluated by Policy Enforcement Engine. If AB is sent by Service
X to unauthorized service Y behind the scene, then Service Y won’t be able to decrypt data it is
not authorized for. When Y tries to decrypt d1, before decryption the AB kernel will query CM in
order to check whether d1 is supposed to be at Y’s side. If not then data leakage alert will be
raised.
In addition to CM, enforcing data obligations, there is one more embedded protection measure
that relies on digital watermarks, embedded into Active Bundles, that can be verified by web
crawlers. If attacker uploads the illegal content to publicly available web hosting that can be
scanned by web crawlers, then web crawler can verify the watermark and can detect copyright
violation. However, this approach is only limited to cases when attacker uploads unauthorized
content to publicly available folder in the network.
Figure 7. Data d1 leakage from Service X to Service Y
2) Plaintext data got leaked
If Service X, who is authorized to see d 1, takes a picture of a screen and sends it via email to
Service Y, who is not authorized to see d 1, then our solution relies on visual watermarks
embedded into data. Visual watermark that will be visible on a captured image, can be used in
the court to prove data ownership.
The key challenge here is that it is hard to come up with an approach that covers all possible
cases of data leakage. Embedded watermarks can help to detect leakage of document, but if
watermarks are removed then protection is gone. For instance, if a service gets access to a credit
card number, then writes it down on a piece of paper in order to remember it and then leaks it via
email to an unauthorized party. In this case, there are several ways to mitigate the problem:
Layered approach: Don't give all the data to the requester at once
14
o First give part of data (incomplete, less sensitive)
o Watch how data is used and monitor trust level of using service
o If trust level is sufficient – give next portion of data
Raise the level of data classification to prevent leakage repetition
Intentional leakage to create uncertainty and lower data value
Monitor network messages
o Check whether they contain e.g. credit card number that satisfies specific pattern
and can be validated using regular expressions
After leakage is detected, make system stronger against similar attacks
o Separate compromised role into two: e.g. suspicious_role and benign_role
o Send new certificates to all benign users for benign role
o Create new AB with new policies, restricting access to suspicious_role (e.g. to all
doctors from the same hospital with a malicious one)
o Increase sensitivity level for leaked data items, i.e. for diagnosis
Data Leakage Damage Assessment
After data leakage is detected damage is assessed based on:
• To whom was the data leaked (service with low trust level vs. service with high level of trust)
• Sensitivity (Classification) of leaked data (classified vs. unclassified)
• When was leaked data received
• Can other sensitive data be derived from the leaked data (i.e. diagnosis can be derived from leaked
medical prescription)
Damage = P(Data is Sensitive) * P (Service is Malicious) * P(t)
(1),
, where P(t) is the probability function for data sensitivity in time.
Fig.8. Data sensitivity probability functions
15
Fig. 8 shows three different data sensitivity probability functions. Data-related event (e.g.
product release) occurs at time to. Threat from data being leaked before to is high. Threat from
data being leaked after to: either goes to zero right away (e.g. leaking picture of a new
smartphone after the product has already been released) or degrades linearly (e.g. new encryption
algorithm, that is being studied by attackers after it has been released) or remains to be high with
time (e.g. novel design of a product).
1.3.3.2 Data Provenance
Each time when service tries to access data from AB, CM is notified about that and provenance
data is recorded at CM. The following provenance data is stored in the log file in encrypted form
at CM:
•
Who tried to decrypt data
•
What type (class) of data
•
When
•
Where did data come from (who is the Sender)
Although there are overheads and challenges related to provenance [36, 39] , not collecting it has
huge implications on the reliability, veracity, and reproducibility of Big Data tracking and
utilization of data and process provenance in workflow-driven analytical and variety of
applications. In addition, forensics requires provenance data for investigating data leakage
incidents.
As a next step, we started investigating blockchain – based mechanism in order to ensure
integrity of provenance data, i.e. integrity of log files.
1.3.3.3 Privacy-preserving publish/subscribe cloud data storage and dissemination
Figure 9. Encrypted search over database of Active Bundles (by Leon Li, NGC “WaxedPrune” project)
16
Figure 9 (created by Leon Li at NGC) shows a public cloud that provides a repository for storing
and sharing intelligence feed data from various sources. This is used by the analytics software to
derive different data trends. The use of a public cloud requires the stored data to be encrypted.
The cloud includes three main components:
(a) Collection Agent used to gather data from multiple distributed intelligence feeds; it maps
similar data across different feeds and logically orders them for storage.
(b) CryptDB [44] that stores encrypted data and provides SQL query capabilities over
encrypted data, using a search engine based on SQL-aware encryption schemes.
(c) Subscription API that provides various methods for data consumers to enable authorized
access to data.
Our solution relies on CryptDB [44] that is used to support encrypted search queries for
encrypted data. CryptDB is a proxy to a database server. It never releases decryption key to a
database, thus, the database can be hosted by untrusted cloud provider. When database is
compromised, only ciphertext is revealed and data leakage is limited to data for currently logged
in users. User’s search query is stemmed, stopwords are removed and then it is converted to SQL
query.
Table 3 shows operations supported by different crypto systems.
Table 2. Operations supported by different crypto systems
CryptDB has several weak points:
(a) OPE encryption scheme is not secure in terms of revealing the order
(b) Not all the SQL queries are supported. E.g. query that has two arithmetic operations
a + b*c is not supported
17
Solution: use Fully Homomorphic Encryption (FHE)
1.4 Technical Activities, Progress, Findings and
Accomplishments
The main technical activities for this project consisted of the implementation of an enterprise
system and data resiliency that is capable of dynamically adapting to attack and failure
conditions through performance/cost-aware process and data replication, data provenance
tracking and automated software-based monitoring & reconfiguration of cloud processes. As the
system was built experiments where conducted to measure its performance. The results were
used in two publications and in two posters for the CERIAS Security Symposium at Purdue. The
details of the technical activities and our main findings are described below.
1.4.1
Live Monitoring
Cyber-resiliency is the ability of a system to continue degraded operations, self-heal, or deal with
the present situation when attacked [13]. For this we need to measure the assurance level
(integrity/accuracy/trust) of the system from the Quality of Service (QoS) parameters such as
response time, throughput, packet loss, delays, consistency, etc.
The solution developed for dynamic reconfiguration of service compositions as described in [10]
involved a distributed set of monitors in every service domain for tracking performance and
security parameters and a central monitor to keep track of the health of various cloud services.
Even though the solution enables dynamic reconfiguration of entire service compositions in the
cloud, it requires replication, registration and tracking of services at multiple sites, which could
have performance and cost implications for the enterprise. To overcome these challenges, the
framework proposed in this work utilizes live monitoring of cloud resourcesto dynamically
detect deviations from normal behavior and integrity violations, and self-heal by reconfiguring
service compositions through software-defined networking [45] of automatically migrated
service/VM instances.
As the goal of the proposed resiliency solution is to provide a generic model, for detection of
possible threats and failures in a cloud-based runtime environment, limiting the utilized anomaly
detection models to supervised learning algorithms will not provide the desired applicability.
Hence, unsupervised learning models such as k-means clustering [46] and one-class SVM
classification [47] to detect outliers (i.e. anomalies) in service and VM behavior will be more
appropriate. Algorithm 1 shows an adaptation of the kmeans algorithm to cluster service
performance data under normal system operation conditions and algorithm 2 shows how to
detect outliers by measuring the distance of the performance vector of a service at a particular
point in time to all clusters formed during training. Additionally, virtual machine introspection
(VMI) [48] techniques need to be utilized to check the integrity of VMs at runtime to ensure that
the application’s memory structure has not been modified in an unauthorized manner. The results
of the monitoring and anomaly detection processes help decide when to reincarnate VMs as
described in the next section.
18
Algorithm 1. Anomaly training algorithm
Algorithm 2. Anomaly detection algorithm
1.4.2 Moving Target Defense
The main idea of this MTD-technique is allowing a node running a distributed application on a
given computing platform for a controlled period of time before vanishing it. The allowed
running time is chosen in such a manner that successful ongoing attacks become ineffective and
a new node with different computing platform characteristics is created and inserted in place of
the vanishing node. The new node is updated by the remaining nodes after completing the
replacement. The required synchronization time is determined by the application and the amount
of data that needs to be transferred to the new node. as the reincarnation process do not keep the
state of the old node.
The randomization and diversification technique of vanishing a node to appear in another
platform is called node reincarnation [18, 19]. One key question is determining when to
reincarnate a node. One approach is setting a fixed period of time for each node and
19
reincarnating them after that lifespan. In this first approach nodes to be reincarnated are selected
either in Round Robin or randomly. However, attacks can occur within the lifespan of each
machine, which makes live monitoring mechanisms a crucial element. Whether an attack is
going on at the beginning of the reincarnation process determines how soon the old node must be
stopped to keep the system resilient. When no threats are present both the old node and new node
can participate in the reincarnation process. The old node can continue running until the new
node is ready to take its place. On the contrary, in case an attack is detected the old node should
be stopped immediately and the reincarnation should occur without its participation, which from
the perspective of the distributed application represents a greater downtime of the node.
Our main contribution here is the design and implementation of a prototype that speeds up the
node reincarnation process using SDN, which allows configuring network devices on the fly via
OpenFlow. We avoid swapping virtual network interfaces of the nodes involved in the process as
proposed in [18] to save time in the preparation of the new virtual machine. The new virtual
machine is created and automatically connected to the network. The machine then starts
participating in the distributed application when routing flows are inserted to the network devices
to redirect the traffic directed to the old VM to the new one.
1.4.2.1 Experiments and Results
Experiments to evaluate the operation times of the proposed MTD solution were conducted.
Figure 10 shows the experiment setup. A Byzantine fault tolerant (BFT-SMaRt) distributed
application was run on a set of Ubuntu (either 12.04 or 14.04 randomly selected) VMs in a
private cloud, which are connected with an SDN network using Open vSwitch. The reincarnation
is stateless, i.e. the new node (e.g. VM1’) does not inherit the state of the replaced node (e.g.
VM1). The set of new VMs are periodically refreshed to start clean and the network is
reconfigured using OpenFlow when a VM is reincarnated to provide continued access to the
application.
Figure 10. Experiment Setup For Moving Target Defense (MTD)
20
Table 3 presents the results: virtual machine restarting and creation time, and Open vSwitch flow
injection time. Note that the important factor for system downtime here is the Open vSwitch flow
injection time, as VM creation and restart take place periodically to create fresh backup copies,
and do not affect the downtime.
Table 3: Moving Target Defense (MTD)
Measurements
Measurements
Times
VM restarting time
~7s
VM creation time
~11s
Open vSwitch flow injection time
~250ms
1.4.3 Data Provenance and Leakage Detection in Untrusted Cloud
Fig.11. EHR dissemination in cloud (created by Dr. Leon Li, NGC)
Based on the architecture framework, illustrated on Fig.11, we performed several experiments to
evaluate performance overhead of our approach. They are included in IEEE paper “Privacypreserving data dissemination in untrusted cloud” [49], written in collaboration with Donald
Steiner, Leon Li and Jason Kobes as co-authors.
Experimental setup
Hardware: Intel Core i7, CPU 860 @2.8GHz x8, 8GB DRAM OS: Linux Ubuntu 14.04.5, kernel
3.13.0-107-generic, 64 bit Browser: Mozilla Firefox for Ubuntu, ver. 50.1.0
21
We use Active Bundle which, in addition to tamper – resistance, supports client’s browser
cryptographic capabilities and authentication method detection. Local request for Patient's
Contact Information is sent by Doctor’s service to an Active Bundle, which represents EHR and
runs on a Purdue University Server waxedprune.cs.purdue.edu:3000. We use an Active Bundle
with 8 access control policies, similar in terms of complexity. Data request is issued from the
service running on the same host with an Active Bundle. Thus, we measure RTT for a local data
request and exclude network delays that can affect the measurements.
Fig.12. AB performance overhead with browser's crypto capabilities on / off
1.4.4 Encrypted Search over Encrypted Data
We deployed a database of Active Bundles in encrypted form. Database contains extra – attribute
used for indexing an AB, i.e. short abstract of AB, its keywords. In our case, diagnosis and age
can be used as extra-attributes. They are stored in encrypted form. ID maps each record to its
corresponding Active Bundle, i.e. to Electronic Health Record of a given patient. As it can be
seen from Fig.13, user’s search query is converted to SQL query, then SQL query is converted
into encrypted form. User-defined functions are used for that. CryptDB [44] open-source engine
is able to execute set of SQL queries, operating on encrypted data.
Table 4. Encrypted Database of Active
Bundles. Table
‘EHR_DB’
ID
Diagnosis
Age
001
Concussion
35
002
Insomnia
30
003
ColdSore Herpes1
40
22
Fig.13. Encrypted Search over Encrypted Database
Query example:
SELECT ID FROM EHR_DB WHERE Age BETWEEN 35 AND 40;
Converted query:
SELECT c1 FROM Alias1 WHERE ESRCH ( Enc(age), Enc(35, 40) );
Result is {001, 003} and it is received in the plaintext form at the Client’s side, but it is still
encrypted on Cloud Provider’s side and, thus, is protected against curious or malicious cloud
administrators.
1.5 Distinctive Attributes, Advantages and Discriminators
The distinctive attributes of the cloud security/resiliency approach we proposed in this
project are the following:
Automated, self-healing cloud services: Existing cloud enterprise systems lack robust
mechanisms to monitor compliance of services with security and performance policies
under changing contexts, and to ensure uninterrupted operation in case of failures. The
proposed work will demonstrate that it is possible to enforce security and performance
requirements of cloud enterprise systems even in the presence of anomalous
behavior/attacks and failure of services. The resiliency and self-healing will be
accomplished through automated reconfiguration, migration and restoration of services
with software-defined networking. Our approach will be complementary to functionality
provided by log management tools such as Splunk in that it will develop models that
accurately analyze the log data gathered by such tools to immediately detect deviations
from normal behavior and quickly respond to such anomalous behavior in order to
provide increased automation of threat detection as well as resiliency. This work will be
done to contribute to IRADS on cyber resilient system and enterprise resiliency.
Agile and resilient virtualization-based systems: We will design and implement the
resiliency framework on top of a cloud framework with special emphasis on time (as low
as a minute) and space diversification and randomization across heterogeneous cloud
platforms (i.e. OS, Hypervisors) while proactively monitoring (i.e. virtual introspection)
the nodes. We abstract the system runtime from the virtual machine (VM) instance to
23
formally reason about its correct behavior. This abstraction allows the framework to
enable MTD capabilities to all types of systems regardless of its architecture or
communication model (i.e. asynchronous and synchronous) on all kinds of cloud
platforms (e.g. OpenStack). The developed prototype including modules for service
monitoring, trust management and dynamic migration/restoration can be easily integrated
into NGC cybersecurity software. The modular architecture and use of standard
software in the monitoring framework allows for easy plugin to IRAD software. The
resiliency work will allow identification of NGC clients’ requirements for building
capabilities in prototypes for Air Force Research Lab (AFRL), DoD and SSA, IRS and
State Department. We plan to work closely with Daniel Goodwin of NGC on BAA
proposals.
Policy-based data access authorization, provenance and leakage detection: Our secure
data dissemination model, based on Active Bundles, ensures that services can get access
only to those data items for which services are authorized. In addition to policy-based
access control, we support trust-based and context-based data dissemination. Trust level
of services is continuously monitored and determined. In addition, we check
cryptographic capabilities of a browser used by client, which tries to access sensitive
data. Data dissemination takes into consideration how secure client’s browser is, how
secure the client’s network is, how secure is the client authentication method (e.g.
password-based vs. hardware-based or fingerprint authentication). We implemented data
leakage detection mechanism, so that if authorized service leaks data to an unauthorized
one then it is detected by Central Monitor and reported to data owner. Data provenance
tracking allows greater, fine-grain control over data access control decisions and enables
limiting data leaks.
1.6 Tangible Assets Created by Project
The tangible assets created by this project include active bundle software, a tutorial describing
the use of the developed software, as well as demos for Secure Data Dissemination/Provenance
and our Moving Target Defense (MTD) technique.
Moving Target Defense (MTD) Solution:
Demo:
/> Code:
/>Secure Data Dissemination/Provenance:
Demo:
/>dl=0
24
Code for Northrop Grumman ‘WAXEDPRUNE’ extended prototype:
/>
In addition we have the following related material available:
The tutorial explaining the use of the active bundle software:
/>rial_Release.pdf
Code repository for the “WAXEDPRUNE” prototype:
Ulybysh/absoa16
Demo video to demonstrate privacy-preserving data dissemination capabilities of the
system:
o Video 1:
/>g16.wmv?dl=0
o Video 2:
/>
/>
We have also created the following demo videos to demonstrate the capabilities of the adaptable
service compositions and policy enforcement system for cloud enterprise agility:
Introduction to the user interface and a composite web service for ticket reservations:
/>
Composite trust algorithms for cloud-based services: />v=6uHEfoxjEgs
Trust update mechanisms:
/>
Enforcing client’s QoS and security policies:
/>
Service redirection policy:
/>
Adaptable service composition:
/>va.wmv
1.7 Outreach Activities and Conferences
25