Tải bản đầy đủ (.pdf) (115 trang)

IT training 6 elements of big data security khotailieu

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (2.83 MB, 115 trang )



Table of Contents

Introduction

1

Chapter 1: Big Data Security Rationales

3

Finding Threats Faster Versus Trusting a Tool

5

Big Data Potentially Can Change the Entire Architecture of Business and IT

5

Chapter 2: Securing HeavyD

15

Why Big Data Security is Necessary

18

Does Security Even Work?

33



Chapter 3: How Does Big Data Change Security?

45

Frameworks and Distributions

46

Shrink the Square Peg to Fit a Round Hole?

55

Chapter 4: Understanding Big Data Security Failures

63

Scope of the Problem

64

Can We Get Beyond CIA?

66

Chapter 5: Framing the Big Data Security Challenge

75

Why Not Give Up and Wait?


75

Can Privacy Help Us?

81

Chapter 6: Six Elements of Big Data Security

89

Threat Model for a Hadoop Environment

89

Six Elements

91

Automation and Scale

94

Bottom Line on Network and System Security

96

Element 2: Data Protection

96


Bottom Line on Data Protection

101

Element 3: Vulnerability Management

101

i


Table of Contents

Bottom Line on Vulnerability Management

103

Element 4: Access Control

103

Bottom Line on Access Control

105

Bottom Line on Policies

110


Conclusion

111

ii


Introduction
You can’t throw a stick these days without hitting a story about the future of Artificial
Intelligence or Machine Learning. Many of those stories talk at a very high level about the
ethics involved in giant automation systems. Should we worry about how we use new
found power in big data systems? While the abuse of tools is always interesting, behind the
curtain lies another story that draws far less attention.
These books are written with the notion that all tools can be used for good or bad, and
ultimately what matters for engineers is to find a definition of reasonable measures of
quality and reliability. Big data systems need a guide to be made safe, because ultimately
they are a gateway to enhanced knowledge. When you think of the abuse that can be done
with a calculator, looking across the vast landscape of fraud and corruption, imagine now if
the calculator itself cannot be trusted. The faster a system can analyze data and provide a
“correct” action or answer, the more competitive advantage to be harnessed in any industry. A complicated question emerges: how can we make automation tools reliable and predictable enough to be trusted with critical decisions?
The first book takes the reader through the foundations for engineering quality into big
data systems. Although all technology follows a long arc with many dependencies, there
are novel and interesting problems in big data that need special attention and solutions.
This is similar to our book on “Securing the Virtual Environment” where we emphasize a
new approach based on core principles of information security. The second book then
takes the foundations and provides specific steps in six areas to architect, build, and assess
big data systems. While industry rushes ahead to cross the bridges of data we are excitedly
building, we might still have time to establish clear measurements of quality, as it relates
to whether these bridges can be trusted.


1



Chapter 1: Big Data Security
Rationales
This chapter aims to help security become an integral part of any big data systems discussion, whether it is before or after deployment. We all know security isn’t embedded yet.
Security is the exception to the deployment discussion, let alone the planning phase. “I’m
sure someone is thinking about that” might end up being you. If you have a group talking
about real security steps and delivery dates early in your big data deployment, you are likely the exception. This gap between theory and reality partly is because security practitioners lack perspective on why businesses are moving towards big data systems; the trained
risk professionals are not at the table to think how best to approach threats and vulnerabilities as technology is adopted. We faced a similar situation with cloud technology. The
business jumped in early, occasionally bringing someone from the security community in
to look around and scratch their head as to why this even was happening.
Definitions of a big data environment are not the point here (we’ll get to that in a minute), although it’s tempting to spend a lot of time on all the different descriptions and
names that are floating around. That would be like debating what really a cloud environment is. The semantics and marketing are useful, yet ultimately not moving us along much
in terms of engineering safety. Suffice it up front to say this topic of security is about more
than just a measure of data size and is something less tangible, more sophisticated, and
unknown in nature. We say data has become big because size matters to modes of operation, while really we also imply here a change in data rates and variations. In rough terms,
the systems we ran in the past are like horses compared to these new steam engine discussions, so we need to take off our client-server cowboy hat and spurs, in order to start thinking about the risks of trains and cars running on rails and roads. Together, the variables
have become known as engines that run on 3V (Volume, Velocity, Variety), a triad which
apparently Gartner coined first around 2001.
The rationale for security in this emerging world of 3V engines is really twofold. On the
one hand security is improved by running on 3V (you can’t predict what you don’t know) and
on the other hand, security has to protect 3V in order to ensure trust in these engines. Better
security engines will result from 3V, assuming you can trust the 3V engines. Few things
speak to this situation faster and better risk knowledge from safe automation, than the
Grover Shoe Factory Disaster of 1905.

3



Chapter 1: Big Data Security Rationales

On the left you see the giant factory, almost an entire city block, before disaster. On the
right you see the factory and even neighboring buildings across the street turned into
nothing more than rubble and ashes.
The background to this story comes from an automation technology rush. Around 1890
there were 100,000 boilers installed, as Americans could not wait to deploy steam engine
technology throughout the country. During this great boom, in the years 1880 to 1890, over
2,000 boilers were known to have caused serious disasters. We are not just talking about
factories in remote areas. Trains with their giant boilers up front were ugly disfigured things
that looked like Cthulhu himself was the engineer.

Despite decades of death and destruction through the late 1800s, Grove Shoe Factory
still had a catastrophic explosion in 1905 with cascading failures that leveled its entire
building, burning to the ground with workers trapped inside.

4


Chapter 1: Big Data Security Rationales

This example helps illustrate why trusted 3V engines are as important, if not more so, as
the performance benefits of a 3V engine. Nobody wants to be the Grover Shoe Factory of
big data, so that is why we look at the years before 1905 and ask how the rationale for security was presented. Who slept through safety class or ignored warnings when building a
big data engine? We need to take the security issue very seriously, because these “engines”
are being used for very important work, and security issues can have a cascading effect if
not properly implemented.
There is a clear relationship between the two sides of security: better knowledge from
data and more trusted engines to process the data. I have found that most people in the

security community are feverishly working on improving the former, generating lots of
shoes as quickly as possible. The latter has mostly been left unexplored, leaving unfinished
or unknown how exactly to build a safe big data engine. That is why I am focusing primarily
on the rationale for security in big data with a business perspective in mind, rather than
just looking at security issues for security market/industry purposes.

Finding Threats Faster Versus Trusting a Tool
Don’t get me wrong. There is much merit in the use of 3V systems for data collection in
order to detect and respond to threats faster. The rationale is to use big data to improve
the quality of security itself. Many people actively are working on better security paradigms
and tools based on the availability of more data, which is being collected faster than ever
before with more detail. If you buy a modern security product, it is most likely running on a
big data distribution. You could remove the fancy marketing material and slick interface
and build one yourself. One might even argue this is just a natural evolution from the existing branches of detection, including IDS, SIEM, AV, and anti-SPAM. In all of these products,
the collection and analysis of as much data as possible is justified by the need to more
quickly address real threats and vulnerabilities.
Indeed, as the threat intelligence community progressed towards an overwhelming flow
of data being shared, they needed better tools. From collection and correlation to visualization to machine learning solutions, products have emerged than can sift through data
and get a better signal from the noise. For example, let’s say three threat intelligence feeds
have the same indicator of compromise and are only slightly altered, making it hard for
humans to see the similarities. A big data engine can find these anomalies much quicker.
However, one wonders whether the engine itself is safe, while being used to quickly improve our knowledge of threats and vulnerabilities.

Big Data Potentially Can Change the Entire Architecture of Business and IT

5


Chapter 1: Big Data Security Rationales


It makes a lot of sense at face value that instead of doing analysis on multiple sources of
information and disconnected warehouses, a centralized approach could be a faster path
with better insights. The rationale for big data, meaning a centralized approach, can thus
be business-driven, rather than driven by whatever reasons people had to keep the data
separate, like privacy or accuracy.
Agriculture is an excellent example of how an industry can evolve with new technology.
Replace the oxen with a tractor, and look how much more grain you have in the silos. Now
consolidate silos with automobiles and elevators and measure again. Eventually we are
reaching a world where every minute piece of data about inputs and outputs from a field
could help improve the yield for the farmer.
Fly a drone over orchards and collect thermal imagery that predicts crop yields or the
need for water, fertilizer, pesticides; these inexpensive birds-eye views and collection systems are very attractive because they can significantly increase knowledge. Did the crop
dusting work? Is one fertilizer more effective at less cost? Will almonds survive the
drought? Answers to the myriad of these business questions are increasingly being asked
of big data systems.
Drones even can collect soil data directly, not waiting for visuals or emissions from
plants, predicting early in a season precisely what produce yields might look like at the
end. Robots can roam amongst cattle with thermal monitors to assess health and report
back like spies on the range. Predictive analysis using remote feeds from distributed areas,
which is changing the whole business of agriculture and risk management, depends on 3V
engines running reliably.
Today the traditional engines of agriculture (diesel-powered tractors) are being set up
to monitor data constantly and provide feedback to both growers and their suppliers. In
this context, there is so much money on the line, with entire markets depending on accurate prediction; everyone has to trust the data environment is safe against compromise or
tampering.
A compromise in the new field of 3V engines is not always obvious or absolute. When
growers upload their data into a supplier’s system, such as a seed company, that data suddenly may be targeted by investors who want to get advance knowledge of yields to game
the market. A central collection system would know crucial details about market-wide supply changes long before the food is harvested. Imagine having just one giant boiler in a factory, potentially failing and setting the whole business on fire, rather than having multiple
redundant engines where one can be shut down at the first sign of trouble.


6


Chapter 1: Big Data Security Rationales

To be fair, the 1905 shoe factory had dual boilers. It is a mystery to this day why the
newer, more secure model wasn’t being used. Instead, they kept running an older one at
unsafe performance levels. Perhaps someone thought the new one was harder to manage
or was not yet as efficient because of safety enhancements.
Again I want to emphasize that it is very, very easy to find warnings about the misuse or
danger of 3V systems. The ethics of using an engine that carries great power are obvious.
Examples of what is really at stake can be found nearly anywhere. Search on Google, for
example, using terms like “professional hair” and “unprofessional hair” and you see quickly a problem.

7


Chapter 1: Big Data Security Rationales

Social scientists could be talking for days about the significance of obviously imbalanced results like these, perpetuating bias. This is a shocking yet somewhat lighthearted
example. Even more troubling is predictive policing technology that perpetuates bias by
consistently ranking risk based on race, further perpetuating racism in justice systems. This
kind of analytic error leads to systems that do a poor job of predicting actual violent crime,
making obvious mistakes. This old cartoon hints at the origin of the black hoodie for stoking fear when talking about hackers or criminals of any kind, really.

8


Chapter 1: Big Data Security Rationales


Clearly people driving the engine aren’t exactly working that hard at ensuring common
concepts of safety or accuracy are in place for actually useful results. It is almost as though
Douglas Adams was right in his joke that the world’s smartest computer, when asked the
meaning of life, would simply reply “42.” That is what makes careless errors so easy to find.
And it is fundamentally a different problem than what I would describe as issues of quality
in the engines beneath these poorly executed usages. In fact, I would argue quality in the
engine is set to become an even more serious issue as we put pressure on users to think
about bias and prejudice in their application development. The troubles will shift behind
the curtain.
At least you have some leverage when algorithms are poorly orchestrated. A Google engineer can say “Oops, I forgot to train the algorithm on non-white faces.” When the algorithms depend on infrastructure that fails, will the same be true? Will engineers be able to
say, “Oops, I see that someone two days ago was injecting bad data into our training set”
and set about fixing things? To make a finer point, it doesn’t matter how good an algorithm
is if the engine lacks data integrity controls. Network communication or storage without a
clear and consistent way to prevent malicious modification is a problem big data environments have to anticipate.
The results we see often can be explained as a function of machines presenting answers
to questions that reinforce our existing bias. What also needs to be investigated and prepared for, more the focus of this set of books, is how to build protection for the underlying
systems themselves. We need to go deeper to where we are talking about failures beyond
bias or error by operators. We need to be talking about a threat model exercise, and thinking about how to stay safe when it comes to attackers who intend to poison, manipulate,
or otherwise break your engine.
We saw a very real-world case of this with Microsoft’s “learning” bot called TayTweets.
Within a day of it being launched on Twitter, a concerted effort was made by attackers to
poison its “learning” system. I use scare-quotes here because I quickly uncovered that this
supposedly intelligent system was being tricked using a dictation flaw, and not really learning. For every strange statement by the bot I looked at the conversation thread and found
someone saying “Repeat after me!” Here you can see what I found and how I found it.

9


Chapter 1: Big Data Security Rationales


TayTweets was presented as a system able to handle huge amounts of data and learn
from it. However, by issuing a simple dictation command the attackers could bypass
“learning” and instead expose a “copy and paste” bot. Then the attacker simply had to dictate whatever they wanted TayTweets to say so they could take a screenshot and declare
(false) victory. It really is a terrible big data design that Microsoft tried to pass off as advanced, when the foundation was so weak it was almost immediately compromised by
simpletons.
Don’t let this environment be yours, where some adversary can bypass advanced algorithms and simply set answers to be whatever serves their purposes. When you think about
the hassle of distributed denial of service (DDoS) attacks today as an annoyance, imagine
someone dumping poison into your training set or polluting your machine learning interface. The only good news in the Microsoft TayTweet case was that the attackers were too
dumb to cover their tracks and effectively painted a giant target on themselves; we unintentionally had a sweet “honey pot” that allowed us to capture bad identities as they eagerly sent streams of “copy and paste” commands.

A Look at Definitions
Let’s talk about 3V definitions in more detail. The most generic and easiest definition for
people to agree upon seems to have originated in 2001 from Doug Laney, VP of Research at
Gartner. He wrote “Velocity, Variety and Volume” are definitive characteristics. This has

10


Chapter 1: Big Data Security Rationales

been widely accepted and quoted, as any search for a definition will tell you. Gartner offers
an updated version in their IT glossary:

Big data is high-volume, high-velocity and
high-variety information assets that demand
cost-effective, innovative forms of information processing for enhanced insight and decision making

I find this definition easy to work with. Given that many people talk about big data like
streams leading into a “data lake,” I would like to take a minute to use the analogy of data
being a leaky pipe.

Imagine that a pipe drips data twenty times per minute, which is probably what you’ve
seen if you ever noticed a leaky faucet. In one day that is about 28,800 drips. Assuming
there is 0.25 milliliter per drip, and there are 15,000 drips in a gallon, we are talking about a
pipe losing 694 gallons each year.
That kind of data leakage obviously is a problem, albeit a relatively small one on its
own. 700 gallons sounds large yet in real life; as you listen to a leaky faucet drip, drip, drip,
we know that 20 drips a minute easily can go quite a while being unnoticed.
Now let’s take this from basic math into the social science realm of possibilities. Multiply
the leaky pipe example by 10,000, which is a modest number of potentially faulty pipes in
any city or suburb. Hold on to your hat, as we suddenly are talking about 288 million drips,
or 19,000 gallons being wasted every single day! That is a lot of gallons thrown away every
day.
At what point would alarms be raised somewhere? In 30 million seconds, which is about
one year, these 10,000 pipes are going to throw away 7 million gallons. Our little drip math
example becomes a BIG problem for regulators when we collect it into a single place of
study and see the overall picture of what has been happening at the macro level. Of course,
that assumes we can get some record or report on these 7 million gallons are wasted, rather than actually dripping on a thirsty plant in an irrigation system. Perhaps you were already thinking ahead to the problem of differentiation. We should be able to devise a way
to tell wasted drips every few seconds versus normal water use.
All of this is before we even add in measuring variety yet, such as detecting various pollutants in every drip, versus clean water. What kind of drips are they? Adding variety magni-

11


Chapter 1: Big Data Security Rationales

fies any kind of effort required (e.g., the water has unknown and varied ingredients that
require testing to identify). Thus BIG means you need “cost-effective, innovative forms of
information processing” for you to understand the leaks and decide what to do about it.
Or more to the point of common big data usage, maybe our reports do not come from
liquid sensors. People could take pictures of these leaky pipes and post to Instagram, or

Tweet about drips, or have a “report a leak” application based on an API that gets queried
and reports are posted in a shared folder. The possibilities of knowledge gathering engines
are really ridiculously infinite when you consider the output possible for every sensor type
for every related concern.
Now let’s go to national scale, just for the sake of argument. If we build an analysis map
of water, we might end up with something like Surging Seas. Their interactive guide shows
what happens to the coastline when oceans rise. Basically the eastern pointy bits of the
country disappear.

Risk Areas from Surging Seas
A 3V world of data like this might seem obvious when you pull back far enough on the
scope controls to see rising water level impact to coastline. The important thing here is traditional IT systems would struggle to crunch data quickly enough to generate one city of
data let alone national interactive accurate maps. Thinking about the latest greatest superhero movie and all the artificial worlds with disasters they can generate on supercomputers; we quickly are becoming dependent on the highest-performance analysis money can

12


Chapter 1: Big Data Security Rationales

buy. Processing lots of varied data that you’re acquiring very quickly, meaning creating
knowledge from the freshest inputs possible, is the foundation of big data engines.
Take a little bit of data and you can capture, store, and measure it with your own tools
in an environment you bought and paid for as a one-time purchase. Increase the rate
enough, which leads to an increase in volume, and special tools and skills become required, potentially leading you to a datacenter-like environment you have to share without
owning. We are talking about the shift from working with 700 gallons to 7 trillion gallons or
ultimately even being able to measure the ocean. It reminds me when I used to work for a
CIO who was fond of warning me “don’t boil the ocean” at the start of every project. Little
did he realize that the power of distributed nodes (humans and our machines) generating
output (carbon emissions) would create climate change fast enough to actually boil the
ocean. With big data environments, I could have said back to him, “We’ll see if we can figure out how to cool it down.”

With a simple 3V definition in hand, you would think things could progress smoothly.
And yet the more I met with people to discuss definitions and examples, the more I found
3V didn’t give much room to talk about security. Big security for big data? Didn’t sound
right. High security? That sounds more normal. Can we have high security big data? Eventually, upon suggestion of others, I experimented with heavy instead of high terminology.
The thought at the time was to start using a new term and see if it sticks. It didn’t, but it
helped find answers in how to get security into the definition. In the next chapter I will explain why there is certain gravity to big data when discussing how to define risk.

13



Chapter 2: Securing HeavyD
Heavy data, shortened to HeavyD, started off as sort of a joke. Humor supposedly helps
with difficult subjects, and creates new insights. Whereas big is relative to volume, heavy
relates to force required. It is sort of like saying high security, but even more scientific. Back
to the leaking pipes example we used in the definition above, Newton’s second law of motion helps explain safety in terms of water:

There is a specific effort, a newton (N) required to keep an object on the surface instead of under water. If a person who weighs
75kg falls off a ferry into water there would
be a displacement effect, let’s say 4N of water. Given 10N per 1kg, our swimmer thus displaces 0.4kg water, leaving 70kg. A 275N lifejacket, an adult standard size, gives 27.5kg
uplift, more than enough to float and survive.

Survival sounds really good. It reminded me of a lock or cryptography surviving an attack. So could we talk about the effort required for survival within data lakes in terms of
heavy or light? The analogy is tempting, yet I found most people don’t like to think about a
world in terms of lifejackets, never mind put one on, or to think about Newton. “The guy hit
in the head by an apple” doesn’t have the right ring to it. Einstein is apparently all the rage
these days. It seemed to make more sense to use the increasingly popular framework of
relativity to explain weight and heavy data; our ability to hold weight is bigger and bigger
depending where you are on the timeline of transitions from analog to digital tools.
We are working to capture and interpret an infinite amount of analog data with our

highly limited digital tools. From that perspective, today’s big data tools in about five years’
time no longer would be considered big, given Moore’s law of progress in compute power. A
tour of the Computer History Museum may have reinforced my reasoning on this point.
The meaning of big data today versus hundreds of years ago, put in terms of our everchanging computational power, is a reflection of how our tools today operate versus then.
Walking along evolutions of machines seemed to say what we consider big depends on the
time, as we have been working on big data “problems” for a very, very long time. The prob-

15


Chapter 2: Securing HeavyD

lems of navigation in the 1400s were astronomically difficult. Today navigation is so trivial,
everyone has a tiny chip that can tell you for virtually no cost how to find the best route to
black pepper vendors in India. Columbus-era big data would be a laugh, just like his tiny
boats, compared to what we can do with technology now.
My father used to explain cultural relativity in a similar way. You and I might live in different time zones. For someone in Moscow, it can be 8 in the morning while it is 10 at night
for someone in San Francisco. Yet both people share a global idea of absolute time. A definition of big data in this light would be that you and I share a definition of data but big for
you is a different number than for me. This brings us to the awesome question, “Can the
same security solutions fit an absolute definition, or must they be relative, also?” Perhaps
it is more like asking whether a watch can work in different time zones versus whether it
can work with different intervals of time; as far as I know, no one tried to monitor milliseconds with a sundial.
Managers today seek “enhanced insight and decision making” but they will not escape
the fundamentals of data like integrity or availability. This is clear. At the same time, we are
entering a new world in security where the old rules may no longer apply. We simultaneously need a different approach than we have used in the past to handle new speeds, sizes,
and changes while also honoring concepts we know so well from where we started. Petabytes already no longer are an exception or considered big, as the IDC explained in “The
Digital Universe in 2020.”

From 2005 to 2020, the digital universe
will grow by a factor of 300, from 130 exabytes to 40,000 exabytes, or 40 trillion gigabytes (more than 5,200 gigabytes for every

man, woman, and child in 2020). From now
until 2020, the digital universe will about
double every two years.

The predictions about size usually just sound like a bunch of numbers. This size than
this size then this size. What does any of this really mean to heavy data? One of our best
sources of breaches for studying how to respond and improve global security actually will
fit on a thumb drive. That’s right; in the world of security, about 2GB was what was considered big for some current definitions of heavy. Meanwhile the CEO of Pivotal was boasting
in 2014 that his customers were gearing up to work with 500PB/month. When security analysts are excited to work with 2 gigabytes for insights and trends and knowledge, and busi-

16


Chapter 2: Securing HeavyD

nesses are pushing past 500,000 terabytes/month in their plan, I see a significant delta. Fortunately, things change and I expect security to see a massive increase.
Perhaps a good real-world example of the potential rate of growth was in 2008 when I
worked at Barclays Global Investors. We secured an environment with many petabytes of
real-time global financial data. Our size in terms of data analysis (quant) capabilities was
arguably leading the world back then. Just five years later, “many petabytes” no longer
was considered exceptional in the industry. That is a strikingly fast evolution of size. With
luck, the security industry will be catching up any moment now to other industries.
Our concept of what is really heavy already should be headed into an exabyte range
(1000 petabytes). Some of our present security tools will survive these transitions from
light to heavy data, and some may not. Data everywhere, for purposes of this book, means
using current tools to acquire, process, and store the heaviest possible amount of data.
While our notion of whether a tool is current becomes far less constant (e.g., cloud computing gives us incredibly rapid expansion of capabilities), our security controls must be designed to maintain consistency across this different scope of data to be effective.
Let me give the example of Security Information and Event Management (SIEM) tools as
an example. We are seeing a continuous shift from a collection and correlation as a market
to the present demand for real-time analysis and forensics of as much data as possible.

Although there were a plethora of vendors five years ago in the log management market
(over 15!) that offered proprietary off-the-shelf log management infrastructure, today they
compete with the even larger cloud infrastructure (IaaS) market. “Pallets shipped” used to
be a metric of SIEM vendor success, which due to the success of Amazon’s service model
has become almost an extinct concept. This transformation parallels the more general IT
market and can be illustrated with three generations of data analysis capability: Batch,
Near-time and Real-time.

Batch systems were the SIEM objective and leaders of five years ago. They solved for
problems of integrity and availability, such as UDP packet loss and the need for dedicated
immutable storage for investigations. They struggled with over-promising analytic and investigation capabilities, which has opened the door to a new set of vendors. In short, the
foundation capabilities were to collect all data and store it. As this value became commoditized by cloud infrastructure, customers sought real analytics and speed for knowledge.

17


Chapter 2: Securing HeavyD

Because a query on the established archive data system could take weeks or even
months, a new era of near-time systems evolved. These products assume varied volumes
of high-velocity data (batch infrastructure) and provide significant analytic performance
enhancements, such that a human analyst can get results fast enough to pivot and explore
through ongoing investigations. What looks like a suspicious external IP address? Perhaps
you found a command and control destination. Now what internal IP addresses are talking
to that external IP? Perhaps you found infected machines.
Real time is the newest and emerging market. These tools attempt to increase the performance bar again; offering forensic-level capability to capture attacks as they happen
and peel apart detailed indicators of compromise. They also introduce machine learning
capabilities to reduce the false-positive and noise rate, which will likely be the Achilles heel
of the near-time market. That really means that the further you move from left to right on
this continuum, the heavier the data you should be able to handle, safely.

All of these security data management solutions will offer batch generation capability,
although many are off-loading this architecture to the cloud. Some solutions offer neartime capability, which usually indicates big data technology instead of just batch. In fact, if
you pull back the curtain you likely will find a big data distribution sitting in front of you.
Only a few solutions so far are integrating digital forensics and incident response tools to
provide real-time capabilities from their big data near-real time systems. Of course, integration of the three isn’t required. All three of these phases can be built and configured instead of packaged and sold off the shelf. However, if you’re talking truly heavy data, then
questions quickly arise as to how safe things are.
Walking into a meeting with an entire security group can sometimes be intimidating. On
one occasion, I was led through long marble lined hallways past giant solid wooden doors
to a long table surrounded by very serious looking faces. “Would you like some water?” the
head of the table asked me, as if to indicate I should be properly prepared for a very dry
conversation. “We understand your system is collecting voice traffic” he continued, “and
we have strict instructions to not collect any voice data in the system.” Their concern, rightly so, was someone breaching the repository of data would be able to play back conversations. Nixon’s impeachment had made an impression on someone, I supposed, and they
wanted to talk about how to keep the security big data system free of conversations. Of
course, I wanted to talk about how safe the system could be made.

Why Big Data Security is Necessary
You have seen several examples already of how big data is sold as a way to achieve rapid intelligence for success and avoid failure from delays. Cure for disease, better agriculture, safer travel, saving animals from extinction; the list gets longer every day. It is sold as

18


Chapter 2: Securing HeavyD

everything from finding a new answer to looking up prior knowledge to avoid mistakes and
waste. In many ways and in virtually every industry, velocity can be pitched as a better
path because it can fundamentally alter business processes. This is the steam engine transition away from horses and to a whole new concept of speed. Yet how comfortable really
should you be with the brakes and suspension of a car before you push the accelerator to
the floor? Perhaps you have heard of someone who avoided disaster just in time due to
finding what they needed with enhanced speed. Or maybe you know of someone who
rushed ahead without crucial input and created a disaster?

There are risks ahead that seem awfully familiar to anyone who reads the unfortunate
story of Grover Shoe Factory. The question is whether we can figure out the baseline of reliability or control we need to avoid serious disaster. The tools necessary for high performance data management and analysis include data protection. The reality is that big data
environments presently require investment and planning beyond any default configuration
(default safe configuration simply is not the way IT runs). Before we open up the power of
big data, consider whether we have put in place the confidentiality, integrity, and availability we need to keep things pointed in the right direction.
An interesting example might be the 2013 UNICEF Digital Maps project. A global organization asked youth to be “urgency rank testers” and “prioritize issues and reduce disaster
risk” by uploading photos in their neighborhoods to a cloud site. On the face of it, this
project sounds like an opportunity for kids to try and make their piece of the world a better
place. In aggregate, it gives global planners a better idea of where to send clean-up resources or change policies.

You might imagine, however, that some people do not appreciate kids sending pictures
of their pollution to an external authority for judgement - whistleblowers in literally every
place a child can go. So the big data system becomes very heavy very quickly, and has to
work hard to protect the confidentiality and integrity of reports.

19


Chapter 2: Securing HeavyD

I intend to discuss measurements across the usual areas, to avoid reinvention of ideas
where possible. What I’m talking about is confidentiality, integrity and availability. Trying
to get somewhere fast with the analysis of data, especially when you have kids acting as
sensors to report environmental issues, can easily run into trouble with all kinds of threat
models. I’ll touch on threats to availability first. Perhaps my favorite example of threat
modeling comes from a very large retail company that consolidated all its data warehouses
into a central one. Once data was pooled it made access easy to all its developers and scientists. Easy access also unfortunately meant easy mistakes could be made, worsened by a
lack of any accountability or audit trail.
A single wrong command, by a well-intentioned intern, destroyed their entire consolidated data set. The reload took several days, negating all the speed gains for their analysis. In
other words, moving to the consolidated big data environment was justified because they

saw answers to queries in a day instead of a week. Losing all the data meant they were
back to a week before they could get their answers.
This real-world example was echoed at a big data conference when a man from Home
Depot stood on stage and told the audience his CEO almost cancelled the entire Hadoop
project when it became non-responsive. Initial results were so promising, so much insight
gained, that the company rushed forward with their infrastructure. At the point of using it
in production there was an availability error and the CEO apparently was fuming mad as
engineers scrambled to resurrect the system. Cool story.
“Oops, I dropped the entire table” or “it might be a few more hours” becomes a catastrophic event in the high-speed open and uncontrolled big data environment. Any time
saved from data being so easily accessible is lost as data goes completely unavailable.
Availability therefore is presented in this book as the first facet to protect data performance.
Once a system is available, we will then discuss data integrity. If the data both sits available for use and has not been corrupted, we can then talk about protecting confidentiality.
You may have noticed the simple reason for approaching data protection in this particular
order. There is no need to measure integrity and confidentially when you are in a “no go”
situation. Systems are offline. When systems fail, the other measurements basically are
zero. So first we might as well talk about ensuring systems are “go” before we get to the
levels of how to protect data on those systems.
Next, assuming I have achieved reasonable availability through fancy new distributed
computing technology for big data environments, data integrity failure comes into focus.
This tends to be the source of some very interesting results, changing our very knowledge
and intelligence we gain from big data, which is why I anticipate it will become the most
important future area of risk. High availability of bad data (spreading bad information) can

20


Chapter 2: Securing HeavyD

be as bad as no availability at all. Maybe it can be worse. Some say it is more dangerous to
publish incorrect results than to fail to publish anything at all, since you then have correction and cleanup to apply resources towards. I still get the impression from environments I

work in that they would rather be online than offline, so I’m sticking with availability then
integrity. Your mileage may vary.
When talking about integrity, I’m tempted to use the old saying, “You can’t connect dots
you don’t have.” This implies that if you remove privacy, you get more data, which produces more knowledge. It sounds simple. Yet it really is also an integrity discussion. What
comes to mind when you see this image from Saarland University researchers (http://
www.mia.uni-saarland.de/Research/IP_Compress.shtml)?

Do you see big white bird or a plane? Do you see someone’s right arm? Some people I
have shown this image to have said, “that’s Lena!” The image you are looking at there has
been reconstructed after 98% of the data was lost. That’s right, only 2% of data was remaining, integrity almost completely destroyed. Perhaps I should say compressed. Since
the beginning of computers, people have seen great financial benefit to reducing data
while maintaining integrity of images. Streaming video and music are great examples of
this. So here is a look at the “random” data that actually was stored or transmitted to generate the above image.

Who is Lena? A picture of a woman in a magazine was scanned into digital format and
used for image compression tests and has become a standard. Lena, after 98% was removed, looks like random dots until you apply smoothing and diffusion. In the old days of
the Internet, such as greyscale or black and white indexed screens, compression of Lena’s

21


×