Tải bản đầy đủ (.pdf) (67 trang)

using artificial neural networks to identify image spam

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.14 MB, 67 trang )

USING ARTIFICIAL NEURAL NETWORKS TO IDENTIFY IMAGE SPAM

A Thesis
Presented to
The Graduate Faculty of The University of Akron

In Partial Fulfillment
of the Requirements for the Degree
Master of Science






Priscilla Hope
August, 2008

ii
USING ARTIFICIAL NEURAL NETWORKS TO IDENTIFY IMAGE SPAM

Priscilla Hope

Thesis

Approved: Accepted:
_______________________________ _______________________________
Advisor Dean of the College
Dr. Kathy J. Liszka Dr. Ronald F. Levant

_______________________________ _______________________________


Faculty Reader Dean of the Graduate School
Dr. Timothy W. O'Neil Dr. George R. Newkome

_______________________________ _______________________________
Faculty Reader Date
Dr. Tim Marguish

_______________________________
Department Chair
Dr. Wolfgang Pelz

iii
ABSTRACT

Internet technology has made international communication easy and convenient. This
convenience has compelled a number of people to rely on electronic mail for almost all
spheres of life – personal, business etc. Scrupulous organizations/individuals have taken
undue advantage of this convenience and populate users’ inboxes with unwanted
messages making email spam a menace. Even as anti-spam software producers think they
have almost solved the problem, spammers come out with new techniques. One such
tactic in the spammers’ toolbox comes in the form of image spam – messages that contain
little more than a link to an image rendered in an HTML mail reader. The image typically
contains the spam message one hopes to avoid, yet it is able to bypass most filters due to
the composition and format of these pictures.
This research focuses on identifying these images as spam by using an artificial
neural network (ANN), software programs used for recognizing patterns, based on the
biological neural networks in our brains. As information propagates through a neural
network, it “learns” about the data. A large collection of both spam and non-spam images
have being used to train an ANN, and then test the effectiveness of the trained network
against an unidentified or already identified set of pictures. This process involves

formatting images and adding the desired training values expected by the ANN. Several
different ANNS have being trained using different configurations of hidden layers and

iv
nodes per layer. A detailed process for preprocessing spam image files is given, followed
by a description on how to train an artificial neural network to distinguish between ham
and spam. Finally, the trained network is tested against both known and unknown images.


v
ACKNOWLEDGEMENTS

This research would not have being possible without Jason Bowling making his ideas
available for further studies. I’m grateful to him for his generosity. I also appreciate
Garth Bruen of Knujon for contributing spam images without which my corpus would
have being small. I appreciate my committee members, Dr. Tim O’Neil and Dr.
Margush, for their insightful corrections. My sincerest gratitude goes to Dr. Kathy J.
Liszka, my supervisor, with whose help this research became a joy to work on. Thanks
Dr. Liszka, you are the best supervisor!










vi

TABLE OF CONTENTS
Page
LIST OF TABLES …………………………………………………………………viii
LIST OF FIGURES ……………………………………………………………… ix
CHAPTER
I. INTRODUCTION ……………………………………………………………… 1
II. THE NATURE OF SPAM ……………………………………………………….4
2.1 Basic Definitions ….……….…………………………………………………5
2.2 History and Statistics ……………………………………… ……………5
2.3 The Long Arm of Spam ……………………………….………… …………7
2.4 Spam Filters ……………………………………………………………… …9
2.5 Who Are Spammers and Why Can’t We Stop Them …………………… 11
2.6 The Cost of Spam …… ………………………………………… 12
2.7 Why Are We Reading Spam …………………………………… ………….13
2.8 Getting Past the Spam Filter …… …………………………………. …… 15
2.9 Related Research …………………………………………………… …… 16
III. IMAGES AND THE CORPUS …………………………………………………18

vii
3.1 Image Spam Creating Techniques ……………………… ………… 18
3.2 Image Formats ………………………………………………… ………… 22
3.3 Image Preparation ……… ………………………………………… 22
3.4 Corpus ……………………………………………………………… …… 23
IV. THE ARTIFICIAL NEURAL NETWORK .……………………………………27
4.1 Fast Artificial Neural Network (FANN) …………… …………………… 29
4.2 Creating the Artificial Neural Network ……………………………… ……30
4.3 Training the Artificial Neural Network ……………………… 32
4.4 Testing the Artificial Neural Network ………………………………… … 36
V. TRAINING RESULTS….……………………………………………………….38
5.1 Training files ………………… …………………………………………….38

5.2 Test Results ……………………………………………………… 41
5.3 Sample Runs ………………………………………………………… …….41
VI. CONCLUSION AND FUTURE WORK ……………………………………….51
REFERENCES ………………………………………… ………………………….53
APPENDIX ………………….……………………… ………………………….….56



viii
LIST OF TABLES

Table Page
3.1 Corpus Statistics ………………… ……………………………………………… 24
5.1 Training Image Times for 50 Hidden Neurons …………… …………………….…40
5.2 Training Image Times for 75 Hidden Neurons …………………………… ………40
















ix
LIST OF FIGURES

Figure Page

1.1 Image Spam Examples ……………………………………………………………2
2.1 The First Generally Acknowledged Email Spam ………………………… 6
2.2 Sample Text-based Spam Message ……………………………………………….8
3.1 Text-only image… …………………………………………………………… 20
3.2 Assembled Images …………………………………………………………… 21
3.3 Original six individual images………………………………………………… 21
3.4 Script for Checksum ……………………………………………………….……25
3.5 Unix Script for Reformatting File Names ………………………………………26
4.1 Perceptron or feed-forward ANN ……………………………………………….28
4.2 Script automating executing image2fann utility ………………………… 30
4.3 Sample content of a file containing a set of image files to be run
through image2fann utility ………………………………………………………31
4.4 Sample preprocessed images to be trained ……………………………… 31
4.5 ANN for Spam Image Identification ……………………………………………33
4.6 Sample partial output from train.c ………………………………………………35
4.7 Process flow of ANN training and testing …………………………………….36
4.8 Process of testing a network…………………………………………………… 37
5.1 Sample preprocessed images to be tested ……………………………………….38
5.2 Sample output file from train.c (partial) ….………………………………… 39
5.3 Sample output file from test.c (partial)………………………………………… 41
5.4 ANN of 572 trained images using 50 hidden neurons and tested
with 53 untrained images ………………………………………………………. 42
5.5 ANN of 572 trained images using 75 hidden neurons and tested
with 53 untrained images ………………………………………………………. 43
5.6 ANN of 227 trained images using 50 hidden neurons and tested

with 53 untrained images ………………………………………………………. 44
5.7 ANN of 227 trained images using 75 hidden neurons and tested
with 53 untrained images ……………………………………………………… 44
5.8 ANN of 2000 trained images using 75 hidden neurons and tested
with 2000 trained images ………………………………………………………. 45
5.9 ANN of 2000 trained images using 75 hidden neurons and tested
with 100 untrained images ………………………………………………………46
5.10 ANN of 2000 trained images using 50 hidden neurons and tested
with 2000 trained images ……………………………………………………… 46
5.11 ANN of 2000 trained images using 50 hidden neurons and tested
with 100 untrained images …………………………………………………… 47
5.12 ANN of 2000 trained “images with mostly words” using 75 hidden
neurons and tested with 100 trained images …………………………………….48
5.13 ANN of 2000 trained “images with mostly words” using 75 hidden
neurons and tested with 100 untrained images …………………………… …48
5.14 ANN of 2000 trained “images with mostly words” using 50 hidden
neurons and tested with 100 trained images …………………………………….49
5.15 ANN of 2000 trained “images with mostly words” using 50 hidden
neurons and tested with 100 untrained images .…………………………….… 49
5.16 Jason Bowling on a hiking trip ………………………………………………….50
5.17 Ham Image Wrongly Classified as Spam …………………………………… 50

x

1
CHAPTER I
INTRODUCTION

Select – delete – repeat. It’s what most email users spend the first ten minutes of
every day doing purging spam from their inboxes. It has become as popular in casual

conversation as the weather. Clearly spam is not going away, at least not in the
foreseeable future. People still respond to it, buy products from it, and are scammed by it.
Communication on the Internet has been eagerly encouraged due to its ease of use,
opportunities to develop personal and professional contacts with colleagues around the
world that previously would have been difficult or impossible, and the possibility to
broadcast questions, discussion topics, opinions, documents, and more to thousands of
colleagues around the world virtually simultaneously. Communicating on the internet
come in many forms: email, discussion groups, Usenet news, chat groups, IRC (Internet
relay chat), video and audio conferencing, and Internet Telephony / SMS (short message
service). People’s response to communicating on the Internet was great since they wanted
to share ideas in an inviting, trusting atmosphere. Unfortunately, the boom in cyberspace
population came along with other social vices as happens in every growing human
society. Large chunks of the Internet took shape as a market place, a soapbox, and
mischievous, even hostile playgrounds, resulting in a less trusting atmosphere. Suspicion

2
has overridden trust, resulting in some users wanting to shut down communication rather
than open it up.

Figure 1.1 Image Spam Examples

Many factors contribute to this hostile environment with one playing a major role
adding to the slow erosion of the old, idealistic Internet philosophy. This factor has led
people to consider the Internet as a mistrustful atmosphere – a notion that cyberspace is
full of pornography and people trying to hijack bank accounts, steal identities, and
otherwise manipulate, deceive, and trick them. It's a nuisance that every Internet user
deals with everyday. It is called spam!!
Filters are available to combat these unsolicited nuisances, but spammers continually
develop new techniques to avoid detection by filters. This thesis focuses on one specific
category of unsolicited bulk email – image spam. It is a fairly recent phenomenon that

has appeared in the past few years. In 2005, it comprised roughly 4.8% of all emails, then
grew to an estimated 25% by mid 2006 [1]. Image spam reached its peak in January 2007
accounting for 52% of all email spam [2]. They come as image attachments that contain
text with what looks like a legitimate subject and from address. These nuisances are
successfully getting by traditional spam filters and optical character recognition (OCR)
systems. As a result, they are often referred to as OCR-evading spam images. Two
common examples are shown in Figure 1.1. These come in many forms by way of file
types, multipart images with images split into multiple images, and rotated by a slight
degree.
This research examines a method for identifying image spam by training an artificial
neural network. Chapter two presents an overall view of the spam problem and a brief
summary of current research. A detailed process for preprocessing spam image files is
given in chapter three, along with a discussion of how the corpus was developed. A
description of artificial neural networks is given in chapter four with instructions on how
to train the network to distinguish between ham and spam. Chapter five presents results
derived from creating and testing the trained network against unknown and known
images. Finally, conclusions and future work are discussed in chapter six.







3

4
CHAPTER II
THE NATURE OF SPAM


Many believe spam is an acronym for "sales promotional advertising mail" or
"simultaneously posted advertising message. Other acronyms associated with Spam
include: UBE (Unsolicited Bulk Email), MMF (Make Money Fast) and MLM (Multi-
Level Marketing). There seems to be two popular theories of why the name spam is
associated with Unsolicited Commercial Email (UCE). Most seem to associate spam with
the brand name product (SPAM - Shoulder Pork and hAM"/"SPiced hAM”) marketed by
Hormel. SPAM luncheon meat is a canned precooked meat product made by the Hormel
Foods Corporation [3]. Email spam, like its lunchmeat namesake, has no one asking for
or wanting it. If they do happen to get it, they most likely throw it away. Another group
implies the name spam was borrowed from a British television series, Monty Python’s
Flying Circus, in which actors sang a song entitled “Spam” with the word “spam”,
repeated over and over drowning all sounds [7].
Which ever theory one goes with, there is some truth underneath all the silliness.
Spam obscures legitimate business and personal correspondence that we want to read.
Worse, some are downright unpleasant, even offensive to view once opened.


5
2.1 Basic Definitions

Email spam comes with a variety of definitions including:
• unsolicited e-mail on the Internet [4];
• the abuse of electronic messaging systems to indiscriminately send unsolicited
bulk messages as defined by

[5]; and
• unsolicited e-mail, often of a commercial nature, sent indiscriminately to multiple
mailing lists, individuals, or newsgroups; junk e-mail [6].

Some examples of spam include:

• unsolicited communication, including "pop-ups" and "pop-unders";
• irrelevant, inappropriate, or repetitious e-mail or message board post;
• advertisement for some product or service;
• commercial, political and social commentaries sent indiscriminately to many
recipients; and
• email chain letters.

2.2 History and Statistics

May 3, 2008 marked the 30
th
Anniversary of email spam. The first email spam was
recorded on May 3, 1978 from a Digital Equipment Corporation marketing
representative, Gary Thuerk. He sent this email, shown in Figure 2.1, to all Arpanet
addresses on the west coast [7] [8]. Technically, by definition, the very first spam was
recorded in a telegram on September 13, 1904 [5]. Undoubtedly, this is not what Joseph
Henry and Samuel Morse had imagined!
Mail-from: DEC-MARLBORO rcvd at 3-May-78 0955-PDT
Date: 1 May 1978 1233-EDT
From: THUERK at DEC-MARLBORO
Subject: ADRIAN@SRI-KL
DIGITAL WILL BE GIVING A PRODUCT PRESENTATION OF THE NEWEST MEMBERS OF
THE DECSYSTEM-20 FAMILY; THE DECSYSTEM-2020, 2020T, 2060, AND 2060T.
THE DECSYSTEM-20 FAMILY OF COMPUTERS HAS EVOLVED FROM THE TENEX OPERATING
SYSTEM AND THE DECSYSTEM-10 COMPUTER ARCHITECTURE. BOTH THE DECSYSTEM-2060T
AND 2020T OFFER FULL ARPANET SUPPORT UNDER THE TOPS-20 OPERATING SYSTEM.
THE DECSYSTEM-2060 IS AN UPWARD EXTENSION OF THE CURRENT DECSYSTEM 2040 AND
2050 FAMILY. THE DECSYSTEM-2020 IS A NEW LOW END MEMBER OF THE
DECSYSTEM-20 FAMILY AND FULLY SOFTWARE COMPATIBLE WITH ALL OF THE OTHER
DECSYSTEM-20 MODELS. WE INVITE YOU TO COME SEE THE 2020 AND HEAR ABOUT THE

DECSYSTEM-20 FAMILY AT THE TWO PRODUCT PRESENTATIONS
WE WILL BE GIVING IN CALIFORNIA THIS MONTH. THE LOCATIONS WILL BE:
TUESDAY, MAY 9, 1978 - 2 PM
HYATT HOUSE (NEAR THE L.A. AIRPORT)
LOS ANGELES, CA
THURSDAY, MAY 11, 1978 - 2 PM
DUNFEY'S ROYAL COACH
SAN MATEO, CA
(4 MILES SOUTH OF S.F. AIRPORT AT BAYSHORE, RT 101 AND RT 92)
A 2020 WILL BE THERE FOR YOU TO VIEW. ALSO TERMINALS ON-LINE TO OTHER
DECSYSTEM-20 SYSTEMS THROUGH THE ARPANET. IF YOU ARE UNABLE TO ATTEND,
PLEASE FEEL FREE TO CONTACT THE NEAREST DEC OFFICE FOR MORE
INFORMATION ABOUT THE EXCITING DECSYSTEM-20 FAMILY.

Figure 2.1 The First Generally Acknowledged Email Spam

Email spam with a specific commercial bent started with what has now been dubbed
the “Green Card” spam. This anti-historical event took place on March 15, 1994[5] when
two attorneys, Laurence Canter and Martha Siegel, put together a bulk Usenet posting to
advertise their services to help immigrants obtain visas that were then known as a “green
card”. Savvy entrepreneurs caught on quickly and spam has been growing since with no
end in sight.


6

7
The following figures give an estimate of absolute numbers through the years:
• 1978 – message sent to 600 addresses[7]
• 1994 – spam sent to 6000 newsgroups, getting to millions of people[9]


2005 – (June) 30 billion spam per day[10]
• 2006 – (June) 55 billion spam per day [10]
• 2006 – (December) 85 billion spam per day[ 5]
• 2007 – (February) 90 billion spam per day[5]
• 2007

(June) 100 billion spam per day[11]

2007 – (November) between 65% and 85% of email is spam [12]
• 2008 – (July) 87.56% of email is spam with image spam being 12.87% [35]

2.3 The Long Arm of Spam

Virtually no electronic form of communication is safe from this nuisance. Although
we normally associate spam with email, it comes in many forms, including instant
messaging (spim), chat rooms, newsgroups, mobile phones, online game messaging,
search engine spam (spamdexing), blogs, wikis and guest books, and video sharing sites.
In short, where there is technology, spam is sure to follow.
There are two basic forms of spam received in emails:
• Text-based – an email consisting of text only to convey the senders information as
shown in Figure 2.2.

Image-based – the spammer’s message is sent in the form of a graphic or an
image, as shown in Figure 1.1.

free casino games, bingo cards, free Poker chips, Sportsbook, no deposit
online casino where you find more games, more winners, more often!
www.casino-vip1.com?” – Sender: Julia Broussard ,
Subject: Native American casino

Figure 2.2 Sample text-based spam message

Spam with an attached image is a relatively new phenomenon, which only started to
appear in numbers in the second half of 2005. Image spam exploded in mid 2006 and by
year’s end, over 50% of total spam received was image spam. It has since declined and
now account for around 20% [13]. According to the paper, Image Spam – the New Face
of Email Threat, image spam forms around 12.87% of total email spam which forms
around 87.56% of all email [35].
Popular Internet browsers, such as Firefox and Internet Explorer, coupled with
powerful search engines like Google, have changed our lives, as we search for answers to
movie trivia and place bids on EBay. Unfortunately, there seems to be a correlation
between time spent on the Internet and the amount of spam received. Spammers obtain
our email addresses a number of common ways. Some of these include:
• Scanning Usenet for email addresses.
• Using programs to extract email addresses from article headers.
• Harvesting subscribers to mailing lists from servers.
• Address harvesting programs (bots or spiders) crawling through the Internet
looking for email addresses in web pages. An especially good place to look is in
the <mailto> html tag.

8

9
• People finder sites also contribute to the spammers email list. For example,
Microsoft Network’s Hotmail automatically adds new email addresses to some
white page directories.
• Email addresses can be bought, usually cheaply, from others who have compiled
the addresses either legitimately or not.
• The three major domain contact points found by searching “whois-style” usually
provide email contacts for administrative, technical and billing staff.

• A dictionary search of email servers of large email hosting companies is done by
picking known URL suffixes (e.g. computer.org) and sending emails to prefixes
and seeing what doesn’t come back
.
• Chat rooms also serve as a hotbed for email harvesting.

2.4 Spam Filters

Spammers have two stages of spamming: landing the mail in a user’s inbox and
enticing the unsuspecting user to read it. Similarly, users have two defensive tactics. They
can take on the arduous task of weeding the unwanted spam by hand or they can use
software filters. Spam filters prevent spam from reaching an inbox. Manual weeding may
still need to be done for those spams that successfully bypass the filter. Manual weeding
is also necessary to identify and retrieve back those legitimate messages that have found
themselves marked as spam. Unfortunately, it is left to the user to identify the spam and
manually delete it or report to the spam filter that a message has succeeded in
circumventing. This is where the spammers’ trickery comes in to play. Some users try to

10
save labor and time by identifying and deleting spam based on the sender’s address or
subject matter. There are myriad spam filters available on the market, some free, with
limited capabilities, and other ranging in price. Examples include MacAfee, AVG,
SurfControl, Symantec, and TrustedSourceTM. Among the capabilities of many email
filters are:
• Compare incoming email with a classified database of spam and junk email
content.
• Use fifteen pre-defined content dictionaries classified by categories such as
“Spam”, “Adult” or “Hate Speech”.
• Use context sensitive language analysis.
• Can be dynamically trained to recognize and understand an organization’s

specific proprietary content.
• Strip HTML out of the message.
• Verify the existence of an email sender by using reverse client DNS lookup.
• Allow users to create their own “explicit deny list”.
• Message reputation and fingerprinting checking to see if the email content has
elements of spam that have been seen before.
• Image fingerprinting checks images to see if they contain similarities to cataloged
spam images.
• Image property space, a technique that uses rules to extract properties of images
in an email that might be spam.
• Analyze the format, layout and structure of an email. Most spammers use the
same template with changes in only the image attached.

11
• Image hashing, a technique that creates a digital signature of the actual image.
This is effective when the same image is sent for several days in a row as is
common with some spammers. Otherwise, it doesn’t help.

2.5 Who are Spammers and Why Can’t We Stop Them?

There are professional spammers who are in the business for many reasons ranging
from advertisement to malware propagation. This does not take into account the junk
email, in the form of jokes from family, friends and colleagues at work. Surprisingly, or
maybe not, there are actually a few anti-spam software companies that use spam to
advertise their latest anti-spam products. Mainsleaze is the name given to what are
generally considered reputable companies that have real products, as opposed to fly-by-
night outfits promising a happier life with Viagra. One example of mainsleaze is Kraft
Food’s spam marketing blitz for its Gevalia gourmet coffee products [5]. Another
example is IDate’s email harvesting of subscribers to the popular social networking site
Quechup [14].

According to new research, it appears that there are just 10 domain name registrars
that make up more than 75% of all Web sites advertised through spam [15]. These are:
• Xinnet Bei Gong Da Software
• BEIJING Networks
• Todynamic
• Joker
• eNom,Inc.

12
• MONIKER
• Dynamic Dolphin
• The NameitCo/AITDOMAINS.COM
• PDR
• Intercosmos/DIRECTNIC.

The research company Knujon rated the domains taking into consideration size of the
domain and the ratio of the number of reported spam-advertised junk product sites to the
number of total domains the registrars have. They used the Knujon Aggression rating.
This rating compares the volume of reported spam messages from a given registrar to
other registrars and then looks at that number against the total number of domains held by
that registrar [16]. This data is based on tens of thousands of spam messages collected on
any given day. They are all provided by public subscribers who “donate” their spam.
Knujon is also the major provider of the spam corpus used in this thesis research.

2.6 The Cost of Spam

Unlike postal junk mail which requires paper, an envelope, and a stamp, operating
costs for the spammer are relatively small. Their major expenditure/effort is in managing
mailing lists, making a massive campaign economically viable. The public and ISPs on
the other hand, bear the cost of low productivity, reduced bandwidth, and potential fraud.

The direct costs include consumption of network resources, disk space, the cost to

13
purchase and maintain commercial spam filters, and, of course, the human time and
attention to dismiss those unwanted messages.
Of a more serious nature, a whole new breed of crimes has evolved as a result of spam.
Our increasing digital presence has made us vulnerable to financial, identity, data and
intellectual property theft. Virus and other malware infections harm our data and cause
data loss resulting in harm ranging from lost productivity to major business losses. To an
organization, spam is not only a nuisance, it is expensive.

2.7 Why Are We Reading Spam?

We all claim that we delete our spam, but if that were true, spammers would have no
reason to continue pushing it through the pipe to us. Obviously, enough people are
reading enough spam to make it lucrative for spammers to continue. The strategy adopted
by the spammer consists of finding victim addresses, landing the mail in users’ in-boxes
and enticing them to open it. A few of the tactics include [17]:
• Cultural preoccupation
: These types of messages tap into society’s cultural
obsession of comparison, better known as “keeping up with the Jones”. “More” or
“less” seems to be on everyone’s mind. Spammers capitalize on this, sending
messages with subjects implying one or the other, with the sender appearing to
come from a legitimate well-known online service.
• Typical concerns
: Money, success, weight loss, and freedom from depression are
basic human needs which can be played on by the spammer to entice an email
user to read spam. They promise the golden elixir for our insecurities and deepest
desires.
• Sexuality: Most women want to be attractive to men; and vice versa, in whatever

form that takes. If you have read email in the last week, you’ve been subjected to
these myriad types of ads. To some users it’s an offensive eyesore while others
welcome it and take the bait.
• Anxiety
: Usually messages sent in this category induce anxiety in users, thereby
making it difficult for them to ignore the message. The header often has nothing
to do with the content presented in the email. The purpose is simply to get the
user to open the email. An example would be something like “Your PayPal
account has been compromised.” In some cases, the attached email may be an
actual PayPal phishing attempt, but often, the user opens it to find a link to a
pharmaceutical site or other scam.
• Faking a personal touch
: When these first started arriving, it was easy to fall prey
to emails with one’s name in the subject and a sender with a real name instead of
an email address. This is still an effective tactic to draw in the naïve user.
• Faking replies and interactions
: Usually a message prefixed with “re:” makes
users think it might be a reply to a message they sent or a reply to message in an
email group they might belong to. Sneaky, but effective.
• Faking informality and acquaintance: The idea here is to create subjects worded in
a casual, friendly style. This gives the impression that the message is coming from
someone familiar to the user. An example might be a subject line of “How’s it
going?”

14

15
2.8 Getting Past the Spam Filter

There are many techniques employed by email spammers to by-pass spam filtering

engines. According to John Graham-Cumming [18], spamming techniques fall into
several categories. Although he has listed seven general tactics, the most common seem
to target traditional Bayesian filters. The three techniques that are relevant to this thesis
include:
• Bad Word Obfuscation
: Filters typically scan a message looking for certain
questionable words and phrases (Viagra, weight loss, phentermine, bad credit,
etc.). If these words and phrases can somehow be masked so that they aren’t
identified by the filter in this category, the email has a much better chance of
slipping through. Intentional misspelling of these words is a simplistic, but
effective approach. HTML tags, used creatively, can be used to show the message
to the user in readable form, while disguising it from the filter.
• Good Word Insertion: These emails were originally very puzzling to the
uninformed user. In addition to the annoying commercial, they have additional
“harmless” words embedded in some fashion, or simply outright appended to the
end of the message, that read like nonsensical rambling. The intent is clearly to
confuse a statistical filter.
• Token Avoidance
: Rather than worry about disguising “bad” words, or tipping the
scale with “good” words, these techniques attempt to prevent a filter from
tokenizing a message in the first place. This is where image spam enters the

×