BỘ GIÁO DỤC VÀ ĐÀO TẠO
TRƯỜNG ĐẠI HỌC SƯ PHẠM TP. HỒ CHÍ MINH
BÁO CÁO TỔNG KẾT
ĐỀ TÀI KHOA HỌC VÀ CƠNG NGHỆ CẤP TRƯỜNG
SUPPORT VECTOR MACHINE TUYẾN TÍNH VỚI ĐỘ PHỨC TẠP
TUYẾN TÍNH
MÃ SỐ: CS2014.19.39
Cơ quan chủ trì: Khoa Công nghệ Thông tin
Chủ nhiệm đề tài: TS. Lê Minh Trung
THÀNH PHỐ HỒ CHÍ MINH – 12/2015
BỘ GIÁO DỤC VÀ ĐÀO TẠO
TRƯỜNG ĐẠI HỌC SƯ PHẠM TP. HỒ CHÍ MINH
BÁO CÁO TỔNG KẾT
ĐỀ TÀI KHOA HỌC VÀ CƠNG NGHỆ CẤP TRƯỜNG
SUPPORT VECTOR MACHINE TUYẾN TÍNH VỚI ĐỘ PHỨC TẠP
TUYẾN TÍNH
MÃ SỐ: CS2014.19.39
Xác nhận của cơ quan chủ trì
(ký, họ tên)
Chủ nhiệm đề tài
(ký, họ tên)
THÀNH PHỐ HỒ CHÍ MINH – 12/2015
Mẫu 1.10 CS
THÔNG TIN KẾT QUẢ NGHIÊN CỨU
ĐỀ TÀI KHOA HỌC VÀ CÔNG NGHỆ CẤP TRƯỜNG
Tên đề tài: Support Vector Machine tuyến tính với độ phức tạp tuyến tính
Mã số: CS2014.19.39
Chủ nhiệm đề tài: TS. Lê Minh Trung
Tel : 0937856369
E-mail:
Cơ quan chủ trì đề tài : Khoa Cơng nghệ Thông tin Trường Đại học Sư phạm Tp.HCM
Cơ quan và cá nhân phối hợp thực hiện : ………………………………………………..
…………………………………………………………………………………………….
…………………………………………………………………………………………….
…………………………………………………………………………………………….
…………………………………………………………………………………………….
Thời gian thực hiện: 11/2014 đến 12/2015
1. Mục tiêu: nghiên cứu và khám phá ra Support Vector Machine tuyến tính có độ phức
tạp tuyến tính trong ngữ cảnh của học online và tính tốn phân tán
2. Nội dung chính:
Phát triển một số thuật tốn SVM tuyến tính trong ngữ cảnh online learning
Phát triển thuật tốn SVM tuyến tính chạy trên các hệ thống phân tán như Spark
3. Kết quả chính đạt được (khoa học, ứng dụng, đào tạo, kinh tế-xã hội):
1) Trung Le, Dinh Phung, Khanh Nguyen, Svetha Venkatesh: Fast One-Class
Support Vector Machine for Novelty Detection. PAKDD 2015: 189-200
2) Khanh Nguyen, Trung Le, Vinh Lai, Duy Nguyen, Dat Tran, Wanli Ma: Least
square Support Vector Machine for large-scale dataset. IJCNN 2015: 1-8
3) Trung Le, Van Nguyen, Anh Nguyen, Khanh Nguyen: Adaptable Linear Support
Vector Machine. NICS 2015.
Mẫu 1.11 CS
SUMMARY
Project Title: Linear Support Vector Machine with Linear Complexity
Code number: CS2014.19.39
Coordinator: Dr. Lê Minh Trung
Implementing Institution: Faculty of Information Technology, HCMc University of
Pedagogy
Cooperating Institution(s)…………………………………………………………………
…………………………………………………………………………………………….
Duration: from 11/2014 to 12/2015
1. Objectives: investigating linear SVM with linear complexity in the context of online
learning and distributed computing.
2. Main contents:
Develop linear SVM in the context of online learning
Develop linear SVM in the context of distributed computing
3. Results obtained
1. Trung Le, Dinh Phung, Khanh Nguyen, Svetha Venkatesh: Fast One-Class
Support Vector Machine for Novelty Detection. PAKDD 2015: 189-200
2. Khanh Nguyen, Trung Le, Vinh Lai, Duy Nguyen, Dat Tran, Wanli Ma: Least
square Support Vector Machine for large-scale dataset. IJCNN 2015: 1-8
3. Trung Le, Van Nguyen, Anh Nguyen, Khanh Nguyen: Adaptable Linear
Support Vector Machine. NICS 2015.
1
Adaptable Linear Support Vector Machine
Anonymous authors
Abstract—Linear Support Vector Machine (LSVM) has recently become one of the most prominent learning methods
for solving classification and regression problems because of its
applications in text classification, word-sense disambiguation, and
drug design. However LSVM and its variations cannot adapt
accordingly to a dynamic dataset nor learn in online mode. In
this paper, we introduce an Adaptable Linear Support Vector
Machine (ALSVM) which linearly scales with the size of training
set. The most brilliant feature of ALSVM is that its decision
boundary is adapted in a close form when adding or removing
data.
Index Terms—Linear Support Vector Machine, adaptable
method, linear method
I. I NTRODUCTION
Support Vector Machine (SVM) [1] is a very well-known
tool for classification and regression problems. There have
been plenty of variations of SVM proposed and applied to
real-world problems, e.g., Least Square Support Vector Machine (LS-SVM) [11] and Proximal Support Vector Machine
(PSVM) [4]. SVM is associated with non-linear or linear
kernel. Although non-linear kernels often offer higher classification accuracies for real-world datasets, in many applications,
e.g., text classification, word-sense disambiguation, and drug
design, linear kernel is preferred because of its much shorter
training time [7].
Many methods have been proposed for Linear Support
Vector Machine (LSVM). Active Support Vector Machine
(ASVM) [8] and Lagrangian Support Vector Machine (LaSVM) [9] based on the formulation of Proximal Support Vector
Machine and can linearly scale with the training set size N .
However, both ASVM and La-SVM require to compute inverse
of a matrix of size (d + 1) × (d + 1), which costs over
O d2.373 and scale roughly in the best case O N d2 [7].
SVM-Perf [7], which is regarded as a variation of Structural
SVM, uses cutting-plane technique to solve its optimization
problem. SVM-Perf was proven much faster than decomposition methods, e.g, SVM-Light [6] and LIBSVM [2] and scales
with O (N s), where s is the sparsity of the training set. In
Pegasos [10], stochastic gradient descent method has been
applied to solve the optimization problem of LSVM in primal.
Pegasos was proven to scale with the complexity O (N s); but
it is really challenging to determine the appropriate values for
the number of iterations T and the sub-sample size k. Other
methods concerned the unconstrained or constrained form
of SVM and employed optimization techniques to find their
solutions, e.g., [3, 5]. Among these methods, LibLinear [5]
emerges as the most popular tool, which has been widely used
by the research community. The computational complexity of
LibLinear to gain −precision is known as O N dlog 1 .
All the above-mentioned LSVM and its variations cannot
adapt accordingly to a dynamic dataset in a close form. In this
paper, we depart from the formulation of Least Square Support
Vector Machine (LS-SVM) [11] to propose Adaptable Linear
Support Vector Machine (ALSVM) whose decision boundary
can be adapted in a close form when data are added or removed
accordingly. Concretely, when one data sample is added to or
removed from the current training set, the decision boundary
is adapted with the computational complexity O d2 . This
feature enables ALSVM to run in online mode. For the batch
mode, by applying inductive strategy, besides ALSVM we
also propose Fast Adaptable Linear Support Vector Machine
(FALSVM) where data samples are subsequently added to
the training set and the adaptations only occur for incorrect
classifications. In the batch mode, the computational complexity of ALSVM and FALSVM are O N d2 . The experiments
established in this paper show that in batch mode, the proposed
methods are superior in computation time to other methods for
dataset of thousands of dimensions.
II. VARIATIONS OF S UPPORT V ECTOR M ACHINE
A. Least Square Support Vector Machine (LS-SVM)
Least Square Support Vector Machine (LS-SVM) [11] is
a variation of SVM. LSVM uses square loss function in
its formulation and replaces the inequality constrains by the
equality ones:
min
w,b,ξ
1
w
2
N
2
ξi2
+C
i=1
T
s.t. : ∀N
i=1 : yi w φ (xi ) + b = 1 − ξi
where φ is a transformation from input space to feature space,
ξ = [ξ1 , ξ2 , ..., ξN ] is vector of slack variables.
B. Proximal Support Vector Machine (PSVM)
Proximal Support Vector Machine (PSVM) [4] is another
variation of SVM. PSVM employs square loss function and
keeps the samples as closest as possible to the margins:
min
w,b,ξ
1
( w
2
N
2
+ b2 ) + C
ξi2
i=1
T
s.t. : ∀N
i=1 : yi w φ (xi ) + b = 1 − ξi
(1)
2
III. L INEAR L EAST S QUARE S UPPORT V ECTOR M ACHINE
(LLS-SVM)
To find the formula for updating the normal vector, we
derive as follows:
A. Formulation
To eliminate the bias parameter b, we use the transformaT
T
tion: x ←− xT 1 and w ←− wT b to the formulation
of LS-SVM. The formulation of linear LS-SVM is as follows:
min
w,ξ
s.t. :
1
w
2
∀N
i=1
N
2
i=1
T
: yi w xi = 1 − ξi
(2)
where xi ∈ Rd+1 is assumed to append a constant 1 and w ∈
Rd+1 .
Actually, the optimization problem of LLS-SVM in Eq. (2)
is strictly equivalent to that of PSVM in linear case as shown
in Eq. (1).
B. Solution
Substituting the constrainst to the objective function in Eq.
(2), we need to minimize the following optimization problem:
L (w) =
1
w
2
N
2
1 − yi wT xi
+C
2
i=1
Setting the derivative to 0, we gain:
δL
= 0 → w + 2C
δw
N
yi xi yi wT xi − 1 = 0
i=1
N
yi xi
N
T
→ Gw = AA + νI w =
yi xi = Ay
i=1
N
where G = i=1 xi xTi + νI ∈ R [(d + 1) × (d + 1)] , A =
N
T
1
[xi ]i=1 ∈ R [(d + 1) × N ], y = [yi ]i=1...N and ν = 2C
.
IV. A DAPTABLE L INEAR S UPPORT V ECTOR M ACHINE
A. Adding New Data
Assume
that
the
new
data
samples
{(xN +1 , yN +1 ) , ..., (xN +n , yN +n )}
are
being
added to the current training set. Let us denote
N +n
T
A = [xi ]i=N +1 ∈ R [(d + 1) × n] and y = [yi ]i=N +1...N +n .
The current and new normal vectors w, w are solutions of
the following systems of equations:
Gw = AAT + νI w = Ay
Let us denote P = G−1 and P = G −1 . Using Kailath
Variant formula, we can evaluate P as follows:
= G−1 − G−1 A I + A T G−1 A
= P − PA I + A TPA
−1
−1
−1
A T G−1
A TP
We now suppose that only one data sample x ∈ Rd+1 with
label y is added to the current training set. Eq. (3) becomes:
1
−1 T
T
P x (P x)
P = P − P x 1 + xT P x
x P =P −
1+a
(4)
where a = xT P x.
Eq. (4) is reasonable since P = G−1 is symmetric by the
symmetry of G. Moreover, we show how to efficiently evaluate
a and P x. Assume that the sparsity of the training set is s, the
computational complexity of evaluation P x is O ((d + 1) s).
Therefore, a is calculated with the cost of O (s). The cost to
2
update P is O (d + 1) . To update w, we first calculate
P x as follows:
1
a
1
P x = Px −
P xxT P x = P x −
Px =
Px
1+a
1+a
1+a
(5)
According to Eq. (5), the cost to calculate P x is O(d + 1).
Finally, w is updated with the cost of O (d + 1 + s) as follows:
(6)
B. Removing Current Data
Assume
that
the
current
data
samples
{(xN −n+1 , yN −n+1 ) , ..., (xN , yN )} are being removed from
N
the current training set. Let us denote A = [xi ]i=N −n+1 ∈
T
R [(d + 1) × n] and y = [yi ]i=N −n+1...N . The current and
new normal vectors w, w are solutions of the following
systems of equations:
Gw = AAT + νI w = Ay
G w = AAT − A A T + νI w = Ay − A y
Let us denote P = G−1 and P = G −1 . Using Kailath
Variant formula, we can evaluate P as follows:
P = G −1 = G − A A T
= G−1 + G−1 A I − A T G−1 A
= P + PA I − A TPA
G w = AAT + A A T + νI w = Ay + A y
P = G −1 = G + A A T
A y − A A Tw
Totally, the computational complexity of adding one data
sample is O d2 .
i=1
i=1
→w −w =P
w = w + P x y − wT x
N
xi xTi w −
→ w = −2C
= G − A A T w = G w − A A Tw
→ w = w + P A y − A Tw
ξi2
+C
G w − A y = Ay = Gw
−1
−1
−1
A T G−1
A TP
To find the formula for updating the normal vector, we
derive as follows:
G w + A y = Ay = Gw
(3)
(7)
= G + A A T w = G w + A A Tw
→w −w =P
−A y + A A T w
→ w = w + P A −y + A T w
3
We now suppose that only one data sample x ∈ Rd+1
with label y is removed from the current training set. Eq. (7)
becomes:
P = P + P x 1 − xT P x
−1
xT P = P +
1
T
P x (P x)
1−a
(8)
where a = xT P x.
Eq. (8) is reasonable since P = G−1 is symmetric by the
symmetry of G. Moreover, we show how to efficiently evaluate
a and P x. Assume that the sparsity of the training set is s, the
computational complexity of evaluation P x is O ((d + 1) s).
Therefore, a is calculated with the cost of O (s). The cost to
2
update P is O (d + 1) . To update w, we first calculate
P x as follows:
a
1
1
P xxT P x = P x +
Px =
Px
1−a
1−a
1−a
(9)
According to Eq. (9), the cost to calculate P x is O(d + 1).
Finally, w is updated with the cost of O (d + 1 + s) as follows:
P x = Px +
w = w + P x −y + wT x
Totally, the computational complexity of removing one data
sample is O d2 .
Algorithm 1 Algorithm for ALSVM.
−1
a = 1 + ν −1 xT1 x1
P = ν −1 I − ν −2 ax1 xT1
w = y1 P x1
for k = 2 to N do
Q = P xk
a = xTk Q
1
QQT
P = P − 1+a
1
R = 1+a Q
w = w − R yk − wT xk
endfor
The algorithm for Adaptable Linear Support Vector Machine (ALSVM) is shown in Algorithm 1 . It is obvious that
the computational complexity of ALSVM is O N d2 .
To propose another version of ALSVM, namely Fast ALSVM
(FALSVM), which can efficiently reduce the training time of
ALSVM, we first randomize the training set and when the data
samples are added, we only update P and w for the data
sample that causes the error. The detail of the training process
of FALSVM is given in Algorithm 2 as follows:
Algorithm 2 Algorithm for FALSVM.
−1
C. Inductive Algorithm for Training Data
In this section, we present how to use the above adaptive
process to learn the optimal hyperplane from the training
set. We start with the empty training set and the labeled
N
data samples {(xi , yi )}i=1 are subsequently added to the
training set. The hyperplane is gradually adapted along with
the insertion of data.
At first, (x1 , y1 ) is added to the training set. We have:
T
P1 = G−1
1 = νI + x1 x1
−1
= ν −1 I − ν −1 Ix1 1 + ν −1 xT1 x1
−1
a = 1 + ν −1 xT1 x1
P = ν −1 I − ν −2 ax1 xT1
w = y1 P x1
for k = 2 to N do
if yk wT xk < 0 then
Q = P xk
a = xTk Q
1
QQT
P = P − 1+a
1
R = 1+a Q
w = w − R yk − w T xk
endif
endfor
ν −1 IxT1
= ν −1 I − ν −2 x1 a1 xT1 = ν −1 I − ν −2 a1 x1 xT1
where a1 = 1 + ν −1 xT1 x1
−1
V. E XPERIMENTS
.
w 1 = y1 P 1 x1
According to Eqs. (3) and (6), when the labeled data sample
(xk+1 , yk+1 ) is added, P and w are updated as follows:
Pk+1 = Pk −
1
T
Pk xk+1 (Pk xk+1 )
1 + ak+1
where ak+1 = xTk+1 Pk xk+1 .
Pk+1 xk+1 =
1
Pk xk+1
1 + ak+1
wk+1 = wk − Pk+1 xk+1 yk+1 − wkT xk+1
A. Experimental Settings
We conducted experiments on 11 datasets. The details of
the datasets are given in Table I. We compared ALSVM and
FALSVM with the two other well-known methods Pegasos
and L1 LibLinear. All experiments were run on a personal
computer with Core I5 and 4 GB of RAM. All methods are
implemented in C#.
The trade-off parameter C was searched in the grid
2−15 , 2−3 , ..., 23 , 25 . To make the parameters consistent, in
Pegasos, the parameter λ is computed from C as λ = N1C . In
Pegasos, the sub-sample size k is set to 1 and the maximum
number of iterations T is set to 10, 000. In LibLinear, the
precision
was set to its default value 0.01. The crossvalidation with 5 folds was carried out.
4
Table I
E XPERIMENTAL DATASETS .
Dataset
a9a
Cod-Rna
Covertype
Mushroom
w8a
SvmGuide3
German
Poker
Shuttle
IJCNN
SensIT Vehicle
Size
32,561
271,617
581,012
8,124
49,749
1,243
1,000
548,831
34,145
49,990
57,650
Dimension
123
8
54
112
300
21
24
10
9
22
100
B. How Fast are ALSVM and Fast ALSVM Compared to
Existing Methods?
The experimental results including the training times and
accuracies are reported in Table II. To highlight the conclusion,
for each dataset, we boldfaced the method that yields the
shortest training time and italicized the method that generates
the highest accuracy. As shown in Table II, regarding the
accuracies, ALSVM and Fast ALSVM are comparable with
others and always appeared in italic for all datasets. With
reference to the training time, FALSVM is always fastest over
all datasets, ALSVM is faster than LibLinear and Pegasos over
all datasets except for w8a.
It may be surprise how our proposed methods which
scale exactly O N d2 are faster than both LibLinear
O N dlog 1
and Pegasos (O (N s)). This is partially
explained as follows. The computational complexities of
ALSVM and FALSVM are exactly N d2 + ds + 2 (d + s)
and m d2 + ds + 2 (d + s) , where m is the number of
adaptations and may be very small as compared to N , i.e.
m ≈ N , while the computational complexities of Pegasos and
LibLinear are exactly const × N s and const × N dlog 1 .
It may happen that for the experimental datasets, the factors
const in Pegasos and LibLinear cause the actually greater
number of unit operations than our proposed methods.
Figure 1. Training times of ALSVM and FALSVM on Covertype (left) and
Cod-Rna (right).
ALSVM and FALSVM scale really well with the number
of training examples, but not really well with the dimension.
It is crucial to answer the question: “What is the threshold
of dimension under which ALSVM or FALSVM is faster than
others?”. To this end, we performed the experiment on the
datasets Real-Sim, whose dimension is 20, 958 and news20,
whose dimension is 1, 355, 191. We made comparison ALSVM
and FALSVM with LibLinear. The dimension is varied with
the step of 50 or 100 and the training times are recorded for
consideration. As shown in Figures 2 and 3, FALSVM scales
with the dimension better than ALSVM. As also indicated
from this experiment, FALSVM is faster than LibLinear if
the dimension is less than around one thousand, otherwise
it becomes gradually slower than LibLinear.
Figure 2. Training times of ALSVM, FALSVM, and LibLinear on Real-Sim
when dimension is varied.
C. How do Training Times of ALSVM and Fast ALSVM Scale
with the Number of Training Examples and Dimension?
To visually demonstrate how ALSVM and FALSVM scale
with the number of training examples, we established the
experiment over the datasets Covertype and Cod-Rna. We
measured the training times of ALSVM and FALSVM for
50%, 60%, 70%, 80%, 90%, and 100% of each dataset. As
shown in Figure 1, ALSVM and FALSVM possess asymptotically linear complexities and FALSVM is approximately tens
of times faster than ALSVM.
Figure 3. Training times of ALSVM, FALSVM, and LibLinear on news20
when dimension is varied.
VI. C ONCLUSION
In this paper, we have proposed Adaptable Linear Support
Vector Machine (ALSVM) and Fast Adaptable Linear Support
Vector Machine (FALSVM), which both scale linearly with the
5
Table II
T HE TRAINING TIMES ( IN SECOND ) AND THE ACCURACIES .
Dataset
a9a
Cod-Rna
Covertype
Mushroom
w8a
SvmGuide3
German
Poker
Shuttle
IJCNN
SensIT Vehicle
ALSVM
Time
Acc
82
82%
9
93%
271
64%
14
92%
403
97%
0.27
82%
0.26
75%
35
91%
0.7
100%
10
93%
629
88%
Fast ALSVM
Time
Acc
4.5
79%
0.7
72%
5.7
60%
0.1
94%
3.9
97%
0.04
82%
0.05
75%
3.2
91%
0.06
100%
0.21
93%
8.7
89%
number of data examples of training set. The most brilliant
feature of ALSVM is that its decision boundary can be adapted
in a close form when data are added to or removed from the
training set. Our experiments have indicated that ALSVM and
FALSVM are efficient for the large-scale dataset of thousands
of dimensions.
R EFERENCES
[1] B. E. Boser, I. M. Guyon, and V. Vapnik. A training
algorithm for optimal margin classifiers. In Proceedings
of the 5th Annual ACM Workshop on Computational
Learning Theory, pages 144–152. ACM Press, 1992.
[2] C.-C. Chang and C.-J. Lin. Libsvm: A library for support
vector machines. ACM Trans. Intell. Syst. Technol., 2(3):
27:1–27:27, May 2011. ISSN 2157-6904.
[3] K.-W. Chang, C.-J. Hsieh, and C.-J. Lin. Coordinate
descent method for large-scale l2-loss linear support
vector machines. J. Mach. Learn. Res., 9:1369–1398,
June 2008. ISSN 1532-4435.
[4] G. Fung and O. L. Mangasarian. Proximal support
vector machine classifiers. In Proceedings of the seventh
ACM SIGKDD international conference on Knowledge
discovery and data mining, pages 77–86, 2001.
[5] C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and
S. Sundararajan. A dual coordinate descent method
for large-scale linear svm. In Proceedings of the 25th
international conference on Machine learning, ICML
’08, pages 408–415. ACM, 2008. ISBN 978-1-60558205-4.
[6] T. Joachims. Advances in kernel methods. chapter
Making Large-scale Support Vector Machine Learning
Practical, pages 169–184. 1999. ISBN 0-262-19416-3.
[7] T. Joachims. Training linear svms in linear time. In
Proceedings of the 12th ACM SIGKDD international
conference on Knowledge discovery and data mining,
KDD ’06, pages 217–226, 2006. ISBN 1-59593-339-5.
[8] O. L. Mangasarian and David R. Musicant. Active
support vector machine classification. In NIPS, pages
577–583, 2000.
[9] O. L. Mangasarian and David R. Musicant. Lagrangian
support vector machines. J. Mach. Learn. Res., 1:161–
177, September 2001. ISSN 1532-4435.
LibLinear
Time
Acc
193
82%
1,683
91%
1,592
62%
16
92%
316
97%
5
81%
0.54
74%
4,811
91%
52
100%
252
92%
4,695
89%
Pegasos
Time
Acc
197
82%
8,332
89%
1,826
61%
92
91%
505
97%
5
81%
4
75%
12,490
91%
17
100%
213
93%
4,452
86%
[10] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos:
Primal estimated sub-gradient solver for svm. In Proceedings of the 24th international conference on Machine
learning, ICML ’07, pages 807–814, 2007. ISBN 9781-59593-793-3.
[11] J. A. K. Suykens and J. Vandewalle. Least squares
support vector machine classifiers. Neural Process. Lett.,
9(3):293–300, June 1999. ISSN 1370-4621.
Least Square Support Vector Machine for
Large-scale Dataset
Khanh Nguyen, Trung Le, Vinh Lai, Duy Nguyen, Dat Tran, Wanli Ma
Abstract—Support Vector Machine (SVM) is a very wellknown tool for classification and regression problems. Many
applications require SVMs with non-linear kernels for accurate
classification. Training time complexity for SVMs with non-linear
kernels is typically quadratic in the size of the training dataset. In
this paper, we depart from the very well-known variation of SVM,
the so-called Least Square Support Vector Machine, and apply
Steepest Sub-gradient Descent method to propose Steepest Subgradient Descent Least Square Support Vector Machine (SGDLSSVM). It is theoretically proven that the convergent rate of the
proposed method to gain ε − precision solution is O log 1ε .
The experiments established on the large-scale datasets indicate
that the proposed method offers the comparable classification
accuracies while being faster than the baselines.
training set size under some specific conditions, e.g., the
dataset is separated with margin m∗ in the feature space.
Actually, this condition is not rare to happen since data tend to
be compacted and clustered in the feature space. Furthermore,
we also theoretically prove that the convergent rate to gain
ε − precision solution is O log 1ε . The experiments show
that our proposed method is able to efficiently handle the largescale datasets and usually offers the comparable classification
accuracy while being faster than the baselines.
Index Terms—Support Vector Machine, solver, kernel method,
steepest gradient descent
Solving a quadratic program is the main component of all
SVMs. There has been extensive research on how to efficiently
solve the quadratic program in terms of both computational
and memory complexities. Interior-point methods, which have
been proved effective on convex quadratic programs on other
domains, have been applied to this quadratic program [6], [7],
[9]. However, the density, size, and ill-conditioning of the
kernel matrix make them very expensive for the large-scale
datasets.
Yet another conventional approach is decomposition method
like SMO [15], SVM-Light [12], LIBSVM [3], and SVMTorch [4]. By using the caching technique, the decomposition
methods can handle the memory used in the training of the
large-scale datasets very well. Nonetheless, their super-linear
scaling behavior with the dataset size [12], [15], [11] makes
their use inefficient or even intractable on large-scale datasets.
An alternative to reducing the training time is to take advantage of the sparsity. Some methods select a set of basic vectors
a priori. This includes sampling randomly from the training
set in the Nystrom method [23], greedily minimizing reconstruction error [18], and variants of the Incomplete Cholesky
factorization [7], [1]. However, these selection methods are not
part of the optimization process, which makes a goal-directed
choice of basis vectors difficult [14]. Another method is to
yield a linear system instead of quadratic program problems
[20]. Hestenes-Stiefel conjugate gradient method was then
applied to solve linear system Ax = B where A ∈ Rn×n
and B ∈ Rn . It converges in at most r + 1 steps where
r = rank (C) and A = I + C. However, the rate of
convergence depends on the condition number of matrix. In
the case using RBF kernel, it is influenced by the choice (C, γ)
where C is trade-off parameter and γ is kernel width parameter
[21]. Cutting-Plane Subspace Pursuit (CPSP) [14] converts its
quadratic optimization problem to an equivalent 1-slack one.
The cutting-plane technique is applied to subsequently add the
constraints to a set of active constraints. To utilize the sparsity,
I. I NTRODUCTION
Support Vector Machine (SVM) [2], [5] is a very popular
learning method for classification and regression problems.
There have been lots of variations of SVM proposed and
applied to real-world problems, e.g., Least Square Support
Vector Machine (LS-SVM) [19], Proximal Support Vector
Machine (PSVM) [8], and Structural Support Vector Machine
[22]. Although SVMs are known to give excellent classification results, their application to problems with large-scale
datasets is encumbered by the cumbersome training time
requirements. Recently, much research has been concentrated
on design fast learning algorithms that can efficiently handle
the large-scale datasets for linear SVMs [13], [10], [16]. Even
though linear SVMs are preferable for the sparse datasets,
e.g., text classification, word-sense disambiguation, and drug
design [13], many real-world applications require SVMs with
non-linear kernels for accurate classification. Training time
complexity for SVMs with non-linear kernels is typically
quadratic in the size of the training dataset [17].
In this paper, we depart from the formulation of Least
Square Support Vector Machine (LS-SVM) [19] and apply
the steepest sub-gradient descent method to propose Steepest
Sub-gradient Descent Least Square Support Vector Machine
(SGD-LSSVM). Unlike the original steepest gradient descent
method, in the proposed method, we use a sub-gradient rather
than the entire gradient for updating at each iteration. As a
consequence, the computational complexity in each iteration is
economic and around O(N (t + ln(t) + 1) + 4t + 2) ≈ O (N ),
where t is usually a small number, e.g., t = 1, 2. The
computational complexity of the proposed method is at worst
quadratic in general as others. Nonetheless, it is theoretically
proven that the proposed method linearly scales with the
II. R ELATED W ORK
preimage technique is used to approximate the active vector of
each constraint by a linear combination of images of k vectors
in the input space. The computational complexity of each
iteration of CPSP is evaluated as O(m3 + k 2 m + km2 + kmn)
plus the time it takes to solve the preimage problem where
m ≈ 30, i.e., the number of active constraints, and k ≤ kmax ,
i.e., the maximal number of preimage vectors in the input
space.
III. BACKGROUND
Support Vector Machine (SVM) [2], [5] builds up optimal
hyperplane such that the margin, the distance from the closest
sample to the hyperplane is maximized. SVM uses Hinge loss
function in its formulation:
min
w,b,ξ
w,ξ
+C
L (w, ξ, α) =
1
2
w +C
2
ξi
(1)
N
(2)
i=1
C. Steepest Sub-gradient Descent Approach
Let us denote f (α) = 12 αT Qα − eT α. Let B be a subset of
{1, 2, ..., N }, i.e., B ⊆ {1, 2, ..., N }. The general update rule
is as follows:
+ b2 ) + C
ξi2
(3)
i=1
T
s.t. : ∀N
i=1 : yi w φ (xi ) + b = 1 − ξi
IV. S TEEPEST S UB - GRADIENT D ESCENT TO L EAST
S QUARE S UPPORT V ECTOR M ACHINE
A. Optimization Problem
To eliminate the bias parameter b, we depart from the formuφ(x)
lation of LS-SVM and do the transformation x →
1
w
and w →
. The optimization problem of LS-SVM
b
becomes:
∇i f (α) i ∈ B
.
0
i∈
/B
φ (θ) = f (αnew ) − f (α) =
=
N
2
yi αi φ(xi )
i=1
1 T
α Qα − eT α
2
where d = −θgB with gi =
Proximal Support Vector Machine (PSVM) [8] is another
variation of SVM. PSVM employs square loss function and
keeps the samples as closest as possible to the margins:
1
( w
2
N
αnew = α + d = α − θgB
C. Proximal Support Vector Machine
min
i=1
where Q = [Qij ]N ×N with Qij = yi yj K(xi , xj ) + νδij , ν =
1
2C and δij is Kronecker symbol.
T
s.t. : ∀N
i=1 : yi w φ (xi ) + b = 1 − ξi
w,b,ξ
i=1
αi yi wT φ(xi ) − 1 + ξi
Substituting to the Lagrange function, we come up with:
α
ξi2
N
ξi2 −
∂L
= 0 → 2Cξi = αi
∂ξi
min
Least Square Support Vector Machine (LS-SVM) [19] is
a variation of SVM. LS-SVM makes use of the square
loss function in its formulation and replaces the inequality
constrains by the equality ones:
+C
N
∂L
=0→w=
∂w
B. Least Square Support Vector Machine
2
(4)
Setting the derivatives to 0, we have:
i=1
where φ is a transformation from the input space to the feature
space, ξ = [ξ1 , ..., ξN ] is the vector of slack variables, and
wT φ(x) + b is the equation of the optimal hyperplane.
min
i=1
Using KKT theorem, the Lagrange function of the optimization problem in Eq. (4) is as follows:
∀N
i=1 : ξi ≥ 0
w,b,ξ
ξi2
+C
where φ(xi ) is assumed to be appended by a constant 1.
The optimization problem in Eq. (4) is strictly equal to that
of PSVM as shown in Eq. (3).
N
2
N
2
T
s.t. : ∀N
i=1 : yi w φ(xi ) = 1 − ξi
T
s.t. : ∀N
i=1 : yi w φ (xi ) + b ≥ 1 − ξi
1
w
2
1
w
2
B. Solution
A. Support Vector Machine
1
w
2
min
1 T
T
d Qd + ∇f (α) d
2
1 T
1 T
T
dB QBB dB + gB
dB = gB
QBB gB θ2 − θ gB
2
2
We have θmin = argmin φ (θ) =
θ
2
gB 2
TQ
gB
BB gB
2
and φmin =
(gT gB )
− 2gT BQBB gB .
B
From above, to maximally reduce f (αnew ), the update rule
becomes
α(k+1) = α(k) − θk gB
gT g
where θk = gT QBBBBgB .
B
Let us define α∗ = argmin f (α). It is obvious that
α
∇f (α∗ ) = Qα∗ − e = 0. The error function V (α) is defined
as follows:
1
1
T
V (α) = (α − α∗ ) Q (α − α∗ ) = f (α) + α∗T Qα∗
2
2
Corollary 1. V α(k+1) = (1 − γk ) V α(k) where γk =
2
(gBT gB )
TQ
−1 g
gB
BB gB ×gk Q
k
Proof:
V α(k) = α(k) − α∗
= Qα(k) − Qα∗
T
T
Q α(k) − α∗
1≤i≤N
T
gB
gB
;
TQ
gB
BB gB
θ=
α = α − θgB ;
w = w − θ i∈B yi gi φ (xi ) ;
until g ∞ < ε
Q−1 Qα(k) − Qα∗ = gkT Q−1 gk
Therefore, we gain the following:
γk =
Algorithm 1 General Algorithm.
∀N
i=1 : αi = 0;
→
−
w= 0;
∀N
i=1 : gi = −1;
repeat
B = t− argmax {|gi | : 1 ≤ i ≤ N } ;
V α(k) − V α(k+1)
f α(k) − f α(k+1)
=
V α(k)
V α(k)
2
g T gB
2
= TB
× T −1
2gB QBB gB
gk Q gk
Assume that |B| = t, we choose t indices of t maximal
values of the set {|gi | : 1 ≤ i ≤ N } to formulate B, i.e., B =
t− argmax {|gi | : 1 ≤ i ≤ N }.
The inequality in Eq. (8) also reveals that the convergence
rate to gain ε − precision solution of the proposed method is
O log 1ε .
1≤i≤N
min
Corollary 2. γk ≥ Nt × λλmax
where λmin (Q) and λmax (Q)
indicate the minimal and maximal eigenvalues of matrix Q
respectively, λmin (t) = min λmin (QBB ) > 0, λmax (t) =
max λmax (QBB ) > 0,
B
B
λmin
= min λmin (t), and λmax =
t
max λmax (t).
t
Proof:
Since Q is a positive definite matrix, its pivot matrices QBB
are positive definite as well. According to Rayleigh’s quotient
inequality, we have:
T
QBB gB
gB
Tg
gB
B
≤ λmax (QBB ) ≤ λmax
λmin ≤ λmin (QBB ) ≤
T
gB
gB
t
≥ gkT gk
N
1
≤
In this section, we propose how to efficiently calculate the
quantities in Algorithm 2 and discuss on the computational
complexity of this algorithm as well.
Let us denote gyx = i∈B gi yi φ(xi ). According to Lemma
7 in Appendix, we have the following:
T
T
gB
QBB gB = νgB
gB + gyxT gyx
T
Let us define ∀N
i=1 : ai = gyx φ(xi ). It appears that:
∀N
i=1 : ai =
(6)
gkT Q−1 gk
λmax
λmax (Q)
gkT gk
1
1
≤
≤ λmax Q−1 =
λmin (Q)
λmin
1
(5)
V. T HE E FFICIENT C OMPUTATION
= λmin Q−1 ≤
gj yj K (xj , xi )
j∈B
(7)
gyxT gyx =
gi yi ai
i∈B
According to Eqs. (5, 6, 7), we have:
T
∀N
i=1 : gi = Qi α − 1 = yi w φ (xi ) + ναi − 1
g T gB
g T gB
g T gk
t
λmin
γk = T B
× BT
× T k −1 ≥
×
N
λmax
gB QBB gB
gk gk
gk Q gk
Theorem 3. lim α(k) = α∗ .
k→∞
T
Proof:
new
∀N
= yi w − θ
i=1 : gi
V α
(k)
= (1 − γk−1 ) V α
k−1
(1 − γi ) V (α0 ) ≤
=
i=0
1−
t
λmin
×
N
λmax
gj yj φ(xj ) φ (xi )
j∈B
(k−1)
+ναinew − 1 = gi − θyi ai − νθ1B (i)gi
k
V α(0)
(8)
Therefore, we conclude: lim V α(k) = 0 or lim α(k) =
k→∞
k→∞
α∗ .
where 1B (x) =
the set B.
1
0
x∈B
is the characteristic function of
x∈
/B
Algorithm 2 The Steepest Descent Sub-gradient Descent
Least Square Support Vector Machine (SGD-LSSVM).
∀N
i=1 : αi = 0;
→
−
w= 0;
∀N
i=1 : gi = −1;
repeat
B = t− argmax {|gi | : 1 ≤ i ≤ N } ;
//O(Nln(t))
1≤i≤N
gg = i∈B gi2 ;
for i = 1 to N do ai = j∈B gj yj K (xj , xi ) ;
gyxN orm = i∈B gi yi ai ;
gQg = ν × gg + gyxN orm;
gg
;
θ = gQg
foreach i ∈ B do αi = αi − θgi ;
foreach i ∈ B do gi = (1 − νθ)gi ;
for i = 1 to N do gi = gi − θyi ai ;
until g ∞ < ε
//O(t)
//O(Nt)
//O(t)
//O(1)
//O(1)
//O(t)
//O(t)
//O(N)
independently with the training set size N under some specific
conditions, e.g., the training set is linearly separated with the
margin m∗ in the feature space.
Theorem 5. The number of iterations in Algorithm 2 is
O(N ) in general and its computational complexity is therefore
O N 2 in general.
Proof:
We have the following:
N
0 > f (α∗ ) >
i=1
≥ N × min
t
1 2
νt − t
2
1 ∗2
να − αi∗
2 i
=−
N
= −N C
2ν
Therefore, the number of iterations in Algorithm 2 cannot
exceed:
Theorem 4. Algorithm 2 terminates after a finite number of
iterations.
n≤
2 tR2 + υ N C
ε2
≈ O (N )
Proof:
2
g T gB
f α
−f α
= TB
2gB QBB gB
T
T
gB
gB
gB
gB
T
= T
gB
gB ≥
2λmax (t)
2gB QBB gB
(9)
Theorem 6. Under one of two following conditions, the
number of iterations in Algorithm 2 is O (const) independently
with the training set size N :
Furthermore, according to the selection procedure for B, we
have:
2
T
gB
gB ≥ max gi2 = g ∞ ≥ ε2
(10)
α∗ 1 is upper bounded by a constant M independently
with N .
2) The training set is linearly separated by the margin m∗ .
Proof:
In two cases, we need to bound |f (α∗ )| by a constant
independently with the training set size N .
1) We have that ∇f (α∗ ) = 0 and thereby Qα∗ = e. It
follows that:
(k)
(k+1)
1)
i∈B
From Eqs. (9, 10), we achieve:
f α(k) − f α(k+1) ≥
ε2
2λmax (t)
(11)
1 ∗T
1
α Qα∗ − eT α∗ = eT α∗ − eT α∗
2
2
1
1
1
= − eT α ∗ = − α ∗ 1 ≥ − M
2
2
2
0 > f (α∗ ) =
|f (α∗ )| ≥ −f (α∗ ) ≥ −f α(n) = f α(0) − f α(n)
n−1
f α(k) − f α(k+1)
=
≥
k=0
nε2
2λmax (t)
Therefore, the number of iterations n cannot exceed:
n≤
2λmax (t) |f (α∗ )|
ε2
(12)
Without loss of generality, we can assume that data are
bounded in the feature space, i.e., φ (x) ≤ R, ∀x. According
to Lemma 9 in Appendix, the upper bound in Ineq. (12)
becomes:
2 tR2 + υ |f (α∗ )|
n≤
(13)
ε2
The computational complexity of each iteration in Algorithm
2 is O(N (t + ln(t) + 1) + 4t + 2) ≈ O (N ). Therefore, the
computational complexity of Algorithm 2 crucially depends
of the number of iterations. In the next section, we prove that
this number of iterations is linear in general and is a constant
Therefore, we gain: |f (α∗ )| ≤ 21 M .
2) Given a training set, to find the optimal hyperplane with
the maximal margin, we need to solve the following
optimization problem:
min
w,b
s.t. : ∀N
i=1
1
2
w
2
: yi wT φ (xi ) ≥ 1
The Lagrange function is as follows:
L (w, α) =
1
w
2
N
2
αi wT φ (xi ) − 1
−
i=1
Let (w, α) be the optimal solution. By setting derivative
is regarded as C × N where N is the training set size. The
cross-validation with 5 folds was carried out. In addition, for
all methods, the precision ε was set to 0.01. The parameter t
of SGD-LSSVM was set to 1.
of L (., .) to 0 , we gain the following:
N
w=
yi αi φ (xi )
i=1
The training set being linearly separated with the
margin m∗ implies that:
1
1
w
or
w
2
m∗ ≤
2
≤
1
2m∗2
At the optimal solution, strong duality holds:
1
1
≥
w
2m∗2
2
2
1
w
2
=
=−
N
2
1
w
2
αi wT φ (xi ) − 1
−
Dataset
a9a
Cod-Rna
Rcv1
Mushroom
SvmGuide1
Shuttle
IJCNN1
Splice
Size
32,561
271,617
20,242
8,124
3,089
34,145
49,990
1,000
Dimension
123
8
47,236
112
4
9
22
60
i=1
N
2
+
B. How Fast and Accurate the Proposed Method is Compared
with the Existing Methods?
αi
i=1
1
= max −
α
2
Table I
E XPERIMENTAL DATASETS .
n
αi
yi yj Kij αi αj +
i=1
i,j
where Kij = K (xi , xj ).
Therefore, we have
n
1
1
min
yi yj Kij αi αj −
αi ≥ − ∗2
α
2 i,j
2m
i=1
In Table II and Figure 1, we reported the accuracies and
the training times corresponding to the optimal values of C,
γ for each method on each dataset. As seen from Table II and
Figure 1, our proposed method SGD-LSSVM is comparable
with others in terms of the accuracy, i.e., six columns of
every dataset are nearly equal in the lower chart of Figure
1. With reference to the training time, SGD-LSSVM is faster
than others on most of datasets except for Rcv1 and Shuttle,
i.e., the plot of SGD-LSSVM in the upper chart of Figure 1 is
almost lower than others.
Finally, we reach the conclusion as follows:
0 > f (α∗ ) >
1
2
n
yi yj Kij αi∗ αj∗ −
i,j
i=1
> min
α
1
2
n
yi yj Kij αi αj −
i,j
Hence, we have: |f (α∗ )| <
αi∗
αi ≥ −
i=1
1
2m∗2
1
2m∗2 .
VI. E XPERIMENT
A. Experimental Settings
We established the experiments over 8 datasets.
All datasets can be downloaded at the URL
/>The
details of the experimental datasets are given in Table I.
We made comparison our proposed method SGD-LSSVM
with SVM-Light [12], SVM-Perf [14], LIBSVM [3], Nystrom
[23], and Incomplete Cholesky Factorization (ICF) [1].
All codes are implemented in C/C++. Actually, we used
Nystrom and ICF, which are implemented in SVM-Perf as
described in [14]. In all experiments, RBF Kernel, given by
2
K(x, x ) = e−γ x−x , was employed. The grid search,
where the trade-off parameter C was varied in the grid
{2−5 , 2−3 , ..., 215 } and the width of Kernel γ was searched in
the grid {2−15 , 2−3 , ..., 25 }, was executed. To make consistent
the parameter, in SVM-Light and SVM-Perf, the parameter C
Figure 1. The training times and accuracies on the datasets.
Table II
T HE TRAINING TIMES ( IN SECOND ) AND THE ACCURACIES ( IN %).
Dataset
a9a
Cod-Rna
Rcv1
Mushroom
SvmGuide1
Shuttle
IJCNN1
Splice
SGD-LSSVM
Time
Acc
88
83
169
91
239
97
1
100
0.4
97
25
100
169
99
0.13
89
SVM-Light
Time
Acc
2,089
85
1,949
93
1,001
97
18
100
9
97
50
100
1,949
99
0.62
87
SVM-Perf
Time
Acc
21,343
83
1,392
90
3,097
97
29
100
15
97
229
100
1,391
99
75
88
LIBSVM
Time
Acc
149
85
303
94
982
97
1.5
100
1.6
97
6
100
303
99
0.15
87
Nystrom
Time
Acc
162
85
809
89
66
95
14
100
2
97
18
100
809
98
2
85
ICF
Time
Acc
648
80
1,786
89
1,180
94
157
100
4
97
197
100
1,786
98
6.5
85
C. How does the Parameter t Affect to the Training Time and
Number of Iterations?
To investigate the affection of parameter t to the training
time, number of iterations, and accuracies, we established the
experiment on the datasets Rcv1 and IJCNN1. The optimal
values for C, γ were used in the experiment and the precision
ε was set to 0.01. We varied the parameter t in the grid
{1, 2, 4, 8, 16, 32, 64, 128, 256, 512} and recorded the training
time, accuracy, and number of iterations in each case.
As observed from Figure 3, when t increases, the number of iterations decreases. This fact can be explained from
Corollaries 1 and 2 which say when t becomes larger, the
objective function is reduced with a larger amount across
the iterations. Figure 2 shows the training times when t is
varied. As shown from Figure 2, when t becomes larger, the
training times tend to increase. The reason is that when t is
increased although the number of iterations is decreased, the
computation cost at each iteration is trade-off increased. Figure
4 indicates the accuracies when t is varied in its domain. As
seen from Figure 4, the accuracies slightly vary and are stable
along with variation of t. This experiment also recommends
that the optimal value for t is either 1 or 2. With these choices
for t, the computation cost in each iteration is very cheap and
it results in the reduced training time.
Figure 3. The number of iterations when t is varied.
Figure 4. The accuracies when t is varied.
D. How does the Proposed Method scale with the Number of
Training Examples?
Figure 2. The training times when t is varied.
To visually show how the training time of the proposed
method scales with the number of training examples, we
established the experiment on the datasets Rcv1 and IJCNN1.
We selected 10%, 20%, ..., 100% of each dataset and measured
the training time for each case. The optimal parameters C, γ
were used in this experiment, the parameter t was set to 1,
and the precision ε was set to 0.01. As shown in Figure 5, the
training time approximately scales with the linear complexity.
It may be partially explained from the fact that the optimal
values |f (α∗ )| of the sub-datasets do only a slight increase and
thereby leading to a slight increase in the number of iterations
2(tR2 +υ )|f (α∗ )|
.
since n ≤
ε2
1) λmax ≤ Q for all kinds of matrix norm.
m
2) If
Q
=
[K (xi , xj ) + νδij ]i,j=1
max K (xi , xi )
1/2
=
i
≤
max φ (xi )
i
and
R then
λmax ≤ mR2 + ν.
Proof:
1) Let us denote the eigenvector of λmax by x. It follows
that:
λmax x = λmax x = Qx ≤ Q
x
2) From previously, we have
λmax ≤ Q
1
= max
i
≤ ν + +max
i
Qij = ν + max
j
φ (xi )
i
K (xi , xj )
j
φ (xj ) ≤ ν + mR2
j
Figure 5. The training times of the percentages on the datasets.
R EFERENCES
VII. C ONCLUSION
We have departed from Least Square Support Vector Machine and propose Steepest Sub-gradient Least Square Support
Vector Machine (SGD-LSSVM). SGD-LSSVM shows some
good properties about convergent rate and computational complexity which are: 1) its convergent rate to gain ε − precision
solution is O log 1ε ; and 2) its computational complexity
is O N 2 and O (N ) under some specific conditions that are
not rare to happen. This makes the proposed method efficient
and tractable to handle the large-scale datasets.
A PPENDIX
Lemma 7.
T
gB
QBB gB
T
= νgB
gB + gyxT gyx.
Proof:
We have
T
gB
QBB gB =
gi gj qij
i∈B j∈B
=
gi gj (yi yj K(xi , xj ) + νδij )
i∈B j∈B
=
gi2
gi gj yi yj K(xi , xj ) +
i∈B j∈B
i∈B
T
= νgB
gB + gyxT gyx
Lemma 8. gyxT gyx =
i∈B
gi yi ai .
Proof:
We have
gyxT gyx = gyxT
gi yi φ(xi )
i∈B
gi yi gyxT φ(xi ) =
=
i∈B
gi yi ai
i∈B
Lemma 9. Q is a m by m positive definite matrix and
λmax = λmax (Q) is the maximal eigenvalue of Q, we have
the following:
[1] F. R. Bach and M. I. Jordan. Predictive low-rank decomposition for
kernel methods. In Proceedings of the 22Nd International Conference
on Machine Learning, ICML ’05, pages 33–40, New York, NY, USA,
2005. ACM.
[2] B. E. Boser, I. M. Guyon, and V. Vapnik. A training algorithm for
optimal margin classifiers. In Proceedings of the 5th Annual ACM
Workshop on Computational Learning Theory, pages 144–152. ACM
Press, 1992.
[3] C.-C. Chang and C.-J. Lin. Libsvm: A library for support vector
machines. ACM Trans. Intell. Syst. Technol., 2(3):27:1–27:27, May 2011.
[4] R. Collobert and S. Bengio. Svmtorch: Support vector machines for
large-scale regression problems. J. Mach. Learn. Res., 1:143–160,
September 2001.
[5] C. Cortes and V. Vapnik. Support-vector networks. In Machine Learning,
pages 273–297, 1995.
[6] M. C. Ferris and T. S. Munson. Interior-point methods for massive
support vector machines. SIAM J. on Optimization, 13(3):783–804,
August 2002.
[7] S. Fine and K. Scheinberg. Efficient svm training using low-rank kernel
representations. J. Mach. Learn. Res., 2:243–264, March 2002.
[8] G. Fung and O. L. Mangasarian. Proximal support vector machine
classifiers. In Proceedings of the seventh ACM SIGKDD international
conference on Knowledge discovery and data mining, pages 77–86,
2001.
[9] E. M. Gertz and S. J. Wright. Object-oriented software for quadratic
programming. ACM Trans. Math. Softw., 29(1):58–81, March 2003.
[10] C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan.
A dual coordinate descent method for large-scale linear svm. In
Proceedings of the 25th international conference on Machine learning,
ICML ’08, pages 408–415. ACM, 2008.
[11] D. Hush and C. Scovel. Polynomial-time decomposition algorithms for
support vector machines. Machine Learning, 51(1):51–71, 2003.
[12] T. Joachims. Advances in kernel methods. chapter Making Large-scale
Support Vector Machine Learning Practical, pages 169–184. MIT Press,
Cambridge, MA, USA, 1999.
[13] T. Joachims. Training linear svms in linear time. In Proceedings of the
12th ACM SIGKDD international conference on Knowledge discovery
and data mining, KDD ’06, pages 217–226, 2006.
[14] T. Joachims and C.-N. Yu. Sparse kernel svms via cutting-plane training.
Machine Learning, 76(2-3):179–193, 2009. European Conference on
Machine Learning (ECML) Special Issue.
[15] J. C. Platt. Advances in kernel methods. chapter Fast Training of Support
Vector Machines Using Sequential Minimal Optimization, pages 185–
208. 1999.
[16] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated
sub-gradient solver for svm. In Proceedings of the 24th international
conference on Machine learning, ICML ’07, pages 807–814, 2007.
[17] S. Shalev-Shwartz and N. Srebro. In William W. Cohen, Andrew
McCallum, and Sam T. Roweis, editors, ICML, volume 307, 2008.
[18] A. J. Smola and B. Schăokopf. Sparse greedy matrix approximation
for machine learning. In Proceedings of the Seventeenth International
Conference on Machine Learning, ICML ’00, pages 911–918, 2000.
[19] J. A. K. Suykens and J. Vandewalle. Least squares support vector
machine classifiers. Neural Process. Lett., 9(3):293–300, June 1999.
[20] JAK Suykens, Lukas Lukas, Paul Van Dooren, Bart De Moor, Joos
Vandewalle, et al. Least squares support vector machine classifiers:
a large scale algorithm. Citeseer.
[21] Johan AK Suykens, Jos De Brabanter, Lukas Lukas, and Joos Vandewalle. Weighted least squares support vector machines: robustness and
sparse approximation. Neurocomputing, 48(1):85–105, 2002.
[22] I. Tsochantaridis. Support Vector Machine Learning for Interdependent
and Structured Output Spaces. PhD thesis, Providence, RI, USA, 2005.
AAI3174684.
[23] C. Williams and M. Seeger. Using the nystrom method to speed up
kernel machines. In Advances in Neural Information Processing Systems
13, pages 682–688. MIT Press, 2001.
Fast One-Class Support Vector Machine for
Novelty Detection
Trung Le1 , Dinh Phung2 , Khanh Nguyen1 , Svetha Venkatesh2
1
2
Department of Information Technology, HCMc University of Pedagogy, Vietnam
Center for Pattern Recognition and Data Analytics, Deakin University, Australia
Abstract. Novelty detection arises as an important learning task in several applications. Kernel-based approach to novelty detection has been
widely used due to its theoretical rigor and elegance of geometric interpretation. However, computational complexity is a major obstacle in this
approach. In this paper, leveraging on the cutting-plane framework with
the well-known One-Class Support Vector Machine, we present a new solution that can scale up seamlessly with data. The first solution is exact
and linear when viewed through the cutting-plane; the second employed
a sampling strategy that remarkably has a constant computational complexity defined relatively to the probability of approximation accuracy.
Several datasets are benchmarked to demonstrate the credibility of our
framework.
Keywords: One-class Support Vector Machine, Novelty Detection, Largescale dataset.
1
Introduction
Real data rarely conform to regular patterns. Finding subtle deviation from
the norm, or novelty detection (ND), is an important research topics in many
data analytics and machine learning tasks ranging from video security surveillance, network abnormality detection to detection of abnormal gene expression
sequence. Novelty detection refers to the task of of finding patterns in data that
do not conform to expected behaviors [4]. These anomaly patterns are interesting
because they reveal actionable information, the known unknowns and unknown
unknowns. Novelty detection methods have been used in many real-world applications, e.g., intrusion detection [8], fraud detection (credit card fraud detection,
mobile phone fraud detection, and insurance claim fraud detection) [7].
There are various existing approaches applied to novelty detection, including
neural network based approach [2], Bayesian network based approach [1], rule
based approach [6], kernel based approach [12,14]. However, these approaches are
either ad hoc (rule-based) or lacking a principled approach to scale up with data.
Moreover, none of these methods has been experimented with large datasets. In
this paper, we depart from a kernel-based approach, which has a established
theoretical foundation, and propose a new novelty detection machine that scales
up seamlessly with data.
The idea of using kernel-based methods for novelty detection is certainly not
new. At its crux, geometry of the normal data is learned from data to define
the domain of novelty. One-Class Support Vector Machine (OCSVM) [12] aims
at constructing an optimal hyperplane that separates the origin and the data
samples such that the margin, the distance from the origin to the hyperplane,
is maximized. OCSVM is used in case data of only one class is available and
we wish to learn the description of this class. In another approach, Support
Vector Data Description (SVDD) [14], the novelty domain of the normal data
is defined as an optimal hypersphere in the feature space, which becomes as a
set of contours tightly covering the normal data when mapped back the input
space.
There are also noticeable research efforts in scaling up kernel-based methods to linear complexity including Pegasos [13] (kernel case), SV M P erf [10].
However, to our best of knowledge, none has attempted at novelty detection
problem.
Starting with the formulation of cutting plane method [11], we propose in
this paper a new approach for novelty detection. Our approach is a combination
of cutting plane method applied to the well-known One-Class Support Vector
Machine [12] – we term our model Fast One-Class SVM (FOSVM). We propose
two complementary solutions for training and detection. The first is an exact solution that has linear complexity. The number of iterations to reach a very close
accuracy, termed as θ−precision (defined later in the paper) is O θ1 , making it
attractive in dealing with large datasets. The second solution employed a sampling strategy that remarkably has constant computational complexity defined
relatively to probability of approximation accuracy. In all cases, rigorous proofs
are given to the convergence analysis as well as complexity analysis.
2
Related Work
Most closely to our present work is the Core Vector Machine (CVM) [18] and
its simplified version Ball Vector Machine (BVM) [17]. CVM is based on the
achievement in computational geometry [3] to reformulate a variation of L2-SVM
as a problem of finding minimal enclosing ball (MEB). CVM shares with ours in
using the principle of extremity in developing the algorithm. Nonetheless, CVM
does not solve directly its optimization problem and does not offer the insight
view of cutting plane method as ours.
Cutting plane method [11] has been applied to building solvers for kernel
methods in the work [9,10,15,16]. In [10], the cutting plane method is employed
to train SV M struct ; the N − slack optimization problem with N constrains is
converted to an equivalent 1 − slack optimization problem with 2N constrains
(binary classification); the constrains are subsequently added to the optimization
problem. In [15,16], the bundle method is used for solving the regularized risk
minimization problem. Though those work have been proven that bundle method
is very efficient for the regularized risk minimization problem and the number
of iterations to gain −precision solution are O 1 for non-smooth problem and
O log 1 for smooth problem, they are implausible to extend to a non-linear
kernel case.
3
3.1
Proposed Fast One-Class Support Vector Machine
(FOSVM1)
One-Class Support Vector Machine
We depart from the One-Class Support Vector Machine proposed by [12]. Given
a training set with only positive samples D = {x1 , . . . , xN }, One-Class Support
Vector Machine (OCSVM) learns the optimal hyperplane that can separate the
origin and all samples such that the margin, the distance from the origin to the
hyperplane, is maximized. The optimization problem of soft model of OCSVM
is as follows:
min
w,ρ
1
w
2
2
−ρ+
1
vN
N
ξi
(1)
i=1
T
s.t. : ∀N
i=1 : w Φ(xi ) ≥ ρ − ξi
∀N
i=1 : ξi ≥ 0
If the slacks variables are not used, i.e., all samples in the training set must be
completely stayed on the positive side of the optimal hyperplane, the soft model
becomes the hard model associated with the following optimization problem:
min
w,ρ
1
w
2
2
−ρ
(2)
T
s.t. : ∀N
i=1 : w Φ(xi ) ≥ ρ
3.2
Cutting Plane Method for One-Class SVM
To apply the cutting plane method for the optimization problem of OCSVM, we
rewrite the OCSVM optimization problem as follows:
min
w,ρ
s.t. : ∀N
i=1 :
Φ (xi )
−1
1
w
2
T
2
−ρ
w
≥ 0 (constrainst Ci )
ρ
(3)
The feasible set of the OCSVM optimization problem is composed by intersection of N half-hyperplanes corresponding to the constrains Ci (1 ≤ i ≤ N )
(see Figure 1). Inspired by the cutting plane method, the following algorithm is
proposed:
Algorithm 1 Cutting method to find the θ-approximate solution.
for n = 2, 3, 4...
(wn , ρn ) = solve (n);
n + 1 = in+1 = argmin
i>n
Φ (xi )
−1
T
wn
ρn
= argmin wnT Φ (xi ) − ρn ;
i>n
on+1 = wnT Φ (xn+1 ) − ρn ;
if (on+1 ≥ −θρn ) return (wn , ρn );
endfor
The procedure solve(n) stands for solving the optimization problem in Eq.
(3) where first n constrains Ci (1 ≤ i ≤ n) are activated. By convenience, we
assume that the chosen index in+1 = n + 1. The satisfaction of condition on+1 ≥
−θρn where θ ∈ (0, 1) means that the current solution (wn , ρn ) is that of the
relaxation of the optimization problem in Eq. (3) while the constrains Ci (i > n)
are replaced by its relaxation Ci (i > n) as follows:
min
w,ρ
s.t. : ∀ni=1 :
∀N
i=n+1 :
Φ (xi )
−1
Φ (xi )
−1
T
1
w
2
T
2
−ρ
w
≥ 0 (constrainst Ci )
ρ
w
≥ −θρn (constrainst Ci )
ρ
Otherwise, the hyperplane corresponding with the constrains Cn+1 separates
the current solution sn = (wn , ρn ) and the feasible set of the full optimizaT
Φ (xn+1 )
wn
tion problem in Eq. (1) since
< −ρn θ < 0 ; the constrains
−1
ρn
Cn+1 is added to the active constrains set (see Figure 1). Furthermore, the distance from the current solution as shown in Eq. (4) to the chosen hyperplane is
maximized so that the current feasible set rapidly approaches the destined one
(see Figure 1).
dn+1 = d (sn , Cn+1 ) =
3.3
ρn − wnT Φ (xn+1 )
wn
(4)
Approximate Hyperplane
We present an interpretation of the proposed Algorithm 1 under the framework of
approximate hyperplane. Let us start with clarifying the approximate hyperplane
notion. Let A ⊆ [N ] be a subset of the set including first N positive integer
numbers and DA be a subset of the training set D including samples whose
indices are in A.
Fig. 1. The constrains Cn+1 is chosen to add to the current active constrains set (red
is active and green is inactive).
T
Φ (x) − ρA = 0.
Denote the optimal hyperplane induced by DA by (HA ) : wA
Given a sample x, let us define the membership of x with regard to the positive
side of the hyperplane (HA ) by :
mA (x) =
1−
1
d(Φ(x),HA )
d(0,HA )
=1+
T
wA
Φ(x)−ρA
ρA
=
T
wA
Φ(x)
ρA
T
Φ (x) − ρA < 0
if wA
otherwise
Intuitively, the membership of x is exactly 1 if Φ (x) lies on the positive side
of HA ; otherwise this membership decreases when Φ (x) is moved further the
hyperplane on the negative side.
We say that hyperplane HA is an θ − approximate hyperplane if it is the
optimal hard-margin hyperplane induced by DA and the memberships with respect to HA of data samples in D \ DA are all greater than or equal 1 − θ, i.e.,
∀x ∈ D \ DA : mA (x) ≥ 1 − θ. The visualization of θ − approximate hyperplane
is shown in Figure 2.
Using the θ − approximate hyperplane, Algorithm 1 can now be rewritten as:
Fig. 2. θ − approximate hyperplane (green is active, black is inactive), the red line
stands for the optimal hard-margin hyperplane induced by DA .
Algorithm 2 Algorithm to find θ − approximate hyperplane.
A = {1, 2} ;
n = 2;
do
(wn , ρn ) = OCSVM ({xi : i ∈ A});
n + 1 = in+1 = argmin (wnT Φ (xi ) − ρn );
i∈Ac
on+1 = wnT Φ (xn+1 ) − ρn ;
if (on+1 ≥ −θρn ) return (wn , ρn );
else A = [n + 1];
while (true)
Algorithm 2 aims at finding a θ − approximate hyperplane in which θ −
approximate notion can be interpreted as slack variables of soft model. It is
worthwhile to note that the stopping criterion in Algorithm 2 means that ∀N
i=n+1 :
mA (xi ) ≥ 1 − θ.
In the next section, we prove that Algorithm 2 terminates after a finite number of iterations independent of the training size N . To simplify the derivation,
we assume that isometric kernel, e.g., Gaussian kernel, is used which means that
Φ (x) = K(x, x)1/2 = 1, ∀x. Furthermore, let us define w∗ = w∞ = lim wn
n→∞
and d∗ = d∞ = lim dn where dn = d (0, Hn ) = wρnn is the margin of the
n→∞
current active set D[n] and d∗ is the margin of the general dataset.
Theorem 1. Algorithm 2 terminates after at most n0 iterations where n0 depends only on θ and the margin of the general data set d∗ = d∞ .
Proof. We sketch the proof as follows. The main task is to prove that:
∀N
j=1
2
1+n/2
: mn (xj ) ≥ 1 −
+
1 − d∗
1
1+n/2
d∗
2
= 1 − g(n)
2
It is easy to see that lim g(n) = 0. Therefore, there exists n0 where n0 is a
n→∞
constant which is independent with N such that mn (xj ) > 1 − θ for all n ≥ n0 .
Theorem 2. The number of iterations in Algorithm 2 is O (1/θ).
Proof. Let us denote θ =
1 − g(n) ≥ 1 − θ ⇔
θd∗2
1−d∗2 .
We have:
1
n
2
+
≤ θ ⇐⇒ + 1 ≥
1 + n/2 1 + n/2
2
√
√
2
2θ + 1 + 1
2θ
It is obvious that:
√ √
2 2θ + 1 + 1
const
≤
∼ O (1/θ ) = O (1/θ)
2θ
2θ
Therefore, we gain the conclusion.
From Theorem 2, we also gain the number of iterations in Algorithm 2 is
O (1/θ). Theorem 1 reveals that the complexity of Algorithm 2 is O(N ) since
the complexity of each iteration is O(N ).
4
Sampling Fast One-Class Support Vector Machine
(FOSVM2)
In this section, we develop a second solution to train the proposed model using
sampling strategy. We now show how to use sampling technique to predict this
furthest vector with the accuracy can be made as close as we like to the exact
solution specified via a bounded probability similar to concept of p-value.
Let 0 < < 1 and m vectors are sampled from N vectors of the training
set D. We denote the set of -top furthest vectors in D by B. We estimate the
probability in order that at least one of m sampled vectors belonging to B. We
call the probability of this event as Pr( -top). First, we note from our definition
that n = |B| = N . Hence, the probability of interest becomes Pr( -top) =
N −n
N
1−
/
. With some effort of manipulation, this reduces to:
m
m
−1
Pr( -top) =1 − [(N − n)!(N − m)!] [N !(N − n − m)!]
N −m
=1 −
i
i=N −n−m+1
−1
N
i
i=N −n+1
Algorithm 3 Algorithm to find θ − approximate hyperplane.
A = {1, 2} ;
n = 2;
do
(wn , ρn ) = OCSVM ({xi : i ∈ A});
B = Sampling(Ac , m);
n + 1 = in+1 = argmin wnT Φ (xi ) − ρn ;
i∈B
on+1 = wnT Φ xin+1 − ρn ;
if (on+1 ≥ −θρn ) return (wn , ρn );
else A = [n + 1];
while (true)
Taking the logarithm yields
N −m
ln [1 − Pr( -top)] =
ln
i=N −n−m+1
i
i+m
N −m
ln 1 −
=
i=N −n−m+1
m
i+m
Now applying the inequality ln(1 − x) ≤ −x, ∀x ∈ [0, 1), we have:
N −m
ln [1 − Pr( -top)] ≤ −
i=N −n−m+1
m
≤−
i+m
N
i=N −n+1
m
mn
≤−
i
N
−mn
Therefore, Pr( -top) ≥ 1 − e N = 1 − e− m . If we wish to have at least 95%
probability in approximate accuracy (to the exact solution) with = 5%, we
can calculate the number of sampling points m from this bound: Pr( -top) ≥
1 − e− m > 0.95. With = 0.05 we have m ≥ 20 ln 20 ≈ 59. Remarkably, this is a
constant complexity w.r.t. to N . We note that the disappearance of N is due to
the fact that the accuracy of our approximation to the exact solution is defined
in terms of the probability. What remarkable is that with a very tight bound
required on a accuracy (e.g., within 95 %), the number of points required to be
sampled is remarkably small (e.g., 59 points regardless of the data size N ).
Hence the complexity of Algorithm 3 is O(const).
5
Experimental Results
In the first set of experiments, we demonstrate that our proposed FOSVM is
comparable with both OCSVM and SVDD in terms of the learning capacity but
are much faster. OCSVM and SVDD are implemented by using the state-of-theart LIBSVM solver [5]. All comparing methods are coded in C# and run on a
computer with 4 GB of memory.
The datasets in the experiment were constructed by choosing one class as the
normal class, appointing the remaining classes as the abnormal class, and randomly removing the samples from the abnormal class such that the ratio of data