..
MINISTRY OF EDUCATION AND TRAINING
HA NOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
--------------------------------------NGUYỄN TIẾN NAM
Tien Nam NGUYEN
HỆ THỐNG THÔNG TIN
SKELETON-BASED HUMAN ACTIVITY REPRESENTATION AND
RECOGNITION
MASTER OF SCIENCE THESIS IN
INFORMATION SYSTEM
KHOÁ 2018B
Hanoi - 2019
MINISTRY OF EDUCATION AND TRAINING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
---------------------------------------
Tien Nam NGUYEN
SKELETON-BASED HUMAN ACTIVITY REPRESENTATION AND
RECOGNITION
Speciality: Information System
MASTER OF SCIENCE THESIS IN:
INFORMATION SYSTEM
SUPERVISOR:
1. Assoc. Prof. Thi Lan LE
Hanoi - 2019
CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM
Độc lập – Tự do – Hạnh phúc
BẢN XÁC NHẬN CHỈNH SỬA LUẬN VĂN THẠC SĨ
Họ và tên tác giả luận văn: Nguyễn Tiến Nam
Đề tài luận văn: Nghiên cứu và phát triển phương pháp biểu diễn và
nhận dạng hoạt động người dựa trên khung xương
Chuyên ngành: Hệ thống thông tin
Mã số SV: CBC18019
Tác giả, Người hướng dẫn khoa học và Hội đồng chấm luận văn xác nhận
tác giả đã sửa chữa, bổ sung luận văn theo biên bản họp Hội đồng ngày
26/10/2019 với các nội dung sau:
STT
Yêu cầu của hội đồng
1 Gộp chương 4 và 5
2
3
Giải thích lí do lựa chọn các
phương pháp nhận dạng sử
dụng trong đề tài
Bổ sung các độ đo đánh giá
Precision, Recall, F1
Nội dung đã sửa chữa, bổ sung
Đã gộp chương 4 và chương 5 thành
1 chương tên là Các kết quả thực
nghiệm (tên tiếng Anh: Experimental
results)
Học viên đã bổ sung thêm chi tiết lí
do lựa chọn phương pháp ở chương 1
phần 3.
Học viên bổ sung thêm thông tin về
cách tính các độ đo đánh giá đã được
trình bày ở chương 4 phần 2
(Evaluation metric). Các độ đo
Precision, Recall và F1 score đều có
thể được sử dụng để đánh giá hệ
thống nhận dạng. Tuy nhiên, trong
luận án, để có thể so sánh với các
phương pháp đã đề xuất trước đó, tùy
vào cơ sở dữ liệu mà các độ đo khác
nhau được sử dụng. Cơ sở dữ liệu
MSRAction3D sử dụng độ chính xác
(Accuracy) trong khi cơ sở dữ liệu
CMDFall sử dụng độ đo F1 score.
Trong bản chỉnh sửa của luận văn,
bên cạnh các độ đo sử dụng riêng cho
từng cơ sở dữ liệu, học viên đã bổ
sung thêm bảng 4.7 ở chương 4 kết
quả nhận dạng trên tất cả các độ đo
cho 2 cơ sở dữ liệu thử nghiệm.
Ngày 07 tháng 11 năm 2019
Giáo viên hướng dẫn
CHỦ TỊCH HỘI ĐỒNG
Tác giả luận văn
Acknowledgements
I would first like to thank my thesis advisor Associate Professor Le Thi Lan,
head of the Computer Vision Department at MICA Institute. The door of
Assoc.Prof. Lan office was always open whenever I ran into a trouble spot
or had a question about my research or writing. She consistently allowed
this thesis to be my own work, but steered me in the right the direction
whenever she thought I needed it.
I would also like to thank the experts who were involved in the validation
survey for this thesis: Dr.Vu Hai, Assoc. Prof. Tran Thi Thanh Hai, PhD
student Pham Dinh Tan who participated and give me more useful information. Without their passionate participation and input, the validation
survey could not have been successfully conducted.
I would also like to acknowledge to School of Information and Communication technology where I have been created all the best conditional to make
the master thesis, and I am gratefully indebted the teachers in SOICT for
very valuable comments on this thesis.
Finally, I must express my very profound gratitude to my parents, my sister
and also to my colleagues in Toshiba Software Development VietNam (Nhu
Dinh Duc, Pham Van Thanh and many colleagues) for providing me with
unfailing support and continuous encouragement throughout my years of
study and through the process of researching and writing this thesis. This
accomplishment would not have been possible without them. Thank you !
Abstract
Human action recognition problem with the aim is to predict what action
of people is making, is currently receiving increasing attention from computer vision researchers due to its widely potential applications in many
fields such as human computer interaction, surveillance camera, robotics,
health care. Recently, the release of cost-effective depth cameras such as
Microsoft Kinect and Asus Xtion PROLIVE allows to open new opportunities for HAR as they provide richer information of the scene. Thanks to
these sensors, besides color images, depth and skeleton information are also
available. Moreover, the latest research results on human pose estimation
in RGB video show that the human pose and skeleton can be accurately
estimated even in complex scenes. Using skeleton information for human
action recognition has several advantages in comparison with those using
color and depth information. As results, a wide range of methods for HAR
using skeleton information have been introduced [1]. The methods proposed
for skeleton-based HAR can be categorized into two groups: hand-crafted
features and deep learning. Each has its own advantages and disadvantages. Deep learning based techniques obtains impressive results several
benchmark datasets. However, they usually require large datasets and high
performance computing hardware. Among hand-crafted descriptors for action representation, Cov3DJ with covariance matrix of 3D joint positions
proves its effectiveness and computational efficiency [2]. To take into account the duration variation of action, a temporal hierarchy representation
is introduced with multiple layers. However, the disadvantage of Cov3DJ is
that it uses of all joints in the skeleton, which causes computational burden
and may become ineffective as each joint has a certain level of engagement
in an action. Moreover, the authors employs only joint positions as joint
features. It seems not good enough to represent action. So other features
in representation action are investigated (joints velocities), combined with
joints positions to create more discrimination feature of each action. This
thesis improves the Cov3DJ method presented [2] by two improvements:
(1) proposing two different schemes to select the most informative joints
for action representation and (2) combining velocity information with positions of the joints for action representation. To evaluate the effectiveness of
the proposed method, extensive experiments have been performed on two
public datasets (MSRAction3D [3] and CMDFall [4]). On MSRAction3D,
the experimental results show that the proposed method obtains 6.17% of
improvement over the original method and outperforms many state-of-theart methods. On CMDFall dataset, the proposed method with F1 score of
0.64 outperforms the deep learning networks ResTCN (F1 score: 0.39) [4]
and LSTM (F1 score: 0.46) [5]. The contributions of the thesis have been
published in an international conference.
Contents
List of Acronymtypes
x
1 Introduction
1
1.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Challenges and open issues in skeleton-based HAR . . . . . . . . . . . .
2
1.3
Objectives and Contributions . . . . . . . . . . . . . . . . . . . . . . .
4
1.4 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
2 State of the Art
2.1
6
Overview of skeletal data and skeleton-based human action recognition
2.2 Pre-processing techniques
2.3
. . . . . . . . . . . . . . . . . . . . . . . . .
6
7
Hand-crafted features-based approach ............................................................. 10
2.3.1
Spatial-temporal descriptors................................................................. 10
2.3.2
Geometric descriptors........................................................................... 13
2.4
Deep learning based approaches ...................................................................... 14
2.5
Conclusions ........................................................................................................... 18
3 Proposed Method
19
3.1
The proposed approach .................................................................................... 19
3.2
The most informative joints detection............................................................. 20
3.2.1
Strategy 1 (FMIJ) for most information joints detection .................. 22
3.2.1.1
Detect candidate joints for each action.............................. 22
3.2.1.2
Select the most informative joints of each action ............... 24
iv
3.2.2
3.3
3.4
Strategy 2 (AMIJ) for most information joints detection .................. 24
Action representation by covariance descriptor ........................................... 25
3.3.1
Temporal covariance descriptor with position information ............... 26
3.3.2
Temporal covariance descriptor with velocity information ................ 26
3.3.3
Temporal hierarchy covariance descriptor ........................................... 27
Classification with support vector machine .................................................... 28
3.4.1
Linear separable training...................................................................... 28
3.4.2
Non linear separable training ............................................................... 29
4 Experimental results
4.1
31
Datasets ............................................................................................................. 31
4.1.1
MSRAction3D .......................................................................................... 31
4.1.2
CMDFall
33
4.2
Evaluation metric ............................................................................................. 33
4.3
Experiment Environments ................................................................................ 36
4.4
Evaluation of features used for joint representation .................................... 36
4.4.1
4.4.2
4.5
Results on MSRAction3D dataset ....................................................... 37
4.4.1.1
ActionSet1 ................................................................................ 37
4.4.1.2
ActionSet2 ................................................................................ 38
4.4.1.3
ActionSet3 ................................................................................ 40
Results on CMDFall dataset ............................................................... 41
Evaluation of the most informative joints selection........................................ 42
4.5.1
The effect of the number of most informative joints .......................... 42
4.5.2
Comparison between two strategies ..................................................... 44
4.6
Comparison with state-of-the-art methods .................................................. 47
4.7
Time computation ............................................................................................ 48
5 Conclusions
50
5.1
Conclusions ........................................................................................................... 50
5.2
Future works .................................................................................................... 51
Publications
52
v
References
56
vi
List of Figures
1.1
Main flow of Human action recognition from skeletal data. . . . . . . .
3
1.2
Example of noisy data in CMDFall dataset with action Walk.
. . .. .
3
1.3
Example of continuous action in CMDFall datasets. . . . . . . . . . . .
4
2.1 Id of the joints of a skeleton in MSRAction3D dataset. . . . . . . . . .
8
2.2 Coordinate System in Kinect. . . . . . . . . . . . . . . . . . . . . . . .
9
2.3
Illustration of the temporal normalization [6]. . . . . . . . . . . . . . .
9
2.4
Illustration the Fourier Temporal Pyramid [7]............................................... 10
2.5
The Cov3DJ descriptor [2]. ............................................................................. 11
2.6
The construction of the CovP3DJ [8]. ............................................................ 12
2.7
Informative joints determined for MSRAction3D dataset [9]........................ 13
2.8
Representation in Lie group and mapping to Lie algebra. ........................... 14
2.9
Two input streams architecture [10]. .............................................................. 15
2.10 Hierarchical RNNs for action representation learning [11]. ........................... 15
2.11 Part-aware LSTM for action recognition [12]. ............................................... 16
2.12 Action recognition by combining CNN and LSTM [13]. ............................... 17
3.1
Overview of the proposed method. ................................................................. 20
3.2
Skeleton sequence of Action Throw in MSRAction3D. ................................. 21
3.3
Candidate joints of Action 6 (Throw) in MSRAction3D dataset. The
candidate joints are marked in red color. ....................................................... 23
3.4
Determination of most informative joints for Action 6 (throw). The final
most informative joints are marked in magenta color.................................... 23
vii
LIST OF FIGURES
3.5
Illustration weighted joints in Strategy 2. ...................................................... 25
3.6
Two-layer hierarchy with (a) none-overlapping and (b) overlapping (OL)
windows. .......................................................................................................... 28
3.7
An example about linear separable and non linear separable [14]. ............... 29
3.8
Transformation data to non-linear kernel. ................................................... 30
4.1
The statistical length of each action in MSRAction3D dataset. ................... 32
4.2
The statistical number of occurrences of each action in MSRAction3D
and CMDFall. ....................................................................................................... 33
4.3
The statistical length of each action in CMDFall dataset. ............................ 34
4.4
Accuracy obtained with different features on ActionSet1. ............................ 37
4.5
Confusion matrix obtained for ActionSet1..................................................... 38
4.6
Accuracy obtained with different features on ActionSet2. ............................ 39
4.7
Confusion matrix obtained for ActionSet2..................................................... 39
4.8
Accuracy obtained with different features on Action Set 3........................... 40
4.9
Confusion matrix of Action Set 3. .................................................................. 41
4.10 Results obtained for CMDFall dataset. .......................................................... 41
4.11 Confusion matrix of CMDFall............................................................................ 42
4.12 Comparison the distribution of each class based on covariance features
on position in ActionSet1 of MSRAction3D. ................................................. 43
4.13 Comparison the distribution of each class based on covariance features
on velocity in ActionSet1 of MSRAction3D................................................... 43
4.14 Comparison the distribution of each class based on covariance features
on combination of position and velocity in ActionSet1 of MSRAction3D. 44
4.15 Fixing MIJ. ...................................................................................................... 46
4.16 Adaptive MIJ. .................................................................................................. 46
viii
List of Tables
4.1
20 actions and 3 subsets of MSRAction3D dataset. ...................................... 32
4.2
List of action classes and categorization. ....................................................... 34
4.3
Confusion matrix ............................................................................................. 35
4.4
The selected joints on AS3 of MSRAction3D dataset with two strategies.
4.5
Accuracy (%) of state-of-the-art methods on MSRAction3D dataset.
Three best results are in bold.
*
45
On the experiments of authors not
used all datasets. .......................................................................................... 47
4.6
F1 score of state-of-the-art methods on CMDFall dataset. .......................... 48
4.7
Obtained results of the proposed methods on two datasets. ......................... 48
4.8
Time computation (s) in testing phase. ....................................................... 49
ix
List of Acronymtypes
AMIJ Adaptive Most Informative Joints.
C3D Convolutional Three Dimensional.
CNN Convolutional Neural Network.
DTW Dynamic Time Warping.
FMIJ Fixing Most Informative Joints.
HAR Human Action Recognition.
LSTM Long short-term memory.
MIJs Most Informative Joints.
NN Neural Network.
RNN Recurrent Neural Network.
SVM Support Vector Machine.
x
Chapter 1
Introduction
1.1
Motivations
We are living in the revolution industry 4.0 where everything is connected, the information explosion, the interaction between objects are becoming deeper and larger.
Machine will be going to collaborate with people in many fields. And to do this,
understanding the human’s behavior is the core technology.
The main aim of human action recognition (HAR) is to recognize type of human’s
actions using information captured by sensors. These actions can be executed by
one, two, a group people. Human action recognition from video is a term which used
to figure out the method to recognize human’s action from one or several cameras.
Although, there have been a numerous approaches proposed for understanding human
action [15], due to the high number of actions, and subjects, the huge variation in
action rate, environment, human action recognition is still a challenge for researchers.
The release of cost-effective depth cameras such as Microsoft Kinect [16] and Asus
Xtion PROLIVE allows to open new opportunities for HAR as they provide richer
information of the scene. Besides color images, depth and skeleton information are
also available. Moreover, the latest research results on human pose estimation in RGB
video show that the human pose and skeleton can be accurately estimated even in
complex scenes. Using skeleton information for human action recognition has several
1
1.2 Challenges and open issues in skeleton-based HAR
advantages in comparison with those using color and depth information. Firstly, unlike
color, skeleton is robust to the human appearance variation. Secondly, HAR based on
skeleton is effective in term of memory requirement and computational time (e.g., a
human skeleton contains 20 joints with 3 coordinates (x, y, z)). Finally, it has been
proven by the Johansson’s study that human vision is able to recognize human activity
and the velocity of the different motion patterns by observing the movement of human
skeleton’s joints [17].
As results, a wide range of methods for HAR using skeleton information have been
introduced [18], [1]. Previous studies have shown that to a certain extent, skeletonbased HAR methods have solved some of the problems of human action recognition
with RGB cameras or video, and have demonstrated good recognition performance in
several benchmarks datasets. However, while working with more challenging datasets,
the recognition rates are still low.
My thesis aims at improving the performance of human action recognition based on
skeleton. In the next section, the challenges and open issues in skeleton-based HAR will
be discussed. Then, the objectives and the contributions of my thesis will be provided.
1.2 Challenges and open issues in skeleton-based
HAR
The Fig. 1.1 shows the main flow of Human action recognition based on skeletal data.
Normally, this problem includes two main step: action spotting and action recognizing.
The role of first step is to find the duration when the action is taking place while the
second step is to determine the type of action. This thesis focuses on the second step
by assuming that the actions of interest have been detected by the first step. HAR
from skeletal data has different difficulties and challenges as follows:
• As the quality of skeleton estimation is not always perfect especially for nonstanding posture, the skeletal data usually has noise. Fig. 1.2 illustrates one
example of Walk action in CMDFall where some joints are not accurately esti-
2
1.2 Challenges and open issues in skeleton-based HAR
Figure 1.1: Main flow of Human action recognition from skeletal data.
mated.
• There exists a high inter-class similarity in action set. For example, three actions
(draw X, draw circle, draw tick) in MSRAction 3D are very similar.
• As each subject has his/her own manner to perform actions in term of speed,
phase, the action dataset has a high intra-class variation.
• In some datasets for instance CMDFall, the beginning moment of action can be
Figure 1.2: Example of noisy data in CMDFall dataset with action Walk.
3
1.3 Objectives and Contributions
Figure 1.3: Example of continuous action in CMDFall datasets.
confused with the previous action as the actions is composed of several action
(e.g., sit on chair then stand up in CMDFall dataset as illustrated in Fig. 1.3).
1.3 Objectives and Contributions
Current approaches for skeleton-based action recognition can be roughly divided into
two main categories. The first category uses hand-crafted features while the second category investigates deep learning methods to automate feature extraction process. Deep
learning based techniques usually require large datasets and high performance computing hardware. Among hand-crafted descriptors for action representation, Cov3DJ with
covariance matrix of 3D joint positions proves its effectiveness and computational efficiency [2]. To account for time dimension of skeletal data, a temporal hierarchy
representation is introduced in [2] with multiple layers. In the first layer, covariance
matrix is computed on all frames. In latter layers, covariance matrices are computed
on shorter temporal windows. Cov3DJ is evaluated on MSRAction3D and MSRC12
datasets. However, disadvantage of Cov3DJ is the use of all joints in computation,
which causes computational burden and may become ineffective as each joint has a
certain level of engagement in an action [19]. So I think whether combine the idea
about subset joints from [19] with covariance feature. Besides reduction feature dimension, it also helps to create discriminative feature between each class. Moreover,
the authors employs only joints positions as joint features. It seems not good enough
to represent action. So other features in representation action are investigated (joints
velocities), combined with joints positions to create more discrimination feature of each
action.
4
1.4 Outline of the thesis
This thesis improves the covariance-based action recognition method presented in
[2] by (1) proposing two different schemes to select the most informative joints and (2)
combining velocity information with positions of the joints for action representation.
To evaluate the effectiveness of the two proposed improvements, extensive experiments have been performed on two public datasets (MSRAction3D [3] and CMDFall
[4]). The experimental results show that the proposed method obtains 6.17% of improvement over the original method and outperforms many state-of-the-art methods.
On CMDFall dataset, the proposed method with F1 score of 0.64 outperforms the deep
learning networks ResTCN (F1 score: 0.39) [4] and LSTM (F1 score: 0.46) [5]. The
contributions of the thesis have been published in an international conference.
1.4 Outline of the thesis
The thesis consists of 5 main chapters. In chapter 2, I present characteristic of skeletal
data, pre-processing techniques as well as analyze in detail the advantages and disadvantages of the state-of-the-art approaches proposed for HAR using skeleton. Then,
the proposed method is described in Chapter 3. Chapter 4 aims at presenting experimental results on two benchmark datasets (MSRAction3D and CMDFall). Finally,
conclusions and future works are given in Chapter 5.
5
Chapter 2
State of the Art
Overview
Feature engineering is one of the most fundamental research problems in computer
vision and also machine learning. On the perspective of feature engineering, it can be
broadly grouped into two main approaches: hand-crafted and deep learning. With the
ability to self-study features from data, deep learning allows to find the features that
hand-crafted can not create. Throughout recent years, deep learning receives tremendous interest from researching community as a strong feature engineering method.
However, to precisely extract specific features from raw data is a hard task even with
deep learning. That is why hand crafting still stays an important role for this problem. This chapter aims at analyzing the state-of-the-art methods proposed for human
action recognition. I divided these methods into two main categories: hand-crafted
features-based and deep learning based.
2.1
Overview of skeletal data and skeleton-based
human action recognition
According to [17], a skeleton is considered as a schematic model of the locations of
torso, head and limbs of the human body. Parameters and motion of such a skeleton
6
2.2 Pre-processing techniques
can be used as a representation of human actions and therefore, the human body pose
is defined by means of the relative location of the joints in the skeleton.
The number of joints of human pose depends on what the device we use to extract
skeleton. For example Kinect sensor v1 provides 20 joints while 25 joints are available with Kinect v2. In this thesis, all experiments are performed on two benchmark
datasets MSRAction3D and CMDFall which provide 20 joints. The id of each joints
are shown in Fig. 2.1. In this thesis, I use some notations as follows:
A joint ith in skeleton at the time t is represented by: pt = (xt, yt, zt) where:
i
i
i
i
• xit, yit and zit is the coordinates of joint;
• i = 1, 2, ..., K; K is the number of the joints used for representing human skeleton
(K = 20 in this thesis);
• t = 1, 2, ..., T , T is the duration of the action of interest;
• The origin (0, 0, 0) is located at the center of the IR sensor on Kinect (see Fig.
2.2), x axis grows to the sensors left, y-axis grows up (note that this direction
is based on the sensors tilt) and z-axis grows out in the direction the sensor is
facing, the used unit is meter.
Given a sequence of skeleton P = {pi t}; i = 1, 2, ..., K and t = 1, 2, ..., T the skeletonbased human action recognition aims to determine the label of action class.
2.2
Pre-processing techniques
Pre-processing is a general step in machine learning problem, it can include several
techniques with the different purposes. It is often used to handle raw data before move
to the next step. Particularly, in the human action recognition problem, there are many
factors like: differences among executor, the different velocity of the action, the length
of the action sequence, the noise or missing data. And pre-processing is able to solve
those above issues. I have synthesized and adopted reasonable techniques offered by the
authors of skeleton-based action recognition problems. In general, to address the factors
7
2.2 Pre-processing techniques
Figure 2.1: Id of the joints of a skeleton in MSRAction3D dataset.
8
2.2 Pre-processing techniques
Figure 2.2: Coordinate System in Kinect.
Figure 2.3: Illustration of the temporal normalization [6].
like as the differences among executor and velocity in [2], [8] they used normalization
for scaling skeletal data into range [0,1]. An another approach for normalization in [6],
[20] they set the joint hip as original coordinate (0,0,0), the coordinate of rest joints are
computed by minus the hip coordinate. Instead of subtracting the coordinate of the rest
joints, in [9] they rotate the skeleton and align the horizontal axis x with the direction
from left hip (HL) to right hip (HR). To deal with the varying length of action in [20]
they firstly defined the number of desired frames and use an interpolation algorithm
based on the the known frames to ensure the length of each action is equivalent.
9
2.3 Hand-crafted features-based approach
Figure 2.4: Illustration the Fourier Temporal Pyramid [7].
On the other hand, in [6] they offered their algorithm (see Fig. 2.3) called: Cubic
spline interpolation of Kinematic Features to normalization the temporal. An other
approach in [21], [2] to handle the different length of action is by using a pyramidal
approach (see Fig. 2.4). To address with the style of the action, following by [14] most
of the authors applied the dynamic temporal warping to compute a nominal curve for
representation for this action [20].
2.3
Hand-crafted features-based approach
In hand-crafted-based approach, features are manually designed and extracted on characteristics of actions. The methods belonging to this approach are categorized into 2
groups: spatial and temporal descriptors and geometric descriptors.
2.3.1
Spatial-temporal descriptors
In this approach, the authors try to compute temporal-spatial features for each joint
in the skeleton. The method introduced by Hussein et al.[2] belongs to this approach.
The authors concatenated all information joints pi = (xi, yi, zi) with i = 1, 2, , K
10
2.3 Hand-crafted features-based approach
Figure 2.5: The Cov3DJ descriptor [2].
at time t to create a vector S = [xt , xt .., xt , yt , yt .., yt , z t , z t.., zt ]. For an entire
1
2
K
1
2
K
1
2
K
sequence, they computed the covariance features for representation. Moreover, the
authors also proposed a computation on small window to well distinct the actions
which have temporal structures. Inspired from [2], in [8] they also computed covariance
matrix named CovP3DJ on the separate parts of body instead of on the all joints (see
Fig. 2.6). This idea helps to reduce the memory occupation from 78.26% to 80.35%
but the accuracy is not significantly improved since they did not use the relationship
among the joints in the different parts of body.
To solve the issues with covariance matrix that have shortcomings such as being prone
to be singular, limited capability in modeling complicated feature relationship, and
having a fixed form of representation, in [22] they used nonlinear kernel matrices to
modify the original covariance matrix. This proposal has obtained the promising results
and got state-of-the-art results of benchmark datasets. Instead of employing all joints
for representing action, in [19], the authors stated that each action can be discriminated
by some specific joints. So they proposed to extract the most informative joints for
each action. This idea is applied in many research of the other authors. In [7], they
proposed the concept of actionlet on the subset of joints. Their model is more robust
to the errors in the features, and it can better characterize the intra-class variations in
the actions. The features which they used to pass in actionlet are the relative position
11