VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY
HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY
--------------------
NGUYEN NGOC HAI DANG
APPLICATION OF MACHINE LEARNING ON
AUTOMATIC PROGRAM REPAIR OF SECURITY
VULNERABILITIES
Major:
Computer Science
Major code: 8480101
MASTER’S THESIS
HO CHI MINH CITY, July 2023
THIS THESIS IS COMPLETED AT
HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY – VNU-HCM
Supervisor:
Assoc. Prof. Dr. Huynh Tuong Nguyen, Assoc. Prof. Dr. Quan Thanh Tho
Examiner 1: Dr. Truong Tuan Anh
Examiner 2: Assoc. Prof. Dr. Nguyen Van Vu
This master’s thesis is defended at HCM City University of Technology,
VNU- HCM City on July 11,2023
Master’s Thesis Committee:
(Please write down full name and academic rank of each member of the
Master’s Thesis Committee)
1. Chairman: Assoc. Prof. Dr. Le Hong Trang
2. Secretary: Dr. Phan Trong Nhan
3. Reviewer 1: Dr. Truong Tuan Anh
4. Reviewer 2: Assoc. Prof. Dr. Nguyen Van Vu
5. Member: Assoc. Prof. Dr. Nguyen Tuan Dang
Approval of the Chairman of Master’s Thesis Committee and Dean of Faculty
of Computer Science and Engineering after the thesis being corrected (If any).
CHAIRMAN OF THESIS COMMITTEE
HEAD OF FACULTY OF
COMPUTER SCIENCE
AND ENGINEERING
VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY
HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY
SOCIALIST REPUBLIC OF VIETNAM
Independence – Freedom - Happiness
THE TASK SHEET OF MASTER’S THESIS
Full name: Nguyen Ngoc Hai Dang
Date of birth: 24/11/1997
Major: Computer Science
I.
Student ID: 1970513
Place of birth: Lam Dong
Major ID: 8480101
THESIS TITLE:
ỨNG DỤNG HỌC MÁY VÀO CHƯƠNG TRÌNH TỰ ĐỘNG SỬA CHỮA LỖ
HỔNG BẢO MẬT - APPLICATION OF MACHINE LEARNING ON
AUTOMATIC PROGRAM REPAIR OF SECURITY VULNERABILITIES
II. TASKS AND CONTENTS:
- Research and build a system to automatically repair vulnerabilities
- Research and propose methods to improve the accuracy of the model.
- Experiment and evaluate the results of the proposed methods.
III. THESIS START DAY: 05/09/2022
IV. THESIS COMPLETION DAY: 09/06/2023
V. SUPERVISOR:
Assoc. Prof. Dr. Huynh Tuong Nguyen, Assoc. Prof. Dr. Quan Thanh Tho
Ho Chi Minh City, date ………
SUPERVISOR
(Full name and signature)
CHAIR OF PROGRAM COMMITTEE
(Full name and signature)
DEAN OF FACULTY OF COMPUTER SCIENCE AND ENGINEERING
(Full name and signature)
Acknowledgement
I would like to acknowledge the people who have helped me with their knowledge,
encouragement, and patience during the work of this thesis. The thesis would not
have been completed without your help and inspiration. First and foremost, I would
like to thank my supervisor at Ho Chi Minh City University of Technology, Professor
Quan Thanh Tho. Thank you for your unwavering support. Your insightful feedback
and contributions have pushed and guided me throughout the work of this thesis. I
would also like to thank my other supervisor at the Norwegian University of Science
and Technology, Professor Nguyen Duc Anh. Thank you for your feedback and help.
Lastly, I would like to thank my friends and family for their endless patience, support,
and encouragement.
i
Abstract
We have, as individuals and as a society, become increasingly more dependent on
software, thus, the consequences of failing software have also become greater. Identifying the failing parts of the software and fixing these parts manually could be timeconsuming, expensive, and frustrating. The growing research field of automated code
repair aims to tackle this problem, by applying machine learning techniques to be able
to repair software in an automated fashion. With the abundance of data of bugs and
patches, research on the use of deep learning in code repairing has been on the rise
and proven to be effective with the appearances of many systems [1] [2] with state
of the art performance. However, the approach is conditioned on a large dataset to
be applicable and this condition can not be met by all types of bugs in applications.
One type of such bugs is vulnerability, which is the target of security exploitation of
attackers to cause great harm to organizations that use the applications. Therefore, the
need to automatically identify and fix vulnerabilities is obvious and can significantly
reduce the harm that can be caused to these organizations.
In our work, we focus on the application of deep learning in vulnerability repairing and experiment with a solution that can be used to handle the lack of data, which
is a requirement for deep learning models to be applied effectively, through the use
of embeddings extracted from large language models like CodeBERT [3] and UnixCoder [4]. Although our results show such an approach does not bring significant
improvement, they can be used by other researchers to gain more insights into the
proximity between the repairing tasks of different types of bugs.
ii
Tóm tắt luận văn
Chúng ta, với tư cách cá nhân và xã hội, ngày càng trở nên phụ thuộc nhiều hơn vào
phần mềm, do đó, hậu quả của việc phần mềm bị lỗi cũng trở nên lớn hơn. Việc xác
định các phần bị lỗi của phần mềm và sửa các phần này theo cách thủ cơng có thể tốn
thời gian, tốn kém và gây khó chịu. Lĩnh vực nghiên cứu sửa chữa mã tự động đang
phát triển nhằm mục đích giải quyết vấn đề này, bằng cách áp dụng các kỹ thuật máy
học để có thể sửa chữa phần mềm theo cách tự động. Với lượng dữ liệu dồi dào về
lỗi và bản vá lỗi, nghiên cứu về việc sử dụng học sâu trong sửa mã ngày càng nhiều
và được chứng minh là hiệu quả với sự xuất hiện của nhiều hệ thống [1] [2] với công
nghệ tiên tiến nhất biểu diễn nghệ thuật. Tuy nhiên, cách tiếp cận này dựa trên một
tập dữ liệu lớn và không phải mọi loại lỗi trong ứng dụng cũng đáp ứng được điều
kiện này, một trong số đó là lỗ hổng bảo mật, vốn là mục tiêu khai thác bảo mật của
những kẻ tấn công nhằm gây hại cho các tổ chức sử dụng các ứng dụng chứa những
lỗ hổng này. Do đó, nhu cầu tự động xác định và sửa các lỗ hổng này là hiển nhiên và
có thể đem giảm đáng kể những thiệt hại có thể xảy ra cho các doanh nghiệp
Trong luận văn này, chúng tôi tập trung vào việc ứng dụng học sâu trong việc
khắc phục lỗ hổng bảo mật và thử nghiệm một giải pháp có thể sử dụng để xử lý tình
trạng thiếu dữ liệu, vốn là u cầu để các mơ hình này trở nên hiệu quả, thơng qua
việc sử dụng các embeddings được trích xuất từ những mơ hình ngơn ngữ lớn như
CodeBERT [3] và UnixCoder [4]. Mặc dù kết quả của chúng tôi cho thấy cách tiếp
cận như vậy không mang lại sự cải thiện đáng kể, nhưng chúng vẫn có thể được các
nhà nghiên cứu khác sử dụng để hiểu rõ hơn về khoảng cách giữa các nhiệm vụ sửa
chữa của các loại lỗi khác nhau.
iii
Declaration
I, Nguyen Ngoc Hai Dang, declare this thesis with the Vietnamese title as "Ứng dụng
của học máy vào chương trình tự động sửa chữa lỗ hổng bảo mật” and English title as
"Application of machine learning on automatic program repair of security vulnerabilities”, is my own work and contains no material that has been submitted previously,
in whole or in part, for the award of any other academic degree or diploma.
Signature
Nguyen Ngoc Hai Dang
iv
CONTENTS
CONTENTS
Contents
1
2
Introduction
1
1.1
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.2
Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.3
Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.4
Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
Literature review
6
2.1
Background on Neural Network and Deep Learning . . . . . . . . . .
6
2.1.1
Recurrent Neural Network (RNN) . . . . . . . . . . . . . . .
6
2.1.2
Vanilla recurrent neural network . . . . . . . . . . . . . . . .
7
2.1.3
Long short-term memory network(LSTM) . . . . . . . . . . .
9
2.1.4
Transformer Neural Network . . . . . . . . . . . . . . . . . .
11
2.2
Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
2.3
Learning Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
2.3.1
Sequence to Sequence Learning . . . . . . . . . . . . . . . . .
17
2.3.2
Graphs-based Learning . . . . . . . . . . . . . . . . . . . . . .
18
2.3.3
Tree-to-tree Learning . . . . . . . . . . . . . . . . . . . . . . .
19
2.4
Bug Repairing and Vulnerabilities Repairing . . . . . . . . . . . . . .
19
2.5
Source code Representation . . . . . . . . . . . . . . . . . . . . . . .
20
2.5.1
GumTree . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
2.5.2
Byte Pair Encoding . . . . . . . . . . . . . . . . . . . . . . . .
21
Source code embeddings . . . . . . . . . . . . . . . . . . . . . . . . .
22
2.6.1
CodeBERT . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
2.6.2
UnixCoder . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
2.6
3
The state of the art program repair approraches
27
3.1
27
Template-based approach . . . . . . . . . . . . . . . . . . . . . . . . .
v
CONTENTS
3.2
CONTENTS
Generative-based approach . . . . . . . . . . . . . . . . . . . . . . . .
29
3.2.1
SeqTrans . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
3.2.2
VRepair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
4
Proposed Methods
34
5
Experiments and Results
35
5.1
Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
Validation method . . . . . . . . . . . . . . . . . . . . . . . .
36
5.2
Metrics of performance . . . . . . . . . . . . . . . . . . . . . . . . . .
36
5.3
Preprocessing the code as plain text . . . . . . . . . . . . . . . . . . .
37
5.4
Extracting embeddings from large language models for code . . . . .
38
5.5
Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
5.6
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
6
Discussions and Conclustion
43
6.1
Discussions of the results . . . . . . . . . . . . . . . . . . . . . . . . .
43
6.2
Main Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
6.3
Future works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
References
45
Appendix
49
vi
LIST OF FIGURES
LIST OF FIGURES
List of Figures
2.1
The basic architecture of recurrent neural network . . . . . . . . . . .
8
2.2
Recurrent Neural Network design patterns . . . . . . . . . . . . . . .
9
2.3
LSTM network with three repeating layers . . . . . . . . . . . . . . .
10
2.4
Attention-integrated recurrent network . . . . . . . . . . . . . . . . .
12
2.5
The encoder-decoder architecture of transformer . . . . . . . . . . . .
13
2.6
Attention head operations . . . . . . . . . . . . . . . . . . . . . . . . .
15
2.7
Dataset used for CodeBERT . . . . . . . . . . . . . . . . . . . . . . .
23
2.8
CodeBERT architecture for replaced tokens detection task . . . . . .
25
2.9
A Python code with its comment and AST . . . . . . . . . . . . . . .
25
2.10 Input for contrastive learning task of UnixCoder . . . . . . . . . . . .
26
3.1
Workflow of VuRLE . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
3.2
Architecture of SeqTrans . . . . . . . . . . . . . . . . . . . . . . . . .
30
3.3
Input of SeqTrans . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3.4
Normalized code segment . . . . . . . . . . . . . . . . . . . . . . . . .
32
3.5
The VRepair pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
4.1
Design of our pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
5.1
Sample of buggy code and its patch . . . . . . . . . . . . . . . . . . .
37
5.2
input sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
5.3
Output sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
5.4
Syntax of the output sequence . . . . . . . . . . . . . . . . . . . . . .
38
vii
LIST OF TABLES
LIST OF TABLES
List of Tables
5.1
Experiments replicating the VRepair pipeline . . . . . . . . . . . . . .
40
5.2
Experiments with embeddings as input . . . . . . . . . . . . . . . . .
41
6.1
Complete set of hyperparameters used in our models built by Opennmtpy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
viii
Page 1 of 49
1
Introduction
In the modern society, software systems play a crucial role in almost every aspect of
our lives [5]. These systems have become the backbone of our interconnected world,
enabling us to communicate, work, learn, and entertain ourselves efficiently and
effectively [6]. From mobile applications and social media platforms to e-commerce
websites and financial systems, software systems have revolutionized the way we
interact, transact, and navigate the digital landscape. They have transformed
industries, streamlined processes, and empowered individuals by providing access to
information and services at our fingertips. The importance of software systems lies
in their ability to automate tasks, enhance productivity, enable innovation, and foster
connectivity on a global scale. They have become indispensable tools for businesses,
governments, healthcare, education, and countless other sectors, driving progress,
enabling efficiency, and shaping the future.
With the increasing reliance on software systems for critical functions, such as
communication, finance, healthcare, and infrastructure, ensuring the security of
these systems is of paramount importance. Software security involves protecting
software applications and data from unauthorized access, breaches, and malicious
activities. The consequences of software security breaches can be severe, ranging
from financial loss and reputational damage to compromised privacy and even
threats to national security. While detecting software security issues can be done
during and after software release, addressing security issues early in the
development process saves time and resources. It is generally easier and less costly
to fix vulnerabilities during the development stage than after the software has been
deployed and is in active use.
Software security detection involves using various techniques and tools to identify
vulnerabilities, weaknesses, and potential threats within the codebase. This can
include static code analysis, dynamic testing, penetration testing, and security
auditing. By actively searching for security issues, developers can uncover and
address potential flaws before the software is deployed, reducing the risk of
exploitation by malicious actors.
Once security issues are detected, code repair comes into play. It involves
remediation efforts to fix the identified vulnerabilities and weaknesses. This may
involve patching code, implementing security controls, updating dependencies, or
improving the overall design of the software. Code repair is a critical step in
Page 2 of 49
mitigating security risks and ensuring that the software meets the necessary security
standards.
In our research, we will explore code repair for software security issues. We will
empirically investigate the application of Deep Learning to create patches of such
vulnerabilities in software in an automatic manner. The contribution of this thesis
are three folds:
• A literature review on state-of-the-art Deep Learning on code repair for
software security
• Experiments with different DL approaches for vulnerable code repair
1.1
Motivation
In the field of software testing, security vulnerabilities are the type of bugs that are
both hard to detect and implement patches as they are not explicitly affecting the
software functionalities but they only exposed and cause great harm when exploited
intentionally.
Methods used for creating code patches can be classified into template-based and
generative-based. Template-based patching involves using predefined templates or
patterns to guide the creation of code patches. These templates provide a structure or
framework for making specific modifications to the code. Developers can fill in the
template with the necessary changes based on the identified security issues. This
approach simplifies the patching process by providing a consistent and predictable
format, making it easier to apply fixes across different codebases. Template-based
patching is especially useful for addressing common security vulnerabilities that
have well-defined solutions. On the other hand, generative-based patching takes a
more automated and algorithmic approach to create code patches. Instead of relying
on predefined templates, this approach leverages machine learning techniques, code
analysis, and algorithms to generate patches automatically. Generative-based
patching involves analyzing the codebase, identifying problematic areas, and
generating code changes that address the detected security issues. This approach can
be more flexible and adaptive, as it can handle a wider range of vulnerabilities and
adapt to different programming languages and code structures.
Automated program repair is an emerging field of Software Engineering (SE)
research that allows for automated rectification of software errors and vulnerabilities
Page 3 of 49
[7]. Program repair, code repair, also known as automated program repair or
software patching, refers to the process of automatically identifying and fixing
software bugs or defects without manual intervention. Program repair techniques
typically analyze the source code, identify the root cause of the bug, and generate a
patch or modification that resolves the issue. It is an emerging research interest in
applying Machine Learning techniques to automate the task of repairing faulty code
[7]–[13].
Inspired by the work of Zimi Chen [14], whose research uses transfer learning to
leverage knowledge deep learning learned from generic code repairing tasks to
improve the performance of vulnerability repairing task, our main focus will be to
investigate the problem of generating vulnerabilities patches automatically using a
generative model and leverage the knowledge learned by large programming
languages models such as CodeBERT [3] to improve the performance of the
generative model through the use of extracted embeddings from the CodeBERT.
1.2
Problem Statement
The problem of generating patches for vulnerabilities using knowledge learned from
data is quite new and still has challenges to work on. One of such challenges is the
lack of proficient data to work on, in terms of both volume and diversity; this leads
to the performance of current vulnerabilities’ patches generating system to be quite
unreliable even for template-based methods that are not as data-hungry as the
generative-based, which is dominated by the deep learning architectures.
Unlike the scarcity of data in the domain of vulnerabilities repairing, the data for
buggy code, in general, is quite large; one can easily crawl a data set from the
repositories on GitHub based on commits, and for each commit, we will have a pair
of bad code and its patch. The question then is how we can use this abundance of
data in the broader domain of code repairing in the solution of code repairing.
As it turns out, situations where the availability of data is limited are not uncommon;
in the field of deep learning, there is a method designed using the same intuition, in
which we transfer the knowledge learned from one task to another, called transfer
learning. The application of transfer learning to the vulnerabilities repairing task has
already been investigated in [15] [2] and archived promising results on the limited
data set.
Page 4 of 49
In our work, we will go even further by combining source code modeling techniques
with transfer learning with the hope that this could further improve previously
reported results.
1.3
Research Questions
Our research will focus on answering the three Research Questions:
Research Question 1: What do we know about Deep Learning in Vulnerable
program Repair?
• Reason: Deep Learning has shown great potential in various software
development problems, and exploring its applicability in vulnerable program
repair can lead to advancements in automated bug fixing. Understanding the
existing knowledge about Deep Learning in this context can provide insights
into its effectiveness, limitations, and potential for improving the efficiency and
accuracy of program repair techniques.
Research Question 2: How effective existing generative-based methods to the
problem of code repairing and vulnerability repairing?
• Reason: Investigating the effectiveness of these methods in code and
vulnerability repairing can provide valuable insights into their strengths,
weaknesses, and limitations.
Research Question 3: Can code embedding extend the capabilities of these
methods?
• Reason: Code embedding is a technique that represents source code as vectors
in a continuous space, enabling the application of machine learning algorithms
to code analysis and understanding. Exploring the use of code embedding in
generative-based methods for code and vulnerability repairing can offer new
perspectives on improving the effectiveness and generalizability of these
techniques.
Page 5 of 49
1.4
Thesis Outline
The thesis will be structured as followed: in section 2 we will discuss the necessary
background knowledge of deep learning, and its applications in the broader field of
code repairing, then we will further discuss the previous methods used in the
problem of vulnerabilities, in the section 3 4 we will then discuss in details some of
the prominent methods and from then lead to our proposed methods.
Page 6 of 49
2
Literature review
We will first introduce the fundamentals of Deep Learning, secondly Learning
Paradigms, and thirdly Deep Learning in Code Repair
2.1
Background on Neural Network and Deep Learning
As mentioned in our previous sections, the current focus of research in the fields is
on Deep Learning-based methods and in such methods source code is seen as a
sequence of tokens, using the intuition that source code is a raw text of a specific
language, and the source will learn the predict the tokens sequences of the respective
patches. We start with the general technique neural network (NN). A neural network
is a general term for a model that is built by stacking basic layers or networks to
form a larger network whose output would be the input of the later layers, the
configuration of layers can vary based on specific tasks and the nature of data itself.
In the context of sequence data and text-based data, the prominent basic networks
that are commonly used to handle such data are recurrent neural networks and
transformer neural networks. In this section, we will subsequently go through these
networks to have a basic understanding of them and how they can be used in our
problem of vulnerability repairing. This section will be structured as follow: first,
we will touch on the basics of a neural network: components and learning algorithm
(parameter optimization); in the second section, we will briefly discuss the basis of
the recurrent neural network, then we will advance to long-short term memory
neural network, which intuitively an enhancement of the vanilla recurrent neural
network for handling long term memorization; in last section, we will go into details
of the transformer neural network along with it attention mechanism.
2.1.1
Recurrent Neural Network (RNN)
In the world of data, there are properties that regulate the information or knowledge
that models are trying to learn, one such type properties is the order of values in
sequence data. To be more formal on the definition of sequence data, we will define
a sequence as a type of data where values in the later position of the sequence are
determined by the values in the previous position.
XT = {x1 , x2, . . . xt . . . xT }
Page 7 of 49
• T denotes the length of the sequence
• t denotes the position of x in the sequence
• xt is the value at position t
• XT represent sequence data of length T
Recurrent neural networks [16] have mainly two types of variants: with gated units
and without gated units. In the next sections, we will discuss both network types
using two representative networks (one network for each type): vanilla recurrent
neural network (with gated units) and long short-term memory network (without
gated units).
2.1.2
Vanilla recurrent neural network
Given a sample of sequence data XT = {x1 , x2 . . . xt . . . xT } in which xt is the value
of the sequence at position t or time step t, each of these values is subsequently fed
into the network as parameters for the functions inside that would extract
information from that timestep, propagate it to downstream operations and finally
make the predictions determined by the task working on. These functions in RNN
can be intuitively described as extracting information from the current timestep
using the knowledge provided by the previous calculations, this also implies that
information from the current timestep is propagated throughout the network. The
mathematical representation of the description is shown below along with figure 2.1
describing an RNN of three layers.
ht = tanh(xt · Wxh + ht−1 · Whh + b)
(2.1)
First we need to walk through the notations used in the figure 2.1 before going into
the details of the operations of a typical RNN.
• tanh function is an activation that return values in range of [−1, 1].
• b term is the bias added to allow better generalization of models.
• Whh is the weights matrix representing the connection between hidden values of
position t1 and position t, which allows information learned in previous position
xt−1 to be forwarded to the current position xt .
Page 8 of 49
Figure 2.1: The basic architecture of recurrent neural network
• Wxh is the weights matrix, which extracts information from the current xt
position.
• ht represents the latent information extracted from the current layer.
Figure 2.1 shows an RNN with three layers and each layer is used to extract the
latent values ht from its respective input xt . In order to do so, the network is fed with
input xt along with previously extracted information ht−1 as described in the
function given in 2.1; however, note that the calculation of hidden values at each
layer is done using the same weights matrices mentioned and only the ht is
propagated to the next layer; therefore, in some text, RNN and its other variants
including LSTM can be represented using a more compact form shown in the
left-hand side. The behavior of propagating information from one layer to the next in
RNN is what makes it capable of handling sequence data as the mentioned
properties of the sequence are the dependencies between values in a sequence.
However, depending on the specific task the network is working on, there might or
might not be a prediction yt for xt , this lead to the categorization of common patterns
for designing recurrent network architectures: one-to-one, many-to-one,
many-to-many, one-to-many. As depicted in the below figure 2.2, the main
difference between these designing patterns is the number of input xt that need to be
fed into the network before either a single prediction yt or a sequence of yt . . . yT′ can
be generated. In the above section, we talked about the sequence-to-sequence design
which is also included in the below images.
Page 9 of 49
Figure 2.2: Recurrent Neural Network design patterns
2.1.3
Long short-term memory network(LSTM)
Although the structure of RNN allows them to efficiently handle sequence data via
the connection between hidden layers allowing information to flow between
timesteps, their abilities fall short when it comes to long sequences in which
information in the later part of a sequence depending on the context of provided by
values whose positions are far from the current position. LSTM [16] handles this
problem of the vanilla RNN using different gated units, which serve as controllers
that manipulate the information flow propagating through the network, there are 3
different gated units in an LSTM: forget gate, output gate, and input gate.
Accompany these gated units are a cell state which is also the key component that
allows an LSTM to retain information from previous timesteps and at each time step
the information contained in the cell state is regulated by the units by forgetting old
information and updating new one extracted from the recent timesteps.
As mentioned in the previous section, the gated units regulate the information flow
stored in cell states, which serve as information storage that allows networks to
retain information from the far ”past“. However, this is just the intuition of LSTM
and we need to break down the mathematical representation of these gates and their
operations. The formal representation is a function whose parameters are: the input
xt , the hidden ht , and the cell state Ct , with the choice of parameters depending on
the specific gate. Along with the new units in LSTM, the number of learnable
weights matrix also increase corresponding to the new units, the weights matrix of