HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
MASTER’S GRADUATION THESIS
PM2.5 Prediction Using Genetic AlgorithmBased Feature Selection and EncoderDecoder Model
NGUYEN MINH HIEU
Major: Data Science and Artificial Intelligence
Thesis advisor:
Institute:
1
Dr. Nguyen Phi Le
School of Information and Communication
Technology
HA NOI, 09/2021
2
Graduation Thesis Assignment
Name: Nguyen Minh Hieu
Phone:
Email :
Class: 20BKHDL-E
Affiliation : Hanoi University of Science and Technology
I – Nguyen Minh Hieu - hereby warrants that the work and presentation in this thesis
performed by myself under the supervision of Dr. Nguyen Phi Le. All the results
presented in this thesis are truthful and are not copied from any other works. All
references in this thesis including images, tables, figures and, quotes are clearly
and fully documented in the bibliography. I will take full responsibility for even
one copy that violates school regulations.
Hanoi, 28th September, 2021
Author
Nguyen Minh Hieu
Attestation
of
thesis
…………………………….......................................................
advisor :
………………………………………………………………………………………………..
………………………………………………………………………………………………..
Hanoi, 28th September, 2021
Thesis Advisor
Acknowledgements
Dr. Nguyen Phi Le
First of all, I would like to deeply thank my family, especially my parents - who have
worked hard to raise me. My parents have always been with me and created the best
3
conditions for me to have all the necessities needed for my studies. Parents are the spiritual
fulcrum, helping me to have a springboard to overcome difficulties and challenges.
I would like to express my gratitude to my advisors, Dr. Nguyen Phi Le for supporting my
studies and research on this subject. She is very kindhearted and supportive
person, who has guided me from the first day I worked with her. Moreover, I would like
to thank Dr. Nguyen Thanh Hung, who has spent his precious time supporting, giving
me advice and along with Dr. Nguyen Phi Le, giving me opportunities to work in many
amazing projects.
My sincere thanks also go to all the people in the ICN laboratory of the BK.AI center. I
have a wonderful time working with talented and special peers. I learned a lot from them
and they always spread positive energy for me.
Finally, I would like to thank my friends who have always stood by me, shared joys and
sorrows, and always supported and helped me all the time.
Abstract
The concentration of fine particulate matter (PM2.5), which represents inhalable particles
with diameters of 2.5 micrometers and smaller, is a vital air quality index. Such particles
can penetrate deep into the human lungs and severely affect human health. This paper
studies accurate PM2.5 prediction, which can potentially contribute to reducing or
avoiding the negative consequences. Our approach’s novelty is to utilize the genetic
algorithm (GA) and an encoder-decoder (E-D) model for PM2.5 prediction. The GA
benefits feature selection and remove outliers to enhance the prediction accuracy. The
encoder-decoder model with long short-term memory (LSTM), which relaxes the
restrictions between the input and output of the model, can be used to effectively predict
the PM2.5 concentration. We evaluate the proposed model on air quality datasets from
Hanoi and Taiwan. The evaluation results show that our model achieves excellent
performance. By merely using the E-D model, we can obtain more accurate (up to 53.7%)
predictions than those of previous works. Moreover, the GA in our model has the
advantage of obtaining the optimal feature combination for predicting the PM2.5
concentration. By combining the GA-based feature selection algorithm and the E-D model,
our proposed approach further improves the accuracy by at least 13.7%.
4
Content
List of Figures
List of Tables
List of Equations
Meaning
Acronyms
Abbreviations and terms
5
LSTM
Long-Short Term Memory
RNN
Recurrent Neural Network
MAE
Mean Absolute Error
RMSE
Root Mean Squared Error
Sequence Length
Horizon
6
Introduction
1.1 PM2.5 forecasting problem
Industrialization and urbanization have brought considerable convenience to human lives.
However, they are generally associated with severe air pollution. Accordingly, people have
raised concerns about air quality, especially near living areas. Particulate matter 2.5
(PM2.5) is one of the most important indexes to evaluate the severity of air quality, which
is directly related to human health. PM2.5 particles in the air can bypass the nose and
throat and penetrate deep into the lungs, causing many diseases, such as cardiovascular
disease and respiratory disease. In 1, the authors reveal that long-term exposure to PM2.5
may lead to heart attack and stroke. Therefore, accurate PM2.5 forecasting is crucial and
may help governments and citizens find suitable solutions to control or prevent negative
conditions.
1.2 Existing solutions and problems
PM2.5 forecasting is a time series prediction problem that is commonly solved using
recurrent neural networks (RNNs), including LSTM 2. The LSTM-based model has
advantages in air quality prediction 3. In 4, the authors also use LSTM but combine gas
and PM2.5 concentrations to predict air quality in Taiwan. The work in 5 exploits deep
learning to build a hybrid neural network model that can forecast PM2.5 multiple steps
ahead. In 6, Yanlin et al. present a hybrid model that integrates graph convolutional
networks and LSTM to predict PM2.5. In 7, the authors utilize the k-nearest neighbor
algorithm to mine spatial-temporal information. The historical information of related
locations is then used as the input of the LSTM, adaptive temporal extractor (ASE), and
artificial neural network (ANN) models. Several other deep learning models for predicting
air quality can be found in 8 - 11. Despite considerable effort, air quality prediction models
still suffer from two issues: restrictions of the input and output lengths and unoptimized
feature selection. The first issue indicates that the number of time steps in a model’s output
cannot exceed that of the input; i.e., the model cannot predict the future with upcoming
steps that exceed the input data’s length. Therefore, it is essential to remove this limitation
in PM2.5 prediction. The second issue arises from the fact that air quality data include
dozens of factors other than PM2.5, such as various concentrations, temperature, and
humidity. These factors may or may not be related to PM2.5. However, appropriate use of
7
some of these factors may improve the prediction accuracy. Meanwhile, misuse may not
only degrade the accuracy but also add extra computational time. Therefore, choosing the
optimal feature combination is essential.
1.3 Goals and approaches
This paper aims to address the two issues described above. As a solution, we propose a
novel PM2.5 prediction model that combines a genetic algorithm (GA) and an encoderdecoder (E-D) model. The GA is exploited to perform feature selection in a near-optimal
manner, thereby enriching the prediction model. Additionally, we leverage the encoderdecoder model to build a PM2.5 prediction model with high accuracy. As a result, the
proposed model can efficiently handle different sizes (in terms of the number of time steps)
of input and output. To demonstrate the effectiveness of our proposed approach, we
evaluate the GA-based feature selection on the Hanoi 12 and Taiwan datasets 11. The
evaluations show that the GA-based feature selection outperforms other methods. We then
compare our model to the state-of-the-art method ST-DNN in 11 using the Taiwan dataset.
Compared to ST-DNN, our model improves the accuracy from 14.82% to 41.71%. By
combining the GA-based feature selection algorithm and the E-D model, our proposed
approach further increases the accuracy by at least 3%.
1.4 Structure of thesis
The remainder of this paper is organized as follows. We describe the motivations in
Section II. Section III presents our proposal. The performance of evaluation is introduced
in Section IV. Section VI introduces related works. Finally, Section VII concludes the
paper.
1 Related works
This section briefly reviews work related to PM2.5/air quality prediction models and GArelated methods. Kök et al. 3 predicted air quality using a deep learning model that
includes three parts. In the first part, the training data are fed into an LSTM layer with an
input sequence length of 8 and output length of 1. Second, the predicted data are labeled
according to the daily air quality index (AQI) values. Finally, a decision unit is developed
to map the observed data and predicted alarm situations. The model succeeds in employing
LSTM with high accuracy, but the input and output are not flexible. Several other models,
8
such as ST-DNN 11, deep air learning (DAL) 10, and GC-DCRNN 14, exploit spatial data
to formulate the relationships between spatial-temporal data. However, ST-DNN and GCDCRNN do not identify the factors that affect air quality. Additionally, the models have a
high time cost because of the preprocessing. The DAL model performs feature selection
during the training process by inserting a layer between the input layer and the second
layer of the neural network. DAL, however, aims to discover the importance of different
input features to the predictions, not to increase the prediction accuracy. The authors reveal
the main relevant factors to the variation in air quality and provide proof to support air
pollution prevention and control.
In [14], the authors use a sequence-to-sequence model to predict PM2.5. They feed all air
pollutants into the model without concern for their appropriation. The main problem is that
predicting all air pollutants will results in an ‘‘accumulation of errors.’’ For example, when
each feature’s prediction results are inaccurate, it will negatively affect PM2.5 forecasting.
Even if a feature does not affect PM2.5, it will cause the outcomes to be more inaccurate.
In [6], the dataset consists of five features other than PM2.5; therefore, there are feature
combinations. However, the authors do not describe how to select the optimal combination.
In the experiments, the authors present results for seven combinations without explaining
why these combinations are selected. L. Yan et al. use the E-D model to predict PM2.5 in
[15]. The authors use all other features, including the monthly average PM2.5
concentration, daily average PM2.5 concentration, PM10 concentration, AQI, SO2, CO,
NO2, O3, average temperature, humidity, pressure, and wind speed per hour per day. As
there is no consideration of the optimal feature selection, the use of all features may
complicate the model and degrade the prediction accuracy. In [16], the authors divide the
features into groups and feed each group into a separate encoder. This approach has two
serious problems. First, similar to other existing works, the use of all features, including
features not related to PM2.5, degrades the prediction accuracy. Second, as they use a
separate encoder for each feature group, the number of encoders is equal to the number of
feature groups. Consequently, the model becomes complex when the number of features is
extensive. Bo Zhang et al. in [6] conduct two training phases to predict PM2.5. In the
former phase, the authors use an auto-encoder model to extract the relationship between
multiple climate variables and the PM2.5 concentration. This phase is also responsible for
compressing the input data to reduce the complexity. The output of the first phase is then
fed into the second phase, which uses a Bi-LSTM network to predict the future PM2.5
based on the historical data. In [7], the authors take advantage of three types of data,
including recent air pollutants, meteorology, and adjacent station’s PM2.5. The authors first
utilize a one-dimensional convolutional neural network to extract the air quality data’s
spatiotemporal correlation. The extracted feature vector is then fed into an LSTM layer
9
with an attention mechanism to predict future air quality. In summary, the existing works
use various variants of the ED model to perform PM2.5 prediction, but none consider
feature selection.
The choice of hyperparameters generally has a significant impact on the performance of a
deep-learning model. A non-exhaustive list of parameters includes the number of hidden
layers, the number of neurons in each layer, the weights, and the learning rate. In previous
works, these hyperparameters are typically chosen based on trial and evaluation, which is
time-consuming and does not guarantee optimal performance. A feasible approach to
address this problem is to exploit search techniques to find the optimal settings. In such a
context, the genetic algorithm (GA), which is a meta-heuristic search and optimization
method, shows potential. In [17], the authors use GA to determine the optimal structure of
a deep belief network for detecting different types of attacks in IoT networks. They then
propose a GA-based approach to adaptively adjust the number of hidden layers and the
number of neurons in each layer according to the attack type. Bouktif et al. in [18] exploit
a GA to determine the optimal time lag and number of layers of an LSTM model for
predicting the future electric load. In [19], the authors propose a deep long short-term
memory (DLSTM) model to predict petroleum production. A GA is used to infer the
optimal selection of the model hyperparameters, such as the number of epochs, the number
of hidden neurons, and the lag. W. Liu et al. leverage support vector machine (SVM) for
PM 2.5 prediction in [20]. They use the genetic algorithm (GA) and particle swarm
optimization (PSO) to optimize the model’s parameters, thereby improving the prediction
accuracy. In [21], the authors propose a PM2.5 prediction model that utilizes multiple
resolution data as input. An ensemble approach is leveraged to combine information
retrieved from the low- and high-resolution data. The authors use the nondominated sorting
genetic algorithm (NSGA-II) to optimize the ensemble weights. These works differ from
our approach in that they leverage a GA to tune the parameters, not to optimize the input
feature combination. Indeed, the models proposed in [20] and [21] take all the features as
the input.
In this research, we apply a GA for air quality prediction model feature selection. The
integration substantially improves the accuracy of PM2.5 prediction. Another distinct
characteristic of our model is the ability to predict multiple steps ahead, as desired.
10
2 Theoretical Background
1.5 Artificial Intelligence
Artificial intelligence (AI) refers to the simulation of human intelligence in machines that
are programmed to think like humans and mimic their actions. The term may also be
applied to any machine that exhibits traits associated with a human mind such as learning
and problemsolving. Artificial intelligence methodology has attracted a significant amount
of attention due to its high accuracy and versatility in solving different problems in many
domains, such as computer vision, speech recognition, and wireless communications, etc.
The ideal characteristic of artificial intelligence is its ability to rationalize and take actions
that have the best chance of achieving a specific goal.
Artificial intelligence can be split into two broad types: narrow AI and general AI. Narrow
AI is the intelligent system that has been taught or learned to carry out specific tasks
without being explicitly programmed how to do so. This type of machine intelligence is
evident in the speech and language recognition of the Siri virtual assistant on the Apple
iPhone, in the visionrecognition systems on selfdriving cars, or in the recommendation
engines that suggest products you might like based on what you bought in the past. Unlike
humans, these systems can only learn or be taught how to do defined tasks, which is why
they are called narrow AI. On the other hand, General AI is very different and is the type of
adaptable intellect found in humans, a flexible form of intelligence capable of learning how
to carry out vastly different tasks, anything from haircutting to building spreadsheets, or
reasoning about a wide variety of topics based on its accumulated experience.
In general, all of the achievements mentioned so far stemmed from a subset of artificial
intelligence, machine learning, which refers to the concept that computer programs can
automatically learn from and adapt to new data without being assisted by humans. Deep
learning, a more advanced technique of machine learning, enables this automatic learning
through the absorption of huge amounts of data such as text, images, or video. In the next
sections 1.6 and 1.7, we will go into detail about subsets of Artificial Intelligence, e.g:
Machine Learning and Deep Learning/
11
1.6 Machine learning overview
1.7 Deep learning overview
Over the last decade, deep learning has attracted much attention and become the dominant
technology in the artificial intelligence community. Deep learning is a subset of machine
learning, which is deep artificial neural networks. These neural networks attempt to
simulate the behavior of the human brain, allowing it to learn from a large amount of data.
An artificial neural network is a graph computing system inspired by the neural network of
humans. An artificial neural network consists of many layers, which can be classified into
three main categories: input layer, hidden layer, and output layer. Figure 1 shows a basic
structure of an artificial neural network.
Figure 1. An example of an artificial neural network.
Deep Learning can handle different problems with large amounts of data with many
different types of input. In addition, Deep Learning reduces the cost of performing feature
engineering, which is time-consuming for Machine Learning. Input data only needs to be
processed to form datasets for training, testing, evaluation, and then feeding into models,
then Deep Learning models will learn and process themselves and produce results.
The disadvantage of Deep Learning is that it requires high training costs and requires many
parallel computers for computation in many cases. Besides, the deep learning method
requires the researcher to try the parameters to get the best results without any general
formula or theory.
In recent years, many deep learning architectures have been invented and widely adopted
in many fields such as time series prediction, computer vision, natural language
processing, speech recognition, etc. Specifically, with timeseries problems, many deep
12
learning models such as Recurrent Neural Networks (RNN), Gated Recurrent Units
(GRU), Long Short-Term Memory (LSTM) have been widely used and showed promising
results compared to traditional models. With the problem of predicting the fine dust index
to assess the quality of the environment, Deep Learning is a powerful tool to solve the
problem when the number of data fields and records is high, the preprocessing data is timeconsuming without a deep learning approach.
1.8 Long short-term memory
The Long Short-Term Memory (LSTM) network is a type of recurrent neural network
(RNN). RNN is a form of the artificial neural network but RNN is called a recurrent neural
network because it performs the same task for all elements with the output (hidden state) of
the pre-computation step as input for the next calculation step. Unlike traditional neural
networks, RNNs can store previously computed information. Therefore, RNNs have many
benefits when dealing with string data.
Figure 2. Structure of RNN.
In Figure 2, are the input data at time , respectively. are hidden states at time respectively,
and they are usually computed by the activation function. is the output data at time . The
hidden state is initialized with the value 0. , , and are weight matrices. The equation that
defines the RNN is shown below:
Equation 1. RNN equation.
13
where are activation functions.
However, in RNN, to train the network we need to backpropagate through time, at each
step the gradient is calculated. The gradient is used to update weights in the network. If the
effect of the previous layer on the current layer is small then the gradient value will be
small and vice-versa. If the gradient of the previous layer is smaller then the gradient of the
current layer will be even smaller. This makes the gradients exponentially shrink down as
we backpropagate. Smaller gradient means it will not affect the weight updation. Due to
this, the network does not learn the effect of earlier inputs. Thus, causing the short-term
memory problem.
To overcome this problem two specialised versions of RNN were created. They are 1)
GRU (Gated Recurrent Unit) 2) LSTM (Long Short-Term Memory). LSTM’s and GRU’s
make use of memory cell to store the activation value of previous words in the long
sequences. Now the concept of gates come into the picture. Gates are used for controlling
the flow of information in the network. Gates are capable of learning which inputs in the
sequence are important and store their information in the memory unit. They can pass the
information in long sequences and use them to make predictions.
Nevertheless, LSTM is superior to both RNN and GRU, and tremendously helpful in
multistep time series forecasting. It performs the same task for all elements with the output
of the current step, which can be used as the input in the upcoming step. Unlike a
traditional neural network, LSTM can store previously calculated information, thereby
gaining benefits when the processing data are in the form of a sequence. The original RNN
suffers from a critical weakness: the vanishing gradient problem, where earlier steps’
contribution becomes insignificant in future steps, so the model fails to capture long-term
dependencies. LSTM addresses the vanishing gradient problem by introducing three
special units: forget gate, update gate, and output gate. The gates can be used to decide
which information and how much of each type of information should be remembered. The
main advantage of LSTM compared to other forms of RNNs is the ability to learn longterm dependencies.
Figure 3 shows the typical structure of an LSTM unit, consisting of three main gate
structures: update gate , forget gate , and output gate . The input of an LSTM unit
comprises three elements: a cell state and a hidden state outputted by the previous LSTM
unit and the input data of the current unit . is a candidate for replacing the memory cell,
which is calculated from the current input and the previous hidden state. The forget gate
and the update gate layer are both computed with a sigmoid function, whose input
comprises the previous hidden state and the current input. Thus, the output values of the
14
forget gate and the update gate are within the range (0, 1). The value of the forget gate is
applied to the previous cell state , while the value of the update gate is multiplied by the
new memory to form the new cell state. Intuitively, the forget gate controls to what extent
previous information is forgotten, while the update gate decides the extent of the current
input and the previous hidden state to be written onto the new cell state. The closer the
value is to 0, the more information is forgotten and ignored, and the closer the value is to 1,
the more information is maintained and stored. is the output gate, which determines what
the next hidden state should be. The equations for each state are as follows:
Equation 2. LSTM equation.
where corresponds to the weight matrix; is the bias coefficient; and are activation
functions.
Figure 3. Structure of the LSTM unit.
The new state is calculated from the previous state and the new memory candidate . First,
the previous cell state is pointwise multiplied by the forget vector , which has the ability to
drop values in the cell state if goes to 0. Then, we add , which updates the cell state to new
values that the neural network finds relevant. This process yields our new cell state
calculated by , which decides the information to forget and update. Finally, the new hidden
state is obtained by multiplying the output gate with the of the new cell state. The new
cell state and the new hidden state are carried over to the next time step. In summary, the
greatest advantages of LSTM and using gates are that the forget gate decides what is
15
relevant to keep from previous steps, the update gate decides what information is relevant
to add from the current step, and the output gate determines what the next hidden state
should be.
1.9 Encoder-Decoder model
An encoder is a network (FC, CNN, RNN, etc) that takes the input and outputs a feature
map/vector/tensor. These feature vector hold the information, the features, that represents
the input. The decoder is a network (usually the same network structure as the encoder but
in opposite orientation) that takes the feature vector from the encoder and gives the best
closest match to the actual input or intended output. The technique is being used in various
different applications like in translation, generative models, etc.
Figure 4. The basic structure of the encoder-decoder model.
Figure 4 shows the structure of an encoder-decoder model. The model can be constructed
in three steps:
1. Encoder: It accepts a single element of the input sequence at each time step,
processese it, collects information for that element, and propagates it forward.
2. Encoder vector: This is the final internal state produced from the encoder part of
the model. It contains information about the entire input sequence to help the
decoder make accurate predictions.
3. Decoder: given the entire sentence, it predicts an output at each time step.
Let is the input data and is the output data. Besides, let and are the functions that map an
input to an output. Mathematically, the output of the decoder can be represented as follows:
16
1.10 Feature engineering
Time series prediction problems always play an important role in providing preventive and
preventive measures to reduce the risk of not responding in time when an unexpected
phenomenon occurs. Such prediction problems can be mentioned as air quality forecast,
water level forecast, rain forecast, population forecast, etc. However, to predict the
essentials, such as PM2.5 prediction, one needs an air quality data set or data related to it.
The issue arises from the fact that air quality data include dozens of factors other than
PM2.5, such as various concentrations, temperature and humidity. These factors may or
may not be related to PM2.5. However, appropriate use of some of these factors may
improve the prediction accuracy. Meanwhile, misuse may not only degrade the accuracy
but also add extra computational time. Therefore, the selection of features is essential. This
is a specific technique in feature engineering. Feature engineering is the process of
converting the original raw data set into a set of features that can help represent the original
data set better, facilitate solving problems more easily, and help with compatibility with
each specific prediction model, as well as improve the accuracy of the current prediction
model.
Feature engineering tries to best represent the original data set that is compatible with the
prediction model you are using. The features in the data set directly affect the predictive
model, so we need to well define the structure of the features to best express the nature of
the data set.
1.10.1 The importance of features
Having a lean set of features will contribute to simplifying the computational complexity
of the model so that the calculation is faster and easier to interpret for the user. For
example, when using a decision tree model, if we use too many features in the prediction
process, even though it gives very good results, it will be difficult for users to observe and
interpret the prediction results. Feature selection solves these problems by automatically
selecting a subset of the original features so that these selected features are suitable for the
current problem
Attribute importance assessment helps data analysts make decisions about which attributes
should be included in the training data. There are many methods to evaluate the importance
of an attribute, after evaluating each attribute will have a separate ranking score. The
higher the score, the more that attribute will be included in the training process, the
attributes with low score may be dropped.
17
1.10.2 Feature extraction
In some problems, the data has many attributes, making the input data set large, making the
learning process time-consuming, and possibly causing inaccurate results because the
attributes are not relevant. The feature extraction method is to reduce the dimensionality of
the data so that the original raw data is simpler and more compact before it is further
processed and fed into the training model. This method helps the predictive model to
reduce the training time significantly. Depending on the problem, there are different
methods of attribute extraction. Figure 5 illustrates an example of feature extraction.
Figure 5. An example of feature extraction.
1.10.3 Feature selection
The attributes in each problem will have different importance and influence on the target
attribute. There are attributes that are necessary for the prediction, there are attributes that
are redundant and need to be removed because they are not suitable for the problem under
consideration. The feature selection method helps to select a subset of attributes from the
original parent set of attributes to suit the problem. There are many algorithms applied to
this attribute selection such as correlation evaluation and trial and error methods. A
common method of feature selection is to use correlation evaluation algorithms, such as
Pearson's 15. The algorithms first determine the correlation between features and then
select the most correlated features to include in the prediction model. Another approach is
to use an embedded model such as Lasso [2] or random forest [3], which have associated
feature selection methods. Figure 6 illustrates an example of feature selection.
18
Figure 6. An example of feature selection.
1.10.4 Feature construction
Building new features is a specific job that requires researchers to spend a lot of time and
effort. To build a new feature, there is no model that can automatically do this, it needs to
rely on human power, so building a new feature requires creativity and experimenting with
many different cases. Figure 7 illustrates an example of feature construction.
19
Figure 7. An example of feature construction.
1.11 Genetic algorithm
1.11.1 Overview
In nature, every species must change and adapt to the environment to survive and thrive.
Any species that can not adapt to the environment fast enough would become extinct
following Darwin's law of natural selection.
In the body of any individual, the genes are linked together in a series of structural formats,
called chromosomes. The characteristic of chromosomes is unique for each species and
determines the character of that species. As a result of the constant change in the
20
environment, the structure of chromosomes is also changed to adapt to the environment
and the next generation would always be more adaptive than previous generations. This is
due to the random exchange of information with the external environment or between the
chromosomes.
Genetic algorithms are designed based on the mechanism of genes. There are many terms
used in genetics algorithms taken from the genetics domain such as hybridization,
mutation, crossover, variant, chromosome, individual,... Genetic algorithms are widely
applied to a wide range of optimization problems, such as graph coloring, pattern
recognition,.., and adopted in many domains, from financial markets to multi-objective
engineering optimization. Genetic algorithms show many advantages over traditional
optimization algorithms. The two most notable advantages are the ability to deal with
complex problems and parallelism. However, genetics algorithms have some
disadvantages: the selection of fitness function, the use of population size, the choice of
hyper-parameters must be handled carefully for the algorithm to converge, otherwise, the
result of the algorithms is meaningless.
The process of Genetic Algorithms involves using an encoding function to encode the
optimization parameters into a set of bits or characters in a linear order to represent
chromosomes, creating variants from initial chromosomes using genetic operators, and
selecting chromosomes using fitness function to find good (or optimal) solution to the
addressed problem. The whole process often consists of the following procedure:
1. Encoding the set of optimization parameters (cost functions)
2. Defining a fitness function to evaluate the performance of a chosen set of
parameters (selection criterion)
3. Creating a population of individuals
4. Perform the iteration of evaluating the individuals in the population, creating a new
population using some genetic operators, fitness-proportionate reproduction, and
replace the old population with the newly generated population.
5. Choosing an optimal solution and decode the result to get the solution for the
problem.
21
Figure 8. The basic structure of Genetic Algorithm
The steps to produce the generic algorithm given shown in Figure 8.
1.11.2 Genetic operators
Genetic algorithms have four main genetic operators: initialization, crossover, mutation,
and selection. Their roles can be very different.
1. Initialization: Randomly generating a population of individuals with a
predetermined size.
2. Crossover: Swapping parts of the solution with another in chromosomes or solution
representations. The main role is to provide mixing of the solutions and
convergence in a subspace.
3. Mutation: The change of parts of one solution randomly, which increases the
diversity of the population and provides a mechanism for escaping from a local
optimum.
4. Selection: The use of the solutions with high fitness to pass on to the next
generations, which is often carried out in terms of some form of a selection of the
best solutions.
22
The details of these operations will be presented in 1.11.3, 1.11.4, 1.11.5, 1.11.6.
1.11.3 Initialization
Traditionally, binary strings of 0s and 1s are used to represent an individual of the
population, but other encoding methods are also possible, e.g: decimal. The evolution
usually starts from a population of randomly generated individuals and happens in
generations. The reproduction operators select chromosomes from the population to be
parents and conducting the crossover process. In each generation, the fitness of every
individual in the population is evaluated. The selection of a chromosome for parenthood
can range from a totally random process to one that is biased by the chromosome's fitness.
The fitness evaluation returns the value to the fitness of an individual in GA. This
evaluation function judges the quality of chromosomes or fitness to solve a problem. The
results of the fitness evaluation function determine the probability that a possible solution
is selected to produce new solutions in the next generation.
1.11.4 Crossover
Crossover is an interchange of parts between two individuals. The crossover operator
creates two offspring chromosomes, which contain some genetic material of its parents.
The concept of the crossover operator is applied to a new chromosome with the hope that
when it takes the best characteristics from each of the parents, thus, it can produce better
offspring than both parents. There are many types of crossover operators as one-point, twopoint, uniform, etc.
1.11.5 Mutation
The mutation operator is an operator used to maintain the diversity in the population. The
basic mutation operator generated is the random position of one of the bits in a bit string
then the bit corresponding to that position is changed. Finally, the generated individual is
moved into the new population of the next generation. There are many different types of
mutation operators used in a binary representation such as flip bit and interchange. The
flip-bit mutation operator crates the new offspring chromosome based on a randomly
generated mutation chromosome by changing 0 to 1 and 1 to 0. Offspring chromosomes
are produced by flipping a bit (0 to 1 and 1 to 0) if a mutation chromosome is 1 then in the
parent chromosome, the corresponding bit is flipped. With interchange mutation, a gene
will be positioned as a permutation to become a new chromosome.
23
1.11.6 Selection
The selection process utilizes the result of the objective function as the guidance on which
individuals to choose among the population. There are many methods of selection that
exist, e.g. Roulette Wheel Selection, Rank Selection, Steady State Selection, Tournament
Selection, Elitism Selection, Boltzmann Selection. The Elitism Selection ensures that the
best individuals must survive in the population. However, very strong elitism may lead to
premature convergence. Thus, using other forms of selecting other than Elitism Selection is
necessary.
1.12 Research methods
Research methods include (i) Surveying the current research situation for the problem
under consideration, (ii) Proposing a deep learning model to solve the problem of
predicting environmental monitoring indicators, (iii) Test and evaluate the performance of
the proposed model and compare it with the models proposed by other groups.
3 Proposed Forecasting Framework
(OFFGED)
1.13 Overview
Figure 9 presents an overview of our proposed forecasting framework OFFGED (Optimal
Forecasting Framework using GA-based feature selection and Encoder-Decoder model),
which includes GA-based feature selection and a prediction module that are detailed in
Section 1.14 and Section 1.15, respectively. In the left module, each individual in the initial
population represents a feature combination. The GA-based method includes numerous
iterations (i.e., generations), in each of which genetic operations of crossover and mutation
generate new individuals. Each individual is evaluated with respect to fitness. We define
fitness as the MAE, which is the prediction model’s output with a specified feature
combination acting as the model’s input. In each generation, the individuals with better
fitness remain; the others are removed from the population. By repeating these processes,
the fitness values of the population are improved in each generation. Finally, we identify
the individual with the optimal fitness value. The right module is the encoder-decoder-
24
based prediction for PM2.5. The module includes a prediction model that uses a data set
that consists of features selected by the GA-based module as the input.
Figure 9. Overview of the proposed model.
1.14 GA-based feature selection
Let be the number of all features in the data set. Each individual is a binary string of
length n that encodes a feature combination. Specifically, the -th gene obtains the value of
either 1 or 0, which indicates whether the -th feature is selected (as shown in Figure 10). A
genetic algorithm adopts the fitness value to represent how ‘‘good’’ an individual is. In our
algorithm, we define the fitness of an individual as the mean absolute error (MAE), which
is obtained by the prediction model. For each generation, we first determine all the newly
created individuals, each of which is used to train the prediction model. We then evaluate
the model to obtain the MAEs.
Figure 10. Encoding a feature combination (the white and gray cells represent the selected
feature encoded by 1 and 0, respectively).
25