MINISTRY OF EDUCATION AND TRAINING
HANOI NATIONAL UNIVERSITY OF EDUCATION
NGUYEN VAN TINH
LINK PREDICTION IN HETEROGENEOUS
INFORMATION NETWORKS AND ITS
APPLICATIONS IN PREDICTING ASSOCIATIONS
BETWEEN NON-CODING RNAS AND DISEASES
Major: Computer Science
Major code: 9480101
SUMMARY OF DISSERTATION IN COMPUTER
SCIENCE
Hanoi, 2023
The dissertation was completed at the Faculty
of
Information Technology, Hanoi National University of
Education.
Supervisors:
1. Assoc. Proc. Dr. Tran Dang Hung
2. Dr. Le Thi Tu Kien
Reviewer 1: Assoc.Prof.Dr. Nguyen Long Giang, Institute
of Information Technology- Vietnam Academy of Science
and Technology.
Reviewer 2:
University.
Assoc.Prof.Dr. Le Duc Hau, Thuyloi
Reviewer 3: Assoc.Prof.Dr. Nguyen Ngoc Hoa, University
of Engineering and Technology-Vietnam National
University, Hanoi (VNU).
The dissertation will be defensed at The university
committee: Hanoi National University of Education.
At: ………………..
The dissertation can be read at: National Library of
Vietnam, Hanoi, Vietnam or Library of Hanoi National
University of education, Hanoi, Vietnam
PUBLICATION
[VTN1] Van Tinh Nguyen, Thi Tu Kien Le and Dang Hung Tran,
"A new method on lncRNA-disease-miRNA tripartite graph to predict
lncRNA-disease associations", 2020 12th International Conference
on Knowledge and Systems Engineering (KSE), 2020, pp. 287-293,
doi: 10.1109/KSE50997.2020.9287563 (Scopus indexed).
[VTN2] Van Tinh Nguyen, Thi Tu Kien Le, Tran Quoc Vinh Nguyen
and Dang Hung Tran, “Inferring miRNA-disease associations using
collaborative filtering and resource allocation on a tripartite
graph”, BMC
Med
Genomics 14, 225
(2021).
(ISI Q2 journal).
[VTN3] Van Tinh Nguyen and Dang Hung Tran, "An improved
computational method for prediction of lncRNA-disease associations
based on collaborative filtering and resource allocation", 2021 13th
International Conference on Knowledge and Systems Engineering
(KSE), 2021, pp. 1-6, doi: 10.1109/KSE53942.2021.9648632 (Scopus
indexed).
[VTN4] Van Tinh Nguyen, Thi Tu Kien Le, Khoat Than and Dang
Hung Tran, “Predicting miRNA–disease associations using improved
random walk with restart and integrating multiple similarities”, Sci
Rep 11, 21071 (2021). />(ISI Q1 journal).
1
INTRODUCTION
Nowadays, we are in a connected world where data or objects’
information, actors or agents, object groups or component groups are
interacted with each other to compose large networks. These networks
are complex. They contain multiple types of nodes and multiple types
of interactions. These networks are called heterogeneous information
networks (HINs). They are rich in semantic information and can be
constructed from multiple data sources. Analyzing of heterogeneous
information network (HIN) generates a trendy research of mining of
data, retrieving of information, link prediction, mining of graph,
network science, and so forth…
Link prediction is a crucial and active task in HIN analysis. It
benefits many researchers and organizations in a variety of fields. The
link prediction’s main objective is to discover absent links in a
network or to forecast links which may soonly occur in a network.
Link prediction has been broadly applied in various domains
from social networks to biological systems. For biological systems,
link prediction has been used to discover the relationships or
associations among biological objects such as disease-phenotype/gene
associations, drug-protein interactions, drug-miRNA associations,
disease-drug associations, non-coding RNA-disease associations, and
so forth. Especially, for a long time, identifying non-coding RNAs
(ncRNAs) in the human genome is difficult. They were treated as
noise. However, ncRNAs play vital roles in life activities.
Identifying relationships between ncRNAs and diseases has
exposed opportunities for therapeutic and diagnostic of human
diseases. Therefore, the studies of ncRNA-disease relationships have
extensively been executed in recent years
Conventional biological experiments make it costly, time-
2
consuming, and laborious to discover potential ncRNA-disease
relationships. Therefore, it requires to have computational methods for
identifying ncRNA-disease associations, especially associations
between non-coding RNAs (ncRNAs) and diseases. In the past few
years, various computational methods for predicting ncRNA-disease
associations have been developed.
Although actual computational methods have made massive
benefits in revealing disease‐associated ncRNAs in each category and
typically decrease the cost as well as time of biological experiments.
Howerver, there are still some limitations which are needed to be
solved as follows.
Firstly, the computational approaches for predicting ncRNAdisease associations ought to deal with sparse data problem. It bases
on the reality that the known ncRNA-disease associations' number is
quite smaller compared to the unknown associations. Hence, it is
difficult to obtain a reliable network to represent a reasonable
biological network. Therefore, it limits prediction accuracy.
Secondly, due to the sparsity data problem, it causes another issue
that the is unbalancing of positive and negative samples in performing
computational methods for predicting ncRNA-disease associations. It
is the reason that the prediction performance of computational
methods is not very reliable.
Thirdly, the similarity calculation in existing computational
methods depends excessively on known associations between
ncRNAs and diseases. It could generate the noticeable bias to
construct computational models for predicting ncRNA-disease
associations. Therefore, it requires to reasonably fuse different
similarity scores from different souces of biological information to
enhance ncRNA-disease association prediction performance.
3
Fourthly, most of existing computational methods are not
applicable to predict associations for isolated diseases or ncRNAs
(miRNAs or lncRNAs) which have not any known association with
other ncRNAs or other diseases in the examined data sets. So, it is
nessessary to combine different biological information to improve the
capability of prediction ncRNA-disease association for isolated cases.
Fifthly, there are too many parameters that need to be adjusted in
many computational methods leading to the difficulty in performing
ncRNA-disease association prediction. It means that the researchers
need to develop more computational methods which will be easier to
employ in ncRNA-disease association prediction.
And finally, since more and more biological databases become
available so it requiresto effectively fuse data from multiple data
sources to enhance the reliability and performance of prediction.
Up to date, a numerous number of research are weekly published
in scientific journals or conferences to show new results of research
on developing computational methods for ncRNA-disease association
prediction. Many of them concentrates on solving the above
mentioned limitations. Additionally, based on the fact that selecting
useful data from heterogeneous information to build up a reliable HIN
is still a challenge, it remains room for scientists and researchers to
research for constructing a reliable HIN and training an useful
computational method to achieve more decisive performance of
ncRNA-disease associations prediction.
It is the reason that the Ph.D student selects the topic “Link
prediction in heterogeneous information networks and its
applications in predicting associations between non-coding RNAs
and diseases” for this dissertation.
• Dissertation objective and research problem
4
Through this dissertation, the research will focus on: proposing
computational methods or models to improve prediction performance
for predicting human non-coding RNA-disease associations on
heterogeneous information networks by solving the following
problems:
- Solving the sparse data problem to improve the accuracy of
human ncRNA-disease associations prediction performance.
- Fusing multi-types of information from different biological
datasets to have more realistic similarities and to decrease the
impact of depending on known human ncRNA-disease
associations excessively.
- Inheriting the computational methods from other domains such
as predicting of microbe-disease associations, drug-disease
relationships, drug-target interactions and so forth, and
improving them to achieve better performance in predicting
human non-coding RNA-disease associations.
• Thesis’s scientific contributions:
The thesis has the following scientific contributions:
Contribution 1: Proposed an improved computational model by
combining collaborative filtering and a resource allocation
process on a tripartite graph using multiple known associations'
types to forecast ncRNA-disease associations.
Contribution 2: Proposed a new miRNA-disease associations
prediction method which used a WKNKN algorithm to preprocess data to decrease unverified associations in miRNAdisease association dataset and uncover latent miRNA-disease
associations using improved random walk with restart (RWR)
algorithm and fusing multiple similarities from HINs.
The contribution 1 is presented in Chapter 2 of the dissertation,
5
related contents of this contribution were published the Proceeding of
the KSE2020 ([VTN1], Scopus Indexed), BMC Medical Genomics
journal in 2021 ([VTN2], ISI Q2 journal), and Proceeding of the
KSE2021 ([VTN3], Scopus Indexed).
Thesis’s structure: Thesis structure is shown in the below figure.
CHAPTER 1. BACKGROUND
1.1. Basic concepts
1.1.1. Heterogeneous information networks
• Information network
Definition 1.1. Information network. An information network
is represented by a graph 𝐺 = (𝑉, 𝐸) where 𝑉 represents set of nodes
and 𝐸 represents set of edges of the graph. The graph G contains an
6
mapping function of object type, ϕ: V → A. It also contains a mapping
function of link type, ψ: E → R. Each node v ϵ V has only one
particular object type, ϕ(v) ϵ A. Each link e ϵ E has only one particular
link type, ψ(e) ϵ R. If two links share the same starting and the same
ending object type, they are same link type.
• Heterogeneous/Homogeneous information network.
Definition
1.2.
Heterogeneous/Homogeneous
information network. If the information network contains
moreover one object type or one link type, it is known as a HIN,
typically |A|>1 or |R|>1; otherwise it is called a homogeneous
information network, typically, |A|=1 and |R|=1.
1.1.2. Biological systems
Biological systems are a special class of heterogeneous information
networks which consists of a large number of biological entities such
as genes, miRNAs, lncRNAs, gene expressions, phenotypes, and so
forth.
1.1.3. Non-coding RNAs (ncRNAs)
The RNAs which can not be transferred into proteins are referred
to as non-coding RNAs Micro RNAs (miRNAs)
MiRNAs are a class of single-stranded, endogenous, small,
evolutionarily conserved ncRNAs with length of 22-26 nucleotides.
Long non-coding RNAs (lncRNAs)
LncRNAs belong to a subclass of ncRNAs. They lengthen more
than 200 nucleotides.
1.2. Link prediction in heterogeneous information networks
1.2.1. Link prediction problem
Definition 1.5. Link prediction in heterogeneous information
networks
Giving a network which is represented by a graph 𝐺 = (𝑉1 ∪
7
𝑉2 ∪ … ∪ 𝑉𝑀 , 𝐸1 ∪ 𝐸2 ∪ … ∪ 𝐸𝑁 ), where 𝑉𝑖 (𝑖 = 1,2, … , 𝑀) indicates
the node set with type i, 𝐸𝑗 (𝑗 = 1,2, … , 𝑁) relects the link set with type
j. The link prediction task is to find if there will exist a link 𝑒𝑘 between
𝑣𝑖 (𝑣𝑖 ∈ 𝑉𝑖 ) and 𝑣𝑗 (𝑣𝑗 ∈ 𝑉𝑗 ). In this sense, 𝑒𝑘 reflects the prediction
task's target link.
Input: The graph 𝐺 = (𝑉1 ∪ 𝑉2 ∪ … ∪ 𝑉𝑀 , 𝐸1 ∪ 𝐸2 ∪ … ∪ 𝐸𝑁 ):
𝑉𝑖 (𝑖 = 1,2, … , 𝑀) is the node set with type i and 𝐸𝑗 (𝑗 = 1,2, … , 𝑁)
reflects the link set with type j.
Output: For any two potentially connected objects 𝑣𝑖 (𝑣𝑖 ∈
𝑉𝑖 ) and 𝑣𝑗 (𝑣𝑗 ∈ 𝑉𝑗 ), whether link 𝑒𝑘 exists (1) or not exist (0)?
1.2.2. Link prediction methods
In general, the link prediction methods can roughly be
categorized into the following types: Network similarity-based
methods, probabilistic and maximum likelihood-based methods,
machine learning based methods and deep learning-based methods.
1.2.3. Link prediction applications in biological systems
In biological networks, link prediction is normally used in
predicting associations between diseases and genes, prioritizing of
disease candidate metabolites, drug development, predicting drugprotein interactions, identifying drug-miRNA associations,
discovering disease-drug associations, predicting associations
between non-coding RNAs and diseases, and so forth.
1.3. Computational methods for predicting associations between
non-coding RNAs and diseases
1.3.1. Predicting non coding RNA-disease association prediction
as a link prediction problem
Predicting ncRNA-disease associations problem is considered as a
HIN-based link prediction problem. It typically uses a heterogeneous
network with multiple biological objects and relationships among
8
them. These biological objects and links among them were collected
from different sources including nodes belonging to ncRNA (miRNA,
lncRNA) type and disease type. Then, it predicts potential associations
between ncRNAs and diseases. The potential associations may be new
associations or missing ones.
1.3.2. Materials used for ncRNA-disease association prediction
Information about miRNAs and miRNA-target associations could
be gathered from different data sources such as miRBase, miReg,
miRTarBase, miRecords, and so on. The verified miRNA-disease
associations could be collected from multiple public databases such as
MiRCancer, MiR2Disease, HMDD, MiREC, DbDEMC, and so on.
LncRNAs’ information could be obtained from different data sources
such as LNCipedia, NONCODE database, LncRBase, and so on.
LncRNA-related interactions’ information could be gathered from
different databases such as DIANA-LncBase, lncRNA2Target, and so
on. LncRNA-disease associations’ information could be collected
from divergent databases such as LncRNADisease, Lnc2Cancer,
MNDR, and so on.
1.3.3. Similarity calculation and network construction
Disease similarity calculation
A typical method calculates disease similarity by computing the
the disease's ancestral nodes' contribution in a tree structure like
MeSH. One other type of method used other related biological
molecules' information to calculate disease’s similarity.
Non-coding RNA similarity calculation
The most popular method is to measure ncRNA similarity using
the biological information of ncRNAs themselves
Heterogeneous network construction
After obtaining similarities, a heterogeneous network is constructed.
9
1.3.4. Literature review of computational methods to predict
ncRNA-disease associations
Numerous computational approaches have recently been
developed for ncRNA-disease associations prediction. They can
generally be arranged into the accompanying categories: networkbased, recommendation-based, resource allocation-based, machine
learning-based, deep learning-based, and multi-biological information
and multi-model integration-based methods. Each computational
methods' category contains their own advantages and disadvantages.
1.4. Thesis’s research directions
The research can be implemented in the following directions
Firstly, it is necessary to develop network representation methods
and more reasonable methods for feature extraction, similarity
calculation, and fusion to solve sparse data problem or enhance the
reliability of prediction performance.
Secondly, the thesis can concentrate on integrating different
biological datasets to construct more reasonable similarities and
proposing new computational approachess for forecasting non-coding
RNA-disease associations.
Thirdly, the computational approaches for revealing ncRNAdisease associations can also be applied to other research areas such
as drug-disease associations, microbe-disease associations,
metabolite-disease associations, and so on. Therefore, the new
approaches or models for predicting non-coding RNA-disease
associations can also borrow the computational techniques from the
aforementioned fields as well and acclimate them in order to improve
performance in the interested area.
1.5. Some evaluation methods and metrics to evaluate
prediction performance
10
In this dissertation, the prediction performance of the proposed
models is evaluated by using Area under Roc Curve (AUC) as well
as Area under Precision-Recall Curve (AUPR) by doing 5-fold-crossvalidation and leave-one-out-cross-validation (LOOCV) experiments.
Besides that, to support the prediction performance reliability, some
case studies could be implemented in each proposed method.
Additionally, although the time complexity was not usually taken into
account to evaluate the performance of a method but in this
dissertation, the time complexity of proposed methods were
quantitatively estimated to assure that they will be finished in
acceptable execution time.
1.6. Chapter summary
In this chapter, firstly, some basic concepts are introduced.
Secondly, the problem of link prediction in HINs was stated and its
applications in biological systems were summarized. Thirdly, the
computational methods for predicting associations between ncRNAs
and diseases were reviewed. From the literature review of
computational methods to predict ncRNA-disease associations, the
thesis research directions are already figured out. Finally, some
evaluation methods and metrics which are used in evaluating
prediction performance of the proposed models, were presented.
CHAPTER 2. NCRNA-DISEASE ASSOCIATIONS
PREDICTION WITH COLLABORATIVE FILTERING AND
RESOURCE ALLOCATION PROCESS ON A TRIPARTITE
GRAPH
2.1. Motivations
Numerous computational approaches have recently been proposed
to forecast potential ncRNA-disease associations, particularly
11
miRNA-disease associations as well as lncRNA-disease associations.
Many of them heavily depended on known ncRNA-disease
associations. They need to utilize different similarity matrices which
are not straightforwardly connected with the ncRNA-disease
associations. Therefore, recently, numerous computational methods
have been constructed by using different known associations' types
among different objects' types to discover latent ncRNA-disease
associations. Generally, computational methods that take into account
a variety of known associations among a variety of objects typically
contributes to an increase in prediction accuracy for predicting
ncRNA-disease associations.This chapter introduces a brand-new
computational model that integrates a variety of known associations
between multiple objects to solve the sparsity data problem and
enhance prediction accuracy. The ncRNA-disease associations were
predicted by employing a process for allocating resources on a
tripartite graph.
2.2. Main related works
2.2.1. The item-based collaborative filtering algorithm for
ncRNA-disease association prediction.
Our proposed model used the item-based CF algorithms for
recommendation to solve the sparse data problem.
2.2.2. Resource allocation on a tripartite graph
The resource allocation algorithm on a tripartite graph was
successfully implemented in some computational approaches to
forecast associations between ncRNAs and diseases, including
TPGLDA and ncPred models.
2.3. The proposed model
The proposed model's flowchart is provided in Figure 2.1. It typically
consists of four main stages. In the initial stage, by using known
12
miRNA-disease associations, known lncRNA-disease associations,
and virified miRNA-lncRNA interactions, a tripartite graph G 0 is
constructed. During the second stage, an item-based CF algorithm is
Figure 2.1. The proposed model's flowchart
13
applied on graph G0 to solve the problem of sparsity data and to
produce a tripartite graph Gu. At the third stage, a resource allocation
algorithm uses the graph Gu to determine the final resource score of
each disease's ncRNA candidates. And finally, all ncRNA candidates’
resource score for each disease are ranked in descending order at the
final stage, so that the candidate with higher resource score will have
greater possibility of being verified in the future
2.4. Employing the proposed model to infer miRNA-disease
associations based on collaborative filtering and resource
allocation
2.4.1. Proposed model's stages in inferring miRNA-disease
associations
• Stage 1: Construction of a tripartite graph G0
• Stage 2: Construction of a tripartite graph Gu
• Stage 3: Employing resource allocation process on the tripartite
graph Gu to infer miRNA-disease associations
• Stage 4: Ranking all candidate miRNAs’ Rscores for each disease
in descending order
2.4.2. Proposed method’s experiments and results
• Step 1. Preparing datasets.
The proposed method used datasets which came from the study of
Zhao et al. There are 190 diseases, 111 lncRNAs, 264 miRNAs, 1880
known lncRNA-miRNA associations, 936 known associations
between diseases and lncRNAs, and 3552 verified miRNA-disease
associations in these datasets. Figure 2.2 shows the relationships
among different datasets and the number of data nodes used in the
proposed method.
• Step 2: Implementing the proposed method and estimating its
computational time complexity
The proposed method was implemented by using the Python
14
Figure 2.2. The datasets and the numbers of data nodes in the
proposed method
programming language. Its time complexity is equivalent to O(n3). It
is polynomial time complexity.
• Step 3. Evaluating prediction performance
The 5-fold cross-validation experiments, and the Area under roc curve
(AUC) as well as Area under precision-recall curve (AUPR) are used
to evaluate the proposed method's performance in identifying miRNAdisease associations.
Evaluating the AUC under 5-fold-cross validation
Figure 2.3. ROC curve and AUC value of the proposed method with
γ = 0.9 in one experimental running time
The proposed method achieves the best averaged value of 0.9788
15
with γ = 0.9 after performing the experiments for 10 times under 5fold-cross-validation. Figure 2.3 illustrates ROC curve and AUC value
of proposed method with γ = 0.9 in one experimental running time.
Evaluating the AUPR under 5-fold-cross validation
The proposed method achieves the best averaged AUPR value
0.9373 with γ = 0.9 under 5-fold cross experiments. Figure 2.4 shows
Precision-Recall curve and AUPR value of proposed method with γ =
0.9 in one experimental running time.
Figure 2.4. Precision-Recall curve and AUPR value of the
proposed method with γ = 0.9 in one experimental running
time
Comparing prediction performance with other related models
The performance of proposed method is compared with the
performance of DCSMDA and TPGLDA methods. The performances
of these methods are shown in the Table 2.1.
Table 2.1. Performance comparison with other related methods
Method
AUC value
AUPR value
TPGLDA
0.9703
0.7421
DCSMDA
0.8155
The proposed method
0.9788
• Step 4. Checking case studies
0.9373
16
Case studies on Prostatic Neoplasms, Heart Failure, Glioma and
Open-angle Glaucoma are constructed to show the ability of our model
in order to identify new disease-associated miRNAs.
2.5. Employing the proposed model to predict lncRNA-disease
associations based on collaborative filtering and resource
allocation
2.5.1. Proposed model's stages in predicting lncRNA-disease
associations
• Stage 1: Construction of a tripartite graph G0
• Stage 2: Applying collaborative filtering algorithm on known
lncRNA-disease associations and verified lncRNA-miRNA
interactions to obtain a new tripartite graph Gu
• Stage 3: Using improved resource allocation process to obtain
predicted lncRNA-disease associations
• Stage 4: Ranking all predicted lncRNAs for each disease to have
final results
2.5.2. Proposed method’s experiments and results
• Step 1. Preparing datasets
Figure 2.5. The relationships between the different data sources and
the numbers of data nodes used in the proposed method
The proposed method used the datasets came from studies of Fu
et al and Yao et al. These datasets contained 2697 tentatively known
lncRNA-disease associations (LDAs), 13562 known miRNA-disease
associations (MDAs) and 1002 verified lncRNA-miRNA interactions
17
(LMIs) among 240 lncRNAs, 420 diseases and 495 miRNAs. Figure
2.5 illustrates the data nodes’ numbers and the relationships of the
different data sources used in the proposed method.
•
Step 2: Implementing the proposed method and estimating
time complexity of proposed method
This proposed method was also implemented by using the Python
programming language. The time complexity of the proposed method
is O(nl*nd*nm) ≈ O(n3). It means its time complexity is polynomial.
• Step 3: Evaluating prediction performance
The 5-fold cross-validation experiments were employed and AUC
and AUPR were used to estimate the proposed method's performance.
Evaluating AUC under 5-fold cross-validation
The new proposed method achieves the best AUC values when
𝛾 = 0.8 under 5-fold cross validation experiments. The Figure 2.6
demonstrates the 5 experimental running times' ROC curves as well
as AUC values with 𝛾 = 0.8
Figure 2.6. The proposed method's ROC curves and AUC values in 5
running times of experiments with 𝛾 = 0.8
• Evaluating AUPR under 5-fold cross-validation
The new proposed method accomplishes best AUPR values when
𝛾 = 0.8 under 5-fold cross validation experiments. Figure 2.7 shows