Social Media as Sensor for Healthcare:
A Machine Learning Approach
Dao Duy Bo
Doctor of Philosophy
2016
Deakin University
Social Media as Sensor for Healthcare:
A Machine Learning Approach
by
Bo Duy Dao
M.Sc
Submitted in fulfillment of the requirements for the degree of
Doctor of Philosophy
Deakin University
July, 2016
Acknowledgements
First of all, from the bottom of my heart, I would like to thank all my supervisors, Dr.
Thin Nguyen, Prof. Dinh Phung and Prof. Svetha Venkatesh. Their professional guidance, constructive criticism, thorough comments, practical suggestions, constant encouragement and support inspired me to conduct the research. Dinh and Thin were patient
and encouraging when I was slow and lost. I will never forget Svetha’s scientific writing
workshops which greatly helped me to improve my research skills. Without their support,
this thesis would has been much more difficult. I am deeply indebted to each of them.
I would like to extend my deeply thanks to my sponsors, Binh Dinh College and Deakin
University for granting me time, financial support and scholarships which have enabled
me to pursue my PhD research.
My sincere gratitude goes to all staff members of Centre of Pattern Recognition and Data
Analytics for their invaluable support during my study. My appreciations also go to my
fellow PhDs and friends for their friendship and mutual support.
Finally, this thesis is dedicated to my family for their encouragement and support through
my long journey. I especially devote my deepest gratitude to the memory of my late
father, Chap, for his unconditioned sacrifice and encouragement. I am deeply indebted
to my dear mother, Thu, and parents-in-law, Vinh and Dao, for their endless love and
support. I can not express enough my great thanks to my beloved wife, Loan, for her
unconditioned love, silent sacrifice and tolerance during my tough time. I will be forever
grateful.
iv
Relevant Publications
Part of this thesis has been published or documented elsewhere. A list of these publications is provided below.
Chapter 3:
• Dao, B., Nguyen, T., Venkatesh, S., and Phung, D. (2015). Nonparametric Discovery of Online Mental Health-related Communities. In Proceedings of the IEEE
International Conference on Data Science and Advanced Analytics (IEEE DSAA),
pages 1-10, Paris, France, October 2015.
• Dao, B., Nguyen, T., Venkatesh, S. and Phung, D. (2016). Latent Sentiment Topic
Modelling and Nonparametric Discovery of Online Mental Health-related Communities. International Journal of Data Science and Analytics, Springer. (Under
revision)
Chapter 4:
• Dao, B., Nguyen, T., Venkatesh, S. and Phung, D. (2014). Analysis of Circadian
Rhythms from Online Communities of Individuals with Affective Disorders. In
Proceedings of the International Conference on Data Science and Advanced Analytics (DSAA), pages 463-469, Shanghai, China, October 2014.
Chapter 5:
• Nguyen, T., Dao, B., Phung, D., Venkatesh, S. and Berk, M. (2013). Online Social
Capital: Mood, Topical and Psycholinguistic Analysis. In Proceedings of the AAAI
International Conference on Weblogs and Social Media (ICWSM), pages 449-456,
Boston, USA, July 2013.
v
Chapter 6:
• Nguyen, T., Phung, D., Dao, B., Venkatesh, S. and Berk, M. (2014). Affective
and Content Analysis of Online Depression Community. IEEE Transactions on
Affective Computing, 5(3), 217-226, IEEE.
Chapter 7:
• Dao, B., Nguyen, T., Phung, D. and Venkatesh, S. (2014). Effect of Mood, Social
Connectivity and Age in Online Depression Community via Topic and Linguistic
Analysis. In Proceedings of the International Conference on Web Information System Engineering (WISE), pages 398-407, Thessaloniki, Greece, October 2014.
• Dao, B., Nguyen, T., Venkatesh, S. and Phung, D. (2016). Effect of Social Capital on Emotion, Language Style and Latent Topics in Online Depression Community. In Proceedings of the IEEE-RIVF International Conference on Computing
and Communication Technologies, Hanoi, Vietnam, November 2016.
Chapter 8:
• Dao, B., Nguyen, T., Venkatesh, S. and Phung, D. (2016). Discovering Latent Affective Dynamics among Individuals in Online Mental Health-related Communities. In Proceedings of the IEEE International Conference of Multimedia and Expo
(ICME), pages 1-6, Seattle, USA, July 2016.
Contents
1
2
Acknowledgements
iv
Relevant Publications
v
Abstract
xxiv
Abbreviations
xxvi
Introduction
1
1.1
Aims and Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.2
Contributions and Significance . . . . . . . . . . . . . . . . . . . . . . .
5
1.3
Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
Background
9
2.1
Social Media: An Overview . . . . . . . . . . . . . . . . . . . . . . . .
9
2.1.1
What is Social Media? . . . . . . . . . . . . . . . . . . . . . . .
9
2.1.2
Social Media Categories . . . . . . . . . . . . . . . . . . . . . .
11
vii
2.1.3
Core Functionalities of Social Media . . . . . . . . . . . . . . .
12
Social Media for Health Care . . . . . . . . . . . . . . . . . . . . . . . .
13
2.2.1
Advantages of Using Social Media . . . . . . . . . . . . . . . . .
14
2.2.2
Issues with Using Social Media . . . . . . . . . . . . . . . . . .
19
2.3
Weblogs for Social Media Analytics . . . . . . . . . . . . . . . . . . . .
20
2.4
Social Media Feature Extraction for Analytics
. . . . . . . . . . . . . .
21
2.4.1
Mood and Affective Words . . . . . . . . . . . . . . . . . . . . .
23
2.4.2
Psycholinguistic Features
. . . . . . . . . . . . . . . . . . . . .
24
2.4.3
Topic Modelling and Topic-Induced Features . . . . . . . . . . .
26
Machine Learning for Social Media Analytics . . . . . . . . . . . . . . .
26
2.5.1
Supervised Learning Approach . . . . . . . . . . . . . . . . . . .
27
2.5.1.1
Classification Techniques . . . . . . . . . . . . . . . .
29
2.5.1.2
Classification Evaluation . . . . . . . . . . . . . . . .
31
Unsupervised Learning Approach . . . . . . . . . . . . . . . . .
32
2.5.2.1
Clustering Techniques
. . . . . . . . . . . . . . . . .
34
2.5.2.2
Clustering Evaluation . . . . . . . . . . . . . . . . . .
35
Topic Modelling: Overview . . . . . . . . . . . . . . . . . . . .
35
2.5.3.1
Parametric Approaches . . . . . . . . . . . . . . . . .
36
2.5.3.2
Bayesian Nonparametric Approaches
. . . . . . . . .
37
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
2.2
2.5
2.5.2
2.5.3
2.6
3
Nonparametric Discovery of Online Mental Health Communities
41
3.1
Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
3.2
Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
3.2.1
Mood-based Features . . . . . . . . . . . . . . . . . . . . . . . .
44
3.2.2
Affective-based Features . . . . . . . . . . . . . . . . . . . . . .
48
3.2.3
Language style-based Features . . . . . . . . . . . . . . . . . . .
48
3.2.4
Generic-word Topic Features . . . . . . . . . . . . . . . . . . . .
49
Nonparametric Discovery Methods . . . . . . . . . . . . . . . . . . . . .
49
3.3.1
Dirichlet Process Mixture . . . . . . . . . . . . . . . . . . . . .
49
3.3.2
Hierarchical Dirichlet Process . . . . . . . . . . . . . . . . . . .
51
3.3.3
Community Representation . . . . . . . . . . . . . . . . . . . .
53
3.3.3.1
Latent Mood-based Community Representation . . . .
53
3.3.3.2
Mood Usage-based Community Representation . . . .
54
3.3.3.3
Latent ANEW-based Community Representation . . . .
54
3.3.3.4
ANEW Usage-based Community Representation . . . .
55
3.3.3.5
Latent ANEW-based over Community Representation .
56
3.3.3.6
Latent Generic-word-based Community Representation
56
3.3.3.7
Psycholinguistic-based Community Representation . .
57
Community Clustering and Evaluation . . . . . . . . . . . . . . .
57
3.3.4.1
57
3.3
3.3.4
Nonparametric Clustering . . . . . . . . . . . . . . . .
3.3.4.2
3.4
3.5
4
Evaluation of Clustering . . . . . . . . . . . . . . . . .
58
Experiments and Analysis . . . . . . . . . . . . . . . . . . . . . . . . .
58
3.4.1
Meta-community Discovery . . . . . . . . . . . . . . . . . . . .
60
3.4.1.1
Mood-based Meta-communities . . . . . . . . . . . . .
63
3.4.1.2
ANEW-based Meta-communities . . . . . . . . . . . .
65
3.4.1.3
Generic-word-based Meta-communities . . . . . . . .
69
3.4.1.4
Psycholinguistic-based Meta-communities . . . . . . .
69
3.4.2
Discussion on Discovered Meta-Communities . . . . . . . . . . .
70
3.4.3
Topics of Interest in the Autism and Depression Cohorts . . . . .
73
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
73
Analysis of Circadian Rhythms and Affective Disorders
75
4.1
Circadian Rhythm, Mental Health and Social Media . . . . . . . . . . . .
76
4.2
Data Study Cohorts . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
4.2.1
Clinical Cohort . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
4.2.2
Control Cohort . . . . . . . . . . . . . . . . . . . . . . . . . . .
79
4.2.3
Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . .
79
Circadian Rhythms in Online Communities . . . . . . . . . . . . . . . .
81
4.3.1
Circadian Rhythms Derived from Posting Behaviours
. . . . . .
81
4.3.2
Circaseptan Rhythms of Posting Behaviours
. . . . . . . . . . .
84
4.3
4.3.3
4.4
4.5
5
Seasonal Posting Patterns . . . . . . . . . . . . . . . . . . . . .
84
Experiment and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . .
87
4.4.1
Analysis of Negative and Positive Affect . . . . . . . . . . . . .
87
4.4.2
Analysis of Affective Mood Rhythms . . . . . . . . . . . . . . .
91
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
94
Online Social Capital and Mental Health
95
5.1
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
96
5.2
Social Capital and Healthcare
. . . . . . . . . . . . . . . . . . . . . . .
98
5.3
Experiments and Mood Prediction . . . . . . . . . . . . . . . . . . . . .
99
5.3.1
Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . .
99
5.3.2
Mood Usage of Online Social Capital . . . . . . . . . . . . . . . 102
5.3.3
Mood Prediction on Online Social Capital . . . . . . . . . . . . . 104
5.4
5.5
Topics and Language Styles of Online Social Capital . . . . . . . . . . . 106
5.4.1
Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.4.2
Hypothesis Testing and Classification . . . . . . . . . . . . . . . 108
5.4.3
Analysis of Psycholinguistic Processes . . . . . . . . . . . . . . 111
5.4.4
Analysis of Latent Topics . . . . . . . . . . . . . . . . . . . . . 112
5.4.5
Social Capital Classification . . . . . . . . . . . . . . . . . . . . 114
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6
Affective and Content Analysis for Online Mental Health Communities
6.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.2
Experiments
6.3
6.4
7
116
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.2.1
Datasets
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.2.2
Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.2.3
Classification and Feature Selection . . . . . . . . . . . . . . . . 124
6.2.3.1
Blog post Classification . . . . . . . . . . . . . . . . . 126
6.2.3.2
Community Classification . . . . . . . . . . . . . . . . 126
6.2.3.3
Feature Selection . . . . . . . . . . . . . . . . . . . . 127
Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.3.1
Classification Performance
. . . . . . . . . . . . . . . . . . . . 128
6.3.2
Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.3.3
Analysis of Latent Topics . . . . . . . . . . . . . . . . . . . . . 133
6.3.4
Analysis of Linguistic Styles
. . . . . . . . . . . . . . . . . . . 134
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Influence of Social Connectivity, Emotion and Demography
136
7.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.2
Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.3
Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.3.1
Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.4
7.5
8
7.3.2
Feature Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.3.3
Feature Selection and Classification . . . . . . . . . . . . . . . . 144
7.3.4
Evaluation of Classification . . . . . . . . . . . . . . . . . . . . 145
Experimental Results and Analysis
. . . . . . . . . . . . . . . . . . . . 146
7.4.1
Statistical Hypothesis Testing Analysis . . . . . . . . . . . . . . 146
7.4.2
Classification Performance
7.4.3
Social Connectivity Impact . . . . . . . . . . . . . . . . . . . . . 153
. . . . . . . . . . . . . . . . . . . . 148
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Egocentric Discovery of Affective Dynamics
163
8.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
8.2
Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
8.3
8.2.1
Affective Transitions . . . . . . . . . . . . . . . . . . . . . . . . 165
8.2.2
Joint Factor Analysis for Latent Affective Dynamics . . . . . . . 167
8.2.2.1
Joint Factor Analysis . . . . . . . . . . . . . . . . . . 167
8.2.2.2
Latent Affective Factors . . . . . . . . . . . . . . . . . 169
Experiments
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
8.3.1
Datasets
8.3.2
Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . 171
8.3.2.1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
Patterns of Mood Transitions . . . . . . . . . . . . . . 171
8.3.2.2
8.4
9
Latent Factors of Affective Dynamics . . . . . . . . . . 173
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Conclusion
178
9.1
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
9.2
Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
Bibliography
183
List of Figures
2.1
Tag cloud of 132 predefined mood tags used in the datasets. . . . . . . . .
21
2.2
Examples of blog posts tagged as either ‘happy’ or ‘sad’ mood. . . . . . .
22
2.3
Top 150 ANEW words used in the content of the datasets. . . . . . . . .
23
2.4
Visualisation of Linguistic Inquiry Word Count (LIWC). . . . . . . . . .
25
2.5
Diagram of main steps in a supervised machine learning. . . . . . . . . .
28
2.6
Example of the supervised machine learning problem. . . . . . . . . . . .
28
2.7
Diagram of key steps in an unsupervised machine learning. . . . . . . . .
33
2.8
Example of an unsupervised machine learning problem. . . . . . . . . . .
33
2.9
Graphical model representation for the hierarchical Dirichlet process (HDP) 39
3.2
HDP representation for latent topics discovery in each community. . . . .
51
3.3
Word cloud of latent topics inferred from the HDP. . . . . . . . . . . . .
55
3.4
Mood-based meta-communities by LMCR. . . . . . . . . . . . . . . . .
63
3.5
ANEW-based meta-communities by LACR. . . . . . . . . . . . . . . . .
66
3.6
Mood-based meta-communities by MUCR. . . . . . . . . . . . . . . . .
67
xv
3.7
Psycholinguistic-based meta-communities by LIWCR. . . . . . . . . . .
3.8
Visualisation of the distance in the interest of HDP mood topics among
communities and meta-communities. . . . . . . . . . . . . . . . . . . . .
3.9
67
68
Visualisation of the distance in the interest of HDP ANEW topics among
communities and meta-communities. . . . . . . . . . . . . . . . . . . . .
68
3.10 Visualisation of the distance in the interest of HDP generic word topics
among communities and meta-communities. . . . . . . . . . . . . . . . .
68
3.11 The proportion of mood-based topics being used in each meta-community. 70
3.12 The proportion of mood-based topics being used by each community of
the meta-community of autism and depression communities. . . . . . . .
71
3.13 The proportion of ANEW-based topics being shared between autism and
depression communities. . . . . . . . . . . . . . . . . . . . . . . . . . .
72
4.1
Tag cloud of 24 moods tagged by the clinical cohort. . . . . . . . . . . .
80
4.2
Tag cloud of 24 moods tagged by the control cohort. . . . . . . . . . . .
80
4.3
Behaviour matrices of 5 clinical subgroups: bipolar, depression, selfharm, separation and suicide. . . . . . . . . . . . . . . . . . . . . . . . .
4.4
82
Behaviour matrices of the 5 control subgroups: fashion, food, parenting,
pets and technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
4.5
Distribution of posting activities by hour of day of two study cohorts. . .
83
4.6
Weekly posting patterns over 4 periods of the day . . . . . . . . . . . . .
85
4.7
Difference in average positive and negative emotion of the clinical group
by hour of day. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
86
4.8
Difference in average positive and negative emotions of the control group
by hour of day. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.9
86
Difference in weekly patterns of negative and positive emotions for the
control and clinical groups. . . . . . . . . . . . . . . . . . . . . . . . . .
89
4.10 Difference in affect by seasonal patterns . . . . . . . . . . . . . . . . . .
90
4.11 Difference in valence of the clinical and control groups.
. . . . . . . . .
91
4.12 Difference in the use of the 24 moods of the clinical and control groups .
92
4.13 Top 20 moods tagged daily to blog posts by the clinical group. . . . . . .
92
4.14 Circadian rhythms of moods of the clinical group . . . . . . . . . . . . .
93
4.15 Circadian rhythms of moods of the control group . . . . . . . . . . . . .
93
5.1
Histogram of moods tagged in a large corpus of posts from the initial
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.2
Visualisation of 24 primary moods extracted from the data on the core
affect model of emotions using valence and arousal scores . . . . . . . . 102
5.3
Difference in valence and arousal between low-high online social capital
cohorts for all six categories. . . . . . . . . . . . . . . . . . . . . . . . . 103
5.4
Prediction accuracy (IG features are used in M2 and M3). . . . . . . . . . 106
5.5
Difference in the use of linguistic styles between low and high social capital cohorts, based on the number of followers. . . . . . . . . . . . . . . . 112
5.6
F-measure for different algorithms in the classifications of low and high
social capital cohorts, based on the number of friends. . . . . . . . . . . . 114
6.1
ANEW usage by the clinical and control groups. . . . . . . . . . . . . . . 122
6.2
Mood usage by the clinical and control groups. . . . . . . . . . . . . . . 123
6.3
Differences in the use of mood tags and affective words for the control
and clinical groups. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.4
Visualisation of the distance of communities in mood-based representation 132
6.5
Visualisation of the distance of communities in topic-based representation. 134
7.1
Age distribution in the online depression community. . . . . . . . . . . . 141
7.2
Cloud visualisation of mood tags used in the community. . . . . . . . . . 143
7.3
Examples of differences in the use of LIWC features between blog posts
tagged with a low and high valence mood. . . . . . . . . . . . . . . . . . 147
7.4
Features selected by Lasso to predict posts in the high valence mood
group using topics and LIWC features as predictors.
. . . . . . . . . . . 148
7.5
Prediction performance.
. . . . . . . . . . . . . . . . . . . . . . . . . . 149
7.6
Examples of differences in the use of top LIWC features between low and
high valence mood groups. . . . . . . . . . . . . . . . . . . . . . . . . . 151
8.1
A representation of the emotional states of an individual on the core affect
model with 16 quadrants. . . . . . . . . . . . . . . . . . . . . . . . . . . 166
8.2
Overview of latent factors shared by the cohorts by NMF.
8.3
Visualisation of 132 LiveJournal pre-defined common mood tags from the
experimental datasets.
8.4
. . . . . . . . 168
. . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Visualisation of the relationship between latent mood factors and mood
tag usage in three cohorts. . . . . . . . . . . . . . . . . . . . . . . . . . 172
8.5
Visualisation of the relationship between affective transitional states and
hidden affective factors. . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
8.6
Latent factors and their contribution to each cohort. . . . . . . . . . . . . 174
8.7
Latent mood factors with their memberships . . . . . . . . . . . . . . . . 175
8.8
Latent affective factors and their memberships . . . . . . . . . . . . . . . 176
List of Tables
3.1
Depression communities: Details of the number of members, number of
posts and a self-description of each community. . . . . . . . . . . . . . .
3.2
Autism communities: Details of the number of members, number of posts
and a self-description of each community. . . . . . . . . . . . . . . . . .
3.3
45
46
General communities: Details of the number of members, number of
posts and a self-description of each community. . . . . . . . . . . . . . .
47
3.4
A summary of community representations. . . . . . . . . . . . . . . . . .
59
3.5
Clustering results with all different community representations.
. . . . .
60
3.6
Clustering performance with the 3 groups’ ground truth. . . . . . . . . .
60
3.7
Clustering performance with the 11 subgroups’ ground truth. . . . . . . .
61
3.8
Generic word-based topic meta-communities by LGWCR. . . . . . . . .
62
3.9
Mood-based topic meta-communities by LMCR. . . . . . . . . . . . . .
64
3.10 ANEW-based meta-communities by LACR. . . . . . . . . . . . . . . . .
65
3.11 Differences in the topics of interest of the autism and depression groups.
74
4.1
Statistical breakdown of the online communities of five affective disorder
subgroups of the clinical group. . . . . . . . . . . . . . . . . . . . . . . .
xx
79
4.2
Statistical breakdown of the online communities of five general subgroups
of the control cohort. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1
79
Statistics of the social connectivity and social support categories used to
define low and high online social capital groups for six categories. . . . . 100
5.2
The number of posts made by users in each social capital group across six
categories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.3
Classification results for feature selection . . . . . . . . . . . . . . . . . 106
5.4
Accuracy in mood prediction for low and high online social capital cohorts.107
5.5
The contingency table for the classification of online social capital groups. 110
5.6
Number of rejections in the Wilcoxon tests on the hypothesis of equal
medians in the use of LIWC features between low and high social capital
groups. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.7
Number of rejections in the Wilcoxon tests on the hypothesis of equal
medians in the use of latent topics between low and high social capital
groups. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.8
Difference in the use of top five latent topics of interest between low and
high online social capital groups, based on number of followers. . . . . . 113
6.1
Details of creation date, number of members, number of posts, and selfdescription of each community in the clinical group. . . . . . . . . . . . . 120
6.2
Details of creation date, number of members, number of posts, and selfdescription of the community in the control group. . . . . . . . . . . . . 121
6.3
Mean scores of psycholinguistic features for the posts written by bloggers
from the control and clinical communities . . . . . . . . . . . . . . . . . 125
6.4
Regression models learned by Lasso to classify the clinical vs control
communities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.5
Regression models learned by Lasso to classify blog posts. . . . . . . . . 129
6.6
Topics corresponding to the highest absolute value of the magnitude of
the coefficients in the Lasso model used to predict depression posts. . . . 130
6.7
Topics selected in the model to predict the emotions of online depression
communities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.1
Statistics of social connectivity indicators used to define low and high
social capital cohorts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
7.2
The number of topics and linguistic features rejected from the hypotheses
testing of equal median in the use of topics and linguistic styles by cohort. 146
7.3
Examples of LIWC features selected by Lasso to predict blog posts of
low versus high mood valence groups. . . . . . . . . . . . . . . . . . . . 148
7.4
Strong joint features to identify the significant difference between high
and low valence mood groups . . . . . . . . . . . . . . . . . . . . . . . . 150
7.5
Joint features selected by Lasso for prediction of high and low valence
mood groups. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
7.6
Cloud visualisation of top selected topics in joint features for the classification of high and low valence mood groups. . . . . . . . . . . . . . . . 152
7.7
Results of the statistical test for three feature sets on three cohorts of different social connectivity indicators. . . . . . . . . . . . . . . . . . . . . 153
7.8
Latent topics are found to be significantly different between low and high
online social capital users (LF-HF category). . . . . . . . . . . . . . . . . 154
7.9
Results of the statistical test for significant difference in the use of language styles via 68 LIWC features between low and high online social
capital cohorts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
7.10 Good predictors (highest standardized β coefficients) of the classification
of users in low vs high social capital cohorts for each cohort. . . . . . . . 157
7.11 The model to predict users in the LFo-HFo cohort using LIWC features
as predictors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
7.12 The model to predict users in the LC-HC group by using LIWC features
as predictors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
7.13 The model to predict users in the LF-HF cohort using LIWC features as
predictors.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
7.14 Topics as predictors of LF-HF bloggers in the depression community. . . 159
7.15 Topics as predictors of LFo-HFo bloggers in the depression community. . 160
7.16 Classification performance of constructed models for the three cohorts.
8.1
. 161
Annotations of emotional transitions across 4 quadrants of the circumplex
model of affect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Abstract
Social media are an online means of interaction among individuals. People are increasingly using social media to discuss health issues and seek support. In addition, online
communities have become avenues for individuals to express their opinions and share information and advice on a variety of issues in their daily life, especially their health and
well-being. The rapid emergence of diverse social media resources provides a wealth of
information on various aspects of healthcare. Social media data can be mined for patterns and knowledge which can be leveraged to make useful inferences about population
health, especially mental health. To better understand and build knowledge from healthcare data, advanced data analytical techniques that can effectively transform the data into
meaningful information about health are required.
The first major contribution of this thesis is to investigate the question of what and how
social media data can be used to study novel problems in healthcare. By taking a datadriven approach, we use existing machine learning algorithms to induce the cluster structures from the data. Each cluster is expected to reveal a meaningful sub-type from the
population data. To this end, we examine online communities with and without mental
health-related conditions, to investigate and identify the meta-clusters of these communities. We use probabilistic topic models to infer latent topics and patterns from the corpus
of affective information in the blog posts of the communities. Second, we aim to study
the characteristics of online mental health-related communities, which we refer to as the
clinical groups, in comparison with the control groups of general online communities. All
aspects of the blog posts, namely mood, written content and language styles, are found
to be significantly different between the two study cohorts. Identifying the difference
between language styles and content topics, with good predictive power, is an important
step in understanding social media and its use in healthcare, especially mental health.
xxiv