disaster tweets classification in disaster response with bert based mode

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (947.14 KB, 16 trang )

<span class="text_page_counter">Trang 2</span><div class="page_container" data-page="2">

2 I. Introduction

a. Background of the study

A various disasters take place in various places around the globe, disaster management has s become a matter of concern for governments, NGOs, and emergency agents. With good, -intime disaster response lives can be saved, damage minimized, and recovery efforts accelerated (Brown et al. 2017; Waeckerle 1991).

In recent years, social media platforms like Facebook, Twitter, and Snapchat have become

integral parts of an individual’s everyday life. Twitter, for example, sees an average of about 10 thousand tweets per second, corresponding to 36 million Tweets sent per minute, 867 million per day, and 361 billion every year (M. 2023).

Twitter is well-known for its real-time engagement. In times of crisis, many people choose this platform as their first channel to post updates on the situation, report damage, call for help, give instructions, etc. (Abedin and Babar 2018). Thus, the platform has been used widely in the field of natural and human-made disasters, providing useful insights so that rescue teams can act effectively (Kapoor et al. 2018; Kim and Hastak 2018). However, due to high volume and high velocity, it is merely impossible to manually monitor and analyze this data in search of informative posts, which calls for research of a solution that can automate such task (Kaur a 2019).

b. Statement of the problem

Creating an automatic classifier algorithm that can detect informative tweets is challenging because of tweets’ unique properties. They are limited in size (maximum 280 characters) and may contain grammar errors, special characters, or unconventional vocabularies that infer different meanings (Nguyen et al. 2016). In the real-world setting, tweets data can be very imbalanced in classes, having more samples in some labels compared to others. These

</div><span class="text_page_counter">Trang 3</span><div class="page_container" data-page="3">

characteristics can severely affect the ability of the classifier to learn patterns, make predictions, and accurately mine useful information lating to disasters (Ghosh, Bellinger, and Corizzo re2022).

Furthermore, apart from knowing the informativeness of social media posts, humanitarian organizations are interested in dividing these tweets into different categories to better coordinate their response (Nguyen et al. 2016). Thus, there are 2 problems to address:

i. Informative vs non-Informative tweets: Most posts on Twitter may contain irrelevant information and are not useful for disaster management. In this case, a binary classification solution (labeling tweets into “informative” and “non-informative”) could be performed to pertain only useful information.

ii. Information types of disaster tweets: Informative tweets can fall into different categories like damage reports, request for help, etc. that require more targeted actions. This is represented as the problem of multi-label classification.

In consideration of the above issues, this research proposes a system to classify and retrieve disaster-related information from Twitter posts. Particularly, the system’s workflow will perform steps of (1) data cleaning, (2) binary filtering of informative tweets, and (3) “ ”classifying tweets into different information categories. For such classification tasks, this research utilizes the ensemble learning approach based on Bidirectional Encoder Representations from Transformer (BERT) models.

c. Purpose and significance of the study

The main objective of this study is to develop a precise and efficient machine learning system for disaster-related by implementing BERT-based models and robust processing workflow. a To the best of my knowledge, no work has been reported where the binary classification of disaster tweets is chained with multi-label classification tasks. However, in light of the large

</div><span class="text_page_counter">Trang 4</span><div class="page_container" data-page="4">

volume of Twitter data, this method contributes achieving optimization when deploying the tomodel to a production environment. Ultimately, the research provides a theoretical foundation for implementing a scalable, applicable tool that aids humanitarian organizations enhancing insituational awareness and expediting effective response strategies during times of crisis.

d. Research questions and objectives

Research questions:

i. How to know if a tweet is disaster or non-disaster?

ii. How to classify a tweet into different information categories relating to a crisis? iii. How can both aforementioned tasks be combined in a single workflow? iv. How accurate and effective is the proposed system in disaster tweet labeling?

II. Literature Review

a. Overview of the field

Recently, many methods have been developed to detect and extract crisis-related information from tweet data, ranging from location estimation, and picture analysis to text mining techniques (Kaur 2019; Prasad et al. 2023). A lot of these approaches are based on traditional

</div><span class="text_page_counter">Trang 5</span><div class="page_container" data-page="5">

machine learning models like SVM, Logistic Regression, or Gradian Boosting (Le 2022). However, with the rapid development of Deep Learning and NLP, research efforts in the field that implement these techniques have shown state-of-art performance in this classification task, especially when involving Bidirectional Encoder Representations from Transformers (BERT) models (Imran et al. 2018; Le 2022; Ma 2019; Ningsih and Hadiana 2021; Prasad et al. 2023; Zahera et al. 2019).

b. Previous studies and findings related to the topic

Considering the main objective of the research, this section will further discuss relevant literatures on 3 topics:

1) Disaster & non-Disaster tweets classification: A fine-tuned BERT<small>LARGE</small> uncased architecture, with 24 transformer blocks, 16 attention heads, and 340M parameters, is proposed by Le (2022) to perform the customized classification task. He also built several machine-learning models and paired them with 4 different vectorizers (TF-IDF, Count Vector, Skip-gram Vector, and CBoW). Evaluated on the Kaggle dataset from the Natural Language Processing with Disaster Tweets competition, the BERT model shows significantly improved performance (F1 score = 0.88) compared to traditional models (max F1 score = 0.81).

On another Kaggle dataset, Ningsih and Hadiana (2021) saw a similar increase when implementing a BERT model. In particular, they added a 0.5 dropout layer to complete BERT’s pre-trained model. Then a dense layer with relu activation is followed to generate opportunities for data to have the “real disaster” and “non real disaster” label- s. The model achieved anaccuracy score of 0.85 on average.

Prasad et al. (2023) further highlighted the importance of incorporating a stochastic gradient descent optimizer for pre-training self-attention language models by using BERT with AdamW optimizer is used to perform binary tweets classification. Through experimentation, their

</div><span class="text_page_counter">Trang 6</span><div class="page_container" data-page="6">

In the same year, Ma (2019) collected 74346 sample data of 7 labels to perform tweets classification for disaster management. He created 4 BERT-based models: BERT<small>base</small>, BERT+NL, BERT+CNN, and BERT+LSTM. After testing, all BERT-based models surpass baseline performance, and BERT+LSTM along with BERT<small>base</small> work the best. Additionally, although customized BERT models attain higher precision, the default BERT produces a higher recall score. Later on, Naaz, Abedin, and Rizvi (2021) discovered that using balanced dataset a and conducting a suitable data-splitting strategy can solve this problem in Ma’s research and create a better classification result.

3) Ensemble learning approaches to tweet classification: Several multi-modal systems have been proposed to solve the tweets classification problem. Kumar et al. (2022) presented a deep multi-modal neural network that combines long-short-term-memory and VGG-16 networks to identify disaster-related contents using text and images together. Even only with tweet texts, the system achieved F1-score varied from 61% to 92%. Additionally, using Twitter images with semantic descriptors (annotation), Rizk et al. (2019) adopted a two-level multi-modal classification scheme which achieved 92.43% accuracy with computational efficiency. Ensembling learning that uses BERT models is also gaining popular in recent years (Cui et ityal. 2023; Mnassri et al. 2022; Xu, Barth, and Solis 2020), however, is not yet been implemented for this problem.

</div><span class="text_page_counter">Trang 7</span><div class="page_container" data-page="7">

c. Gaps in the literature and the need for the proposed study

A pattern that we see from the studied literature is that although BERT-based models performed better than traditional machine learning and some deep learning models on the task (with ~80% accuracy / F1-score), they still fall behind multi-modal mechanisms in predicting disaster tweets. Hence, as an effort to further enhance the performance of the implementation, this research ircreates a novel multi-modal system that utilizes BERT-based classifiers for mining crisis-related information from Twitter data. Considering the large volume of data in the real-world implementation of such a system, we also add a BERT binary classification model that can filter out irrelevant tweets before categorizing them into different labels. Potentially, these design suggestions would bring higher accuracy and efficiency the task compared to single BERT-tobased models.

d. Theoretical framework or conceptual framework

In this section, we will discuss several theoretical concepts that guide the design of our proposed system.

1) BERT & BERT-based models:

BERT is a pre-trained language model proposed by Google AI Language researchers (Devlin et al. 2019) that has become state-of-the-art in natural language processing. In the original research paper, authors Devlin et al. (2019) explained that BERT uses Transformer, an attention mechanism that learns the contextual relationship between words or tokens in a text. BERT’s distinct characteristic is that it reads text from both directions (bidirectional), which enable it to understand context based on a text’s entire surrounding.

Regarding that BERT is trained on a massive amount of data, transfer learning with based models (fine-tuned models built on top of the BERT architecture) achieves high performance in specific NLP tasks like text classification even with limited labeled data (Naaz

</div><span class="text_page_counter">Trang 8</span><div class="page_container" data-page="8">

3) Ensemble deep learning:

Ensemble learning is a machine learning approach that combines multiple models to perform a given task. An ensemble comprises a group of base learners that are trained for the same problem, then integrated to attain a final result (Mnassri et al. 2022). The application of ensemble learning help to capture diverse patterns; through which, the risk of overfitting/underfitting is mitigated, and significantly more robust prediction/classification results are achieved.

A comprehensive study of ensemble deep learning has been conducted by Ganaie et al. (2022) which gives details of different ensemble mechanism, namely classical methods like boosting, stacking, or fusion strategies (which is paired more with deep learning) like unweighted model averaging or majority voting. To improve disaster tweet classification, we will implement and evaluate our system with the majority voting strategy.

III. Methodology

</div><span class="text_page_counter">Trang 9</span><div class="page_container" data-page="9">

9 a. Dataset

Our analyses are based on the publicly available dataset from the Kaggle Natural Language Processing with Disaster Tweets competition. The dataset consists of 10,083 tweets which have been labeled as 0 and 1 for “not disaster” and “disaster (Ningsih and Hadiana 2021). Real ”disaster tweets account for 42,97% of the dataset. We eliminate other metadata and keep only the tweet text for processing.

To perform the multi-label classification task, we distribute tweets into 7 information types, following the taxonomy in the paper of Ma (2019). These labels are:

Caution and advice: Warnings, tips, and advice by concerned authorities and individuals.

Infrastructure and utilities damage: Posts related to damaged objects, buildings, and services.

Donations and volunteering: Regarding the donation of food, clothes, medicines, human power, etc.

Affected individuals: Information of injured, dead people, or disaster victims.

Sympathy and emotional support: Posts spreading prayers and wishes.

Other useful information: Information relating to disaster. a

Not related or not informative: tweets not related or not useful for disaster management.

From these tweets, we create 3 datasets:

i. D1 – For binary classification This dataset will contain 10,083 records and have 2 : features of tweet texts and binary labels (0 or 1). Since the classes are quite balanced, we will not perform further data augmentation techniques.

</div><span class="text_page_counter">Trang 10</span><div class="page_container" data-page="10">

ii. D2 – For multi-label classification: This dataset contain 4,333 samples (omitting data s

labeled “Not related or not informative”) and also has 2 labels of tweets and information type. To deal with the imbalance problem, we apply sampling methods to convert this into a balanced dataset with equal distribution of tweets in each label.

iii. D3 – For testing the overall system: Containing all records with mul -labeling. This tidataset will be used to evaluate the performance of our proposed system against multi-label tweets classification models.

On all the datasets, data is divided with a ratio of 80% for training and 20% for testing.

b. Data Pre-processing

i. Data Cleansing: First, all tweets are transformed to lowercase. Regular Expression isused to remove common substances like emails, URLs, HTML tags, emojis, special characters, etc. Then, abbreviations will be replaced with expressions that we mapped to. For fluency purposes, we do not remove stop words (Naaz et al. 2021).

ii. Word Embedding: Word embedding is a technique to represent words in a form that can be understood by machine learning algorithms (Prasad et al. 2023). We use the TF-IDF method, which uses word weighting (Le 2022), as the embedding mechanism.

c. Model Selection

To find the best candidate models for our multi-modal system, this study will also propose and measure the performance of several BERT-based models in our 2 main tasks:

Binary tweets classification models:

i. BERTBASE: Default BERT model with hyperparameters tuning used as the baseline model.

ii. Fine-tuned BERT Add a sequence non-linear layer on top of the ERT: <small>LARGE</small> model.

</div><span class="text_page_counter">Trang 11</span><div class="page_container" data-page="11">

11 Multi-label tweets classification models:

i. BERT<small>BASE</small>: Default BERT model with hyperparameters tuning used as the baseline model.

ii. Fine-tuned BERT: Use pre-trained BERT to obtain word vectors, then build 10 BERT stacked layers on top of the BERT outputs (Zahera et al. 2019).

iii. BERT with LSTM: Stacking a bidirectional LSTM on the default BERT model (Ma 2019; Naaz et al. 2021).

We use 2 variations of the cross entropy loss functions for training in the 2 cases:

Binary cross entropy loss function: Where

𝐶<small>′</small>=2 is the number of binary classes, is 𝑥<small>′</small>

the true distribution of any data point, and 𝑥

is the model’s predicted distribution.

∑𝑦<sub>𝑜,𝑐</sub>log (𝑝<sub>𝑜,𝑐</sub>)

Multiclass cross entropy loss function: Where M is the number of classes, y is the binary indicator (0 or 1), p is the predicted probability observation of o is of class c

∑ 𝑦<sub>𝑜,𝑐</sub>log (𝑝<sub>𝑜,𝑐</sub>)

The models’ performances are evaluated based on 𝐹1 score - a harmonic combination between the 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 and 𝑅𝑒𝑐𝑎𝑙𝑙 score:

𝐹<small>1</small>= <sup>2 × 𝑃𝑟𝑒𝑐𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙</sup>𝑃𝑟𝑒𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙

Based on this, we will choose the best-performing models to implement in our system.

d. Multi-modal system proposal and evaluation

Our system architecture consists of 4 layers:

</div>