Tải bản đầy đủ (.pdf) (103 trang)

Nâng cao hiệu quả truyền thông của các hệ thống học liên kết

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (2.37 MB, 103 trang )

Improving communication efficiency of Federated
Learning systems

Dinh Thi Quynh Khanh
Department of Computer Vision, MICA
Hanoi University of Science and Technology
Supervisor
Assoc. Prof. Le Thi Lan
In partial fulfillment of the requirements for the degree of
Master of Computer Science
August 4, 2023


Acknowledgements

First of all, I would like to express my sincere gratitude towards my advisors
Assoc. Professor Le Thi Lan and Assoc. Professor Tran Thi Thanh Hai.
I still remember the first day I came to the office to see them to start my
master study and now that I look back to the day Professor Le Thi Lan
assigned me to study the exciting topic of Federated Learning, I realize
what a wonderful journey has it been, going from zero knowledge in the
area to building a tool that can help research community and practitioners.
I’ve learned from my advisors many valuable lessons to become a better
problem solver. I realized that I need to value all of my (failed) attempts,
imperfect initial works and see them as a starting point. My advisors have
taught me that failures are just temporary and instead of being discouraged
by them, use them as a feedback to polish my work and hence, move myself
forward. Thanks to them, I see the beauty of research and finally decided
to pursuit a PhD in the field; but more importantly, I truly know who I
want to become and how I could contribute to our community with my
works.


I would also like to thank every member at the Comvis laboratory, I did have
so much joy and memories there, they have always made me felt welcomed.
I could not ask for better fellows.
Finally, I would love to thank my family for their support and encouragement throughout my years of study and through the process of researching
and writing this thesis. This accomplishment would not have been possible
without them. From the bottom of my heart, thank you!


This material is based upon work supported by the Air Force Office of Scientific Research under award number FA2386-20-1-4053 and Hanoi University
of Science and Technology (HUST) under project number T2021-SAHEP003.


Abstract
In the recent years, Computer Vision has observed a revolution in different tasks such as object detection, object classification, action recognition.
Much of this success is based on collecting vast amounts of data, often in
a privacy-invasive manner. Federated Learning (FL) is a new discipline
of machine learning that allows training models with the data being decentralized, i.e instead of sharing data, participants collaboratively train a
model by only sending weight updates to a server. Therefore, FL can help
to protect data privacy and utilize the computing power from the huge
number of endpoint devices. This motivates us to investigate the use of FL
in computer vision tasks.
However, the SOTA methods for computer vision tasks mainly base on
deep learning models that require to train the models with a huge number
of parameters. Therefore, training models for computer visions tasks in a
federated manner faces several obstacles.
Firstly, FL often yields a significant amount of communication overhead
than centralized learning, since the model parameters are required to be
exchanged between clients and central server during the training process.
Secondly, the study of FL for computer vision tasks requires running experiments with complicated programs, as FL is more sophisticated than
conventional training in terms of implementation, making the field sometimes exclusive to researchers, engineers with decent engineering skills only

especially for human action recognition tasks (HAR).
Therefore, in this thesis, we attempt to push forward the development of
FLy for computer vision by solving these two challenges.


To mitigate the first issue, we propose a model weight compression and
encoding during model uploading for Federated Averaging (FedAvg) - a
widely used algorithm in FL. Our weight compression is inspired by Sparse
Ternary Compression algorithm with a modification to be applicable to
FedAvg. We also utilize compressed weights’ characteristics to encode them
hence the communication cost can be reduced. The proposed method has
been evaluated on one task of computer vision that is image classification.
Experimental results on MNIST dataset demonstrate that our method is
able to reduce the communication cost without considerably worsening the
model accuracy.
Regarding the second challenge, although recently some FL frameworks
have been developed to facilitate the study and application of FL in some
specific tasks, unfortunately, those frameworks are limited to testing deep
models on object classification task. Therefore, we introduce a novel framework, called FlowerAction, to study FL-based human action recognition
from video data. To the best of our knowledge, this is the first FL framework
for video-based action recognition. FlowerAction is built upon the Flower
framework, it incorporates various techniques, including data loading, data
partitioning, widely used network architecture for HAR (e.g. SlowFast,
I3D, R3D), and FL aggregations (e.g. FevAvg, FedBN, FedPNS, and STC).
First, we present the main components of FlowerAction, which are developed on top of the existing Flower framework to interface with the external
components (data loaders, deep models) and internal ones of Flower (model
communication and aggregation algorithms). We then demonstrate the effectiveness of deep models and FL algorithms in recognizing human actions
using benchmark datasets (e.g. HMDB51 and EgoGesture) for both simulation and real distributed training. Our experimental results show that
the FlowerAction framework operates properly, which helps researchers to
approach FL with less effort and conduct federated training benchmarks



quickly. Furthermore, our analysis results that show performances in terms
of top-k accuracy and communication cost of different FL algorithms could
give instructive suggestions for the selection and deployment of a deep
model for FL-based action recognition in the future.


Contents
List of Acronymtypes

xii

1 Introduction

1

1.1

Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

1.3


Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

1.4

Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

2 Related works

11

2.1

Communication-efficient Federated Learning . . . . . . . . . . . . . . .

2.2

Video-based human action recognition (HAR) and federated learning for

2.3

2.4

11

HAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


13

2.2.1

Deep learning models for HAR

. . . . . . . . . . . . . . . . . .

14

2.2.2

Federated learning for HAR . . . . . . . . . . . . . . . . . . . .

14

Federated learning frameworks . . . . . . . . . . . . . . . . . . . . . . .

15

2.3.1

LEAF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

2.3.2

FedVision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


16

2.3.3

Flower . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

2.3.4

FedCV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

3 Communication cost reduction using sparse ternary compression and
encoding for FedAvg

21

vi


CONTENTS

3.1


Proposed method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

3.1.1

Layer-wise sparse ternary compression . . . . . . . . . . . . . .

23

3.1.2

Model weight encoding . . . . . . . . . . . . . . . . . . . . . . .

24

3.2

Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

3.3

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

4 FlowerAction - A Federated Learning framework for Human Action

Recognition

33

4.1

Proposed framework . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

4.1.1

Framework overview . . . . . . . . . . . . . . . . . . . . . . . .

34

4.1.2

Workflow of FlowerAction framework . . . . . . . . . . . . . . .

35

4.1.3

FL algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

4.1.3.1


FedAvg (Federated Averaging) . . . . . . . . . . . . .

43

4.1.3.2

FedBN (Federated BatchNorm) . . . . . . . . . . . . .

44

4.1.3.3

FedPNS (Federated Probabilistic Node Selection) . . .

44

4.1.3.4

STC (Sparse Ternary Compression) . . . . . . . . . . .

47

Deep learning models for human action recognition . . . . . . .

48

4.1.4.1

C3D . . . . . . . . . . . . . . . . . . . . . . . . . . . .


49

4.1.4.2

R3D . . . . . . . . . . . . . . . . . . . . . . . . . . . .

50

4.1.4.3

Inflated 3D ConvNet - I3D

. . . . . . . . . . . . . . .

50

4.1.4.4

SlowFast

. . . . . . . . . . . . . . . . . . . . . . . . .

50

Data pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

4.1.5.1


Data pre-processing . . . . . . . . . . . . . . . . . . .

51

4.1.5.2

Data partitioning . . . . . . . . . . . . . . . . . . . . .

51

Characteristics of FlowerAction . . . . . . . . . . . . . . . . . .

52

Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

4.2.1

Human action recognition datasets . . . . . . . . . . . . . . . .

57

4.2.2

Implementation and setup . . . . . . . . . . . . . . . . . . . . .

61


4.2.3

Experimental results . . . . . . . . . . . . . . . . . . . . . . . .

64

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

73

4.1.4

4.1.5

4.1.6
4.2

4.3

vii


CONTENTS

5 Conclusions

74

5.1


Summary of achievement and limitations . . . . . . . . . . . . . . . . .

74

5.2

Future works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

75

References

90

viii


List of Figures
1.1

Next word prediction application in mobile phones [1] . . . . . . . . . .

3

2.1

The overall architecture of FedVision [2] . . . . . . . . . . . . . . . . .

17


2.2

The overall architecture of Flower [3] . . . . . . . . . . . . . . . . . . .

18

2.3

Experiments on edge devices are supported by Flower [3] . . . . . . . .

19

3.1

Overview of the proposed method. (1) Server sends the global model to
clients; (2) Clients update the model with their local data; (3) Clients
apply the proposed weight compression and encoding techniques; (4)
Clients send local models to the server (5) Server updates the global
model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

3.2

Samples from MNIST handwritten digit dataset. [4] . . . . . . . . . . .

25

3.3


Network architecture used in our experiments for MNIST classification.

26

3.4

Accuracy at different values of compression factor. . . . . . . . . . . . .

27

3.5

Relationship between communication cost and compression factor. . . .

27

3.6

Confusion matrix at p = 73% with non-IID data distribution. . . . . . .

28

3.7

Confusion matrix at p = 78% with IID data distribution. . . . . . . . .

29

4.1


FlowerAction’s main flow . . . . . . . . . . . . . . . . . . . . . . . . . .

36

4.2

FlowerAction’s architecture. Notes that dashed blocks indicates workin-progress. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

4.3

Flower’s template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

40

4.4

FlowerAction’s data pipeline . . . . . . . . . . . . . . . . . . . . . . . .

51

4.5

IID data partition in FlowerAction . . . . . . . . . . . . . . . . . . . .

52

ix



LIST OF FIGURES

4.6

Programming interface of FlowerAction’s client . . . . . . . . . . . . .

53

4.7

FlowerAction’s Client API . . . . . . . . . . . . . . . . . . . . . . . . .

53

4.8

Programming interface of FlowerAction’s server . . . . . . . . . . . . .

54

4.9

FlowerAction’s Server API . . . . . . . . . . . . . . . . . . . . . . . . .

55

4.10 Some examples from HMDB51 dataset [5] . . . . . . . . . . . . . . . .

58


4.11 Some examples from EgoGesture dataset [6] . . . . . . . . . . . . . . .

60

4.12 Number of samples per class of client 0 and client 10 from HMDB51 in
non-IID data partition, C = 20, f rac = 0.25 . . . . . . . . . . . . . . .

62

4.13 Number of samples per class of client 0 and client 5 from EgoGesture in
non-IID data partition, C = 8, f rac = 0.25 . . . . . . . . . . . . . . . .

62

4.14 Top-1 accuracy of FedAvg, FedBN and FedPNS on HMDB51 with simulated clients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

66

4.15 Top-1 accuracy of FedAvg, FedBN and FedPNS on EgoGesture with
simulated clients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

66

4.16 Number of nodes kept after Optimal Aggregation on HMDB51, C = 70,
f rac = 0.1, IID data . . . . . . . . . . . . . . . . . . . . . . . . . . . .

67

4.17 Number of nodes kept after Optimal Aggregation on HMDB51, C = 70,

f rac = 0.1, non-IID data . . . . . . . . . . . . . . . . . . . . . . . . . .

67

4.18 Cumulative communication cost of FedAvg and STC (p = 0.04) on
HMDB51 through the rounds . . . . . . . . . . . . . . . . . . . . . . .

70

4.19 Top-1 accuracy by epochs of FedAvg, STC with p = 0.04 and p = 0.01
on HMDB51 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

70

4.20 t-SNE visualization on HMDB51 test set with model trained with FedAvg 71
4.21 t-SNE visualization on EgoGesture test set with model trained with
FedAvg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

71

4.22 t-SNE visualization on HMDB51 test set with model trained with STC
through the rounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x

72


List of Tables
3.1


Parameters used in the experiments. . . . . . . . . . . . . . . . . . . .

3.2

Compression factor, communication cost and accuracies obtained in two

30

scenarios: IID and Non-IID of the proposed method. . . . . . . . . . .

31

4.1

Centralized training configurations . . . . . . . . . . . . . . . . . . . .

63

4.2

Top-1 accuracy (%) on HMDB51 and EgoGesture datasets in the literature 64

4.3

Top-1 accuracy from simulated federated training on HMDB51. . . . .

65

4.4


Top-1 accuracy from simulated federated training on EgoGesture. . . .

66

4.5

Framework-based experiments results with FedAvg . . . . . . . . . . .

69

4.6

Framework-based experiments results with STC . . . . . . . . . . . . .

69

xi


List of Acronymtypes
C3D Convolutional Three Dimensional.
CNN Convolutional Neural Network.
DL Deep Learning.
FedAvg Federated Averaging.
FedBN Federated BatchNorm.
FedPNS Federated Probabilistic Node Selection.
FL Federated Learning.
HAR Human Action Recognition.
IID Independent and identically distributed.

LSTM Long short-term memory.
NN Neural Network.
RNN Recurrent Neural Network.
STC Sparse Ternary Compression.
SVM Support Vector Machine.

xii


Chapter 1
Introduction
1.1

Motivation

Artificial Intelligence (AI) is developing at an unprecedented rate with the rise of Deep
Learning (DL): several new exciting research papers are coming out daily and astonishing applications that demonstrate an intelligence that are close to human level. The
success of deep learning comes primarily from the availability of large data and computation power. However, one of the main challenges that DL models currently face in
this digital era is the privacy of data, since these data can be very sensitive and private
that their owners are usually note willing to share them, which hinders the process
of collecting datasets for DL. Therefore AI needs to ensure that the privacy of users’
data is not violated, and the study of training paradigms that protect data’s confidentiality becomes more and more demanding. Beside that, with the rapid evolution of
technology in general, edge devices is now more capable of computation power.
One question rises in that situation: “Can we train the model without gathering
all the data to a central server?”. Fortunately, Federated Learning (FL), which is
the study of algorithms to allow several machines to collaboratively train a model is
the answer. With FL, not only that the privacy of user data is protected, but the
computational power of a sheer number of devices is also fully utilized [7].
FL, first introduced by scientists at Google in 2017 [8], is a machine learning ap-


1


1.1 Motivation

proach that allows data to remain on local devices while a global model is trained. It
is a decentralized approach to machine learning, where multiple devices collaborate to
train a shared model without directly exchanging raw data.
The idea behind FL is to address the challenge of training machine learning models
on sensitive or confidential data that can not be transferred to a central server. This
approach enables organizations to train models on data that is distributed across various devices or servers, without transferring the data to a centralized server. Instead,
only the model updates are sent from the devices to a central server where they are
aggregated to improve the global model.
FL has several advantages over traditional centralized machine learning approaches
and FL methods play a critical role in supporting privacy-sensitive applications where
the training data are distributed at the edge devices. Examples of potential applications include: learning sentiment, semantic location, or activities of mobile phone
users; adapting to pedestrian behavior in autonomous vehicles; and predicting health
events like heart attack risk from wearable devices [9], [10], [11]. We discuss several
applications of FL below:
• Smart phones. By jointly learning user behavior across a large pool of mobile
phones, statistical models can power applications such as next-word prediction,
face detection, and voice recognition [1], [12]. However, users may not be willing
to share their data in order to protect their personal privacy or to save the
limited bandwidth/battery power of their phone. FL has the potential to enable
predictive features on smart phones without diminishing the user experience or
leaking private information.
• Organizations. Organizations or institutions can also be viewed as ‘devices’ in the
context of FL. For example, hospitals are organizations that contain a multitude
of patient data for predictive healthcare. However, hospitals operate under strict
privacy practices, and may face legal, administrative, or ethical constraints that

require data to remain local. FL is a promising solution for these applications
[10], as it can reduce strain on the network and enable private learning between

2


1.1 Motivation

Figure 1.1: Next word prediction application in mobile phones [1]
various devices/organizations.
• Internet of things. Modern IoT networks, such as wearable devices, autonomous
vehicles, or smart homes, may contain numerous sensors that allow them to
collect, react, and adapt to incoming data in real-time. For example, a fleet of
autonomous vehicles may require an up-to-date model of traffic, construction, or
pedestrian behavior to safely operate. However, building aggregated models in
these scenarios may be difficult due to the private nature of the data and the
limited connectivity of each device. FL methods can help to train models that
efficiently adapt to changes in these systems while maintaining user privacy [11],
[13].
While FL offers many benefits over centralized machine learning, it also presents
some challenges. Here are some of the main challenges in FL:
Heterogeneous data: In FL, heterogeneous data refers to data that is distributed
across different devices or servers and has different statistical properties or data distributions. This can occur when the data comes from different sources, such as different
regions, different types of devices, or different user groups.
The problem with heterogeneous data is that it can make it challenging to train a
single global model that performs well on all devices. This is because a global model

3



1.1 Motivation

that is optimized for one device may not perform well on another device with different
data distributions.
To address the challenge of heterogeneous data in FL, researchers have developed
various techniques, such as model personalization [14], where a separate model is
trained for each device, or transfer learning, where the global model is initialized with
a pre-trained model and fine-tuned for each device. Another approach is to use metalearning, where the model learns to adapt to different data distributions across devices
[15].
Addressing heterogeneous data in FL is an active area of research, and developing
effective techniques for dealing with heterogeneity is crucial for the success of FL in
practical applications.
Communication costs: Communication cost is a significant challenge in FL,
particularly for devices with limited bandwidth or connectivity. In FL, the local devices
or servers compute the gradient of the loss function with respect to their local model
parameters and send them to the central server for aggregation. However, this process
can be computationally expensive and can lead to high communication costs.
The communication cost challenge in FL can be addressed in several ways. One
approach is to reduce the amount of data that needs to be communicated between
devices and the central server. This can be done by compressing the model updates or
by using techniques such as quantization [16], sparsification [17], or differential privacy
[18] to reduce the amount of information that needs to be transferred.
Another approach is to use local model aggregation, where the model updates from
different devices are aggregated locally before being sent to the central server. This
can reduce the communication costs by reducing the number of messages that need to
be sent to the central server.
Moreover, researchers are exploring new approaches to reduce communication costs
in FL, such as using hierarchical or decentralized architectures, or developing new compression techniques that can reduce the size of the model updates without sacrificing
accuracy [19].
Solving the communication cost challenge in FL is crucial for making FL prac-


4


1.1 Motivation

tical for real-world applications, particularly for devices with limited connectivity or
bandwidth.
Privacy and security: FL aims to preserve the privacy and security of user data
by keeping it on local devices. However, ensuring data privacy and security can be
challenging, particularly if the local devices are not trusted or secure.
Although in federated setting, the data remains on local devices or servers, and only
the model updates are sent to the central server for aggregation. However, this process
can still pose privacy and security risks, particularly if the local devices or servers are
not trusted or secure.
One of the main challenges in ensuring privacy and security in FL is to prevent the
central server from learning any sensitive information from the local devices. To address this challenge, researchers have developed various techniques, such as differential
privacy, secure aggregation, and homomorphic encryption.
Differential privacy is a technique that adds noise to the model updates to prevent
the central server from learning any sensitive information [20]. Secure aggregation is
a technique that allows the central server to aggregate the model updates without
learning any sensitive information [21]. Homomorphic encryption is a technique that
allows the central server to compute on encrypted data without decrypting it, thereby
preserving the privacy of the data [22].
Moreover, researchers are exploring new approaches to ensure privacy and security
in FL, such as developing new cryptographic techniques, secure multi-party computation, and blockchain-based approaches [23].
Ensuring privacy and security in FL is essential for applications in domains where
privacy and security are critical, such as healthcare or finance.
Limited computational resources: Computation is also a significant challenge
in FL, particularly for devices with limited computational resources. In FL, the local

devices or servers compute the gradient of the loss function with respect to their local
model parameters and send them to the central server for aggregation. However, this
process can be computationally expensive and can lead to high computation costs,
particularly for devices with limited processing power.

5


1.1 Motivation

The computation challenge in FL can be addressed in several ways. One approach is
to use model compression techniques, such as pruning [24], quantization, or knowledge
distillation [25], to reduce the size of the models and make them more efficient to
compute. This can reduce the computational overhead on the local devices or servers
and reduce the communication costs between the devices and the central server.
Moreover, researchers are exploring new approaches to reduce the computation
challenge in FL, such as developing new algorithms that can be trained using less
computation or using hardware accelerators, such as GPUs or TPUs, to speed up the
computation.
Addressing these challenges requires developing new FL algorithms that can handle
heterogeneity, limited computational resources, and non-IID data, while also ensuring
data privacy and security. Additionally, it requires finding ways to reduce communication costs and developing robust methods for aggregating model updates from different
devices.
One difficulty from researcher and practitioner’s perspective is that since FL is
still relatively new, there is still just a few tools for benchmarking and deployment in
the field, as FL is innately much more complicated than centralized model training
in terms of software development. It can be observed than the implementation of a
FL research paper is usually done based on pure DL framework (e.g. TensorFlow or
PyTorch). Even with the arrival of new FL frameworks, which can help to conduct real
distributed training experiments, the amount of works that are aimed at a specific DL

problems other than just standard Computer Vision and Natural Language Processing
tasks such as image classification, next word prediction, is still limited.
One of the potential applications for FL is video-based human action recognition
(HAR) problem. HAR is one of the most important Computer Vision task and an
attractive research topic in the recent years due to its various applications. In the
literature, numerous methods have been proposed to solve this problem [26]. Some
primary challenges (e.g. high number of action categories [27], variation in performing
the actions [28], variation of lighting conditions and complex backgrounds, viewpoint
changes [29; 30]) have been gradually overcome by improving model architecture [31],

6


1.1 Motivation

utilizing domain adaptation [32], or multimodality combination [33], etc. However,
there exist some further issues. Since videos are of enormous volume and cameras continuously provide data, transferring such large amounts of data from anywhere would
be time-consuming and lead to latency. Besides, the video data are eventually sensitive, mostly in the field of medicine, in-home surveillance, and financial transaction.
In such contexts, privacy is strictly required. As a result, the strategy to transfer all
of the data to a server and then do annotation and training may be unsuitable for
such applications. Therefore, FL can be a secure paradigm to solve these issues as FL
leverages the computational power of a large number of edge devices while preserving
the privacy of the data.
Investigation of FL-based action recognition requires answers to the following questions: which deep learning model is suitable for deployment in an federated setting;
what is the acceptable communication cost; how the performance can be improved, and
what is the strategy to select good clients? It is then necessary to provide FL frameworks to quickly experiment with FL algorithms and answers to the above questions.
While some FL frameworks are recently developed and available to users/researchers
such as LEAF [34], Flower [3], FedCV [35], FedVision [36], they are very limited to
simulated data, or in specific domains. Particularly in the computer vision field, FL
frameworks have been proposed such as FedVision, FedCV are mainly dedicated to

object detection and classification task [37]. The FL for Human action recognition
(HAR), a long-standing problem of computer vision is still in its infancy.
FL for HAR has its own challenges compared to FL for object detection and recognition: HAR models are usually designed to capture long spatio-temporal relationships,
therefore are often big in size, thus cause a substantial computational cost. Moreover,
as the human body has a high degree of freedom, it leads to high inter-similarity and
high intra-variation issues. This explains why the performance of HAR based on FL
can be drastically reduced compared to the centralized performance as stated in several
works [38]. Recently, some attempts have been dedicated to the development of a FL
scheme for human activity recognition [39]. However, most of them are based on motion data that requires less computational time [40], [41]. It still lacks of a general FL

7


1.2 Problem definition

framework that can incorporate state-of-the-art deep learning models for vision-based
human activity recognition.

1.2

Problem definition

FL can be formulated mathematically as follows: Suppose we have a large dataset D
that is distributed across N devices or servers, denoted as D = {D1 , D2 , ..., DN }. Each
device i has a subset of the data, denoted as Di . Let fθ be the global model with
parameters θ, and let fi (θ) be the local model on device i. Then, the objective of FL
is to minimize the following loss function:

L=


N

ni
i=1 N Li (fθ , Di )

(1.1)

where Li is the loss function on device i, ni is the size of |Di |, and θ is the parameter
vector of the global model.
The above objective function can be interpreted as a weighted sum of the local
losses, where the weights are determined by the size of the local dataset. The goal is to
find the optimal θ that minimizes the weighted sum of local losses, without transferring
raw data to a central server. Instead, the local devices or servers compute the gradient
of the loss function with respect to their local model parameters, and send them to the
central server, which aggregates them to update the global model [42].
The aggregation process can be done in different ways, such as Federated Averaging or Federated SGD (Stochastic Gradient Descent), depending on the specific FL
algorithm used.

1.3

Contributions

The objective of my entire master work is to construct a communication efficient FL
system and an unified software toolbox for applying FL to HAR problem. Thus it is
mainly consist of two parts, one is reducing the communication cost, the other is an

8


1.4 Outline of the thesis


FL framework for HAR that helps researchers and practitioners accelerate their study
and experiments.
First, as previously mentioned, training models in a federated setting requires large
amount of communication resources, especially for training modern deep learning models that contains millions or even billions of parameters. This not only causes a huge
communication cost overhead but also heavy burden to the participants’ devices. We
investigate in techniques that reduce the communication costs during training process
of decentralized models, in particular, weight quantization and encoding algorithms
that can be applied to FedAvg [42], a popular FL algorithm.
Secondly, to facilitate the development of FL for vision-based human action recognition tasks, we propose a FL framework, named FlowerAction, which is built on top
of the well-known FL framework Flower. Flower is a framework for studying FL algorithms on simple tasks and benchmarks. We adopt Flower as an FL-core and extend to
the video-based action recognition problem. We also provide extensive benchmarks on
two popular HAR datasets. We do believe that proving such a benchmark, along with
the implementation of models, data partition, and evaluation/profiling tools can make
a significant contribution to the research community in human action recognition with
FL.

1.4

Outline of the thesis

The remaining of this thesis is organized as follows:
• In chapter 2, we briefly summarize three main approaches to reduce communication overhead in FL settings and how FL has been being applied to HAR
problems in the literature.
• In chapter 3, we describe our proposed communication-efficient FL algorithm by
first introducing the method which is, in essence, FedAvg with two modifications:
Layer-wise ternary compression and model weight encoding. We then show the
experimental results to prove the effectiveness of the proposed method.

9



1.4 Outline of the thesis

• In chapter 4, we describe our proposed FL framework for HAR - FlowerAction
where we go into details about the main component of the framework, the workflow, our implementation for deep model, data processing, FL algorithms and
additional tools for evaluation (metrics, visualization, communication cost measurement). The experiment section in this chapter presents the datasets and
our experimental results with simulated and real deployment of our framework.
Analysis on the performance of deep models on centralized and FL will be reported and discussed. We also evaluate the role of data distribution, the rate of
number of clients participating in the training, client selection strategy. The finding will make new recommendation for deployment of the framework in practical
applications.
• Finally, we conclude and give some ideas for future works in chapter 5.

10


Chapter 2
Related works
Overview
This chapter discusses the recent development of two areas of interest in our study:
communication efficient FL and FL for video-based HAR. In particular, we introduces
communication efficient FL algorithms which can be broadly divided into three groups:
communication delay, quantization and sparsification. We then discuss video-based
human action recognition and federated learning applications for HAR by enumerating
common HAR datasets, popular model architecture for video-based HAR and some
works on applying FL for HAR, as well as existing FL frameworks in the community.

2.1

Communication-efficient Federated Learning


FL is an incredibly interesting topic because it allows users to keep their data private
while a high-quality model can still be trained using it. However, several issues are still
open and need to be addressed for FL. One of them is the communication overhead
especially when applying neural network as neural networks nowadays commonly have
millions of parameters and sending updates for millions of weights from a mobile device
to a server is not really desirable.
There exists many methods that were proposed for communication-efficient FL.

11


2.1 Communication-efficient Federated Learning

They can be divided into three different categories based on how they help to reduce
communication cost. Federated Averaging [42], classified into the first group, relies on
a communication delay mechanism, where the clients compute weight update by
performing multiple Stochastic Gradient Descent (SGD) iterations before communicating with the server.
The second group follows quantization approach, which aims at reducing the
number of bits needed for representing the model weights/gradients. One of the most
prominent paper following this approach is SGD via gradient quantization and encoding
[43], which focuses on solving the trade-off between the transmission channel bandwidth
and convergence time. The idea behind QSGD is simple and intuitive, as each weight
wi of layer w can be written as wi = w2 · sign(wi ) ·
to approximate

|wi |
w2

|wi |

,
w2

the algorithm attempts

with a fraction represented by two integers l and s, where s is

pre-defined as quantization level. Then some QSGD-based variation algorithms are
proposed. Han et al. [44] proposed a dense quantization method where the weights
of a Deep Neural Network (DNN) is quantized with minimal loss of accuracy. Wen
at al. [45] applied stochastical quantization to gradients and introduced “TernGrad”,
16× compression rate is achieved, but the accuracy decreased by more than 2% in a
complex DNN model. Siede et al. [46] introduced a dense quantization method that
uses 1-bit quantization for weight updates. SGD is performed by accumulating error
from previous rounds to ensure high convergence speed while all gradients are still
considered during model training.
[47] by Sattler et al. is a demonstration for sparsification approach, where they
proposed a scheme including gradient sparsification, and binarization to reduce the
number of elements that are taken into account, the method yields a greater than 32×
compression rate. In their work, top-k sparsification is shown empirically to suffer
least from Non independent and identically distributed (Non-IID) scenario, which is
a common scenario in federated setting. When sparsification is applied to a tensor,
only elements with magnitudes smaller than a certain predefined threshold are kept,
while the others are set to zero. They also further combine binarization with sparsity,
i.e instead of communicating the fraction of largest elements at full precision, the

12



×