Tải bản đầy đủ (.pdf) (115 trang)

Research and develop solutions to traffic data collection based on voice techniques

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (639.47 KB, 115 trang )

VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY
HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY

NGUYỄN THỊ TY

RESEARCH AND DEVELOP SOLUTIONS TO TRAFFIC
DATA COLLECTION BASED ON VOICE TECHNIQUES

Major:
Computer Science
Major code: 8480101

MASTER THESIS

HO CHI MINH CITY, July 2023


VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY
HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY

NGUYỄN THỊ TY

RESEARCH AND DEVELOP SOLUTIONS TO TRAFFIC
DATA COLLECTION BASED ON VOICE TECHNIQUES

Major:
Computer Science
Major code: 8480101

MASTER THESIS


HO CHI MINH CITY, July 2023


THIS THESIS IS COMPLETED AT
HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY – VNU-HCM

Supervisor: Assoc. Prof. Trần Minh Quang

Examiner 1: Assoc. Prof. Nguyễn Văn Vũ

Examiner 2: Assoc. Prof. Nguyễn Tuấn Đăng

This master’s thesis is defended at HCM City University of Technology, VNUHCM City on July 11, 2023
The board of the Master’s Thesis Defense Council includes: (Please write down
the full name and academic rank of each member of the Master Thesis Defense Council)
1. Chairman: Assoc. Prof. Lê Hồng Trang
2. Secretary: Dr. Phan Trọng Nhân
3. Examiner 1: Assoc. Prof. Nguyễn Văn Vũ
4. Examiner 2: Assoc. Prof. Nguyễn Tuấn Đăng
5. Commissioner: Assoc. Prof. Trần Minh Quang
Approval of the Chairman of Master’s Thesis Committee and Dean of Faculty of
Computer Science and Engineering after the thesis is corrected (If any).
CHAIRMAN OF THESIS COMMITTEE

DEAN OF FACULTY OF
COMPUTER SCIENCE AND ENGINEERING


i
VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY


SOCIALIST REPUBLIC OF VIETNAM

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY

Independence – Freedom - Happiness

THE TASK SHEET OF MASTER’S THESIS
Full name: NGUYỄN THỊ TY

Student code: 2171072

Date of birth: 22/11/1996

Place of birth: Binh Dinh Province

Major: Computer Science

Major code: 8480101

I. THESIS TITLE:
Research and develop solutions to traffic data collection based on voice techniques (Nghiên cứu và phát triển các giải pháp thu thập dữ liệu giao thơng dựa
trên các kỹ thuật giọng nói).
II. TASKS AND CONTENTS:
• Task 1: Traffic Data Collection and Processing.
The first task involves collecting comprehensive traffic data. Extensive research will be conducted to identify reliable data sources, followed by the
implementation of appropriate data collection techniques. Subsequently, experiments will be carried out to determine the most effective data processing
methods. The aim is to enhance data quality and optimize processing efficiency for further analysis.
• Task 2: Research and Experimentation for Automatic Speech Recognition
Model Development.

In this phase, the focus will be on researching and experimenting with various architectures to develop high-performance automatic speech recognition
models. Different techniques will be explored to achieve accurate speech-totext conversion. The goal is to identify the best-performing model that meets
the project’s requirements.


ii
• Task 3: Automatic Speech Recognition Model Evaluation and Future Work.
Once the automatic speech recognition models are developed, a comprehensive evaluation process will be undertaken. The achieved results will be analyzed using appropriate metrics and techniques to assess their performance.
Strengths and weaknesses of each model will be identified. Based on this
analysis, recommendations for future work will be provided, outlining potential enhancements or modifications to the automatic speech recognition
models.
III. THESIS START DAY: 06/02/2023.
IV. THESIS COMPLETION DAY: 09/06/2023.
V. SUPERVISOR: Assoc. Prof. TRẦN MINH QUANG.
Ho Chi Minh City, June 9, 2023
SUPERVISOR

CHAIR OF PROGRAM COMMITTEE

(Full name and signature)

(Full name and signature)

DEAN OF FACULTY OF
COMPUTER SCIENCE AND ENGINEERING
(Full name and signature)


iii
ACKNOWLEDGMENTS

I would like to extend my sincere gratitude to the individuals who have provided
invaluable support and assistance throughout my research journey. I would like to
express my formal appreciation to Assoc. Prof. Trần Minh Quang for his exceptional
guidance, expertise, and unwavering support. His mentorship has been instrumental in helping me navigate the necessary steps to complete this thesis. Whenever I
encountered difficulties or felt lost, Assoc. Prof. Quang provided invaluable advice
that steered me back in the correct direction. His suggestion to process the data to
enhance its quality was a significant contribution to my research. Furthermore, his
assistance in establishing contact with esteemed researchers working on topics similar to mine and facilitating connections with individuals who could provide server
support for training large models, such as automatic speech recognition models, has
been immensely valuable.
I would like to express my profound gratitude to the esteemed researchers, Mr.
Nguyễn Gia Huy and Mr. Nguyễn Tiến Thành, for their generous contributions in
sharing their profound insights and knowledge. Their willingness to address my inquiries regarding the Urban Traffic Estimation System, collected data, and existing
issues has significantly enriched my comprehension of the subject matter. Furthermore, I am sincerely thankful to my sisters, Ms. Nguyễn Thị Nghĩa and Ms. Nguyễn
Thị Hiển, as well as Lương Duy Hưng and Vũ Anh Nhi, for their invaluable support
in meticulously creating precise transcripts for the audio files.
Furthermore, I would like to express my deep appreciation to Mr. Tăng Quốc
Thái for his diligent efforts in meticulously collecting and securely storing the traffic
reports from VOH 95.6MHz. Additionally, I am profoundly grateful to Mr. Mai Tấn
Hà, who graciously provided me with access to a server for the training of automatic
speech recognition models. His generosity and support have been instrumental in enabling the successful execution of the model training process. I would also like to
extend my formal gratitude to Dr. Lê Thành Sách and Mr. Nguyễn Hoàng Minh from
the Data Science Laboratory at Ho Chi Minh City University of Technology (HC-


iv
MUT) for their kind approval in granting me the opportunity to utilize an independent
server for automatic speech recognition model training. Their trust and support from
the Data Science Laboratory have been pivotal in facilitating the smooth progress of
my research. In addition, I am sincerely thankful for the invaluable support rendered

by my friends, Mr. Nguyễn Tấn Sang and Mr. Huỳnh Ngọc Thiện, in working with
the server that has limited permissions. Their expertise and assistance have been indispensable in effectively navigating the constraints imposed by the server limitations.
Lastly, I would like to express my heartfelt gratitude to my boss, co-workers,
friends, and family for their unwavering emotional support and understanding during
the challenging times that I encountered throughout this research endeavor. Their
encouragement and belief in my abilities have been instrumental in my success.
Once again, I am deeply grateful to all of the individuals mentioned above for
their significant contributions and support, without which this thesis would not have
been possible.


v
ABSTRACT
This thesis addresses two fundamental challenges within the domain of the current intelligent traffic system, specifically the Urban Traffic Estimation (UTraffic)
System . The first challenge pertains to the insufficiency of data that meets the requisite standards for training the automatic speech recognition (ASR) model that will
be deployed in the UTraffic system. The current dataset predominantly consists of
synthesized data, resulting in a bias towards recognizing synthesized traffic speech
reports while struggling to accurately transcribe real-life traffic speech reports imported by UTraffic users. The second challenge involves the accuracy of the ASR
model deployed in the current UTraffic system, particularly in transcribing real-life
traffic speech reports into text.
To address these challenges, this research proposes several approaches. Firstly,
an alternative traffic data source is identified to reduce the reliance on synthesized
data and mitigate the bias. Secondly, a pipeline incorporating audio processing techniques such as sampling rate conversion and speech enhancement is designed to effectively process the dataset, with the ultimate objective of improving ASR model
performance. Thirdly, advanced and suitable ASR architectures are experimented
with using the processed dataset to identify the most optimal model for deployment
within the UTraffic system.
Significant achievements have been obtained through this research. Firstly, a new
dataset of superior quality compared to the previous one has been developed. Continuous data collection from the alternative traffic data source can further enhance
this dataset, making it a valuable resource for future research endeavors aiming to improve the ASR model deployed in the UTraffic system. Additionally, notable progress
has been made in improving the accuracy of the ASR model compared to the results

achieved by the current architecture of the UTraffic system’s ASR model.


vi
TÓM TẮT LUẬN VĂN
Luận văn này giải quyết hai thách thức cơ bản trong lĩnh vực hệ thống giao thông
thông minh hiện tại, cụ thể là Hệ Thống Dự Báo Tình Trạng Giao Thơng Đơ Thị
(UTraffic). Thách thức đầu tiên liên quan đến sự thiếu hụt dữ liệu đáp ứng tiêu chuẩn
cần thiết cho việc huấn luyện mơ hình nhận dạng giọng nói tự động (ASR), sẽ được
triển khai trong hệ thống UTraffic. Bộ dữ liệu hiện tại chủ yếu bao gồm dữ liệu tổng
hợp, dẫn đến sự thiên vị cho việc nhận dạng các báo cáo giao thông tạo từ giọng nói
tổng hợp, trong khi gặp khó khăn trong việc chuyển các báo cáo giao thơng ở dạng
giọng nói được cung cấp bởi người dùng UTraffic sang văn bản chính xác. Thách
thức thứ hai liên quan đến độ chính xác của mơ hình ASR triển khai trong hệ thống
UTraffic hiện tại.
Để giải quyết những thách thức này, nghiên cứu này đề xuất một số phương pháp.
Thứ nhất, xác định nguồn dữ liệu giao thông thay thế để giảm thiểu sự phụ thuộc vào
dữ liệu tổng hợp. Thứ hai, thiết kế luồng xử lý thích hợp, trong đó kết hợp các kỹ
thuật xử lý âm thanh như chuyển đổi tỉ lệ lấy mẫu và tăng cường giọng nói để xử lý
hiệu quả bộ dữ liệu đang có, với mục tiêu cuối cùng là cải thiện hiệu suất mơ hình
ASR. Thứ ba, thử nghiệm bộ dữ liệu đã được xử lý trên các kiến trúc ASR tiên tiến
để xác định được mơ hình tối ưu nhất cho việc triển khai trong hệ thống UTraffic.
Nghiên cứu này đã đạt được thành tựu đáng kể. Thứ nhất, chúng ta hình thành
được một bộ dữ liệu mới có chất lượng vượt trội hơn so với bộ dữ liệu ban đầu. Việc
tiếp tục thu thập dữ liệu từ nguồn thay thế có thể nâng cao hơn nữa chất lượng của
bộ dữ liệu hiện có, biến nó thành nguồn tài nguyên quý giá cho những nỗ lực nghiên
cứu cải thiện hiệu suất mơ hình ASR triển khai trong hệ thống UTraffic trong tương
lai. Ngoài ra, so với các kết quả đạt được bởi mơ hình ASR hiện tại trong hệ thống
UTraffic, chúng ta đã đạt được những tiến bộ đáng kể, đặc biệt trong việc cải thiện độ
nhận dạng giọng nói chính xác.



vii
DECLARATION
I, Nguyễn Thị Ty, solemnly declare that this thesis titled "Research and develop
solutions to traffic data collection based on voice techniques" is the result of my own
work, conducted under the supervision of Assoc. Prof. Trần Minh Quang. I affirm that all the information presented in this thesis is based on my own knowledge,
research, and understanding, acquired through extensive study and investigation.
I further declare that any external assistance, whether in the form of data, ideas,
or references, has been duly acknowledged and properly cited in accordance with
the established academic conventions. I have provided appropriate references and
citations for all the sources and materials used in this thesis, giving credit to the
original authors and their contributions.
I acknowledge that this thesis is intended to fulfill the demands of society and to
contribute to the existing body of knowledge in the field. It represents the culmination
of my efforts, dedication, and commitment to advancing knowledge and understanding in this area.
I hereby affirm that this thesis is an authentic and original piece of work, and I
take full responsibility for its content. I understand the consequences of any act of
plagiarism or academic dishonesty, and I assure that this thesis has been prepared
with utmost integrity and honesty.
Nguyễn Thị Ty
June 9, 2023


Contents

1

2


List of Figures
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x

List of Tables
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xi

INTRODUCTION

1

1.1

General Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . .

3

1.3

Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


5

1.4

Scope Of Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

1.5

Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

1.6

Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

BACKGROUND

11

2.1

Traffic Data Processing for ASR Model . . . . . . . . . . . . . . . . 11

2.2


Components in ASR Model . . . . . . . . . . . . . . . . . . . . . . . 12

2.3

2.2.1

Acoustic Modeling . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.2

Language Modeling . . . . . . . . . . . . . . . . . . . . . . 18

2.2.3

Decoding Algorithm . . . . . . . . . . . . . . . . . . . . . . 18

Challenges in Traffic Speech Recognition . . . . . . . . . . . . . . . 20

3

RELATED WORK

22

4

APPROACH

26


4.1

Choosing ESPnet for ASR Model Development . . . . . . . . . . . . 26

viii


ix
4.2

4.3

5

6

Data Collection and Data Processing . . . . . . . . . . . . . . . . . . 30
4.2.1

Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2.2

Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . 32

Training and Decoding for End-to-End ASR . . . . . . . . . . . . . . 36
4.3.1

Attention-based Encoder Decoder . . . . . . . . . . . . . . . 37


4.3.2

Hybrid CTC/Attention End-to-End ASR . . . . . . . . . . . . 38

4.3.3

Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . . . 40

EXPERIMENT AND EVALUATION

43

5.1

Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.2

Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.3

Experimental Result and Analysis . . . . . . . . . . . . . . . . . . . 46
5.3.1

Data Processing Method Experiment . . . . . . . . . . . . . 46

5.3.2

RNNLM Training Experiment . . . . . . . . . . . . . . . . . 48


5.3.3

Architecture Comparison Experiment . . . . . . . . . . . . . 49

5.3.4

Language Model Weight Variation Experiment . . . . . . . . 56

5.3.5

CTC Weight Variation Experiment . . . . . . . . . . . . . . . 57

5.3.6

VOH Data Impact Assessment Experiment . . . . . . . . . . 59

5.4

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.5

ASR Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.5.1

bktraffic-analyxer and Training Server Environments . . . . . 62

5.5.2


ASR Deployment Result . . . . . . . . . . . . . . . . . . . . 62

5.5.3

ASR Deployment Result Analysis . . . . . . . . . . . . . . . 71

CONCLUSION

73

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100


List of Figures
2.1

System Architecture of Automatic Speech Recognition [5]. . . . . . . 13

2.2

The Architecture of the Conformer Encoder Model [6]. . . . . . . . . 14

2.3

The Architecture of the Branchformer Encoder Block [7]. . . . . . . . 16

4.1

Taxonomy of Methods for Constructing our End-to-End ASR System.


4.2

The Experimental Flow of the Standard ESPnet Recipe [54]. . . . . . 29

4.3

Distribution of Audio Hours among Three Data Sources. . . . . . . . 32

5.1

Training Time in the First Scenario. . . . . . . . . . . . . . . . . . . 47

5.2

Training Time in the Second Scenario. . . . . . . . . . . . . . . . . . 47

5.3

Training Time in the Third Scenario. . . . . . . . . . . . . . . . . . . 47

5.4

Training Time in the Fourth Scenario. . . . . . . . . . . . . . . . . . 47

5.5

GPU Maximum Cached Memory of Transformer-based Architecture. . 52

5.6


GPU Maximum Cached Memory of Architecture with Conformer-

26

based Encoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.7

GPU Maximum Cached Memory of Architecture with Branchformerbased Encoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.8

WER for Transformer-based Encoder-Decoder Architecture. . . . . . 55

5.9

WER for Conformer-based Encoder Transformer-based Decoder Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.10 WER for Branchformer-based Encoder Transformer-based Decoder
Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
x


List of Tables
4.1

Comparison of Audio Transcription Approaches. . . . . . . . . . . . 33

4.2

WER for System Combination of Speech Enhancement with Speech

Recognition on CHiME-4 Corpus [37]. . . . . . . . . . . . . . . . . . 34

4.3

Comparison of Word Error Rate in Different Scenarios. . . . . . . . . 35

5.1

Language Model Perplexity. . . . . . . . . . . . . . . . . . . . . . . 48

5.2

Comparison of ASR Models based on WER. . . . . . . . . . . . . . . 49

5.3

ASR Model Latency and RTF. . . . . . . . . . . . . . . . . . . . . . 50

5.4

Models & Trainable Parameters and Training Time. . . . . . . . . . . 51

5.5

Comparison of Training and Validation Metrics for Transformer, Conformer, and Branchformer Models. . . . . . . . . . . . . . . . . . . . 54

5.6

Impact of LM weight on WER and Sentence Error Rate (S.Err). . . . 57


5.7

Impact of CTC Weight Variation during the Decoding Stage on WER
and S.Err. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.8

Impact of CTC Weight Variation in the Training Stage on WER and
S.Err. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.9

ASR Model WER, Latency & RTF with Different Models and Test
Sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.10 Audio Files Conversion Time and Transcription Results of ASR Model
in bktraffic-analyxer. . . . . . . . . . . . . . . . . . . . . . . . . . . 63

xi


Chapter 1

INTRODUCTION
1.1

General Introduction

The Urban Traffic Estimation System (UTraffic) [1], developed by Ho Chi Minh
University of Technology, is an intelligent traffic system designed to estimate and

predict urban traffic conditions. It utilizes real-time traffic data to generate accurate
traffic estimations. By employing advanced technologies such as data analytics, machine learning, and deep learning, UTraffic provides valuable insights and support
for traffic management and planning, aiming to improve traffic efficiency, reduce
congestion, and enhance overall transportation systems in urban areas.
One crucial feature of UTraffic is the functionality to convert traffic speech reports into text through web and mobile applications. This feature significantly enhances the system’s usability and effectiveness. The details of this functionality are
as follows:
• User Interaction: The web and mobile applications provide an interface for users
to submit their traffic speech reports. Users can access the application from their
devices, record their voice, and provide relevant details about the traffic situation
they are reporting.
• Voice-to-Text Conversion: The recorded speech reports undergo processing us1


2
ing automatic speech recognition (ASR) technology. ASR algorithms analyze
the audio input and convert it into textual format.
• Text Processing: The converted text is then further processed to extract relevant
information, such as the location, type of incident, severity, and any additional
details provided by the user. Natural language processing (NLP) techniques may
be applied to understand and extract meaning from the text effectively.
• Data Integration: The extracted information in the form of segments from the
speech reports is integrated into the UTraffic system’s database, enabling further
analysis and utilization for purposes such as real-time traffic monitoring, incident
detection, and traffic prediction.
• Route Optimization: By analyzing the speech reports, the UTraffic system can
identify the most efficient routes for specific destinations based on real-time traffic conditions. It can provide recommendations to drivers to suggest alternate
routes that minimize travel time and avoid congested areas.
By converting traffic speech reports into text, the UTraffic system can efficiently
process and analyze user-generated data, enabling better traffic management and decisionmaking. This feature improves accessibility and convenience for users to contribute
to the system, as they can report traffic incidents through voice inputs that are transformed into actionable information within the UTraffic ecosystem. In the context of

this thesis, our primary focus lies on the aspect of Voice-to-Text conversion of this
feature.
The paper by Nguyen et al. [2] presents the process of developing an ASR model
and its deployment in the UTraffic system. The ASR model utilizes a Conformerbased encoder and Transformer-based decoder architecture, which has been trained
on the VIVOS dataset [3]. The evaluation of this model is conducted on a set of 80
traffic speech reports.
Although the authors did not perform fine-tuning using additional data, they have
taken measures to address the limited training data challenge. They have incorporated


3
a feature within the UTraffic web and mobile application that allows users to voluntarily contribute their own traffic speech reports. By prompting users to speak and
record their speech, the training dataset can be enriched. Additionally, the authors
utilized a speech synthesis tool called Vbee to generate synthetic traffic speech reports, resulting in an additional 122,569 seconds of audio data. These efforts have
cumulatively produced approximately 35 hours of audio data, which will be utilized
for future training of the ASR model.
In terms of the chosen architecture for the ASR models deployed in the UTraffic system, the Conformer-based encoder and Transformer-based decoder architecture
was found to outperform alternative configurations such as Recurrent Neural Network
(RNN) -based encoder-decoder and Transformer-based encoder-decoder, as assessed
by the word error rate (WER) metric. The Conformer-based encoder and Transformerbased decoder architecture exhibited superior performance, showcasing its suitability
for achieving accurate speech recognition within the UTraffic system during the evaluation period.

1.2

Problem Description

There are several challenges associated with the current implementation of the
ASR model in the UTraffic system. Firstly, the ASR model deployed in UTraffic
encounters a domain adaptation problem. Despite the VIVOS training dataset containing a diverse range of speech recordings from various speakers, encompassing
different domains and topics, the trained model is applied to recognize traffic speech

reports from the web or mobile application of the UTraffic system. This introduces
potential degradation in accuracy and recognition performance due to differences in
acoustic and linguistic characteristics, including speaker variability, noise conditions,
language style, and vocabulary, between the source domain and the target domain
(traffic domain).
The second challenge relates to the quality of the training data that has been pre-


4
pared for future ASR model training, as mentioned by the author [2]. While UTraffic
incorporates a feature that allows the system’s web or mobile applications to collect
traffic speech reports by reading predefined transcripts, this functionality has not garnered significant user engagement. As a result, the available data for training ASR
models remains limited due to the low popularity of this application. To address
this limitation, the authors [2] have employed an alternative approach of utilizing the
Vbee speech synthesis tool [4] to generate additional artificial traffic data. However,
it is important to consider that these synthesized traffic data may possess inherent
weaknesses, which need to be carefully evaluated and accounted for in the training
process:
• Synthesized data typically lacks the diversity and variability found in real-life
speech patterns. As a result, the model trained predominantly on synthesized
data may struggle to generalize well to real-world scenarios. It may not effectively handle the natural variations in speech styles, accents, intonations, and
background noise present in genuine traffic speech reports.
• Over-reliance on synthesized data can lead to a bias in the trained model. The
model becomes accustomed to the specific characteristics of synthesized speech,
making it less effective in recognizing and transcribing real speech patterns accurately. This bias can significantly impact the model’s performance on authentic
data and reduce its overall accuracy.
• Synthesized speech often lacks the naturalness and nuances present in human
speech. The model trained primarily on synthesized data may struggle to accurately transcribe or understand real speech patterns due to the differences in
prosody, pronunciation, and other subtle speech variations that exist in genuine
traffic speech reports.

The final concern regarding the ASR model utilized in the UTraffic system revolves
around its architecture, specifically the Conformer-based encoder and Transformerbased decoder. While the Conformer architecture incorporates parallel processing


5
techniques such as time-depth separable convolutions and multi-head self-attention,
it still relies on sequential computations within each layer. Consequently, this architecture exhibits limitations in terms of parallelism. Moreover, the Conformer-based
encoder requires an extended training time due to its sequential nature. Additionally,
although the Conformer-based encoder can handle relatively long sequences, it faces
challenges when processing excessively lengthy input sequences. As the sequence
length increases, the sequential operations of the Conformer may introduce computational bottlenecks, thereby impacting efficiency and escalating memory requirements.
Considering the characteristics of the dataset and the computational constraints
of the current ASR model employed in the UTraffic system, the Conformer-based
encoder is considered suitable. However, if the dataset or computational constraints
were to change, it would be advisable to explore alternative architectures beyond the
Conformer-based encoder to identify the most appropriate choice for the ASR model.
This would involve evaluating other architectures that can effectively handle parallel
processing, reduce training time, and efficiently process long sequences, thus optimizing the performance and scalability of the ASR model in the given context.

1.3

Objective

The objectives of this thesis are to address the aforementioned problems encountered in the ASR model used in the UTraffic system. To address the first problem,
an alternative approach is proposed to acquire additional traffic data for training the
ASR model. A suitable data source with the following characteristics is sought:
• The data source should provide speech samples specifically related to the local
traffic conditions of Ho Chi Minh City or the surrounding region, as our ASR
model is intended for traffic-related tasks in this specific area. This ensures the
relevance and applicability of the training data to the target region.

• The chosen data source should offer real-life traffic updates and reports, serving
as a valuable resource for obtaining authentic traffic speech reports.


6
• It is important that the data source exhibits diverse speech styles and vocabulary.
Training our ASR model with such data exposes it to a wide range of speech
patterns, facilitating improved generalization and robustness. This allows the
model to adapt to various speakers and variations in speech delivery commonly
encountered in real-life traffic reports.
• Accessibility is a key consideration, thus a data source that is readily available
and easily accessible is preferred. Ensuring access to a reliable and convenient
data source is essential for the effective building and training of our ASR model.
Subsequently, a pipeline is proposed to process the collected data, aiming to enhance
the performance of our ASR model.
To address the second problem, an in-depth exploration is conducted to identify
advanced and suitable architectures for the ASR model in the domain of traffic automatic speech recognition. Various advanced architectures are experimented with
to determine the most optimal choice for the ASR model to be deployed within the
UTraffic system. Particular attention is given to architectures capable of addressing the weaknesses associated with the Conformer-based encoder, such as limited
parallelism, extended training time, high computational resource requirements, and
difficulties encountered when handling long sequences.
By pursuing these objectives, this thesis aims to overcome the identified challenges and improve the effectiveness and efficiency of the ASR model utilized in the
UTraffic system.

1.4

Scope Of Work

The scope of work for this thesis encompasses several key areas. Firstly, it involves data collection, where an additional data source will be identified and selected
for training the ASR model. Extensive research and evaluation will be conducted to

identify sources that provide speech samples specifically related to the local traffic


7
conditions of Ho Chi Minh City or the surrounding region. The focus is on data
sources that offer real-life traffic updates and reports, encompassing diverse speech
styles and vocabulary, while also being easily accessible.
Secondly, a data processing pipeline will be developed to effectively process the
collected data. This pipeline aims to enhance the quality and suitability of the data for
training the ASR model. The steps involved may include sampling rate conversion,
speech enhancement techniques, and aligning the audio data with corresponding text
transcripts.
Thirdly, the thesis will explore advanced ASR model architectures suitable for
traffic automatic speech recognition. Different architectures will be thoroughly studied and compared, with a specific focus on addressing the limitations associated with
the Conformer-based encoder. The objective is to identify the most optimal architecture that improves parallelism, reduces training time, minimizes computational resource requirements, and effectively handles long sequences.
Fourthly, extensive experimental evaluation will be conducted to assess the performance of the proposed approaches. This will involve training and testing the ASR
models using the collected data and the selected architectures. Evaluation metrics
such as transcription accuracy, computational efficiency, latency, and real-time factor metrics will be employed. Comparative analysis will be conducted to determine
the improvements achieved in comparison to the existing ASR model deployed in the
UTraffic system.
Finally, thorough documentation and reporting are included within the scope of
work. This encompasses comprehensive documentation of the research methodology,
experimental setup, data collection process, pipeline development, model architecture
exploration, and experimental results. A comprehensive report will be prepared to
summarize the findings, draw conclusions, and provide recommendations based on
the research conducted.


8


1.5

Contribution

This thesis makes five significant contributions. Firstly, it contributes an enhanced dataset for training and evaluating the ASR model. Through the collection of
additional traffic speech reports and the inclusion of transcripts corresponding to audio samples, the dataset reduces bias towards synthesized data and improves the ASR
model’s ability to recognize real-life traffic speech reports. This dataset serves as a
valuable resource for future research in the field of traffic automatic speech recognition.
Secondly, it improves the accuracy of the ASR model. By exploring alternative
architectures that are better suited to the dataset, the thesis enhances the transcription accuracy of the ASR model for traffic speech reports. Through the selection
and implementation of a more suitable architecture, the ASR model achieves higher
accuracy in recognizing and transcribing traffic-related speech.
Thirdly, it develops a methodological pipeline for processing the collected data.
This pipeline includes important steps such as sampling rate conversion and speech
enhancement, ensuring the quality and suitability of the data for training the ASR
model. The developed pipeline serves as a valuable methodology for future researchers
working in similar domains, providing guidelines for effective data processing in traffic automatic speech recognition.
Fourthly, it presents a comprehensive comparative analysis and provides insights
into the proposed approaches. Through the evaluation of performance metrics such
as transcription accuracy, computational efficiency, latency, and real-time factor, the
thesis offers valuable insights into the improvements achieved by the proposed approaches. This comparative analysis guides future researchers in selecting and optimizing ASR model architectures for traffic automatic speech recognition.
Lastly, the thesis suggests future research directions to further enhance the performance of the UTraffic ASR model. Based on the analysis of results, the thesis
provides recommendations for future research, encouraging the exploration of inno-


9
vative approaches, the addressing of remaining challenges, and the advancement of
the field of traffic automatic speech recognition.
In summary, these contributions significantly contribute to the understanding and
capabilities of the ASR model in traffic-related tasks, providing valuable insights and

directions for researchers and practitioners in the field.

1.6

Thesis Structure

The structure of this thesis consists of six chapters, which are as follows:
• Chapter 1 - INTRODUCTION: This chapter provides an overview of the research topic, presents the objectives and significance of the study, and outlines
the structure of the thesis.
• Chapter 2 - BACKGROUND: In this chapter, the relevant background information and theoretical foundations related to traffic automatic speech recognition are discussed. It includes an overview of ASR models, speech recognition
techniques, and the challenges specific to traffic-related speech recognition.
• Chapter 3 - RELATED WORK: This chapter reviews the existing literature
and research studies related to ASR models in the context of traffic automatic
speech recognition. It discusses the approaches, methodologies, and findings of
previous works, highlighting the gaps and limitations that the current thesis aims
to address.
• Chapter 4 - APPROACH: This chapter presents the proposed approaches and
methodologies for improving the ASR model in the UTraffic system. It describes
the alternative data collection methods, the development of the data processing
pipeline, and the exploration of advanced ASR model architectures. The chapter
provides a detailed explanation of the rationale behind each approach and the
techniques employed.


10
• Chapter 5 - EXPERIMENT AND EVALUATION: This chapter presents the
experimental setup, data analysis procedures, and evaluation metrics used to assess the performance of the proposed approaches. It includes details on the training and testing processes, performance evaluation criteria, and comparative analysis of the results obtained. The chapter provides insights into the effectiveness
and efficiency of the proposed approaches.
• Chapter 6 - CONCLUSION: The final chapter summarizes the key findings,
conclusions, and contributions of the thesis. It discusses the implications of

the research, highlights the limitations, and suggests avenues for future research.
This chapter concludes the thesis by emphasizing the significance of the work
conducted and its impact on the field of traffic automatic speech recognition.
Each chapter is structured to provide a comprehensive and logical progression of
the research, contributing to the overall understanding and advancement of the ASR
model in the context of the UTraffic system.


Chapter 2

BACKGROUND
This chapter provides the necessary background information and theoretical foundations related to traffic automatic speech recognition. It offers an overview of ASR
models, speech recognition techniques, and the unique challenges encountered in the
domain of traffic-related speech recognition.

2.1

Traffic Data Processing for ASR Model

Traffic data processing techniques encompass a series of methodical procedures
employed to preprocess and refine traffic-related data before its utilization in training ASR models. These techniques are pivotal for optimizing the data and ensuring
its appropriateness for effective ASR model training. The initial step involves data
collection, wherein speech data specifically associated with traffic scenarios is gathered. This entails recording audio samples from diverse sources, including traffic
radio broadcasts, roadside microphones, or in-car voice recordings. Following data
collection is the transcription and annotation phase. Traffic-related speech data necessitates transcription and annotation to establish a labeled dataset for training ASR
models. Transcription involves the conversion of audio recordings into textual representations, while annotation entails the labeling of specific segments or events within
the audio, such as identifying traffic-related terms, road names, or spoken commands.
11



×