Tải bản đầy đủ (.pdf) (122 trang)

Qa system for real estate law in vietnam

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (3.99 MB, 122 trang )

VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY
HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY

PHẠM THANH HỮU

QA SYSTEM FOR REAL ESTATE LAW
IN VIETNAM

Major
: Computer Science
Major code : 8480101

MASTER’S THESIS

HO CHI MINH CITY, July 2023


THIS RESEARCH IS COMPLETED AT
HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY – VNU – HCM

Supervisor 1: ASSOC. PROF. QUAN THANH THO, PhD.

Supervisor 2: DR. NGUYEN TIEN THINH, PhD.

Examiner 1 : DR. TRAN TUAN ANH, PhD.

Examiner 2 : DR. BUI THANH HUNG, PhD.
Master’s thesis is defended at HCM City University of Technology,
VNU- HCM City on 13/07/2023
Master’s Thesis Committee:
1. Chairman: ASSOC. PROF. VO THI NGOC CHAU


2. Secretary: DR. PHAN TRONG NHAN
3. Reviewer 1: DR. TRAN TUAN ANH
4. Reviewer 2: DR. BUI THANH HUNG
5. Commissioner: DR. BUI CONG GIAO

Approval of the Chairman of Master’s Thesis Committee and Dean of
Faculty of Computer Science and Engineering after the thesis is
corrected (If any).
CHAIRMAN OF THESIS COMMITTEE

DEAN OF FACULTY OF COMPUTER
SCIENCE AND ENGINEERING


VIETNAM NATIONAL
UNIVERSITY HO CHI MINH CITY
HO CHI MINH CITY UNIVERSITY
OF TECHNOLOGY

SOCIALIST REPUBLIC OF VIETNAM
Independence – Freedom - Happiness

THE TASK SHEET OF MASTER’S THESIS
Full name: PHAM THANH HUU

Student code: 2171066

Date of birth: 03.03.1978

Place of birth: QuangNgai


Major: Computer Science

Major code : 8480101

I.

THESIS TITLE (In Vietnamese): HỆ THỐNG HỎI ĐÁP TỰ ĐỘNG LUẬT BẤT
ĐỘNG SẢN VIỆT NAM.

II. THESIS TITLE (In English) : QA SYSTEM FOR REAL ESTATE LAW IN
VIETNAM.
III. TASKS AND CONTENTS: Developing a chatbot capable of responding to legal
real estate queries.
IV. THESIS START DATE : 22.12.2022
V.

THESIS COMPLETION DATE: 09.06.2023

VI. INSTRUCTOR: ASSOC. PROF. QUAN THANH THO, PhD and DR. NGUYEN
TIEN THINH, PhD.

INSTRUCTOR

INSTRUCTOR

HCM City, 09/06/2023
CHAIRMAN OF
PROGRAM COMMITTEE


DEAN OF
COMPUTER SCIENCE AND ENGINEERING

i


Acknowledgment
I would like to express my deepest gratitude to my advisors - Assoc. Prof. Quan Thanh Tho, for his
valuable and constructive suggestions during the planning and development of this research work. His
willingness to give his time so generously has been very much appreciated. Moreover, his advice on
algorithms and his recommendations on solutions when I had to deal with problems during doing this
research.
Finally, I wish to thank IVS JSC for funding this study.

ii


Abstract
Intelligent legal services have emerged in recent years due to the application of AI technology to the
law industry; however, these have yet to be developed in Vietnam since there is a lack of research into
automatic processing in the Vietnamese language. In this thesis, the author proposes to build a chatbot
that can effectively and automatically answer legal questions, especially those related to real estate.
The most important module of the chatbot is the Legal Statutes Identification (LSI), which identifies
the legal statutes relevant to a given description of facts or evidence of a legal document (such as a legal
question or a description of a legal fact). To deploy the LSI model, the author has built an LSI dataset
including more than 300,000 legal questions and millions of judgments of the Supreme People’s Court
of Vietnam. Three models are presented in this thesis. The first is an ML-based model in which the
LSI is performed by the Support Vector Machine after the input questions have been word-embedded
with TF-IDF Embedding. The second model, based on deep learning, will implement LSI downstream
tasks after using a new model called LegarBERT to construct word embedding for the input question.

Finally, the author attempts to build LSI using graph machine learning by encoding legal reasoning
as nodes and edges, representing by queries, a legal articles, and legal key word (legal terminology).

TÓM TẮT LUẬN VĂN THẠC SĨ
Các dịch vụ pháp lý thông minh đã xuất hiện trong những năm gần đây nhờ sự áp dụng của cơng
nghệ Trí tuệ Nhân tạo vào ngành luật; tuy nhiên, tại Việt Nam, chúng vẫn chưa được phát triển do
thiếu nghiên cứu về xử lý tự động trong tiếng Việt. Trong luận văn này, tác giả đề xuất xây dựng một
chatbot có khả năng trả lời tự động và hiệu quả các câu hỏi pháp lý, đặc biệt là các câu hỏi liên quan
đến bất động sản. Mô-đun quan trọng nhất của chatbot là Hệ thống Xác định Căn cứ Pháp lý (Legal
Statutes Identification - LSI), được sử dụng để xác định các căn cứ pháp lý liên quan đến một mô tả
cụ thể về sự kiện hoặc bằng chứng từ một văn bản pháp lý (như một câu hỏi pháp lý hoặc một mô
tả về sự kiện pháp lý). Để triển khai mơ hình LSI, tác giả đã xây dựng một tập dữ liệu LSI gồm hơn
300.000 câu hỏi pháp lý và hàng triệu bản án của Tòa án Nhân dân Tối cao Việt Nam. Luận văn này
trình bày ba mơ hình. Mơ hình đầu tiên dựa trên máy học (ML), trong đó LSI được thực hiện bằng
Máy Vector Hỗ trợ sau khi câu hỏi đầu vào được biểu diễn bằng phương pháp Nhúng TF-IDF. Mơ
hình thứ hai, dựa trên học sâu, sẽ thực hiện các tác vụ LSI sau khi sử dụng một mơ hình mới được gọi
là LegarBERT để xây dựng việc nhúng từ cho câu hỏi đầu vào. Cuối cùng, tác giả cố gắng xây dựng
LSI bằng cách sử dụng học máy đồ thị bằng cách mã hóa lý luận pháp lý thành các nút và cạnh, biểu
thị bằng các truy vấn, các điều khoản pháp lý và thuật ngữ pháp lý.

Keywords: LSI, Law Graph, Intelligence Law Service, Vietnamese Law Questions and Answers,
Vietnamese Embedded Word, Law Prediction.
iii


Declaration of Authenticity
I guarantee this research is my own, conducted under the supervision of Assoc. Prof. Quan Thanh
Tho. The contents and results of this research are legitimate and have not been published in any forms
prior to this. The data and materials used for the analysis and feedback are derived from various
resources and which are appropriately listed in the References section.

The data and results of several other authors and organizations have been used and have been aptly
cited.
If there is any plagiarism, I stand by our actions and are to be held responsible for it. Ho Chi Minh
City University of Technology is not responsible for any copyright infringement relating to this dissertation. .

Ho Chi Minh City, June 2023
Author
Pham Thanh Huu

iv


Contents
1

2

Introduction

1

1.1

Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

Ojectives and Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


4

1.2.1

Business Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.2.2

Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.2.3

Thesis’s Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.3

Contributions of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

1.4

Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


5

Legal Document Structure and Data

7

2.1

VN-LandLaw-2013 Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

2.2

Formal Structure of a Legal Document . . . . . . . . . . . . . . . . . . . . . . . . .

9

2.3

Legal Data Sourcing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

2.4

Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11


2.4.1

LSI Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

2.4.2

Legal Entity Extration . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

2.4.3

Legal Relation Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

2.4.4

The TF-IDF Matrix of Vietnam Land Law . . . . . . . . . . . . . . . . . . .

18

Legal Data Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

2.5.1


Legal Data Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

2.5.2

Basic Legal Data Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

2.5.3

Unbalanced Legal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

2.5.4

Vietnam Land Law Article Semantic Relations Matrix . . . . . . . . . . . .

27

2.5.5

Vietnam Land Law Article Co-occurrence Matrix in LSI Dataset . . . . . . .

28

2.5


3

Literature Review

31

3.1

QAS Research in NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33

3.2

Law-related Global QAS Research . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

3.3

Vietnamese Law-related QAS Research . . . . . . . . . . . . . . . . . . . . . . . .

37


CONTENTS
4

5


Background

40

4.1

Technical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

40

4.1.1

Term Fequency-Inverse Document Frequency (TF-IDF) . . . . . . . . . . .

40

4.1.2

Support Vector Machine (SVM) . . . . . . . . . . . . . . . . . . . . . . . .

40

4.1.3

Attention Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

4.1.4


Autoencoder Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

4.1.5

BiLSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

4.1.6

PhoBERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

4.1.7

Masked Language Modeling (MLM) . . . . . . . . . . . . . . . . . . . . .

45

4.1.8

Fine-tune a Pretrained Model . . . . . . . . . . . . . . . . . . . . . . . . . .

45

4.1.9


Graph Convolution Neural Network (GCN) and Graph Attention Netowrk(GAT) 46

4.2

Business Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

4.3

Legal Domain Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

The Proposed System

51

5.1

Business Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

5.2

Overall System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

52


5.3

The Main User Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

5.4

UI Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

5.5

The Evaluation/Acceptance Criteria . . . . . . . . . . . . . . . . . . . . . . . . . .

56

5.5.1

Chatbot System Acceptance Criteria . . . . . . . . . . . . . . . . . . . . . .

56

5.5.2

LSI Model Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

56


5.5.3

Comparative Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59

Primary Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

62

5.6
6

LSI by Linear Support Vector Classification with TF-IDF Embedding

63

6.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

63

6.2

Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

63


6.3

Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

64

6.4

Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

64

6.4.1

Dataset Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

64

6.4.2

Training Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

65

6.4.3

Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

65


Results and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

65

6.5
7

CONTENTS

LSI by Multi Label Classification with LegarBert

67

7.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

67

7.2

Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

67

7.3

Legar Answering Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

vi


CONTENTS
7.4
7.5
7.6

7.7
8

Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

70

7.4.1

LegarBert Training from PhoBert . . . . . . . . . . . . . . . . . . . . . . .

70

Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

71

7.5.1

Dataset Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

71


Legal-Masked Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

71

7.6.1

Training Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

73

7.6.2

Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

73

Results and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

75

LegarHKB: A LSI Retrieval Model using Heterogeneous Knowledge Graph for the Vietnamese Law Domain
76
8.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

76

8.2


Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

76

8.3

Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

79

8.4

Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

80

8.4.1

Dataset Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

80

8.4.2

Training Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

81

8.4.3


Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

82

Results and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

82

8.5
9

CONTENTS

Conclusions

84

9.1

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

84

9.2

Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

85


10 List of Deliverables

87

List of Publications

88

References

103

vii


List of Figures
1.1

LSI ChatBot’s response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

2.1

Data collection procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

2.2


Data labeling application screenshot - Login Page . . . . . . . . . . . . . . . . . . .

13

2.3

Data labeling application screenshot - Home Page . . . . . . . . . . . . . . . . . . .

14

2.4

Data labeling application screenshot - Labeling Page . . . . . . . . . . . . . . . . .

14

2.5

The tree of ”Chủ thể.” Subjects of legal relations . . . . . . . . . . . . . . . . . . .

15

2.6

The tree of ”Hành vi” or ”Quan hệ pháp lý.” Acts/ Legal relations. . . . . . . . . . .

17

2.7


The TF-IDF Matrix of Vietnam Land Law. . . . . . . . . . . . . . . . . . . . . . .

18

2.8

Legal data categories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

2.9

Supervised LSI training data statistics per legal category. . . . . . . . . . . . . . . .

22

2.10 Supervised LSI training data statistics per legal category (Distribution). . . . . . . .

22

2.11 Semi-supervised LegarBert (MLM) training data statistics per legal category(From
books). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

2.12 Semi-supervised LegarBert (MLM) training data statistics per legal category(From
books) (Distribution). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24


2.13 Semi-supervised LegarBert (MLM) training data statistics per legal category(From
Supreme People’s Court). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

2.14 Semi-supervised LegarBert (MLM) training data statistics per legal category(From
Supreme People’s Court) (Distribution). . . . . . . . . . . . . . . . . . . . . . . . .

26

2.15 Unbalanced legal data phenomenon. . . . . . . . . . . . . . . . . . . . . . . . . . .

27

2.16 Heatmap of 212 legal documents’ TF-IDF vectors’ cosine similarity . . . . . . . . .

27

2.17 Semantic relations of articles 35-51 of chapter IV(Land use master plans and plans). .

28

2.18 Heatmap of Vietnam Land Law articles co-occurrence in LSI dataset. . . . . . . . .

28

2.19 Article Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29


2.20 High concurrency, high semantic similarity. . . . . . . . . . . . . . . . . . . . . . .

29

2.21 High concurrency, low semantic similarity. . . . . . . . . . . . . . . . . . . . . . .

30

3.1

Timeline of automated law research . . . . . . . . . . . . . . . . . . . . . . . . . .

31

4.1

Architecture of attention Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

4.2

Architecture of auto encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43


LIST OF FIGURES

LIST OF FIGURES


4.3

Architecture of BiLSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

4.4

Architecture of PhoBERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

4.5

Graph Attention Neural Network. . . . . . . . . . . . . . . . . . . . . . . . . . . .

46

4.6

IVS JSC overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

4.7

Vietnamese legal structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48


4.8

Vietnam real estate law structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

4.9

IRAC method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

49

4.10 IRAC example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

50

5.1

Business Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

5.2

Overall system architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

5.3


The main user cases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

5.4

Main screens of the chatbot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

5.5

Legal quick lookup popup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

5.6

Example of KU calculation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

58

5.7

Vietnam Land Law 2013 Long-tail dataset. . . . . . . . . . . . . . . . . . . . . . .

61

6.1


LSI by Support Vector Machine with TF-IDF Embedding model. . . . . . . . . . . .

64

7.1

The Answering Engine of the Legar System. . . . . . . . . . . . . . . . . . . . . . .

69

7.2

LegarBert embedding model training by MLM tasks. . . . . . . . . . . . . . . . . .

70

7.3

LSI by Multi Label Classification with LegarBert. . . . . . . . . . . . . . . . . . . .

71

8.1

LSI by Heterogeneous Knowledge Graph. . . . . . . . . . . . . . . . . . . . . . . .

79

8.2


Data transformation process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

80

8.3

Nodes and Edges Definition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

80

8.4

Graph Design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

81

8.5

Graph Demo with some nodes and edges. . . . . . . . . . . . . . . . . . . . . . . .

81

9.1

The directions for future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

86

ix



List of Tables
1.1

NLP techniques used in legal domain. . . . . . . . . . . . . . . . . . . . . . . . . .

3

2.1

S, O, R, TO, T legal question analysis. . . . . . . . . . . . . . . . . . . . . . . . . .

10

2.2

Entity extraction example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

2.3

Top 100 single-word TF-IDF values from 212 Vietnamese Land Law. . . . . . . . .

19

2.4

Top 200 single-word TF-IDF values from 212 Vietnamese Land Law. . . . . . . . .


20

2.5

Legal Document Sentences and Words Statistics . . . . . . . . . . . . . . . . . . . .

26

2.6

Most-paired articles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

3.1

QAS research in NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

3.2

Law-related global QAS research . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

3.3

Vietnamese Law-related QAS research . . . . . . . . . . . . . . . . . . . . . . . . .


39

5.1

Comparative Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

60

5.2

Long-tail dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

61

6.1

Train/Val/Test Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

64

6.2

Model 1 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

65

6.3

LSI by Support Vector Machine with TF-IDF Embedding results . . . . . . . . . . .


66

7.1

Hyperparameter of LegarBert training. . . . . . . . . . . . . . . . . . . . . . . . . .

74

7.2

Hyperparameter of LSI by LegarBert. . . . . . . . . . . . . . . . . . . . . . . . . .

74

7.3

Perplexity comparing with MLM task . . . . . . . . . . . . . . . . . . . . . . . . .

75

7.4

LSI by Multi Label Classification with LegarBert results. . . . . . . . . . . . . . . .

75

7.5

LSI by Multi Label Classification with LegarBert K-Utility . . . . . . . . . . . . . .


75

8.1

Hyperparameter of LSI by Heterogeneous Knowledge Graph . . . . . . . . . . . . .

82

8.2

LSI by Heterogeneous Knowledge Graph results . . . . . . . . . . . . . . . . . . .

82

8.3

LSI by Heterogeneous Knowledge Graph K-Utility . . . . . . . . . . . . . . . . . .

83

9.1

Comparing 3 models by Precision/Recall/F1 . . . . . . . . . . . . . . . . . . . . . .

84

9.2

Comparing 2 models by KU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


85


LIST OF TABLES
9.3

LIST OF TABLES

Summary the embedding capability of 3 models. (ì is None,
ã is Having, N is Might
Have) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

85

xi


Chapter 1
Introduction
1.1

Motivation

In most nations, the legal system is overburdened by a backlog of cases, particularly in low-level
judiciaries. Though speedy justice acts exist, the process in the legal domain is extremely laborious.
Artificial Intelligence (AI) tools can provide a way of automating these tasks, accelerating justice
delivery [1]. The legislation to which businesses and citizens have to abide is growing at a constant
rate both in complexity and volume. The data present in legislation is mostly in an unstructured format
in legal documents [2]. This makes the task of retrieving information highly inefficient and timeconsuming, particularly when there are huge quantities of data involved. Further, the utility of such

data differs broadly and relies on its representation and structure. In this scenario, legal professionals
and users might find it highly problematic to explore the legal data while investigating a specific case
or dealing with particular circumstances, even when the data is accessible [3]. These problems have
resulted in the necessity of devising better methods for structuring and searching across huge amounts
of legal data [4]. For this reason, the process of Legal Statute Identification (LSI) is significant in
the domain of the law and it includes identification of the probable set of statutory laws, which are
appropriate, or which may be violated based on the factual description of a scenario described in
natural language. This process has to be carried out at various phases of litigation by experts, such
as judges, lawyers, and police personnel. Therefore, automation of LSI can significantly increase law
access for professionals and the wider public [5].
Due to the rapid advances in deep learning (DL) and natural language processing (NLP), numerous
Question-answering systems (QAS) have been developed for numerous applications such as navigation, virtual assistants, chatbots, and search engines [6][7], and thus can be applied in other fields,
including law, to improve efficiency. The primary purpose of a QAS is to comprehend user intentions
and provide appropriate responses. The QAS extracts its data autonomously in response to a user
query [8]. There are many user-friendly documents on the Internet, but there are also many which are
less so. Consequently, a successful QAS must use relevant documents efficiently.
As QAS, requires input information, it is necessary to use Legal Statute Identification (LSI) [5]. As
mentioned, LSI identifies the legal statutes that are pertinent to a given description of facts or evidence
of a legal document (such as a legal question or a description of a legal fact).
Recent technological advances have seen Artificial Intelligence (AI) being applied in various fields,
and now AI applications can be used for processing, understanding and interpreting human language.
Massive quantities of often semantically complex information can be easily represented using knowledge graphs that provide structured querying capabilities and easy access. Moreover, models based
on Semantic Web standards and knowledge graphs can be employed for reasoning and inferring new
Master's Thesis, Semester 222, Academic year 2022 - 2023

Page 1/109


Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

data or finding inconsistencies in information. Further, they permit downstream applications, such as
classification, prediction, dialogue, and QAS [9].
Legal Artificial Intelligence (LegalAI), which focuses predominantly on AI in legal applications, has
garnered enormous interest in recent years [10]. LegalAI approaches rely on Natural Language Processing (NLP) [11] because the majority of resources in the judicial system are textual, such as contracts, judgment documents, and legal provisions. Knowledge graphs are comparatively state-of-theart technology in NLP processes, they have significant strength in constructing legal data. They use
graph models for describing the knowledge and building relations among various entities. Knowledge
graphs can be classified into two groups: General knowledge graphs and Domain knowledge graphs.
The former type is the most commonly utilized graph due to their broad information coverage across
various domains [12]. On the other hand, domain knowledge graphs are mostly intended for particular
domains, highlighting the depth of knowledge. As the information available in the legal domain has
rigorous and complicated knowledge features, domain knowledge graphs are preferred. Simultaneously, the advancement in graph databases and machine learning has empowered a potential way to
construct legal document knowledge graphs [13].
In the legal field, knowledge graphs can be constructed by depicting the cases filed in courts as nodes
and citations as edges, thereby enabling numerous graph processes. This way of representing the
legal documents can enhance the performance of several downstream applications, like finding similar cases, judgment prediction, text summarization, question-answering, and legal cognitive assistance
[1]. Legal intelligence intends to utilize AI technologies, like speech recognition and NLP for empowering the domain of intelligent justice [14]. Most of the LSI techniques are based on simple machine
learning as well as statistical methods [5]. Recently, NLP approaches have been most commonly used
in several processes of legal text mining with the availability of high-quality legal texts. Legal text
mining is slowly turning into a most commonly studied topic [15]. Multiple models have been devised
in the past for producing knowledge graph from unstructured data.
Convolutional Neural Networks (CNN) and Sequence-to-Sequence (Seq2Seq) frameworks like Recurrent Neural Network (RNN) have produced better performances in several NP tasks, such as document classification, information retrieval, and sentiment analysis [6]. [2].Furthermore, Deep Learning
is a modern technique in AI and has been used extensively owing to their promising outcomes in classification and prediction problems across various fields. Particularly, knowledge matching of deep
learning components, input features, hidden unit and layers, and output predictions with ontology and
knowledge graphs has the ability in making the internal mechanism of processes more understandable and transparent. Additionally, the query and reasoning strategies of knowledge graphs allow
innovative explanations, such as interactive and cross-disciplinary explanations [16].
Following Table 1.1 portrays a review of the prevailing schemes devised for applying NLP in legal
domain to identify the knowledge contained with an emphasis on deep learning approaches for various
researches.
On the other hand, when examining legal documents within the context of professional SI systems,
studies in the field of CS and AI often approach them as textual documents, following traditional
approaches to document representation and classification. Nonetheless, this approach faces certain

limitations when applied to legal documents. Firstly, legal documents encompass specific legal entities and relationships, such as subjects and objects, necessitating careful consideration and effective
integration with emerging embedding techniques in NLP. This presents a non-trivial challenge. Secondly, previous techniques tended to classify a legal document containing facts or evidence as a single
legal term or provision. However, legal statutes such as a question or consultation issue involve multiple provisions. For instance, if the area of the land purchased is smaller than the area stated on the
certificate, one can refer to the relevant provisions of the Vietnam Land Law 2013, such as Articles
98, 114, 115, and 116, to address this issue.

Master's Thesis, Semester 222, Academic year 2022 - 2023

Page 2/109


Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

Table 1.1: NLP techniques used in legal domain.
Authors

Paul, S., et al.
[5]

Methods
Legal Statute
Identification using
Citation Network
(LeSICiN)

Advantages
LeSICiN offered high
generalizability and was
effective in producing a

feature-rich and robust
representation of the
document.
This method was
effective in determining
the probable answers for
every question, even with
limitations to knowledge
explicitly.

Disadvantages
It failed to capture the
semantic relationships
in the heterogeneous
citation graph.
This algorithm was
trained for solving
processes associated
with common sense,
and so produced poor
results with reasoning.
The Graph LSTM
method was inefficient
in producing better
results with reduced
data size.

Sovrano, F.,
et al. [17]


Question-Answering
(QA) Algorithm

Li, G., et al. [18]

Graph Long ShortTerm Memory
(Graph LSTM)

This technique produced
high classification accuracy
and computational efficiency.

Zhao, Q.,
et al. [14]

Graph neural
network-based
Legal Judgment
Prediction (LJP)

This model was capable
of extracting feature
information fully and
alleviating the data
imbalance problem.

The network suffered
from overfitting issues.

Deep Neural

Network (DNN)

This technique was
capable of leveraging
contextual information
with high efficiency and
modeling the implicit
relationship among
entities.

This scheme failed to
consider incorporating
structural information
and domain knowledge
for enriching the
semantic meanings of
abbreviations
to enhance performance.

Ji, D., et al.
[15]

Sulis, E.,
et al. [19]

Co-occurrence
network (CN)

Zhu, G., et al.
[20]


Bidirectional LongShort-Term Memory
with Conditional
Random Field
(Bi-LSTM-CRF)

Vuong,
Y.T.H., et al.
[21]

Supporting Model
with BERT for Case law
Retrieval
(SM-BERT-CR)

This technique was
effective in automatically
identifying the link
classes and implicit links
between norms of legal
texts with high accuracy.
The Bi-LSTM-CRF
was successful in fully
mining information tacit
knowledge, and
enhancing the retrieval efficiency.
This scheme was suitable
for retrieving case details
from all legal case
documents, irrespective

of the document length.

Master's Thesis, Semester 222, Academic year 2022 - 2023

The method was futile
in improving the
classification scheme to
attain better results.
The method suffered
from high
computational
complexity.
Though this model was
efficient in identifying
the support relation
directly, it was
unsuccessful in
improving the F1-score.

Page 3/109


Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering
To overcome those drawbacks, we evaluated the following three approaches:
• Machine Learning based approach: Implementation of LSI task by Linear Support Vector Classification with TF-IDF Embedding
• Deep Learning based approach: Implementation of LSI task by Multi-Label Classification with
LegarBert (Our proposed BERT-based pre-trained model for Vietnam Law Domain)
• Graph machine learning based approach: Implementation of LSI task with LegarHKB (Our
proposed heterogeneous knowledge graph for the Vietnam Law Domain)


1.2
1.2.1

Ojectives and Scope
Business Objectives

The business objective of this thesis is to build an intelligent legal system service based on chatbot
technology as the foundation for implementing legal on-demand service for IVS JSC.

1.2.2

Research Objectives

• Construct a chatbot application that can respond to legal real estate-related questions.
• Develop a knowledge graph based on expert legal knowledge using the most advanced AI technologies for Legal Statute Identification (LSI) to reduce the number of errors made by normal
people and attorneys.
• Determine the significance of Vietnamese words, which helps determine whether they should
be added to the Legal Statute Identification (LSI) or not.
• To comprehend how to collect and process data, particularly Vietnamese legal data, in order to
construct an efficient natural language processing system.
• To comprehend the BERT model and other fundamental AI models, including a comprehension
of standard pre-trained language models and the training of a domain-specific language model.
• To comprehend the entire procedure for constructing a natural language processing system using
DL/ML models.

1.2.3

Thesis’s Scope


• This thesis tests, selects, and deploys an NLP model to predict pertinent [law, phrase, term] in
response to a real estate legal query (LSI for Vietnamese Land law 2013). Figure 1.1 depicts a
user inputting a query and the LSI ChatBot responding.
• Additionally, the integration of this model into other related third-party components (not directly implemented by the thesis author), such as integration into the RASA chatbot platform,
the module that summarizes [law, clause, term] to provide answers, the module for user authorization, the module for user contract management, and the module for users-legal experts
connection, are carried out within the scope of this thesis. Although not directly implemented
by the thesis author, all module source codes are created by third parties in accordance with the
author’s original concept, vision, concept, guidance, analysis, and enhancement requirements.

Master's Thesis, Semester 222, Academic year 2022 - 2023

Page 4/109


Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

Figure 1.1: LSI ChatBot’s response

1.3

Contributions of the Thesis

As mentioned earlier, we introduce the following:
• Legal knowledge representation and Legal reasoning representation using deep embedding and
legal domain knowledge.
• The LegarBert language model repository is created for Vietnam Land Law. Digitized data
from Vietnam Land Law 2013 was processed using advanced methods and embedded utilizing
BERT’s masked language model.
• The Heterogeneous Knowledge Graph-based Vietnam Land Law model LegarHKB is new.

Legal terminology, subject, object, relation, event, and time nodes from a large Vietnam Law
database are included in the model. This graph data warehouse digitizes and labels approximately 1 million Supreme Court of Vietnam cases.
• We introduce new metrics suitable for commercializing LSI products where traditional metrics
in ML and DL, such as Precision, Recall, and F1, are incomprehensible to consumers. This new
metric is referred to as KU and is introduced in the following sections.

1.4

Organization of the Thesis

The structure of thesis is structured as follows: Chapter 1 provides an introduction to the LSI problem
and outlines the contribution of the thesis. Chapter 2, we provide a legal expert’s view of the structure
and components of a legal document (it could be a legal question or a description of a legal fact),
from which the reader can gain insights about the data to be processed by the system. This chapter
also covers data collection, labeling, and processing. Chapter 3 will carry out a literature review on
contemporary studies on QAS in NLP, international studies on automating rules by AI techniques,
and a summary of research applying AI to Vietnam law automation. Chapter 4 provides the required
technical background, business background, and legal domain background to understand our research
in this thesis in the following chapters. Chapter 5 introduces the Legar structure (a chatbot system
that answers Vietnam Land Law documents), including system design, functional design, screen design, and system requirements upon acceptance. Chapter 6 discusses using Linear Support Vector
Classification and TF-IDF Embedding to solve the LSI problem. Chapter 7 introduces the approach
using LegarBert, a model we pre-created using MLM on BERT. Chapter 8 presents an approach using
Master's Thesis, Semester 222, Academic year 2022 - 2023

Page 5/109


Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering
LegarHKB, A LSI Retrieval Model using Heterogeneous Knowledge Graph for the Vietnamese Law

Domain. We conclude the thesis in Chapter 9 with some insights into future directions. Chapter 10
presents the deliverables of the thesis, including source code, publication models, published data, and
articles (in progress, submitted, or accepted, if any) based on the research of this thesis.

Master's Thesis, Semester 222, Academic year 2022 - 2023

Page 6/109


Chapter 2
Legal Document Structure and Data
2.1

VN-LandLaw-2013 Corpus

The VN-LandLaw-2013 corpus is a collection of questions, answers, and corresponding labels related
to the Vietnamese Land Law 2013. The process of preparing this dataset involved two main steps:
data collection and data labeling. In the data collection step, we acquired a digitalized version of the
Vietnamese Land Law 2013. Then, we gathered conversations from landlaw-related e-forums, which
reflect real situations of legal consultation by experts concerning the applications of the Vietnamese
Land Law 2013. The collected dataset of conversations was labeled by our team of legal experts to
extract relevant information such as doctype, legislation, article, clause, and point. Ultimately, a total
of 5910 data samples were collected for this corpus.
The conversation given in Example 1 is, in fact, from a real sample extracted for the VN-LandLaw2013 corpus. Listing 1 provides the full information of this data item when stored in the corpus. As can
be observed, a data item is annotated with a substantial amount of legal information. In the context
of this paper, we focus on the annotated Articles, specifically Articles 168, 188, and 186. These
Articles are used as labels for this item. Thus, our consultation problem is realized as a multi-label
classification task, where the Legar system will classify a user’s concern into corresponding Articles.
Example 2.1. In a real conversation concerning legal matter of land business, a user expresses a
concern as. ‘Tôi đang thực hiện giao dịch mua đất của Ông A, hiện tại đã ký xong hợp đồng chuyển

nhượng có cơng chứng. Nay tơi được cơng chứng viên báo là ơng A đang bị khởi tố vì tội lừa đảo
chiếm đoạt tài sản và có thể giao dịch mua đất của tôi sẽ bị tạm dừng do tài sản của ông A sẽ bị phong
toả? Xin hỏi công chứng viên nói như vậy có đúng khơng? Xin cảm ơn!’ (‘I am currently executing
the purchase transaction of land from Mr. A. The transfer contract has been signed and notarized.
Today, the notary public informed me that Mr. A is under investigation for the offense of fraudulent
misappropriation of assets, and as a result, the land purchase transaction may be temporarily suspended
due to the possibility of asset freezing against Mr. A. May I inquire if the notary public’s statement is
accurate? Thank you!’).

Master's Thesis, Semester 222, Academic year 2022 - 2023

Page 7/109


Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

{”id”:”92”, ”question”:”Tôi đang thực hiện giao dịch mua đất của Ông A, hiện tại đã ký xong hợp
đồng chuyển nhượng có cơng chứng. Nay tơi được cơng chứng viên báo là ơng A đang bị khởi tố vì
tội lừa đảo chiếm đoạt tài sản và có thể giao dịch mua đất của tôi sẽ bị tạm dừng do tài sản của ông A
sẽ bị phong toả? Xin hỏi công chứng viên nói như vậy có đúng khơng? Xin cảm ơn! ”,
”answer”:” Điều 188 Luật Đất đai 2013 quy định điều kiện thực hiện quyền chuyển nhượng quyền
sử dụng đất như sau: - Có Giấy chứng nhận, trừ trường hợp quy định tại khoản 3 Điều 186 và trường
hợp nhận thừa kế quy định tại khoản 1 Điều 168 của Luật đất đai 2013; - Đất khơng có tranh chấp; Quyền sử dụng đất không bị kê biên để bảo đảm thi hành án; - Trong thời hạn sử dụng đất. Căn cứ
quy định nêu trên, nếu mảnh đất mà bạn đang thực hiện giao dịch mua bán, chuyển nhượng đáp ứng
các điều kiện trên thì người bán có quyền chuyển nhượng mảnh đất này. Hợp đồng chuyển nhượng
được công chứng theo đúng quy định pháp luật. Căn cứ quy định tại Điều 82 Luật Công chứng 2014,
tổ chức hành nghề cơng chứng có thể cập nhật thơng tin về biện pháp ngăn chặn được áp dụng đối với
tài sản có liên quan đến hợp đồng, giao dịch đã được công chứng, cụ thể như sau: “Điều 62. Cơ sở dữ
liệu công chứng 1. Cơ sở dữ liệu công chứng bao gồm các thông tin về nguồn gốc tài sản, tình trạng

giao dịch của tài sản và các thơng tin về biện pháp ngăn chặn được áp dụng đối với tài sản có liên quan
đến hợp đồng, giao dịch đã được cơng chứng.” Căn cứ quy định trên, văn phịng công chứng nơi bạn
đã công chứng hợp đồng chuyển nhượng sẽ cập nhật được những thông tin ngăn chặn đối với tài sản
mà bạn đang giao dịch. Do đó, nếu những thông tin ngăn chặn đối với tài sản bạn đang giao dịch đã
được cập nhật trên cơ sở dữ liệu cơng chứng thì nội dung văn phịng cơng chứng đưa ra như bạn nêu
trên là hồn tồn có cơ sở. Tuy hợp đồng chuyển nhượng quyền sử dụng đất của bạn đã được công
chứng nhưng theo quy định tại Điều 11a Thông tư 24/2014/TT-BTNMT (được bổ sung bởi Thông tư
33/2017/TT-BTNMT), cơ quan có thẩm quyền sẽ dừng giải quyết thủ tục cấp Giấy chứng nhận quyền
sử dụng đất đối với hồ sơ xin cấp Giấy chứng nhận quyền sử dụng đất của bạn. Xem thêm: Trường
hợp nào bị phong tỏa tài khoản ngân hàng? Trên đây là nội dung tư vấn dựa trên những thông tin mà
luật sư đã nhận được. Nếu còn bất kỳ thắc mắc nào liên quan, vui lòng liên hệ 19006199 để được hỗ
trợ kịp thời. Xin cảm ơn!”,
”summary”:”Nếu những thông tin ngăn chặn đối với tài sản bạn đang giao dịch đã được cập nhật trên
cơ sở dữ liệu cơng chứng thì nội dung văn phịng cơng chứng đưa ra như bạn nêu trên là hồn tồn có
cơ sở. Tuy hợp đồng chuyển nhượng quyền sử dụng đất của bạn đã được công chứng nhưng theo quy
định tại Điều 11a Thông tư 24/2014/TT-BTNMT (được bổ sung bởi Thơng tư 33/2017/TT-BTNMT),
cơ quan có thẩm quyền sẽ dừng giải quyết thủ tục cấp Giấy chứng nhận quyền sử dụng đất đối với hồ
sơ xin cấp Giấy chứng nhận quyền sử dụng đất của bạn.”,
”legals”:[ {”doctype”: ”Luật”, ”legislation”: ”Luật Đất đai 2013”, ”article”: ”188”, ”clause”: ”Không
xác định”, ”point”: ””}, {”doctype”: ”Luật”, ”legislation”: ”Luật Đất đai 2013”, ”article”: ”168”,
”clause”: ”1”, ”point”: ””}, {”doctype”: ”Luật”, ”legislation”: ”Luật Đất đai 2013”, ”article”: ”186”,
”clause”: ”3”, ”point”: ””},{”doctype”: ”Luật”, ”legislation”: ”Luật Công chứng 2014”, ”article”:
”82”, ”clause”: ”Không xác định”, ”point”: ””},{”doctype”: ”Luật”, ”legislation”: ”Luật Công chứng
2014”, ”article”: ”62”, ”clause”: ”Không xác định”, ”point”: ””}, {”doctype”: ”Thông tư”, ”legislation”: ”Thông tư 24/2014/TT-BTNMT (được bổ sung bởi Thông tư 33/2017/TT-BTNMT)”, ”article”:
”11”, ”clause”: ”Không xác định”, ”point”: ””}] }

Listing 2.1. An Example of Data Item in VN-Landlaw-2013 Corpus.

Master's Thesis, Semester 222, Academic year 2022 - 2023


Page 8/109


Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

2.2

Formal Structure of a Legal Document

People without specialized law knowledge and even inexperienced lawyers are unfamiliar with examining and interpreting complex law documents [10]. The legal text has grown exponentially on
the Internet and in specialized systems, along with other natural languages text data, such as scientific publications, news stories, or social media [22]. In contrast to other literature, legal texts contain
strong logical links between sentences or other articles using words, phrases, concerns, concepts, and
variables related to the law [23]. The logical links between sentences related to the law create ambiguity. Unfortunately, this makes finding information and providing answers in the legal field more
difficult than in other fields [24].
Global researchers [25][26] identified some basic ontology design patterns regularly used to model
legal norms. i) Agent-role-time 3; ii) Event-time-place-jurisdiction 4; iii) Agent-action-time [27]; iv)
Object-document [27]; v) Legal deontic ontology [28][26]. These patterns, combined with linguistic
taxonomies, could provide a good solution for creating a bridge between the variants of the legal
definitions and the conceptualization level[29].
In legal science in Vietnam, in order to identify specific laws and provisions to solve a legal question,
the following factors are often noticed and analyzed: SUBJECT (S), OBJECT (O), LEGAL RELATIOSHIP(R), TRANSACTION OBJECT(TO) and TIME of legal events(T). Determining these factors is very important because each legal code has a separate set of governing S, O, R, TO, T through
which we can identify which terms and laws to use to solve the problem.
• Subject (”Chủ thể.” Subjects of legal relations): an individual or organization with legal and
legal act capacity participating in legal relations.
• Objects (”Khách thể.” Objects of legal relations): material benefits, spiritual benefits, or both
benefits that the subject parties want to achieve when entering a particular legal relationship.
• Relationship ( ”Hành vi” or ”Quan hệ pháp lý.” Acts/ Legal relations): something that people
do or cause to happen.
• Transaction Object (Đối tượng bị tác động): Objects of legal transactions such as houses, land,

money, and gold.
• Time: the time at which the legal action or event occurs.
Example and analysis:
Question: ”Vợ chồng bà A đang sử dụng 150𝑚 2 đất tại thành phố Hồ Chí Minh. Mảnh đất được
chuyển nhượng năm 1970. Vợ chồng bà A có hợp đồng mua bán đất có chứng nhận của cơ quan có
thẩm quyền đương thời (Năm 1970). Vợ chồng bà A có quyền được cấp giấy chứng nhận quyền sử
dụng đất khơng? Ngồi ra Vợ chồng bà A có quyền giao dịch quyền sử dụng đất nêu trên không?”

Master's Thesis, Semester 222, Academic year 2022 - 2023

Page 9/109


Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering
Table 2.1: S, O, R, TO, T legal question analysis.
Subject(S)

Objects(O)

Relationship(R)
sử dụng

Transaction object(TO)

Time(T)

đất

1970


chuyển nhượng

Vợ chồng bà A

hợp đồng mua bán
quyền sử dụng đất

cơ quan có thẩm quyền

chứng nhận

cấp giấy chứng nhận

giao dịch

Assume that the labeling and identification of sets S, O, R, TO, and T are correct and appropriate. In
that case, we can anticipate developing an AI model that recognizes Article 101 and Article 188 of
the Vietnam Land Law 2013, as described in the preceding illustration.
Due to time constraints, this thesis only examines the extraction of S, R and the development of an AI
model that effectively leverages S, R to perform LSI. This is one of the thesis’s killer techniques.

2.3

Legal Data Sourcing

The majority of data is collected from online sources and law books (which are scanned and converted
to digital format using OCR technology). Following is main data sourses.
a) Academic Q&A group:
• />b) Question and answer group about specific cases:

• Ministry of Finance: />• Ministry of Labour, Invalids and Social Affairs: />• Ministry of Education and Training: />• Ministry of Justice: />• Ministry of Industry and Trade: />• Ministry of Culture, Sports and Tourism: />Master's Thesis, Semester 222, Academic year 2022 - 2023

Page 10/109


Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering









Ministry of Transport: />Ministry of Construction: />Ministry of Agriculture and Rural Development: />Ministry of Information and Communications: />Ministry of Public Security: />Ministry of Science and Technology: />Ministry of Home Affairs: />Thuvienphapluat: />
c) Publication of the judgment of the Supreme People’s Court:
There are more than 1 million judgments published by the people’s courts at


2.4
2.4.1

Data Preparation
LSI Labeling

After acquiring the data, the author organized a law student team (approximately 50 students in three
months) to label LSI data. The labeler must identify the question’s [law, clause, and term] and summarize the response from a set of queries and answers


Master's Thesis, Semester 222, Academic year 2022 - 2023

Page 11/109


Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

Master's Thesis, Semester 222, Academic year 2022 - 2023

Figure 2.1: Data collection procedure

Page 12/109


×