VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY
HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY
BUI MINH HIEU
RESEARCH AND DEVELOP SOLUTIONS TO ESTIMATE
TRAFFIC DENSITY FROM TRAFFIC CAMERAS AT MAIN
INTERSECTIONS
Major: COMPUTER SCIENCE
Major code: 8480101
MASTER’S THESIS
HO CHI MINH CITY, month 07 year 2023
THIS THESIS IS COMPLETED AT
HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY – VNU-HCM
Supervisor: Assoc. Prof. Ph. D Tran Minh Quang
Examiner 1: Assoc. Prof. Ph. D Nguyen Van Vu
Examiner 2: Assoc. Prof. Ph. D Nguyen Tuan Dang
This master’s thesis is defended at HCM City University of Technology,
VNU- HCM City on July 11th 2023
Master’s Thesis Committee:
(Please write down full name and academic rank of each member of the
Master’s Thesis Committee)
1. Chairman: Assoc. Prof. Ph. D Le Hong Trang
2. Secretary: Ph. D Phan Trong Nhan
3. Review 1: Assoc. Prof. Ph. D Nguyen Van Vu
4. Review 2: Assoc. Prof. Ph. D Nguyen Tuan Dang
5. Member: Assoc. Prof. Ph. D Tran Minh Quang
Approval of the Chairman of Master’s Thesis Committee and Dean of Faculty
of Computer Science and Engineering after the thesis being corrected (If any).
CHAIRMAN OF THESIS COMMITTEE
HEAD OF FACULTY OF COMPUTER
SCIENCE AND ENGINEERING
i
VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY
HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY
SOCIALIST REPUBLIC OF VIETNAM
Independence – Freedom - Happiness
THE TASK SHEET OF MASTER’S THESIS
Full name : Bui Minh Hieu
Student ID : 2170461
Date of birth : 17/02/1997
Place of birth : HCM
Major : Computer science
Major ID : 8480101
I.
THESIS TITLE : RESEARCH AND DEVELOP SOLUTIONS TO
ESTIMATE TRAFFIC DENSITY FROM TRAFFIC CAMERAS AT MAIN
INTERSECTIONS (NGHIÊN CỨU, XÂY DỰNG CÁC PHÉP ƯỚC LƯỢNG
MẬT ĐỘ GIAO THÔNG DỰA VÀO DỮ LIỆU CAMERA Ở NHỮNG NÚT
GIAO THÔNG QUAN TRỌNG).
II.
TASKS AND CONTENTS :
The objective of this thesis is to research and develop a method for estimating
traffic density at intersection areas using images or videos. Accordingly, the tasks
involved in this work include determining the calculation method and evaluating
traffic density for a given area, comparing existing papers and systems with the
proposed method to identify any differences, proposing solutions to address
challenges, and advancing the calculation method of the thesis. Additionally,
designing a prototype system to demonstrate the functionality will be undertaken.
III. THESIS START DAY : (According to the decision on assignment of
Master’s thesis) 06/09/2022
IV. THESIS COMPLETION DAY : (According to the decision on assignment of
Master’s thesis) 08/06/2023
V. SUPERVISOR : (Please fill in the supervisor’s full name and academic rank)
Assoc. Prof. Ph.D Tran Minh Quang
HCM City, June 8th 2023
SUPERVISOR
(Full name and signature)
CHAIR OF PROGRAM COMMITTEE
(Full name and signature)
HEAD OF FACULTY OF COMPUTER SCIENCE AND ENGINEERING
(Full name and signature)
ii
ACKNOWLEDGEMENT
First and foremost, I would want to express my gratitude and extend my heartfelt
gratitude to Assoc. Prof. Ph. D Tran Minh Quang, the instructor who has completely
guided and helped me to accomplish my thesis.
I'd want to thank the Computer Science and Engineering instructors, as well as the
teachers who generously shared their knowledge with me throughout my time at the
HCM city University of Technology. Finally, I'd want to thank my family and friends,
who have always supported and encouraged me throughout the process of doing this
research.
HCM City, June 8th 2023
Bui Minh Hieu
iii
ABSTRACT
Traffic congestion has become a pressing issue, not only for the government but also
for the general public, as it directly impacts the quality of life. It is necessary to
research and develop solutions to manage traffic conditions and minimize the impact
of congestion. Understanding this, this study provides readers with a method to
estimate traffic conditions, specifically traffic density at intersections. To achieve
this, the research team divided the task into two main components: vehicle counting
and area measurement. Vehicle counting is relatively simple with the assistance of
modern technologies and techniques. In contrast, and notably, this study offers
readers methods to calculate the area of a region from images, without relying on
camera specifications, based on the concept of object reference. Additionally, through
this research, we also present the design of the testing system, the challenges faced,
and the accompanying solutions during the design process, as well as the results
achieved when applied in a real-world environment. Finally, alongside the expected
outcomes, some limitations need to be discussed and addressed in the future.
Keyword: Traffic density, Area of a region, Object reference
iv
TÓM TẮT LUẬN VĂN THẠC SĨ
Tắc đường đã trở thành vấn đề cấp bách, khơng chỉ đối với chính phủ mà cịn cả với
người dân, vì nó ảnh hưởng trực tiếp đến chất lượng cuộc sống. Việc nghiên cứu và
phát triển các giải pháp quản lý tình trạng giao thơng để giảm thiểu tác động của ùn
tắc đã trở nên cần thiết hơn bao giờ. Nhằm để hiểu rõ hơn vấn đề đã nêu, nghiên cứu
này cung cấp cho độc giả một phương pháp để ước tính, đánh giá điều kiện giao
thơng, cụ thể là tính mật độ giao thơng tại các ngã tư. Để đạt được mục tiêu này, nhóm
nghiên cứu đã chia cơng việc thành hai thành mục chính: đếm số phương tiện và đo
diện tích khu vực. Việc đếm số phương tiện khá đơn giản với sự hỗ trợ của các công
nghệ và kỹ thuật hiện đại. Ngược lại, đo diện tích khu vực lại khó khăn hơn và thử
thách hơn, do đó nghiên cứu này đề xuất các phương pháp tính tốn diện tích của một
khu vực từ hình ảnh, mà khơng dựa vào thơng số của máy ảnh, dựa trên khái niệm về
tham chiếu đối tượng. Bên cạnh đó, thơng qua nghiên cứu này, chúng tôi cũng giới
thiệu kiến trúc hệ thống mà chúng tôi thiết kế, các thách thức gặp phải trong quá trình
chạy hệ thống và các giải pháp đi kèm, cũng như kết quả đạt được khi áp dụng trong
môi trường thực tế. Cuối cùng, bên cạnh các kết quả dự kiến, một số hạn chế cần
được thảo luận và giải quyết trong tương lai.
Từ khóa: Mật độ giao thơng, Diện tích khu vực, Tham chiếu đối tượng
v
THE COMMITMENT
I confirm that this is my research. The data utilized in the thesis's complete analytic
process has a clear and transparent provenance, and it was released in compliance
with scientific research standards and ethics. In this thesis, I have presented the results
of my study openly and fairly. The thesis results are presented in this report for the
first time and have not been published in any earlier thesis.
HCM City, June 8th 2023
Bui Minh Hieu
vi
TABLE OF CONTENTS
I. INTRODUCTION ....................................................................................... 1
1.1 Research problem ........................................................................................................ 1
1.2 Objectives of the topic ................................................................................................. 3
1.3 Scope of study .............................................................................................................. 3
1.4 Scientific and practical significances ........................................................................... 3
1.4.1 Practical significance ............................................................................................ 3
1.4.2 Academy significance ........................................................................................... 4
II. THEORETICAL BASIC........................................................................... 6
2.1 Definition of traffic density ......................................................................................... 6
2.2 Definition of level of service (LOS) ............................................................................ 6
2.3 Definition of computer vision - machine learning ....................................................... 7
2.3.1 Haar-cascade ......................................................................................................... 7
2.3.1.1 Calculating Haar Features .............................................................................. 8
2.3.1.2 Creating Integral Images ................................................................................ 9
2.3.1.3 Adaboost Training ....................................................................................... 10
2.2.1.4 Implementing Cascading Classifiers ........................................................... 10
2.3.2 Convolutional neural network models ................................................................ 11
2.3.2.1 Convolution layer ........................................................................................ 12
2.3.2.2 Pooling layer ................................................................................................ 12
2.3.2.3 Fully connected layer ................................................................................... 13
2.3.3 YOLO ................................................................................................................. 13
2.3.3.1 Residual blocks ............................................................................................ 14
2.3.3.2 Bounding box regression ............................................................................. 14
2.3.3.3 Intersection over union (IOU)...................................................................... 15
2.3 Definition of pixel per meter ..................................................................................... 16
III. RELATED WORKS .............................................................................. 18
3.1 Traffic situation in Ho Chi Minh City ....................................................................... 18
3.1.1 Overview of traffic situation in Ho Chi Minh City ............................................ 18
3.1.2 Statistics of damage ............................................................................................ 19
3.2 Vehicles detect and count approaches ....................................................................... 20
3.2.1 Foreground extraction ......................................................................................... 20
3.2.2 Haar-cascade ....................................................................................................... 21
3.2.3 Convolutional neural network ............................................................................ 23
vii
3.2.4 YOLO ................................................................................................................. 24
3.3 Object size measuring methods ................................................................................. 25
3.3.1 Math-based calculating method .......................................................................... 26
3.3.2 Object reference method ..................................................................................... 29
3.4 Traffic density calculating ......................................................................................... 30
IV. PROPOSED SOLUTIONS .................................................................... 32
4.1 Calculate traffic density ............................................................................................. 32
4.2 Vehicle counting and categorizing ............................................................................ 33
4.2.1 Vehicle counting ................................................................................................. 33
4.2.2 Vehicle classifying and converting ..................................................................... 34
4.3 Calculate intersection area ......................................................................................... 36
4.3.1 Distance-based method ....................................................................................... 38
4.3.2 Mean-based method ............................................................................................ 43
V. EXPERIMENTAL RESULT .................................................................. 46
5.1 Experiment setup environment .................................................................................. 46
5.2 Experimental system architecture .............................................................................. 47
5.2.1 Data collection module ....................................................................................... 47
5.2.2 Training server module ....................................................................................... 48
5.2.3 Data analysis ....................................................................................................... 50
5.2.4 Diagnosis ............................................................................................................ 51
5.2.4.1 Obtain real area approach ............................................................................ 51
5.2.4.2 Calculate error rate....................................................................................... 52
5.3 Result from training and detecting............................................................................. 53
5.4 Result from inferring intersection area ...................................................................... 58
5.5 Result from evaluating traffic density ....................................................................... 61
VI.
DISCUSSION AND FURTHER RESEARCH ................................ 63
6.1 Achievement .............................................................................................................. 63
6.2 Limitation of the study ............................................................................................... 64
6.3 Recommendations for further research ...................................................................... 65
PUBLICATIONS .......................................................................................... 66
REFERENCES .............................................................................................. 67
1
I. INTRODUCTION
1.1 Research problem
Urbanization is understood as the process of urban expansion expressed as a
percentage of the urban area or population over the total area or population of an area
or region. Moreover, urbanization is also considered a huge development process,
improving quality of life, maintaining a balanced population, controlling population
density, etc.
By the end of June 2021, the coverage rate of urban zoning planning compared
to construction land area in urban areas across the country will reach about 53%, in
which 2 special urban areas (Hanoi and Ho Chi Minh City) and 19 grade I cities reach
about 80–90%, and in urban areas of grades II, III, and IV, about 40–50%. The
detailed coverage rate of urban planning is about 39% compared to the area of
construction land [1]. According to several recent reports, the urbanization of
Vietnam is at a gradually increasing pace, with the percentage of the country’s
coverage reaching 40% in 2019 [2]. This process of urbanization brings many
benefits to a country, such as accelerating economic growth, shifting labor and
economic structures, and changing population distribution. Cities are not only big
consumers of goods but also places to create job opportunities and income for
workers [3].
Consequently, the necessity for travel has led to an increase in the number of
means of transportation, which has increased traffic congestion as a result of the large
cities' rapid population growth. Traffic jams have always been a nuisance for every
citizen in urbanized cities to cope with since it is uncomfortable to travel, and for
Vietnam's governments to deal with since it not only costs a lot of money and
consideration to establish an effective plan to solve the problem but is also very
dangerous if left as is, as transport within Vietnam will be delayed and the economy
will be affected due to such circumstances [3]. It is estimated that traffic jams in one
of the most urbanized cities, Ho Chi Minh City, can damage the government budget
2
by up to 6 billion USD annually [4] and the budget of the citizens joining the traffic
from the waste of gasoline in traffic jams [5]. Furthermore, the government has made
enormous investments in the installation of closed-circuit television (CCTV) camera
systems, although their full potential has not been realized.
As a result, the research team must develop ways to evaluate traffic and utilize
the capabilities of these cameras. Many kinds of metrics are used to measure the level
of traffic on roadways and in particular areas. The study team focuses on traffic
density characteristics in this study. Two aspects must be considered when
calculating traffic density: the number of vehicles and the area of region where the
counting takes place, in this case, the junction area. Several studies and modern
technology, particularly machine learning, have been committed to dealing with the
vehicle counting problem. On the other hand, calculating the area of an intersection
presents a different level of complexity. Each camera has different attributes and is
positioned at different heights and locations, making the calculation challenging and
requiring significant effort in collecting these parameters. Additionally, for the
convenience of applying the solution in practical settings and across the majority of
intersections, we aim to find a solution to generalize the aforementioned problem,
which means calculating the area without relying on those specific technical
specifications.
To solve the preceding issue, we propose an approach that needs the use of a
reference object [31]. The use of a reference object is a strategy that utilizes the
known size of an object in space to estimate the size of another. It was discovered in
this study that reference objects have dynamic rather than static attributes. For the
computations, we use traffic vehicles that frequently recognized reference objects.
Based on that concept, we obtained numerous promising results from this study:
- Propose a way to calculate traffic density at intersections.
- Propose a way to count vehicles appropriately.
- Propose solutions for calculating the region’s area and evaluate them in realworld circumstances.
- Develop an experimental system for practical application.
3
1.2 Objectives of the topic
The objectives of the topic are mentioned as follows:
- Proposing a way to calculate traffic density at intersections.
- Comparing and selecting appropriate machine learning models for
determining the number of vehicles in a given region and categorizing them as
motorbikes, cars, trucks, and so on.
- Proposing methods to convert the intersection area from pixel to square meters
- Developing an experimental system for collecting and processing data.
1.3 Scope of study
The main scientific idea of the project is to research and develop solutions to
estimate traffic density from traffic cameras or videos at main intersections. The data
used in this study is collected directly from the Transportation Facilities’ CCTV
cameras. Ho Chi Minh City is the area used as a research target. The study focuses
on the issue of evaluating traffic density at intersections using images and solving
concerns related to calculating traffic density.
1.4 Scientific and practical significances
This study benefits the subject of traffic density estimate by using an imagebased technique, which has both scientific and practical implications. It provides
insightful knowledge of the unique traffic patterns seen in Ho Chi Minh City while
also outlining feasible alternatives for strategic planning and traffic management
within the city's transportation infrastructure.
1.4.1 Practical significance
Related to practical significance, there are several factors to be considered as follows:
- Real-time traffic information: the suggested technique might contribute to the
developing of real-time traffic information systems. The system can give
drivers up-to-date information by continually measuring traffic density at
4
intersections. This allows them to choose the best route and avoid crowded
regions, increasing traffic efficiency, and cutting down on travel times.
- Resource allocation and infrastructure planning: effective resource allocation
can help transportation agencies if traffic density is accurately estimated.
Authorities can prioritize investments in infrastructure enhancements, such as
more lanes, traffic signal optimization, or intelligent transportation systems,
by identifying junctions with high traffic density.
- Estimating congestion tool: this research contributes to the practical aspect of
traffic engineering by providing a tool for identifying congestion at
intersections, which helps transportation agencies implement strategies to
reduce traffic congestion in Ho Chi Minh City.
- Intelligent
Transportation
Systems
(ITS)
development:
the
suggested technique can help with the development of Intelligent
Transportation Systems (ITS) by accurately estimating traffic density at
crossings. Transportation agencies may improve the overall effectiveness and
efficiency of traffic management, including adaptive traffic signal control,
incident detection, and real-time traffic information distribution, by
incorporating the traffic density data into an ITS framework. This might result
in better traffic flow, a decrease in congestion, and better overall performance
of the transportation system in Ho Chi Minh City.
1.4.2 Academy significance
Related to academy significance, this research contributes some points as follows:
- Improvement in traffic density estimation: by suggesting a novel approach for
calculating traffic density at crossings using an image-based method, the
research contributes a scientific contribution. Comparing this method to
previous ones, a new strategy is introduced that has the potential to increase
the accuracy and reliability of traffic density estimation.
- Application of machine learning algorithms: for identifying objects in images
of traffic, the study uses machine learning strategies. This aspect of the
5
research highlights the use of machine learning methods in the transportation
industry, showing the potential of artificial intelligence to address actual traffic
issues.
- Validation and comparative analysis of calculating region’s area methods: the
research study reviews and contrasts several approaches to determining the
total area of a region. The study advances our understanding of the advantages
and disadvantages of various methods by comparing proposed strategies to
current ones. Researcher and practitioner remarks about the selection and use
of suitable traffic density estimating methodologies in various urban settings
might be influenced by this comparative analysis.
- Application to urban areas: although the research focuses on estimating traffic
density in Ho Chi Minh City, it has the potential to be more broadly
generalized to other cities that are experiencing comparable traffic problems.
The research offers insights and approaches that may be utilized in other cities
by addressing the unique traffic circumstances and complexity of Ho Chi Minh
City. This broadens the scope of the scientific relevance beyond a particular
city region and helps in the development of practical traffic management plans
for differed urban settings.
6
II. THEORETICAL BASIC
In this chapter, we aim to present the theoretical foundations that our research
team has relied on as a basis for developing corresponding solutions to serve the
study. We will utilize appropriate theories, alongside evaluating and considering
theories that are not suitable for the context of the problem, in Chapter III.
2.1 Definition of traffic density
The number of cars occupying a particular length of highway in a traffic lane
is referred to as traffic density. It's measured in terms of vehicle/mile,
vehicle/kilometer, or vehicle/meter. In 500 feet, for example, four automobiles are
displayed. So the traffic density per mile is 42.24 automobiles. The volume of traffic
is inversely proportional to the density. If the density is lower, the speed and traffic
volume will be higher. And if the density is higher, the speed and traffic volume will
be lower. When a traffic congestion occurs at a certain spot, we may consider
expanding the road, building a flyover, or installing an underpass based on the peak
hour traffic flow [18].
2.2 Definition of level of service (LOS)
According to Wikipedia and the Oxford dictionary, a qualitative metric called
level of service (LOS) is used to assess how well motor vehicle traffic services are
provided. By classifying traffic flow and assigning quality levels of traffic based on
performance measures such as vehicle speed, density, congestion, etc., LOS is used
to study roads and intersections. In a general context, all services in the asset
management sector may fall under the umbrella of levels of service [7].
In general, the standards used to assess the level of traffic congestion in various
nations and cities vary. But they're all based on the same primary factor, which is
average speed and traffic flow on the road, along with secondary factors like waiting
times at nodes, service levels, state persistence times, and waiting line length. In
Vietnam, so far, there have been no specific criteria for determining traffic congestion
based on the basic parameters of traffic flow (velocity, volume, density, etc.) [8], [9].
7
2.3 Definition of computer vision - machine learning
Computer vision is the technique of using computers to understand digital
images and videos. It aims to automate tasks that human vision can perform.
Techniques for acquiring, processing, and understanding digital pictures, as well as
retrieving data from the real world, are used to create information. It also has
subcategories like object recognition, video tracking, and motion prediction, making
it helpful for navigation, visualization of objects, and other applications.
Machine learning, which is a branch of artificial intelligence, is the study of
algorithms and statistical models. Without specific guidelines, systems rely on
patterns and inference to carry out a task. As a result, it applies to pattern recognition,
software engineering, and computer vision. Computers accomplish machine learning
with only a little assistance from software programmers. Data is used to make
decisions, and data may be used in a variety of ways across fields. You can divide
learning into three categories: supervised learning, semi-supervised learning, and
unsupervised learning.
Computer vision and machine learning are two fields that have developed
close ties. Computer vision for tracking and recognition has improved because of
machine learning. It provides efficient acquisition, image processing, and object
focus techniques that are applied in computer vision. The application of machine
learning has also been expanded by computer vision. It includes a digital image or
video, a sensor device, an interpreting device, and the stage of interpretation. The
stages of the interpreting device and interpretation in computer vision use machine
learning [12].
2.3.1 Haar-cascade
This novel technique was initially published by Paul Viola and Michael Jones
in their 2001 study, Rapid Object Detection Using a Boosted Cascade of Simple
Features [11], and has since become one of the most cited papers in computer vision
research. This technology allowed for real-time object recognition in video feeds.
Viola and Jones concentrate on identifying faces in pictures. However, the framework
8
can be used to develop detectors for any "things," including bananas, cooking
utensils, buildings, automobiles, and structures.
In general, Haar cascade is a cascading window technique that attempts to
calculate characteristics in each window and identify whether it may be an object
[11], [12]. Although the Viola-Jones framework undoubtedly paved the way for
object detection, subsequent approaches, such as deep learning and histogram of
oriented gradients (HOG) + linear SVM, have far surpassed it.
The Haar cascade algorithm is divided into four stages: computing Haar
features, generating integral pictures, utilizing Adaboost, and implementing
cascading classifiers. However, it's important to recall that this approach, like other
machine learning models, requires a large number of positive and negative photos of
the same items to train the classifier.
2.3.1.1 Calculating Haar Features
The collection of Haar features is the initial phase as shown in Fig. 1. To put
it simply, a Haar feature is a detection window position where calculations are done
on adjacent rectangular sections. The calculation includes adding up the pixel
intensities in each region and figuring out how different the amounts are. But
identifying these elements in a huge photograph can be challenging. Thus, integral
images come into play in this situation since they allow for a reduction in the number
of operations.
Fig. 1. Some examples of Haar feature’s types [11].
9
2.3.1.2 Creating Integral Images
According to Paul Viola and Michael Jones [11], using an intermediate picture
representation known as the integral, rectangle features can be computed quickly.
ii(x, y) =
∑
i(x ′ , y ′ )
(1)
x′ ≤x, y′ ≤y
s(x, y) = s(x, y − 1) + i(x, y)
(2)
ii(x, y) = ii(x, y − 1) + s(x, y)
(3)
Where: ii(x, u) is the integral image; i(x, y) is the original image and s(x, y) is the
cumulative row sum, s(x, -1) = 0, and ii(-1, y) = 0.
The integral image may be calculated in a single pass over the original picture.
Using the integral image, any rectangle sum may be computed using four array
references. Clearly, eight references may be utilized to compute the difference
between two rectangular sums. Due to the neighboring rectangular sums implied by
the two-rectangle features mentioned above, they may be computed using six array
references, eight for three-rectangle features, and nine for four-rectangle features.
Fig. 2. The total number of pixels in rectangle A makes up the value of the integral
picture at position 1. Result of D is (4 + 1 - (2 + 3)) [11].
10
2.3.1.3 Adaboost Training
Adaboost, in practice, chooses the best features and trains the classifiers to
utilize them. It combines many "weak classifiers" to create a "strong classifier" that
the algorithm may use to detect items.
Weak learners are produced by moving a window across the input picture and
computing the Haar features for each area of the image. This distinction is analogous
to a threshold that may be trained to differentiate between things and non-objects.
Because they are "weak classifiers," a large number of Haar features are necessary to
properly produce a strong classifier. The last phase, which employs cascade
classifiers, combines these weak learners into a strong learner.
Fig. 3. Representation of a boosting algorithm [11].
2.2.1.4 Implementing Cascading Classifiers
The cascade classifier consists of several stages, each of which is made up of
a group of weak learners. A highly accurate classifier can be created from the mean
prediction of all weak learners by employing boosting during the training of weak
learners.
The classifier either chooses to go on to the subsequent region or decides to
indicate that an object was identified (positive) based on this prediction (negative).
11
Stages are created so that negative samples can be rejected as quickly as possible
because the majority of the windows do not contain anything of interest. By
classifying an object as a non-object, your object recognition method would be
significantly hampered; hence, it is crucial to maximize a low false-negative rate.
Before utilizing the Haar cascade, it is important to keep in mind that training the
model with the proper hyperparameters is necessary.
2.3.2 Convolutional neural network models
Convolutional neural networks (CNN), a sort of artificial neural network that
has been dominating various computer vision tasks, have drawn interest in a range of
areas, including images, which are inspired by the architecture of the animal visual
cortex, according to [13]. CNN uses numerous building blocks like convolution
layers, pooling layers, and fully connected layers to learn spatial hierarchies of
information automatically and adaptively through backpropagation. The first two
layers, including convolution and pooling, are used to extract features, while the third
one, known as the "fully connected layer, is used to transfer the collected features
into the final output, such as classification and detection. A convolution layer is a key
component of CNN, which is made up of a stack of mathematical operations such as
convolution, a form of linear operation.
As stated in [13], pixel values in digital pictures are stored in a twodimensional (2D) grid, i.e., a number array, and at each image location, a parameter
small grid termed the kernel, an optimizable feature extractor is applied. Because a
feature may be found anywhere in the picture, CNNs are particularly efficient for
image processing. Extracted features can evolve hierarchically and become
progressively more complicated as one layer feeds its output into the next layer.
Training is the process of adjusting parameters such as kernels in order to reduce the
difference between outputs and ground truth labels using optimization algorithms like
backpropagation and gradient descent, among others.
12
Fig. 4. An overview of a convolutional neural network (CNN) architecture and the
training process [13].
2.3.2.1 Convolution layer
Convolutional layers perform a convolution operation on the input and
transmit the output to the following layer. Convolutions combine all of the pixels in
their receptive area into a single value. For example, if you apply convolution to a
picture, you will reduce the image size while also combining all of the information in
the field into a single pixel. The convolutional layer's final output is a vector. We may
employ several types of convolutions depending on the sort of issue we need to solve
and the features we want to learn [14].
2.3.2.2 Pooling layer
According to [14], a pooling layer conducts traditional down-sampling,
lowering the in-plane dimensionality of the feature maps to introduce translation
invariance to small shifts and distortions and reduce the number of future learnable
parameters. It is worth noting that none of the pooling layers have learnable
parameters, although filter size, stride, and padding are hyperparameters in pooling
operations, similar to convolution processes.
13
2.3.2.3 Fully connected layer
Last but not least, the output feature maps of the final convolution or pooling
layer are often flattened, that is, turned into a one-dimensional (1D) array of integers
(or vectors). Then they are linked to one or more completely connected layers, also
called dense layers, in which every input is linked to every output by a learnable
weight [14]. Once the features extracted by the convolution layers and down-sampled
by the pooling layers have been formed, they are transferred to the network's final
outputs, such as the probabilities for each class in classification tasks, by a subset of
fully connected layers. The number of output nodes in the final fully linked layer is
normally equal to the number of classes. As previously explained, each fully linked
layer is followed by a nonlinear function such as ReLU.
2.3.3 YOLO
You Only Look Once is referred to as YOLO informally, first described in the
seminal 2015 paper by Joseph Redmon et al. [15]. This method identifies and finds
various things in images (in real time). YOLO performs object detection as a
regression problem and outputs the class probabilities of the discovered photos.
To detect objects in real-time, a technique called YOLO utilizes convolutional
neural networks (CNN). To identify objects, the approach requires just one forward
propagation through a neural network, as the name implies. This means that the entire
image is predicted by a single algorithm run. CNN is used to forecast several class
probabilities and bounding boxes at the same time. The YOLO algorithm consists of
various variants, such as YOLO v1, YOLO v2 (YOLO9000) in 2016, YOLO v3 in
2018, YOLO v4 in 2020, etc. YOLO, nowadays, is very well-known when it comes
to detecting object tasks, which is very straightforward for the thesis. The reason is
described as follows:
- Speed: The real-time item prediction capability of this method increases the
speed of detection.
- High accuracy: YOLO is a prediction approach that produces accurate findings
with low background noise.
14
- Learning capabilities: The algorithm has excellent learning capabilities that
allow it to pick up on object representations and use them to its advantage
when detecting objects.
In general, this approach includes three techniques: residual blocks, bounding
box regression, and intersection over union (IOU).
2.3.3.1 Residual blocks
The picture is first separated into grids, with each grid having a square
dimension. Figure. 5 below depicts how an input image is separated into equal grids.
Objects that appear within grid cells are detected by each grid cell. For example, if
an item center occurs within a certain grid cell, that cell will be in charge of detecting
it.
Fig. 5. An image is divided into grids. Green grid cells detect pedestrian [40].
2.3.3.2 Bounding box regression
A bounding box is an outline that draws attention to an object in a picture.
Every bounding box, as shown in the illustration by YOLO, has the following
properties: width (bw); height (bh); class (person, automobile, traffic light, etc.),
symbolized by the letter c; and bounding box center (bx, by). YOLO predicts the
height, width, center, and class of objects using a single bounding box regression.
Figure 6 below depicts an example of a bounding box. The bounding box has been
represented with a yellow outline.
15
Fig. 6. The information provided by a bounding box [40].
2.3.3.3 Intersection over union (IOU)
The object detection process known as intersection over union (IOU) describes
box overlapping. YOLO uses IOU to generate an output box that correctly encircles
the objects. The bounding boxes and confidence ratings are forecasted by each grid
cell.
IOU is mostly utilized in object identification applications, where we train a
model to produce a box that precisely encloses an object. When performing YOLO
implementation, IOU is also used in non-maximal suppression, which is used to
remove boxes that are positioned close to the same item based on which box has
higher confidence [16].
The concept for calculating IOU is mentioned as Fig. 7. Let us assume that
box 1 is represented by [x1, y1, x2, y2], and box 2 is represented by [x3, y3, x4, y4].
Following Fig. 8, IOU will be calculated as following Eq. (4):
IOU =
Area of Intersection of two boxes
Area of Union of two boxes
(4)
16
Fig. 7. Given two boxes, the dark area is intersection area [16].
Fig. 8. IOU equation is presented in picture form [16].
2.3 Definition of pixel per meter
A pixel is the smallest unit of a computer picture or graphic that may be shown
and represented on a digital display device, according to Techopedia. Furthermore,
pixels are combined to make a complete image, video, text, or anything else viewable
on a computer monitor. Besides, a pixel is also known as a picture element (pix =
picture, el = element) so that it can be used as a countable unit in an image [19].