H.C.M CITY UNIVERSITY OF TECHNOLOGY
FACULTY OF
INFORMATION TECHNOLOGY
MASTER’S THESIS
SUPERVISORS : Prof. Dr. MARC BUI
Prof. Dr. CAO HOANG TRU
STUDENT
: HO XUAN NGUYEN
CLASS
: Master 2005
STUDENT ID
: 00705147
H.C.M City
July 2007
ABSTRACT
Text is a very powerful index in content-based image and video indexing
such as inscriptions in stele images and demonstrations in video frames.
Although, there are many systems and algorithms proposed for localizing
and extracting text, however, for our specific problem (text in stele images),
these are not convenient due to the algorithms center on other particular
databases.
Therefore, for getting this information we propose an automatic system
which could detect, localize texts and extract them to characters associated
to metadata with a good performance.
Our system treats for grayscale image input and HanNom characters as
output, and contains four stages. First of all, a noise reduction stage is
designed based on a morphological operator to enhance the input image.
And then following this stage, a text detection and localization stage (coarseto-fine text detection) is applied by using a discrete Wavelet transform (Haar
DWT), median filtering and thresholding techniques. Besides these, a
combination of the connected component analysis and morphological
operators is used for fine detection. After eliminating non-text components,
a density-based region growing and a splitting line framework are developed
to collect all single like-text lines.
The next stage, text line verification, we propose a two-step algorithm to
remove fake text lines: thresholding verification and neural network-based
verification steps. Finally, a character segmentation stage is proposed by
applying a genetic algorithm to find the best non-linear segment if having
some touching and kerned characters after using projection profile-based
segmentation.
ACKNOWLEDGEMENTS
Most of all I would like to express my greatest appreciation to my
supervisor, Prof. Dr. Cao Hoang Tru, for his supervision and kind helps
during my study. I would like to thank him for guiding me into this research
area, and sharing with me many insightful experiences on doing good
research, especially reading through the thesis and suggesting corrections.
I would like to thank Prof. Dr. Duong Nguyen Vu for his support and
evaluation and Prof. Dr. Marc Bui for the useful discussions on variety of
image processing and other topics. I also would like to thank Prof. Dr.
Christian Wolf for his invaluable help with many technical problems.
I am also grateful to Mr. Tran Giang Son, with whom I first had the
opportunity to acquaint with in HanNom characters recognition area. And
sincere thanks go to all my friends and colleagues for helping me to further
experiment on the large databases.
Finally, I would like to thank all my family for passing to me their passion
for learning and research: thank you for your encouragement through all my
life, for your love and unconditional support in all my doings.
Characters Extraction for HanNom Stele Images
Student: Ho Xuan Nguyen
TABLE OF CONTENTS
List of Figures......................................................................................................... III
List of Tables ............................................................................................................V
Chapter 1: Problem Definition ................................................................................1
1.1 Motivation..................................................................................................1
1.2 Problem Area .............................................................................................1
1.3 Objective and Scope ..................................................................................3
1.4 Contributions..............................................................................................4
Chapter 2: Literature Review..................................................................................5
2.1 Image Enhancement...................................................................................5
2.2 Character Areas Detection and Localization .............................................6
2.2.1 Region-based Methods .....................................................................6
2.2.2 Texture-based Methods ..................................................................10
2.2.3 Text Extraction in Compressed Domain.........................................13
2.2.4 Other Approaches ...........................................................................13
2.3 Character Areas Verification ...................................................................14
2.4 Character Extraction ................................................................................14
Chapter 3: Related Background Knowledge........................................................17
3.1 Discrete Wavelet Transforms ..................................................................17
3.2 Artificial Neural Networks ......................................................................18
3.3 Genetic Algorithms ..................................................................................19
3.4 The Basic Image Processing Techniques.................................................20
3.4.1 Morphological Operators for Binary Image ...................................20
3.4.2 Projection Profile Analysis .............................................................22
Chapter 4: Text Detection and Character Extraction System ...........................23
4.1 Image Enhancement Stage .......................................................................23
4.2 Text Detection and Localization Stage ....................................................24
I
Characters Extraction for HanNom Stele Images
Student: Ho Xuan Nguyen
Fig 4.14 The proposed character extraction algorithm ........................................................43
Fig 4.15 The SPZ of a non-linear segmentation path ..........................................................44
Fig 4.16 The cross operator for the genetic algorithm.........................................................46
Fig 4.17 The mutation operator for the genetic algorithm...................................................46
Fig 4.18 The extracted characters (final results) for our system .........................................47
Fig 5.1 The result images of the text localization in different algorithms (sample image #1
in the stele image database) .................................................................................................51
Fig 5.2 The result images of the text localization in different algorithms (sample image #2
in the stele image database) .................................................................................................51
Fig 5.3 The result images of the text localization in different algorithms (sample image #1
in the video frame database) ................................................................................................53
Fig 5.4 The result images of the text localization in different algorithms (sample image #2
in the video frame database) ................................................................................................53
Fig 5.5 The results of the character segmentation in different algorithms ..........................55
Fig 5.6 The result images of the proposed system...............................................................56
IV
Characters Extraction for HanNom Stele Images
Student: Ho Xuan Nguyen
4.2.1 Coarse Detection.............................................................................24
4.2.2 Fine Detection.................................................................................28
4.3 Text Line Verification Stage....................................................................36
4.3.1 Text Line Verification by Thresholding Technique .......................36
4.3.2 Text Line Verification by Neural Networks ...................................39
4.4 Character Extraction Stage ......................................................................42
4.4.1 Merging Regions Procedure ...........................................................44
4.4.2 Searching a Non-Linear Segmentation Path...................................44
Chapter 5: Experiments and Comparisons ..........................................................48
5.1 Text Localization Evaluation ...................................................................48
5.1.1 Evaluation and Comparison on Stele Image Database ...................50
5.1.2 Evaluation and Comparison on Video Frame Database .................52
5.2 Character Extraction Evaluation ..............................................................54
5.2.1 Evaluation and Comparison on Stele Image Database ...................54
5.2.2 Evaluation and Comparison on Video Frame Database .................55
5.3 Full Evaluation of the Proposed System..................................................56
Chapter 6: Conclusions and Future Works..........................................................57
6.1 Conclusions..............................................................................................57
6.2 Future Works ...........................................................................................57
Chapter 7: References ............................................................................................59
II
Characters Extraction for HanNom Stele Images
Student: Ho Xuan Nguyen
List of Figures
Fig 1.1 Fundamental concepts in image processing ..............................................................1
Fig 1.2 The outline of our program........................................................................................3
Fig 1.3 An illustrated example for character segmentation ...................................................3
Fig 2.1 A system of character extraction ...............................................................................5
Fig 2.2 An approach of Clark and Mirmehdi [13]...............................................................13
Fig 2.3 The basic segmentation in Hong’s approach [25] ...................................................15
Fig 2.4 The fine segmentation in Hong’s approach [25] .....................................................16
Fig 3.1 An example of DWT tree ........................................................................................17
Fig 3.2 The DWT for image decomposition........................................................................18
Fig 3.3 An artificial neural network.....................................................................................19
Fig 3.4 The genetic algorithm [15] ......................................................................................20
Fig 3.5 An example for applying erosion operator [16] ......................................................21
Fig 3.6 An example for applying dilation operator [16]......................................................21
Fig 3.7 An example for projection profile analysis [67]......................................................22
Fig 4.1 The flow chart of our system for character extraction ............................................23
Fig 4.2 An example for applying our image enhancement..................................................24
Fig 4.3 The 2-D Haar DWT for an image [37]....................................................................25
Fig 4.4 The wavelet energy image (a combination of LH, HH and HL subbands).............26
Fig 4.5 The thresholding image after applying median filtering and thresholding techniques
on the wavelet energy image ...............................................................................................28
Fig 4.6 The flow chart for applying CCA and opening operator to remove non-text regions
.............................................................................................................................................29
Fig 4.7 The result image after applying CCA and opening morphological operator ..........30
Fig 4.8 The density-based region growing algorithm..........................................................30
Fig 4.9 The text regions found by applying the density-based region growing algorithm..31
Fig 4.10 The proposed splitting line algorithm ...................................................................32
Fig 4.11 An illustrated example for the splitting line framework .......................................34
Fig 4.12 The output image of the text detection and localization stage ..............................35
Fig 4.13 The output image for the text line verification stage.............................................42
III
Characters Extraction for HanNom Stele Images
Student: Ho Xuan Nguyen
List of Tables
Table 4.1 Choosing thresholds for the coarse detection ......................................................27
Table 4.2 The thresholds for CCA to remove noise ............................................................29
Table 4.3 The thresholds for the density-based growing algorithm ....................................31
Table 4.4 The thresholds for the FilterSmallRegion procedure...........................................33
Table 4.5 The thresholds for the splitting line framework ..................................................33
Table 4.6 The thresholds for the FilterLargeRegion procedure...........................................35
Table 4.7 Sample text lines and number of pixels in different kinds of directions .............38
Table 4.8 Sample text lines and the WWGVD values in different kinds of directions .......38
Table 4.9 The thresholds for the fill factor and the non-direction factor of text line ..........39
Table 4.10 The number patterns and images for training neural network...........................41
Table 5.1 Experimental results on the text localization stage for the stele image database 50
Table 5.2 Experimental results on the text localization stage for the video frame database52
Table 5.3 Experimental results on the character segmentation stage for the stele image
database................................................................................................................................54
Table 5.4 Experimental results on the character segmentation stage for the video frame
database................................................................................................................................55
Table 5.5 Experimental results on the proposed system for the two databases...................56
V
Characters Extraction for HanNom Stele Images
Student: Ho Xuan Nguyen
Chapter 1: Problem Definition
1.1 Motivation
The Vietnamese people are accustomed to saying: “one remembers the source from which
one drinks the water” (“Uống nước nhớ nguồn”) [72]. One activity of this tradition is the
erection of steles with inscription of all the names, birth dates and birth places of doctors,
and other excellent graduates who took part in examinations since 1442. At present there
remain many steles written by HanNom characters, standing in the premises of Van Mieu
and others. Some of them had eroded through many years from past to now seriously. To
promote a plan for maintaining information (inscriptions) for these stele images, as two
verses found in the following popular song (“ca dao”) [72]:
The stele of stone erodes after a hundred years
The words of people continue to remain in force after a thousand years
(Trăm năm bia ñá thì mịn
Ngàn năm bia miệng vẫn cịn trơ trơ)
For much progress in this plan, a character extraction needs to apply on all stele images in
order to get all characters before character recognition and storage are started. With this
reason, we want to propose our research work here - “Development of an image
segmentation software dedicated to ancient inscriptions of VietNam” - this means to build
an image segmentation algorithm for HanNom stele image. This research has an immediate
and motivating application in a philology research framework which aims at providing a
modern research tool, using up-to-date information and communication technologies for
the historical knowledge of VietNam.
1.2 Problem Area
The problem of extracting character information from visual clues has attracted wide
attention for many years. This problem is special to involve in image processing
techniques, especially on image segmentation. To have a look through image processing
techniques as well as image segmentation technique, we present a brief summarization for
fundamental concepts in this area [21].
Fig 1.1 Fundamental concepts in image processing
1
Characters Extraction for HanNom Stele Images
Student: Ho Xuan Nguyen
A knowledge base can be considered as a form which contains information about a
problem domain, and compacted into an image processing system. This knowledge may be
whether simple or not, this is simply as detailing regions of an image. In that, all concerned
information is known to be located, therefore for seeking information, the search will be
limited and performed in a short time. However, in some cases, it also becomes quite
complicated, for example an image database which having high-resolution satellite images
of a region. And this database keeps in touch with change-detection applications [21].
Image acquisition is known as the first process in an image processing system and shown
in Fig 1.1. Generally, the image acquisition stage is about pre-processing, like as scaling.
In some cases, this process is just simply as being given an image which is already in
digital form.
Image enhancement is a simplest process in image processing, but an appealing area and
much important. Usually, this stage is able to bring out detail and make the problem so
clearer. Sometimes this just highlights some features or properties in an image to achieve a
good input for next processing stages such as contrast or brightness modification.
For some reasons, image segmentation is considered as the most important stage and a
difficult task in image processing. This technique has been widely used to analyze and
examine the content of an image in order to get meaningful knowledge. This is also
applied in image classification with aim to partition an image into a set of disjointing
regions whose characteristics (intensity, color and so on) are similar. Generally, in this
stage, with its important role, it can be said that “the more accurate the segmentation, the
more likely recognition is to succeed”.
After the segmentation stage done, representation and description stage almost always
follow. Usually, in this stage, the data format is concerned in raw pixel data, constituting
either the boundary of a region or all the pixels in the region itself. In either case, needs to
apply a data conversion, this transfers the original data to a form suitable for computer
processing. The data representation must need to consider as the first decision to do and is
based on some properties following [21]:
•
Boundary representation is convenient for the problems which relate to external
shape characteristics, such as corners and inflections.
•
With internal properties (texture, skeletal shape), the regional representation will be
a suitable choice.
Besides this selection, we also show a specific method as a way for the data description so
that features of interest are highlighted.
Recognition - can be as the last in image processing area with aim to label an object for
recognition such as “character”, “face”, these are the top interesting subjects now.
After having an overview of image processing techniques, we want to focus on image
segmentation technique deeply, where our research works in. Besides this, for achieving a
good result, we also have to examine many involved techniques such as image
enhancement… which will be described more details in next sections.
2
Characters Extraction for HanNom Stele Images
Student: Ho Xuan Nguyen
1.3 Objective and Scope
As we presented above, the objectives of this research can be divided into two major
subparts:
•
Have a survey (state-of-the-art) about image segmentation techniques.
•
Propose an innovative algorithm for this research and its comparison with other
techniques. From this algorithm, we will develop software which automatically
segments a huge collection (a corpus) of steles images into a collection of extracted
HanNom character images, associated to metadata.
We will build a program based on our algorithm which has ability to take as:
Fig 1.2 The outline of our program
And the desired results are demonstrated following:
Fig 1.3 An illustrated example for character segmentation
Now, we describe scope for our problem - an important part in research. We examine
about “content” of these stele images, in my thought these stele images can be belonged to
one of two categories: unlimited-area, limited-area images. An unlimited-area image is
considered as an image with a part of a stele image (not necessarily the whole) and some
others, whereas a limited-area image is just only a stele image, no anything on it. And here,
in our research, we work on the unlimited-area image from now if having no explicit
mention.
Besides image content, another one has to pre-define correctly. The text planes (text
regions) in an image are fronto-parallel or not to the camera, certainly meaning of “frontoparallel” word phase can just be seen in a relative way. In this research we are concerned
with the images having fronto-parallel view.
Finally, some images maybe have characters of a variety of fonts, sizes and orientations
but in this research we just focus on these images with the same size and vertical writing
style of character due to the time for research is limited. The size of character varies from
(40×40) to (80×80) pixels for the stele images (the primary database), and from (10×10) to
(16×16) pixels for the video images which can be considered as the secondary database for
experiments.
3
Characters Extraction for HanNom Stele Images
Student: Ho Xuan Nguyen
1.4 Contributions
As one of the objectives of the thesis, this thesis proposes an efficient system to segment
and collect these HanNom characters from grayscale images (stele images and video
frames). Besides that, some problems and future works are presented and discussed in this
thesis. In order to get an easy approach, we have an overview look for these contributions
of this thesis which can be listed below.
First, for noise removal an image enhancement is used, an erosion morphological operator
is applied on grayscale images. In fact, for some high quality databases (the video frame
database) this stage can be eliminated, whereas this is very important for other databases
such as the stele image database.
Second, as one of the major goals of this thesis, a text detection and localization stage is
proposed which can be considered as a coarse-to-fine text detection system. This coarse
detection is designed based on a discrete wavelet transform (Haar DWT) for an image in
order to emerge like-text components clearly, and then the median filtering and
thresholding techniques are applied for enhancement. Finally, we propose the fine
detection as a combination of the connected component analysis and morphological
operators to remove non like-text components. Next a density-based region growing
algorithm is used to cluster text pixels, and then we get all single like-text lines by a
projection profile-based (splitting line) framework.
Third, for a text line verification stage, we propose a two-step algorithm to remove fake
text lines. First step, in the thresholding verification, three features of a text line (the fill
factor, the non-direction factor and the weighted wavelet gradient vector direction) are
extracted to compare with specific thresholds to define fake text lines. Second, the neural
network-based verification step is implemented to continue removing fake text lines after
training from sample patterns. In the second step, the first-order statistics (in wavelet
domain), text line periodicity and contrasts between text and its background features are
used as the inputs of neural network.
Finally, a character segmentation stage is proposed as a combination of projection profile
analysis and a non-linear-based segment for touching and kerned characters. In this
algorithm, a genetic algorithm is used for finding non-linear segment. Besides that, some
dimensional heuristics about HanNom (or Chinese) character are also utilized.
For the layout of the thesis, the rest of this thesis is organized into 6 chapters. According to
this layout, in chapter 2, we briefly present a literature review of related works. In chapter
3, the background knowledge of DWT, neural network, genetic algorithm and some basic
image processing techniques that will be used in the thesis are described. In chapter 4, our
proposed system is presented in details and fully. Next, in chapter 5, the experimental
results and comparisons are shown, and in this chapter we use the two databases (the stele
image and video frame databases) for evaluation as we said in the subsection 1.3. The last
chapters - chapter 6 consists of a concluding summary of the achievements and limitations
of our proposed text localization and character segmentation system and proposes some
future research directions, and chapter 7 is the references.
4
Characters Extraction for HanNom Stele Images
Student: Ho Xuan Nguyen
Chapter 2: Literature Review
Character extraction is an important part in an OCR (Optical Character Recognition)
system. This is known as a challenge area where many researchers work in and propose
algorithms for. Usually, this part can be divided into four sub-stages (or problems): Image
enhancement, character areas detection and localization, character areas verification, and
the last - character extraction.
Fig 2.1 A system of character extraction
2.1 Image Enhancement
In this stage, some pre-processing techniques is used to improve the quality of an image
and made it better suited. Usually, this is necessary before performing any analysis on
natural images. Enhancement is known as a technique that emphasis or highlights some
salient features of the original image and also improves the visibility of fine patterns and
object details for example retrieving an edge map and removing background noise for an
image. In brief, this process aims to process an image - whose result is more suitable than
the original image for a specific application. Here, “specific” shows convenient criteria for
next processing (detection and localization).
Some useful smoothing and sharpening filters are presented by Gonzalez and Woods [21].
In the frequency domain, these are referred to as low-pass and high-pass filters
respectively. Sometimes de-skewing an image could help a lot; it improves the readability
of scanned images [30].
5
Characters Extraction for HanNom Stele Images
Student: Ho Xuan Nguyen
2.2 Character Areas Detection and Localization
In this stage, actually no signal or information to show an input image contains any text or
not. Therefore, generally some text detection techniques will be applied to know the
existence or non-existence of text in the image. After this confirmation, text location is
implemented. This stage needs some image segmentation techniques to work.
We can be said that almost algorithms belong to two types: region-based and texturebased. This division certainly is built on the features utilized. Besides these, text detection
and localization in the compressed domain is a new interesting approach.
2.2.1 Region-based Methods
Region-based methods are known as these popular approaches, usually the color or
grayscale - two properties in a text region which use to distinguish with the corresponding
properties of the background. These methods can be divided into two sub-approaches: the
connected component (CC)-based and edge-based. However, both two sub-approaches
work by a bottom-up style, in which all small sub-structures (CCs or edges or both) will be
identified and then merged to create some bounding boxes for text. Some techniques such
as histogram thresholding, edge detection are frequently used in these approaches.
CC-based Methods
As presented above, CC-based method is one of method use a bottom-up approach - its
principle is just building large components from small components by grouping or
clustering them, this process will go on until all regions are identified in the image. For
merging text components, the geometrical analysis is used for, with a combination of the
spatial arrangement of the components so as to filter out non-text components and marks
the boundaries of the text regions [29].
Ohya et al. [50] present an algorithm to segment scene images into regions based on
adaptive thresholding. This system can be considered as a system has four stages:
•
Using local thresholding for binarization.
•
Character candidate regions are detected by observing grey-level differences
between adjacent regions.
•
Implementation of a comparison between the character candidates and the standard
patterns (in a database). From that, calculating and building the similarities for
character recognition.
•
And the last stage is relaxation operation to update the similarities.
With this algorithm, it will be of great advantage of extraction and recognition characters,
including multi-segment characters, works well under varying illuminating conditions such
as some properties (sizes, positions, and fonts) when dealing with scene text images,
(freight trains, signboards,…). Besides these, the algorithm itself also has any
disadvantage, especially in video documents (binary segmentation). This algorithm is
literally inconvenient to apply on these documents - having several objects with different
grey-levels and high levels of noise and variations in illumination. Moreover, when an
6
Characters Extraction for HanNom Stele Images
Student: Ho Xuan Nguyen
input of the algorithm is upright and not connected (monochrome), it will encounter with
several restrictions related to text alignment.
The algorithm in Zhong et al. [69] is also an approach based on CC. This first uses
horizontal spatial variance in order to find the approximate locations of text lines and then
extracts text components using color segmentation. This algorithm works on “color
reduction” principle, the peaks in the histogram (RGB space) will be applied for the color
space quantization, and the text regions clustering process is based on this space.
Here, the authors build a filtering stage - where each text component has to go through
using a number of heuristics such as:
•
The regularity of almost characteristics.
•
Printed characters display in the same size (approximately) and line thickness are
located at a regular distance from each other.
Moreover, the authors also have a view to consider text regions as having a certain texture
based on these regularities and the test set for this system are CD images and book cover
images.
Two systems for detecting, segmenting and recognizing text in videos have been
developed by Lienhart et al. [38, 39]. In both systems, text regions are behaved as CCs having the same or similar color and size. In [38] the authors dedicate a color-based splitand-merge algorithm to track text only short-term in order to rule out non-text regions.
Text, however, is segmented and recognized on a frame basis and not associated over time.
And the last - text recognition like as iterative algorithm will be used to convert the text
into ASCII. Besides this, another system is introduced as the latter system counts on
anisotropic diffusion [39], in which text is tracked over its life time, and multiple text
occurrences are integrated over time. In conclusion, the same thing in two systems is that
using monochromatic exploiting in image segmentation.
Another method is introduced by Hase et al. [24], also related to the CC-based method and
applied for color documents. In this approach the authors assume that every character is
printed in a single color - which is considered as a common assumption in text localization
algorithms. The RGB color space is translated into L * a * b * color space, and using the
color histogram as a criterion for select representative colors. The image is then divided
into several binary images, after that a multi-stage relaxation [23] to take up in the string
extraction on each binary image. After merging all results from the individual binary
images, some nominated strings are not always characters. Therefore, when all nominated
strings of all images are superimposed, some strings overlap each other. To overcome
these, character strings are selected by their likelihoods (the area ratio of black pixels in
elements to its rectangle, the standard deviation of the line widths of initial elements),
using conflict resolution rules. And then all conflicts (inclusion and overlap) need to
resolve by a filtering stage, here the authors use tree representation for this stage - a
problem of great importance, contrary to other text localization techniques. The result of
this algorithm will be good when deal with shadowed and curved strings, but the defined
likelihood is hard to define accurately, as the authors described, in this case results in
missed text or false alarms.
7
Characters Extraction for HanNom Stele Images
Student: Ho Xuan Nguyen
Duong et al. [17] present a document analysis system using CC method for grayscale
document imaged. Geometric and some texture features are the criterion to cluster and
consider text or non-text zones. Regions of interest (text areas) are retrieved via cumulative
gradient considerations with a combination of some entropic heuristics. This system can be
divided into three stages:
•
Areas detection - using regions with important grey-level variations in horizontal
direction are likely to be text areas as a hypothesis to detect text areas.
•
Binarization - having a difference for this stage is instead of performing a global
thresholding for the whole image, the authors prefer to consider every chip
separately.
•
Text separation - clustering the set of areas found via gradient discrimination using
geometric featured combined with some texture primitives.
In this presentation the authors assume that text elements are written in Latin alphanumeric
symbols and set in horizontal direction lines.
Generally, CC-based methods can segment a character into multiple CCs, especially work
well when deal with polychrome text strings and low-resolution and noisy video images.
All examined systems above have the same structure, are four-stage systems:
•
Pre-processing is also called enhancement, such as skew processing, color
clustering and noise reduction.
•
CC generation, in order to generate all connected components.
•
Eliminating all fake candidates, or filtering out non-text components. Several
threshold values need to use with filtering aim. Threshold values are chosen by
observing all images or videos from databases.
•
And the last, component grouping stage is to group all desired components. The
performance of these algorithms depends on mainly the component grouping stage,
in this stage the projection analysis (text line selection) is concerned deeply.
As we examined, CC-based methods are not hard to understand, besides this, its operation
is not also complicated (simple implementation). Therefore, these methods widely used
into projects and proposed by many researchers.
Edge-based Methods
In this category, the main principle of its operation is built from “high contrast between the
text and the background” property. And usually, the edge is one of some criteria, which
satisfies with this principle. When the edges are applied, need to find out the text boundary
through the identification and merging process. Further, a combination of several heuristics
is used to filter out the non-text regions. Some interesting edge filters are proposed by
many authors based on Canny / Sobel operator - are used for the edge detection, and a
smoothing operation or a morphological operator is used for the merging stage [29].
The morphological approach is presented by Hasan and Karam [22] for text extraction
from images, the most advantages of this approach are insensitive to noise, skew and text
8
Characters Extraction for HanNom Stele Images
Student: Ho Xuan Nguyen
orientation. Besides these, one its advantage is free from artifacts, the main idea of this
algorithm can be considered below:
•
Build an intensity image Y which is computed from the combination of a color
input image as follows:
Y = 0.299 R + 0.587 G + 0.114 B
(2.1)
where R, G, and B are the red, green, and blue components, respectively.
•
After that, the authors use a morphological gradient operator to retrieve all edges.
When having the identified edges, these edges go through a thresholing stage to
obtain a binary edge image. In this thresholding stage, an adaptive thresholding will
be applied and performed for each candidate region which can be a text area in the
intensity image. In order to build all candidate regions, the authors use a dilation in
which edges that are spatially close are grouped, while small components are
removed by erosion.
•
And the last, filtering fake candidates (non-text components) by using the size,
thickness, aspect ratio, and grey-level homogeneity.
This method is simple and has many transmutations to adopt when deal with color images,
and seems to be robust to noise, as shown by experiments with noisy images. As presented
before, the most advantages of this method are insensitive to skew and text orientation, and
can be applied for curved text strings extraction. However, the most disadvantages are not
easy to apply on objects which having similar grayscale values, and different colors in a
color space. Also, this method encounters with prominent text regions - which not easy to
detect in the grey-level image. The last, just only three images are used for testing when
compare their method with other methods.
In order to improve the algorithm performance in [22], Wu et al. [62] present a
morphology-based text line extraction algorithm for text region extraction from images. By
using a novel set of morphological operations to extract text line candidates which text
lines have important contrast. For detecting orientation of a text line, the authors present a
moment-based method to determine this orientation. From this orientation, the like-text
components are extracted for text verification by using an x-projection technique. And
then, a novel recovery algorithm is applied to recover a complete text line which is often
fragmented before. Finally, a verification scheme is designed for verifying all extracted
potential text lines according to their text geometries. The average accuracy of this
algorithm, as described by the authors is 95.4%, and this algorithm is insensitive to skewed
text lines. Besides that, the authors also implement the comparison between this algorithm
and the algorithm in [22] with a better result.
Gao and Yang [19] present an adaptive algorithm which is expected to apply on natural
scenes for automatic detection of text. The algorithm can be considered in prototype
systems which algorithm can detect and recognize sign inputs automatically from a video
camera, and translate the signs into English texts or voice streams. Here the “sign” is
categorized following: names (streets, buildings, companies, etc); information
(designations, directions, notices, etc); commercial (announcements, advertisements, etc);
traffic (warnings, limitations, etc). This algorithm consists of:
9
Characters Extraction for HanNom Stele Images
Student: Ho Xuan Nguyen
•
A multi-scale edge detection algorithm - the edge detection algorithm is utilized to
obtain the initial candidates of text regions under varying lighting conditions, and
then use a multi-resolution approach to compensate other variations, such as noise
and contrast.
•
Adaptive searching and color modeling in the neighborhood of initial candidate text
regions - this layer of the algorithm performs an adaptive search and color
modeling near the initially detected text cues to discriminate text / non-text regions
and find all the text regions in the image.
•
Layout analysis of the detected text regions.
The results of this algorithm, as described by the authors, the “detected without missing
characters” index is 93.3%, otherwise the “detected with missing characters” is 5.9% and
just only 0.8% text regions in a database (823 patterns) are not detected, with the false
alarms index is 10.1%. However, there is no experiment for a comparison between their
method and other methods.
Smith and Kanade [56] localize text by first detecting vertical edges (using 3×3 horizontal
differential filter and then thresholding technique is applied) with a predefined template.
And then the authors apply a smoothing process with eliminating small edges for grouping
vertical edges into text regions. After that, a bounding box is created by connecting all
these adjacent edges. To collect text regions, some heuristics are used such as the aspect
ratio, fill factor, and size of each bounding box to filter out non-text regions.
Finally, all clusters which have similar textures and shape characteristics are retrieved by
use the intensity histogram observation. This method is fast but produces many false
alarms because many background regions may also have strong horizontal contrast.
2.2.2 Texture-based Methods
The key point of this category is to find the difference between text pixels and background
pixels, in other ways using the distinct textural properties observation to distinguish text
regions from the background. In these methods, having many techniques based on Gabor
filters, wavelet, FFT, spatial variance, etc. can be used to detect these properties [29].
Generally the methods belong to this category are more complicated than the first category
(Region-based methods).
Park et al. [52] present a texture-based method to apply on vehicle license plate
localization, this method relates to using neural networks. To get the location of a vehicle
license plate, the horizontal variance of the text is also used with combination of neural
networks which are used as filters for analyzing small windows of an image and deciding
whether each window contains a license plate. The authors use some time delay neural
networks (TDNN) like as a texture discriminator in the HSI color space and also known as
horizontal and vertical filters. A small window with HSI color values as the input of each
neural network and decides whether or not the window contains a license plate number.
For creating bounding boxes - which contain license plates, the authors use projection
profile analysis, this analysis is applied on a combining the two filtered images (these
inputs of two neural networks). The method offers robustness when dealing with noisy
images.
10
Characters Extraction for HanNom Stele Images
Student: Ho Xuan Nguyen
In contrast, Wu et al. [63, 64] propose an image segmentation using a multi-scale texture
segmentation scheme. The authors use advantages of the distinctive characteristics of text
which make it stand out from other image materials. Usually, text has the following
distinguishing characteristics: text processes a certain frequency and orientation
information; text shows spatial cohesion-characters of the same text string are similar
heights, orientation and spacing.
This first phase of a four-phase system treats text as a distinctive texture; a texture
segmentation scheme is used to focus attention on regions where it may occur. A system
with its principle based on nine second-order Gaussian derivatives to apply for detecting
text regions. And then the authors use a non-linear transformation which is applied to each
filtered image. After that, the local energy estimation will be implemented; this is clustered
through the K-means algorithm with each component is known as an output of the
nonlinear transformation at each pixel. Next, the five-step stage (chip generation) is
initiated; it can be described with a stroke generation (in which all character strokes will be
generated); a stroke filtering to eliminate fake strokes; a stroke aggregation (this stage
combines all strokes into desired regions); a chip filtering and chip extension.
This algorithm is known as the multiple-scale method to detect text with a wide range of
sizes before mapped back onto the original image. With this special feature, it’s insensitive
to the image resolution and certainly becomes the most advantage of the algorithm, but
some missing small texts will be occurred in some cases.
Sin et al. [55] use number of edge pixels (in horizontal and vertical) as frequency features
and then using the spectrum from Fourier transformation to detect text regions in real
scene images. The frequency feature of text images is highly intuitive, and thus appealing
as much. Here, almost text regions will be liked on a rectangular background - in this
approach, the authors find out the lines of these rectangles which are desired regions by
using Hough transform after retrieving detected edges. However, this algorithm is so clear
if to be aware how these three stages are merged to generate the final result.
The wavelet transforms also use in texture-based methods, for example in Mao et al. [49].
In this algorithm, a system is built on a multi-scale scheme and the most interesting thing
of this method is to use local energy analysis for hybrid Chinese / English text detection.
For each scale, the local energy variation is defined from a local region based on the
wavelet transform coefficients of images. After that, when having a binary image (applying
thresholding technique on the local energy variation), an analysis on this image will be
implemented through connected component-based filtering technique using geometric
attributes such as the size and the aspect ratio. Finally, the results are combination of all
the text regions, which are detected at several scales.
Liang and Chen [37] dedicate another wavelet transform-based approach, the Haar DWT is
employed to detect edges of candidate text regions. The authors show that one of the
distinctions between text edges and non-text edges is themselves intensities. Therefore,
thresholding technique is applied for preliminary removing the non-text edges for each
resulted detail component sub-bands which are these results of the applying Haar DWT.
And then the morphological dilation operators are used to merge isolated text edges of
each detail component sub-band in a transformed binary image. Finally, all final text
regions can be defined correctly by applying a logical AND operator for three detail
component sub-bands. As reported by the authors, the average correct rate is high
11
Characters Extraction for HanNom Stele Images
Student: Ho Xuan Nguyen
(98.57%) and the processing time is fast (0.39s for each video frame or image has the
1024×768 resolution in BMP or MPEG format). Besides that, the combination of the
wavelet transforms and neural network (or support vector machine - SVM) is also
presented by many authors [7, 66].
Jung [28] uses a neural network texture discrimination method for text locations in color
images and demonstrates that the proposed method is applicable to text locations in
complex color images for efficient content-based indexing. In this algorithm, a set of
texture discrimination masks go through into a neural network for training in order to
improve the classification (text or non-text regions) much better. As described by the
authors, the color bands (R, G, and B) are considered as these textural properties and
examined.
An algorithm as a combination of FFT and neural network is proposed by Chun et al. [12].
The most special things of this method are to bring line sampling and without processing a
whole image with aim to extract texts in real-time. The intuitive property of character is to
have the higher frequency, with this utility one, the candidate text areas can be found. To
save the processing time, the FFT is computed for overlapped line segments of 1×64 pixels
and the output of each line segment consists of 32 features. For pixel discrimination, a
feed-forward neural network with one hidden-layer is used with a combination of using the
32-dimensional feature vector. After retrieving images from neural network, noise
elimination and labeling operations will be implemented. In this algorithm, the processing
time is not mentioned, this thing - the actual time is not reported, although their system can
run well in real-time as described by the authors.
Another approach is also presented by Hua et al. [26] for automatic location of text in
video frames. The concerned properties of text regions in this algorithm are rich of corners
and edges. Therefore, this assumption “text regions which have many corners and edges”,
will be applied for the text detection scheme and can be utilized by corner detection,
refinement, and merging for the finding candidate text regions. And then edge maps (Sobel
vertical and horizontal edge detection) are used to extract these regions into single text
lines. At last, the authors use feature analysis of the detected text lines to reduce some false
alarms.
Clark and Mirmehdi [13] present a good approach which is built on texture-based method
by using a combination of some statistical properties of local image neighborhoods for the
location of text in real-scene images. The statistical measures which the authors describe
extract properties of the image which characteristic text, invariant to a large degree to the
orientation, scale or color of the text in the scene.
With five predefined measures, as described by the authors, these can identify specific
properties of visible text that can differentiate it from most other parts of an image.
In a text detection stage, the authors make a combination of these statistical measures and
use a neural network to classify regions of an image as text or non-text based on these
measures. For each input image, these measures will be applied to and the neural network
will use them to determine likely text-regions in the image.
12
Characters Extraction for HanNom Stele Images
Student: Ho Xuan Nguyen
Fig 2.2 An approach of Clark and Mirmehdi [13]
This approach avoids the use of different thresholds for the various situations, and also
useful when text is too small to read, or the text plane is not fronto-parallel to the camera.
2.2.3 Text Extraction in Compressed Domain
Nowadays, with high technology development, almost digital images and videos are
usually stored, processed, and transmitted in a compressed form for saving capacity and
time. Therefore, some methods are proposed that directly operate on images in MPEG or
JPEG compressed formats have recently been presented. Actually, these methods only
require a small amount of decoding, thereby resulting in a faster algorithm, and also using
the DCT coefficients and motion vectors in a MPEG video are also useful in text detection
[68].
Zhong et al. [68] present a method which is one of these methods for localizing captions in
JPEG images and I-frames of MPEG compressed videos. In this algorithm, these textural
properties, for example the directionality and periodicity of local image blocks are
retrieved from the DCT coefficients. The results are then refined using morphological
operations and connected component analysis. This algorithm, as described by the authors,
seems very fast, approximately 0.006 seconds to process a 240×350 image and has a recall
rate of 99.17% and false alarm rate of 1.87%. Besides advantages above, some
disadvantages still remain, such as precise localization results can’t be generated for text or
non-text regions detection.
2.2.4 Other Approaches
In this stage - text localization, there are many binarization techniques, which use global,
local, or adaptive thresholding are applied to perform a good process. Thresholding
technique can be used and retrieved a good result when work on homogeneous background
images. However for non-homogeneous background, it seems inconvenient, thus adaptive
thresholding technique is proposed, this is known as the technique which usually uses a
predefined-size local window across the image, and applies thresholding on this window.
The interesting thing of this technique is the choice of the window size.
In overview, with this feature, can be said that all these methods is the suitable method
when deal with the images which have black characters on a white background such as
document images. Besides these, due to their relatively simple implementation thereby
having many adapters based on this approach for many specific applications such as
address location on the postal mail, courtesy amount on checks are proposed [9].
Moreover, some hybrid approaches have developed to solve the limitation of any single
approach. Usually, these methods are the combinations of more advantage points of the
previous methods.
13
Characters Extraction for HanNom Stele Images
Student: Ho Xuan Nguyen
Actually, as presented in previous sections, the algorithm is presented by Zhong [69]
belongs to CC-based method, however in another way, this can be considered as a fusion
the CC-based approach with the texture-based approach. When retrieving the bounding
boxes from texture-based method, the CC-based method will be applied to fill in characters
extending beyond the bounding boxes. In this hybrid scheme, no information about
quantitative analysis is mentioned for the performance enhancement by the authors.
Besides these authors, some methods are also presented by other authors such as Gandhi et
al. [18], Strouthopoulos et al. [57], etc…
2.3 Character Areas Verification
In this section we address the issues of character areas verification, this stage is built to
verify all text-area candidates - which are results of “character areas (text areas) detection
and localization” stage, to eliminate fake candidates (non-text areas). Results of this stage
are text areas to supply “character extraction” stage. In addition, for speedups the system,
the text tracking or text verification stage must be taken no longer time than the text
detection and localization stage. Besides this, the stage can be performed the original
image recovery. Even so, there have not been many approaches for this stage.
About text tracking on video images, a block-matching algorithm in Lienhart [39, 40] is
known as an international standard for video compression such as MPEG. This algorithm
refines extracted text regions based on temporal text motion information. The matching
criterion in this algorithm is chosen from mean absolute differences, the minimum of all
these values. With a given threshold value, every localized block will be examined. A
block like as text component when its grayscale difference between the blocks is less than
a threshold value.
Li et al. [36] present a text tracking approach which is expected to satisfy on several
circumstances (captions, printed texts). The pure translational motion model is built from
the sum of the squared differences with its principle based on multi-resolution matching, to
reduce the computational complexity. Besides this, when deal with slight larger text blocks
and also make sure the tracking process is stable, text contours and edge maps are used for.
However, with these system’s features, this method seems not suitable to work (not make
sure a good performance) for handling the scale and rotation.
Moreover, some heuristics can be applied for text tracking to show exactly the desired
areas - these heuristics are depended on each specific problem (attributes of problem). For
license plate location problem, Park [52] presents the algorithm with the final step for
creating some bounding boxes around the plate candidate regions using license plate
characteristics such as the size, shape and height / width ratio. For example, with the
Korean license plate, the authors use some heuristics to select plate candidate regions
following: the aspect ratio of a bounding box is within the range [1.5, 2.5]; the area size is
> 1000 pixels yet < 10000 pixels; the ratio of the area size and the size of its bounding box
is > 0.9.
2.4 Character Extraction
In the OCR system, the text extraction stage is considered as input of the recognition stage
thereby it certainly must pay many attentions for. Actually, the performance on the
14
Characters Extraction for HanNom Stele Images
Student: Ho Xuan Nguyen
recognition stage is much better when the result of the previous stage (text extraction) is
examined deeply. From that, preferably using enhancement techniques before
implementing text extraction, there are some algorithms which have used widely in this
stage as a combination of text enhancement and character extraction.
In Sato et al. [53], for text enhancement, a linear interpolation technique is proposed in
order to magnify small text at a higher resolution. This magnification process will be
applied after finding text regions to obtain higher resolution images. And then the image
quality improvement is implemented with the assumption “having any movement on the
background, but seems stable across frames for video captions” by multi-frame integration
using resolution-enhanced frames. Here, to be aware that when both the text and the
background are moving in parallel, the background doesn’t clean up. After the image
enhancement stages, the character extraction stage is applied using a projection analysis.
This projection analysis is known as an essential technique for character segmentation and
used by many authors [65, 67].
The same algorithm has also presented by Hong et al. [25]. This method performs basic
segmentation and fine segmentation based on varying spacing thresholds and minimum
variance criteria. The segmentation process is broken down into several steps:
•
The basic segmentation will be implemented first, through varying the space
threshold - this stage includes five most probable ways of segmentation. And each
way will have a specific space threshold. Get projection profiles in horizontal (Xaxis) and vertical (Y-axis). The first time, from the given threshold the projection
(horizontal) is scanned from left to right. When finding a gap is larger than the
threshold, this gap will be chosen as a boundary. After that vary the threshold value
and repeat this above process again, with notice that “once a particular
segmentation is decided, the width median of each segment and the width variance
are calculated and stored for the next stage process”. In the basic segmentation, this
process will be repeated in five times.
Fig 2.3 The basic segmentation in Hong’s approach [25]
15
Characters Extraction for HanNom Stele Images
•
Student: Ho Xuan Nguyen
Note that when having more than one radical joined together or the distance
between radicals is too large, the first stage - basic segmentation doesn’t seem
effective. Thereby, the authors introduce the fine segmentation to solve this by
splitting joined segments and merging over-segmented parts. Select top two ways
in the basic segmentation with minimum variance for implementing the fine
segmentation. The following process will be applied to extract characters, and is
repeated until no more new segmentation is created.
o Scan each segment from left to right.
o If the width of the segment is too small with the median, do the following step.
Make a combination of this segment with the next few segments.
o Otherwise, it means the width of segment is too large.
Make an estimation to calculate how many parts the segment should be split
into and then split this segment.
o And then calculate the median and variance for this new segmentation. A new
segmentation is called “valid way” and stored if its variance decreases,
otherwise discard it.
Fig 2.4 The fine segmentation in Hong’s approach [25]
As described by the authors, the sentence level accuracy, with an average of eight
handwritten characters in each sentence, is 49% for top one choice and 56% for top five
choices. The average character recognition accuracy is 85%. However, this method doesn’t
show the meaning of getting projection profile in vertical, maybe it will be applied for
characters which are written in vertical line.
The binarization algorithm which uses a filtering stage to refine binarization results is
proposed by Antani et al. [1] for text extraction. This filtering stage is considered as image
enhancement and designed based on the color, size, spatial location, topography, and
shape. Usually, in bright, the text seems brighter than the background. This assumption is
also used by the authors. However, sometimes this is not true, and to overcome it, maybe
this algorithm has to perform on both the original image and its inverse image.
Chen et al. [10] propose a character extraction method based on the connected component
(CC) analysis. After having localized text regions (grayscale images) from the previous
stage, these are transferred to an EM-based segmentation stage. Here in this stage, the CC
analysis will be examined based on geometric filtering technique to get characters. The
experimental results of this method are reported by the authors sound good.
16
Characters Extraction for HanNom Stele Images
Student: Ho Xuan Nguyen
Chapter 3: Related Background Knowledge
As we said, in this section we briefly introduce the techniques and theories which are used
into this thesis. Here, a short introduction of DWT theory is presented firstly, and then the
theories about artificial neural networks and genetic algorithms are also performed
respectively. The last, some image processing techniques such as the morphological
operators (erosion, dilation) and projection profile analysis that are considered as the basic
techniques in image processing area formally introduced. To get for more details about
these, we can refer in some materials [32, 58], [28], [2, 3], [34, 45] for the wavelets, neural
networks, genetic algorithms and image processing technique theories, correspondingly.
3.1 Discrete Wavelet Transforms
Like Fourier and many other transforms, wavelet is known as an orthogonal transform, and
often used in image compression [54, 60] and text detection [20, 37] areas. Although the
orthogonal transform deals with the spectral theorem and is not only one to one, however
the relation is not complex. There are two types of wavelet transforms, called discrete
wavelet transform (DWT) and continuous wavelet transform (CWT).
The CWT of a function f ( x ) ∈ L2 (R ) is defined [11] as below:
CWT { f (x ); a, b} = ∫ f ( x )Ψa , b (x ) d ( x )
(3.1)
where the functions Ψa, b ( x ) define the family of the wavelet functions, with a ≠ 0 the
scale of the transform and b the spatial location.
However, the CWT can’t be directly applied to analyze discrete signals, in this section, we
focus on DWT - this is a novel mathematical tool for digital signal processing, the DWT is
a special case of the CWT that provides a compact representation of a signal in time and
frequency that can be computed efficiently. The DWT of a discrete signal f (n ) (n ∈ Z ) is
defined by the following equation:
{
}
(
DWT f (n ); 2 j , 2 j k = ∑ f (n )g *j n − 2 j k
n∈Z
)
(3.2)
(
)
with g j (n − 2 j k ) the discrete equivalent of the Ψ j , k (x ) = 2 2 Ψ 2 − j x − k .
−1
j
The DWT analysis can be performed using a fast, pyramidal algorithm related to multi-rate
filterbanks [47]. The DWT [48] is computed with a cascade of filters followed by a factor
2 subsampling (Fig 3.1).
f (n )
cA1
cA2
cD1
cD2
Fig 3.1 An example of DWT tree
17