Journal of Science & Technology 101 (2014) 179-181
Building Databases for Good Quality Vietnamese Synthesis
Trinh Van Loan'*, Dinh Dong Laong^, Pham Thi Kim Ngoan^ LeXuan
Thanh'
'Hanoi University of Science and Technology, No 1, Dai Co Viet Str., Hai Ba Trung, Ha Noi, Viet Nam
^Nha Trang University
Received: March 05; accepted: April 22, 2014
Abstract
The Vietnamese ts a monosyllabic and tonal language. Therefore, in order to make high-quality synthesized
Vietnamese units, it is necessary to synthesize six tones whose charactenstics are as close to natural
language as possible. In this paper, we propose a new approach to build Vietnamese databases for
synthesizing the tones of Vietnamese with good quality. In addition, the databases can be used for other
Vietnamese synthesis applications using concatenation synthesis method.
Keyword Vietnamese database, good quality, tonal, concatenation
1. Introduction
Until now, Vietnamese synthesis using
concatenation method has achieved some initial results
[1-3], However, these results are still limited. Through
practice and research, we have recognized that the
quality of Vietnamese synthesis mostly depends on the
quality of tonal synthesis and database. Establishing
die database which satisfies both two factors above in
which tonal synthesis comes fu^t has been conducted
for good quality Viemamese synthesis. In this article,
the first part introduces some basic charactenstics of
Vietnamese phonetics as a background for our paper
The next part describes some stages which have been
conducted to build database for good quality
Viemamese synthesis using sound unit concatenation
and the last part is assessment The scientific
significance of database built by our method is that we
can implement a synthesizer with unlimited
vocabulary of any individual voice once we have built
database of him or her. Another method to build
Vietnamese database for synthesizer using sound unit
concatenation did not use natural tones but synthesized
tones [4], With our method, the quality of Vietnamese
tones is quite namral. An essay on this method using
lunited vocabulary has proven its advantage [5].
vocabulary formed from one or two morphemes is
called monosyllables, disyllables and polysyllables.
Viemamese is a tonal language with 6 tones:
level (no mark), hanging, sharp, heavy, askuig and
tumbling. In Vietnamese, tone plays a role as a syllable
taking part m syllable and word forming and word
meaning distinction Moreover, tone is the factor that
creates Vietoamese specific charactenstics,
3. Vietnamese phonemes and syllable structure
system.
Based on the modem Vietnamese development,
basic phonemes system mcludes 14 vowels and 22
consonants [6]. Each vowels could stand alone or
match with other one or two ones to form rhyme. In a
ftill form, each Vietnamese syllable includes 5 parts:
initial, onset, nucleus, coda and tone. Except initial, the
others are called final or rhyme. They work together as
the table below:
Table I. Syllabic structure in Vietnamese
Tone
Final (Rhyme)
Initial
2. Basic characteristics of Vietnamese phonetics
Viemamese is a monosyllabic language [6].
The word doesn't change morphology or codas (tails)
to indicate grammatical categones In terms of word
structure, Vietnamese does not use affixes and little
morphemes. Viemamese is analytic language without
any boundaries between syllables and morphemes.
Each syllable has one morpheme, Viemamese
•Con-esponding Aulhor. Tel. (+S4) 903.277,732
Email ,vn
Onset
Nucleus
Coda
For example, syllable "toan" is analyzed as
followed: initial /t/, onset /o/, nucleus /a/, coda ltd and
sharp tone
3.1. Initial consonants systems
Vietnamese has 22 initial consonants as table 2:
Journal of Science & Technology 101 (2014) 179-181
Table 2. Vietnamese consonants
above, two issues should be reviewed. Database allows
to synthesize tone relevant to natural voice and the
quality of speech signals m database need to be m a
good recording condition. Furthermore, to form the
database better for synthesis, we need to solve such
problems as build completed database which satisfies
the requirements, choice of voices to record and script
organization. The choice of speaker voice depends
only on which type of voice (male, female, old or
young people..,) we want to synthesize.
¥
¥
1
b
Ihl
12
nh
2
Cfklq
Ikl
13
ng/ngh
3
ch
Id
14
4
d/gl
III
15
P
ph
5
d
lil
16
r
Ifl
ki
6
g7gh
s
1,1
h
111
Ihl
17
7
18
1
III
8
kh
¥
19
Ih
/('. ('/
9
1
tr
10
m
/!/
20
Iml
21
V
Ivl
1!
»
lui
22
'
Isl
Ip/
3.3. Nucleus
To meets such requirements, database should
be formed with sound files which relevant to one
syllable. Each recorded syllable has a defined syllable
to synthesize. Following this idea, each syllable is
divided into 2 units: initial and coda. The main parts of
initial and coda unit equivalent is initial and nucleus as
shown m table 1: syllabic structure in Vietnamese,
According to the results of die research [7] and the
division into 2 units of sounds, the initial units are
relevant to level tone and the rest coda is relevant to all
6 tones. Therefore, when building the database for
initial sounds, we only record the sounds of the
respective level tone. In terms of coda, we record all 6
tones
Vietnamese has 16 vowels categorized into 14
groups as Table 3:
4.1. Syllable list foundation in database
/(/
Onsets which have functions as tonal
depression are semi-vowels Viemamese has 2 semivowels: li /and lui
Basmg on Viemamese syllable structure and
using computer, we have founded the completed list of
syllables which need recordmg. List foundation is
conducted by the combinatorial method with a purpose
of takmg all probable case of Viemamese syllables.
Following the combination stage, we eliminate some
cases which do not exist in Viemamese and filter a list
of sounds to record by manual method. Syllables are
recorded according to the defined number of initial and
coda unit.
Table 3 Nucleus system in Vietnamese
.,
/a/
a
a
4
e
Id
11
u
/m/
5
a
Id
12
ua/tio
/up/
6
J,
N
13
ifa/jm
/itsr/
7
0
hi
14
ia/ie/ya/ye
IW
1
2
3
8
6
m
9
a
hi
iti
10
u
/u/
lol
Initials foundation. By combining initial units
with nucleus vowels, we get 324 combinations. After
manual elirmnation phase, 294 combinations remain.
For instance, some combinations which do not exist m
Vietnamese are: "ce", "ce", "ci", "nghu", "nghu",. .,
Apart from /zero/ coda, Vietnamese coda has 6
consonants and 2 semi-vowels as Table 4:
Table 4 coda sysiem in Vietnamese
1
m
Iml
5
1
2
n
Inl
6
dch
Ikljd
3
ng/nh
/y
7
P
111
8
•ly
olu
h'
4
Coda foundation: by combining onsets, nucleus
and coda m the table of Vietnamese syllable structure,
we finally get 721 combinations existmg in
Vietnamese. In particularly, by combining onset with
nucleus and removing non-existing combinations, we
get 187 combinations Keepmg to take these 187
combinations to combine with coda, we collect 2244
combinations. Nexl, we extrude not existing
combinations in Viemamese, 721 combinations
remain. For example, some eliminated combinations
which do not exist in Vietnamese are "at", "at", "af',
"ap", "ap", "a", "ai", "So",
III
¥
4. Database building
We have constructed the database to synthesize
good quality Vietnamese with an aim to recreate the
most natural tones Tonal quality, instead of capacity
of database has been put into the top prionty. In order
to construct this database which meets requirements
In total. 1015 combinations have been
established. These combinations combine with
181
Journal ofScience & Technology 101 (2014) 179-181
necessary characters to form a list of need-to-record
syllables in which there are some similar
pronunciations. Accordingly, we only have to record
976 syllables.
4.2. Recording scripts
After finishing the syllable list foundation, we
should ensure to prepare the record script which bnngs
about the best results. In terms of coda combinations,
we conduct to combine /n/ or Ixl m front of these
syllables For examples, in order to earn coda
combmations "u&ng", "oan", we will record "tuong",
"toan" sounds or "nuong", "noan". The consonants /n/
(or any voiced consonant) or /t/ have been chosen as
the first phoneme in a recorded syllable because we
can exttact coda more easily from syllables
consttucted with these consonants
This method
enables exttaction of syllables and sound units to
automatic or semi-auto work
In order to reduce coarticulation phenomena to
the lowest level, the list of recorded syllables should
be mdependently displayed in computer screen. At one
moment, just one recorded syllable is shown in I
second.
4.3. Recording
Recording equipment is Computerized Speech
Lab Model 4500 (CSL Model) from Kay PENT AX
specified for speech recording and analysis. The
recording room is isolated from the noise from extemal
envhonment. Recording process was implemented in
the studio of School of Information Technology and
Communications in Hanoi University of Science and
Technology. The sampling frequency is 16000 Hz with
16 bits per sample. The speaker will read regularly,
clearly and decisively recorded syllables. With
average duration for each syllable is about 250 ms,
recording time is 244000 ms (244 seconds)
At fu^t, we conducted to record with three
voices- one man's, one woman's and one's children's
ones. Continuously recording time for a 976-syllable
set is 20 minutes (breaks beriveen syllables included).
The total capacity of 1015 syllables is 10MB foreach
voice It is the database we built for research goals. In
terms of practical applications, after extracting the
initial or coda for synthesis, the rest should be
eliminated. Then the total capacity reduces to 5,8MB
Accordmg to the calculated results, the average signal
to noise ratio is 21 dB which is good and acceptable [4].
S. Conclusion
In summary, we have uitroduced the method of
build mg databases for good quality Viemamese
synthesis. The initial results suggest thai voice
synthesis is satisfactory. It is believed that building the
database by this method creates favorable conditions
to conduct Viemamese dialect synthesis and any voice
that we want to synthesize In addition, the database
that we build is also used for another synthesis
application, especially Viemamese synthesis using
concatenation method.
References
[ I ] Tran Do Dat, Eric Castelii, Sengnat Jean-Francois, Trinh
Van Loan, Le Xuan Hung Linear FO Contour Model for
Viemamese Tones and Viemamese Syllable Synthesis
with TD-PSOLA. Proc. TAL 2006, La Rochelle, Apnl
2006.
[2] Nguyen Thanh Kien, Nguyen Due Thang, Le Thai Hoa,
Tnnh Van Loan,"DSP-based Embedded System for Text
to Speech Synthesis of Viemamese", Proceeding of the
2™* Asia Pacific International Conference on Information
ScienceandTechnology, Hanoi, December (2007) 215219,
[3] Hansjorg Mixdorff, Nguyen Hung Bach, Hiroya
Fujisaki, Mai Chi Luong, "Quantiiaiive Analysis and
Synthesis of Syllabic Tones in Viemamese",
EuroSpeech 2003 - GENEVA
[4] Trin D6 Dat, Eric Castel li, Tn nh Van Loan, Le V:?t BSc,
Building a large Vietnamese Speech Database, Tap chi
Khoa hoc va Cong ngh? (ISBN 0868-3980) Vol 46/47
(2004) 13-17,
[5] La The Vinh, Trinh Van Loan, "Vietnamese Recognition
and Synthesis with T-engme Embedded System",
Proceeding of the 2"'' Asia Pacific International
Conference on Information Science and Technology,
Hanoi, (2007) 133-137,
it ban
[7] TrSn Do Dat, Eric Castelii, Sengnat Jean-Francois, Le
Xuan Himg, Trinh Van Loan. Influence of FO on
Vietnamese syllable perception Proc of Interspeech
2005, Lisbon, (2006) 1697-1700.