Tải bản đầy đủ (.pdf) (90 trang)

Vietnamese speech synthesis for some assistant services on mobile devices

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.98 MB, 90 trang )

NGUYEN TIEN THANH

MINISTRY OF EDUCATION AND TRAINING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
---------------------------------------

Nguyen Tien Thanh

COMPUTER SCIENCE

VIETNAMESE SPEECH SYNTHESIS FOR
SOME ASSISTANT SERVICES ON MOBILE DEVICES

MASTER OF SCIENCE THESIS
COMPUTER SCIENCE

2014B
Hanoi – 2016


Master of science thesis 2016

MINISTRY OF EDUCATION AND TRAINING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
--------------------------------------Nguyen Tien Thanh

VIETNAMESE SPEECH SYNTHESIS FOR
SOME ASSISTANT SERVICES ON MOBILE DEVICES

Department : International research institute MICA


MASTER THESIS OF SCIENCE
COMPUTER SCIENCE

SUPERVISOR:
Dr. Mac Dang Khoa

Hanoi – 2016

Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices

Page ii


Master of science thesis 2016

COMMITMENT
I commit myself to be the person who was responsible for conducting this
study. All reference figures were extracted with clear derivation. The presented
results are truthful and have not published in any other person‟s work.
Nguyễn Tiến Thành

Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices

Page iii


Master of science thesis 2016

ACKNOWLEDGEMENT
During the progress of master student, many people gave me generous help and

inspiration.
I wish to thank all my professors and colleagues at MICA International
Research Institute, who have helped me with generous supports. Their advice and
knowledge they imparted to me are gratefully appreciated, inspiring me a lot to
finish this thesis.
Special thanks to my supervisor Dr. Mạc Đăng Khoa and colleagues of Speech
Communication Department, MICA Institute for their advice and encouragement
they gave to me, especially Assoc. Prof. Trần Đỗ Đạt for their thorough review and
invaluable suggestions
I would like to thank to Mr. Nguyễn Mạnh Hà and Ms. Nguyễn Hằng Phương
for their guide in recording the corpus. I would also like to thank to a lot of MICA
members, who spent much of time for testing for my research.
I am grateful to Prof. Eric Castelli, Dr. Nguyễn Việt Sơn and MICA‟s
directorate for supporting me the best working conditions in MICA International
Research Institute.
Finally, I owe a great deal to my parents and my younger brother for their
encouragement and support. They have given me strength and motivation in my
work and in my life.

Nguyễn Tiến Thành

Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices

Page iv


Master of science thesis 2016

List of figures
Figure 1-1 Representation of sound.(Huang et al. 2001) ............................................4

Figure 1-2 A schematic diagram of the human speech production apparatus (Huang
et al. 2001) ...................................................................................................................6
Figure 1-3 Glottal airflow and the resulting sound pressure at the mouth (Rabiner
and Juang 1993) ..........................................................................................................7
Figure 1-4 Waveform plot of the beginning of the utterance “It‟s time”(Huang et al.
2001)............................................................................................................................8
Figure 1-5 Signal of sound “my speech” and its spectrogram ....................................9
Figure 1-6 Speech recognition and speech synthesis (Chandra and Akila 2012) .....10
Figure 1-7 Schematic of text-to-speech synthesis.....................................................11
Figure 1-8 A schematic of the construction of an articulatory speech synthesizer and
how a such a synthesizer may be considered to contain a model of information
encoding in the speech signal (Palo 2006) ................................................................14
Figure 1-9 Block diagram of a synthesis-by-rule system. Pitch and formants are
listed as the only parameters of the synthesizer for convenience. In practice, such
system has about 40 parameters. (Huang et al. 2001) ...............................................15
Figure 1-10 Core architecture of HMM-based speech synthesis system (Yoshimura
2002)..........................................................................................................................18
Figure 1-11 General HMM-based synthesis scheme (Zen et al. 2009) ....................19
Figure 1-12 A diagram of the Hunt and Black algorithm, showing one particular
sequence of units and how the target cost measures a distance between a unit and
the specification, and how the join cost measures a distance between the two
adjacent units (Taylor 2009) .....................................................................................25
Figure 2-1 Schematic diagram of Hanoi Vietnamese tones (Michaud 2004) ...........35
Figure 2-2 Base system of Vu Hui Quan consists of 2 parts: training part and
synthesis part.(Quan and Nam 2009) ........................................................................36
Figure 2-3 Vietnamese speech recognition system (Vu et al. 2006) ........................37
Figure 2-4 Non-uniform unit selection model (Van Do et al. 2011) ........................38
Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices

Page v



Master of science thesis 2016

Figure 2-5 Parse tree to search (Van Do et al. 2011) ................................................39
Figure 3-1 Target cost of target units and candidate units (Tran 2007)....................42
Figure 3-2 Sentence splits into phrases and syllables ...............................................44
Figure 3-3 Average length of syllables in different positions (Tran 2007) ..............45
Figure 3-4 Average length of syllables (Tran 2007) .................................................46
Figure 3-5 Signal of “giỏi” syllable in two difference positions ..............................47
Figure 3-6 Sub-cost based on the difference in position of phrase ...........................49
Figure 3-7 Sub-cost based on the difference in context of preceding syllable and
following syllable ......................................................................................................50
Figure 3-8 Syllable “Quanh” is composed of four phonemes ..................................51
Figure 3-9 Sub-cost based on the difference in context of preceding phoneme and
following phoneme....................................................................................................51
Figure 3-10 Acoustic units network ..........................................................................56
Figure 3-11 The algorithm of separating sentence into as long as possible phrases 57
Figure 3-12 Finding the longest phrase in database ..................................................58
Figure 3-13 Search space before applying acoustic units network ...........................59
Figure 3-14 Search space after applying acoustic units network ..............................60
Figure 3-15 Finding candidates of word “chúng tôi” ...............................................61
Figure 4-1 Interface of Adobe Audition 3.0 .............................................................65
Figure 4-2 Interface of Praat .....................................................................................66
Figure 4-3 Most test result by domain ......................................................................68
Figure 4-4 Perception test .........................................................................................69
Figure 4-5 Result of the perception test ....................................................................70
Figure 4-6 Speed of synthesis process of two systems .............................................72

Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices


Page vi


Master of science thesis 2016

List of tables
Table 1-1 Types of using some popular units ...........................................................29
Table 2-1 The concluded structure of Vietnamese syllables (Tran 2003) ................33
Table 2-2 Symbol of Vietnamese tones ....................................................................34
Table 2-3 Advantages and disadvantages between two synthesis systems of Quan
and Thao ....................................................................................................................40
Table 3-1 Position difference and cost value (min is better). Target unit is begin or
end of sentence ..........................................................................................................48
Table 3-2 Position difference and cost value (min is better). Target unit is both
begin and end or is middle of sentence. ....................................................................48
Table 3-3 Phoneme types in Vietnamese (Tran 2007) ..............................................52
Table 3-4 Direction and complexity of Vietnamese tones ........................................54
Table 4-1. Number of sentences and distinct syllables in each domain ...................63
Table 4-2 Tags and Meaning of xml file...................................................................67

Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices

Page vii


Master of science thesis 2016

Contents
COMMITMENT ....................................................................................................... iii

ACKNOWLEDGEMENT ........................................................................................ iv
List of figures ..............................................................................................................v
List of tables ............................................................................................................. vii
Introduction .................................................................................................................1
Chapter 1. Overview of speech processing and text-to-speech ................................4
1.1. Speech and speech processing .......................................................................4
1.1.1. Sound ......................................................................................................4
1.1.2. Human vocal mechanism ........................................................................5
1.1.3. Speech representation in the time and frequency domains .....................7
1.1.4. Speech processing .................................................................................10
1.2. Text-To-Speech ...........................................................................................11
1.2.1. Introduction ...........................................................................................11
1.2.2. Speech synthesis techniques .................................................................12
1.2.3. Articulatory synthesis ...........................................................................13
1.2.4. Formant synthesis .................................................................................15
1.2.5. Concatenative synthesis ........................................................................16
1.2.6. Statistical Parametric synthesis .............................................................17
1.3. From concatenative synthesis to unit selection synthesis ...........................21
1.3.1. Extending concatenative synthesis .......................................................21
1.3.2. The algorithm of Hunt and Black .........................................................24
1.3.3.

Speech synthesis based on non-uniform units selection ..........................27

1.4. Conclusion ...................................................................................................30
Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices

Page viii



Master of science thesis 2016

Chapter 2. Text-to-speech for Vietnamese .............................................................31
2.1. Overview Vietnamese language and phonology .........................................31
2.1.1. Characteristics .......................................................................................31
2.1.2. Vietnamese syllable structure ...............................................................33
2.2. Overview text-to-speech in Vietnamese ......................................................35
2.3. Discussion and proposal ..............................................................................39
2.4. Conclusion ...................................................................................................41
Chapter 3. Improvement of Non-uniform unit selection technique for Vietnamese
Text-to-speech ...........................................................................................................42
3.1. Quality improvement: using target costs for unit selection .........................42
3.1.1. Target costs in Vietnamese synthesis ...................................................42
3.1.2. Separating sentence into phrases ..........................................................43
3.1.3. Target cost computation........................................................................44
3.2. Performance improvement: using acoustic units network ..........................55
3.2.1. Acoustic units network .........................................................................55
3.2.2. Separating sentence into the longest phrases ........................................56
3.2.3. Searching candidates.............................................................................59
3.3. Conclusion ...................................................................................................61
Chapter 4. Implementations and evaluation ...........................................................62
4.1. System overview..........................................................................................62
4.2. Building database ........................................................................................62
4.2.1. Text database building ..........................................................................62
4.2.2. Speech corpus recording .......................................................................64
4.2.3. Database processing ..............................................................................64

Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices

Page ix



Master of science thesis 2016

4.3. Evaluation ....................................................................................................67
4.3.1. Quality of synthesized speech ..............................................................67
4.3.2. Cost target improvement .......................................................................69
4.3.3. Performance ..........................................................................................71
4.4. Conclusion ...................................................................................................73
Chapter 5. Conclusions and perspectives ...............................................................74
References .................................................................................................................76

Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices

Page x


Master of science thesis 2016

Introduction
Context
Most people have heard about some synthetic voices in their life. We
experienced them in a number of situations. For instances, some telephone
information systems have automated speech response, or speech synthesis is often
used as an aid to the disabled.
Text-to-speech (TTS) systems have been integrated in many applications. One
of the useful applications is reading for blind people application, which can read
any text from a book and convert it into speech. Being known as Talkback, this kind
of application has been developed and integrated by Google on Android OS.
Talkback can read text displayed on the screens of Android devices to help blind

people use these devices easily.
The mainstream adoption of TTS has been severely limited by its quality. In
recent years, the considerable advance in their quality have made TTS systems are
becoming more common. Probably the main use of TTS today is in call-centre
automation, where a user calls to pay an electricity bill or book some travel and
conducts the entire transaction through an automatic dialogue system. Beyond this,
TTS systems have been used for reading news stories, weather reports, travel
directions and a wide variety of other applications.
In recent times, smart devices such as smartphones, tablets, etc. are increasingly
popular and play an important role in our life. They can be used in education,
medical, transport, communication, and so on. In Vietnam, some TTS systems have
been studied and developed on the mobile devices, such as : vnSpeak, Viettel
Speak, etc. At MICA international research institute, researchers have also
developed some TTS systems integrated into numbers of applications such as
VIVA, VIVAVU, VIQ on Google Play. However, these systems still exist some
limitations such as poor voice quality, long response time, etc...

Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices

Page 1


Master of science thesis 2016

Our goals in building a system that is capable of speaking from text can be
applied to smart devices and overcome the mentioned weaknesses. We hope that
this system will bring advantages in life for us
Objective of this thesis
This thesis was realized at MICA institute, Speech Communication department
and its main goal is to build a high quality Vietnamese speech synthesis system that

can be integrated into electronic devices running on Android OS.
Basic theory of speech synthesis is firstly studied. Then, new methods to
improve quality of the existed Vietnamese synthesis system, that is driven to run on
smartphones and smart devices, will be proposed.
The first task is building a speech corpus for synthesizing Vietnamese
utterances. With this corpus, we can synthesize almost all syllables of Vietnamese
and can apply Text-to-speech system to any Vietnamese documents
After that, based on researches about Vietnamese phonetic and Vietnamese
synthesis, some new costs for calculating optimal way in speech synthesis using
unit selection technique were proposed. The costs are expected help us choose more
preferable units to synthesize utterance.
Moreover, we also suggest using a phonetic units network to optimize searching
and selecting time of candidate units.
Finally, all these researches and suggestions will be applied to a speech
synthesis system that can be embedded in assistant applications on smartphones.
Thesis structure
Chapter 1 presents basic theories of speech, giving the background of speech
signal, speech signal processing and speech synthesis.

Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices

Page 2


Master of science thesis 2016

Chapter 2 focuses on theories of speech synthesis using unit selection
technique. It also introduces current Vietnamese speech synthesis and gives
suggestions.
Chapter 3 is our research on target cost used for selecting units in speech

synthesis. We also describe an acoustic unit network which is used for improving
performance of the TTS system
In chapter 4, our work on building the Vietnamese speech corpus is presented.
Experiments for evaluating the quality of the new TTS system are also presented.
Final part completes with conclusions of the thesis work and suggestions for
further work.

Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices

Page 3


Master of science thesis 2016

Chapter 1.

Overview of speech processing and text-to-

speech
1.1. Speech and speech processing
In this section, we briefly review speech sound and human speech production
systems. We also show how speech signal can be represented.
1.1.1. Sound
Sound is a longitudinal pressure wave formed of compressions and rarefactions
of air molecules, in a direction parallel to that of the application of energy.
Compressions are zones where air molecules have been forced by the application of
energy into a tighter-than-usual configuration, and rarefactions are zones where air
molecules are less tightly packed.
The alternating configurations of compression and rarefaction of air molecules
along the path of an energy source are sometimes described by the graph of a sine

wave as shown in Figure 1-1

Figure 1-1 Representation of sound.(Huang et al. 2001)
In this representation, crests of the sine curve correspond to moments of
maximal compression and troughs to moments of maximal rarefaction. There are
two important parameters, amplitude and wavelength, to describe a sine wave.
Frequency (calculated by cycles/second) measured in Hertz (Hz) is also used to
measure of the waveform

Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices

Page 4


Master of science thesis 2016

1.1.2. Human vocal mechanism
A schematic diagram of the human vocal mechanism is shown in Figure 1-2.
The gross components of the speech production apparatus are the lungs, trachea,
larynx (organ of voice production), pharyngeal cavity (throat), oral and nasal cavity.
The pharyngeal and oral cavities are typically referred to as the vocal tract, and the
nasal cavity as the illustrated in Figure 1-2, the human speech production apparatus
consists of:
- Lungs: source of air during speech.
- Vocal cords (larynx): when the vocal folds are held close together and
oscillate against one another during a speech sound, the sound is said to be voiced.
When the folds are too slack or tense to vibrate periodically, the sound is said to be
unvoiced. The place where the vocal folds come together is called the glottis.
- Velum (Soft Palate): operates as a valve, opening to allow passage of air (and
thus resonance) through the nasal cavity. Sounds produced with the flap open

include m and n.
- Hard palate: a long relatively hard surface at the roof inside the mouth, which,
when the tongue is placed against it, enables consonant articulation.
- Tongue: flexible articulator, shaped away from the palate for vowels, placed
close to or on the palate or other hard surfaces for consonant articulation.
- Teeth: another place of articulation used to brace the tongue for certain
consonants.
- Lips: can be rounded or spread to affect vowel quality, and closed completely
to stop the oral air flow in certain consonants (p, b, m).

Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices

Page 5


Master of science thesis 2016

Figure 1-2 A schematic diagram of the human speech production apparatus (Huang et al.
2001)

Air enters the lungs via the normal breathing mechanism. As air is expelled
from the lung to the trachea (or windpipe), the tensed vocal cords within the larynx
are caused to vibrate (in the mode of relaxation oscillator) by the air flow. The air
flow is chopped in to quasi-periodic pulses which are the modulated in frequency in
passing through the pharynx (the throat cavity), the mouth cavity, and possibly the
nasal cavity. Depend on the positions of the various articulators (i.e. jaw, tongue,
velum, lips, mouth) different sounds are produced.
The glottal air flow (volume velocity wave form) and the resulting sound
pressure at the mouth for a typical vowel sound is shown in Figure 1-3. The glottal
waveform shows a gradual build-up to a quasi-periodic pulse train of air, taking

about 15 ms to reach steady state. This build-up is also reflected in the acoustic
waveform shown at the bottom of the figure.

Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices

Page 6


Master of science thesis 2016

Figure 1-3 Glottal airflow and the resulting sound pressure at the mouth (Rabiner and
Juang 1993)

Speech is produced as a sequence of sounds. Hence, the state of the vocal cords,
as well as the positions, shape, and sizes of the various articulators, changes over
time to reflect the sound being produced.

1.1.3. Speech representation in the time and frequency domains
In general, we have three ways to represent a speech signal. Firstly, we know
that the speech signal is slowly time varying signal, thus when it is examined over a
sufficiently short period of time, its characteristics are fairly stationary; however,
over long periods of time the signal characteristics change to reflect the different
speech sounds being spoken.
There are several ways of labeling events in speech. One of the simply and most
straightforward is via the state of the speech-production source- the vocal cords. We
use a three-state representation, which includes :
- Silence (S), where no speech is produced.
- Unvoiced (U), in which the vocal cords not vibrating, so the resulting speech
waveform is aperiodic or random in nature.


Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices

Page 7


Master of science thesis 2016

- Voiced (V), in which the vocal cords are tensed and therefore vibrate
periodically when flow from the lungs, so the resulting speech waveform is quasiperiodic.
It should be clear that the segmentation of the waveform into well-defined
regions of silence, unvoiced, and voiced signal is not exact; it is often difficult to
distinguish a weak, unvoiced sound from the silence, or a weak, voiced sound from
unvoiced sounds or even silence.

S

S

V

U

U

V

Figure 1-4 Waveform plot of the beginning of the utterance “It’s time”(Huang et al.
2001)
An alternative way of characterizing the speech signal and representing the
information associated with the sounds is via a spectral representation. Perhaps, the

most popular representation of this type is the sound spectrogram in which a threedimensional representation of the speech intensity, in different frequency bands,
over time is portrayed. Figure 1-5 shows an example of the speech presentation by
spectrogram. In this figure, the spectral intensity at each point in time is indicated
by the intensity (darkness) of the plot at a particular analysis frequency.
A third way of representing the time varying signal characteristics of speech is
via a parameterization of the spectral activity based on the model of speech
production. Because of the human vocal tract is essentially a tube, or concatenation

Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices

Page 8


Master of science thesis 2016

of tubes, of varying cross-sectional area that is excited either at one end (by the
vocal cord puffs of air) or at a point along the tube (corresponding to turbulent air at
a constriction), acoustic theory tells us that the transfer function of energy from the
excitation source to the output can be described in term of the natural frequencies or
resonances of the tube.
Such resonances are called formants for speech, and they represent the
frequencies that pass the most acoustic energy from source to the output. Typically,
there are about three resonance of significance, for a human vocal tract, below about
3500Hz. There is a good correspondence between the estimated formant frequencies
and the points of high spectral energy in spectrogram. The formant frequency
representation is a highly efficient, compact representation of the time varying
characteristics of speech. The major problem, however, is the difficulty of reliably
estimating the formant frequencies for low-level voiced sound, and the difficulty of
defining the formant for unvoiced or silent regions. As such, this representation is
more of theoretical than practical interest.


Figure 1-5 Signal of sound “my speech” and its spectrogram
Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices

Page 9


Master of science thesis 2016

1.1.4.

Speech processing

Speech processing brings a growing number of language processing
applications. We already saw examples in the form of real-time dialogue between a
user and a machine: voice-activated telephone servers, embedded conversational
agents to control devices, i.e., jukeboxes, VCRs, and so on. In such systems, a
speech recognition module transcribes the user‟s speech into a word stream. The
character flow is then processed by a language engine dealing with syntax,
semantics, and finally by the back-end application program.

Figure 1-6 Speech recognition and speech synthesis (Chandra and Akila 2012)

A speech synthesizer converts resulting answers (strings of characters) into
speech to the user. Figure 1-6 shows how speech processing is located within a
language processing architecture, here to be a natural language interface to a
database.
Speech recognition is also an application in itself, as with speech dictation
systems. Such systems enable users to transcribe speech into written reports or
documents, without the help of a keyboard. Most speech dictation systems have no

other module than the speech engine and a statistical language model. They do not
include further syntactic or semantic layers.
Within the scope of this thesis, we focus on learning and researching on speech
synthesis

Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices

Page 10


Master of science thesis 2016

1.2. Text-To-Speech
1.2.1. Introduction
This field of study is known both as speech synthesis that is the “synthetic”
(computer) generation of speech, and text-to-speech or TTS; the process of
converting written text into speech.
Text-to-speech systems have an enormous range of applications. Their first real
application was in reading systems for the blind, where a system would read some
text from a book and convert it into speech. Today, quite sophisticated systems exist
that facilitate human computer interaction for the blind, in which the TTS can help
the user navigate around a windows system.(Taylor 2009)

Input text

Text and
linguistic
analysis

Prosody and

speech
generation

Synthesized
speech

Figure 1-7 Schematic of text-to-speech synthesis

As seen in the picture the synthesis starts from text input. Nowadays this may
be plain text or marked-up text e.g. HTML or something similar. If the text uses
some sort of mark-up it may already contain some or all of the information made
available by the text and linguistic analysis stage. Regardless of the quality of the
input text, after this stage we will have a description of the text on the phonetic
level.
The first stage in the synthesis phase is to take the words we have just found
and encode them as phonemes. We do this, because this provides a more compact
representation for further synthesis processes to work on. The words, phonemes and
phrasing form an input specification to the unit selection module. Actual synthesis
Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices

Page 11


Master of science thesis 2016

is performed by accessing a database of pre-recorded speech so as to find units
contained there that match the input specification as closely as possible.
The second stage is prosody and third stage is speech signal generation. During
the prosody stage linguistic information is used to generate F0 contours, timing
information for the phones etc. Finally, the synthesized speech itself is generated

from these specifications. If we are dealing with normal TTS, the generated speech
will take the form of a audio signal.
The pre-recorded speech can take the form of a database of waveform
fragments and when a particular sequence of these are chosen, signal processing is
used to stich them together to form a single continuous output speech waveform.
This is essentially how (one type) of modern TTS works. One may well ask
why it takes an entire book to explain this then, but as we shall see, each stage of
this process can be quite complicated, and so we give extensive background and
justification for the approaches taken. Additionally, while it is certainly possible to
produce a system that speaks something with the above recipe, it is considerably
more difficult to create a system that consistently produces high quality speech no
matter what the input is.
1.2.2.

Speech synthesis techniques

According to the development history, speech synthesis technique can be
divided into generations. The techniques researched and developed in the first
generation are formant synthesis and articulation synthesis.
Formant synthesis was the first genuine synthesis technique to be developed
and was the dominant technique until the early 1980s. Formant synthesis is often
called synthesis by rule;
Synthesis systems by concatenation are often collectively called secondgeneration synthesis systems.

Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices

Page 12


Master of science thesis 2016


However, in concatenative synthesis we can never collect enough data to cover
all the effects we wish to synthesize, and often the coverage we have in the database
is very uneven. Furthermore, the concatenative approach always limits us to
recreating what we have recorded; in a sense all we are doing is reordering the
original data.
An alternative is to use statistical, machine-learning techniques to infer the
specification-to-parameter mapping from data. While this and the concatenative
approach can both be described as data-driven, in the concatenative approach we are
effectively memorizing the data, whereas in the statistical approach we are
attempting to learn the general properties of the data
While many possible approaches to statistical synthesis are possible, most work
has focused on using hidden Markov models (HMMs) or Deep Neural Network
(DNN). These and the unit-selection techniques are termed third-generation
techniques
This thesis focuses on a technique of third generation, which is the unit
selection technique
1.2.3. Articulatory synthesis
Perhaps the most obvious way to synthesize speech is to try a direct simulation
of human speech production. This approach is called articulatory synthesis and is
actually the oldest in the sense that the famous talking machine of von Kempelen
can be seen as an articulatory synthesizer (Von 1791)

Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices

Page 13


Master of science thesis 2016


Figure 1-8 A schematic of the construction of an articulatory speech synthesizer
and how a such a synthesizer may be considered to contain a model of information
encoding in the speech signal (Palo 2006)

Summary
 These models generate speech by direct modeling of human articulator
behavior.
 They are by their very nature the most “natural” way of generating speech,
and in principle speech generation in this way should involve control of a
simple parameter space with only a few degrees of freedom.
 In practice, acquiring data to determining rules and models is very difficult.
 Mimicking the human system closely can be very complex and
computationally intractable.
 Because of these difficulties, there is little engineering work in articulatory
synthesis, but it is central in the other areas of speech production, articulator
physiology and audio-visual or talking-head synthesis.

Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices

Page 14


Master of science thesis 2016

1.2.4. Formant synthesis
Formant synthesis was the first genuine synthesis technique to be developed
and was the dominant technique until the early 1980s. Formant synthesis is often
called synthesis-by-rule; As we shall see, most formant synthesis techniques do in
fact use rules of the traditional form, but data driven techniques have also been
used.


Figure 1-9 Block diagram of a synthesis-by-rule system. Pitch and formants are
listed as the only parameters of the synthesizer for convenience. In practice, such
system has about 40 parameters. (Huang et al. 2001)

Summary
 Formant synthesis works by using individually controllable formant filters,
which can be set to produce accurate estimations of the vocal-tract transfer
function.
 An impulse train is used to generate voiced sounds and a noise source to
generate obstruent sounds. These are then passed through the filters to
produce speech.
 The parameters of the formant synthesizer are determined by a set of rules
concerning the phone characteristics and phone context.
 In general formant synthesis produces intelligible but not natural-sounding
speech.
 It can be shown that very natural speech can be generated so long as the
parameters are set very accurately. Unfortunately, it is extremely hard to do
this automatically.

Nguyễn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices

Page 15


×