dữ liệu chuỗi thời gian và dữ liệu bảng trong stata

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (21.26 MB, 381 trang )

UNIVERSITY OF ECONOMICS HO CHI MINH CITY

SCHOOL OF ECONOMICS

NOTES ON TIME SERIES AND PANEL
TIME-SERIES ECONOMETRICS FOR
JUNIOR RESEARCHERS USING STATA

Author: Phùng Thanh Bình
()

2020

PREFACE
In 2009, I had a chance to give a series of lectures on time series econometrics for
Vietnam – The Netherlands Programme (VNP) for M.S. in Applied Economics. Those
days, time series econometrics had rarely been recognized as a separate course in
economics curriculum of the Vietnamese universities. It took me many months to read
textbooks, reference manual and research articles, then prepared the first draft of the
notes. Master theses and published articles using time series models have gradually
increased since then. In early 2018, I received an email of a professor of
macroeconomics and time series econometrics from Lucerne, Switzerland, and his
detailed comments motivated me to revise the initial draft. It was then used as a
supplementary reading for undergraduates of economics programmes at School of
Economics, UEH1. In 2019, EfD (www.efd.vn) and VNP (www.vnp.edu.vn) offered
me an opportunity to complete this first edition of Notes on Time Series and Panel
Time-Series Econometrics for Junior Researchers Using Stata.
The main purpose of the notes is to explain the basic concepts and models in a
simplest language for students and junior researchers in economics. I hope the notes is
also a useful reference for other courses such as econometrics for finance, economic

forecasting and data analysis for undergraduates, and applied econometrics for
graduates. Whenever a term is clearly defined in a proper textbook, the notes will not
rewrite. Instead, an exact reference is suggested for further reading the original source.
Various examples, including datasets are borrowed from the econometrics textbooks,
Stata time series/panel data reference manuals and research articles, but the hands-on
instructions make the notes become completely different. Similar to other econometrics
texts, the usage of mathematics and statistics is unavoidable, but I try to use the verbal
approach and backward substitution method in order to help undergraduates in
economics who do not have a strong mathematical background able to understand the
statistical tests and models. In addition, the notes present all the statistical tests in a stepby-step manner, and tries to explain how each test can be carried out using Stata dofiles. I think that this approach will make the study of time series and panel time-series
econometrics more fascinating and more enjoyable.
I hope this series of notes is a useful complementary reading of very famous
textbooks in time series and panel time-series econometrics at UEH & VNP. I also hope
it can play a role as technical guidebook for junior researchers in economics, especially
environmental economics within the EfD networks and Economy & Environment
Partnership for Southeast Asia (EEPSEA).
Thank you very much for reading the notes. If you find any errors or typos, please
let me know by e-mailing me at the below address.
Phùng Thanh Bình

1

www.UEH.edu.vn

i

Contents
Preface ……………………………………………………………………………….... i

Table of contents ……………………………………………………………………... ii
1. Introduction …………………………………………………………………...…… 1
2. The structure of economic data ………………….……….………………………… 3
3. Data management in Stata ……………….………………………………………… 6
4. The nature of time series ……...………………….………………………..……… 18
5. Stationary stochastic process ………………….…………………………..……… 20
5.1 Definition ……………………………………………………………………... 20
5.2 AR and MA processes ………………………………………………………… 22
5.3 Examples of AR and MA processes in Stata ………………………………….. 32
5.4 Invertibility …………………………………………………………………… 35
5.5 ARMA process ……………………………………………………………….. 37
6. Nonstationary stochastic process ………………………………….…………....… 42
6.1 Trends and random walks …………………………………………………….. 42
6.2 Unit root stochastic process …………………………………………………... 51
7. The spurious regressions ………………….…………………...……………......… 53
7.1 Understanding the concept ……………………………………………………. 53
7.2 Explaining the spurious regression problem …………………………………. 56
8. Testing for non-stationarity …………………………..........................................… 58
8.1 Graphical analysis …………………………………………………………….. 58
8.2 Tests for white noise ………………………………………………………….. 64
8.3 Autocorrelation function and correlogram …………………………………… 67
8.4 Tests for non-stationarity ………………………..………………………...….. 70
8.5 Performing unit root tests in Stata ………………………………….………… 87
8.6 Unit root tests in empirical studies …………………………………….……… 93
9. Short-run and long-run relationships …………....................................................… 98
9.1 Understanding the concepts ………………………………………………….. 98
9.2 Autoregressive distributed lag (ARDL) model and error correction model .… 100
10. Cointegration and error correction model ……................................................… 103
10.1 Cointegration ……………………………………………………………..... 103
10.2 Common trends …………………………………………………………..… 104

10.3 Tests of cointegration: A residual based approaches ……………………….. 109
10.4 Tests of cointegration: Bounds testing approaches ……………………….... 115
10.5 Tests of cointegration with structural breaks ……………………………….. 127
10.6 Interpreting the error correction (EC) model .…………...………………..… 134
10.7 ARDL or EC models ………………………………………………………...136
10.8 Estimation of ARDL models …………………….………………………….138
ii

10.9 Estimation of long-run parameters ………………………………………….141
10.10 Numerical examples ……………………………..…………………….…. 145
10.11 Applications of ECM-ARDL models in empirical studies ………………. 159
11. Vector autoregressive models ……………...…...............................................… 163
11.1 Bivariate VAR models …………………………………………...……..… 163
11.2 Estimating VAR models in Stata ………………………………………….. 168
11.3 Applications of VAR model …………………………………………….… 178
12. Vector error correction models ........................................................................… 182
12.1 A two-variable VEC models ……………………………………...……..… 182
12.2 A general VEC model models …………………………………...……..… 188
12.3 Johansen tests of cointegration …………………………………...……..… 193
12.4 Estimation of Johansen tests and VEC models in Stata …….…...……..… 205
12.4 Some empirical examples …………………………………...…………..… 225
13. Causality analysis ............................................................................................… 231
13.1 The standard Granger causality test ……………………………………..… 231
13.2 The augmented Granger causality test …………………………………..… 234
13.3 Causality tests in Stata …………………………………………....……..… 236
13.4 Toda-Yamamoto long-run causality test ………………………....……..… 250
14. Panel time-series models ……………………………………………………….. 258
14.1 Introduction ……………………………………………………………..… 258
14.2 A new field of panel econometrics ……………………………………..… 261

14.3 An overview of large-N-large-T panel data ……………………………..… 262
14.4 Cross-sectional dependence tests ……………………………………….… 263
14.5 Slope homogeneity tests ………….……………………………………..… 268
14.6 Panel unit root tests ……………………………………………….……..… 270
14.7 Panel cointegration tests ………………………………………………..… 283
14.8 Estimation of cointegrating relation in panels ……….…………………..… 292
14.9 Panel Granger causality analysis ………………………………………..… 295
14.10 Empirical studies with panel time-series data ………………………….… 304
15. Panel vector autoregression ………..…………………………………………... 314
15.1 Introduction ……………………………………………………………..… 314
15.2 Panel VAR estimation in Stata …………………………………………..… 315
16. Concluding remarks ………………………………………………………….… 320
References: ………………………………………………………………………… 324
Appendix A: Partial regression coefficients …………………………………………. iv
Appendix B: ARIMA models ………………….……………………………………. vii
Appendix C: ARDL and EC models ………...…………………………………….. xxxi
Appendix D: An explanation of ADF test equations ………………………...….. xxxvi
Appendix E: Impulse response function ……………...…………………...…………. xl

iii

1.
INTRODUCTION
I write this series of notes on time series and panel time-series econometrics for my
students in Applied Economics at the University of Economics Ho Chi Minh City
(UEH). Since most economics students in Vietnam are likely to have problems with
English as a second language, mathematics background, and especially access to
updated resources for self-study, this series hopefully has some helpful contributions.
The aim is to help my students understand key concepts of time series and panel timeseries econometrics through hands-on examples in Stata. To its end, they are able to

read time-series and panel time-series data articles. Moreover, I also expect that they
will have a sufficient interest, and write a thesis in this field. As the time this series of
lecture notes is preparing, I believe that the Vietnam time series data is long enough to
conduct such a study. In addition, large-N-large-T panel data2 may be also good sources
for doing empirical researches in macroeconomics.
This is just a concise summary of the body of knowledge in time series and panel timeseries data econometrics according to my own limited understanding. Obviously, it has
not much scientific value for citations because many parts of the text are cited from the
original sources of references. Researches using bivariate models are not strongly
appreciated by prestigious academic journal’s editors3 and university’s supervisors as
well. Thus, multivariate time series with structural breaks and panel time-series models
should be paid special attention. As a junior researcher, you must be independently and
fully responsible for your own choice of the research project. My advice is that you
should firstly start with the research problem of interest, not with the data availability
and the statistical techniques. Therefore, you just use the model whenever you really
need and crystal clearly understand it.
Some topics such as ordinary least squares regression with stationary time series,
modelling volatility clustering, seemingly unrelated regression equations, nonlinear
time series modelling such as switching, threshold autoregression and smooth transition
autoregression, nonlinear autoregressive distributed lag and time-varying coefficient
models, and traditional panel data models are beyond the scope of this notes. You can
find them in advanced econometrics textbooks, journal articles, and updated Stata
reference manuals. After studying this series of notes with hands-on practice in Stata
(at least version 15), you should be able to basically understand the following topics in
time series and panel time-series econometrics:
▪ The structure of economic data

The most important data sources for such studies can be World Bank’s World Development Indicators,
International Financial Statistics, the Penn World Tables, UNIDO INDStat, Reuters Thomson, General Statistical
Office, Google Trends, etc.
3

See, for examples, Öztürk (2010), Omri (2014), Smyth and Narayan (2015).
2

1

▪ Data management in Stata
▪ The nature of time series
▪ The concepts of stationarity, non-stationarity, autoregressive (AR), moving
average (MA), and random walk processes
▪ ARMA and ARIMA models
▪ The concept of spurious regression
▪ Tests for non-stationarity (ADF, DF-GLS, PP)
▪ Tests for stationarity (KPSS)
▪ Tests for non-stationarity with structural breaks (Zivot-Andrews, PerronVolgelsang, Clemente-Montanes-Reyes)
▪ The short-run dynamics and long-run relationship
▪ Autoregressive distributed lag (ARDL) model
▪ Error-correction (EC) models
▪ Tests for cointegration: Residual-based approach (AEG, CRDW)
▪ Tests for cointegration: Bounds testing approach (ARDL bounds test)
▪ Test for cointegration: Gregory and Hansen approach
▪ Estimation of long-run parameters (fully-modified OLS - FMOLS, dynamic
OLS - DOLS, canonical cointegrating regression – CCR)
▪ Vector autoregressive (VAR) models
▪ Impulse-response function and forecast-error variance decomposition
▪ Tests for cointegration: Johansen tests (multiple equation approach)
▪ Vector error correction (VEC) models
▪ Granger causality analysis (standard, augmented versions and Toda-Yamamoto)
▪ Panel time-series analysis (cross-sectional dependence tests, slope homogeneity
tests, panel unit root tests, panel cointegration tests, estimation of long-run

parameters (e.g., FMOLS and DOLS methods), panel ARDL models, and panel
Granger causality analysis)
▪ Panel vector autoregression models (model selection, estimation, impulseresponse function, and Granger causality analysis).
To get started, you should be familiar with basic econometrics and statistics4. Searching
for research articles, I realize that time series and panel time-series data analyses have
been widely applied in fields of macroeconomics, financial economics, and especially
energy economics. Therefore, time series and panel time-series models in these notes
just equip basic tools for you to do empirical researches, specialized knowledge from
literature review is indeed a key. Furthermore, a good way to use these notes is to
replicate the do-file examples with Stata, and read the original articles that are used to
illustrate the respective tests.

4

Suggested textbooks for undergraduates: Gujarati and Porter (2009), Gujarati (2011, 2015), Hill et al. (2017),
Studenmund (2017), Wooldridge (2016); Daniels and Minot (2020), etc.

2

2.
THE STRUCTURE OF ECONOMIC DATA
There are four types of data sets studied by econometrics: cross-sectional data time
series data, pooled cross sections and panel data.
A cross-sectional data set consists of a sample of economic units (i.e., individuals,
households, firms, districts, provinces, states, countries, and so on) taken at a particular
point in time. An important feature of cross-sectional data is the assumption that they
are collected by random sampling from the underlying population. Because crosssectional observations on a number of economic units at a given time are often generated
by way of a random sample, they are typically uncorrelated (Hill et al., 2017: p.418). In
economics, the analysis of cross-sectional data is closely aligned with the applied

microeconomics fields. Data on economic units at a specific point in time are important
for testing microeconomic hypotheses and evaluating economic policies (Wooldridge,
2016: p.5).
Table 1.1: Cross-sectional data example.
…

ID

Y

X1

X2

Xk

1

3.10

11

2

0

2

3.24

12

22

1

3

3.00

11

2

0

.

.

.

.

.

.

.

.

.

.

525

11.56

16

5

1

526

3.50

14

5

0

A time series data set consists of observations of one or several variables of a specific
economic unit over time such as stock prices, money supply, consumer price index,
gross domestic product (GDP), energy consumption, CO2 emissions, sales figures, and
so on. Because past events can influence future events and lags in behaviour are

prevalent in the social sciences, time is an important dimension in a time series data set.
A key feature of time series data is that most economic series are strongly related to
their recent histories. In other words, there is the likely correlation between different
observations. Another feature of time series data is its natural ordering according to
time. Therefore, if one rearranges time series observations, there is a danger of
confounding what is their most important distinguishing feature: the possible existence
of dynamic-evolving relationships between variables5. In addition, the data frequency
5

A dynamic relationship is one in which the change in a variable now has an impact on that same variable, or
other variables, in one or more future time periods (Hill et al., 2017: p.418).

3

is also a peculiar feature of time series data. In economics, the most common
frequencies are daily, weekly, monthly, quarterly, and annually. Many weekly, monthly,
and quarterly economic time series display a strong seasonal pattern, which can be an
important factor in a time series analysis (Wooldridge, 2016: p.7). This type of data is
the primary focus throughout the notes.
Table 1.2: Time series data example.
…

ID

Year

Y

X1

Xk

1

1950

0.20

20.1

878.7

2

1951

0.21

20.7

925.0

3

1952

0.23

22.6

1015.9

.

.

.

.

.

.

.

.

.

.

37

1986

3.35

58.1

4281.6

38

1987

3.35

58.2

4496.7

A pooled cross section is a data set that has both cross-sectional and time series
features. Pooling cross sections or units from different years is often an effective way
of analysing the effects of a policy intervention. The idea is to collect data from the
years before and after a key policy change. A pooled cross section is analysed much
like a standard cross section, except that we often need to account for secular differences
in the variables across the time. In addition to increasing the sample size, the point of a
pooled cross-sectional analysis is often to see how a key relationship has changed over
time (Wooldridge, 2016: p.8).
Table 1.3: Pooled cross section data example.
…

ID

Year

Y

X1

Xk

1

1993

85500

42

1600

2

1993

67300

36

1440

3

1993

134000

38

2000

.

.

.

.

.

250

1993

243600

41

2600

251

1995

65000

16

1250

252

1995

182400

20

2200

.

.

.

.

.

520

1995

57200

16

1100

A panel or longitudinal data set consists of a time series for each cross-sectional
member in the data set. For example, we have wage, education, and employment history
for a set of individuals followed over a 10-year period. Or we might collect information
4

such as investment and financial data, about the same set of firms over a five-year time
period. Panel data can also be collected on geographical units such as data for the same
set of countries on energy consumption, economic growth and CO2 emission over
several years. The key feature of panel data that distinguishes them from a pooled cross
section is that the same cross-sectional units are followed over a given time period.
There are at least two advantages of using panel data sets over cross-sectional data or
even pooled cross-sectional data. First, the use of more than one observation can
facilitate causal inference in situations where inferring causality would be very difficult
if only a single cross section were available. Second, they often allow us to study the
importance of lags in behaviour or the result of decision making (Wooldridge, 2016:
p.9-10). A panel of large observations over a long time period, especially in finance and
macroeconomics (i.e., a macro panel), is the secondary focus of this notes.
Table 1.4: Panel data example.
X1

…

ID

Year

X1

Xk

1

1990

45.2

1.716231

1

1991

43.7

1.682067

.

.

.

.

1

2020

300

1.700939

2

1990

28.8

1.688059

2

1991

31.5

1.682296

.

.

.

.

2

2020

210

4.413671

3

1990

570

5.434941

3

1991

574

5.975792

.

.

.

.

3

2020

1056

5.106615

4

1990

276

4.842814

4

1991

275

4.853465

.

.

.

.

4

2020

581

4.624482

…

…

20

1990

7.5

4.564457

20

1991

7.7

4.498013

.

.

.

20

2020

46.6

.

3.471086
5

3.
DATA MANAGEMENT IN STATA
In this section, we will introduce how to manage time series data which are the
fundamental focus of this notes6. According to Hamilton (2012: p.13), data management
encompasses the initial tasks of creating a dataset, editing to correct errors, identifying
the missing values, and adding internal documentation such as variable and value labels.
It also encompasses many other jobs required by ongoing projects, such as adding new
observations or variables; reorganizing, simplifying or sampling from the data;
separating, combining or collapsing datasets; converting variable types; and creating

new variables through algebraic or logical expressions.
Creating a new dataset
We can create a new dataset by either typing in data, copy and paste, or reading an
external file (e.g., xls). A by-hand approach is practical with small datasets, or may be
unavoidable when the original information is printed material such as a table in a book
or an industry report (Hamilton, 2012: p.16). This is not popular in time series and panel
datasets, so we will not discuss here7. When the original data source is electronic, such
as a web page, text file, spreadsheet or word processor document, we can bring these
data into Stata by copy and paste (Hamilton, 2012: p.21)8. However, this approach is
not convenient in research projects, because we usually store the commands in a dofile. Because the original data source is often in form of spreadsheet, I suggest to read
the .xls or .xlsx file directly into Stata by using an import command9. We will illustrate
some examples with the import command.
▪ Daily data
Table13_6.xls10 is a daily data on stock price (close, lnclose = ln(close)) where the first
row describes the variable names.
import excel using D:\Table13_6.xls, firstrow clear
des
format date %td11
edit

6

A good reference is Das (2019): Chapters 1 and 9.
For more details, please read Hamilton (2012: p.16-21).
8
For more details, please read Hamilton (2012: p.16-21-22).
9
For more details, please read Hamilton (2012: p.42-43).
10
This dataset can be downloaded here: />11

For more details, you type at the Command window: help dates and times.
7

6

Table 3.1: Data editor after the import command in Stata (daily data).

Note. lnclose = log(close) and time is a trend variable.

▪ Monthly data
Table14_8.xls12 (Gujarati, 2011: Chapter 14) is a monthly data on treasury bill rates
where the first row is the variable names.
import excel using D:\Table14_8.xls, firstrow clear
Table 3.2: Data editor after the import command in Stata (monthly data).

We realize that the obs variable is in the string format (i.e., in orange colour with capital
letter M, but Stata just allows lowercase, m). Therefore, the next step is to transform it
into numeric format by using the following commands:
gen month = m(1981m1) + _n-1
format month %tm
drop obs
12

This dataset can be downloaded here: />
7

Table 3.3: Data editor after reformatting the time variable in Stata (monthly data).

Notes. tb3 is three-month treasury bond rate, tb6 is six-month treasury bond rate, time is a trend variable,
and time2 is squared trend variable.

▪ Quarterly data
ser_corr.xls13 (Asteriou and Hall, 2016: Chapter 7) is a quarterly data on food
expenditure, disposable income and relative price index of food where the first row is
the variable names (notes that the obs contains capital letter Q).
import excel using D:\ser_corr.xls, firstrow clear
gen quarter = q(1985q1) + _n-1
format quarter %tq
edit
Table 3.4: Data editor after reformating the time variable in Stata (quarterly data).

13

This dataset can be downloaded here: />
8

▪ Yearly data
Sweden.xls (Asteriou and Hall, 2016: Chapter 17) is a yearly data on macroeconomic
and financial variables where the first row is the variable names.
import excel using D:\Sweden.xls, firstrow clear
edit
Table 3.5: Data editor after the import command in Stata (yearly data).

Changing variable properties
Three most important properties of a variable in a data set are name, label and format.
For example, we want to rename, label and format the variables in Table 3.5 in order to
have a clearer representation, we can use the following commands14 (after importing

data into Stata file):
rename Time year
rename res_mon X1
rename claims_pri X2
rename (claims_oth m3 share_price) (X3 X4 Y)
label variable X4 “Ratio of money supply to GDP, percent”
label variable Y “Stock market return, percent”
format Y %5.2f15
edit

14

For more details, you can see Hamilton (2012: p.18-22).
This command specifies a fixed display format five numerals wide, with two digits to the right of the decimal.
You can type at the Command window help format for more information.
15

9

Table 3.6: Data editor after changing variable properties in Stata (yearly data).

By labelling data and variables, we obtain a dataset that is more self-explanatory
(Hamilton, 2012: p.20). To have an overview of the dataset, we can use the command
describe or simply des.
Specifying subsets of the data: in and if qualifiers
Many Stata commands can be restricted to a subset of the data by adding an in or if
qualifier. in specifies the observations numbers to which the command applies. For
example, list in 5 tells Stata to list only the 5th observation (Hamilton, 2012: p.23). To
list the 1st through 5th observations, type list in 1/5, list in f/5, list X1 X2 Y in 1/5 or to

list observations of the last 10 years, type list in -10/l16. We sometimes use the command
sort before using the command list if we want to demonstrate the observations with the
highest values of some variable. For example,
global1.dta (Hamilton, 2012)17 is the data on global temperature anomaly during the
1901 to 2011 period. Among the 1584 months in the global temperature data, which 10
months had the highest temperature anomalies, meaning they were farthest above the
1901-2011 average for that month? To know this information, we can use the following
commands:
use "D:\global1.dta", clear
sort temp
list in -10/l

16
17

Note that f and l denote ‘first’ and ‘last’ observations.
This dataset can be downloaded here: />
10

The if qualifier also has broad applications, but it selects observations based on specific
variable values (Hamilton, 2012: p.24). For example, to see the mean and standard
deviation of temperature anomalies before and after 1970, type:
sum temp if year < 1970
sum temp if year >= 1970
The “ < ” (is less than) and “ >= ” (greater than or equal to) signs are relational
operators:
==
!=
>

<
>=
<=

is equal to
is not equal to (~= is also works)
is greater than
is less than
is greater than or equal to
is less than or equal to

Two or more relational operators can be combined within a single if expression by the
use of logical operators (Hamilton, 2012: p.24). Stata’s logical operators include:
&
|
!

and
or (symbol is a vertical bar, not the number one or letter “l”)
not (~ also works)

Parentheses allow us to specify the precedence among multiple operators. For example
sum temp if (month == 1 | month = 2) & year >= 1940 & year < 1970
Converting and extracting time variable components
Sometimes the original dataset includes separate variables on day, month and year (or
three different columns regarding time information), so we need to convert the day,
month and year information into a single numerical index of time. Stata’s mdy()
function does this, creating an elapsed-date variable. MILwater.dta (Hamilton, 2012)18

18

This dataset can be downloaded here: />
11

is data on daily water consumption for the town of Milford, New Hampshire over seven
months from January through July 1983. The data looks like this

We need to create an elapsed-date variable (named date here) by using the following
commands:
gen date = mdy(month,day,year)
format date %td

Sometimes we need to extract either month and year component from a variable with
%tm format (e.g., month in Table 2.3) or quarter and year component from a variable
with %tq format (e.g., quarter in Table 2.4).
▪

Extract month and year

Stata’s do-file includes the following commands:
import excel using D:\Table14_8.xls, firstrow clear
drop time time2
gen time = m(1981m1) + _n-1
format time %tm
drop obs
format time %10.0g
gen date = dofm(time)
format date %td
format time %tm

gen month = month(date)
gen year = year(date)
drop date
12

▪

Extract quarter and year

Stata’s do-file includes the following commands:
import excel using D:\ser_corr.xls, firstrow clear
drop obs
gen time = q(1985q1) + _n-1
format time %10.0g
gen date = dofq(time)
format date %td
format time %tq
gen quarter = quarter(date)
gen year = year(date)
drop date
format LCONS LDISP LPRICE %5.2f

Generating and replacing variables
The generate (or simply gen) and replace (or simply rep) commands allow us to create
new variables or change the values of existing variables.
▪ Generate a trend variable
gen time = _n
13

▪ Generate a squared trend variable
gen time2 = time^2
or
gen time2 = time*time
▪ Generate an intercept dummy variable
If there is a structural break in the data (e.g., DU = 0 before 1990, DU = 1 from 1990),
we can create an intercept dummy variable (see Zivot and Andrews unit root test in
Section 8.4) as follows:
gen DU = 0
replace DU = 1 if year >= 1990
or
gen DU = (year >= 1990)
If there is a structural break in the data (e.g., DU = 0 before 1990q1, DU = 1 from
1990q1), we can create an intercept dummy variable (see Zivot and Andrews unit root
test in Section 8.4) as follows (after extracting quarter and year component):
gen DU (yr >= 1990)
If there is a structural break in the data (e.g., DU = 0 before 2010m1, DU = 1 from
2010m1), we can create an intercept dummy variable (see Zivot and Andrews unit root
test in Section 8.4) as follows (after extracting quarter and year component):
gen DU (yr >= 2000)
▪

Generate a slope dummy variable

If there is a structural break in the data (e.g., DT = 0 before 1990, DT = trend variable
from 1990), we can create an slope dummy variable (see Zivot and Andrews unit root
test in Section 8.4) as follows:
gen DU = 0
replace DU = year - 1989 if year >= 1990

▪ Generate quarterly dummy variables
Suppose we have the time variable is ‘quarter’ (e.g., Table 2.4), we want to create four
quarterly dummy variables DQ1 for quarter 1, DQ2 for quarter 2, etc. The commands
are as follow:
gen q=quarter(dofq(quarter))
tab q, gen(DQ)
▪ Generate monthly dummy variables
Suppose we have the time variable is ‘month’ (e.g., Table 2.3), we want to create twelve
monthly dummy variables DM1 for January, DM2 for February, etc. The commands
are as follow:
gen m=month(dofm(month))
tab m, gen(DM)
▪ Generate a variable in logarithmic form
14

gen lnY = log(Y)
▪ Generate lags and differences (see Hamilton, 2012: p.366-7)
Time series analysis often involves lagged variables, or values from previous times.
Lags can be specified by explicit subscripting.
gen Y_1 = Y[_n-1]
gen Y_2 = Y[_n-2]

[Yt-1: first lagged value or 1st lag]
[Yt-2: second lagged value or 2nd lag]

Alternatively, we could accomplish the same thing, using tsset data [i.e., the
observations are declared to be time series using the tsset command followed by the
variable name that identifies the time variable, e.g., tsset date (Adkins and Hill, 2011:
p.27)], with Stata’s L. (lags) operators19 [see Adkins and Hill (2011: p.273-5) and

Greene (2018: p.1022) for more information]. Lag operators are simpler than an
explicit-subscripting approach.
gen Y_1 = L.Y
gen Y_2 = L2.Y

[Yt-1]
[Yt-2]

Time series analysis often involves (first) differenced variables. Using tsset data, with
Stata’s D. (difference) operators [see Adkins and Hill (2011: p.273-5) for more
information], we can create differenced variables:
gen diff1Y = D.Y
gen diff2Y = D2.Y

[first difference:
Yt = Yt – Yt-1]20
[second difference: 2Yt = (Yt –Yt-1) – (Yt-1 – Yt-2)]

Similarly, using ttset data, with Stata’s S. (seasonal difference) operators, we can create
seasonal differenced vaiables:
gen sdiff1Y = S.Y
gen diff2Y = S2.Y
gen diff14Y = S4.Y
gen diff12Y = S12.Y

[same as D.; seasonal difference: Yt – Yt-1]
[second seasonal difference: Yt – Yt-2]
[4th seasonal difference: Yt – Yt-4]
[12th seasonal difference: Yt – Yt-12]

In the case of seasonal differences, S12. Does not mean 12th difference, but rather a first
difference at lag 12. Therefore, instead of using explicit-subscripting approach, we
should use seasonal difference for either monthly data (S12.) or quarterly data (S4.).
▪ Generate a growth rate variable [i.e., growth rate of Yt: Rt = (Yt – Yt-1)/Yt-1 =
ln(Yt/Yt-1) = ln(Yt) – ln(Yt-1) = ln(Yt); or Rt = (Yt – Yt-12)/Yt-12 = ln(Yt/Yt-12) =
ln(Yt) – ln(Yt-12)]21.
gen R = (Y – L.Y)/L.Y
gen R = (Y – L12.Y)/L12.Y
or
gen lnY = log(Y)
19

The lag operator, L, is a device that greatly simplifies the mathematics of time series analysis. The operator
defines the lagging operation: LYt = Yt-1, L2Yt = L(LYt) = LYt-1 = Yt-2, …, LpYt = Yt-p, LpLqYt = LpYt-q = Lp+qYt
= Yt-p-q (Greene, 2018: p.1022).
20
Using lag operator, the first difference operator is also defined as Yt = Yt – Yt-1 = Yt – LYt = (1 – L)Yt (Greene,
2018: p.2023).
21
Suggested reference: Asteriou and Hall (2016: p.24-5).

15

gen R = D.lnY
gen R = S12.lnY
▪ Generate a detrended variable (i.e., dtrend = Y – a – bTime, or dtrend = Y – a –
bTime – cTime2, where Y is a time series variable and Time is a trend variable).
regress Y Time
predict dtrend, residuals

[OLS regression: a linear trend regression model]

▪ Smoothing
Many time series exhibit high-frequency variations that make it difficult to discern
underlying patterns. Smoothing such series breaks the data into two parts, one that
varies gradually, and a second “rough” part containing the leftover rapid changes
(Hamilton, 2012: p.353): data = smooth + rough.
Various powerful smoothing tools22 are available through the tssmooth commands. All
but tssmooth nl can handle missing values (see Hamilton, 2012: p.256-9).
tssmooth ma
tssmooth exponential
tssmooth dexponential
tssmooth hwinters
tssmooth shwinters
tssmooth nl

moving-average filters
single exponential filters
double exponential filters
nonseasonal Holt-Winters smoothing
seasonal Holt-Winters smoothing
nonlinear filters

For examples, use the MILwater.dta, we can create new variables:
tsset date
tssmooth ma ma3 = water, window(1 1 1)
tssmooth ma ma5 = water, window(2 1 2)
tssmooth exponential exp1 = water
tssmooth exponential exp2 = water, p(.4)

tssmooth dexponential dexp1 = water
tssmooth dexponential dexp2 = water, p(.7)
tssmooth hwinters hw1 = water
tssmooth hwinters hw2 = water, parms(.7 .3)
tssmooth shwinters shw1 = water
tssmooth shwinters shw2 = water, additive
tssmooth nl nl1=water, smoother(5)
tssmooth nl nl2=water, smoother(3RSSH)

[more one year]
[more one year]

Collapsing data
In some cases, the data was already organised by days, but our purpuse of analysis is to
use monthly data. Or some variables were already organised by months, but our purpose

22

For more details, read Hanke and Wichern (2014: Chapter 3), and Mills (2019: chapter 9) for an advanced level.

16

of analysis is to use annually data. In such cases, we need to aggregate data into means,
medians or other statistics for for groups defined by one or more variables (e.g., days to
months, or months to year). The Stata’s collapse command does this. For example, the
dataset global2.dta (Hamilton, 2012) on monthly global temperatures from January
1880 to December 2011 (1584 months and 132 years). The tables below present raw
data and collapsed data by using the following command:
collapse (mean) temp, by(year)

Collapse can create variables based on any of the following summary statistics:
mean
median
p1
p2
sd
semean
sum
count
max
min
iqr
first
last

Means (default, if statistic is not specified)
Medians
1st percentiles
2nd percentiles
Standard deviations
Standard error of the mean (sd/sqrt(n))
Sums
Number of nonmissing observations
Maximums
Minimums
Interquartile range
First values
Last values

In financial econometrics, researchers can face similar situations where asset prices and
their returns are often observed by days.

17

4.
THE NATURE OF TIME SERIES
In time series analysis, it is extremely important that the analysts should clearly
understand the term random or stochastic process. A stochastic process is defined as a
collection of random variables ordered in time. If we let Y denote a random variable,
and if it is continuous, we denote it a Y(t), but if it is discrete, we denote it as Y t. An
example of the former is an electrocardiogram, and an example of the latter is GDP
(gross domestic product). Since most economic data are collected at discrete points in
time, we usually use the notation Yt23 rather than Y(t). If we let Y represent GDP, we
have Y1, Y2, Y3, …, Y99, where the subscript 1 denotes the first observation (e.g., GDP
for the third quarter of 1993) and the subscript 99 denotes the last observation (i.g., GDP
for the first quarter of 2018). Keep in mind that each of these Y’s is a random variable
(Gujarati and Porter, 2009: p.740).

Figure 4.1: Realization versus stochastic process.
In what sense we can regard GDP as a stochastic process? Consider for instance the
Vietnam GDP of 836.270 billion VND for the third quarter of 2017. In theory, the GDP
figure for the third quarter of 2017 could have been any number, depending on the
prevailing economic and political climates. The figure of 836.270 billion VND is just a

23

The subscript t represents time, with t = 1 being the first observation available on Y and t = T being the last.
The complete set of times t = 1, 2, …, T is often be referred to as the observational period (Mills, 2019: p.1).

18

particular realization of all such possibilities. In this case, we can think of the value of
836.270 billion VND as the mean value of all possible values of GDP for the third
quarter of 2017. In other words, GDP value at a specific point in time is characterized
as a normal distribution. Therefore, we can say that GDP is a stochastic process and the
series of actual values observed for the period from the second quarter of 1993 to the
first quarter of 2018 is a single realization of a process, GDPt. The entire history over
such a period constitutes a realization of the process. At least in economics, the process
could not be repeated. There is no counterpart to repeated sampling in a cross section
or replication of an experiment involving a time series process in physics or engineering
(Greene, 2018: p.985). Gujarati and Porter (2009: p.740) states that “the distinction
between the stochastic process and its realization in time series data is just like the
distinction between population and sample in cross-sectional data”. Just as we use
sample data to draw inferences about a population; in time series, we use the realization
to draw inferences about the underlying stochastic process.
The reason why we mention this term before examining specific models is that all basic
assumptions in time series models relate to the stochastic process (i.e., properties of the
population such as mean, variance and covariance). Stock and Watson (2015: p.523)
said that the assumption that the future will be like the past is an important one in time
series regression, sufficiently so that it is given its own name: “stationary”. If the future
is like the past, then the historical relationships can be used to forecast the future. But
if the future differs fundamentally from the past, then the historical relationships might
not be reliable guides to the future. Therefore, in the context of time series regression,
the idea that historical relationships can be generalized to the future is formalized by
the concept of stationarity. This concept is described in the next section.

19

5.
STATIONARY STOCHASTIC PROCESSES
5.1 Definition
According to Gujarati and Porter (2009: p.740), a key concept underlying stochastic
process that has received a great attention and investigation by time series analysts is
the stationary stochastic process. Broadly speaking, “a stochastic process is said to be
stationary if its mean and variance are constant over time and the value of the
covariance24 between the two periods depends only on the distance or gap or lag
between the two time periods and not the actual time at which the covariance is
computed” (Gujarati and Porter, 2009: p.740)25. In the time series literature, such a
stochastic process is known as a weakly stationary or covariance stationary. By
contrast, a time series is strictly stationary26 if all the moments of its probability
distribution are time-invariant over time. If, however, the stationary process is normal,
the weakly stationary stochastic process is also strictly stationary (Gujarati and Porter,
2009: p.740). For most practical applications, the weak type of stationarity generally
suffices. According to Kocenda and Cerny (2017: p.17), the most frequently used
stationarity concept in econometrics is the concept of covariance stationarity. Therefore,
throughout this series of notes, we will for simplicity usually use only the term
stationarity instead of covariance or weak stationarity. According to Asteriou and Hall
(2016: p.277), a weakly stationary series is characterized by:
(a) exhibits mean reversion in that it fluctuates around a constant long-run mean;
(b) has a finite variance that is time-invariant; and
(c) has a theoretical correlogram27 that diminishes as the lag length increases.
In its simplest terms a time series Yt is said to be (weakly) stationary (hereafter refer to
stationary) if it has the following properties (Asteriou and Hall, 2016: p.277; Gujarati
and Porter, 2009: p.740; Hill et al., 2018: p.566; Kocenda and Cerny, 2017: p.17):
(a) mean:

E(Yt) =  (constant for all t)

(5.1)

i.e., t = t+k =  <  for all t, k
(b) variance:

Var(Yt) = E(Yt – )2 = 2 (constant for all t)

(5.2)

i.e., 2t = 2t+k = 2 <  for all t, k

24

The coefficient of autocorrelation (i.e. the ratio of covariance between Yt and Yt + k and variance of Yt) is also
used interchangeably.
25
For examples, the covariance between Yt and Yt-3 is the same the covariance between Yt-2 and Yt-5 (i.e. the
distance is 3 lags), but the covariance between Yt and Yt-3 is different from the covariance between Yt and Yt-5.
26
For more details, see Mills (2019: p.32-33).
27
Correlogram is a graph of the autocorrelations for various lags of a time series (Hanke and Wichern, 2014:
p.21). Autocorrelation is the correlation between a variable lagged one or more time periods and itself (Hanke and
Wichern, 2014: p.18).

20

(c) covariance: Cov(Yt,Yt+k) = k = E[(Yt – )(Yt+k – )]

(5.3)

i.e., Cov(Yt,Yt+k) = Cov(Yt+s,Yt+s+k) = k <  for all t, s, k
where k, covariance (or exactly autocovariance) at lag k, is the covariance between the
values of Yt and Yt-k, that is, between two Y values k periods apart. If k = 0, we obtain
0, which is simply the variance of Y (= 2); if k = 1, 1 is the covariance between two
adjacent values of Y. Note that we sometimes use Yt-k (where Yt is the current value of
Y) to replace Yt+k (where Yt is the the origin of Y). According to Asteriou and Hall
(2016: p.277), the quantities in above equations would remain the same whether the
observations for the series were, for example, from 2000 to 2010 or from 2010 to 2020.
Suppose we shift the origin of Y from Yt to Yt+m (say, from 1998Q3 to 2008Q3 for our
GDP data). Now, if Yt is to be stationary, the mean, variance, and autocovariance of
Yt+m must be the same as those of Yt. Translated into plain language, the above
properties means that a time series is (covariance) stationary, if its mean and variance
are constant and finite over time and if the covariance (or autocorrelation) depends only
on the time distance k between the two elements of the time series (i.e., the time between
the periods k) but not on the actual point in time t (Kocenda and Cerny, 2017: p.17-8;
Hill et al., 2018: p.427).
Gujarati and Porter (2009: p.741) state that such a time series will tend to return to its
mean (i.e., mean reversion)28 and fluctuations around its mean (measured by its
variance) will have a broadly constant value. In other words, a stationary process will
not drift far away from its mean value because of the finite variance. It also notes that
for a stationary process, the speed of mean reversion depends on the covariance; it is
quick if the covariances are small and slow when they are large (Gujarati and Porter,
2009: p.741). Shocks to a stationary series are necessarily temporary; over time, the
effects of the shocks will dissipate and the series will revert to its long-run mean level.
As such, long-term forecasts of a stationary series will converge to the unconditional
mean of the series (Asteriou and Hall, 2016: p.277). Another characteristic of stationary

variables is that they are weakly dependent, that is, their sample autocorrelations29 cut
off or tend to decline geometrically, dying out at long lags (Hill et al., 2018: p.566). In
other words, weak dependence implies that, as k →  (i.e., observations get further and
further apart in time), they become almost independent. For k large enough, the
autocorrelations become negligible (Hill et al., 2018: p.428).
If a time series is not stationary in the sense just defined, it is called a nonstationary
time series. In other words, a nonstationary time series will have a time-varying mean
or a time-varying variance or both (Gujarati and Porter, 2009: p.741). In addition,
another characteristic of nonstationary variables is that their sample autocorrelations
remain large at long lags, that is, they exhibit strong dependence (Hill et al., 2018:

28

In other words, if a time series is stationary, then any shock occurs in time t has a diminishing effect over time
and finally disappears in time t + k as k → . In contrast, the effect of a shock either remains present in the same
magnitude in all future dates (Kocenda and Cerny, 2017: p.17). Having a constant mean and fluctuations in the
series that tend to return to the mean are characteristics of stationary variables. This is called the property of mean
reversion (Hill et al., 2018: p.566).
29
Autocorrelation coefficients are introduced in the next section.

21

dữ liệu chuỗi thời gian và dữ liệu bảng trong stata

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về