Tải bản đầy đủ (.pdf) (10 trang)

Electronic Business: Concepts, Methodologies, Tools, and Applications (4-Volumes) P249 ppt

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (448.27 KB, 10 trang )

2414
Automatically Extracting and Tagging Business Information for E-Business Systems
marketplaces in which they compete. The World
Wide Web is a rich but unmanageably huge source
of human-readable business information—some
novel, accurate, and relevant—some repeti-
WLYHZURQJRURXWRIGDWH$VWKHÀRRGRI:HE
document tops 11.5 billion pages and continues
to rise (Gulli & Signorini, 2005), the human task
of grasping the business information it bears
seems more and more hopeless. Today’s Really
Simple Syndication (RSS) news syndication and
aggregation tools provide only marginal relief to
information-hungry, document-weary managers
and investors. In the envisioned Semantic Web,
business information will come with handles
(semantic tags) that computers can intelligently
grab onto, to perform tasks in the business-to-
business (B2B), business-to-consumer (B2C), and
consumer-to-consumer (C2C) environments.
6HPDQWLFHQFRGLQJDQGGHFRGLQJLVDGLI¿FXOW
problem for computers, however, as any very ex-
pressive language, for example, English provides
a large number of equally valid ways to represent
a given concept. Further, phrases in most natural
(i.e., human) languages tend to have a number of
different possible meanings (semantics), with the
correct meaning determined by context. This is
especially challenging for computers. As a stan-
GDUGDUWL¿FLDOODQJXDJHHPHUJHVFRPSXWHUVZLOO
become semantically enabled, but humans will


face a monumental encoding task. For e-busi-
QHVVDSSOLFDWLRQVLWZLOOQRORQJHUEHVXI¿FLHQW
to publish accurate business information on the
Web in, say, English or Spanish. Rather, that
information will have to be encoded into the ar-
WL¿FLDOODQJXDJHRIWKH6HPDQWLF:HE²DQRWKHU
time-consuming, tedious, and error-prone process.
Pre-standard Semantic Web creation and editing
tools are already emerging to assist early adopters
with Semantic Web publishing, but even as the
tools and technologies stabilize, many businesses
will be slow to follow. Furthermore, a great deal
of textual data in the pre-Semantic Web contains
YDOXDEOHEXVLQHVVLQIRUPDWLRQÀRDWLQJWKHUH
along with the out-dated debris. However, the
new Web vessels—automated agents—cannot
navigate this old-style information. If the rising
sea of human-readable knowledge on the Web is
WREHWDSSHGDQGVWUHDPVRILWSXUL¿HGIRUFRP-
puter consumption, e-business systems must be
developed to process this information, package
it, and distribute it to decision makers in time for
competitive action. Tools that can automatically
extract and semantically tag business information
from natural language texts will thus comprise
an important component of both the e-business
systems of tomorrow, and the Semantic Web of
the day after.
In this chapter, we give some background on
the Semantic Web, ontologies, and the valuable

sources of Web information available for e-busi-
ness applications. We then describe how textual
information can be extracted to produce XML
¿OHV DXWRPDWLFDOO\ )LQDOO\ ZH GLVFXVV IXWXUH
trends for this research and conclude.
BACKGROUND
The World Wide Web Consortium (W3C) is lead-
ing efforts to standardize languages for knowledge
representation on the Semantic Web and is de-
veloping tools that can verify that a given docu-
ment is grammatically correct according to those
standards. The XML standard, already widely
adopted commercially as a data interchange
format, forms the syntactic base for this layered
framework. XML is semantically neutral, so the
resource description framework (RDF) adds a
SURWRFROIRUGH¿QLQJVHPDQWLFUHODWLRQVKLSVEH-
tween XML-encoded data components. The Web
ontology language (OWL) adds to RDF tools for
GH¿QLQJPRUHVRSKLVWLFDWHGVHPDQWLFFRQVWUXFWV
(classes, relationships, constraints) still using the
RDF-constrained XML syntax. Computers can
EHSURJUDPPHGWRSDUVHWKH;0/V\QWD[¿QG
RDF-encoded semantic relationships, and resolve
meanings by looking for equivalence relation-
2415
Automatically Extracting and Tagging Business Information for E-Business Systems
VKLSVDVGH¿QHGE\2:/EDVHGYRFDEXODULHVRU
ontologies.
Ontologies are virtual dictionaries that for-

PDOO\GH¿QHWKHPHDQLQJVRIUHOHYDQWFRQFHSWV
Ontologies may be foundational (general), or
GRPDLQVSHFL¿FDQGDUHRIWHQVSHFL¿HGKLHUDUFKL-
cally, relating concepts to one another via their
attributes. As ontologies emerge across the Seman-
tic Web, many will overlap, and different terms
ZLOOFRPHWRGH¿QHDQ\JLYHQFRQFHSW6HPDQWLF
maps will be built to relate the same concepts
GH¿QHGGLIIHUHQWO\IURPRQHRQWRORJ\WRDQRWKHU
(Doan, Madhavan, Domingos, & Halevy, 2002).
Software programs called intelligent agents will
be built to navigate the Semantic Web, searching
not only for keywords or phrases, but also for
concepts semantically encoded into Web docu-
ments (Berners-Lee, Hendler, & Lassila, 2001).
7KH\PD\DOVR¿QGVHPDQWLFFRQWHQWE\QHJRWL-
ating with semantically enhanced Web services,
which Medjahed, Bouguettaya, and Elmagarmid
GH¿QHDVVHWV³RIIXQFWLRQDOLWLHVWKDWFDQ
be programmatically accessed through the Web”
(p. 333). Web services may process information
IURPGRPDLQVSHFL¿FNQRZOHGJHEDVHVDQGWKH
facts in these knowledge bases may, in turn, be
represented in terms of an ontology from the
same domain. An important tool for constructing
domain models and knowledge-based applications
with ontologies is Protégé (n.d.). Protégé is a free,
open-source platform.
Ontologies are somewhat static, and should be
created carefully by domain experts. Knowledge

bases, while structurally static, should have dy-
namic content. That is, to be useful, especially in
the competitive realm of business, they should be
continually updated with the latest, best-known
information in the domain and regularly purged of
knowledge that has become stale or been proven
wrong. In business domains, the world evolves
quickly, and processing the torrents of information
describing that evolution is a daunting task. Much
of the emerging information about the business
world is published online daily in government
UHSRUWV ¿QDQFLDO UHSRUWV VXFK DV WKRVH LQ WKH
electronic data gathering, analysis, and retrieval
(EDGAR) system database, and Web articles by
such sources as the Wall Street Journal (WSJ),
Reuters, and the Associated Press. Such sources
contain a great deal of information, but in forms
that computers cannot use directly. They there-
fore need to be processed by people before the
facts can be put into a database. It is desirable,
but impossible for a person, and expensive for
a company, to retrieve, read, and synthesize all
of the day’s Web news from a given domain and
enter the resulting knowledge into a knowledge
base to support the company’s decision making
for that day. While the protocols and information
retrieval technologies of the Web make these ar-
ticles reachable by computer, they are written for
human consumption and still lack the semantic
tags that would allow computers to process their

FRQWHQWHDVLO\,WLVDGLI¿FXOWSURSRVLWLRQWRWHDFK
a computer to correctly read (syntactically parse)
natural language texts and correctly interpret
(semantically parse) all that is encoded there.
H ow e v e r, a u t om at ic a l ly le a r n i n g e v e n s o m e o f t h e
daily emerging facts underlying Web news articles
could provide enough competitive advantage to
justify the effort. We envision the emergence of
e-business services, based on knowledge bases
fed from a variety of Web news sources, which
serve this knowledge to subscribing customers in
a variety of ways, including both semantic and
nonsemantic Web services.
One domain of great interest to investors is that
dealing with the earnings performance and fore-
FDVWVRIFRPSDQLHV0DQ\¿UPVSURYLGHPDUNHW
analyses on a variety of publicly traded corpora-
W LR QV +RZH YH USU R¿WP D UJLQVG ULYHWKH L UFKRLF H V
of which companies to analyze, leaving over half
of t h e 10 , 0 0 0 o r s o pu b l i c ly t r a de d U. S . c o m p a n i e s
unanalyzed (Berkeley, 2002). Building tools,
which automatically parse the earnings statements
of these thousands of unanalyzed smaller com-
panies, and which convert these statements into
;0/IRU:HEGLVWULEXWLRQZRXOGEHQH¿WLQYHVWRUV
2416
Automatically Extracting and Tagging Business Information for E-Business Systems
and those companies themselves, whose public
exposure would increase, and whose disclosures
to regulatory agencies would be eased. A number

of XML-based languages and ontologies have
been developed and proposed as standards for
UHSUHVHQWLQJVXFKVHPDQWLFLQIRUPDWLRQLQWKH¿-
nancial services industry, but most have struggled
to achieve wide adoption. Examples include News
Markup Language (NewsML) (news), Financial
products Markup Language (FpML) (derivatives),
Investment Research Markup Language (IRML)
(investment research), and the Financial Exchange
Framework (FEF) Ontology (FEF: Financial
Ontology, 2003; Market Data Markup Language,
2000). However, the Extensible Business Markup
Language (XBRL), an XML-derivative, has
been emerging over the last several years as an
HEXVLQHVVVWDQGDUGIRUPDWIRUHOHFWURQLF¿QDQFLDO
reporting, having enjoyed early endorsement by
such industry giants as NASDAQ, Microsoft,
and PricewaterhouseCoopers (Berkeley, 2002).
By 2005, the U.S. Securities and Exchange Com-
mission (SEC) had begun accepting voluntary
¿QDQFLDO¿OLQJVLQ;%5/WKH)HGHUDO 'HSRVLW
Insurance Corporation (FDIC) was requiring
XBRL reporting, and a growing number of pub-
OLFO\WUDGHGFRUSRUDWLRQVZHUHSURGXFLQJ¿QDQFLDO
statements in XBRL (XBRL, 2006).
We present a prototype system that uses natu-
ral language processing techniques to perform
LQIRUPDWLRQH[WUDFWLRQRIVSHFL¿FW\SHVRIIDFWV
f r om c o r p o r a t e e a r n i n g s a r t ic l e s o f t h e Wall St reet
Journal. These facts are represented in template

form to demonstrate their structured nature and
converted into XBRL for Web portability.
EXTRACTING INFORMATION FROM
ONLINE ARTICLES
This section discusses the process of generating
;0/IRUPDWWHG¿OHVIURPRQOLQHGRFXPHQWV2XU
system, Flexible Information extRaction SysTem
(FIRST), analyzes online documents from the
WSJ using syntactic and simple semantic analysis
(Hale, Conlon, McCready, Lukose, & Vinjamur,
2005; Lukose, Mathew, Conlon, & Lawhead,
2004; Vinjamur, Conlon, Lukose, McCready,
& Hale, 2005). Syntactic analysis helps FIRST
to detect sentence structure, while semantic
analysis helps FIRST to identify the concepts that
are represented by different terms. The overall
process is shown in Figure 1. This section starts
with a discussion of the information extraction
literature. Later, we discuss how FIRST extracts
information from online documents to produce
;0/IRUPDWWHG¿OHV
Information Extraction
The explosion of textual information on the Web
requires new technologies that can recognize
information originally structured for human con-
sumption rather than for data processing. Research
in DUWL¿FLDOLQWHOOLJHQFH$,KDVEHHQWU\LQJWR
¿QGZD\VWRKHOSFRPSXWHUVSURFHVVWDVNVZKLFK
would otherwise require human judgment. NLP,
a sub-area of AI, is a research area that deals

with spoken and written human languages. NLP
subareas include machine translation, natural
language interfaces, language understanding,
Figure 1. Information extraction and XML tag-
ging process
URL of the document
http://www.
2417
Automatically Extracting and Tagging Business Information for E-Business Systems
and text generation. Since NLP tasks are very
GLI¿FXOWIHZ1/3DSSOLFDWLRQDUHDVKDYHEHHQ
developed commercially. Currently, the most
successful applications are grammar checking
and machine translation programs.
To deal with textual data, information systems
need to be able to understand the documents
they read. Information extraction (IE) research
has sought automated ways to recognize and
convert information from textual data into more
structured, computer-friendly formats, such as
display templates or database relations (Cardie,
1997; Cowie & Lehnert, 1996).
0DQ\ EXVLQHVV DUHDV FDQ EHQH¿W IURP ,(
research, such as underwriting, clustering, and
H[WUDFWLQJLQIRUPDWLRQIURP¿QDQFLDOGRFXPHQWV
Some previous IE research prototypes include Sys-
tem for Conceptual Information Symmarization,
Organziation, and Retrieval (SCISOR) (Jacobs &
Rau, 1990), EDGAR-Analyzer (Gerdes, 2003),
Edgar2xml (Leinnemann, Schlottmann, Seese,

& Stuempert, 2001). Moens, Uyttendaele, and
Dumortier (2000) researched the extraction of
information from databases of court decisions.
The major research organization promoting in-
formation extraction technology is the Message
Understanding Conference (MUC). MUC’s origi-
nal goals were to evaluate and support research on
the automation and analysis of military messages
containing textual information.
IE systems’ input documents are normally
GRPDLQ VSHFL¿F &DUGLH  &RZLH  /HK-
nert, 1996). Generally, documents from the same
publisher, reporting stories in the same domain,
have similar formats and use common vocabular-
ies for expressing certain types of facts—styles
that people can detect as patterns. If knowledge
engineers who build computer systems team up
ZLWKVXEMHFWPDWWHUH[SHUWVZKRDUHÀXHQWLQWKH
information types and expression patterns of the
domain, computer systems can be built to look
for the concepts represented by these familiar
patterns. Humans do this now, but computers
will be able to do it much faster.
Unfortunately, the extraction process presents
P D Q\ G LI ¿ F X OW LH V 2 QHL Q YRO YH V W K H V \ Q W DFW L F V W U XF -
ture of sentences, and another involves inferring
sentence meanings. For example, it is quite easy
IRUDKXPDQWRUHFRJQL]HWKDWWKHVHQWHQFHV³7KH
Dow Jones industrial average is down 2.7%” and
³7KH'RZ-RQHVL QGX VW U LDODYHUDJHGLSSHG´

are semantically synonymous, though slightly
different. For a computer to extract the same
meaning from the two different representations,
LWPXVW¿UVWEHWDXJKWWRSDUVHWKHVHQWHQFHVDQG
then taught which words or phrases are synonyms.
Also, just as children learn to recognize which
sentences in a paragraph are the topic or key
sentences, computers must also be taught how to
recognize which sentences in a text are paramount
versus which are simply expository. Once these
key sentences are found, the computer programs
will extract the vital information from them for
inclusion in templates or databases.
There are two major approaches to building
information extraction systems: the knowledge
engineering approach and the automatic train-
ing approach (Appelt & Israel, 1999). In the
knowledge engineering approach, knowledge
engineers employ their own understanding of
natural language, along with the domain expertise
they extract from subject matter experts, to build
rules which allow computer programs to extract
information from text documents. With this ap-
proach, the grammars are generated manually,
and written patterns are discovered by a human
expert, analyzing a corpus of text documents from
the domain. This becomes quite labor-intensive
as the size, number, and stylistic variety of these
training texts grows (Appelt & Israel, 1999).
Unlike the knowledge engineering approach,

the automatic training approach does not require
computer experts who know how IE systems work
or how to write rules. A subject matter expert
annotates the training corpus. Corpus statistics
or rules are then derived automatically from the
training data and used to process novel data.
Since this technique requires large volumes of
2418
Automatically Extracting and Tagging Business Information for E-Business Systems
WUDLQLQJGDWD¿QGLQJHQRXJKWUDLQLQJGDWDFDQ
EHGLI¿FXOW$SSHOW,VUDHO0DQQLQJ
Schutze, 2002). Research using this approach
includes Neus, Castell, and Martín (2003).
Advanced research in information extraction
appears in journals and conferences run by several
AI and NLP organizations, such as the MUC,
the Association for Computational Linguistics
(ACL) (www.aclweb.org/), the Inter national Joint
&RQIHUHQFH RQ $UWL¿FLDO ,QWHOOLJHQFH ,-&$,
( and the American Association
IRU$UWL¿FLDO,QWHOOLJHQFH$$$,KWWSZZZ
aaai.org/).
FIRST: Flexible Information
extRaction SysTem
This section discusses our experimental system
),567),567H[WUDFWVLQIRUPDWLRQIURP¿QDQ-
FLDOGRFXPHQWVWRSURGXFH;0/¿OHVIRURWKHU
e-business applications.
According to Appelt and Israel (1999), the
knowledge engineering approach performs best

when linguistic resources such as lexicons are
available, when knowledge engineers who can
write rules are available, and when training data
LVVSDUVHDQGH[SHQVLYHWR¿QG%DVHGRQWKHVH
constraints, our system, FIRST, employs the
knowledge engineering approach. FIRST is an
experimental system for extracting semantic facts
from online documents. Currently, FIRST works
LQWKHGRPDLQRI¿QDQFHH[WUDFWLQJSULPDULO\
from the WSJ. The inputs to FIRST are news
articles while the output is the information in an
explicit form contained in a template. After the
extraction process is completed, this information
can be put into a database or converted into an
;0/IRUPDWWHG ¿OH )LJXUH  VKRZV ),567¶V
system architecture.
FIRST is built in two phases: the build phase
and the functional phase. The build phase uses
resources such as the training documents and
s o m e t o ol s , s u c h a s a K e yWo r d I n C o n t ex t ( K W I C )
index builder (Luhn, 1960), the CMU-SLM
toolkit (Clarkson & Rosendfeld, 1997; Clarkson
& Rosendfeld, 1999), and a part-of-speech tag-
ger, to analyze patterns in the documents from
our area of interest. Through the knowledge
engineering process, we learn how the authors
of the articles write the stories—how they tend
to phrase recurring facts of the same type. We
employ these recurring patterns to create rules
Figure 2. System architecture of FIRST

2419
Automatically Extracting and Tagging Business Information for E-Business Systems
which FIRST uses to extract information from
new Web articles.
In addition to detecting recurring patterns,
we use lexical semantic relation information
from WordNet (Fellbaum, 1998; Miller, Beck-
with, Fellbaum, Gross, & Miller, 1990; Miller,
1995) to expand the set of keywords to include
additional relevant terms that share semantic
relationships with the original keywords. The
following subsection describes our corpus, the
KWIC index generator, the CMU-SLM toolkit,
and the part of speech tagger. WordNet, which
contains information on lexical semantic relations,
is discussed after that.
The Corpus and Rule Extraction
Process
To generate rules that enable FIRST to extract
information from online documents, we look
for written patterns in a number of articles in
the same domain. FIRST’s current goal is to ex-
tract information from the WSJ in the domain of
FRUSRUDWH¿QDQFH)LJXUHVKRZVDVDPSOHWSJ
document published in 1987.
We use articles from the WSJ written in 1987
DVDWUDLQLQJGDWDVHWWRKHOSXV¿QGSDWWHUQVLQ
the articles. Each article is tagged using Standard
Generalized Markup Language (SGML). SGML
LVDQLQWHUQDWLRQDOVWDQGDUGIRUWKHGH¿QLWLRQRI

device-independent, system-independent meth-
ods of representing texts in electronic form. These
tags include information about, for example, the
document number, the headline, the date, the
document source, and the text.
Since there are many ways to express the same
s e n t e nc e mea n i n g , we h a ve t o l o ok a t a s m a n y p at-
terns as possible. We generate a KWIC index to
see the relationships between potential keywords
and other words in the corpus sentences.
$.:,&LQGH[¿OHLVFUHDWHGE\SXWWLQJHDFK
ZRUGLQWRD¿HOGLQWKHGDWDEDVH$IWHUWKDWWKH
¿UVWZRUGLVUHPRYHGDQGHDFKUHPDLQLQJZRUGLV
VKLIWHGRQH¿HOGWRWKHOHIWLQWKHURZ7KHSURFHVV
continues until the last word in the sentence is put
LQWRWKH¿UVWSRVLWLRQ)LJXUHVKRZVSDUWRIWKH
Figure 3. A sample WSJ document published in 1987
2420
Automatically Extracting and Tagging Business Information for E-Business Systems
.:,&LQGH[WKH¿UVWZRUGVIRUWKHVHQWHQFH
³3DWWHQ &RUS VDLG LW LV QHJRWLDWLQJ D SRVVLEOH
joint venture with Hearst Corp. of New York and
Anglo-French investor Sir James Goldsmith to
sell land on the East Coast.”
We have generated more than 5 million rows of
data. When many sentences are generated in the
¿OHZHORRNDWWKHNH\WHUPVWKDWZHEHOLHYHPD\
be used to express important information—the
VSHFL¿FW\SHVRILQIRUPDWLRQZHDLPWRH[WUDFW
For example, suppose we believe that the word

sale will lead to important information about
stock prices, but we are not sure how other words
relate to the word sale. We therefore select all the
rows in the database that contain the word sale,
using the following structured query language
(SQL) statement:
Select W1, W2, W3, W4, W5
From WSJ_1987
Where W1 like ‘sale%’
Order by W1, W2;
Many rows are returned from this SQL state-
ment. Some rows are useful and show interesting
patterns but some are not. Figure 5 shows some
sample rows that have the word sales appearing
in column 1. Using this technique, we are able
WR¿QG VHYHUDOSDWWHUQV ZLWKLQZKLFKWKHZRUG
sales appears.
We also look for patterns using n-gram data
produced by the Carnegie Mellon Statistical
Language Modeling (CMU-SLM) Toolkit (http://
www.speech.cs.cmu.edu/SLM_info.html). The
CMU-SLM toolkit provides several functions,
including word frequency lists and vocabularies,
)LJXUH6DPSOHURZVIURPD.:,&LQGH[¿OH
)LJXUH6DPSOHURZVIURP.:,&LQGH[¿OHZKHUH³VDOHV´DSSHDUVLQFROXPQ
2421
Automatically Extracting and Tagging Business Information for E-Business Systems
word bigram and trigram counts, bigram- and
trigram-related statistics, and various back off
bigram and trigram language models. Table 1

shows some 3-gram and 4-gram data.
There are two types of n-gram patterns we are
interested in, for example, a word such as sales
• Patterns where sales is the ¿UVW term and
with n-1 words after it:
sales declined 42%, to $53.4
sales declined to $475.6 million
• Patterns where sales is the last word, with
n-1 words before it:
increase of 50% in sales
10% increase in the sales
These patterns help us to generate rules for
information extraction. The following shows a
simple rule that FIRST uses to extract informa-
WLRQDERXW¿QDQFLDOLWHPVVXFKDVVDOHVDQGWKHLU
UHODWLRQWR¿QDQFLDOstatus words such as increase
or decrease.
Extraction Rule to Identify Financial Status
(increase, decrease…)
• ([DPSOHRIDSUR[LPLW\UXOHIRU¿QDQFLDO
status:
Let
n1 be the optimal proximity within which
D ¿QDQFLDO LWHP DSSHDUV before ¿QDQFLDO
status
Let
n2 be the optimal proximity within which
D¿QDQFLDOLWHPDSSHDUVafter¿QDQFLDOVWD-
tus
forHDFK¿QDQFLDOLWHPVWDWXVNH\ZRUG

if
D¿QDQFLDOLWHPLVSUHVHQWZLWKLQn1 words
before the keyword
or
  ¿QDQFLDOLWHPLVSUHVHQWZLWKLQ
n2 words
after the keyword
then consider the keyword as a possible can-
GLGDWHIRU¿QDQFLDOVWDWXV
end if
end for
Table 1. Sample 3-grams and 4-grams
2422
Automatically Extracting and Tagging Business Information for E-Business Systems
7KXVIRUWKHVHQWHQFH³sales declined 42%, to
´),567¿OOVWKHVORWVLQWKHWHPSODWHDV
Financial Item: sales
Financial Status: decline
Percentage Change: 42%
Change Description: to $53.4
Syntactic Analysis
Syntactic analysis helps FIRST to identify the
role of each word and phrase in a sentence. It tells
whether a word or phrase functions as subject,
YHUE REMHFW RU PRGL¿HU 7KH VHQWHQFH ³-LP
works for a big bank in NY,” for example, has
the following sentence structure, or pars tree.
(See Exhibit A.)
This parse tree shows that the subject of the
VHQWHQFHLV³-LP´WKHYHUELVWKHSUHVHQWWHQVH

RI³ZRUN´DQGWKHPRGL¿HULVWKHSUHSRVLWLRQDO
SKUDVH³IRUDELJEDQNLQ1<´
In general, most natural language processing
systems require a sophisticated knowledge base,
the lexicon (a list of words or lexical entries), and
a set of grammar rules. While it would be pos-
VLEOHWRXVHDIXOOÀHGJHGSDUVHUWRLQFRUSRUDWH
grammar rules into a system like ours, we felt that
a part-of-speech tagger would be more robust.
6SHFL¿FDOO\ZHXVHWKH/LQJXD(17DJJHUQG
as a tool for part-of-speech tagging. Lingua-EN-
Tagger is a probability-based, corpus-trained
tagger. It assigns part-of-speech tags based on a
lookup dictionary and a set of probability values
( />Tagger.pm).
7KH VHQWHQFH ³&DPSEHOO VKDUHV ZHUHGRZQ
$1.04, or 3.5 percent, at $28.45 on the New York
Stock Exchange on Friday morning,” for example,
is tagged by Lingua-EN-Tagger (n.d.) as:
<nnp>Campbell</nnp> <nns>shares</nns>
<vbd>were</vbd> <rb>down</rb> <ppd>$</
ppd> <cd>1.04</cd> <ppc>,</ppc> <cc>or</cc>
<cd>3.5</cd> <nn>percent</nn> <ppc>,</ppc>
<in>at</in> <ppd>$</ppd> <cd>28.45</cd>
<in>on</in> <det>the</det> <nnp>New</
nnp> <nnp>York</nnp> <nnp>Stock</
nnp> <nnp>Exchange</nnp> <in>on</in>
<nnp>Friday</nnp> <nn>morning</nn> <pp>.</
pp>
Here, nnp, for example, indicates a proper

noun. The tagged output and knowledge of English
grammar help us to identify sentence structure.
7KLVKHOSVXVWROHDU Q³ZKRGRHVZKDWWRZKRP"´
from a sentence, which in turn helps us to extract
information more accurately.
FIRST uses information from the part-of-
speech tagger to identify the sentence structure.
This structure is then used in its extraction
rules. Some sample FIRST rules using syntactic
information, analyzing sentences that contain
key terms used in the proximity rule (earlier),
are shown next:
&DVH  ([DPSOHV ZKHUH ¿QDQFLDO VWDWXV
DSSHDUV DIWHU ¿QDQFLDO LWHP²¿QDQFLDO LWHP
and status words bolded.
<nns>Sales</nns><vbd>rose</vbd> <cd>5.7</
cd> <nn>%</nn> <to>to</to> <ppd>$</ppd>
<cd>2.22</cd> <cd>billion</cd> <in>from</in>
<ppd>$</ppd> <cd>2.1</cd> <cd>billion</cd>
<det>a</det> <nn>year</nn> <in>ago</in>
<pp>.</pp>
<nnp>Campbell</nnp> <nnp>Soup</
nnp> <nnp>Co.</nnp> <vbd>said</vbd>
<nnp>Friday</nnp> <jj>quarterly</jj>
<nns>profits</nns> <vbd>were</vbd>
MM!ÀDWMM! <pp>.</pp>
The following is an example of a rule to identify
¿QDQFLDOVWDWXVIRUWKLVFDVH
2423
Automatically Extracting and Tagging Business Information for E-Business Systems

S
VP
N
V
PP
PNP
PREP
DET NP
ADJ
NPP
PREP NP
NP
Jim works for a big bank in NY
for each keyword that is a candidate for denoting
a ¿QDQFLDOLWHP (e.g., sales)
ifWKHWDJJHUKDVLGHQWL¿HGWKDWNH\ZRUGDVD
noun or plural noun in the sentence
then a form of a corresponding ¿QDQFLDOVWDWXV
NH\ZRUGHJ³LQFUHDVH´VKRXOGEHSUHVHQWLQWKH
immediately following verb phrase
end if
end for
We maintain a lexicon of candidate keywords
IRUGHQRWLQJ¿QDQFLDOLWHPVHJsales, revenue,
and SUR¿WV), as well as a lexicon of candidate
NH\ZRUGVIRUGHQRWLQJ¿QDQFLDOVWDWXVHJ
rise, increase, and decrease). In the previous
examples:
<vbd>rose</vbd>
and

YEG!ZHUHYEG!MM!ÀDWMM!
form the respective verb phrases. Thus for the
¿UVWVHQWHQFH),567¿OOVWKHORWVDV
Financial Item: sales
Financial Status: rose
)RUWKHVHFRQGVHQWHQFH),567¿OOVWKHORWV
as:
)LQDQFLDO,WHPSUR¿WV
)LQDQFLDO6WDWXVZHUHÀDW
Case 2: Examples where status appears be-
IRUH¿QDQFLDOLWHP²¿QDQFLDOLWHPDQGVWDWXV
words bolded.

<det>The</det> <nnp>Camden</nnp>
<ppc>,</ppc> <nnp>N.J.</nnp> <ppc>,</
ppc> <nn>company</nn> <vbd>saw</vbd>
<jj>strong</jj> <nns>sales</nns> <pp>.</
pp>
<det>The</det><nn>company</nn>
<vbd>saw</vbd> <det>a</det> <cd>6</cd>
<nn>%</nn> <to>to</to> <cd>9</cd> <nn>%</
nn> <nn>sequential</nn> <nn>increase</nn>
<in>in</in> <nns>sales</nns> <in>from</in>
<jj>last</jj> <nn>quarter</nn> <pp>.
Exhibit A.

×