Tải bản đầy đủ (.pdf) (51 trang)

Xây dựng công cụ tìm kiếm tập tin văn bản theo nội dung

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (21.06 MB, 51 trang )

TRTIoNG DAr

KHoA xY

Hec rAv o6

rnudr c6xc

NGHp

--pfEcz--

xu6,r

LUAN

16r

NGHrpp

CHTIYEN NGANH CONG NGHE THONG TIN

oB rAr: xAv Dqlyc cONc cq riur xmur
TAP TIN vAN TAN THEo NQI DUNG

SVTH: t Avr

rry cnAM pA

LOP DAI HQC TIN HQC 4
MSSV:0951010050



uA so: MSDT_KLDH_3.15
GVHD: Th.s NGUYTN Crri
I

cAN THo 2ot3
L.003736

cUONc


Loi cdm on

r-,OT

Sau b6n n6m tr6n

cim

cm di5n

CAM ON

giing dudng d4i hgc,

di

di5n nay

s6p ra truong


tdi thenh thAt

toin th€ thiy c6 gi6o trong trudrng dgi hgc Tdy D6 dd quan t6m theo ddi,

ti6p sirc t6i trong qu5 trinh hqc t?p ctng nhu trong c6c kj, thi.
Trong qu6 trinh ldm lu4n vdn t6t nghiQp, t6i dA nh4n dugc sg hudng d6n cfing nhu

trg girip nhiQt tinh cta Thiy Nguy6n Chi Culng,

li

gi6o vi6n hucmg d6n ThAy d5 tich

cyc trong qu6 trinh rdn luyQn k! ning l4p trinh cfing nhu ki6n thfc cho t6i. D6n nay
chuong trinh dd hoin thinh t6i xin

Ngodi ra, t6i xin gtii ldi cim
di5n

gi6ng


gti loi cim on chdn thinh
crn

!

Thiy.


siu sic d6n cha mg, ngudi d5 cho t6i budc chdn

dai hoc, cho t6i hoc c6c ki6n thric m6i, c6 kinh nghiQm nhi6u hcrn

trong cuQc s5ng. Ngoii ra, t6i cffng xin gui loi cim
quan tdm e6p

t16n

crn

thdn thi6t d6n t6t c6 bpn bd de

d6n bni lu{n v6n cf,ng nhu de girip dd t6i.

Cutii cung bdi lupn vdn dd hoan thinh, t5t nhi6n kh6ng thi5 tr6nh kh6i c6c sai s6t
mong nhfln

du-o.

c sU g6p y cira quf thdy cd vd c6c bpn.

Xin chdn thenh cim

crn

!

Cin tho,ngiy 03 th6ng 06 ndm 2013

Sinh vi6n thuc hi€n

LAm Thf ChIm Pa

Trang

i


Ldi ndi ddu

r-,OT

NOI DAU

Lufn v5n t6t nghiQp li mQt m6n hgc do sinh vi6n tg hgc vd tim hi6u,
ct?u

hoin thdnh dii 6n, nhim tt6p tmg nhu

cAu

tIrS

nghiCn

tfuc ti6n trong doi sting cflng nhu trong

hec tfp.


Ngiy nay, thgc t(: c6

rdt nhi6u nhu

ciu ddi h6i chirng ta phii giii

quyi5t sao cho

hqp ly. Thpt v4y, trong xE hQi tiiSn bQ hiQn nay khi cugc s6ng dd c6 ph6n sung tric
tiQn nghi hcm, viQc

vi

lya chon mQt phuong ph5p s5ng sao cho thoii m6i, kh6ng phi nhi6u

thoi gian ld tti€u mong udc cria m6i ngudi.
O m5i khia cpnh kh6c nhau c6 c6c y6u cAu kh6c nhau, ri6ng AOi vOi ngdnh c6ng
nghQ th6ng

tin, didu ldm t6i quan t0m d6 ld viQc tim kiiSm tpp tin v[n bin tr6n m6y tinh

c6 nhdn. Edy thyc sg

li mQt ddi h6i cdn thi6t, viQc tim kitim hic ndo cfing dugc uu ti€n

hdng dAu, vi chi khi c6 mQt phAn m6m hd trg tim kii5m t6t thi n6 sE c5 trlc dgng to l6n

trong c6ng viQc cfrng nhu trong hgc tap ctra b4n.
Ban biiSt ddy, hiQn nay thdi gian


li mQt kh6i niQm r6t hr quan trgng, r6t quy b6u

moi ngudi vi vfly phii bi6t cSch su dgng sao cho hqp ly vd hiru fch. Cho n6n,
trong dcrn vi hgc trinh cira tdi, t6i da chgn cho minh ae tai *Xdy dpg c6ng cg tim kii5m
AOi vOi

tpp tin vdn

bin theo nQi dung" ldm bii kh6a lufn ttit nghiQp.

Trang

ii


T6m tdt

TOM TAT
C6ng cg tim ki6m ld mQt phAn m6m hiru ich

vi cAn thitit cho m6y vi tinh, sE kh6

khln khi phii tim thti c6ng, n6 ldm ban m6t qu5 nhi6u thoi gian, ldm chflm tr6 c6ng
vi6c cria b4n, thay vdo d6 khi da c6 c6ng cg tim ki6m trong tay, b4n

sE

nhin rdi hon, vi

ti6t kiQm dugc nhi€u thdi gian hcrn cho c6ng viQc khric.


vi

Cdng cg tim ki6m tap tin v6n bAn theo nQi dung ld mQt phAn mOm kh6 quan trdng
cAn thi6t cho nhirng ai sir dung m6y vi tinh.

Trong bdi b6o c6o kh6a lufn ndy t6i sE trinh biy qu6 trinh xdy dlmg mdt c6ng cu
tim ki6m t4p tin v6n bin theo nQi dung, d0 mqi ngudi hi€u rd hon sg cdn thi6t ctra n6.

ABSTRACT
Search Engine is helpful and necessary software for a computer.it's so

manual find, it's

difficult for

will lost more time of you, and maybe your work must to delay.

Instead, if you use the search engine, you

will saving much time for a next work.

Search text files by content is so important and necessary for a users computer.

In my thesis report, i will be presents the processes to do a search text files by
content. I hope this became useful with everybody.

Trang

iii



Nh,2n xdt cua etdo vien huong diln

NHAN xET

cul crAo vrfiN Hr.IoNG nAN
***

Cin tho, ngdy ..... th6ng ..... ndm 2013
Gi6o vi6n huong d6n

t

Nguy6n Chi Cudng

Trang iv


Nhdn xdt cila gido vi€n phan biQn

NHAN xET cu.q,

crAo vrEN PHAN BrEN

____________* *

{<

___________


Cdn tho,ngey ..... thring ..... ndm 2013
Gi6o vi6n phin biQn

Le Thi Thu Lan

Trang v


Danh muc hinh

DANH MUC HiNn

Hinh 2.1. B-tree bqc 2 c6 3 muc..

.......... 13

Ilinh 2.2 : Tim ki€m gid tri X.........

........14

22............
....... 16
Hinh 2.4: Qud trinh tdch node khi thOm kh6a 15
.................... 17
Ilinh 2.5: ThAmvdo cdc khda 35, 7,26 vd /8............
..............17
Hinh 2.6: Qud trinh tdch node khi thAm kh6a 22 ............
........17
Hinh 2.7: Qud trinh tdch node khi th1m khoa i ..............

........ 1g
Hinh 2.8: ThAmvdo cdc kh6a 42, 13,46, 27 vd 8..............
..... 18
Hinh 2.9: Qud trinh tdch node khi th1m kh6a 32 ............
........ lg
Hinh 4.10: Thdm vdo 38, 24 vd 45 ............
............. 18
Hinh 2.1l: Qud trinh tdch node dQ quy khi thAm kh6a 25 ............
........... 19
Hinh j.2. Cdy nhi phdn..
.......29
Hinh 4.l: Giao diQn chinh cila chuong trinh ........
.................. 36
Hinh 4.2: Giao diQn dgcfile
..................37
Hinh 4.3: Giao diQn qudt file ..............
....................37
Hinh 4.4: Giao diQn qudt thu mttc .........
.................3g
Hinh 4.5: Giao di€n cdc k€t qud tim drqc..
............3g
Hinh 4.6: Giao di€n tim theo ftnfile
.....39
Hinh 4.7: Giao diQn tim theo n6i dung........
...........39
Hinh 2.3 : Qud trinh tdch node khi thlm kh6a

Trang vi



Danh muc bang

DANH MUC TANC

'
'

Bdng

3.

t Bi€u didn cdu trilc

lm,t

trfi chi m4c theo m6 hinh tra c*u lugn ly.......Errorl

Bookmark not defined.
Bdng 3.2 MO td cdch ddnh s6 cho tqp tin trong m6 hinh lqp chi mttc nguqc............24
Bdng 3.3 Xdy d,rng tQp chi mqc nghlch dao co sdp x€p vd gom nh6m....................25

Trang vii


Muc

cHI/oNG r: TONC QUAN
1. rY oo cHo. N op

lr,rc


.......

rat...

I

..................1

2. TiNH HiNHNGHTEN ctru LrEN euAN
2.1 C6ng cp tim kitim cira Windows
2.2 C6ng cp tim kitim cira Google
2.3 C6ng cp tim ki6m ctia Ping.........
3. MUC TIEU VA NHrEM VU CUA KHOA LUAN
3.1 Mpc ti6u
3.2 NhiQm vu.............
4. HTJONG GrAr QUYET...
cHTIONG 2z CA sd r,f THUyfT

....... I
..................

1

.....................2
.................3

............4
.....................4


.......4
.................. s

......................6

1. lY ruuvpr
1.1

t.2

HQ th6ng

tim ki6m tr0n m6y tinh c6 nhdn.........

Trang

viii

............8


Mttc ltrc

2.2

Gi6i thigu

B-Tree

.....12


CHITONG 3: NQI DUNG THTIC HIEN

l.

LAP CHi

MUC

...............23

1.1 Phucrng ph6p chi mgc nghich

1.2

d6o..........

X6y dpg c6u truc luu trfi cho tu
nghich d6o..........

2
:

BQ TiM

4

diiSn

...............30


xML........

DocID
Tpp tin chi muc
Tap tin

Lr.ru

...........23

trong phucrng ph6p lpp chi mpc
................26

ruprr{

xAy D\NG TAp rrN

3.2
3.3

..23

...................30
..........30
........31

D0..........

...............32


4.1 Luu tpo chi muc cho h1e...........
4.2 Luu qu6t 6 dia/ thu mqc.........
s. u0 simr
5.1 T6ng qu6t.........
5.2 Tim ki6m
<16
<16

............32
...............33

......34
........34
..................35

CHI.IONG 4: TTT LUAN.....

TAI LryU THAM KHAo

.....41

Trang ix


Chuong

cHrIoNG
1.


Lv

1:

t:

T6ngquan

r6NC QUAN

DO CHQN DE TAr

Tn khi linh vuc c6ng nghQ th6ng tin chi6m linh thi trudng, thi viQc tim kii5m
kh6ng cdn

li

thupt ngfi m6i me v6i ngudi ding nta. Vi6c tra cr?u, tim ki6m c6c tu kh6a

tr6n web kh6ng m6t qu6

vii

gi6y Ae ni6n thi ki5t qu6. D6i v6i nhirng ai

li

tin dd cira

m?ng thi viQc tra tir kh6a tim ki6m thdng tin ld di6u kh6ng tne tnieu AOi vOi frg.

BCn canh d6, nhu cdu tim ki6m th6ng tin tr6n mrly tinh c6

nhin cflng kh6 quan

trgng, viQc truy xu6t dtln c6c nQi dung b€n trong thu muc ctng nhu vi6c tim kitim th6ng

tin tr6n webpage vQy, d6u trA vC kiSt qu6 nhu y bgn mong dqi vC dich vp mi bpn

str

dung.

V0 nguy6n t6c viQc luu trfr th6ng tin vd tim ki6m th6ng tin thi don giin. Gi6 su

tii liQu vd mQt ngudi mudn tim c6c tdi li6u li6n quan di5n y€u cAu cria
minh. Ngudi d6 c6 th6 tlqc tet cir c6c tii li6u trong kho, git lui circ tii lieu li6n quan vi
c6 mQt kho chrla

b6 di c:ic tdi liQu kh6ng li6n quan. R6 rdng

giii

phrlp ndy kh6ng thuc t6 vi t6n r6t ntriOu

thdi gian.
Chinh vi vfy, vi6c xdy dUng c6ng cp tim kii5m tpp tin vdn bin theo nQi dung ld
mQt rt6 tdi it nhiAu c6 sric

inh huong vi sy quan tdm cria ngudi dung


hiQn nay. Dua

tr6n c6ng cp ndy chirng ta chi cdn nh6 mQt vdi tri kh6a c6 li6n quan d6n nQi dung

di5

tim ra tfp tin d6.
2. TiNH nixH NGHTEN cLfu LrEN euAN
2.1 C6ng cg tim ki6m cfra Windows

2.1.1

Uu di6m

Trong c6c hQ di6u hinh cria Windows ddu c6 sEn c6ng cu tim kii5m d6y ld
chucmg trinh tiQn ich girip tim ki6m cilc th6ng tin trong hQ th6ng, cric th6ng tin ndy c6

the

le cilc tQp tin, thu mgc, tim mQt mriy tinh trong

hQ

th6ng m?ng ho{c tim mQt ngudi

,^
ndo <1o trong sd dia chi Address book vd ngodi ra c6 th6 tim
Internet.
Trang


I

th6ng tin tr6n mAng


Chuong 1: Tdngquan

-

Tim

chc tQp tin hinh 6nh, 6m thanh ho{c video. D6nh d6u vdo c6c 6 iI6 chqn

lo4i tap tin vd nhap t6n cira tQp tin mu6n tim, c6 th6 chi can nnap mQt phin cta t6n
tQp

tin, n6u dC tr6ng thi chuong trinh

-

Tim c6c

tQp

sE

tim t6t ch

circ tpp


tin c6 trong

hQ

th6ng.

tin tdi li0u, vin bin, bing tinh,... Chgn c6c thdi ili6m tpo t4p tin

hoflc chgn Don't remember n6u kh6ng nhd vd nhap t6n cira t4p tin mudn tim.

-

Tim tdt ch chc tQ,p tin vi thu mqc, nh4p t6n cta tpp tin ho{c thu muc

cAn

tim,

c6 thi5 nhAp mQt hay nhiAu tu c6 trong nQi dung cira tpp tin d6 tim vd chgn 6 dia cdn

tim trong Look in.

-

Tim mQt m6y tfnh trong

hQ thiSng mAng

ho{c tim mQt nguoi ndo d6 trong sd


dia chi Address book. N6u chgn tim People in your address book thi

sE

xu6t hign

cria sd cho ph6p nhQp c6c th6ng tin vA ngudi mu5n tim, nhflp mQt ho{c tdt ch cilc
thdng tin vd nh6n Find now

-

dt5

tim.

Ngodi c5c thitit l6p tim ki6m co bin, chuong trinh cdn cho ph6p thii5t l6p c6c

ki€u tim ki6m mo r6ng hcrn nhu:

.,
n
"

Thdi gian tpp tin dugc tpolsua.
Kich thu6c cta

tQp

tin.


Chgn c6c thi6t l4p mo rQng.

,,, Chon kieu dinh dpng tpp tin mu5n tim.
,,, Tim trong c6c thu mpc hQ thting.

.
.

Tim c6ct$p tin vi thu mpc An

"
o

Tim chinh x6c t6n theo cht thuong holc chfr hoa.

Tim trong c5c thu mqc con.

Tim trong c6c thi6t bi sao ltru.

2.1.2 Nhuqc di€m
ViQc tim kii5m thi r6t hiQu qui, c6 th6 tim dfng theo tu kh6a md b4n nhflp,
nhung ktit qu6

tri

vd thi vO s5, ban phii trich lgc ra dO chgn dugc tpp tin md b4n cAn.

2.2 Cdngcg tim ki5m cnia Google

2.2.1


Uu di6m

Trang 2


Chuong

t:

T6ngquan

Ding Google Desktop dC tim kii5m tr6n PC dA ding vd nhanh ch6ng hon
. .x
nhieu so voi c6ng cu tuong t.u tich hgrp sIn trong Windows. N6 don gi6n

vi

tiQn

dung.

Tru6c h6t, Google Desktop khoi dQng nhanh hon c6ng cg tim ki6m ctra
Windows. ngudi ding chi

cin nh6y k6p vdo bi6u tucrng cria chuong trinh trong

th6ng

dO


mo trang tim ki6m trong trinh duyQt, ti6p d6 96 tu kh6a tim ki6m

vd nh6n Enter

vi

c6c ktlt

khay

hQ

qui

sC

xu6t hiQn. H9 cfing kh6ng cAn chgn gita tim ki6m

theo tOn file hay theo nQi dung ho{c
ci

mQt

lAn.

Thri hai, Google Desktop lo4i b6 nhu ciu su dUng c6c c6ng cu tim kiiSm

vi


chpm

kh6 tich hqp trong c6c phdn m6m web, e-mail vd vin phdng cria

Microsoft.

Tinh nlng tim ki6m nhanh, tilt ch mQt 6n cria Google Desktop girip nguoi
dung khic phuc dugc sp hing tfng khi mu5n tim mQt

tii liQu md kh6ng nh6 rd

ki6u file. Hcrn nira, n6u chuong trinh khdng cho ra k6t qui mong mu6n, mQt cri
nh6p chuQt

vio

du
link trOn cua s6 Google Desktop

s€

gtii truy v6n d6n ngudi

anh'em Google Web Search cira n6.

Nguoc lpi, m6i lAn nguoi dirng dua ra m6t lQnh tim ki6m tr6n giao diQn
th6ng thulng cria Google, k6t qui tim ki6m tir Google Desktop s€ dugc b6 sung tp
dQng



phia tr6n mdn hinh.

2.2.2 Nhtqc

diAm

Sg ri6ng tu cira bpn sE bf soi m6i (khi b4n sri dpng chung mdy tinh vdi ngudi
kh6c Google Desktop hi6n thi tht cir dii liQu cria mgi ngudi, tru phi duqc thi6t lap
preferences di5 lo4i bo mQt si5 thu muc hoic dinh dang.

hqp cria bpn, hdy cdn nh6c
2.3 C6ng cg

2.3.1

Vi vfy, ntiu ddy li trubng

k! tru6c khi cdi d[t chucmg trinh).

tim ki5m cria Ping

Uu di6m

Tim ki6m Ti6ng ViQt cira Ping nhin chung kh6 t5t. Ping c6 vC uu tiOn nQi dung
hcrn vd c5c

k! thupt thdn thiQn v6i m5y tim ki6m.

Trang 3


Chuong

V6i m6i ki5t qui tim ki6m, b4n c6 mQt si5 th6ng tin mi6u
thudng

li

c6c mi6u

ti

snippet md Google hay su dgng vd

ti

t:

T6ngquan

chi tii5t hcrn vd trang,

d{t bigt

la sitelink. C6c

sitelink niy minh nhan thAy Bing to raxhc dfnh don gi6n hon dga vio menu vd s5 l6n
xuAt hign hon


li

ngir nghia r,hung lai hi6u qui kh6ng k6m Google, thQm chi ld cdn

chinh x6c hcrn.

Ngoii ph6n nQi dung k6t qui tim ki6m vi pop-up th6ng tin trang ktit qu6, m6y tim
ki6m Bing cdn cung cdp cho b4n bing gqi y c6c tu kh6a li6n quan

vi llch st c6c tim

ki6m de thgc thi.

Microsoft dang c5 ging khai th6c vdo di6m mpnh cfing ld chd tr6ng cira Google ld
phdn giao diQn.

2.3.2 Nhuqc di6m
Chua c6 phi€n bdn ti6ng

vi6t, ld c6ng cu cdn non k6m.ViQc t6ch loc tiiing ViQt

trong c6c dlnh d4ng phirc tpp hon nhu Flash, PDF chua dugc ttit.

Ciic c6ng cg h5 trg Webmaster cria Microsoft cdn b6o thir vdi cdc dinh dpng vd
.A

cnuan neng.

3. MUC TrEU VA NHrEM VV CUA KHOA LUAN

3.1Mgc ti6u
NQi dung kh6a lupn ld mdn hoc khri m6i chua dugc hgc trong chuong trinh, cAn
du-o.

c nghiCn cr?u vd hgc h6i. Girip chring ta rdn luyQn

k! nlng

ph6n tich, xu ly v6n dA

vi khi ndng tu duy itQc lfp, cirng v6i sy trg girip cria gi6ng vi€n hudng din tin ring sC
hodn thinh ctuqc dc tdi xdy dpg c6ng cg tim kitim tap tin vdn b6n theo n6i dung.
3.2 NhiQm vg

.
,,

Tim hi€u vC hQ th6ng tim ki6m th6ng tin.
HQ thdng

tim kii5m tfp tin tr6n m6y tfnh

c6 nhdn.

,,,
..,

BQ phan lAp chi mpc. C6c phuong ph6p ldp chi mqc.

.


B0 phQn tim ki6m.

,,,

XAy d1mg c6ng cg.

Phuong phrip luu

trf

b0 chi mgc.

Trang 4


Chuong 1: Tdngquan

"

LQp trinh tr6n Windows Form: Dgc nQi dung tpp

tin vin bin (Word,

HTML, HTM, TXT, RTF...).

"

Luu trfr chi mgc theo c6u truc B-Tree.


4. HTIoNG GrAr euyEr
Dti x6y dtmg c6ng cp tim kii5m t4p tin vdn bin ta c6 nhirng huong
sau:

o
o
o
o

Ng6n ngt lfp trinh C#.
C6ng cU lap trinh Microsoft Visual Studio 2010

Lpp chi muc bing phuong phrip chi muc nghich d6o
Su dqng B-Tree dC ltru chi mpc

Trang 5

giii

quytit nhu


Chuong 2: Co so l!,thuyAt

CHTIONG 2: CO SO

Lf THUYNT

1. r,f ruuynr
1.1Tim ki6m th6ng tin

1.1.1 Dinh nghia
Dinh nghia hQ th5ng tim ki6m th6ng tin cua mQt s6 t6c giit:
Salton (1989): "H€ th6ng tim ki6m th6ng tin xtr

ly cictpp tin luu vd nhirng

ve th6ng tin, x5c dinh vd tim tir c6c tpp tin nhirng th6ng tin
v0 th6ng tin. ViQc truy tim nhirng th6ng tin d{c

th6ng tin dugc luu tr&
c5c thuQc

thi

y6u cAu

phi hqp v6i nhfrng y6u c6u

php thuqc

vi sp tuong thc gifia cic

vi c6c y€u cAu, dugc dSnh gi6 bing c6ch so s6nh c6c gi5 trf cria

tinh dugc luu trft vd c6c y6u ciu vd th6ng tin".

Kowalski (1997): "HQ th5ng truy tim th6ng tin ld mQt hq thiing c6 khi ndng luu
trfr, truy tim vd duy tri th6ng tin. Th6ng tin trong nhirng trucrng hgp ndy c6 th6 bao
g6m vrn bin, hinh 6nh, 6m thanh video


"MOt

hQ

vi

nhfrng dtii tugrrg da phucrng ti6n kh6c."

th6ng tim ki6m th6ng tin ld mQt phAn m6m girip ngudi su dpng tim ki6m

th6ng tin (information) h9 cAn. Hay mQt hQ thdng th6ng tin girip ngudi sir dung tim
ki6m nhirng

tii

liQu (document) chr?a nhtng th6ng tin (information) hg c0n

dung sE rut trfch th6ng tin cAn thi6t

"Truy tim th6ng tin

li

fi

vi ngudi su

nhirng tdi liQu d6."

tim ki6m th6ng tin (thucmg


li

c6c tdi li€u) o mQt d4ng kh6ng

c6 c6u truc (th6ng thudng ld v5n bin), th6a m6n nhu c6u th6ng tin tir trong nhirng
ngu6n thdng tin lon (duo. c luu trft tr6n c5c m6y tinh)." [irbookprint]

Tim kii5m thpng tin (information retrieval): tu mQt ngudn r6t ntri6u
ti6ng n6i, tim ra rrhirng tQp c6 nQi dung li6n quan tltin mQt v6n
(hay

tri loi). Di6n hinh cria c6ng nghQ niy li


tQp

vin

bdn hay

(c6u h6i) ta c6n bitit

Google, mQt he tim ki6m th6ng tin tr6n

Web, md hAu nhu chring ta d6u dung thuong xuy6n.

Hi6u don giin


hQ

th5ng tim ki6m thdng tin

sfr dgng tim ki5m th6ng tin mQt

1.1.2 Mrlc ti6u vi

chrfrc

li

mQt hQ th6ng h5

cich nhanh ch6ng vi d6 ding.

ning

Trang 6

trg cho ngutri


Chuong 2: Co so l!,thuy€t

MUc ti6u cira mQt h€ th6ng tim ki6m thdng tin ld tim ki6m

ldi cho nhu

cAu thdng


vi tri

vA c6c tdi liQu

tin cira ngudi ding.

MOt he th6ng tim ki6m th6ng tin c6 hai chric ning chinh d6 le lpp chi mpc
cr?u hay

tri

vi tra

tim ki6m:

o
tin tir

Lap chi mpc

tii

liQu vd biOu di6n lpi

"tt",

c6 thti le

o


li giai do4n phin tich tni liQu dC rut trich c6c tii li6u boi c6c dcrn vi th6ng tin d6. Ecrn vith6ng

hopc phric tpp hon ld "cpm

tli",

ookh6i

tin

niQm", ...

Tra criu (hay tim ki6m) ld giai dopn tim ki6m trong co sd dfr liQu Oieu

di6n cho

cfuc

titi lieu boi c5c don v! th6ng tin trong chric n6ng lap chi muc) nhirng

tdi liQu pht hqp v6i nQi dung c0u truy v6n.

1.1.3 Ki6n trric
Theo

I

thuy6t thi nQi dung cdu truy v6n c6 th6 so srinh tryc ti6p vdi nQi dung cta


tii liQu dC t6y ra nhlrng tdi liQu 1i6n quan dtin nhu ciu ngudi ding.
Nhtmg tr6n thyc tis thi kh6ng thti thpc hiQn nhu vpy dugc vi c5c ciu truy v6n vir cric

tdi liQu dugc luu trii d c5c d4ng kh5c nhau, nguoi dirng khi doc vdo n6i dung c6u truy
v6n

vi tni liQu c6 th6 tu duy ra dugc m5i quan hQ giffa chring.
Vi vpy trong ki6n truc cria mQt hQ th6ng tim

d0

thfc

kii5m th6ng tin c6 bQ phpn trung gian

hiQn c6ng vi6c chuytin tlOi c6c th6ng tin cira

tii

liQu thinh thdng tin

du-o.

c biiiu

di6n trong hQ th6ng, d6 le b9 phpn lpp chi mgc.
BQ phan thri hai ld b0 ph0n

tim ki6m,


bQ phpn

niy c6 chric nrng phdn tich nQi dung

c6u hoi (truy v6n) cria ngudi ding thdnh nhirng th6ng tin gdn gi6ng nhu trong hQ th6ng

di luu trfr, sau d6 so sSnh th6ng tin ndy vdi th6ng tin c6 trong hQ th6ng de tra ve t6t
que.

a)

B0 phqn lQp chi muc
NhiQm vU chfnh cria bQ ph4n

niy li:

dgc nQi dung tu mQt kho tdi liQu l6n,

t6 etin

phdn tich, thting k6, bii5u diSn lai nQi dung ctia ttng tdi liQu trong kho vd luu nhirng
th6ng tin

niy vdo co so dt

liQu theo mQt c6u trirc

Co so dfr li0u ndy ggi ld tap tin chi muc cta


hQ

Trang

7

nio d6 theo quy dinh cira hQ th6ng.

thdng.


Chuong 2: Co so l!,thuyAt

C6 nhi6u c6ch dO bi6u di6n lai nQi dung cria tdi li6u trong tpp tin chi muc nhu: nQi
dung g5c, rfit trich

Vi

ngt nghia, rut trich

dflc trung cria tdi liQu,...

vln bin thi ta c6 th0 sri dgng nQi dung gdc hoic y chinh

dp: mQt tdi liQu

tpp tin d6 dC l6p chi muc;

ctia


tii liQu ld hinh enh hoic 6m thanh thi chirng ta l6y ra d{c

tnmg cira tdi liQu g6c Ae tpp chi mgc.
TQp

tin chi muc sC dugc hQ th6ng luu trt, quin li, bd sung, cQp nhflt,...

trinh tim ki6m

sE diSn ra

tr6n nQi dung cria t4p tin ndy

vi

qu6

mi khdng cdn t5c dQng d6n c6c

tii lieu g5c trong kho.
b)

B0 phdn tim ki6m

Th6ng tin ngudi dung cAn tim sE dugc nh$p thdng qua mQt giao diQn (c6 the le
giao diQn web), thong tin
nhi€n hay mQt c6u trfrc
phdn tich

vi


niy co th6 0 d4ng mQt c6u hoi gAn gi6ng v6i ng6n ngfr t.u

nio d6 do hQ th6ng qui udc. C6c cdu hoi niy

s€ dugc hQ th6ng

bi6u di6n lpi o mQt hinh thric m6i gi6ng nhu phAn lap chi mpc, khi

dung dugc bi6u di6n l4i

<16

nQi

niy "d6ng d?ng" v6 mpt c6u truc v6i tpp tin chi mgc vi chfng

ta cfing dC dang so s6nh tryc tiifp gifra hai sg bi6u di6n ndy.

l.2lJlg thting tim ki6m trGn m6y tinh ci nhin
1.2.1 Girfi thiQu
HQ th6ng tim ki6m tdi liQu tr6n miiy tinh c6 nhdn ld tim kii5m th6ng tin ngudi

ding tr6n c6c lopi t4p tin vdn bin

ptrO Uitin nhu Word,

HTML, HTM, TXT, RTF... vd

hi6n thi nhfing tpp tin c6 1i6n quan d6n tir khori ngudi ding nh4p vio.


1.2.2 C6c vin dd
HiQu qui cria qu6 trinh lpp chi muc cho c6c tdi
nhiAu

liOu luu

vio ytiu t6 b9 doc cho ttng lo4i tpp tin kh6c nhau. Do

trfi tr6n dia phu thuQc r5t

c6c lo4i tdi li6u kh6c nhau

thi c6u truc kh6c nhau vd v6n d0 doc chinh x5c nQi dung vd c6u truc cria chting ld tucrng
d6i kh6 kh6n.
Tucmg

ty nhu cric hQ th6ng tim ki6m tr6n web,

hQ

th6ng tim ki6m tr6n m6y tinh

c6 nhdn cflng vfly, qu6 trinh lpp chi mUc diSn ra nhanh hay chgm tuj,thuQc vdo ki6n

thric vd kinh nghiQm cta nguoi

xiy dpg

hQ


th6ng vC nhi6u y6u tri nhu: c5u tnic chi

mqc (c6u tnic ciy, bing b6m,...), c6u trric file AC tuu trir chi mqc (file dpng text, nhi
ph6n...).
Trang 8


Chuong 2: Co sd l!,thuy€t

v6n rld trong tim ki6m thdng tin
Ke fir nhirng nim 40, c6c v6n dO trong viQc luu

1.3 MOt s6

tin itd thu hirt sp chir

trt

th6ng tin vd tim kii5m th6ng

y rdt l6n. V6i mQt lugng th6ng tin kh6ng 16 thi viQc tim ki6m

chinh x5c vir nhanh ch6ng cing tro nOn kh6 khdn hon. V6i sU ra ddi cria m5y tinh,
nhiAu

rit

y tuong l6n dugc dua ra nh[m cung c6p mQt hQ th6ng tim ki6m th6ng minh vd


chinh x6c. Tuy nhi6n v5n

dC

tim kii5m sao cho hiQu qui vdn chua dugc gi6i quy6t.

Vd nguy6n t6c viQc luu trft

vi tim ki6m

th6ng tin thi dcrn gi6n.

Gii

su c6 mQt kho

chria c6c tdi liQu vd mQt ngudi mu6n tim cric tdi liEu li6n quan d6n y€u c6u cta minh.

Ngudi doc c6 th6 dgc tdt chc6c tdi liQu trong kho, gifi lpi c6c tii liQu li6n quan vd bo di
crlc tdi li€u khdng li6n quan. RO rdng

giii

vi ti5n r6t nhi6u thoi

ph6p ndy khdng thUc t6

gian.

V6i


sU ra

ddi cria m6y vi tinh c6 t6c dQ cao, m6y tinh c6 th6 "dec" thay cho con

ngudi d6 trich ra chc tdi liQu c6 li6n quan trong toan b$ tflp dfi li6u. Tuy nhi6n vAn d6

li

hic niy

ldm sao d6 x6c dinh dugc tdi liQu nio li6n quan di5n c6u hoi. Mpc dich ctia

mQt hQ thiSng

tim ki6m th6ng tin tg ilQng ld truy luc dugc t6t ch cic tdi liQu c6 li6n quan

-a^l yeu cau.
den

2.

THI/C HANH

2.lLfp
HQ

chi mgc
th6ng lpp chi muc hay cdn ggi ln hQ th6ng phin tich vh xtr


li dt liQu, thgc hiQn

viQc phdn tich, trich lgc nhfrng thdng tin cAn thi6t (thuong ld c6c tir dcrn, tu gh6p, cgm

tu quan trgng) tu kho

tii

lieu thu th4p dugc vd t6 chr?c thinh

sd dfi liQu ri6ng

ccy

tim ki6m tr6n d6 mQt c6ch nhanh ch6ng, hiQu qu6. HQ thdng chi mgc

thi5

c6c tir kho6, chi rd c6c tu khoil niro xu6t hiQn trong t4p tin nio, vi

o


c6

li danh s6ch

tri nio.


C6c phucrng phSp lap chi muc

2.1.1 M6 hinh tra cru luqn ly
C6ng viQc chri yiSu khi xdy

kho

tii

dpg

mQt hQ thting tim ki6m

li

viQc lap chi mpc cho

li6u ctia chring ta, sau qu6 trinh lpp chi mgc thi nQi dung cria chi mpc

diSn nQi dung cria kho tdi li6u o mQt dpng c6u truc ndo d6 do hQ th6ng quy
Trang 9

sC biiSu


Chrong 2: Co sd $ thuyiit

C6u truc chi muc co
cr?u lu?n


bin vi

dcrn gian nh6t

ly (boolean retrieval boolean). Cdu truc chi muc niy

DC minh hoa cho qu6

trinh xdy

hi6u 16 v6n d6: tim kii5m t6n c6c nhdn

dpg

duo. c

x6y

dpg

Ae UiCu

tii liQu.

di6n cho tdt chc6c tu (word) xu6t hiQn trong

Sau qu6

li c6u truc chi mpc theo m6 hinh tra


chi muc theo c6ch tr€n, ta xem vi du sau d6

vft trong c6cthcphAm

trinh thu th4p c6c t6c phAm v6n hgc cria Shakespeare

cria nhd

vi

v[n Shakespeare.

th6ng kC c6c tu ngt ta

dpg bing th6ng kd g6m t6n nhdn vflt vd t€n c6c vo kfch. Bing thting kC c6 th6
dugc m6 ti nhu sau:

xdy

'l'eic

Anton-\,
anrl
Clleopatra

r\ntorr,v

Z


'l'h*
Julius 'lentpest

('easar'

I

[]rutus

I

I

I

I

0

I

I

0

Mercy'

I

0


w0rser

I

0

Bdng

3.

t

BiAu di6n cdu

ffilc ltru tric chi

I

lamlet Othello

00

I

('easar

Calpumia
('lcopatra


-c

phim

0r

{lt
00
{i0
II
l1

mqtc theo m6 hinh

N4acbcrh

0

I

t)

0

I

I

0


0

0

0

I

I

I

0

tra ctat luQn t!'

Circ gi6 tri trong ma trQn (ki6u boolean) dnng dC bi6u di6n cho su c6 m{t cira
nh6n vpt trong vo kich. Gi6 tri ndy
gi6 tri ndy

li

1 (true) niSu nhdn

vft

c6 trong vo kfch, ngugc lpi

li 0 (false) ni5u nhin v4t kh6ng c6 trong vo kich.


Chfng ta c6 c6c vector bits m6i tu bi6u di6n trong kho

tii

liQu.

Vi du tu

"Antony" le 110001 (1 ddng thuQc ma trfn tr6n).
Trong m6 hinh Boolean chring ta c6 th6 d{t ra U6t tcy truy v6n nio dudi dpng
m6t bii5u thfc logic tr6n tir ngt, trong d6 circ tri ngfr ndy k6t hqrp vdi nhau boi crlc phdp
to6n AND, OR, vd NOT.
Trang 10


Chuong 2: Co so lj,thuyAt

Gi6 su y6u cAu tim ki6m ld c6c vo kich c6 chria 'oBrutus"

vi "Caesar"

nhtmg

"kh6ng c6 Calpurnia". Khi ti6 c6u truy v6n c6 th6 bi€u ditin nhu: Brutus AND Caesar

AND (NOT Calpurnia).
ViQc tim ki6m sC ti6n hdnh ph6p tinh sau:
110100

AND 110111 AND (NOT(o10000;


110100

AND 110111 AND 101111

Gi6

tri

dey bit tr€n
ki5t

F

1 the hiQn

:

:

100100

vo kich o vi tri tuong img

li

ktit qui cin

tri


ve. Oi6n gi6i cho

li vo kfch
qui tri vii.

Cric hpn ch6 c6 th6 th6y dugc o m6 hinh ndy d6 ld:

o

Ktit qui cira m6 hinh ndy le

1

tei li6u

se dugc x6c
li

dirng hay kh6ng dring

Otii vOi 1 c6u truy v6n tim ki6m (tim chfnh x6c). Hon nfra c6c tdi liQu tr6 vC chi c6 th.5

dugc s8p xtip dga vdo id cira c6c tii liQu d6.

o

C6u truy v6n ph6i o dpng cria mQt biiSu thric logic vd nhu th6 se r6t kh6


khln cho

ngudi sri dgng th6ng thudrng.

o
c6c

M6 hinh tra

cr?u lupn

l;f sri d\rng m6t ma trpn nhi phdn

tii liQu, viQc x6y dlrng ma trdn ndy ld rilt "ldngphi".

liQu titing Anh, vd m6i tai liQu c6

tdi liQu). Ta
chring ta
1

sC

tlO

Gi6 su chring ta c6 1 triQu

tii

khoing 1000 tu (tucrng duong 2-3 trangtr6n m5i t4p


c6 khoing 500.000 tu kh6a ph6n biQt AOi vOi ng6n

phii x6y dpg

bi6u diSn tpp chi mUc

mQt ma tr4n g6m

lM x 0,5M

:

ngt Anh, c6 nghia ld

500 tj, phAn tir, trong d6 chi c6

tf phin tti mang gi6 tr! 1.
2.1.2 Phrong phdp chi mqc nghich ddo
V6i nhirng h4n ch6 d5 n6u,
li


qu5 trinh lap chi muc

th6y m6 hinh tra ct?u lufln ly kh6ng th6 6p dqng d€ xdy d1mg c6c
Phucrng ph6p lap chi muc nghich
phdp


niy

sO du
hQ

tii

liQu, ta nhpn

th6ng tim ki6m lcyn.

dio ra ttdi nhim gi6i quyiit hpn chti tr6n.(Phuong

nghiCn cftu & phAn nQi dung thuc hiqn).

Trang I I


Chuong 2: Cct sd ty thuyiit

2.2 Gi6n thiQu B-Tree

Trong thrc t6 c6 nhi€u chi muc kich thudc cria n6 ld r6t lon (thgc t6 1 table
trong thgc t6 s5 record cira n6 khoang hon

phii

du-o.


I

tripu ddng ld diAu binh thucmg) n6n cin

c luu tr6n cric thi6t bi phAn ctmg (thi6t bi nh0p xuSt - HDD), ngdy nay khi md

dung luong cria c6c thilit bi ltru trii ting l6n r6t nhanh vd gi6 thenh 16 n€n cAn 1 phucrng

ph6p truy xdt nhanh tr6n 1 lugng dir liQu lon nhu
B-Tree

li

dugc chia

1 c6u

vay.

truc dfi liQu d4ng cdy 'da phdn c6n blng'c6u truc 1 node cta n6 thuong

lim nhi6u phAn,

cdy dugc t6 chric

voi I qui tic nh6t



(thuong thi gi6 tri cria c6c node tdng din tu trrii qua ph6i, node con b6n tnli

sE

tu diu

chria gi6

tri nh6 hon node con b6n phai...)
B-Tree

li

chinh gita chting
hpn boi

1 sU t6ng qu6t h6a cria c6y nhi phdn tim kiiSm @ST) sg kh6c biQt

li

s6 lugng node con cria B-Tree n6 sC nhi6u hcrn chr? kh6ng bi gioi

2 phdntu nhu cria cdy nh! phin tim ki6m. Muc ti€u o ddy li lim

sao d6 gi6m

ttii da sti tAn truy xu6t cric thitit bi ngopi vi o ddy ld fia, tpi b6t cri thoi di6m nio chfng
ta sE thir ilinh vi nhfing bin ghi, chring ta mu5n dQ cao cira ciy da phin ndy phii le nh6

*r6t


Ae

giim chi phi tim kii5m, de dat duoc muc ti6u niy thi sd lugng nh6nh trong c6y

phii

l0n.

Ni5u B-Tree c6

bfc ld m thi I node ctra n6 sE c6 tOi ea m nh6nh con vd c6 m-l kh6a <16

tim ki6m.

o

C6c reord chi ilugc luu 0 nrit 16 cria cdy

"

T6t

ci cfuc nirt (ngopi tru ntit g6c) c6 tOi tnieu DL rnlz vi t6i da ld m ciy(nh5nh)

con.

o Nrit g6c vd 16 ph:ii it nh6t c6 2 con.
o B6n trong 1 node luu gi6 tri tim ki6m, vdi chi dugc sri dung nhu nhtng "ngudi
chi dudng" cho qu6 trinh tim ki6m, nhtng kh6a niy cin phii dugc s6p x6p theo 1 thf

tg thting nh6t tu tru6c (thucmg thi t[ng dAn tri tr6i qua phai).

o

Tuy thuQc vio kich thudc cria 1 record so vdi kich thu6c cria 1 kh6a thi node

cria B-Tree bpc m c6 th6 luu nhi6u hon ho{c

Trang 12

16

it hon m records. Th6ng thudmg kich


Chuong 2: Co sd

thu6c cria 1 node ld bQi sO cria

I

ttri5i

af

fiQu trong O Aa

ft6n windows m{c

tf


thuydt


5l2byte).

o

Cdc node

16

cria cdy thudng duqc li6n kt{t voi nhau de

e" thenh I danh s6ch li€n

tiSt, aieu nny nhim mgc ti6u la clti khOng phni duyQt tro lgi node g6c trong qu5 trinh
drryQt

tim kitim.

2.2.1 DinhnglrTo
MQt B-hee bAc N co cdc d{c finh sau:

(+) M6i node c6 tOi (+) M5i node ho[c

li


node

16

hofc c6 t6i thiiSu Nl2+l node con

Vf dp:

J.

8

34.37.43

t2. t4.16 t9

78.9{J

Ilinh 2.1. B-tree bdc 2 cd 3 m*c
Khai brio:
#define ORDER 5
typedef struct tagNODE
{
NumTree;

int

int

ll sd cdy con crla node hiQn hdnh

Key[ORDER-ll; ll mdng lmt trii cdc khod cfia node

tagNODE* Branch[ORDER]; // cdc con trd chi d€n cdc node con
*NODEPTR, +BTREE; ll contrd node
NODE,
)

NODEPTR

Root;// contrd node gdc

Trang

Ij


Chuong 2: Co sd ly thuyAt

2.2.2

Tim kiilm ffong B-tree

Ifinh 2.2 : Tim kiiim gid
X6t node trong hirdr.2.2, kh6a c6n tim

ti X

li X. Gia sri nQi dung cria X nim trong bQ

nh6- Voi m dri lon ta sri dgng phuong ph6p tim ki6m nh! phen, ntiu m nh6 ta sri dung

phuong phrip tim kitim tuen tg.
N6u X kh6ng th6y sE co ba trudng hqp sau:

(+) Ki < X < K;*1. Tiiip tqc tim kiiSm trin c6y con Ci
(+) Km < X.
(+) X

Tiiip tgc tim ki6m tr6n Cm

1Kr.

Ti6p tqc tim ki6m tr€n C6

Qu6 trinh ndy ti6p tuc cho 15

md v6n kh6ng tim thAy khoq viQc tim ki6m

a)

di di diSn node

h th6t bai.

Hdm NODESEARCH

Tra v€ v! tri nh6 nhSt cria kh6a trong nrit p bit dAu lon hon hay bing k. Trudng hqp

k lon hon


dt cir cickh6a

trong nrit p thi tra vc vi tri p-> NumTrees-l

int NodeSearch (NODEPTR p, int k)
{

i:0;

int
for (i:0; i< p>NumTrees
refurn i;

-l &&p->Key[i] < k; i++;;

)

Him NodeSearch tlugc dirng

aC

tim kh6a k c6 trong nrit p hay kh6ng. N6u kh6a k

kh6ng co hong nut P thi ph6p toiin ndy
cria p a6 ti6p tUc tim kh6a k trong

tri

vC


vi tri girip chfng ta chgn n6t con phn hqp

ntt con niy.

b) Hdn SEARCH
Tim kh6a k tran B-Tree. Con tn6 p xuSt ph6t t& gtic vd tti xuting c6c nhrffi c6y
con ph[r hqp iI6 tim kh6a k co trong mQt n6t p hay kh6ng.
N6u c6 kh6a k t4i nrit p tr6n cdy:
+ Bi6n found tra vA g16

H TRUE
Trang 14


Chuong 2: Cct so lj,thuydt

+ Him Search( ) tr6 vC con tr6 chi nft p c6 chria kh6a k
+ Bi6n position

tri

vC

vi tri cira kh6a k c6 trong nirt p ndy

N€u kh6ng cd kh6a k trOn cdy: liuc niy p
th6m kh6a k

'


: NULL vi q (ntt cha cira p) chi ntit 16 c6 th6

vio nrit niry dugc.

+ Birin found

tri

vC giltriFALSE

tri

+ Hdm Search( )

vC con

tr6 q li nrit

16

c6 th6m

nft k viro

+ Bitin position trd v€ vi tri c6 th6 chdn kh6a k vio nrit ki q ndy

NODEPTR Search(int
{

b int *pPosition, int *pFound)


int i:0;

NODEPTR p : Root, q :NULL;;
while (p !:NULL)
t

i:

NodeSearch (p, k);

if (i< p->numtress-1
t

&.&k::

p->Key[i])

*pFound :
*pPosition
return

:

p;

TRUE;
i; ilL'i rt'i ti*,t thriv l;kd* k
ii Node t:ti chwt *lxi* K


)

q:
p:

p;
p ->Branch[i];

)

ri' ({1ung tnet i}tiir'. lirr: na3 p
'N{ll.l." r,a ul !a nrttle lii *ri rltc tl:rnr khriii
r,ii* node ntirr'" pitosizir:n l;\ r,i tri u* tl:c *hin khi>a k *,,
*pFound : FALSE;
*PPosition: i;
return g, .. I lit r * notlc l*
)

2.2.3

DuyQt cdy B-tree

DuyQt c6c kh6a cira B-Tree theo

r

tht

t.u


tu nh6

void Scan(NODEPRT pRoot)
t

if (pRoot:: NULL)
retum;

Trang 15


qui

k


×