TRTIoNG DAr
KHoA xY
Hec rAv o6
rnudr c6xc
NGHp
--pfEcz--
xu6,r
LUAN
16r
NGHrpp
CHTIYEN NGANH CONG NGHE THONG TIN
oB rAr: xAv Dqlyc cONc cq riur xmur
TAP TIN vAN TAN THEo NQI DUNG
SVTH: t Avr
rry cnAM pA
LOP DAI HQC TIN HQC 4
MSSV:0951010050
uA so: MSDT_KLDH_3.15
GVHD: Th.s NGUYTN Crri
I
cAN THo 2ot3
L.003736
cUONc
Loi cdm on
r-,OT
Sau b6n n6m tr6n
cim
cm di5n
CAM ON
giing dudng d4i hgc,
di
di5n nay
s6p ra truong
tdi thenh thAt
toin th€ thiy c6 gi6o trong trudrng dgi hgc Tdy D6 dd quan t6m theo ddi,
ti6p sirc t6i trong qu5 trinh hqc t?p ctng nhu trong c6c kj, thi.
Trong qu6 trinh ldm lu4n vdn t6t nghiQp, t6i dA nh4n dugc sg hudng d6n cfing nhu
trg girip nhiQt tinh cta Thiy Nguy6n Chi Culng,
li
gi6o vi6n hucmg d6n ThAy d5 tich
cyc trong qu6 trinh rdn luyQn k! ning l4p trinh cfing nhu ki6n thfc cho t6i. D6n nay
chuong trinh dd hoin thinh t6i xin
Ngodi ra, t6i xin gtii ldi cim
di5n
gi6ng
gti loi cim on chdn thinh
crn
!
Thiy.
siu sic d6n cha mg, ngudi d5 cho t6i budc chdn
dai hoc, cho t6i hoc c6c ki6n thric m6i, c6 kinh nghiQm nhi6u hcrn
trong cuQc s5ng. Ngoii ra, t6i cffng xin gui loi cim
quan tdm e6p
t16n
crn
thdn thi6t d6n t6t c6 bpn bd de
d6n bni lu{n v6n cf,ng nhu de girip dd t6i.
Cutii cung bdi lupn vdn dd hoan thinh, t5t nhi6n kh6ng thi5 tr6nh kh6i c6c sai s6t
mong nhfln
du-o.
c sU g6p y cira quf thdy cd vd c6c bpn.
Xin chdn thenh cim
crn
!
Cin tho,ngiy 03 th6ng 06 ndm 2013
Sinh vi6n thuc hi€n
LAm Thf ChIm Pa
Trang
i
Ldi ndi ddu
r-,OT
NOI DAU
Lufn v5n t6t nghiQp li mQt m6n hgc do sinh vi6n tg hgc vd tim hi6u,
ct?u
hoin thdnh dii 6n, nhim tt6p tmg nhu
cAu
tIrS
nghiCn
tfuc ti6n trong doi sting cflng nhu trong
hec tfp.
Ngiy nay, thgc t(: c6
rdt nhi6u nhu
ciu ddi h6i chirng ta phii giii
quyi5t sao cho
hqp ly. Thpt v4y, trong xE hQi tiiSn bQ hiQn nay khi cugc s6ng dd c6 ph6n sung tric
tiQn nghi hcm, viQc
vi
lya chon mQt phuong ph5p s5ng sao cho thoii m6i, kh6ng phi nhi6u
thoi gian ld tti€u mong udc cria m6i ngudi.
O m5i khia cpnh kh6c nhau c6 c6c y6u cAu kh6c nhau, ri6ng AOi vOi ngdnh c6ng
nghQ th6ng
tin, didu ldm t6i quan t0m d6 ld viQc tim kiiSm tpp tin v[n bin tr6n m6y tinh
c6 nhdn. Edy thyc sg
li mQt ddi h6i cdn thi6t, viQc tim kitim hic ndo cfing dugc uu ti€n
hdng dAu, vi chi khi c6 mQt phAn m6m hd trg tim kii5m t6t thi n6 sE c5 trlc dgng to l6n
trong c6ng viQc cfrng nhu trong hgc tap ctra b4n.
Ban biiSt ddy, hiQn nay thdi gian
li mQt kh6i niQm r6t hr quan trgng, r6t quy b6u
moi ngudi vi vfly phii bi6t cSch su dgng sao cho hqp ly vd hiru fch. Cho n6n,
trong dcrn vi hgc trinh cira tdi, t6i da chgn cho minh ae tai *Xdy dpg c6ng cg tim kii5m
AOi vOi
tpp tin vdn
bin theo nQi dung" ldm bii kh6a lufn ttit nghiQp.
Trang
ii
T6m tdt
TOM TAT
C6ng cg tim ki6m ld mQt phAn m6m hiru ich
vi cAn thitit cho m6y vi tinh, sE kh6
khln khi phii tim thti c6ng, n6 ldm ban m6t qu5 nhi6u thoi gian, ldm chflm tr6 c6ng
vi6c cria b4n, thay vdo d6 khi da c6 c6ng cg tim ki6m trong tay, b4n
sE
nhin rdi hon, vi
ti6t kiQm dugc nhi€u thdi gian hcrn cho c6ng viQc khric.
vi
Cdng cg tim ki6m tap tin v6n bAn theo nQi dung ld mQt phAn mOm kh6 quan trdng
cAn thi6t cho nhirng ai sir dung m6y vi tinh.
Trong bdi b6o c6o kh6a lufn ndy t6i sE trinh biy qu6 trinh xdy dlmg mdt c6ng cu
tim ki6m t4p tin v6n bin theo nQi dung, d0 mqi ngudi hi€u rd hon sg cdn thi6t ctra n6.
ABSTRACT
Search Engine is helpful and necessary software for a computer.it's so
manual find, it's
difficult for
will lost more time of you, and maybe your work must to delay.
Instead, if you use the search engine, you
will saving much time for a next work.
Search text files by content is so important and necessary for a users computer.
In my thesis report, i will be presents the processes to do a search text files by
content. I hope this became useful with everybody.
Trang
iii
Nh,2n xdt cua etdo vien huong diln
NHAN xET
cul crAo vrfiN Hr.IoNG nAN
***
Cin tho, ngdy ..... th6ng ..... ndm 2013
Gi6o vi6n huong d6n
t
Nguy6n Chi Cudng
Trang iv
Nhdn xdt cila gido vi€n phan biQn
NHAN xET cu.q,
crAo vrEN PHAN BrEN
____________* *
{<
___________
Cdn tho,ngey ..... thring ..... ndm 2013
Gi6o vi6n phin biQn
Le Thi Thu Lan
Trang v
Danh muc hinh
DANH MUC HiNn
Hinh 2.1. B-tree bqc 2 c6 3 muc..
.......... 13
Ilinh 2.2 : Tim ki€m gid tri X.........
........14
22............
....... 16
Hinh 2.4: Qud trinh tdch node khi thOm kh6a 15
.................... 17
Ilinh 2.5: ThAmvdo cdc khda 35, 7,26 vd /8............
..............17
Hinh 2.6: Qud trinh tdch node khi thAm kh6a 22 ............
........17
Hinh 2.7: Qud trinh tdch node khi th1m khoa i ..............
........ 1g
Hinh 2.8: ThAmvdo cdc kh6a 42, 13,46, 27 vd 8..............
..... 18
Hinh 2.9: Qud trinh tdch node khi th1m kh6a 32 ............
........ lg
Hinh 4.10: Thdm vdo 38, 24 vd 45 ............
............. 18
Hinh 2.1l: Qud trinh tdch node dQ quy khi thAm kh6a 25 ............
........... 19
Hinh j.2. Cdy nhi phdn..
.......29
Hinh 4.l: Giao diQn chinh cila chuong trinh ........
.................. 36
Hinh 4.2: Giao diQn dgcfile
..................37
Hinh 4.3: Giao diQn qudt file ..............
....................37
Hinh 4.4: Giao diQn qudt thu mttc .........
.................3g
Hinh 4.5: Giao di€n cdc k€t qud tim drqc..
............3g
Hinh 4.6: Giao di€n tim theo ftnfile
.....39
Hinh 4.7: Giao diQn tim theo n6i dung........
...........39
Hinh 2.3 : Qud trinh tdch node khi thlm kh6a
Trang vi
Danh muc bang
DANH MUC TANC
'
'
Bdng
3.
t Bi€u didn cdu trilc
lm,t
trfi chi m4c theo m6 hinh tra c*u lugn ly.......Errorl
Bookmark not defined.
Bdng 3.2 MO td cdch ddnh s6 cho tqp tin trong m6 hinh lqp chi mttc nguqc............24
Bdng 3.3 Xdy d,rng tQp chi mqc nghlch dao co sdp x€p vd gom nh6m....................25
Trang vii
Muc
cHI/oNG r: TONC QUAN
1. rY oo cHo. N op
lr,rc
.......
rat...
I
..................1
2. TiNH HiNHNGHTEN ctru LrEN euAN
2.1 C6ng cp tim kitim cira Windows
2.2 C6ng cp tim kitim cira Google
2.3 C6ng cp tim ki6m ctia Ping.........
3. MUC TIEU VA NHrEM VU CUA KHOA LUAN
3.1 Mpc ti6u
3.2 NhiQm vu.............
4. HTJONG GrAr QUYET...
cHTIONG 2z CA sd r,f THUyfT
....... I
..................
1
.....................2
.................3
............4
.....................4
.......4
.................. s
......................6
1. lY ruuvpr
1.1
t.2
HQ th6ng
tim ki6m tr0n m6y tinh c6 nhdn.........
Trang
viii
............8
Mttc ltrc
2.2
Gi6i thigu
B-Tree
.....12
CHITONG 3: NQI DUNG THTIC HIEN
l.
LAP CHi
MUC
...............23
1.1 Phucrng ph6p chi mgc nghich
1.2
d6o..........
X6y dpg c6u truc luu trfi cho tu
nghich d6o..........
2
:
BQ TiM
4
diiSn
...............30
xML........
DocID
Tpp tin chi muc
Tap tin
Lr.ru
...........23
trong phucrng ph6p lpp chi mpc
................26
ruprr{
xAy D\NG TAp rrN
3.2
3.3
..23
...................30
..........30
........31
D0..........
...............32
4.1 Luu tpo chi muc cho h1e...........
4.2 Luu qu6t 6 dia/ thu mqc.........
s. u0 simr
5.1 T6ng qu6t.........
5.2 Tim ki6m
<16
<16
............32
...............33
......34
........34
..................35
CHI.IONG 4: TTT LUAN.....
TAI LryU THAM KHAo
.....41
Trang ix
Chuong
cHrIoNG
1.
Lv
1:
t:
T6ngquan
r6NC QUAN
DO CHQN DE TAr
Tn khi linh vuc c6ng nghQ th6ng tin chi6m linh thi trudng, thi viQc tim kii5m
kh6ng cdn
li
thupt ngfi m6i me v6i ngudi ding nta. Vi6c tra cr?u, tim ki6m c6c tu kh6a
tr6n web kh6ng m6t qu6
vii
gi6y Ae ni6n thi ki5t qu6. D6i v6i nhirng ai
li
tin dd cira
m?ng thi viQc tra tir kh6a tim ki6m thdng tin ld di6u kh6ng tne tnieu AOi vOi frg.
BCn canh d6, nhu cdu tim ki6m th6ng tin tr6n mrly tinh c6
nhin cflng kh6 quan
trgng, viQc truy xu6t dtln c6c nQi dung b€n trong thu muc ctng nhu vi6c tim kitim th6ng
tin tr6n webpage vQy, d6u trA vC kiSt qu6 nhu y bgn mong dqi vC dich vp mi bpn
str
dung.
V0 nguy6n t6c viQc luu trfr th6ng tin vd tim ki6m th6ng tin thi don giin. Gi6 su
tii liQu vd mQt ngudi mudn tim c6c tdi li6u li6n quan di5n y€u cAu cria
minh. Ngudi d6 c6 th6 tlqc tet cir c6c tii li6u trong kho, git lui circ tii lieu li6n quan vi
c6 mQt kho chrla
b6 di c:ic tdi liQu kh6ng li6n quan. R6 rdng
giii
phrlp ndy kh6ng thuc t6 vi t6n r6t ntriOu
thdi gian.
Chinh vi vfy, vi6c xdy dUng c6ng cp tim kii5m tpp tin vdn bin theo nQi dung ld
mQt rt6 tdi it nhiAu c6 sric
inh huong vi sy quan tdm cria ngudi dung
hiQn nay. Dua
tr6n c6ng cp ndy chirng ta chi cdn nh6 mQt vdi tri kh6a c6 li6n quan d6n nQi dung
di5
tim ra tfp tin d6.
2. TiNH nixH NGHTEN cLfu LrEN euAN
2.1 C6ng cg tim ki6m cfra Windows
2.1.1
Uu di6m
Trong c6c hQ di6u hinh cria Windows ddu c6 sEn c6ng cu tim kii5m d6y ld
chucmg trinh tiQn ich girip tim ki6m cilc th6ng tin trong hQ th6ng, cric th6ng tin ndy c6
the
le cilc tQp tin, thu mgc, tim mQt mriy tinh trong
hQ
th6ng m?ng ho{c tim mQt ngudi
,^
ndo <1o trong sd dia chi Address book vd ngodi ra c6 th6 tim
Internet.
Trang
I
th6ng tin tr6n mAng
Chuong 1: Tdngquan
-
Tim
chc tQp tin hinh 6nh, 6m thanh ho{c video. D6nh d6u vdo c6c 6 iI6 chqn
lo4i tap tin vd nhap t6n cira tQp tin mu6n tim, c6 th6 chi can nnap mQt phin cta t6n
tQp
tin, n6u dC tr6ng thi chuong trinh
-
Tim c6c
tQp
sE
tim t6t ch
circ tpp
tin c6 trong
hQ
th6ng.
tin tdi li0u, vin bin, bing tinh,... Chgn c6c thdi ili6m tpo t4p tin
hoflc chgn Don't remember n6u kh6ng nhd vd nhap t6n cira t4p tin mudn tim.
-
Tim tdt ch chc tQ,p tin vi thu mqc, nh4p t6n cta tpp tin ho{c thu muc
cAn
tim,
c6 thi5 nhAp mQt hay nhiAu tu c6 trong nQi dung cira tpp tin d6 tim vd chgn 6 dia cdn
tim trong Look in.
-
Tim mQt m6y tfnh trong
hQ thiSng mAng
ho{c tim mQt nguoi ndo d6 trong sd
dia chi Address book. N6u chgn tim People in your address book thi
sE
xu6t hign
cria sd cho ph6p nhQp c6c th6ng tin vA ngudi mu5n tim, nhflp mQt ho{c tdt ch cilc
thdng tin vd nh6n Find now
-
dt5
tim.
Ngodi c5c thitit l6p tim ki6m co bin, chuong trinh cdn cho ph6p thii5t l6p c6c
ki€u tim ki6m mo r6ng hcrn nhu:
.,
n
"
Thdi gian tpp tin dugc tpolsua.
Kich thu6c cta
tQp
tin.
Chgn c6c thi6t l4p mo rQng.
,,, Chon kieu dinh dpng tpp tin mu5n tim.
,,, Tim trong c6c thu mpc hQ thting.
.
.
Tim c6ct$p tin vi thu mpc An
"
o
Tim chinh x6c t6n theo cht thuong holc chfr hoa.
Tim trong c5c thu mqc con.
Tim trong c6c thi6t bi sao ltru.
2.1.2 Nhuqc di€m
ViQc tim kii5m thi r6t hiQu qui, c6 th6 tim dfng theo tu kh6a md b4n nhflp,
nhung ktit qu6
tri
vd thi vO s5, ban phii trich lgc ra dO chgn dugc tpp tin md b4n cAn.
2.2 Cdngcg tim ki5m cnia Google
2.2.1
Uu di6m
Trang 2
Chuong
t:
T6ngquan
Ding Google Desktop dC tim kii5m tr6n PC dA ding vd nhanh ch6ng hon
. .x
nhieu so voi c6ng cu tuong t.u tich hgrp sIn trong Windows. N6 don gi6n
vi
tiQn
dung.
Tru6c h6t, Google Desktop khoi dQng nhanh hon c6ng cg tim ki6m ctra
Windows. ngudi ding chi
cin nh6y k6p vdo bi6u tucrng cria chuong trinh trong
th6ng
dO
mo trang tim ki6m trong trinh duyQt, ti6p d6 96 tu kh6a tim ki6m
vd nh6n Enter
vi
c6c ktlt
khay
hQ
qui
sC
xu6t hiQn. H9 cfing kh6ng cAn chgn gita tim ki6m
theo tOn file hay theo nQi dung ho{c
ci
mQt
lAn.
Thri hai, Google Desktop lo4i b6 nhu ciu su dUng c6c c6ng cu tim kiiSm
vi
chpm
kh6 tich hqp trong c6c phdn m6m web, e-mail vd vin phdng cria
Microsoft.
Tinh nlng tim ki6m nhanh, tilt ch mQt 6n cria Google Desktop girip nguoi
dung khic phuc dugc sp hing tfng khi mu5n tim mQt
tii liQu md kh6ng nh6 rd
ki6u file. Hcrn nira, n6u chuong trinh khdng cho ra k6t qui mong mu6n, mQt cri
nh6p chuQt
vio
du
link trOn cua s6 Google Desktop
s€
gtii truy v6n d6n ngudi
anh'em Google Web Search cira n6.
Nguoc lpi, m6i lAn nguoi dirng dua ra m6t lQnh tim ki6m tr6n giao diQn
th6ng thulng cria Google, k6t qui tim ki6m tir Google Desktop s€ dugc b6 sung tp
dQng
phia tr6n mdn hinh.
2.2.2 Nhtqc
diAm
Sg ri6ng tu cira bpn sE bf soi m6i (khi b4n sri dpng chung mdy tinh vdi ngudi
kh6c Google Desktop hi6n thi tht cir dii liQu cria mgi ngudi, tru phi duqc thi6t lap
preferences di5 lo4i bo mQt si5 thu muc hoic dinh dang.
hqp cria bpn, hdy cdn nh6c
2.3 C6ng cg
2.3.1
Vi vfy, ntiu ddy li trubng
k! tru6c khi cdi d[t chucmg trinh).
tim ki5m cria Ping
Uu di6m
Tim ki6m Ti6ng ViQt cira Ping nhin chung kh6 t5t. Ping c6 vC uu tiOn nQi dung
hcrn vd c5c
k! thupt thdn thiQn v6i m5y tim ki6m.
Trang 3
Chuong
V6i m6i ki5t qui tim ki6m, b4n c6 mQt si5 th6ng tin mi6u
thudng
li
c6c mi6u
ti
snippet md Google hay su dgng vd
ti
t:
T6ngquan
chi tii5t hcrn vd trang,
d{t bigt
la sitelink. C6c
sitelink niy minh nhan thAy Bing to raxhc dfnh don gi6n hon dga vio menu vd s5 l6n
xuAt hign hon
li
ngir nghia r,hung lai hi6u qui kh6ng k6m Google, thQm chi ld cdn
chinh x6c hcrn.
Ngoii ph6n nQi dung k6t qui tim ki6m vi pop-up th6ng tin trang ktit qu6, m6y tim
ki6m Bing cdn cung cdp cho b4n bing gqi y c6c tu kh6a li6n quan
vi llch st c6c tim
ki6m de thgc thi.
Microsoft dang c5 ging khai th6c vdo di6m mpnh cfing ld chd tr6ng cira Google ld
phdn giao diQn.
2.3.2 Nhuqc di6m
Chua c6 phi€n bdn ti6ng
vi6t, ld c6ng cu cdn non k6m.ViQc t6ch loc tiiing ViQt
trong c6c dlnh d4ng phirc tpp hon nhu Flash, PDF chua dugc ttit.
Ciic c6ng cg h5 trg Webmaster cria Microsoft cdn b6o thir vdi cdc dinh dpng vd
.A
cnuan neng.
3. MUC TrEU VA NHrEM VV CUA KHOA LUAN
3.1Mgc ti6u
NQi dung kh6a lupn ld mdn hoc khri m6i chua dugc hgc trong chuong trinh, cAn
du-o.
c nghiCn cr?u vd hgc h6i. Girip chring ta rdn luyQn
k! nlng
ph6n tich, xu ly v6n dA
vi khi ndng tu duy itQc lfp, cirng v6i sy trg girip cria gi6ng vi€n hudng din tin ring sC
hodn thinh ctuqc dc tdi xdy dpg c6ng cg tim kitim tap tin vdn b6n theo n6i dung.
3.2 NhiQm vg
.
,,
Tim hi€u vC hQ th6ng tim ki6m th6ng tin.
HQ thdng
tim kii5m tfp tin tr6n m6y tfnh
c6 nhdn.
,,,
..,
BQ phan lAp chi mpc. C6c phuong ph6p ldp chi mqc.
.
B0 phQn tim ki6m.
,,,
XAy d1mg c6ng cg.
Phuong phrip luu
trf
b0 chi mgc.
Trang 4
Chuong 1: Tdngquan
"
LQp trinh tr6n Windows Form: Dgc nQi dung tpp
tin vin bin (Word,
HTML, HTM, TXT, RTF...).
"
Luu trfr chi mgc theo c6u truc B-Tree.
4. HTIoNG GrAr euyEr
Dti x6y dtmg c6ng cp tim kii5m t4p tin vdn bin ta c6 nhirng huong
sau:
o
o
o
o
Ng6n ngt lfp trinh C#.
C6ng cU lap trinh Microsoft Visual Studio 2010
Lpp chi muc bing phuong phrip chi muc nghich d6o
Su dqng B-Tree dC ltru chi mpc
Trang 5
giii
quytit nhu
Chuong 2: Co so l!,thuyAt
CHTIONG 2: CO SO
Lf THUYNT
1. r,f ruuynr
1.1Tim ki6m th6ng tin
1.1.1 Dinh nghia
Dinh nghia hQ th5ng tim ki6m th6ng tin cua mQt s6 t6c giit:
Salton (1989): "H€ th6ng tim ki6m th6ng tin xtr
ly cictpp tin luu vd nhirng
ve th6ng tin, x5c dinh vd tim tir c6c tpp tin nhirng th6ng tin
v0 th6ng tin. ViQc truy tim nhirng th6ng tin d{c
th6ng tin dugc luu tr&
c5c thuQc
thi
y6u cAu
phi hqp v6i nhfrng y6u c6u
php thuqc
vi sp tuong thc gifia cic
vi c6c y€u cAu, dugc dSnh gi6 bing c6ch so s6nh c6c gi5 trf cria
tinh dugc luu trft vd c6c y6u ciu vd th6ng tin".
Kowalski (1997): "HQ th5ng truy tim th6ng tin ld mQt hq thiing c6 khi ndng luu
trfr, truy tim vd duy tri th6ng tin. Th6ng tin trong nhirng trucrng hgp ndy c6 th6 bao
g6m vrn bin, hinh 6nh, 6m thanh video
"MOt
hQ
vi
nhfrng dtii tugrrg da phucrng ti6n kh6c."
th6ng tim ki6m th6ng tin ld mQt phAn m6m girip ngudi su dpng tim ki6m
th6ng tin (information) h9 cAn. Hay mQt hQ thdng th6ng tin girip ngudi sir dung tim
ki6m nhirng
tii
liQu (document) chr?a nhtng th6ng tin (information) hg c0n
dung sE rut trfch th6ng tin cAn thi6t
"Truy tim th6ng tin
li
fi
vi ngudi su
nhirng tdi liQu d6."
tim ki6m th6ng tin (thucmg
li
c6c tdi li€u) o mQt d4ng kh6ng
c6 c6u truc (th6ng thudng ld v5n bin), th6a m6n nhu c6u th6ng tin tir trong nhirng
ngu6n thdng tin lon (duo. c luu trft tr6n c5c m6y tinh)." [irbookprint]
Tim kii5m thpng tin (information retrieval): tu mQt ngudn r6t ntri6u
ti6ng n6i, tim ra rrhirng tQp c6 nQi dung li6n quan tltin mQt v6n
(hay
tri loi). Di6n hinh cria c6ng nghQ niy li
tQp
vin
bdn hay
(c6u h6i) ta c6n bitit
Google, mQt he tim ki6m th6ng tin tr6n
Web, md hAu nhu chring ta d6u dung thuong xuy6n.
Hi6u don giin
hQ
th5ng tim ki6m thdng tin
sfr dgng tim ki5m th6ng tin mQt
1.1.2 Mrlc ti6u vi
chrfrc
li
mQt hQ th6ng h5
cich nhanh ch6ng vi d6 ding.
ning
Trang 6
trg cho ngutri
Chuong 2: Co so l!,thuy€t
MUc ti6u cira mQt h€ th6ng tim ki6m thdng tin ld tim ki6m
ldi cho nhu
cAu thdng
vi tri
vA c6c tdi liQu
tin cira ngudi ding.
MOt he th6ng tim ki6m th6ng tin c6 hai chric ning chinh d6 le lpp chi mpc
cr?u hay
tri
vi tra
tim ki6m:
o
tin tir
Lap chi mpc
tii
liQu vd biOu di6n lpi
"tt",
c6 thti le
o
li giai do4n phin tich tni liQu dC rut trich c6c
tii li6u boi c6c dcrn vi th6ng tin d6. Ecrn vith6ng
hopc phric tpp hon ld "cpm
tli",
ookh6i
tin
niQm", ...
Tra criu (hay tim ki6m) ld giai dopn tim ki6m trong co sd dfr liQu Oieu
di6n cho
cfuc
titi lieu boi c5c don v! th6ng tin trong chric n6ng lap chi muc) nhirng
tdi liQu pht hqp v6i nQi dung c0u truy v6n.
1.1.3 Ki6n trric
Theo
I
thuy6t thi nQi dung cdu truy v6n c6 th6 so srinh tryc ti6p vdi nQi dung cta
tii liQu dC t6y ra nhlrng tdi liQu 1i6n quan dtin nhu ciu ngudi ding.
Nhtmg tr6n thyc tis thi kh6ng thti thpc hiQn nhu vpy dugc vi c5c ciu truy v6n vir cric
tdi liQu dugc luu trii d c5c d4ng kh5c nhau, nguoi dirng khi doc vdo n6i dung c6u truy
v6n
vi tni liQu c6 th6 tu duy ra dugc m5i quan hQ giffa chring.
Vi vpy trong ki6n truc cria mQt hQ th6ng tim
d0
thfc
kii5m th6ng tin c6 bQ phpn trung gian
hiQn c6ng vi6c chuytin tlOi c6c th6ng tin cira
tii
liQu thinh thdng tin
du-o.
c biiiu
di6n trong hQ th6ng, d6 le b9 phpn lpp chi mgc.
BQ phan thri hai ld b0 ph0n
tim ki6m,
bQ phpn
niy c6 chric nrng phdn tich nQi dung
c6u hoi (truy v6n) cria ngudi ding thdnh nhirng th6ng tin gdn gi6ng nhu trong hQ th6ng
di luu trfr, sau d6 so sSnh th6ng tin ndy vdi th6ng tin c6 trong hQ th6ng de tra ve t6t
que.
a)
B0 phqn lQp chi muc
NhiQm vU chfnh cria bQ ph4n
niy li:
dgc nQi dung tu mQt kho tdi liQu l6n,
t6 etin
phdn tich, thting k6, bii5u diSn lai nQi dung ctia ttng tdi liQu trong kho vd luu nhirng
th6ng tin
niy vdo co so dt
liQu theo mQt c6u trirc
Co so dfr li0u ndy ggi ld tap tin chi muc cta
hQ
Trang
7
nio d6 theo quy dinh cira hQ th6ng.
thdng.
Chuong 2: Co so l!,thuyAt
C6 nhi6u c6ch dO bi6u di6n lai nQi dung cria tdi li6u trong tpp tin chi muc nhu: nQi
dung g5c, rfit trich
Vi
ngt nghia, rut trich
dflc trung cria tdi liQu,...
vln bin thi ta c6 th0 sri dgng nQi dung gdc hoic y chinh
dp: mQt tdi liQu
tpp tin d6 dC l6p chi muc;
ctia
tii liQu ld hinh enh hoic 6m thanh thi chirng ta l6y ra d{c
tnmg cira tdi liQu g6c Ae tpp chi mgc.
TQp
tin chi muc sC dugc hQ th6ng luu trt, quin li, bd sung, cQp nhflt,...
trinh tim ki6m
sE diSn ra
tr6n nQi dung cria t4p tin ndy
vi
qu6
mi khdng cdn t5c dQng d6n c6c
tii lieu g5c trong kho.
b)
B0 phdn tim ki6m
Th6ng tin ngudi dung cAn tim sE dugc nh$p thdng qua mQt giao diQn (c6 the le
giao diQn web), thong tin
nhi€n hay mQt c6u trfrc
phdn tich
vi
niy co th6 0 d4ng mQt c6u hoi gAn gi6ng v6i ng6n ngfr t.u
nio d6 do hQ th6ng qui udc. C6c cdu hoi niy
s€ dugc hQ th6ng
bi6u di6n lpi o mQt hinh thric m6i gi6ng nhu phAn lap chi mpc, khi
dung dugc bi6u di6n l4i
<16
nQi
niy "d6ng d?ng" v6 mpt c6u truc v6i tpp tin chi mgc vi chfng
ta cfing dC dang so s6nh tryc tiifp gifra hai sg bi6u di6n ndy.
l.2lJlg thting tim ki6m trGn m6y tinh ci nhin
1.2.1 Girfi thiQu
HQ th6ng tim ki6m tdi liQu tr6n miiy tinh c6 nhdn ld tim kii5m th6ng tin ngudi
ding tr6n c6c lopi t4p tin vdn bin
ptrO Uitin nhu Word,
HTML, HTM, TXT, RTF... vd
hi6n thi nhfing tpp tin c6 1i6n quan d6n tir khori ngudi ding nh4p vio.
1.2.2 C6c vin dd
HiQu qui cria qu6 trinh lpp chi muc cho c6c tdi
nhiAu
liOu luu
vio ytiu t6 b9 doc cho ttng lo4i tpp tin kh6c nhau. Do
trfi tr6n dia phu thuQc r5t
c6c lo4i tdi li6u kh6c nhau
thi c6u truc kh6c nhau vd v6n d0 doc chinh x5c nQi dung vd c6u truc cria chting ld tucrng
d6i kh6 kh6n.
Tucmg
ty nhu cric hQ th6ng tim ki6m tr6n web,
hQ
th6ng tim ki6m tr6n m6y tinh
c6 nhdn cflng vfly, qu6 trinh lpp chi mUc diSn ra nhanh hay chgm tuj,thuQc vdo ki6n
thric vd kinh nghiQm cta nguoi
xiy dpg
hQ
th6ng vC nhi6u y6u tri nhu: c5u tnic chi
mqc (c6u tnic ciy, bing b6m,...), c6u trric file AC tuu trir chi mqc (file dpng text, nhi
ph6n...).
Trang 8
Chuong 2: Co sd l!,thuy€t
v6n rld trong tim ki6m thdng tin
Ke fir nhirng nim 40, c6c v6n dO trong viQc luu
1.3 MOt s6
tin itd thu hirt sp chir
trt
th6ng tin vd tim kii5m th6ng
y rdt l6n. V6i mQt lugng th6ng tin kh6ng 16 thi viQc tim ki6m
chinh x5c vir nhanh ch6ng cing tro nOn kh6 khdn hon. V6i sU ra ddi cria m5y tinh,
nhiAu
rit
y tuong l6n dugc dua ra nh[m cung c6p mQt hQ th6ng tim ki6m th6ng minh vd
chinh x6c. Tuy nhi6n v5n
dC
tim kii5m sao cho hiQu qui vdn chua dugc gi6i quy6t.
Vd nguy6n t6c viQc luu trft
vi tim ki6m
th6ng tin thi dcrn gi6n.
Gii
su c6 mQt kho
chria c6c tdi liQu vd mQt ngudi mu6n tim cric tdi liEu li6n quan d6n y€u c6u cta minh.
Ngudi doc c6 th6 dgc tdt chc6c tdi liQu trong kho, gifi lpi c6c tii liQu li6n quan vd bo di
crlc tdi li€u khdng li6n quan. RO rdng
giii
vi ti5n r6t nhi6u thoi
ph6p ndy khdng thUc t6
gian.
V6i
sU ra
ddi cria m6y vi tinh c6 t6c dQ cao, m6y tinh c6 th6 "dec" thay cho con
ngudi d6 trich ra chc tdi liQu c6 li6n quan trong toan b$ tflp dfi li6u. Tuy nhi6n vAn d6
li
hic niy
ldm sao d6 x6c dinh dugc tdi liQu nio li6n quan di5n c6u hoi. Mpc dich ctia
mQt hQ thiSng
tim ki6m th6ng tin tg ilQng ld truy luc dugc t6t ch cic tdi liQu c6 li6n quan
-a^l yeu cau.
den
2.
THI/C HANH
2.lLfp
HQ
chi mgc
th6ng lpp chi muc hay cdn ggi ln hQ th6ng phin tich vh xtr
li dt liQu, thgc hiQn
viQc phdn tich, trich lgc nhfrng thdng tin cAn thi6t (thuong ld c6c tir dcrn, tu gh6p, cgm
tu quan trgng) tu kho
tii
lieu thu th4p dugc vd t6 chr?c thinh
sd dfi liQu ri6ng
ccy
tim ki6m tr6n d6 mQt c6ch nhanh ch6ng, hiQu qu6. HQ thdng chi mgc
thi5
c6c tir kho6, chi rd c6c tu khoil niro xu6t hiQn trong t4p tin nio, vi
o
c6
li danh s6ch
tri nio.
C6c phucrng phSp lap chi muc
2.1.1 M6 hinh tra cru luqn ly
C6ng viQc chri yiSu khi xdy
kho
tii
dpg
mQt hQ thting tim ki6m
li
viQc lap chi mpc cho
li6u ctia chring ta, sau qu6 trinh lpp chi mgc thi nQi dung cria chi mpc
diSn nQi dung cria kho tdi li6u o mQt dpng c6u truc ndo d6 do hQ th6ng quy
Trang 9
sC biiSu
Chrong 2: Co sd $ thuyiit
C6u truc chi muc co
cr?u lu?n
bin vi
dcrn gian nh6t
ly (boolean retrieval boolean). Cdu truc chi muc niy
DC minh hoa cho qu6
trinh xdy
hi6u 16 v6n d6: tim kii5m t6n c6c nhdn
dpg
duo. c
x6y
dpg
Ae UiCu
tii liQu.
di6n cho tdt chc6c tu (word) xu6t hiQn trong
Sau qu6
li c6u truc chi mpc theo m6 hinh tra
chi muc theo c6ch tr€n, ta xem vi du sau d6
vft trong c6cthcphAm
trinh thu th4p c6c t6c phAm v6n hgc cria Shakespeare
cria nhd
vi
v[n Shakespeare.
th6ng kC c6c tu ngt ta
dpg bing th6ng kd g6m t6n nhdn vflt vd t€n c6c vo kfch. Bing thting kC c6 th6
dugc m6 ti nhu sau:
xdy
'l'eic
Anton-\,
anrl
Clleopatra
r\ntorr,v
Z
'l'h*
Julius 'lentpest
('easar'
I
[]rutus
I
I
I
I
0
I
I
0
Mercy'
I
0
w0rser
I
0
Bdng
3.
t
BiAu di6n cdu
ffilc ltru tric chi
I
lamlet Othello
00
I
('easar
Calpumia
('lcopatra
-c
phim
0r
{lt
00
{i0
II
l1
mqtc theo m6 hinh
N4acbcrh
0
I
t)
0
I
I
0
0
0
0
I
I
I
0
tra ctat luQn t!'
Circ gi6 tri trong ma trQn (ki6u boolean) dnng dC bi6u di6n cho su c6 m{t cira
nh6n vpt trong vo kich. Gi6 tri ndy
gi6 tri ndy
li
1 (true) niSu nhdn
vft
c6 trong vo kfch, ngugc lpi
li 0 (false) ni5u nhin v4t kh6ng c6 trong vo kich.
Chfng ta c6 c6c vector bits m6i tu bi6u di6n trong kho
tii
liQu.
Vi du tu
"Antony" le 110001 (1 ddng thuQc ma trfn tr6n).
Trong m6 hinh Boolean chring ta c6 th6 d{t ra U6t tcy truy v6n nio dudi dpng
m6t bii5u thfc logic tr6n tir ngt, trong d6 circ tri ngfr ndy k6t hqrp vdi nhau boi crlc phdp
to6n AND, OR, vd NOT.
Trang 10
Chuong 2: Co so lj,thuyAt
Gi6 su y6u cAu tim ki6m ld c6c vo kich c6 chria 'oBrutus"
vi "Caesar"
nhtmg
"kh6ng c6 Calpurnia". Khi ti6 c6u truy v6n c6 th6 bi€u ditin nhu: Brutus AND Caesar
AND (NOT Calpurnia).
ViQc tim ki6m sC ti6n hdnh ph6p tinh sau:
110100
AND 110111 AND (NOT(o10000;
110100
AND 110111 AND 101111
Gi6
tri
dey bit tr€n
ki5t
F
1 the hiQn
:
:
100100
vo kich o vi tri tuong img
li
ktit qui cin
tri
ve. Oi6n gi6i cho
li vo kfch
qui tri vii.
Cric hpn ch6 c6 th6 th6y dugc o m6 hinh ndy d6 ld:
o
Ktit qui cira m6 hinh ndy le
1
tei li6u
se dugc x6c
li
dirng hay kh6ng dring
Otii vOi 1 c6u truy v6n tim ki6m (tim chfnh x6c). Hon nfra c6c tdi liQu tr6 vC chi c6 th.5
dugc s8p xtip dga vdo id cira c6c tii liQu d6.
o
C6u truy v6n ph6i o dpng cria mQt biiSu thric logic vd nhu th6 se r6t kh6
khln cho
ngudi sri dgng th6ng thudrng.
o
c6c
M6 hinh tra
cr?u lupn
l;f sri d\rng m6t ma trpn nhi phdn
tii liQu, viQc x6y dlrng ma trdn ndy ld rilt "ldngphi".
liQu titing Anh, vd m6i tai liQu c6
tdi liQu). Ta
chring ta
1
sC
tlO
Gi6 su chring ta c6 1 triQu
tii
khoing 1000 tu (tucrng duong 2-3 trangtr6n m5i t4p
c6 khoing 500.000 tu kh6a ph6n biQt AOi vOi ng6n
phii x6y dpg
bi6u diSn tpp chi mUc
mQt ma tr4n g6m
lM x 0,5M
:
ngt Anh, c6 nghia ld
500 tj, phAn tir, trong d6 chi c6
tf phin tti mang gi6 tr! 1.
2.1.2 Phrong phdp chi mqc nghich ddo
V6i nhirng h4n ch6 d5 n6u,
li
qu5 trinh lap chi muc
th6y m6 hinh tra ct?u lufln ly kh6ng th6 6p dqng d€ xdy d1mg c6c
Phucrng ph6p lap chi muc nghich
phdp
niy
sO du
hQ
tii
liQu, ta nhpn
th6ng tim ki6m lcyn.
dio ra ttdi nhim gi6i quyiit hpn chti tr6n.(Phuong
nghiCn cftu & phAn nQi dung thuc hiqn).
Trang I I
Chuong 2: Cct sd ty thuyiit
2.2 Gi6n thiQu B-Tree
Trong thrc t6 c6 nhi€u chi muc kich thudc cria n6 ld r6t lon (thgc t6 1 table
trong thgc t6 s5 record cira n6 khoang hon
phii
du-o.
I
tripu ddng ld diAu binh thucmg) n6n cin
c luu tr6n cric thi6t bi phAn ctmg (thi6t bi nh0p xuSt - HDD), ngdy nay khi md
dung luong cria c6c thilit bi ltru trii ting l6n r6t nhanh vd gi6 thenh 16 n€n cAn 1 phucrng
ph6p truy xdt nhanh tr6n 1 lugng dir liQu lon nhu
B-Tree
li
dugc chia
1 c6u
vay.
truc dfi liQu d4ng cdy 'da phdn c6n blng'c6u truc 1 node cta n6 thuong
lim nhi6u phAn,
cdy dugc t6 chric
voi I qui tic nh6t
(thuong thi gi6 tri cria c6c node tdng din tu trrii qua ph6i, node con b6n tnli
sE
tu diu
chria gi6
tri nh6 hon node con b6n phai...)
B-Tree
li
chinh gita chting
hpn boi
1 sU t6ng qu6t h6a cria c6y nhi phdn tim kiiSm @ST) sg kh6c biQt
li
s6 lugng node con cria B-Tree n6 sC nhi6u hcrn chr? kh6ng bi gioi
2 phdntu nhu cria cdy nh! phin tim ki6m. Muc ti€u o ddy li lim
sao d6 gi6m
ttii da sti tAn truy xu6t cric thitit bi ngopi vi o ddy ld fia, tpi b6t cri thoi di6m nio chfng
ta sE thir ilinh vi nhfing bin ghi, chring ta mu5n dQ cao cira ciy da phin ndy phii le nh6
*r6t
Ae
giim chi phi tim kii5m, de dat duoc muc ti6u niy thi sd lugng nh6nh trong c6y
phii
l0n.
Ni5u B-Tree c6
bfc ld m thi I node ctra n6 sE c6 tOi ea m nh6nh con vd c6 m-l kh6a <16
tim ki6m.
o
C6c reord chi ilugc luu 0 nrit 16 cria cdy
"
T6t
ci cfuc nirt (ngopi tru ntit g6c) c6 tOi tnieu DL rnlz vi t6i da ld m ciy(nh5nh)
con.
o Nrit g6c vd 16 ph:ii it nh6t c6 2 con.
o B6n trong 1 node luu gi6 tri tim ki6m, vdi chi dugc sri dung nhu nhtng "ngudi
chi dudng" cho qu6 trinh tim ki6m, nhtng kh6a niy cin phii dugc s6p x6p theo 1 thf
tg thting nh6t tu tru6c (thucmg thi t[ng dAn tri tr6i qua phai).
o
Tuy thuQc vio kich thudc cria 1 record so vdi kich thu6c cria 1 kh6a thi node
cria B-Tree bpc m c6 th6 luu nhi6u hon ho{c
Trang 12
16
it hon m records. Th6ng thudmg kich
Chuong 2: Co sd
thu6c cria 1 node ld bQi sO cria
I
ttri5i
af
fiQu trong O Aa
ft6n windows m{c
tf
thuydt
5l2byte).
o
Cdc node
16
cria cdy thudng duqc li6n kt{t voi nhau de
e" thenh I danh s6ch li€n
tiSt, aieu nny nhim mgc ti6u la clti khOng phni duyQt tro lgi node g6c trong qu5 trinh
drryQt
tim kitim.
2.2.1 DinhnglrTo
MQt B-hee bAc N co cdc d{c finh sau:
(+) M6i node c6 tOi
(+) M5i node ho[c
li
node
16
hofc c6 t6i thiiSu Nl2+l node con
Vf dp:
J.
8
34.37.43
t2. t4.16 t9
78.9{J
Ilinh 2.1. B-tree bdc 2 cd 3 m*c
Khai brio:
#define ORDER 5
typedef struct tagNODE
{
NumTree;
int
int
ll sd cdy con crla node hiQn hdnh
Key[ORDER-ll; ll mdng lmt trii cdc khod cfia node
tagNODE* Branch[ORDER]; // cdc con trd chi d€n cdc node con
*NODEPTR, +BTREE; ll contrd node
NODE,
)
NODEPTR
Root;// contrd node gdc
Trang
Ij
Chuong 2: Co sd ly thuyAt
2.2.2
Tim kiilm ffong B-tree
Ifinh 2.2 : Tim kiiim gid
X6t node trong hirdr.2.2, kh6a c6n tim
ti X
li X. Gia sri nQi dung cria X nim trong bQ
nh6- Voi m dri lon ta sri dgng phuong ph6p tim ki6m nh! phen, ntiu m nh6 ta sri dung
phuong phrip tim kitim tuen tg.
N6u X kh6ng th6y sE co ba trudng hqp sau:
(+) Ki < X < K;*1. Tiiip tqc tim kiiSm trin c6y con Ci
(+) Km < X.
(+) X
Tiiip tgc tim ki6m tr6n Cm
1Kr.
Ti6p tqc tim ki6m tr€n C6
Qu6 trinh ndy ti6p tuc cho
15
md v6n kh6ng tim thAy khoq viQc tim ki6m
a)
di di diSn node
h th6t bai.
Hdm NODESEARCH
Tra v€ v! tri nh6 nhSt cria kh6a trong nrit p bit dAu lon hon hay bing k. Trudng hqp
k lon hon
dt cir cickh6a
trong nrit p thi tra vc vi tri p-> NumTrees-l
int NodeSearch (NODEPTR p, int k)
{
i:0;
int
for (i:0; i< p>NumTrees
refurn i;
-l &&p->Key[i] < k; i++;;
)
Him NodeSearch tlugc dirng
aC
tim kh6a k c6 trong nrit p hay kh6ng. N6u kh6a k
kh6ng co hong nut P thi ph6p toiin ndy
cria p a6 ti6p tUc tim kh6a k trong
tri
vC
vi tri girip chfng ta chgn n6t con phn hqp
ntt con niy.
b) Hdn SEARCH
Tim kh6a k tran B-Tree. Con tn6 p xuSt ph6t t& gtic vd tti xuting c6c nhrffi c6y
con ph[r hqp iI6 tim kh6a k co trong mQt n6t p hay kh6ng.
N6u c6 kh6a k t4i nrit p tr6n cdy:
+ Bi6n found tra vA g16
H TRUE
Trang 14
Chuong 2: Cct so lj,thuydt
+ Him Search( ) tr6 vC con tr6 chi nft p c6 chria kh6a k
+ Bi6n position
tri
vC
vi tri cira kh6a k c6 trong nirt p ndy
N€u kh6ng cd kh6a k trOn cdy: liuc niy p
th6m kh6a k
'
: NULL vi q (ntt cha cira p) chi ntit 16 c6 th6
vio nrit niry dugc.
+ Birin found
tri
vC giltriFALSE
tri
+ Hdm Search( )
vC con
tr6 q li nrit
16
c6 th6m
nft k viro
+ Bitin position trd v€ vi tri c6 th6 chdn kh6a k vio nrit ki q ndy
NODEPTR Search(int
{
b int *pPosition, int *pFound)
int i:0;
NODEPTR p : Root, q :NULL;;
while (p !:NULL)
t
i:
NodeSearch (p, k);
if (i< p->numtress-1
t
&.&k::
p->Key[i])
*pFound :
*pPosition
return
:
p;
TRUE;
i; ilL'i rt'i ti*,t thriv l;kd* k
ii Node t:ti chwt *lxi* K
)
q:
p:
p;
p ->Branch[i];
)
ri' ({1ung tnet i}tiir'. lirr: na3 p
'N{ll.l." r,a ul !a nrttle lii *ri rltc tl:rnr khriii
r,ii* node ntirr'" pitosizir:n l;\ r,i tri u* tl:c *hin khi>a k *,,
*pFound : FALSE;
*PPosition: i;
return g, .. I lit r * notlc l*
)
2.2.3
DuyQt cdy B-tree
DuyQt c6c kh6a cira B-Tree theo
r
tht
t.u
tu nh6
void Scan(NODEPRT pRoot)
t
if (pRoot:: NULL)
retum;
Trang 15
qui
k