LearningApacheMahoutClassification
TableofContents
LearningApacheMahoutClassification
Credits
AbouttheAuthor
AbouttheReviewers
www.PacktPub.com
Supportfiles,eBooks,discountoffers,andmore
Whysubscribe?
FreeaccessforPacktaccountholders
Preface
Whatthisbookcovers
Whatyouneedforthisbook
Whothisbookisfor
Conventions
Readerfeedback
Customersupport
Downloadingtheexamplecode
Downloadingthecolorimagesofthisbook
Errata
Piracy
Questions
1.ClassificationinDataAnalysis
Introducingtheclassification
Applicationoftheclassificationsystem
Workingoftheclassificationsystem
Classificationalgorithms
Modelevaluationtechniques
Theconfusionmatrix
TheReceiverOperatingCharacteristics(ROC)graph
AreaundertheROCcurve
Theentropymatrix
Summary
2.ApacheMahout
IntroducingApacheMahout
AlgorithmssupportedinMahout
ReasonsforMahoutbeingagoodchoiceforclassification
InstallingMahout
BuildingMahoutfromsourceusingMaven
InstallingMaven
BuildingMahoutcode
SettingupadevelopmentenvironmentusingEclipse
SettingupMahoutforaWindowsuser
Summary
3.LearningLogisticRegression/SGDUsingMahout
Introducingregression
Understandinglinearregression
Costfunction
Gradientdescent
Logisticregression
StochasticGradientDescent
UsingMahoutforlogisticregression
Summary
4.LearningtheNaïveBayesClassificationUsingMahout
IntroducingconditionalprobabilityandtheBayesrule
UnderstandingtheNaïveBayesalgorithm
Understandingthetermsusedintextclassification
UsingtheNaïveBayesalgorithminApacheMahout
Summary
5.LearningtheHiddenMarkovModelUsingMahout
Deterministicandnondeterministicpatterns
TheMarkovprocess
IntroducingtheHiddenMarkovModel
UsingMahoutfortheHiddenMarkovModel
Summary
6.LearningRandomForestUsingMahout
Decisiontree
Randomforest
UsingMahoutforRandomforest
StepstousetheRandomforestalgorithminMahout
Summary
7.LearningMultilayerPerceptronUsingMahout
Neuralnetworkandneurons
MultilayerPerceptron
MLPimplementationinMahout
UsingMahoutforMLP
StepstousetheMLPalgorithminMahout
Summary
8.MahoutChangesintheUpcomingRelease
Mahoutnewchanges
MahoutScalaandSparkbindings
ApacheSpark
UsingMahout’sSparkshell
H2Oplatformintegration
Summary
9.BuildinganE-mailClassificationSystemUsingApacheMahout
Spame-maildataset
CreatingthemodelusingtheAssassindataset
Programtouseaclassifiermodel
Testingtheprogram
Secondusecaseasanexercise
TheASFe-maildataset
Classifierstuning
Summary
Index
LearningApacheMahoutClassification
LearningApacheMahoutClassification
Copyright©2015PacktPublishing
Allrightsreserved.Nopartofthisbookmaybereproduced,storedinaretrievalsystem,
ortransmittedinanyformorbyanymeans,withoutthepriorwrittenpermissionofthe
publisher,exceptinthecaseofbriefquotationsembeddedincriticalarticlesorreviews.
Everyefforthasbeenmadeinthepreparationofthisbooktoensuretheaccuracyofthe
informationpresented.However,theinformationcontainedinthisbookissoldwithout
warranty,eitherexpressorimplied.Neithertheauthor,norPacktPublishing,andits
dealersanddistributorswillbeheldliableforanydamagescausedorallegedtobecaused
directlyorindirectlybythisbook.
PacktPublishinghasendeavoredtoprovidetrademarkinformationaboutallofthe
companiesandproductsmentionedinthisbookbytheappropriateuseofcapitals.
However,PacktPublishingcannotguaranteetheaccuracyofthisinformation.
Firstpublished:February2015
Productionreference:1210215
PublishedbyPacktPublishingLtd.
LiveryPlace
35LiveryStreet
BirminghamB32PB,UK.
ISBN978-1-78355-495-9
www.packtpub.com
Credits
Author
AshishGupta
Reviewers
SivaPrakash
TharinduRusira
VishnuViswanath
CommissioningEditor
AkramHussain
AcquisitionEditor
ReshmaRaman
ContentDevelopmentEditor
MerwynD’souza
TechnicalEditors
MonicaJohn
NovinaKewalramani
ShrutiRawool
CopyEditors
SarangChari
GladsonMonteiro
AartiSaldanha
RashmiSawant
ProjectCoordinator
NehaBhatnagar
Proofreaders
SimranBhogal
SteveMaguire
Indexer
MonicaAjmeraMehta
Graphics
SheetalAute
AbhinashSahu
ProductionCoordinator
ConidonMiranda
CoverWork
ConidonMiranda
AbouttheAuthor
AshishGuptahasbeenworkinginthefieldofsoftwaredevelopmentforthelast8years.
Hehasworkedindifferentcompanies,suchasSAPLabsandCaterpillar,asasoftware
developer.Whileworkingforastart-upwherehewasresponsibleforpredictingpotential
customersfornewfashionapparelsusingsocialmedia,hedevelopedaninterestinthe
fieldofmachinelearning.Sincethen,hehasworkedonusingbigdatatechnologiesand
machinelearningfordifferentindustries,includingretail,finance,insurance,andsoon.
Hehasapassionforlearningnewtechnologiesandsharingtheknowledgethusgained
withothers.HehasorganizedmanybootcampsfortheApacheMahoutandHadoop
ecosystem.
Firstofall,Iwouldliketothankopensourcecommunitiesfortheircontinuouseffortsin
developinggreatsoftwareforall.IwouldliketothankMerwynD’SouzaandReshma
Raman,myeditorsforthisproject.Specialthankstothereviewersofthisbook.
Nothingcanbeaccomplishedwithoutthesupportoffamily,friends,andlovedones.I
wouldliketothankmyfriends,family,andespeciallymywifeandmysonfortheir
continuoussupportthroughoutthewritingofthisbook.
AbouttheReviewers
SivaPrakashisworkingasatechleadinBangalore.Hehasextensivedevelopment
experienceintheanalysis,design,development,implementation,andmaintenanceof
variousdesktop,mobile,andweb-basedapplications.Helovestrekking,traveling,music,
readingbooks,andblogging.
YoucanfindhimonLinkedInat />TharinduRusiraiscurrentlyacomputerscienceandengineeringundergraduateatthe
UniversityofMoratuwa,SriLanka.Asastudentresearcher,hehasstronginterestsin
machinelearning,compilers,andhigh-performancecomputing.
Tharinduhasalsoworkedasaresearchanddevelopmentsoftwareengineeringinternat
ZaiziAsia(Pvt)Ltd.,wherehefirststartedusingApacheMahoutduringthe
implementationofanenterprise-levelcontentmanagementandinformationretrieval
system.
HeseesthepotentialofApacheMahoutasascalablemachinelearninglibraryfor
industry-levelimplementationsandhasevencontributedtotheMahout0.9release,the
lateststablereleaseofMahout.
HeisavailableonLinkedInat />VishnuViswanathisaseniorbigdatadeveloperwhohasmanyyearsofindustrial
expertiseinthearenaofmachinelearning.Heisatechenthusiastandispassionateabout
bigdataandhasexpertiseonmostbig-data-relatedtechnologies.
YoucanfindhimonLinkedInat />
www.PacktPub.com
Supportfiles,eBooks,discountoffers,and
more
Forsupportfilesanddownloadsrelatedtoyourbook,pleasevisitwww.PacktPub.com.
DidyouknowthatPacktofferseBookversionsofeverybookpublished,withPDFand
ePubfilesavailable?YoucanupgradetotheeBookversionatwww.PacktPub.comandas
aprintbookcustomer,youareentitledtoadiscountontheeBookcopy.Getintouchwith
usat<>formoredetails.
Atwww.PacktPub.com,youcanalsoreadacollectionoffreetechnicalarticles,signup
forarangeoffreenewslettersandreceiveexclusivediscountsandoffersonPacktbooks
andeBooks.
/>DoyouneedinstantsolutionstoyourITquestions?PacktLibisPackt’sonlinedigital
booklibrary.Here,youcansearch,access,andreadPackt’sentirelibraryofbooks.
Whysubscribe?
FullysearchableacrosseverybookpublishedbyPackt
Copyandpaste,print,andbookmarkcontent
Ondemandandaccessibleviaawebbrowser
FreeaccessforPacktaccountholders
IfyouhaveanaccountwithPacktatwww.PacktPub.com,youcanusethistoaccess
PacktLibtodayandview9entirelyfreebooks.Simplyuseyourlogincredentialsfor
immediateaccess.
Preface
Thankstotheprogressmadeinthehardwareindustries,ourstoragecapacityhas
increased,andbecauseofthis,therearemanyorganizationswhowanttostorealltypesof
eventsforanalyticspurposes.Thishasgivenbirthtoaneweraofmachinelearning.The
fieldofmachinelearningisverycomplexandwritingthesealgorithmsisnotapieceof
cake.ApacheMahoutprovidesuswithreadymadealgorithmsintheareaofmachine
learningandsavesusfromthecomplextaskofalgorithmimplementation.
TheintentionofthisbookistocoverclassificationalgorithmsavailableinApache
Mahout.Whetheryouhavealreadyworkedonclassificationalgorithmsusingsomeother
toolorarecompletelynewtothefield,thisbookwillhelpyou.So,startreadingthisbook
toexploretheclassificationalgorithmsinoneofthemostpopularopensourceprojects
whichenjoysstrongcommunitysupport:ApacheMahout.