DataSciencefromScratch
JoelGrus
DataSciencefromScratch
byJoelGrus
Copyright©2015O’ReillyMedia.Allrightsreserved.
PrintedintheUnitedStatesofAmerica.
PublishedbyO’ReillyMedia,Inc.,1005GravensteinHighwayNorth,Sebastopol,CA
95472.
O’Reillybooksmaybepurchasedforeducational,business,orsalespromotionaluse.
Onlineeditionsarealsoavailableformosttitles().Formore
information,contactourcorporate/institutionalsalesdepartment:800-998-9938or
Editor:MarieBeaugureau
ProductionEditor:MelanieYarbrough
Copyeditor:NanReinhardt
Proofreader:EileenCohen
Indexer:EllenTroutman-Zaig
InteriorDesigner:DavidFutato
CoverDesigner:KarenMontgomery
Illustrator:RebeccaDemarest
April2015:FirstEdition
RevisionHistoryfortheFirstEdition
2015-04-10:FirstRelease
Seeforreleasedetails.
TheO’ReillylogoisaregisteredtrademarkofO’ReillyMedia,Inc.DataSciencefrom
Scratch,thecoverimageofaRockPtarmigan,andrelatedtradedressaretrademarksof
O’ReillyMedia,Inc.
Whilethepublisherandtheauthorhaveusedgoodfaitheffortstoensurethatthe
informationandinstructionscontainedinthisworkareaccurate,thepublisherandthe
authordisclaimallresponsibilityforerrorsoromissions,includingwithoutlimitation
responsibilityfordamagesresultingfromtheuseoforrelianceonthiswork.Useofthe
informationandinstructionscontainedinthisworkisatyourownrisk.Ifanycode
samplesorothertechnologythisworkcontainsordescribesissubjecttoopensource
licensesortheintellectualpropertyrightsofothers,itisyourresponsibilitytoensurethat
yourusethereofcomplieswithsuchlicensesand/orrights.
978-1-491-90142-7
[LSI]
Preface
DataScience
Datascientisthasbeencalled“thesexiestjobofthe21stcentury,”presumablyby
someonewhohasnevervisitedafirestation.Nonetheless,datascienceisahotand
growingfield,anditdoesn’ttakeagreatdealofsleuthingtofindanalystsbreathlessly
prognosticatingthatoverthenext10years,we’llneedbillionsandbillionsmoredata
scientiststhanwecurrentlyhave.
Butwhatisdatascience?Afterall,wecan’tproducedatascientistsifwedon’tknowwhat
datascienceis.AccordingtoaVenndiagramthatissomewhatfamousintheindustry,data
scienceliesattheintersectionof:
Hackingskills
Mathandstatisticsknowledge
Substantiveexpertise
AlthoughIoriginallyintendedtowriteabookcoveringallthree,Iquicklyrealizedthata
thoroughtreatmentof“substantiveexpertise”wouldrequiretensofthousandsofpages.At
thatpoint,Idecidedtofocusonthefirsttwo.Mygoalistohelpyoudevelopthehacking
skillsthatyou’llneedtogetstarteddoingdatascience.Andmygoalistohelpyouget
comfortablewiththemathematicsandstatisticsthatareatthecoreofdatascience.
Thisisasomewhatheavyaspirationforabook.Thebestwaytolearnhackingskillsisby
hackingonthings.Byreadingthisbook,youwillgetagoodunderstandingofthewayI
hackonthings,whichmaynotnecessarilybethebestwayforyoutohackonthings.You
willgetagoodunderstandingofsomeofthetoolsIuse,whichwillnotnecessarilybethe
besttoolsforyoutouse.YouwillgetagoodunderstandingofthewayIapproachdata
problems,whichmaynotnecessarilybethebestwayforyoutoapproachdataproblems.
Theintent(andthehope)isthatmyexampleswillinspireyoutrythingsyourownway.
AllthecodeanddatafromthebookisavailableonGitHubtogetyoustarted.
Similarly,thebestwaytolearnmathematicsisbydoingmathematics.Thisisemphatically
notamathbook,andforthemostpart,wewon’tbe“doingmathematics.”However,you
can’treallydodatasciencewithoutsomeunderstandingofprobabilityandstatisticsand
linearalgebra.Thismeansthat,whereappropriate,wewilldiveintomathematical
equations,mathematicalintuition,mathematicalaxioms,andcartoonversionsofbig
mathematicalideas.Ihopethatyouwon’tbeafraidtodiveinwithme.
Throughoutitall,Ialsohopetogiveyouasensethatplayingwithdataisfun,because,
well,playingwithdataisfun!(Especiallycomparedtosomeofthealternatives,liketax
preparationorcoalmining.)
FromScratch
Therearelotsandlotsofdatasciencelibraries,frameworks,modules,andtoolkitsthat
efficientlyimplementthemostcommon(aswellastheleastcommon)datascience
algorithmsandtechniques.Ifyoubecomeadatascientist,youwillbecomeintimately
familiarwithNumPy,withscikit-learn,withpandas,andwithapanoplyofotherlibraries.
Theyaregreatfordoingdatascience.Buttheyarealsoagoodwaytostartdoingdata
sciencewithoutactuallyunderstandingdatascience.
Inthisbook,wewillbeapproachingdatasciencefromscratch.Thatmeanswe’llbe
buildingtoolsandimplementingalgorithmsbyhandinordertobetterunderstandthem.I
putalotofthoughtintocreatingimplementationsandexamplesthatareclear,wellcommented,andreadable.Inmostcases,thetoolswebuildwillbeilluminatingbut
impractical.Theywillworkwellonsmalltoydatasetsbutfalloveron“webscale”ones.
Throughoutthebook,Iwillpointyoutolibrariesyoumightusetoapplythesetechniques
tolargerdatasets.Butwewon’tbeusingthemhere.
Thereisahealthydebateragingoverthebestlanguageforlearningdatascience.Many
peoplebelieveit’sthestatisticalprogramminglanguageR.(Wecallthosepeoplewrong.)
AfewpeoplesuggestJavaorScala.However,inmyopinion,Pythonistheobvious
choice.
Pythonhasseveralfeaturesthatmakeitwellsuitedforlearning(anddoing)datascience:
It’sfree.
It’srelativelysimpletocodein(and,inparticular,tounderstand).
Ithaslotsofusefuldatascience–relatedlibraries.
IamhesitanttocallPythonmyfavoriteprogramminglanguage.Thereareotherlanguages
Ifindmorepleasant,better-designed,orjustmorefuntocodein.Andyetprettymuch
everytimeIstartanewdatascienceproject,IendupusingPython.EverytimeIneedto
quicklyprototypesomethingthatjustworks,IendupusingPython.AndeverytimeI
wanttodemonstratedatascienceconceptsinaclear,easy-to-understandway,Iendup
usingPython.Accordingly,thisbookusesPython.
ThegoalofthisbookisnottoteachyouPython.(Althoughitisnearlycertainthatby
readingthisbookyouwilllearnsomePython.)I’lltakeyouthroughachapter-longcrash
coursethathighlightsthefeaturesthataremostimportantforourpurposes,butifyou
knownothingaboutprogramminginPython(oraboutprogrammingatall)thenyoumight
wanttosupplementthisbookwithsomesortof“PythonforBeginners”tutorial.
Theremainderofourintroductiontodatasciencewilltakethissameapproach — going
intodetailwheregoingintodetailseemscrucialorilluminating,atothertimesleaving
detailsforyoutofigureoutyourself(orlookuponWikipedia).
Overtheyears,I’vetrainedanumberofdatascientists.Whilenotallofthemhavegone
ontobecomeworld-changingdataninjarockstars,I’veleftthemallbetterdatascientists
thanIfoundthem.AndI’vegrowntobelievethatanyonewhohassomeamountof
mathematicalaptitudeandsomeamountofprogrammingskillhasthenecessaryraw
materialstododatascience.Allsheneedsisaninquisitivemind,awillingnesstowork
hard,andthisbook.Hencethisbook.
ConventionsUsedinThisBook
Thefollowingtypographicalconventionsareusedinthisbook:
Italic
Indicatesnewterms,URLs,emailaddresses,filenames,andfileextensions.
Constantwidth
Usedforprogramlistings,aswellaswithinparagraphstorefertoprogramelements
suchasvariableorfunctionnames,databases,datatypes,environmentvariables,
statements,andkeywords.
Constantwidthbold
Showscommandsorothertextthatshouldbetypedliterallybytheuser.
Constantwidthitalic
Showstextthatshouldbereplacedwithuser-suppliedvaluesorbyvaluesdetermined
bycontext.
TIP
Thiselementsignifiesatiporsuggestion.
NOTE
Thiselementsignifiesageneralnote.
WARNING
Thiselementindicatesawarningorcaution.
UsingCodeExamples
Supplementalmaterial(codeexamples,exercises,etc.)isavailablefordownloadat
/>Thisbookisheretohelpyougetyourjobdone.Ingeneral,ifexamplecodeisoffered
withthisbook,youmayuseitinyourprogramsanddocumentation.Youdonotneedto
contactusforpermissionunlessyou’rereproducingasignificantportionofthecode.For
example,writingaprogramthatusesseveralchunksofcodefromthisbookdoesnot
requirepermission.SellingordistributingaCD-ROMofexamplesfromO’Reillybooks
doesrequirepermission.Answeringaquestionbycitingthisbookandquotingexample
codedoesnotrequirepermission.Incorporatingasignificantamountofexamplecode
fromthisbookintoyourproduct’sdocumentationdoesrequirepermission.
Weappreciate,butdonotrequire,attribution.Anattributionusuallyincludesthetitle,
author,publisher,andISBN.Forexample:“DataSciencefromScratchbyJoelGrus
(O’Reilly).Copyright2015JoelGrus,978-1-4919-0142-7.”
Ifyoufeelyouruseofcodeexamplesfallsoutsidefairuseorthepermissiongivenabove,
feelfreetocontactusat
Safari®BooksOnline
NOTE
SafariBooksOnlineisanon-demanddigitallibrarythatdeliversexpertcontentinboth
bookandvideoformfromtheworld’sleadingauthorsintechnologyandbusiness.
Technologyprofessionals,softwaredevelopers,webdesigners,andbusinessandcreative
professionalsuseSafariBooksOnlineastheirprimaryresourceforresearch,problem
solving,learning,andcertificationtraining.
SafariBooksOnlineoffersarangeofplansandpricingforenterprise,government,
education,andindividuals.
Membershaveaccesstothousandsofbooks,trainingvideos,andprepublication
manuscriptsinonefullysearchabledatabasefrompublisherslikeO’ReillyMedia,
PrenticeHallProfessional,Addison-WesleyProfessional,MicrosoftPress,Sams,Que,
PeachpitPress,FocalPress,CiscoPress,JohnWiley&Sons,Syngress,Morgan
Kaufmann,IBMRedbooks,Packt,AdobePress,FTPress,Apress,Manning,NewRiders,
McGraw-Hill,Jones&Bartlett,CourseTechnology,andhundredsmore.Formore
informationaboutSafariBooksOnline,pleasevisitusonline.
HowtoContactUs
Pleaseaddresscommentsandquestionsconcerningthisbooktothepublisher:
O’ReillyMedia,Inc.
1005GravensteinHighwayNorth
Sebastopol,CA95472
800-998-9938(intheUnitedStatesorCanada)
707-829-0515(internationalorlocal)
707-829-0104(fax)
Wehaveawebpageforthisbook,wherewelisterrata,examples,andanyadditional
information.Youcanaccessthispageat />Tocommentorasktechnicalquestionsaboutthisbook,sendemailto
Formoreinformationaboutourbooks,courses,conferences,andnews,seeourwebsiteat
.
FindusonFacebook: />FollowusonTwitter: />WatchusonYouTube: />
Acknowledgments
First,IwouldliketothankMikeLoukidesforacceptingmyproposalforthisbook(and
forinsistingthatIpareitdowntoareasonablesize).Itwouldhavebeenveryeasyforhim
tosay,“Who’sthispersonwhokeepsemailingmesamplechapters,andhowdoIgethim
togoaway?”I’mgratefulhedidn’t.I’dalsoliketothankmyeditor,MarieBeaugureau,
forguidingmethroughthepublishingprocessandgettingthebookinamuchbetterstate
thanIeverwouldhavegottenitonmyown.
Icouldn’thavewrittenthisbookifI’dneverlearneddatascience,andIprobablywouldn’t
havelearneddatascienceifnotfortheinfluenceofDaveHsu,IgorTatarinov,John
Rauser,andtherestoftheFarecastgang.(Solongagothatitwasn’tevencalleddata
scienceatthetime!)ThegoodfolksatCourseradeservealotofcredit,too.
Iamalsogratefultomybetareadersandreviewers.JayFundlingfoundatonofmistakes
andpointedoutmanyunclearexplanations,andthebookismuchbetter(andmuchmore
correct)thankstohim.DebashisGhoshisaheroforsanity-checkingallofmystatistics.
AndrewMusselmansuggestedtoningdownthe“peoplewhopreferRtoPythonaremoral
reprobates”aspectofthebook,whichIthinkendedupbeingprettygoodadvice.Trey
Causey,RyanMatthewBalfanz,LorisMularoni,NúriaPujol,RobJefferson,MaryPat
Campbell,ZachGeary,andWendyGrusalsoprovidedinvaluablefeedback.Anyerrors
remainingareofcoursemyresponsibility.
IowealottotheTwitter#datasciencecommmunity,forexposingmetoatonofnew
concepts,introducingmetoalotofgreatpeople,andmakingmefeellikeenoughofan
underachieverthatIwentoutandwroteabooktocompensate.SpecialthankstoTrey
Causey(again),for(inadvertently)remindingmetoincludeachapteronlinearalgebra,
andtoSeanJ.Taylor,for(inadvertently)pointingoutacoupleofhugegapsinthe
“WorkingwithData”chapter.
Aboveall,IoweimmensethankstoGangaandMadeline.Theonlythingharderthan
writingabookislivingwithsomeonewho’swritingabook,andIcouldn’thavepulledit
offwithouttheirsupport.
Chapter1.Introduction
“Data!Data!Data!”hecriedimpatiently.“Ican’tmakebrickswithoutclay.”
ArthurConanDoyle
TheAscendanceofData
Weliveinaworldthat’sdrowningindata.Websitestrackeveryuser’severyclick.Your
smartphoneisbuildinguparecordofyourlocationandspeedeverysecondofeveryday.
“Quantifiedselfers”wearpedometers-on-steroidsthatareeverrecordingtheirheartrates,
movementhabits,diet,andsleeppatterns.Smartcarscollectdrivinghabits,smarthomes
collectlivinghabits,andsmartmarketerscollectpurchasinghabits.TheInternetitself
representsahugegraphofknowledgethatcontains(amongotherthings)anenormous
cross-referencedencyclopedia;domain-specificdatabasesaboutmovies,music,sports
results,pinballmachines,memes,andcocktails;andtoomanygovernmentstatistics
(someofthemnearlytrue!)fromtoomanygovernmentstowrapyourheadaround.
Buriedinthesedataareanswerstocountlessquestionsthatnoone’severthoughttoask.
Inthisbook,we’lllearnhowtofindthem.
WhatIsDataScience?
There’sajokethatsaysadatascientistissomeonewhoknowsmorestatisticsthana
computerscientistandmorecomputersciencethanastatistician.(Ididn’tsayitwasa
goodjoke.)Infact,somedatascientistsare—forallpracticalpurposes—statisticians,
whileothersareprettymuchindistinguishablefromsoftwareengineers.Someare
machine-learningexperts,whileotherscouldn’tmachine-learntheirwayoutof
kindergarten.SomearePhDswithimpressivepublicationrecords,whileothershavenever
readanacademicpaper(shameonthem,though).Inshort,prettymuchnomatterhowyou
definedatascience,you’llfindpractitionersforwhomthedefinitionistotally,absolutely
wrong.
Nonetheless,wewon’tletthatstopusfromtrying.We’llsaythatadatascientistis
someonewhoextractsinsightsfrommessydata.Today’sworldisfullofpeopletryingto
turndataintoinsight.
Forinstance,thedatingsiteOkCupidasksitsmemberstoanswerthousandsofquestions
inordertofindthemostappropriatematchesforthem.Butitalsoanalyzestheseresultsto
figureoutinnocuous-soundingquestionsyoucanasksomeonetofindouthowlikely
someoneistosleepwithyouonthefirstdate.
Facebookasksyoutolistyourhometownandyourcurrentlocation,ostensiblytomakeit
easierforyourfriendstofindandconnectwithyou.Butitalsoanalyzestheselocationsto
identifyglobalmigrationpatternsandwherethefanbasesofdifferentfootballteamslive.
Asalargeretailer,Targettracksyourpurchasesandinteractions,bothonlineandin-store.
Anditusesthedatatopredictivelymodelwhichofitscustomersarepregnant,tobetter
marketbaby-relatedpurchasestothem.
In2012,theObamacampaignemployeddozensofdatascientistswhodata-minedand
experimentedtheirwaytoidentifyingvoterswhoneededextraattention,choosingoptimal
donor-specificfundraisingappealsandprograms,andfocusingget-out-the-voteefforts
wheretheyweremostlikelytobeuseful.Itisgenerallyagreedthattheseeffortsplayedan
importantroleinthepresident’sre-election,whichmeansitisasafebetthatpolitical
campaignsofthefuturewillbecomemoreandmoredata-driven,resultinginaneverendingarmsraceofdatascienceanddatacollection.
Now,beforeyoustartfeelingtoojaded:somedatascientistsalsooccasionallyusetheir
skillsforgood—usingdatatomakegovernmentmoreeffective,tohelpthehomeless,
andtoimprovepublichealth.Butitcertainlywon’thurtyourcareerifyoulikefiguring
outthebestwaytogetpeopletoclickonadvertisements.
MotivatingHypothetical:DataSciencester
Congratulations!You’vejustbeenhiredtoleadthedatascienceeffortsatDataSciencester,
thesocialnetworkfordatascientists.
Despitebeingfordatascientists,DataSciencesterhasneveractuallyinvestedinbuilding
itsowndatasciencepractice.(Infairness,DataSciencesterhasneverreallyinvestedin
buildingitsproducteither.)Thatwillbeyourjob!Throughoutthebook,we’llbelearning
aboutdatascienceconceptsbysolvingproblemsthatyouencounteratwork.Sometimes
we’lllookatdataexplicitlysuppliedbyusers,sometimeswe’lllookatdatagenerated
throughtheirinteractionswiththesite,andsometimeswe’llevenlookatdatafrom
experimentsthatwe’lldesign.
AndbecauseDataSciencesterhasastrong“not-invented-here”mentality,we’llbe
buildingourowntoolsfromscratch.Attheend,you’llhaveaprettysolidunderstanding
ofthefundamentalsofdatascience.Andyou’llbereadytoapplyyourskillsatacompany
withalessshakypremise,ortoanyotherproblemsthathappentointerestyou.
Welcomeaboard,andgoodluck!(You’reallowedtowearjeansonFridays,andthe
bathroomisdownthehallontheright.)
FindingKeyConnectors
It’syourfirstdayonthejobatDataSciencester,andtheVPofNetworkingisfullof
questionsaboutyourusers.Untilnowhe’shadnoonetoask,sohe’sveryexcitedtohave
youaboard.
Inparticular,hewantsyoutoidentifywhothe“keyconnectors”areamongdatascientists.
Tothisend,hegivesyouadumpoftheentireDataSciencesternetwork.(Inreallife,
peopledon’ttypicallyhandyouthedatayouneed.Chapter9isdevotedtogettingdata.)
Whatdoesthisdatadumplooklike?Itconsistsofalistofusers,eachrepresentedbya
dictthatcontainsforeachuserhisorherid(whichisanumber)andname(which,inone
ofthegreatcosmiccoincidences,rhymeswiththeuser’sid):
users=[
{"id":0,"name":"Hero"},
{"id":1,"name":"Dunn"},
{"id":2,"name":"Sue"},
{"id":3,"name":"Chi"},
{"id":4,"name":"Thor"},
{"id":5,"name":"Clive"},
{"id":6,"name":"Hicks"},
{"id":7,"name":"Devin"},
{"id":8,"name":"Kate"},
{"id":9,"name":"Klein"}
]
Healsogivesyouthe“friendship”data,representedasalistofpairsofIDs:
friendships=[(0,1),(0,2),(1,2),(1,3),(2,3),(3,4),
(4,5),(5,6),(5,7),(6,8),(7,8),(8,9)]
Forexample,thetuple(0,1)indicatesthatthedatascientistwithid0(Hero)andthe
datascientistwithid1(Dunn)arefriends.ThenetworkisillustratedinFigure1-1.
Figure1-1.TheDataSciencesternetwork
Sincewerepresentedourusersasdicts,it’seasytoaugmentthemwithextradata.
NOTE
Don’tgettoohunguponthedetailsofthecoderightnow.InChapter2,we’lltakeyouthroughacrash
courseinPython.Fornowjusttrytogetthegeneralflavorofwhatwe’redoing.
Forexample,wemightwanttoaddalistoffriendstoeachuser.Firstweseteachuser’s
friendspropertytoanemptylist:
foruserinusers:
user["friends"]=[]
Andthenwepopulatethelistsusingthefriendshipsdata:
fori,jinfriendships:
#thisworksbecauseusers[i]istheuserwhoseidisi
users[i]["friends"].append(users[j])#addiasafriendofj
users[j]["friends"].append(users[i])#addjasafriendofi
Onceeachuserdictcontainsalistoffriends,wecaneasilyaskquestionsofourgraph,
like“what’stheaveragenumberofconnections?”
Firstwefindthetotalnumberofconnections,bysummingupthelengthsofallthe
friendslists:
defnumber_of_friends(user):
"""howmanyfriendsdoes_user_have?"""
returnlen(user["friends"])#lengthoffriend_idslist
total_connections=sum(number_of_friends(user)
foruserinusers)#24
Andthenwejustdividebythenumberofusers:
from__future__importdivision#integerdivisionislame
num_users=len(users)#lengthoftheuserslist
avg_connections=total_connections/num_users#2.4
It’salsoeasytofindthemostconnectedpeople—they’rethepeoplewhohavethelargest
numberoffriends.
Sincetherearen’tverymanyusers,wecansortthemfrom“mostfriends”to“least
friends”:
#createalist(user_id,number_of_friends)
num_friends_by_id=[(user["id"],number_of_friends(user))
foruserinusers]
sorted(num_friends_by_id,#getitsorted
key=lambda(user_id,num_friends):num_friends,#bynum_friends
reverse=True)#largesttosmallest
#eachpairis(user_id,num_friends)
#[(1,3),(2,3),(3,3),(5,3),(8,3),
#(0,2),(4,2),(6,2),(7,2),(9,1)]
Onewaytothinkofwhatwe’vedoneisasawayofidentifyingpeoplewhoaresomehow
centraltothenetwork.Infact,whatwe’vejustcomputedisthenetworkmetricdegree
centrality(Figure1-2).
Figure1-2.TheDataSciencesternetworksizedbydegree
Thishasthevirtueofbeingprettyeasytocalculate,butitdoesn’talwaysgivetheresults
you’dwantorexpect.Forexample,intheDataSciencesternetworkThor(id4)onlyhas
twoconnectionswhileDunn(id1)hasthree.Yetlookingatthenetworkitintuitively
seemslikeThorshouldbemorecentral.InChapter21,we’llinvestigatenetworksinmore
detail,andwe’lllookatmorecomplexnotionsofcentralitythatmayormaynotaccord
betterwithourintuition.
DataScientistsYouMayKnow
Whileyou’restillfillingoutnew-hirepaperwork,theVPofFraternizationcomesbyyour
desk.Shewantstoencouragemoreconnectionsamongyourmembers,andsheasksyou
todesigna“DataScientistsYouMayKnow”suggester.
Yourfirstinstinctistosuggestthatausermightknowthefriendsoffriends.Theseare
easytocompute:foreachofauser’sfriends,iterateoverthatperson’sfriends,andcollect
alltheresults:
deffriends_of_friend_ids_bad(user):
#"foaf"isshortfor"friendofafriend"
return[foaf["id"]
forfriendinuser["friends"]#foreachofuser'sfriends
forfoafinfriend["friends"]]#geteachof_their_friends
Whenwecallthisonusers[0](Hero),itproduces:
[0,2,3,0,1,3]
Itincludesuser0(twice),sinceHeroisindeedfriendswithbothofhisfriends.Itincludes
users1and2,althoughtheyarebothfriendswithHeroalready.Anditincludesuser3
twice,asChiisreachablethroughtwodifferentfriends:
print[friend["id"]forfriendinusers[0]["friends"]]#[1,2]
print[friend["id"]forfriendinusers[1]["friends"]]#[0,2,3]
print[friend["id"]forfriendinusers[2]["friends"]]#[0,1,3]
Knowingthatpeoplearefriends-of-friendsinmultiplewaysseemslikeinteresting
information,somaybeinsteadweshouldproduceacountofmutualfriends.Andwe
definitelyshoulduseahelperfunctiontoexcludepeoplealreadyknowntotheuser:
fromcollectionsimportCounter#notloadedbydefault
defnot_the_same(user,other_user):
"""twousersarenotthesameiftheyhavedifferentids"""
returnuser["id"]!=other_user["id"]
defnot_friends(user,other_user):
"""other_userisnotafriendifhe'snotinuser["friends"];
thatis,ifhe'snot_the_sameasallthepeopleinuser["friends"]"""
returnall(not_the_same(friend,other_user)
forfriendinuser["friends"])
deffriends_of_friend_ids(user):
returnCounter(foaf["id"]
forfriendinuser["friends"]#foreachofmyfriends
forfoafinfriend["friends"]#count*their*friends
ifnot_the_same(user,foaf)#whoaren'tme
andnot_friends(user,foaf))#andaren'tmyfriends
printfriends_of_friend_ids(users[3])#Counter({0:2,5:1})
ThiscorrectlytellsChi(id3)thatshehastwomutualfriendswithHero(id0)butonly
onemutualfriendwithClive(id5).
Asadatascientist,youknowthatyoualsomightenjoymeetinguserswithsimilar