Tải bản đầy đủ (.pdf) (464 trang)

Joel grus data science from scratch first princ

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (7.05 MB, 464 trang )



DataSciencefromScratch
JoelGrus



DataSciencefromScratch
byJoelGrus
Copyright©2015O’ReillyMedia.Allrightsreserved.
PrintedintheUnitedStatesofAmerica.
PublishedbyO’ReillyMedia,Inc.,1005GravensteinHighwayNorth,Sebastopol,CA
95472.
O’Reillybooksmaybepurchasedforeducational,business,orsalespromotionaluse.
Onlineeditionsarealsoavailableformosttitles().Formore
information,contactourcorporate/institutionalsalesdepartment:800-998-9938or

Editor:MarieBeaugureau
ProductionEditor:MelanieYarbrough
Copyeditor:NanReinhardt
Proofreader:EileenCohen
Indexer:EllenTroutman-Zaig
InteriorDesigner:DavidFutato
CoverDesigner:KarenMontgomery
Illustrator:RebeccaDemarest
April2015:FirstEdition


RevisionHistoryfortheFirstEdition
2015-04-10:FirstRelease
Seeforreleasedetails.


TheO’ReillylogoisaregisteredtrademarkofO’ReillyMedia,Inc.DataSciencefrom
Scratch,thecoverimageofaRockPtarmigan,andrelatedtradedressaretrademarksof
O’ReillyMedia,Inc.
Whilethepublisherandtheauthorhaveusedgoodfaitheffortstoensurethatthe
informationandinstructionscontainedinthisworkareaccurate,thepublisherandthe
authordisclaimallresponsibilityforerrorsoromissions,includingwithoutlimitation
responsibilityfordamagesresultingfromtheuseoforrelianceonthiswork.Useofthe
informationandinstructionscontainedinthisworkisatyourownrisk.Ifanycode
samplesorothertechnologythisworkcontainsordescribesissubjecttoopensource
licensesortheintellectualpropertyrightsofothers,itisyourresponsibilitytoensurethat
yourusethereofcomplieswithsuchlicensesand/orrights.
978-1-491-90142-7
[LSI]



Preface


DataScience
Datascientisthasbeencalled“thesexiestjobofthe21stcentury,”presumablyby
someonewhohasnevervisitedafirestation.Nonetheless,datascienceisahotand
growingfield,anditdoesn’ttakeagreatdealofsleuthingtofindanalystsbreathlessly
prognosticatingthatoverthenext10years,we’llneedbillionsandbillionsmoredata
scientiststhanwecurrentlyhave.
Butwhatisdatascience?Afterall,wecan’tproducedatascientistsifwedon’tknowwhat
datascienceis.AccordingtoaVenndiagramthatissomewhatfamousintheindustry,data
scienceliesattheintersectionof:
Hackingskills
Mathandstatisticsknowledge

Substantiveexpertise
AlthoughIoriginallyintendedtowriteabookcoveringallthree,Iquicklyrealizedthata
thoroughtreatmentof“substantiveexpertise”wouldrequiretensofthousandsofpages.At
thatpoint,Idecidedtofocusonthefirsttwo.Mygoalistohelpyoudevelopthehacking
skillsthatyou’llneedtogetstarteddoingdatascience.Andmygoalistohelpyouget
comfortablewiththemathematicsandstatisticsthatareatthecoreofdatascience.
Thisisasomewhatheavyaspirationforabook.Thebestwaytolearnhackingskillsisby
hackingonthings.Byreadingthisbook,youwillgetagoodunderstandingofthewayI
hackonthings,whichmaynotnecessarilybethebestwayforyoutohackonthings.You
willgetagoodunderstandingofsomeofthetoolsIuse,whichwillnotnecessarilybethe
besttoolsforyoutouse.YouwillgetagoodunderstandingofthewayIapproachdata
problems,whichmaynotnecessarilybethebestwayforyoutoapproachdataproblems.
Theintent(andthehope)isthatmyexampleswillinspireyoutrythingsyourownway.
AllthecodeanddatafromthebookisavailableonGitHubtogetyoustarted.
Similarly,thebestwaytolearnmathematicsisbydoingmathematics.Thisisemphatically
notamathbook,andforthemostpart,wewon’tbe“doingmathematics.”However,you
can’treallydodatasciencewithoutsomeunderstandingofprobabilityandstatisticsand
linearalgebra.Thismeansthat,whereappropriate,wewilldiveintomathematical
equations,mathematicalintuition,mathematicalaxioms,andcartoonversionsofbig
mathematicalideas.Ihopethatyouwon’tbeafraidtodiveinwithme.
Throughoutitall,Ialsohopetogiveyouasensethatplayingwithdataisfun,because,
well,playingwithdataisfun!(Especiallycomparedtosomeofthealternatives,liketax
preparationorcoalmining.)


FromScratch
Therearelotsandlotsofdatasciencelibraries,frameworks,modules,andtoolkitsthat
efficientlyimplementthemostcommon(aswellastheleastcommon)datascience
algorithmsandtechniques.Ifyoubecomeadatascientist,youwillbecomeintimately
familiarwithNumPy,withscikit-learn,withpandas,andwithapanoplyofotherlibraries.

Theyaregreatfordoingdatascience.Buttheyarealsoagoodwaytostartdoingdata
sciencewithoutactuallyunderstandingdatascience.
Inthisbook,wewillbeapproachingdatasciencefromscratch.Thatmeanswe’llbe
buildingtoolsandimplementingalgorithmsbyhandinordertobetterunderstandthem.I
putalotofthoughtintocreatingimplementationsandexamplesthatareclear,wellcommented,andreadable.Inmostcases,thetoolswebuildwillbeilluminatingbut
impractical.Theywillworkwellonsmalltoydatasetsbutfalloveron“webscale”ones.
Throughoutthebook,Iwillpointyoutolibrariesyoumightusetoapplythesetechniques
tolargerdatasets.Butwewon’tbeusingthemhere.
Thereisahealthydebateragingoverthebestlanguageforlearningdatascience.Many
peoplebelieveit’sthestatisticalprogramminglanguageR.(Wecallthosepeoplewrong.)
AfewpeoplesuggestJavaorScala.However,inmyopinion,Pythonistheobvious
choice.
Pythonhasseveralfeaturesthatmakeitwellsuitedforlearning(anddoing)datascience:
It’sfree.
It’srelativelysimpletocodein(and,inparticular,tounderstand).
Ithaslotsofusefuldatascience–relatedlibraries.
IamhesitanttocallPythonmyfavoriteprogramminglanguage.Thereareotherlanguages
Ifindmorepleasant,better-designed,orjustmorefuntocodein.Andyetprettymuch
everytimeIstartanewdatascienceproject,IendupusingPython.EverytimeIneedto
quicklyprototypesomethingthatjustworks,IendupusingPython.AndeverytimeI
wanttodemonstratedatascienceconceptsinaclear,easy-to-understandway,Iendup
usingPython.Accordingly,thisbookusesPython.
ThegoalofthisbookisnottoteachyouPython.(Althoughitisnearlycertainthatby
readingthisbookyouwilllearnsomePython.)I’lltakeyouthroughachapter-longcrash
coursethathighlightsthefeaturesthataremostimportantforourpurposes,butifyou
knownothingaboutprogramminginPython(oraboutprogrammingatall)thenyoumight
wanttosupplementthisbookwithsomesortof“PythonforBeginners”tutorial.
Theremainderofourintroductiontodatasciencewilltakethissameapproach — going
intodetailwheregoingintodetailseemscrucialorilluminating,atothertimesleaving
detailsforyoutofigureoutyourself(orlookuponWikipedia).



Overtheyears,I’vetrainedanumberofdatascientists.Whilenotallofthemhavegone
ontobecomeworld-changingdataninjarockstars,I’veleftthemallbetterdatascientists
thanIfoundthem.AndI’vegrowntobelievethatanyonewhohassomeamountof
mathematicalaptitudeandsomeamountofprogrammingskillhasthenecessaryraw
materialstododatascience.Allsheneedsisaninquisitivemind,awillingnesstowork
hard,andthisbook.Hencethisbook.


ConventionsUsedinThisBook
Thefollowingtypographicalconventionsareusedinthisbook:
Italic
Indicatesnewterms,URLs,emailaddresses,filenames,andfileextensions.
Constantwidth

Usedforprogramlistings,aswellaswithinparagraphstorefertoprogramelements
suchasvariableorfunctionnames,databases,datatypes,environmentvariables,
statements,andkeywords.
Constantwidthbold

Showscommandsorothertextthatshouldbetypedliterallybytheuser.
Constantwidthitalic

Showstextthatshouldbereplacedwithuser-suppliedvaluesorbyvaluesdetermined
bycontext.
TIP
Thiselementsignifiesatiporsuggestion.

NOTE

Thiselementsignifiesageneralnote.

WARNING
Thiselementindicatesawarningorcaution.


UsingCodeExamples
Supplementalmaterial(codeexamples,exercises,etc.)isavailablefordownloadat
/>Thisbookisheretohelpyougetyourjobdone.Ingeneral,ifexamplecodeisoffered
withthisbook,youmayuseitinyourprogramsanddocumentation.Youdonotneedto
contactusforpermissionunlessyou’rereproducingasignificantportionofthecode.For
example,writingaprogramthatusesseveralchunksofcodefromthisbookdoesnot
requirepermission.SellingordistributingaCD-ROMofexamplesfromO’Reillybooks
doesrequirepermission.Answeringaquestionbycitingthisbookandquotingexample
codedoesnotrequirepermission.Incorporatingasignificantamountofexamplecode
fromthisbookintoyourproduct’sdocumentationdoesrequirepermission.
Weappreciate,butdonotrequire,attribution.Anattributionusuallyincludesthetitle,
author,publisher,andISBN.Forexample:“DataSciencefromScratchbyJoelGrus
(O’Reilly).Copyright2015JoelGrus,978-1-4919-0142-7.”
Ifyoufeelyouruseofcodeexamplesfallsoutsidefairuseorthepermissiongivenabove,
feelfreetocontactusat


Safari®BooksOnline
NOTE
SafariBooksOnlineisanon-demanddigitallibrarythatdeliversexpertcontentinboth
bookandvideoformfromtheworld’sleadingauthorsintechnologyandbusiness.
Technologyprofessionals,softwaredevelopers,webdesigners,andbusinessandcreative
professionalsuseSafariBooksOnlineastheirprimaryresourceforresearch,problem
solving,learning,andcertificationtraining.

SafariBooksOnlineoffersarangeofplansandpricingforenterprise,government,
education,andindividuals.
Membershaveaccesstothousandsofbooks,trainingvideos,andprepublication
manuscriptsinonefullysearchabledatabasefrompublisherslikeO’ReillyMedia,
PrenticeHallProfessional,Addison-WesleyProfessional,MicrosoftPress,Sams,Que,
PeachpitPress,FocalPress,CiscoPress,JohnWiley&Sons,Syngress,Morgan
Kaufmann,IBMRedbooks,Packt,AdobePress,FTPress,Apress,Manning,NewRiders,
McGraw-Hill,Jones&Bartlett,CourseTechnology,andhundredsmore.Formore
informationaboutSafariBooksOnline,pleasevisitusonline.


HowtoContactUs
Pleaseaddresscommentsandquestionsconcerningthisbooktothepublisher:
O’ReillyMedia,Inc.
1005GravensteinHighwayNorth
Sebastopol,CA95472
800-998-9938(intheUnitedStatesorCanada)
707-829-0515(internationalorlocal)
707-829-0104(fax)
Wehaveawebpageforthisbook,wherewelisterrata,examples,andanyadditional
information.Youcanaccessthispageat />Tocommentorasktechnicalquestionsaboutthisbook,sendemailto

Formoreinformationaboutourbooks,courses,conferences,andnews,seeourwebsiteat
.
FindusonFacebook: />FollowusonTwitter: />WatchusonYouTube: />

Acknowledgments
First,IwouldliketothankMikeLoukidesforacceptingmyproposalforthisbook(and
forinsistingthatIpareitdowntoareasonablesize).Itwouldhavebeenveryeasyforhim
tosay,“Who’sthispersonwhokeepsemailingmesamplechapters,andhowdoIgethim

togoaway?”I’mgratefulhedidn’t.I’dalsoliketothankmyeditor,MarieBeaugureau,
forguidingmethroughthepublishingprocessandgettingthebookinamuchbetterstate
thanIeverwouldhavegottenitonmyown.
Icouldn’thavewrittenthisbookifI’dneverlearneddatascience,andIprobablywouldn’t
havelearneddatascienceifnotfortheinfluenceofDaveHsu,IgorTatarinov,John
Rauser,andtherestoftheFarecastgang.(Solongagothatitwasn’tevencalleddata
scienceatthetime!)ThegoodfolksatCourseradeservealotofcredit,too.
Iamalsogratefultomybetareadersandreviewers.JayFundlingfoundatonofmistakes
andpointedoutmanyunclearexplanations,andthebookismuchbetter(andmuchmore
correct)thankstohim.DebashisGhoshisaheroforsanity-checkingallofmystatistics.
AndrewMusselmansuggestedtoningdownthe“peoplewhopreferRtoPythonaremoral
reprobates”aspectofthebook,whichIthinkendedupbeingprettygoodadvice.Trey
Causey,RyanMatthewBalfanz,LorisMularoni,NúriaPujol,RobJefferson,MaryPat
Campbell,ZachGeary,andWendyGrusalsoprovidedinvaluablefeedback.Anyerrors
remainingareofcoursemyresponsibility.
IowealottotheTwitter#datasciencecommmunity,forexposingmetoatonofnew
concepts,introducingmetoalotofgreatpeople,andmakingmefeellikeenoughofan
underachieverthatIwentoutandwroteabooktocompensate.SpecialthankstoTrey
Causey(again),for(inadvertently)remindingmetoincludeachapteronlinearalgebra,
andtoSeanJ.Taylor,for(inadvertently)pointingoutacoupleofhugegapsinthe
“WorkingwithData”chapter.
Aboveall,IoweimmensethankstoGangaandMadeline.Theonlythingharderthan
writingabookislivingwithsomeonewho’swritingabook,andIcouldn’thavepulledit
offwithouttheirsupport.



Chapter1.Introduction
“Data!Data!Data!”hecriedimpatiently.“Ican’tmakebrickswithoutclay.”
ArthurConanDoyle



TheAscendanceofData
Weliveinaworldthat’sdrowningindata.Websitestrackeveryuser’severyclick.Your
smartphoneisbuildinguparecordofyourlocationandspeedeverysecondofeveryday.
“Quantifiedselfers”wearpedometers-on-steroidsthatareeverrecordingtheirheartrates,
movementhabits,diet,andsleeppatterns.Smartcarscollectdrivinghabits,smarthomes
collectlivinghabits,andsmartmarketerscollectpurchasinghabits.TheInternetitself
representsahugegraphofknowledgethatcontains(amongotherthings)anenormous
cross-referencedencyclopedia;domain-specificdatabasesaboutmovies,music,sports
results,pinballmachines,memes,andcocktails;andtoomanygovernmentstatistics
(someofthemnearlytrue!)fromtoomanygovernmentstowrapyourheadaround.
Buriedinthesedataareanswerstocountlessquestionsthatnoone’severthoughttoask.
Inthisbook,we’lllearnhowtofindthem.


WhatIsDataScience?
There’sajokethatsaysadatascientistissomeonewhoknowsmorestatisticsthana
computerscientistandmorecomputersciencethanastatistician.(Ididn’tsayitwasa
goodjoke.)Infact,somedatascientistsare—forallpracticalpurposes—statisticians,
whileothersareprettymuchindistinguishablefromsoftwareengineers.Someare
machine-learningexperts,whileotherscouldn’tmachine-learntheirwayoutof
kindergarten.SomearePhDswithimpressivepublicationrecords,whileothershavenever
readanacademicpaper(shameonthem,though).Inshort,prettymuchnomatterhowyou
definedatascience,you’llfindpractitionersforwhomthedefinitionistotally,absolutely
wrong.
Nonetheless,wewon’tletthatstopusfromtrying.We’llsaythatadatascientistis
someonewhoextractsinsightsfrommessydata.Today’sworldisfullofpeopletryingto
turndataintoinsight.
Forinstance,thedatingsiteOkCupidasksitsmemberstoanswerthousandsofquestions

inordertofindthemostappropriatematchesforthem.Butitalsoanalyzestheseresultsto
figureoutinnocuous-soundingquestionsyoucanasksomeonetofindouthowlikely
someoneistosleepwithyouonthefirstdate.
Facebookasksyoutolistyourhometownandyourcurrentlocation,ostensiblytomakeit
easierforyourfriendstofindandconnectwithyou.Butitalsoanalyzestheselocationsto
identifyglobalmigrationpatternsandwherethefanbasesofdifferentfootballteamslive.
Asalargeretailer,Targettracksyourpurchasesandinteractions,bothonlineandin-store.
Anditusesthedatatopredictivelymodelwhichofitscustomersarepregnant,tobetter
marketbaby-relatedpurchasestothem.
In2012,theObamacampaignemployeddozensofdatascientistswhodata-minedand
experimentedtheirwaytoidentifyingvoterswhoneededextraattention,choosingoptimal
donor-specificfundraisingappealsandprograms,andfocusingget-out-the-voteefforts
wheretheyweremostlikelytobeuseful.Itisgenerallyagreedthattheseeffortsplayedan
importantroleinthepresident’sre-election,whichmeansitisasafebetthatpolitical
campaignsofthefuturewillbecomemoreandmoredata-driven,resultinginaneverendingarmsraceofdatascienceanddatacollection.
Now,beforeyoustartfeelingtoojaded:somedatascientistsalsooccasionallyusetheir
skillsforgood—usingdatatomakegovernmentmoreeffective,tohelpthehomeless,
andtoimprovepublichealth.Butitcertainlywon’thurtyourcareerifyoulikefiguring
outthebestwaytogetpeopletoclickonadvertisements.


MotivatingHypothetical:DataSciencester
Congratulations!You’vejustbeenhiredtoleadthedatascienceeffortsatDataSciencester,
thesocialnetworkfordatascientists.
Despitebeingfordatascientists,DataSciencesterhasneveractuallyinvestedinbuilding
itsowndatasciencepractice.(Infairness,DataSciencesterhasneverreallyinvestedin
buildingitsproducteither.)Thatwillbeyourjob!Throughoutthebook,we’llbelearning
aboutdatascienceconceptsbysolvingproblemsthatyouencounteratwork.Sometimes
we’lllookatdataexplicitlysuppliedbyusers,sometimeswe’lllookatdatagenerated
throughtheirinteractionswiththesite,andsometimeswe’llevenlookatdatafrom

experimentsthatwe’lldesign.
AndbecauseDataSciencesterhasastrong“not-invented-here”mentality,we’llbe
buildingourowntoolsfromscratch.Attheend,you’llhaveaprettysolidunderstanding
ofthefundamentalsofdatascience.Andyou’llbereadytoapplyyourskillsatacompany
withalessshakypremise,ortoanyotherproblemsthathappentointerestyou.
Welcomeaboard,andgoodluck!(You’reallowedtowearjeansonFridays,andthe
bathroomisdownthehallontheright.)


FindingKeyConnectors
It’syourfirstdayonthejobatDataSciencester,andtheVPofNetworkingisfullof
questionsaboutyourusers.Untilnowhe’shadnoonetoask,sohe’sveryexcitedtohave
youaboard.
Inparticular,hewantsyoutoidentifywhothe“keyconnectors”areamongdatascientists.
Tothisend,hegivesyouadumpoftheentireDataSciencesternetwork.(Inreallife,
peopledon’ttypicallyhandyouthedatayouneed.Chapter9isdevotedtogettingdata.)
Whatdoesthisdatadumplooklike?Itconsistsofalistofusers,eachrepresentedbya
dictthatcontainsforeachuserhisorherid(whichisanumber)andname(which,inone
ofthegreatcosmiccoincidences,rhymeswiththeuser’sid):
users=[
{"id":0,"name":"Hero"},
{"id":1,"name":"Dunn"},
{"id":2,"name":"Sue"},
{"id":3,"name":"Chi"},
{"id":4,"name":"Thor"},
{"id":5,"name":"Clive"},
{"id":6,"name":"Hicks"},
{"id":7,"name":"Devin"},
{"id":8,"name":"Kate"},
{"id":9,"name":"Klein"}

]

Healsogivesyouthe“friendship”data,representedasalistofpairsofIDs:
friendships=[(0,1),(0,2),(1,2),(1,3),(2,3),(3,4),
(4,5),(5,6),(5,7),(6,8),(7,8),(8,9)]

Forexample,thetuple(0,1)indicatesthatthedatascientistwithid0(Hero)andthe
datascientistwithid1(Dunn)arefriends.ThenetworkisillustratedinFigure1-1.

Figure1-1.TheDataSciencesternetwork

Sincewerepresentedourusersasdicts,it’seasytoaugmentthemwithextradata.
NOTE
Don’tgettoohunguponthedetailsofthecoderightnow.InChapter2,we’lltakeyouthroughacrash
courseinPython.Fornowjusttrytogetthegeneralflavorofwhatwe’redoing.


Forexample,wemightwanttoaddalistoffriendstoeachuser.Firstweseteachuser’s
friendspropertytoanemptylist:
foruserinusers:
user["friends"]=[]

Andthenwepopulatethelistsusingthefriendshipsdata:
fori,jinfriendships:
#thisworksbecauseusers[i]istheuserwhoseidisi
users[i]["friends"].append(users[j])#addiasafriendofj
users[j]["friends"].append(users[i])#addjasafriendofi

Onceeachuserdictcontainsalistoffriends,wecaneasilyaskquestionsofourgraph,
like“what’stheaveragenumberofconnections?”

Firstwefindthetotalnumberofconnections,bysummingupthelengthsofallthe
friendslists:
defnumber_of_friends(user):
"""howmanyfriendsdoes_user_have?"""
returnlen(user["friends"])#lengthoffriend_idslist
total_connections=sum(number_of_friends(user)
foruserinusers)#24

Andthenwejustdividebythenumberofusers:
from__future__importdivision#integerdivisionislame
num_users=len(users)#lengthoftheuserslist
avg_connections=total_connections/num_users#2.4

It’salsoeasytofindthemostconnectedpeople—they’rethepeoplewhohavethelargest
numberoffriends.
Sincetherearen’tverymanyusers,wecansortthemfrom“mostfriends”to“least
friends”:
#createalist(user_id,number_of_friends)
num_friends_by_id=[(user["id"],number_of_friends(user))
foruserinusers]
sorted(num_friends_by_id,#getitsorted
key=lambda(user_id,num_friends):num_friends,#bynum_friends
reverse=True)#largesttosmallest
#eachpairis(user_id,num_friends)
#[(1,3),(2,3),(3,3),(5,3),(8,3),
#(0,2),(4,2),(6,2),(7,2),(9,1)]

Onewaytothinkofwhatwe’vedoneisasawayofidentifyingpeoplewhoaresomehow
centraltothenetwork.Infact,whatwe’vejustcomputedisthenetworkmetricdegree
centrality(Figure1-2).



Figure1-2.TheDataSciencesternetworksizedbydegree

Thishasthevirtueofbeingprettyeasytocalculate,butitdoesn’talwaysgivetheresults
you’dwantorexpect.Forexample,intheDataSciencesternetworkThor(id4)onlyhas
twoconnectionswhileDunn(id1)hasthree.Yetlookingatthenetworkitintuitively
seemslikeThorshouldbemorecentral.InChapter21,we’llinvestigatenetworksinmore
detail,andwe’lllookatmorecomplexnotionsofcentralitythatmayormaynotaccord
betterwithourintuition.


DataScientistsYouMayKnow
Whileyou’restillfillingoutnew-hirepaperwork,theVPofFraternizationcomesbyyour
desk.Shewantstoencouragemoreconnectionsamongyourmembers,andsheasksyou
todesigna“DataScientistsYouMayKnow”suggester.
Yourfirstinstinctistosuggestthatausermightknowthefriendsoffriends.Theseare
easytocompute:foreachofauser’sfriends,iterateoverthatperson’sfriends,andcollect
alltheresults:
deffriends_of_friend_ids_bad(user):
#"foaf"isshortfor"friendofafriend"
return[foaf["id"]
forfriendinuser["friends"]#foreachofuser'sfriends
forfoafinfriend["friends"]]#geteachof_their_friends

Whenwecallthisonusers[0](Hero),itproduces:
[0,2,3,0,1,3]

Itincludesuser0(twice),sinceHeroisindeedfriendswithbothofhisfriends.Itincludes
users1and2,althoughtheyarebothfriendswithHeroalready.Anditincludesuser3

twice,asChiisreachablethroughtwodifferentfriends:
print[friend["id"]forfriendinusers[0]["friends"]]#[1,2]
print[friend["id"]forfriendinusers[1]["friends"]]#[0,2,3]
print[friend["id"]forfriendinusers[2]["friends"]]#[0,1,3]

Knowingthatpeoplearefriends-of-friendsinmultiplewaysseemslikeinteresting
information,somaybeinsteadweshouldproduceacountofmutualfriends.Andwe
definitelyshoulduseahelperfunctiontoexcludepeoplealreadyknowntotheuser:
fromcollectionsimportCounter#notloadedbydefault
defnot_the_same(user,other_user):
"""twousersarenotthesameiftheyhavedifferentids"""
returnuser["id"]!=other_user["id"]
defnot_friends(user,other_user):
"""other_userisnotafriendifhe'snotinuser["friends"];
thatis,ifhe'snot_the_sameasallthepeopleinuser["friends"]"""
returnall(not_the_same(friend,other_user)
forfriendinuser["friends"])
deffriends_of_friend_ids(user):
returnCounter(foaf["id"]
forfriendinuser["friends"]#foreachofmyfriends
forfoafinfriend["friends"]#count*their*friends
ifnot_the_same(user,foaf)#whoaren'tme
andnot_friends(user,foaf))#andaren'tmyfriends
printfriends_of_friend_ids(users[3])#Counter({0:2,5:1})

ThiscorrectlytellsChi(id3)thatshehastwomutualfriendswithHero(id0)butonly
onemutualfriendwithClive(id5).
Asadatascientist,youknowthatyoualsomightenjoymeetinguserswithsimilar



×