OperatingSystems
Principles&Practice
VolumeIV:PersistentStorage
SecondEdition
ThomasAnderson
UniversityofWashington
MikeDahlin
UniversityofTexasandGoogle
RecursiveBooks
recursivebooks.com
OperatingSystems:PrinciplesandPractice(SecondEdition)VolumeIV:Persistent
StoragebyThomasAndersonandMichaelDahlin
Copyright©ThomasAndersonandMichaelDahlin,2011-2015.
ISBN978-0-9856735-6-7
Publisher:RecursiveBooks,Ltd.,
Cover:ReflectionLake,Mt.Rainier
Coverdesign:CameronNeat
Illustrations:CameronNeat
Copyeditors:SandyKaplan,WhitneySchmidt
Ebookdesign:RobinBriggs
Webdesign:AdamAnderson
SUGGESTIONS,COMMENTS,andERRORS.Wewelcomesuggestions,commentsand
errorreports,byemailto
Noticeofrights.Allrightsreserved.Nopartofthisbookmaybereproduced,storedina
retrievalsystem,ortransmittedinanyformbyanymeans—electronic,mechanical,
photocopying,recording,orotherwise—withoutthepriorwrittenpermissionofthe
publisher.Forinformationongettingpermissionsforreprintsandexcerpts,contact
Noticeofliability.Theinformationinthisbookisdistributedonan“AsIs”basis,without
warranty.NeithertheauthorsnorRecursiveBooksshallhaveanyliabilitytoanypersonor
entitywithrespecttoanylossordamagecausedorallegedtobecauseddirectlyor
indirectlybytheinformationorinstructionscontainedinthisbookorbythecomputer
softwareandhardwareproductsdescribedinit.
Trademarks:Throughoutthisbooktrademarkednamesareused.Ratherthanputa
trademarksymbolineveryoccurrenceofatrademarkedname,westateweareusingthe
namesonlyinaneditorialfashionandtothebenefitofthetrademarkownerwithno
intentionofinfringementofthetrademark.Alltrademarksorservicemarksarethe
propertyoftheirrespectiveowners.
ToRobin,Sandra,Katya,andAdam
TomAnderson
ToMarla,Kelly,andKeith
MikeDahlin
Contents
Preface
I:KernelsandProcesses
1.Introduction
2.TheKernelAbstraction
3.TheProgrammingInterface
II:Concurrency
4.ConcurrencyandThreads
5.SynchronizingAccesstoSharedObjects
6.Multi-ObjectSynchronization
7.Scheduling
III:MemoryManagement
8.AddressTranslation
9.CachingandVirtualMemory
10.AdvancedMemoryManagement
IVPersistentStorage
11FileSystems:IntroductionandOverview
11.1TheFileSystemAbstraction
11.2API
11.3SoftwareLayers
11.3.1APIandPerformance
11.3.2DeviceDrivers:CommonAbstractions
11.3.3DeviceAccess
11.3.4PuttingItAllTogether:ASimpleDiskRequest
11.4SummaryandFutureDirections
Exercises
12StorageDevices
12.1MagneticDisk
12.1.1DiskAccessandPerformance
12.1.2CaseStudy:ToshibaMK3254GSY
12.1.3DiskScheduling
12.2FlashStorage
12.3SummaryandFutureDirections
Exercises
13FilesandDirectories
13.1ImplementationOverview
13.2Directories:NamingData
13.3Files:FindingData
13.3.1FAT:LinkedList
13.3.2FFS:FixedTree
13.3.3NTFS:FlexibleTreeWithExtents
13.3.4Copy-On-WriteFileSystems
13.4PuttingItAllTogether:FileandDirectoryAccess
13.5SummaryandFutureDirections
Exercises
14ReliableStorage
14.1Transactions:AtomicUpdates
14.1.1AdHocApproaches
14.1.2TheTransactionAbstraction
14.1.3ImplementingTransactions
14.1.4TransactionsandFileSystems
14.2ErrorDetectionandCorrection
14.2.1StorageDeviceFailuresandMitigation
14.2.2RAID:Multi-DiskRedundancyforErrorCorrection
14.2.3SoftwareIntegrityChecks
14.3SummaryandFutureDirections
Exercises
References
Glossary
AbouttheAuthors
Preface
PrefacetotheeBookEdition
OperatingSystems:PrinciplesandPracticeisatextbookforafirstcoursein
undergraduateoperatingsystems.Inuseatover50collegesanduniversitiesworldwide,
thistextbookprovides:
Apathforstudentstounderstandhighlevelconceptsallthewaydowntoworking
code.
Extensiveworkedexamplesintegratedthroughoutthetextprovidestudentsconcrete
guidanceforcompletinghomeworkassignments.
Afocusonup-to-dateindustrytechnologiesandpractice
TheeBookeditionissplitintofourvolumesthattogethercontainexactlythesame
materialasthe(2nd)printeditionofOperatingSystems:PrinciplesandPractice,
reformattedforvariousscreensizes.Eachvolumeisself-containedandcanbeusedasa
standalonetext,e.g.,atschoolsthatteachoperatingsystemstopicsacrossmultiple
courses.
Volume1:KernelsandProcesses.ThisvolumecontainsChapters1-3oftheprint
edition.Wedescribetheessentialstepsneededtoisolateprogramstopreventbuggy
applicationsandcomputervirusesfromcrashingortakingcontrolofyoursystem.
Volume2:Concurrency.ThisvolumecontainsChapters4-7oftheprintedition.We
provideaconcretemethodologyforwritingcorrectconcurrentprogramsthatisin
widespreaduseinindustry,andweexplainthemechanismsforcontextswitchingand
synchronizationfromfundamentalconceptsdowntoassemblycode.
Volume3:MemoryManagement.ThisvolumecontainsChapters8-10oftheprint
edition.Weexplainboththetheoryandmechanismsbehind64-bitaddressspace
translation,demandpaging,andvirtualmachines.
Volume4:PersistentStorage.ThisvolumecontainsChapters11-14oftheprint
edition.Weexplainthetechnologiesunderlyingmodernextent-based,journaling,and
versioningfilesystems.
Amoredetaileddescriptionofeachchapterisgivenintheprefacetotheprintedition.
PrefacetothePrintEdition
WhyWeWroteThisBook
Manyofourstudentstellusthatoperatingsystemswasthebestcoursetheytookasan
undergraduateandalsothemostimportantfortheircareers.Wearenotalone—manyof
ourcolleaguesreportreceivingsimilarfeedbackfromtheirstudents.
Partoftheexcitementisthatthecoreideasinamodernoperatingsystem—protection,
concurrency,virtualization,resourceallocation,andreliablestorage—havebecome
widelyappliedthroughoutcomputerscience,notjustoperatingsystemkernels.Whether
yougetajobatFacebook,Google,Microsoft,oranyotherleading-edgetechnology
company,itisimpossibletobuildresilient,secure,andflexiblecomputersystemswithout
theabilitytoapplyoperatingsystemsconceptsinavarietyofsettings.Inamodernworld,
nearlyeverythingauserdoesisdistributed,nearlyeverycomputerismulti-core,security
threatsabound,andmanyapplicationssuchaswebbrowsershavebecomemini-operating
systemsintheirownright.
Itshouldbenosurprisethatformanycomputersciencestudents,anundergraduate
operatingsystemsclasshasbecomeadefactorequirement:atickettoaninternshipand
eventuallytoafull-timeposition.
Unfortunately,manyoperatingsystemstextbooksarestillstuckinthepast,failingtokeep
pacewithrapidtechnologicalchange.Severalwidely-usedbookswereinitiallywrittenin
themid-1980’s,andtheyoftenactasiftechnologystoppedatthatpoint.Evenwhennew
topicsareadded,theyaretreatedasanafterthought,withoutpruningmaterialthathas
becomelessimportant.Theresultaretextbooksthatareverylong,veryexpensive,andyet
failtoprovidestudentsmorethanasuperficialunderstandingofthematerial.
Ourviewisthatoperatingsystemshavechangeddramaticallyoverthepasttwentyyears,
andthatjustifiesafreshlookatbothhowthematerialistaughtandwhatistaught.The
paceofinnovationinoperatingsystemshas,ifanything,increasedoverthepastfewyears,
withtheintroductionoftheiOSandAndroidoperatingsystemsforsmartphones,theshift
tomulticorecomputers,andtheadventofcloudcomputing.
Topreparestudentsforthisnewworld,webelievestudentsneedthreethingstosucceedat
understandingoperatingsystemsatadeeplevel:
Conceptsandcode.Webelieveitisimportanttoteachstudentsbothprinciplesand
practice,conceptsandimplementation,ratherthaneitheralone.Thistextbooktakes
conceptsallthewaydowntothelevelofworkingcode,e.g.,howacontextswitch
worksinassemblycode.Inourexperience,thisistheonlywaystudentswillreally
understandandmasterthematerial.Allofthecodeinthisbookisavailablefromthe
author’swebsite,ospp.washington.edu.
Extensiveworkedexamples.Inourview,studentsneedtobeabletoapplyconcepts
inpractice.Tothatend,wehaveintegratedalargenumberofexampleexercises,
alongwithsolutions,throughoutthetext.Weusestheseexercisesextensivelyinour
ownlectures,andwehavefoundthemessentialtochallengingstudentstogobeyond
asuperficialunderstanding.
Industrypractice.Toshowstudentshowtoapplyoperatingsystemsconceptsina
varietyofsettings,weusedetailed,concreteexamplesfromFacebook,Google,
Microsoft,Apple,andotherleading-edgetechnologycompaniesthroughoutthe
textbook.Becauseoperatingsystemsconceptsareimportantinawiderangeof
computersystems,wetaketheseexamplesnotonlyfromtraditionaloperating
systemslikeLinux,Windows,andOSXbutalsofromothersystemsthatneedto
solveproblemsofprotection,concurrency,virtualization,resourceallocation,and
reliablestoragelikedatabases,webbrowsers,webservers,mobileapplications,and
searchengines.
Takingafreshperspectiveonwhatstudentsneedtoknowtoapplyoperatingsystems
conceptsinpracticehasledustoinnovateineverymajortopiccoveredinan
undergraduate-levelcourse:
KernelsandProcesses.Thesafeexecutionofuntrustedcodehasbecomecentralto
manytypesofcomputersystems,fromwebbrowserstovirtualmachinestooperating
systems.YetexistingtextbookstreatprotectionasasideeffectofUNIXprocesses,as
iftheyaresynonyms.Instead,westartfromfirstprinciples:whataretheminimum
requirementsforprocessisolation,howcansystemsimplementprocessisolation
efficiently,andwhatdostudentsneedtoknowtoimplementfunctionscorrectlywhen
thecallerispotentiallymalicious?
Concurrency.Withtheadventofmulti-corearchitectures,moststudentstodaywill
spendmuchoftheircareerswritingconcurrentcode.Existingtextbooksprovidea
blizzardofconcurrencyalternatives,mostofwhichwereabandoneddecadesagoas
impractical.Instead,wefocusonprovidingstudentsasinglemethodologybasedon
Mesamonitorsthatwillenablestudentstowritecorrectconcurrentprograms—a
methodologythatisbyfarthedominantapproachusedinindustry.
MemoryManagement.Evenasdemand-paginghasbecomelessimportant,
virtualizationhasbecomeevenmoreimportanttomoderncomputersystems.We
provideadeeptreatmentofaddresstranslationhardware,sparseaddressspaces,
TLBs,andon-chipcaches.Wethenusethoseconceptsasaspringboardfor
describingvirtualmachinesandrelatedconceptssuchascheckpointingandcopy-onwrite.
PersistentStorage.Reliablestorageinthepresenceoffailuresiscentraltothe
designofmostcomputersystems.Existingtextbookssurveythehistoryoffile
systems,spendingmostoftheirtimeadhocapproachestofailurerecoveryanddefragmentation.Yetnomodernfilesystemsstillusethoseadhocapproaches.Instead,
ourfocusisonhowfilesystemsuseextents,journaling,copy-on-write,andRAIDto
achievebothhighperformanceandhighreliability.
IntendedAudience
OperatingSystems:PrinciplesandPracticeisatextbookforafirstcoursein
undergraduateoperatingsystems.Webelieveoperatingsystemsshouldbetakenasearly
aspossibleinanundergraduate’scourseofstudy;manystudentsusethecourseasa
springboardtoaninternshipandacareer.Tothatend,wehavedesignedthetextbookto
assumeminimalpre-requisites:specifically,studentsshouldhavetakenadatastructures
courseandoneoncomputerorganization.Thecodeexamplesarewritteninacombination
ofx86assembly,C,andC++.Inparticular,wehavedesignedthebooktointerfacewell
withtheBryantandO’Hallorantextbook.Wereviewandcoverinmuchmoredepththe
materialfromthesecondhalfofthatbook.
Weshouldnotewhatthistextbookisnot:itisnotintendedtoteachtheAPIorinternalsof
anyspecificoperatingsystem,suchasLinux,Android,Windows8,OSX,oriOS.Weuse
manyconcreteexamplesfromthesesystems,butourfocusisonthesharedproblemsthese
systemsfaceandthetechnologiesthesesystemsusetosolvethoseproblems.
AGuidetoInstructors
Oneofourgoalsisenableinstructorstochooseanappropriatelevelofdepthforeach
coursetopic.Eachchapterbeginsataconceptuallevel,withimplementationdetailsand
themoreadvancedmaterialtowardstheend.Themoreadvancedmaterialcanbeomitted
withoutcompromisingtheabilityofstudentstofollowlatermaterial.Nosingle-quarteror
single-semestercourseislikelytobeabletocovereverytopicwehaveincluded,butwe
thinkitisagoodthingforstudentstocomeawayfromanoperatingsystemscoursewith
anappreciationthatthereisalwaysmoretolearn.
Foreachtopic,weattempttoconveyitatthreelevels:
Howtoreasonaboutsystems.Wedescribecoresystemsconcepts,suchas
protection,concurrency,resourcescheduling,virtualization,andstorage,andwe
providepracticeapplyingtheseconceptsinvarioussituations.Inourview,this
providesthebiggestlong-termpayofftostudents,astheyarelikelytoneedtoapply
theseconceptsintheirworkthroughouttheircareer,almostregardlessofwhat
projecttheyendupworkingon.
Powertools.Weintroducestudentstoanumberofabstractionsthattheycanapplyin
theirworkinindustryimmediatelyaftergraduation,andthatweexpectwillcontinue
tobeusefulfordecadessuchassandboxing,protectedprocedurecalls,threads,locks,
conditionvariables,caching,checkpointing,andtransactions.
Detailsofspecificoperatingsystems.Weincludenumerousexamplesofhow
differentoperatingsystemsworkinpractice.However,thismaterialchangesrapidly,
andthereisanorderofmagnitudemorematerialthancanbecoveredinasingle
semester-lengthcourse.Thepurposeoftheseexamplesistoillustratehowtousethe
operatingsystemsprinciplesandpowertoolstosolveconcreteproblems.Wedonot
attempttoprovideacomprehensivedescriptionofLinux,OSX,oranyother
particularoperatingsystem.
Thebookisdividedintofiveparts:anintroduction(Chapter1),kernelsandprocesses
(Chapters2-3),concurrency,synchronization,andscheduling(Chapters4-7),memory
management(Chapters8-10),andpersistentstorage(Chapters11-14).
Introduction.ThegoalofChapter1istointroducetherecurringthemesfoundinthe
laterchapters.Wedefinesomecommonterms,andweprovideabitofthehistoryof
thedevelopmentofoperatingsystems.
TheKernelAbstraction.Chapter2coverskernel-basedprocessprotection—the
conceptandimplementationofexecutingauserprogramwithrestrictedprivileges.
Giventheincreasingimportanceofcomputersecurityissues,webelieveprotected
executionandsafetransferacrossprivilegelevelsareworthtreatingindepth.We
havebrokenthedescriptionintosections,toallowinstructorstochooseeitheraquick
introductiontotheconcepts(upthroughSection2.3),orafulltreatmentofthekernel
implementationdetailsdowntothelevelofinterrupthandlers.Someinstructorsstart
withconcurrency,andcoverkernelsandkernelprotectionafterwards.Whileour
textbookcanbeusedthatway,wehavefoundthatstudentsbenefitfromabasic
understandingoftheroleofoperatingsystemsinexecutinguserprograms,before
introducingconcurrency.
TheProgrammingInterface.Chapter3isintendedasanimpedancematchfor
studentsofdifferingbackgrounds.Dependingonstudentbackground,itcanbe
skippedorcoveredindepth.Thechaptercoverstheoperatingsystemfroma
programmer’sperspective:processcreationandmanagement,device-independent
input/output,interprocesscommunication,andnetworksockets.Ourgoalisthat
studentsshouldunderstandatadetailedlevelwhathappenswhenauserclicksalink
inawebbrowser,astherequestistransferredthroughoperatingsystemkernelsand
userspaceprocessesattheclient,server,andbackagain.Thischapteralsocoversthe
organizationoftheoperatingsystemitself:howdevicedriversandthehardware
abstractionlayerworkinamodernoperatingsystem;thedifferencebetweena
monolithicandamicrokerneloperatingsystem;andhowpolicyandmechanismare
separatedinmodernoperatingsystems.
ConcurrencyandThreads.Chapter4motivatesandexplainstheconceptof
threads.Becauseoftheincreasingimportanceofconcurrentprogramming,andits
integrationwithmodernprogramminglanguageslikeJava,manystudentshavebeen
introducedtomulti-threadedprogramminginanearlierclass.Thisisabitdangerous,
asstudentsatthisstagearepronetowritingprogramswithraceconditions,problems
thatmayormaynotbediscoveredwithtesting.Thus,thegoalofthischapteristo
provideasolidconceptualframeworkforunderstandingthesemanticsof
concurrency,aswellashowconcurrentthreadsareimplementedinboththe
operatingsystemkernelandinuser-levellibraries.Instructorsneedingtogomore
quicklycanomittheseimplementationdetails.
Synchronization.Chapter5discussesthesynchronizationofmulti-threaded
programs,acentralpartofalloperatingsystemsandincreasinglyimportantinmany
othercontexts.Ourapproachistodescribeoneeffectivemethodforstructuring
concurrentprograms(basedonMesamonitors),ratherthantoattempttocover
severaldifferentapproaches.Inourview,itismoreimportantforstudentstomaster
onemethodology.Monitorsareaparticularlyrobustandsimpleone,capableof
implementingmostconcurrentprogramsefficiently.Theimplementationof
synchronizationprimitivesshouldbeincludedifthereistime,sostudentsseethat
thereisnomagic.
Multi-ObjectSynchronization.Chapter6discussesadvancedtopicsinconcurrency
—specifically,thetwinchallengesofmultiprocessorlockcontentionanddeadlock.
Thismaterialisincreasinglyimportantforstudentsworkingonmulticoresystems,
butsomecoursesmaynothavetimetocoveritindetail.
Scheduling.Thischaptercoverstheconceptsofresourceallocationinthespecific
contextofprocessorscheduling.Withtheadventofdatacentercomputingand
multicorearchitectures,theprinciplesandpracticeofresourceallocationhave
renewedimportance.Afteraquicktourthroughthetradeoffsbetweenresponsetime
andthroughputforuniprocessorscheduling,thechaptercoversasetofmore
advancedtopicsinaffinityandmultiprocessorscheduling,power-awareanddeadline
scheduling,aswellasbasicqueueingtheoryandoverloadmanagement.Weconclude
thesetopicsbywalkingstudentsthroughacasestudyofserver-sideload
management.
AddressTranslation.Chapter8explainsmechanismsforhardwareandsoftware
addresstranslation.Thefirstpartofthechaptercovershowhardwareandoperating
systemscooperatetoprovideflexible,sparseaddressspacesthroughmulti-level
segmentationandpaging.Wethendescribehowtomakememorymanagement
efficientwithtranslationlookasidebuffers(TLBs)andvirtuallyaddressedcaches.
WeconsiderhowtokeepTLBsconsistentwhentheoperatingsystemmakeschanges
toitspagetables.Weconcludewithadiscussionofmodernsoftware-based
protectionmechanismssuchasthosefoundintheMicrosoftCommonLanguage
RuntimeandGoogle’sNativeClient.
CachingandVirtualMemory.Cachesarecentraltomanydifferenttypesof
computersystems.Moststudentswillhaveseentheconceptofacacheinanearlier
classonmachinestructures.Thus,ourgoalistocoverthetheoryandimplementation
ofcaches:whentheyworkandwhentheydonot,aswellashowtheyare
implementedinhardwareandsoftware.Wethenshowhowtheseideasareappliedin
thecontextofmemory-mappedfilesanddemand-pagedvirtualmemory.
AdvancedMemoryManagement.Addresstranslationisapowerfultoolinsystem
design,andweshowhowitcanbeusedforzerocopyI/O,virtualmachines,process
checkpointing,andrecoverablevirtualmemory.Asthisismoreadvancedmaterial,it
canbeskippedbythoseclassespressedfortime.
FileSystems:IntroductionandOverview.Chapter11framesthefilesystem
portionofthebook,startingtopdownwiththechallengesofprovidingausefulfile
abstractiontousers.WethendiscusstheUNIXfilesysteminterface,themajor
internalelementsinsideafilesystem,andhowdiskdevicedriversarestructured.
StorageDevices.Chapter12surveysblockstoragehardware,specificallymagnetic
disksandflashmemory.Thelasttwodecadeshaveseenrapidchangeinstorage
technologyaffectingbothapplicationprogrammersandoperatingsystemsdesigners;
thischapterprovidesasnapshotforstudents,asabuildingblockforthenexttwo
chapters.Ifstudentshavepreviouslyseenthismaterial,thischaptercanbeskipped.
FilesandDirectories.Chapter13discussesfilesystemlayoutondisk.Ratherthan
surveyallpossiblefilelayouts—somethingthatchangesrapidlyovertime—we
usefilesystemsasaconcreteexampleofmappingcomplexdatastructuresonto
blockstoragedevices.
ReliableStorage.Chapter14explainstheconceptandimplementationofreliable
storage,usingfilesystemsasaconcreteexample.Startingwiththeadhoctechniques
usedinearlyfilesystems,thechapterexplainscheckpointingandwriteahead
loggingasalternateimplementationstrategiesforbuildingreliablestorage,andit
discusseshowredundancysuchaschecksumsandreplicationareusedtoimprove
reliabilityandavailability.
Wewelcomeandencouragesuggestionsforhowtoimprovethepresentationofthe
material;pleasesendanycommentstothepublisher’swebsite,
Acknowledgements
Wehavebeenincrediblyfortunatetohavethehelpofalargenumberofpeopleinthe
conception,writing,editing,andproductionofthisbook.
WestartedonthejourneyofwritingthisbookoverdinnerattheUSENIXNSDI
conferencein2010.Atthetime,wethoughtperhapsitwouldtakeusthesummerto
completethefirstversionandperhapsayearbeforewecoulddeclareourselvesdone.We
wereverywrong!Itisnoexaggerationtosaythatitwouldhavetakenusalotlonger
withoutthehelpwehavereceivedfromthepeoplewementionbelow.
Perhapsmostimportanthavebeenourearlyadopters,whohavegivenusenormously
usefulfeedbackaswehaveputtogetherthisedition:
Carnegie-Mellon
DavidEckhardtandGarthGibson
Clarkson
JeannaMatthews
Cornell
GunSirer
ETHZurich
MothyRoscoe
NewYorkUniversity
LaskshmiSubramanian
PrincetonUniversity
KaiLi
SaarlandUniversity
PeterDruschel
StanfordUniversity
JohnOusterhout
UniversityofCaliforniaRiverside
HarshaMadhyastha
UniversityofCaliforniaSantaBarbara BenZhao
UniversityofMaryland
NeilSpring
UniversityofMichigan
PeteChen
UniversityofSouthernCalifornia
RameshGovindan
UniversityofTexas-Austin
LorenzoAlvisi
UniverstiyofToronto
DingYuan
UniversityofWashington
GaryKimuraandEdLazowska
Indevelopingourapproachtoteachingoperatingsystems,bothbeforewestartedwriting
andafterwardsaswetriedtoputourthoughtstopaper,wemadeextensiveuseoflecture
notesandslidesdevelopedbyotherfaculty.Ofparticularhelpwerethematerialscreated
byPeteChen,PeterDruschel,SteveGribble,EddieKohler,JohnOusterhout,Mothy
Roscoe,andGeoffVoelker.Wethankthemall.
Ourillustratorforthesecondedition,CameronNeat,hasbeenajoytoworkwith.We
wouldalsoliketothankSimonPeterforrunningthemultiprocessorexperiments
introducingChapter6.
WearealsogratefultoLorenzoAlvisi,AdamAnderson,PeteChen,SteveGribble,Sam
Hopkins,EdLazowska,HarshaMadhyastha,JohnOusterhout,MarkRich,MothyRoscoe,
WillScott,GunSirer,IonStoica,LakshmiSubramanian,andJohnZahorjanfortheir
helpfulcommentsandsuggestionsastohowtoimprovethebook.
WethankJoshBerlin,MarlaDahlin,RasitEskicioglu,SandyKaplan,JohnOusterhout,
WhitneySchmidt,andMikeWalfishforhelpingusidentifyandcorrectgrammaticalor
technicalbugsinthetext.
WethankJeffDean,GarthGibson,MarkOskin,SimonPeter,DaveProbert,AminVahdat,
andMarkZbikowskifortheirhelpinexplainingtheinternalworkingsofsomeofthe
commercialsystemsmentionedinthisbook.
WewouldliketothankDaveWetherall,DanWeld,MikeWalfish,DavePatterson,Olav
Kvern,DanHalperin,ArmandoFox,RobinBriggs,KatyaAnderson,SandraAnderson,
LorenzoAlvisi,andWilliamAdamsfortheirhelpandadviceontextbookeconomicsand
production.
TheHelenRiaboffWhiteleyCenteraswellasDonandJeanneDahlinwerekindenough
tolendusaplacetoescapewhenweneededtogetchapterswritten.
Finally,wethankourfamilies,ourcolleagues,andourstudentsforsupportingusinthis
larger-than-expectedeffort.
IV
PersistentStorage
11.FileSystems:IntroductionandOverview
Memoryisthetreasuryandguardianofallthings.—MarcusTulliusCicero
Computersmustbeabletoreliablystoredata.Individualsstorefamilyphotos,musicfiles,
andemailfolders;programmersstoredesigndocumentsandsourcefiles;officeworkers
storespreadsheets,textdocuments,andpresentationslides;andbusinessesstoreinventory,
orders,andbillingrecords.Infact,foracomputertoworkatall,itneedstobeableto
storeprogramstorunandtheoperatingsystem,itself.
Forallofthesecases,usersdemandalotfromtheirstoragesystems:
Reliability.Auser’sdatashouldbesafelystoredevenifamachine’spoweristurned
offoritsoperatingsystemcrashes.Infact,muchofthisdataissoimportantthat
usersexpectandneedthedatatosurviveevenifthedevicesusedtostoreitare
damaged.Forexample,manymodernstoragesystemscontinuetoworkevenifone
ofthemagneticdisksstoringthedatamalfunctionsorevenifadatacenterhousing
someofthesystem’sserversburnsdown!
Largecapacityandlowcost.Usersandcompaniesstoreenormousamountofdata,
sotheywanttobeabletobuyhighcapacitystorageforalowcost.Forexample,it
takesabout350MBtostoreanhourofCD-qualitylosslesslyencodedmusic,4GB
tostoreanhour-longhigh-definitionhomevideo,andabout1GBtostore300digital
photos.Asaresultoftheseneeds,manyindividualsown1TBormoreofstoragefor
theirpersonalfiles.Thisisanenormousamount:ifyouprinted1TBofdataastext
onpaper,youwouldproduceastackabout20mileshigh.Incontrast,forlessthan
$100youcanbuy1TBofstoragethatfitsinashoebox.
Highperformance.Forprogramstousedata,theymustbeabletoaccessit,andfor
programstouselargeamountsofdata,thisaccessmustbefast.Forexample,users
wantprogramstart-uptobenearlyinstantaneous,abusinessmayneedtoprocess
hundredsorthousandsoforderspersecond,oraservermayneedtostreamalarge
numberofvideofilestodifferentusers.
Nameddata.Becauseusersstorealargeamountofdata,becausesomedatamust
lastlongerthantheprocessthatcreatesit,andbecausedatamustbesharedacross
programs,storagesystemsmustprovidewaystoeasilyidentifydataofinterest.For
example,ifyoucannameafile(e.g.,/home/alice/assignments/hw1.txt)youcanfind
thedatayouwantoutofthemillionsofblocksonyourdisk,youcanstillfinditafter
youshutdownyourtexteditor,andyoucanuseyouremailprogramtosendthedata
producedbythetexteditortoanotheruser.
Controlledsharing.Usersneedtobeabletosharestoreddata,butthissharingneeds
tobecontrolled.Asoneexample,youmaywanttocreateadesigndocumentthat
everyoneinyourgroupcanreadandwrite,thatpeopleinyourdepartmentcanread
butnotwrite,andthatpeopleoutsideofyourdepartmentcannotaccessatall.As
anotherexample,itisusefulforasystemtobeabletoallowanyonetoexecutea
programwhileonlyallowingthesystemadministratortochangetheprogram.
Nonvolatilestorageandfilesystems.Thecontentsofasystem’smainDRAMmemory
canbelostifthereisanoperatingsystemcrashorpowerfailure.Incontrast,non-volatile
storageisdurableandretainsitsstateacrosscrashesandpoweroutages;non-volatile
storageisalsocalledorpersistentstorageorstablestorage.Nonvolatilestoragecanalso
havemuchhighercapacityandlowercostthanthevolatileDRAMthatformsthebulkof
mostsystem’s“mainmemory.”
However,non-volatilestoragetechnologieshavetheirownlimitations.Forexample,
currentnon-volatilestoragetechnologiessuchasmagneticdisksandhigh-densityflash
storagedonotallowrandomaccesstoindividualwordsofstorage;instead,accessmustbe
doneinmorecoarse-grainedunits—512,2048,ormorebytesatatime.
Furthermore,theseaccessescanbemuchslowerthanaccesstoDRAM;forexample,
readingasectorfromamagneticdiskmayrequireactivatingamotortomoveadiskarm
toadesiredtrackondiskandthenwaitingforthespinningdisktobringthedesireddata
underthediskhead.Becausediskaccessesinvolvemotorsandphysicalmotion,thetime
toaccessarandomsectoronadiskcanbearound10milliseconds.Incontrast,DRAM
latenciesaretypicallyunder100nanoseconds.Thislargedifference—aboutfiveorders
ofmagnitudeinthecaseofspinningdisks—drivestheoperatingsystemtoorganizeand
usepersistentstoragedevicesdifferentlythanmainmemory.
Filesystemsareacommonoperatingsystemabstractiontoallowapplicationstoaccess
non-volatilestorage.Filesystemsuseanumberoftechniquestocopewiththephysical
limitationsofnon-volatilestoragedevicesandtoprovidebetterabstractionstousers.For
example,Figure11.1summarizeshowphysicalcharacteristicsmotivateseveralkey
aspectsoffilesystemdesign.
Goal
PhysicalCharacteristic DesignImplication
High
LargecosttoinitiateIO
performance access
Organizedataplacementwithfiles,directories,
freespacebitmap,andplacementheuristicsso
thatstorageisaccessedinlargesequentialunits
Cachingtoavoidaccessingpersistentstorage
Storagehaslarge
capacity,survives
Nameddata
crashes,andisshared
acrossprograms
Supportfilesanddirectorieswithmeaningful
names
Controlled
sharing
Includeaccess-controlmetadatawithfiles
Devicestoresmany
users’data
Reliable
storage
Crashcanoccurduring
update
Usetransactionstomakeasetofupdatesatomic
Storagedevicescanfail
Useredundancytodetectandcorrectfailures
Flashmemorycellscan
wearout
Movedatatodifferentstoragelocationstoeven
thewear
Figure11.1:Characteristicsofpersistentstoragedevicesaffectthedesignofanoperating
system’sstorageabstractions.
Performance.Filesystemsamortizethecostofinitiatingexpensiveoperations—
suchasmovingadiskarmorerasingablockofsolidstatememory—bygrouping
whereitsplacementofdatasothatsuchoperationsaccesslarge,sequentialrangesof
storage.
Naming.Filesystemsgrouprelateddatatogetherintodirectoriesandfilesand
providehuman-readablenamesforthem(e.g.,/home/alice/Pictures/summervacation/hiking.jpg.)Thesenamesfordataremainmeaningfulevenaftertheprogram
thatcreatesthedataexits,theyhelpusersorganizelargeamountsofstorage,andthey
makeiteasyforuserstousedifferentprogramstocreate,read,andedit,theirdata.
Controlledsharing.Filesystemsincludemetadataaboutwhoownswhichfilesand
whichotherusersareallowedtoread,write,orexecutedataandprogramfiles.
Reliability.Filesystemsusetransactionstoatomicallyupdatemultipleblocksof
persistentstorage,similartohowtheoperatingsystemusescriticalsectionsto
atomicallyupdatedifferentdatastructuresinmemory.
Tofurtherimprovereliability,filesystemsstorechecksumswithdatatodetect
corruptedblocks,andtheyreplicatedataacrossmultiplestoragedevicestorecover
fromhardwarefailures.
Impactonapplicationwriters.Understandingthereliabilityandperformanceproperties
ofstoragehardwareandfilesystemsisimportantevenifyouarenotdesigningafile
systemfromscratch.Becauseofthefundamentallimitationsofexistingstoragedevices,
thehigher-levelillusionsofreliabilityandperformanceprovidedbythefilesystemare
imperfect.Anapplicationprogrammerneedstounderstandtheselimitationstoavoid
havinginconsistentdatastoredondiskorhavingaprogramrunordersofmagnitude
slowerthanexpected.
Forexample,supposeyoueditalargedocumentwithmanyembeddedimagesandthat
yourwordprocessorperiodicallyauto-savesthedocumentsothatyouwouldnotlosetoo
manyeditsifthemachinecrashes.Iftheapplicationusesthefilesystemina
straightforwardway,severalofunexpectedthingsmayhappen.
Poorperformance.First,althoughfilesystemsallowexistingbytesinafiletobe
overwrittenwithnewvalues,theydonotallownewbytestobeinsertedintothe
middleofexistingbytes.So,evenasmallupdatetothefilemayrequirerewritingthe
entirefileeitherfrombeginningtoendoratleastfromthepointofthefirstinsertion
totheend.Foramulti-megabytefile,eachauto-savemayenduptakingasmuchasa
second.
Corruptfile.Second,iftheapplicationsimplyoverwritestheexistingfilewith
updateddata,anuntimelycrashcanleavethefileinaninconsistentstate,containing
amishmashoftheoldandnewversions.Forexample,ifasectioniscutfromone
locationandpastedinanother,afteracrashthesaveddocumentmayendupwith
copiesofthesectioninbothlocations,onelocation,orneitherlocation;oritmayend
upwitharegionthatisamixoftheoldandnewtext.
Lostfile.Third,ifinsteadofoverwritingthedocumentfile,theapplicationwrites
updatestoanewfile,thendeletestheoriginalfile,andfinallymovesthenewfileto
theoriginalfile’slocation,anuntimelycrashcanleavethesystemwithnocopiesof
thedocumentatall.
Programsusearangeoftechniquestodealwiththesetypesofissues.Forexample,some
structuretheircodetotakeadvantageofthedetailedsemanticsofspecificoperating
systems.Someoperatingsystemsguaranteethatwhenafileisrenamedandafilewiththe
targetnamealreadyexists,thetargetnamewillalwaysrefertoeithertheoldornewfile,
evenafteracrashinthemiddleoftherenameoperation.Insuchacase,an
implementationcancreateanewfilewiththenewversionofthedataandusetherename
commandtoatomicallyreplacetheoldversionwiththenewone.
Otherprogramsessentiallybuildaminiaturefilesystemoverthetopoftheunderlying
one,structuringtheirdatasothattheunderlyingfilesystemcanbettermeettheir
performanceandreliabilityrequirements.
Forexample,awordprocessormightuseasophisticateddocumentformat,allowingitto,
forexample,addandremoveembeddedimagesandtoalwaysupdateadocumentby
appendingupdatestotheendofthefile.
Asanotherexample,adataanalysisprogrammightimproveitsperformanceby
organizingitsaccessestoinputfilesinawaythatensuresthateachinputfileisreadonly
onceandthatitisreadsequentiallyfromitsstarttoitsend.
Or,abrowserwitha1GBon-diskcachemightcreate100files,eachcontaining10MBof
data,andgroupagivenwebsite’sobjectsinasequentialregionofarandomlyselected
file.Todothis,thebrowserwouldneedtokeepmetadatathatmapseachcachedwebsite
toaregionofafile,itwouldneedtokeeptrackofwhatregionsofeachfileareusedand
whicharefree,itwouldneedtodecidewheretoplaceanewwebsite’sobjects,andit
wouldneedtohaveastrategyforgrowingormovingawebsite’sobjectsasadditional
objectsarefetched.
Roadmap.Togetgoodperformanceandacceptablereliability,bothapplicationwriters
andoperatingsystemsdesignersmustunderstandhowstoragedevicesandfilesystems
work.Thischapterandthenextthreediscussthekeyissues:
APIandabstractions.Therestofthischapterintroducesfilesystemsbydescribing
atypicalAPIandsetofabstractions,anditprovidesanoverviewofthesoftware
layersthatprovidetheseabstractions.
Storagedevices.Thecharacteristicsofpersistentstoragedevicesstronglyinfluence
thedesignofstoragesystemabstractionsandhigherlevelapplications.Chapter12
thereforeexploresthephysicalcharacteristicsofcommonstoragedevices.
Implementingfilesanddirectories.Chapter13describeshowfilesystemskeep
trackofdatabydescribingseveralwidelyusedapproachestoimplementingfilesand
directories.
Reliablestorage.Althoughwewouldlikestoragetobeperfectlyreliable,physical
devicesfallshortofthatideal.Chapter14describeshowstoragesystemsuse
transactionalupdatesandredundancytoimprovereliability.
11.1TheFileSystemAbstraction
Today,almostanyonewhousesacomputerisfamiliarwiththehigh-levelfilesystem
abstraction.Filesystemsprovideawayforuserstoorganizetheirdataandtostoreitfor
longperiodsoftime.Forexample,Bob’scomputermightstoreacollectionofapplications
suchas/Applications/Calculatorand/ProgramFiles/TextEditandacollectionofdatafiles
suchas/home/Bob/correspondence/letter-to-mom.txt,and/home/Bob/Classes/OS/hw1.txt.
Moreprecisely,afilesystemisanoperatingsystemabstractionthatprovidespersistent,
nameddata.Persistentdataisstoreduntilitisexplicitlydeleted,evenifthecomputer
storingitcrashesorlosespower.Nameddatacanbeaccessedviaahuman-readable
identifierthatthefilesystemassociateswiththefile.Havinganameallowsafiletobe
accessedevenaftertheprogramthatcreatedithasexited,andallowsittobesharedby
multipleapplications.
Therearetwokeypartstothefilesystemabstraction:files,whichdefinesetsofdata,and
directories,whichdefinenamesforfiles.
File.Afileisanamedcollectionofdatainafilesystem.Forexample,theprograms
/Applications/Calculatoror/ProgramFiles/TextEditareeachfiles,asarethedata
/home/Bob/correspondence/letter-to-mom.txtor/home/Bob/Classes/OS/hw1.txt.
Filesprovideahigher-levelabstractionthantheunderlyingstoragedevice:theyleta
single,meaningfulnamerefertoan(almost)arbitrarily-sizedamountofdata.Forexample
/home/Bob/Classes/OS/hw1.txtmightbestoredondiskinblocks0x0A713F28,
0xB3CA349A,and0x33A229B8,butitismuchmoreconvenienttorefertothedatabyits
namethanbythislistofdiskaddresses.
Afile’sinformationhastwoparts,metadataanddata.Afile’smetadataisinformation
aboutthefilethatisunderstoodandmanagedbytheoperatingsystem.Forexample,a
file’smetadatatypicallyincludesthefile’ssize,itsmodificationtime,itsowner,andits
securityinformationsuchaswhetheritmayberead,written,orexecutedbytheowneror
byotherusers.
Afile’sdatacanbewhateverinformationauserorapplicationputsinit.Fromthepointof
viewofthefilesystem,afile’sdataisjustanarrayofuntypedbytes.Applicationscanuse
thesebytestostorewhateverinformationtheywantinwhateverformattheychoose.Some
datahaveasimplestructure.Forexample,anASCIItextfilecontainsasequenceofbytes
thatareinterpretedaslettersintheEnglishalphabet.Conversely,datastructuresstoredby
applicationscanbearbitrarilycomplex.Forexample,a.docfilescancontaintext,
formattinginformation,andembeddedobjectsandimages,anELF(Executableand
LinkableFile)filescancontaincompiledobjectsandexecutablecode,oradatabasefile
cancontaintheinformationandindicesmanagedbyarelationaldatabase.
Executing“untyped”files
Usually,anoperatingsystemtreatsafile’sdataasanarrayofuntypedbytes,leavingitup
toapplicationstointerpretafile’scontents.Occasionally,however,theoperatingsystem
needstobeabletoparseafile’sdata.
Forexample,LinuxsupportsanumberofdifferentexecutablefiletypessuchastheELF
anda.outbinaryfilesandtcsh,csh,andperlscripts.Youcanrunanyofthesefilesfrom
thecommandlineorusingtheexec()systemcall.E.g.,
>a.out
Helloworldfromhello.ccompiledbygcc!
>hello.pl
Helloworldfromhello.pl,aperlscript!
>echo‘‘Helloworldfrom/bin/echo,asystembinary!’’
Helloworldfrom/bin/echo,asystembinary!
Toexecuteafile,theoperatingsystemmustdeterminewhetheritisabinaryfileora
script.Ifitistheformer,theoperatingsystemmustparsethefiletodeterminewherein
thetargetprocess’smemorytoloadcodeanddatafromthefileandwhichinstructionto
startwith.Ifitisthelatter,theoperatingsystemmustdeterminewhichinterpreter
programitshouldlaunchtoexecutethescript.
Linuxdoesthisbyhavingexecutablefilesbeginwithamagicnumberthatidentifiesthe
file’sformat.Forexample,ELFbinaryexecutablesbeginwiththefourbytes0x7f,0x45,
0x4c,and0x46(theASCIIcharactersDEL,E,L,andF);onceanexecutableisknownto
beanELFfile,theELFstandarddefineshowtheoperatingsystemshouldparsetherest
ofthefiletoextractandloadtheprogram’scodeanddata.Similarly,scriptfilesbegin
with#!followedbythenameoftheinterpreterthatshouldbeusedtorunthescript(e.g.,
ascriptmightbeginwith#!/bin/shtobeexecutedusingtheBourneshellor#!
/usr/bin/perltobeexecutedusingtheperlinterpreter.
Alternativeapproachesincludedeterminingafile’stypebyitsnameextension—the
charactersafterthelastdot(.)inthefile’sname(e.g.,.exe,.pl,or.sh)—orincluding
informationaboutafile’stypeinitsmetadata.
Multipledatastreams
Fortraditionalfiles,thefile’sdataisasinglelogicalsequenceofbytes,andeachbytecan