作者简介:伟林,中年码农,从事过电信、手机、安全、芯片等行业,目前依旧从事Linux方向开发工作,个人爱好Linux相关知识分享。内存释放内存分配gfpmasknode候选策略zone候选策略zonefallback策略lowmemreserve机制orderfallback策略migratetype候选策略migratefallback策略reclaimwatermarkreclaim方式allocpages()内存释放 Buddy系统中,相比较内存的分配,内存的释放过程更简单,我们先来解析这部分。 这里体现了Buddy的核心思想:在内存释放时判断其buddy兄弟page是不是order大小相等的freepage,如果是则合并成更高一阶order。这样的目的是最大可能的减少内存碎片化。 内存释放最后都会落到freepages()函数:voidfreepages(structpagepage,unsignedintorder){(1)对pagerefcount减1后并判断是否为0如果引用计数为0了,说明可以释放page了if(putpagetestzero(page))freethepage(page,order);}staticinlinevoidfreethepage(structpagepage,unsignedintorder){(1)单个page首先尝试释放到pcpif(order0)Viapcp?freeunrefpage(page);(2)大于1的2order个page,释放到orderfreearea当中elsefreepagesok(page,order);}staticvoidfreepagesok(structpagepage,unsignedintorder){unsignedlongflags;intmigratetype;unsignedlongpfnpagetopfn(page);(2。1)page释放前的一些动作:清理一些成员做一些检查执行一些回调函数if(!freepagesprepare(page,order,true))return;(2。2)获取到page所在pageblock的migratetype当前page会被释放到对应orderfreearea的对应migratefreelist链表当中migratetypegetpfnblockmigratetype(page,pfn);localirqsave(flags);countvmevents(PGFREE,1order);(2。3)向zone中释放pagefreeonepage(pagezone(page),page,pfn,order,migratetype);localirqrestore(flags);}freeonepage()staticinlinevoidfreeonepage(structpagepage,unsignedlongpfn,structzonezone,unsignedintorder,intmigratetype){unsignedlongcombinedpfn;unsignedlonguninitializedvar(buddypfn);structpagebuddy;unsignedintmaxorder;maxordermint(unsignedint,MAXORDER,pageblockorder1);VMBUGON(!zoneisinitialized(zone));VMBUGONPAGE(pageflagsPAGEFLAGSCHECKATPREP,page);VMBUGON(migratetype1);if(likely(!ismigrateisolate(migratetype)))modzonefreepagestate(zone,1order,migratetype);VMBUGONPAGE(pfn((1order)1),page);VMBUGONPAGE(badrange(zone,page),page);continuemerging:(2。3。1)尝试对释放的(2order)长度的page进行逐级向上合并while(ordermaxorder1){(2。3。1。1)得到当前释放的(2order)长度page对应的buddy伙伴page指针计算伙伴buddy使用和(1order)进行异或:(0order)pfn对应的伙伴page为(1order)pfn,(1order)pfn对应的伙伴page为(0order)pfnbuddypfnfindbuddypfn(pfn,order);buddypage(buddypfnpfn);if(!pfnvalidwithin(buddypfn))gotodonemerging;(2。3。1。2)判断伙伴page的是否是buddy状态:是否是free状态在buddy系统中(pagemapcountPAGEBUDDYMAPCOUNTVALUE)当前的freeorder和要释放的order相等(pageprivateorder)if(!pageisbuddy(page,buddy,order))gotodonemerging;OurbuddyisfreeoritisCONFIGDEBUGPAGEALLOCguardpage,mergewithitandmoveuponeorder。if(pageisguard(buddy)){clearpageguard(zone,buddy,order,migratetype);}else{(2。3。1。3)如果满足合并的条件,则准备开始合并把伙伴page从原freelist中删除listdel(buddylru);zonefreearea〔order〕。nrfree;清理page中保存的order信息:pagemapcount1pageprivate0rmvpageorder(buddy);}(2。3。1。4)组成了更高一级order的空闲内存combinedpfnbuddypfnpfn;pagepage(combinedpfnpfn);pfncombinedpfn;order;}if(maxorderMAXORDER){Ifwearehere,itmeansorderispageblockorder。如果在这里,意味着orderpageblockorder。Wewanttopreventmergebetweenfreepagesonisolatepageblockandnormalpageblock。Withoutthis,pageblockisolationcouldcauseincorrectfreepageorCMAaccounting。我们要防止隔离页面块和正常页面块上的空闲页面合并。否则,页面块隔离可能导致不正确的空闲页面或CMA计数。Wedontwanttohitthiscodeforthemorefrequentlowordermerging。我们不想命中此代码进行频繁的低阶合并。if(unlikely(hasisolatepageblock(zone))){intbuddymt;buddypfnfindbuddypfn(pfn,order);buddypage(buddypfnpfn);buddymtgetpageblockmigratetype(buddy);if(migratetype!buddymt(ismigrateisolate(migratetype)ismigrateisolate(buddymt)))gotodonemerging;}maxorder;gotocontinuemerging;}(2。3。2)开始挂载合并成order的空闲内存donemerging:(2。3。2。1)page中保存order大小:pagemapcountPAGEBUDDYMAPCOUNTVALUE(128)pageprivateordersetpageorder(page,order);Ifthisisnotthelargestpossiblepage,checkifthebuddyofthenexthighestorderisfree。Ifitis,itspossiblethatpagesarebeingfreedthatwillcoalescesoon。Incase,thatishappening,addthefreepagetothetailofthelistsoitslesslikelytobeusedsoonandmorelikelytobemergedasahigherorderpage如果这不是最大的页面,请检查倒数第二个order的伙伴是否空闲。如果是这样,则可能是页面即将被释放,即将合并。万一发生这种情况,请将空闲页面添加到列表的末尾,这样它就不太可能很快被使用,而更有可能被合并为高阶页面(2。3。2。2)将空闲page加到对应order链表的尾部if((orderMAXORDER2)pfnvalidwithin(buddypfn)){structpagehigherpage,higherbuddy;combinedpfnbuddypfnpfn;higherpagepage(combinedpfnpfn);buddypfnfindbuddypfn(combinedpfn,order1);higherbuddyhigherpage(buddypfncombinedpfn);if(pfnvalidwithin(buddypfn)pageisbuddy(higherpage,higherbuddy,order1)){listaddtail(pagelru,zonefreearea〔order〕。freelist〔migratetype〕);gotoout;}}(2。3。2。3)将空闲page加到对应order链表的开始listadd(pagelru,zonefreearea〔order〕。freelist〔migratetype〕);out:zonefreearea〔order〕。nrfree;} 嵌入式物联网需要学的东西真的非常多,千万不要学错了路线和内容,导致工资要不上去! 无偿分享大家一个资料包,差不多150多G。里面学习内容、面经、项目都比较新也比较全!某鱼上买估计至少要好几十。 点击这里找小助理0元领取:加微信领取资料 PageBuddy()用来判断page是否在buddy系统中,还有很多类似的page操作函数都定义在pageflags。h当中:linuxsource4。15。0includelinuxpageflags。h:definePAGEMAPCOUNTOPS(uname,lname)staticalwaysinlineintPageuname(structpagepage){returnatomicread(pagemapcount)PAGElnameMAPCOUNTVALUE;}staticalwaysinlinevoidSetPageuname(structpagepage){VMBUGONPAGE(atomicread(pagemapcount)!1,page);atomicset(pagemapcount,PAGElnameMAPCOUNTVALUE);}staticalwaysinlinevoidClearPageuname(structpagepage){VMBUGONPAGE(!Pageuname(page),page);atomicset(pagemapcount,1);}PageBuddy()indicatethatthepageisfreeandinthebuddysystem(seemmpagealloc。c)。definePAGEBUDDYMAPCOUNTVALUE(128)PAGEMAPCOUNTOPS(Buddy,BUDDY) 对于单个page,会首先释放到percpu缓存中:startkernel()mminit()meminit()freeallbootmem()freelowmemorycoreearly()freememorycore()freepagesmemory()freepagesbootmem()freepages()freethepage()freeunrefpage():voidfreeunrefpage(structpagepage){unsignedlongflags;unsignedlongpfnpagetopfn(page);(1)一些初始化准备工作pageindexmigratetype;if(!freeunrefpageprepare(page,pfn))return;localirqsave(flags);(2)释放page到pcp中freeunrefpagecommit(page,pfn);localirqrestore(flags);}staticvoidfreeunrefpagecommit(structpagepage,unsignedlongpfn){structzonezonepagezone(page);structpercpupagespcp;intmigratetype;(2。1)migratetypepageindexmigratetypegetpcppagemigratetype(page);countvmevent(PGFREE);(2。2)对于某些migratetype的特殊处理if(migratetypeMIGRATEPCPTYPES){(2。2。1)对于isolate类型,free到全局的freelist中if(unlikely(ismigrateisolate(migratetype))){freeonepage(zone,page,pfn,0,migratetype);return;}migratetypeMIGRATEMOVABLE;}(2。3)获取到zone当前cpupcp的链表头pcpthiscpuptr(zonepageset)pcp;(2。4)将空闲的单page加入到pcp对应链表中listadd(pagelru,pcplists〔migratetype〕);pcpcount;(2。5)如果pcp中的page数量过多(大于pcphigh),释放pcpbatch个page到全局freelist当中去if(pcpcountpcphigh){unsignedlongbatchREADONCE(pcpbatch);freepcppagesbulk(zone,batch,pcp);pcpcountbatch;}} pcphigh和pcpbatch的赋值过程:startkernel()setuppercpupageset()setupzonepageset()zonepagesetinit()pagesetsethighandbatch():staticintzonebatchsize(structzonezone){batch的大小(zonesize(10244))(32)batchzonemanagedpages1024;if(batchPAGESIZE5121024)batch(5121024)PAGESIZE;batch4;Weeffectively4belowif(batch1)batch1;batchrounddownpowoftwo(batchbatch2)1;returnbatch;}staticvoidpagesetsetbatch(structpercpupagesetp,unsignedlongbatch){high6batchpagesetupdate(ppcp,6batch,max(1UL,1batch));}内存分配 相比较释放,内存分配的策略要复杂的多,要考虑的因素也多很多,让我们一一来解析。gfpmask gfpmask是GFP(GetFreePage)相关的一系列标志,控制了分配page的一系列行为。 node候选策略 在NUMA的情况下,会有多个memorynode可供选择,系统会根据policy选择当前分配的node。allocpages()allocpagescurrent():structpageallocpagescurrent(gfptgfp,unsignedorder){(1。1)使用默认NUMA策略structmempolicypoldefaultpolicy;structpagepage;(1。2)获取当前进程的NUMA策略if(!ininterrupt()!(gfpGFPTHISNODE))polgettaskpolicy(current);Noreferencecountingneededforcurrentmempolicynorsystemdefaultpolicyif(polmodeMPOLINTERLEAVE)pageallocpageinterleave(gfp,order,interleavenodes(pol));else(2)从NUMA策略指定的首选node和备选node组上,进行内存页面的分配pageallocpagesnodemask(gfp,order,policynode(gfp,pol,numanodeid()),policynodemask(gfp,pol));returnpage;}zone候选策略 Buddy系统中对每一个node定义了多个类型的zone:enumzonetype{ZONEDMA,ZONEDMA32,ZONENORMAL,ZONEHIGHMEM,ZONEMOVABLE,ZONEDEVICE,MAXNRZONES}; gfpmask中也定义了一系列选择zone的flag:Physicaladdresszonemodifiers(seelinuxmmzone。hlowfourbits)defineGFPDMA((forcegfpt)GFPDMA)defineGFPHIGHMEM((forcegfpt)GFPHIGHMEM)defineGFPDMA32((forcegfpt)GFPDMA32)defineGFPMOVABLE((forcegfpt)GFPMOVABLE)ZONEMOVABLEalloweddefineGFPZONEMASK(GFPDMAGFPHIGHMEMGFPDMA32GFPMOVABLE) 怎么样根据gfpmask中的zonemodifiers来选择分配锁使用的zone呢?系统设计了一套算法来进行转换: 具体的代码如下:allocpages()allocpagescurrent()allocpagesnodemask()prepareallocpages()gfpzone():staticinlineenumzonetypegfpzone(gfptflags){enumzonetypez;(1)gfp标志中低4位为zonemodifiersintbit(forceint)(flagsGFPZONEMASK);(2)查表得到最后的候选zone内核规定GFPDMA,GFPHIGHMEM和GFPDMA32其两个或全部不能同时存在于gfp标志中z(GFPZONETABLE(bitGFPZONESSHIFT))((1GFPZONESSHIFT)1);VMBUGON((GFPZONEBADbit)1);returnz;}defineGFPZONETABLE((ZONENORMAL0GFPZONESSHIFT)(OPTZONEDMAGFPDMAGFPZONESSHIFT)(OPTZONEHIGHMEMGFPHIGHMEMGFPZONESSHIFT)(OPTZONEDMA32GFPDMA32GFPZONESSHIFT)(ZONENORMALGFPMOVABLEGFPZONESSHIFT)(OPTZONEDMA(GFPMOVABLEGFPDMA)GFPZONESSHIFT)(ZONEMOVABLE(GFPMOVABLEGFPHIGHMEM)GFPZONESSHIFT)(OPTZONEDMA32(GFPMOVABLEGFPDMA32)GFPZONESSHIFT))defineGFPZONEBAD(1(GFPDMAGFPHIGHMEM)1(GFPDMAGFPDMA32)1(GFPDMA32GFPHIGHMEM)1(GFPDMAGFPDMA32GFPHIGHMEM)1(GFPMOVABLEGFPHIGHMEMGFPDMA)1(GFPMOVABLEGFPDMA32GFPDMA)1(GFPMOVABLEGFPDMA32GFPHIGHMEM)1(GFPMOVABLEGFPDMA32GFPDMAGFPHIGHMEM))zonefallback策略 通过上述的候选策略,我们选定了内存分配的node和zone,然后开始分配。如果分配失败,我们并不会马上启动内存回收,而是通过fallback机制尝试从其他低级的zone中看看能不能借用一些内存。 fallback的借用,只能从高级到低级的借用,而不能从低级到高级的借用。比如:原本想分配Normalzone的内存,失败的情况下可以尝试从DMA32zone中分配内存,因为能用normalzone地址范围的内存肯定也可以用DMA32zone地址范围的内存。但是反过来就不行,原本需要DMA32zone地址范围的内存,你给他一个normalzone的内存,地址超过了4G,可能就超过了DMA设备的寻址能力。 系统还定义了一个GFPTHISNODE标志,用来限制fallback时只能在本node上寻找合适的低级zone。否则会在所有node上寻找合适的低级zone。 该算法的具体实现如下:1、每个node定义了fallback时用到的候选zone链表:pgdatnodezonelists〔ZONELISTFALLBACK〕跨nodeFALLBACK机制生效,用来链接所有node的所有zonepgdatnodezonelists〔ZONELISTNOFALLBACK〕如果gfpmask设置了GFPTHISNODE标志,跨nodeFALLBACK机制失效,用来链接本node的所有zone 系统启动时初始化这些链表:startkernel()buildallzonelists()buildallzonelists()buildzonelists()buildzonelistsinnodeorder()buildthisnodezonelists()buildzonerefsnode(): 2、内存分配时确定使用的fallback链表:allocpages()allocpagescurrent()allocpagesnodemask()prepareallocpages()nodezonelist():staticinlinestructzonelistnodezonelist(intnid,gfptflags){(1)根据fallback机制是否使能,来选择候选zone链表returnNODEDATA(nid)nodezonelistsgfpzonelist(flags);}staticinlineintgfpzonelist(gfptflags){ifdefCONFIGNUMA(1。1)如果gfpmask指定了GFPTHISNODE,则跨nodefallback机制失效if(unlikely(flagsGFPTHISNODE))returnZONELISTNOFALLBACK;endif(1。2)否则,跨nodefallback机制生效returnZONELISTFALLBACK;}allocpages()allocpagescurrent()allocpagesnodemask()finaliseac():staticinlinevoidfinaliseac(gfptgfpmask,unsignedintorder,structalloccontextac){Dirtyzonebalancingonlydoneinthefastpathacspreaddirtypages(gfpmaskGFPWRITE);(2)从fallbacklist中选取最佳候选zone,即本node的符合zonetype条件的最高zoneacpreferredzonereffirstzoneszonelist(aczonelist,achighzoneidx,acnodemask);}3、从原有zone分配失败时,尝试从fallbackzone中分配内存:allocpages()allocpagescurrent()allocpagesnodemask()getpagefromfreelist():staticstructpagegetpagefromfreelist(gfptgfpmask,unsignedintorder,intallocflags,conststructalloccontextac){structzonerefzacpreferredzoneref;structzonezone;(1)如果分配失败,遍历fallbacklist中的zone,逐个尝试分配fornextzonezonelistnodemask(zone,z,aczonelist,achighzoneidx,acnodemask){}}lowmemreserve机制 承接上述的fallback机制,高等级的zone可以借用低等级zone的内存。但是从理论上说,低等级的内存更加的宝贵因为它的空间更小,如果被高等级的侵占完了,那么用户需要低层级内存的时候就会分配失败。 为了解决这个问题,系统给每个zone能够给其他高等级zone借用的内存设置了一个预留值,可以借用内存但是本zone保留的内存不能小于这个值。 我们可以通过命令来查看每个zone的lowmemreserve大小设置,protection字段描述了本zone给其他zone借用时必须保留的内存:pwlubuntu:catproczoneinfoNode0,zoneDMApagesfree3968min67low83high99spanned4095present3997managed3976本zone为DMA给DMAzone借用时必须保留0pages给DMA32zone借用时必须保留2934pages给NormalMovableDevicezone借用时必须保留3859pagesprotection:(0,2934,3859,3859,3859)Node0,zoneDMA32pagesfree418978min12793low15991high19189spanned1044480present782288managed759701本zone为DMA32给DMADMA32zone借用时必须保留0pages给NormalMovableDevicezone借用时必须保留925pagesprotection:(0,0,925,925,925)nrfreepages418978Node0,zoneNormalpagesfree4999min4034low5042high6050spanned262144present262144managed236890本zone为Normal因为MovableDevicezone大小为0,所以给所有zone借用时必须保留0pagesprotection:(0,0,0,0,0)Node0,zoneMovablepagesfree0min0low0high0spanned0present0managed0protection:(0,0,0,0,0)Node0,zoneDevicepagesfree0min0low0high0spanned0present0managed0protection:(0,0,0,0,0) 可以通过lowmemreserveratio来调节这个值的大小:pwlubuntu:catprocsysvmlowmemreserveratio2562563200orderfallback策略 Buddy系统中对每一个zone又细分了多个order的freearea:ifndefCONFIGFORCEMAXZONEORDERdefineMAXORDER11elsedefineMAXORDERCONFIGFORCEMAXZONEORDERendif 如果在对应order的freearea中找不多free内存的话,会逐个往高级别orderfreearea中查找,直至maxorder。 对高级别order的freelist,会被分割成多个低级别order的freelist。migratetype候选策略 Buddy系统中对每一个zone中的每一个orderfreearea又细分了多个migratetype:enummigratetype{MIGRATEUNMOVABLE,MIGRATEMOVABLE,MIGRATERECLAIMABLE,MIGRATEPCPTYPES,thenumberoftypesonthepcplistsMIGRATEHIGHATOMICMIGRATEPCPTYPES,MIGRATECMA,MIGRATEISOLATE,cantallocatefromhereMIGRATETYPES}; gfpmask中也定义了一系列选择migratetype的flag:defineGFPMOVABLE((forcegfpt)GFPMOVABLE)ZONEMOVABLEalloweddefineGFPRECLAIMABLE((forcegfpt)GFPRECLAIMABLE)defineGFPMOVABLEMASK(GFPRECLAIMABLEGFPMOVABLE) 根据gfpmask转换成migratetype的代码如下:allocpages()allocpagescurrent()allocpagesnodemask()prepareallocpages()gfpflagstomigratetype():staticinlineintgfpflagstomigratetype(constgfptgfpflags){VMWARNON((gfpflagsGFPMOVABLEMASK)GFPMOVABLEMASK);BUILDBUGON((1ULGFPMOVABLESHIFT)!GFPMOVABLE);BUILDBUGON((GFPMOVABLEGFPMOVABLESHIFT)!MIGRATEMOVABLE);if(unlikely(pagegroupbymobilitydisabled))returnMIGRATEUNMOVABLE;Groupbasedonmobility(1)转换的结果仅为3种类型:MIGRATEUNMOVABLEMIGRATEMOVABLEMIGRATERECLAIMABLEreturn(gfpflagsGFPMOVABLEMASK)GFPMOVABLESHIFT;}migratefallback策略 在指定migratetype的order和大于order的freelist分配失败时,可以从同一zone的其他migratetypefreelist中偷取内存。staticintfallbacks〔MIGRATETYPES〕〔4〕{〔MIGRATEUNMOVABLE〕{MIGRATERECLAIMABLE,MIGRATEMOVABLE,MIGRATETYPES},〔MIGRATERECLAIMABLE〕{MIGRATEUNMOVABLE,MIGRATEMOVABLE,MIGRATETYPES},〔MIGRATEMOVABLE〕{MIGRATERECLAIMABLE,MIGRATEUNMOVABLE,MIGRATETYPES},ifdefCONFIGCMA〔MIGRATECMA〕{MIGRATETYPES},NeverusedendififdefCONFIGMEMORYISOLATION〔MIGRATEISOLATE〕{MIGRATETYPES},Neverusedendif}; fallbacks〔〕数组定义了当前migrate可以从偷取哪些其他migrate的空闲内存,基本就是MIGRATEUNMOVABLE、MIGRATERECLAIMABLE、MIGRATEMOVABLE可以相互偷取。 具体的代码如下:allocpages()allocpagescurrent()allocpagesnodemask()getpagefromfreelist()rmqueue()rmqueue()rmqueuefallback():reclaimwatermark 分配时如果freelist中现有的内存不能满足需求,则会启动内充回收。系统对每个zone定义了三种内存水位highlowmin,针对不同的水位采取不同的回收策略:pwlubuntu:catproczoneinfoNode0,zoneDMApagesfree3968min67low83high99 具体三种水位的回收策略如下:reclaim方式 系统设计了几种回收内存的手段:allocpages() Buddy内存分配的核心代码实现。allocpages()allocpagescurrent()allocpagesnodemask():structpageallocpagesnodemask(gfptgfpmask,unsignedintorder,intpreferrednid,nodemasktnodemask){structpagepage;(1。1)默认的允许水位为lowunsignedintallocflagsALLOCWMARKLOW;gfptallocmask;Thegfptthatwasactuallyusedforallocationstructalloccontextac{};Thereareseveralplaceswhereweassumethattheordervalueissanesobailoutearlyiftherequestisoutofbound。(1。2)order长度的合法性判断if(unlikely(orderMAXORDER)){WARNONONCE(!(gfpmaskGFPNOWARN));returnNULL;}(1。3)gfpmask的过滤gfpmaskgfpallowedmask;allocmaskgfpmask;(1。4)根据gfpmask,决定的highzoneidx、候选zonelist、migratetypeif(!prepareallocpages(gfpmask,order,preferrednid,nodemask,ac,allocmask,allocflags))returnNULL;(1。5)挑选第一个合适的zonefinaliseac(gfpmask,order,ac);Firstallocationattempt(2)第1次分配:使用low水位尝试直接从freelist分配pagepagegetpagefromfreelist(allocmask,order,allocflags,ac);if(likely(page))gotoout;Applyscopedallocationconstraints。ThisismainlyaboutGFPNOFSresp。GFPNOIOwhichhastobeinheritedforallallocationrequestsfromaparticularcontextwhichhasbeenmarkedbymemallocno{fs,io}{save,restore}。(3。1)如果使用memallocno{fs,io}{save,restore}设置了NOFS和NOIO从currentflags解析出相应的值,用来清除gfpmask中相应的GFPFS和GFPIO标志allocmaskcurrentgfpcontext(gfpmask);ac。spreaddirtypagesfalse;Restoretheoriginalnodemaskifitwaspotentiallyreplacedwithcpusetcurrentmemsallowedtooptimizethefastpathattempt。(3。2)恢复原有的nodemaskif(unlikely(ac。nodemask!nodemask))ac。nodemasknodemask;(4)慢速分配路径:使用min水位,以及各种手段进行内存回收后,再尝试分配内存pageallocpagesslowpath(allocmask,order,ac);out:if(memcgkmemenabled()(gfpmaskGFPACCOUNT)pageunlikely(memcgkmemcharge(page,gfpmask,order)!0)){freepages(page,order);pageNULL;}tracemmpagealloc(page,order,allocmask,ac。migratetype);returnpage;}staticinlineboolprepareallocpages(gfptgfpmask,unsignedintorder,intpreferrednid,nodemasktnodemask,structalloccontextac,gfptallocmask,unsignedintallocflags){(1。4。1)根据gfpmask,获取到可能的最高优先级的zoneachighzoneidxgfpzone(gfpmask);(1。4。2)根据gfpmask,获取到可能候选node的所有zone链表aczonelistnodezonelist(preferrednid,gfpmask);acnodemasknodemask;(1。4。3)根据gfpmask,获取到migratetypeMIGRATEUNMOVABLEMIGRATEMOVABLEMIGRATERECLAIMABLEacmigratetypegfpflagstomigratetype(gfpmask);(1。4。4)如果cpusetcgroup使能,设置相应标志位if(cpusetsenabled()){allocmaskGFPHARDWALL;if(!acnodemask)acnodemaskcpusetcurrentmemsallowed;elseallocflagsALLOCCPUSET;}(1。4。5)如果指定了GFPFS,则尝试获取fs锁fsreclaimacquire(gfpmask);fsreclaimrelease(gfpmask);(1。4。6)如果指定了GFPDIRECTRECLAIM,判断当前是否是非原子上下文可以睡眠mightsleepif(gfpmaskGFPDIRECTRECLAIM);if(shouldfailallocpage(gfpmask,order))returnfalse;(1。4。7)让MIGRATEMOVABLE可以使用MIGRATECMA区域if(ISENABLED(CONFIGCMA)acmigratetypeMIGRATEMOVABLE)allocflagsALLOCCMA;returntrue;}getpagefromfreelist() 第一次的快速内存分配,和后续的慢速内存分配,最后都是调用getpagefromfreelist()从freelist中获取内存。staticstructpagegetpagefromfreelist(gfptgfpmask,unsignedintorder,intallocflags,conststructalloccontextac){structzonerefzacpreferredzoneref;structzonezone;structpglistdatalastpgdatdirtylimitNULL;(2。5。1)轮询fallbackzonelist链表,在符合条件(idxhighzoneidx)的zone中尝试分配内存fornextzonezonelistnodemask(zone,z,aczonelist,achighzoneidx,acnodemask){structpagepage;unsignedlongmark;if(cpusetsenabled()(allocflagsALLOCCPUSET)!cpusetzoneallowed(zone,gfpmask))continue;(2。5。2)如果GFPWRITE指示了分配页的用途是dirty,平均分布脏页查询node上分配的脏页是否超过限制,超过则换nodeif(acspreaddirtypages){if(lastpgdatdirtylimitzonezonepgdat)continue;if(!nodedirtyok(zonezonepgdat)){lastpgdatdirtylimitzonezonepgdat;continue;}}(2。5。3)获取当前分配能超越的水位线markzonewatermark〔allocflagsALLOCWMARKMASK〕;(2。5。4)判断当前zone中的freepage是否满足条件:1、totalfreepage(2order)watermarklowmemreserve2、是否有符合要求的长度为(2order)的连续内存if(!zonewatermarkfast(zone,order,mark,acclasszoneidx(ac),allocflags)){intret;(2。5。5)如果没有足够的free内存,则进行下列的判断CheckedheretokeepthefastpathfastBUILDBUGON(ALLOCNOWATERMARKSNRWMARK);(2。5。6)如果可以忽略水位线,则直接进行分配尝试if(allocflagsALLOCNOWATERMARKS)gototrythiszone;if(nodereclaimmode0!zoneallowsreclaim(acpreferredzonerefzone,zone))continue;(2。5。7)快速内存回收尝试回收(2order)个page快速回收不能进行unmap,writeback操作,回收priority为4,即最多尝试调用shrinknode进行回收的次数为priority值在nodereclaim()中使用以下scancontrol参数来调用shrinknode(),structscancontrolsc{。nrtoreclaimmax(nrpages,SWAPCLUSTERMAX),。gfpmaskcurrentgfpcontext(gfpmask),。orderorder,。priorityNODERECLAIMPRIORITY,。maywritepage!!(nodereclaimmodeRECLAIMWRITE),默认为0。mayunmap!!(nodereclaimmodeRECLAIMUNMAP),默认为0。mayswap1,。reclaimidxgfpzone(gfpmask),};retnodereclaim(zonezonepgdat,gfpmask,order);switch(ret){caseNODERECLAIMNOSCAN:didnotscancontinue;caseNODERECLAIMFULL:scannedbutunreclaimablecontinue;default:didwereclaimenough(2。5。8)如果回收成功,重新判断空闲内存是否已经足够if(zonewatermarkok(zone,order,mark,acclasszoneidx(ac),allocflags))gototrythiszone;continue;}}trythiszone:(2。5。9)满足条件,尝试实际的从freelist中摘取(2order)个pagepagermqueue(acpreferredzonerefzone,zone,order,gfpmask,allocflags,acmigratetype);if(page){(2。5。10)分配到内存后,对structpage的一些处理prepnewpage(page,order,gfpmask,allocflags);if(unlikely(order(allocflagsALLOCHARDER)))reservehighatomicpageblock(page,zone,order);returnpage;}}returnNULL;}staticinlineboolzonewatermarkfast(structzonez,unsignedintorder,unsignedlongmark,intclasszoneidx,unsignedintallocflags){(2。5。4。1)获取当前zone中freepage的数量longfreepageszonepagestate(z,NRFREEPAGES);longcmapages0;ifdefCONFIGCMAIfallocationcantuseCMAareasdontusefreeCMApagesif(!(allocflagsALLOCCMA))cmapageszonepagestate(z,NRFREECMAPAGES);endif(2。5。4。2)对order0的长度,进行快速检测free内存是否够用if(!order(freepagescmapages)markzlowmemreserve〔classzoneidx〕)returntrue;(2。5。4。3)慢速检测free内存是否够用returnzonewatermarkok(z,order,mark,classzoneidx,allocflags,freepages);}boolzonewatermarkok(structzonez,unsignedintorder,unsignedlongmark,intclasszoneidx,unsignedintallocflags,longfreepages){longminmark;into;constboolallocharder(allocflags(ALLOCHARDERALLOCOOM));freepagesmaygonegativethatsOK(2。5。4。3。1)首先用freepage总数减去需要的order长度,判断剩下的长度是不是还超过水位线freepages(1order)1;(2。5。4。3。2)如果是优先级高,水位线可以减半if(allocflagsALLOCHIGH)minmin2;IfthecallerdoesnothaverightstoALLOCHARDERthensubtractthehighatomicreserves。Thiswilloverestimatethesizeoftheatomicreservebutitavoidsasearch。(2。5。4。3。3)非harder类的分配,free内存还需预留nrreservedhighatomic的内存if(likely(!allocharder)){freepagesznrreservedhighatomic;(2。5。4。3。4)harder类的分配,非常紧急了,水位线还可以继续减半缩小}else{OOMvictimscantryevenharderthannormalALLOCHARDERusersonthegroundsthatitsdefinitelygoingtobeintheexitpathshortlyandfreememory。Anyallocationitmakesduringthefreepathwillbesmallandshortlived。if(allocflagsALLOCOOM)minmin2;elseminmin4;}ifdefCONFIGCMAIfallocationcantuseCMAareasdontusefreeCMApages(2。5。4。3。5)非CMA的分配,free内存还需预留CMA内存if(!(allocflagsALLOCCMA))freepageszonepagestate(z,NRFREECMAPAGES);endifCheckwatermarksforanorder0allocationrequest。Ifthesearenotmet,thenahighorderrequestalsocannotgoaheadevenifasuitablepagehappenedtobefree。(2。5。4。3。6)free内存还要预留(水位内存lowmemreserve〔classzoneidx〕)如果减去上述所有的预留内存内存后,还大于请求的order长度,说明当前zone中的free内存总长度满足请求分配的order但是有没有符合要求的长度为(2order)的连续内存,还要进一步查找判断if(freepagesminzlowmemreserve〔classzoneidx〕)returnfalse;Ifthisisanorder0requestthenthewatermarkisfine(2。5。4。3。7)如果order为0,不用进一步判断了,总长度满足,肯定能找到合适长度的pageif(!order)returntrue;Forahighorderrequest,checkatleastonesuitablepageisfree(2。5。4。3。8)逐个查询当前zone中大于请求order的链表for(oorder;oMAXORDER;o){structfreeareaareazfreearea〔o〕;intmt;if(!areanrfree)continue;(2。5。4。3。9)逐个查询当前order中的每个migratetype链表,如果不为空则返回成功for(mt0;mtMIGRATEPCPTYPES;mt){if(!listempty(areafreelist〔mt〕))returntrue;}ifdefCONFIGCMAif((allocflagsALLOCCMA)!listempty(areafreelist〔MIGRATECMA〕)){returntrue;}endifif(allocharder!listempty(areafreelist〔MIGRATEHIGHATOMIC〕))returntrue;}returnfalse;}rmqueue() 找到合适有足够free内存的zone以后,rmqueue()负责从freelist中摘取page。rmqueue()rmqueue():staticalwaysinlinestructpagermqueue(structzonezone,unsignedintorder,intmigratetype){structpagepage;retry:(1)从原始指定的migratefreeist中分配内存pagermqueuesmallest(zone,order,migratetype);if(unlikely(!page)){if(migratetypeMIGRATEMOVABLE)pagermqueuecmafallback(zone,order);(2)如果上一步分配失败,尝试从其他migratelist中偷取内存来分配if(!pagermqueuefallback(zone,order,migratetype))gotoretry;}tracemmpagealloczonelocked(page,order,migratetype);returnpage;}staticalwaysinlinestructpagermqueuesmallest(structzonezone,unsignedintorder,intmigratetype){unsignedintcurrentorder;structfreeareaarea;structpagepage;Findapageoftheappropriatesizeinthepreferredlist(1。1)逐个查询order的freaaarea中migratetype的freelist,看看是否有free内存for(currentorderorder;currentorderMAXORDER;currentorder){area(zonefreearea〔currentorder〕);pagelistfirstentryornull(areafreelist〔migratetype〕,structpage,lru);if(!page)continue;(1。1。1)从freelist中摘取内存listdel(pagelru);清理page中保存的order信息:pagemapcount1pageprivate0rmvpageorder(page);areanrfree;(1。1。2)把剩余内存重新挂载到低阶order的freelist中expand(zone,page,order,currentorder,area,migratetype);setpcppagemigratetype(page,migratetype);returnpage;}returnNULL;}allocpagesslowpath()staticinlinestructpageallocpagesslowpath(gfptgfpmask,unsignedintorder,structalloccontextac){boolcandirectreclaimgfpmaskGFPDIRECTRECLAIM;constboolcostlyorderorderPAGEALLOCCOSTLYORDER;structpagepageNULL;unsignedintallocflags;unsignedlongdidsomeprogress;enumcompactprioritycompactpriority;enumcompactresultcompactresult;intcompactionretries;intnoprogressloops;unsignedintcpusetmemscookie;intreserveflags;Wealsosanitychecktocatchabuseofatomicreservesbeingusedbycallersthatarenotinatomiccontext。if(WARNONONCE((gfpmask(GFPATOMICGFPDIRECTRECLAIM))(GFPATOMICGFPDIRECTRECLAIM)))gfpmaskGFPATOMIC;retrycpuset:compactionretries0;noprogressloops0;compactpriorityDEFCOMPACTPRIORITY;cpusetmemscookiereadmemsallowedbegin();Thefastpathusesconservativeallocflagstosucceedonlyuntilkswapdneedstobewokenup,andtoavoidthecostofsettingupallocflagsprecisely。Sowedothatnow。(1)设置各种标志:ALLOCWMARKMIN,水位降低到minALLOCHARDER,如果是atomic或者rttask,进一步降低水位allocflagsgfptoallocflags(gfpmask);Weneedtorecalculatethestartingpointforthezonelistiteratorbecausewemighthaveuseddifferentnodemaskinthefastpath,ortherewasacpusetmodificationandweareretryingotherwisewecouldendupiteratingovernoneligiblezonesendlessly。(2)重新安排fallbackzonelistacpreferredzonereffirstzoneszonelist(aczonelist,achighzoneidx,acnodemask);if(!acpreferredzonerefzone)gotonopage;(3)进入慢速路径,说明在low水位下已经分配失败了,所以先唤醒kswapd异步回收线程if(gfpmaskGFPKSWAPDRECLAIM)wakeallkswapds(order,ac);Theadjustedallocflagsmightresultinimmediatesuccess,sotrythatfirst(4)第2次分配:使用min水位尝试直接从freelist分配pagepagegetpagefromfreelist(gfpmask,order,allocflags,ac);if(page)gotogotpg;Forcostlyallocations,trydirectcompactionfirst,asitslikelythatwehaveenoughbasepagesanddontneedtoreclaim。Fornonmovablehighorderallocations,dothataswell,ascompactionwilltrypreventpermanentfragmentationbymigratingfromblocksofthesamemigratetype。对于昂贵的分配,首先尝试直接压缩,因为我们可能有足够的基本页,不需要回收。对于不可移动的高阶分配,也要这样做,因为压缩将尝试通过从相同migratetype的块迁移来防止永久的碎片化。Donttrythisforallocationsthatareallowedtoignorewatermarks,astheALLOCNOWATERMARKSattemptdidntyethappen。不要尝试这个分配而允许忽略水位,因为allocnowatermark尝试还没有发生。if(candirectreclaim(costlyorder(order0acmigratetype!MIGRATEMOVABLE))!gfppfmemallocallowed(gfpmask)){(5)第3次分配:内存压缩compact后,尝试分配getpagefromfreelist()pageallocpagesdirectcompact(gfpmask,order,allocflags,ac,INITCOMPACTPRIORITY,compactresult);if(page)gotogotpg;ChecksforcostlyallocationswithGFPNORETRY,whichincludesTHPpagefaultallocationsif(costlyorder(gfpmaskGFPNORETRY)){Ifcompactionisdeferredforhighorderallocations,itisbecausesynccompactionrecentlyfailed。IfthisisthecaseandthecallerrequestedaTHPallocation,wedonotwanttoheavilydisruptthesystem,sowefailtheallocationinsteadofenteringdirectreclaim。if(compactresultCOMPACTDEFERRED)gotonopage;Lookslikereclaimcompactionisworthtrying,butsynccompactioncouldbeveryexpensive,sokeepusingasynccompaction。compactpriorityINITCOMPACTPRIORITY;}}retry:Ensurekswapddoesntaccidentallygotosleepaslongasweloop(6)再一次唤醒kswapd异步回收线程,可能ac参数变得更严苛了if(gfpmaskGFPKSWAPDRECLAIM)wakeallkswapds(order,ac);(7)设置各种标志:ALLOCNOWATERMARKS,进一步降低水位,直接忽略水位reserveflagsgfppfmemallocflags(gfpmask);if(reserveflags)allocflagsreserveflags;Resetthezonelistiteratorsifmemorypoliciescanbeignored。Theseallocationsarehighpriorityandsystemratherthanuserorientated。if(!(allocflagsALLOCCPUSET)reserveflags){acpreferredzonereffirstzoneszonelist(aczonelist,achighzoneidx,acnodemask);}Attemptwithpotentiallyadjustedzonelistandallocflags(8)第4次分配:使用no水位尝试直接从freelist分配pagepagegetpagefromfreelist(gfpmask,order,allocflags,ac);if(page)gotogotpg;Callerisnotwillingtoreclaim,wecantbalanceanything(9)如果当前不支持直接回收,则退出,等待kswapd异步线程的回收if(!candirectreclaim)gotonopage;Avoidrecursionofdirectreclaim(10)避免递归回收if(currentflagsPFMEMALLOC)gotonopage;Trydirectreclaimandthenallocating(11)第5次分配:直接启动内存回收后,并尝试pagegetpagefromfreelist()pageallocpagesdirectreclaim(gfpmask,order,allocflags,ac,didsomeprogress);if(page)gotogotpg;Trydirectcompactionandthenallocating(12)第6次分配:直接启动内存压缩后,并尝试pagegetpagefromfreelist()pageallocpagesdirectcompact(gfpmask,order,allocflags,ac,compactpriority,compactresult);if(page)gotogotpg;Donotloopifspecificallyrequested(13)如果还是分配失败,且不支持重试,出错返回if(gfpmaskGFPNORETRY)gotonopage;DonotretrycostlyhighorderallocationsunlesstheyareGFPRETRYMAYFAILif(costlyorder!(gfpmaskGFPRETRYMAYFAIL))gotonopage;(14)检查重试内存回收是否有意义if(shouldreclaimretry(gfpmask,order,ac,allocflags,didsomeprogress0,noprogressloops))gotoretry;Itdoesntmakeanysensetoretryforthecompactioniftheorder0reclaimisnotabletomakeanyprogressbecausethecurrentimplementationofthecompactiondependsonthesufficientamountoffreememory(seecompactionsuitable)(15)检查重试内存压缩是否有意义if(didsomeprogress0shouldcompactretry(ac,order,allocflags,compactresult,compactpriority,compactionretries))gotoretry;DealwithpossiblecpusetupdateracesbeforewestartOOMkilling(16)在启动OOMkiling之前,是否有可能更新cpuset来进行重试if(checkretrycpuset(cpusetmemscookie,ac))gotoretrycpuset;Reclaimhasfailedus,startkillingthings(17)第7次分配:所有的内存回收尝试都已经失败,祭出最后的大招:通过杀进程来释放内存pageallocpagesmayoom(gfpmask,order,ac,didsomeprogress);if(page)gotogotpg;Avoidallocationswithnowatermarksfromloopingendlessly(18)避免无止境循环的无水位分配if(tskisoomvictim(current)(allocflagsALLOCOOM(gfpmaskGFPNOMEMALLOC)))gotonopage;RetryaslongastheOOMkillerismakingprogress(19)在OOMkilling取得进展时重试if(didsomeprogress){noprogressloops0;gotoretry;}nopage:Dealwithpossiblecpusetupdateracesbeforewefail(20)在我们失败之前处理可能的cpuset更新if(checkretrycpuset(cpusetmemscookie,ac))gotoretrycpuset;MakesurethatGFPNOFAILrequestdoesntleakoutandmakesurewealwaysretry(21)如果指定了GFPNOFAIL,只能不停的进行重试if(gfpmaskGFPNOFAIL){AllexistingusersoftheGFPNOFAILareblockable,sowarnofanynewusersthatactuallyrequireGFPNOWAITif(WARNONONCE(!candirectreclaim))gotofail;WARNONONCE(currentflagsPFMEMALLOC);WARNONONCE(orderPAGEALLOCCOSTLYORDER);pageallocpagescpusetfallback(gfpmask,order,ALLOCHARDER,ac);if(page)gotogotpg;condresched();gotoretry;}fail:(22)构造分配失败的告警信息warnalloc(gfpmask,acnodemask,pageallocationfailure:order:u,order);gotpg:returnpage;} 文章链接: https:mp。weixin。qq。coms8svh5lTQ6f6NuKwzNAt8g 转载自:人人极客社区,作者彭伟林 文章链接:Buddy内存管理机制(下)