TVM是一个开源深度学习编译器,可适用于各类CPUs,GPUs及其他专用加速器。它的目标是使得我们能够在任何硬件上优化和运行自己的模型。不同于深度学习框架关注模型生产力,TVM更关注模型在硬件上的性能和效率。 本文只简单介绍TVM的编译流程,及如何自动调优自己的模型。更深入了解,可见TVM官方内容:文档:https:tvm。apache。orgdocs源码:https:github。comapachetvm编译流程 TVM文档DesignandArchitecture〔1〕讲述了实例编译流程、逻辑结构组件、设备目标实现等。其中流程见下图: 从高层次上看,包含了如下步骤:导入(Import):前端组件将模型提取进IRModule,其是模型内部表示(IR)的函数集合。转换(Transformation):编译器将IRModule转换为另一个功能等效或近似等效(如量化情况下)的IRModule。大多转换都是独立于目标(后端)的。TVM也允许目标影响转换通道的配置。目标翻译(TargetTranslation):编译器翻译(代码生成)IRModule到目标上的可执行格式。目标翻译结果被封装为runtime。Module,可以在目标运行时环境中导出、加载和执行。运行时执行(RuntimeExecution):用户加载一个runtime。Module并在支持的运行时环境中运行编译好的函数。调优模型 TVM文档UserTutorial〔2〕从怎么编译优化模型开始,逐步深入到TE,TensorIR,Relay等更底层的逻辑结构组件。 这里只讲下如何用AutoTVM自动调优模型,实际了解TVM编译、调优、运行模型的过程。原文见CompilingandOptimizingaModelwiththePythonInterface(AutoTVM)〔3〕。准备TVM 首先,安装TVM。可见文档InstallingTVM〔4〕,或笔记TVM安装〔5〕。 之后,即可通过TVMPythonAPI来调优模型。我们先导入如下依赖:importonnxfromtvm。contrib。downloadimportdownloadtestdatafromPILimportImageimportnumpyasnpimporttvm。relayasrelayimporttvmfromtvm。contribimportgraphexecutor准备模型,并加载 获取预训练的ResNet50v2ONNX模型,并加载:modelurl。join(〔https:github。comonnxmodelsraw,mainvisionclassificationresnetmodel,resnet50v27。onnx,〕)modelpathdownloadtestdata(modelurl,resnet50v27。onnx,moduleonnx)onnxmodelonnx。load(modelpath)准备图片,并前处理 获取一张测试图片,并前处理成224x224NCHW格式:imgurlhttps:s3。amazonaws。commodelserverinputskitten。jpgimgpathdownloadtestdata(imgurl,imagenetcat。png,moduledata)Resizeitto224x224resizedimageImage。open(imgpath)。resize((224,224))imgdatanp。asarray(resizedimage)。astype(float32)OurinputimageisinHWClayoutwhileONNXexpectsCHWinput,soconvertthearrayimgdatanp。transpose(imgdata,(2,0,1))NormalizeaccordingtotheImageNetinputspecificationimagenetmeannp。array(〔0。485,0。456,0。406〕)。reshape((3,1,1))imagenetstddevnp。array(〔0。229,0。224,0。225〕)。reshape((3,1,1))normimgdata(imgdata255imagenetmean)imagenetstddevAddthebatchdimension,asweareexpecting4dimensionalinput:NCHW。imgdatanp。expanddims(normimgdata,axis0)编译模型,用TVMRelay TVM导入ONNX模型成Relay,并创建TVM图模型:targetinput(target〔llvm〕:)ifnottarget:targetllvmtargetllvmmcpucoreavx2targetllvmmcpuskylakeavx512Theinputnamemayvaryacrossmodeltypes。YoucanuseatoollikeNetrontocheckinputnamesinputnamedatashapedict{inputname:imgdata。shape}mod,paramsrelay。frontend。fromonnx(onnxmodel,shapedict)withtvm。transform。PassContext(optlevel3):librelay。build(mod,targettarget,paramsparams)devtvm。device(str(target),0)modulegraphexecutor。GraphModule(lib〔default〕(devdefault)) 其中target是目标硬件平台。llvm指用CPU,建议指明架构指令集,可更优化性能。如下命令可查看CPU:llcversiongrepCPUHostCPU:skylakelscpu 或直接上厂商网站(如IntelProducts〔6〕)查看产品参数。运行模型,用TVMRuntime 用TVMRuntime运行模型,进行预测:dtypefloat32module。setinput(inputname,imgdata)module。run()outputshape(1,1000)tvmoutputmodule。getoutput(0,tvm。nd。empty(outputshape))。numpy()收集优化前的性能数据 收集优化前的性能数据:importtimeittimingnumber10timingrepeat10unoptimized(np。array(timeit。Timer(lambda:module。run())。repeat(repeattimingrepeat,numbertimingnumber))1000timingnumber)unoptimized{mean:np。mean(unoptimized),median:np。median(unoptimized),std:np。std(unoptimized),}print(unoptimized) 之后,用以对比优化后的性能。后处理输出,得知预测结果 输出的预测结果,后处理成可读的分类结果:fromscipy。specialimportsoftmaxDownloadalistoflabelslabelsurlhttps:s3。amazonaws。comonnxmodelzoosynset。txtlabelspathdownloadtestdata(labelsurl,synset。txt,moduledata)withopen(labelspath,r)asf:labels〔l。rstrip()forlinf〕Opentheoutputandreadtheoutputtensorscoressoftmax(tvmoutput)scoresnp。squeeze(scores)ranksnp。argsort(scores)〔::1〕forrankinranks〔0:5〕:print(classswithprobabilityf(labels〔rank〕,scores〔rank〕))调优模型,获取调优数据 于目标硬件平台,用AutoTVM自动调优,获取调优数据:importtvm。autoschedulerasautoschedulerfromtvm。autotvm。tunerimportXGBTunerfromtvmimportautotvmnumber10repeat1minrepeatms0sinceweretuningonaCPU,canbesetto0timeout10insecondscreateaTVMrunnerrunnerautotvm。LocalRunner(numbernumber,repeatrepeat,timeouttimeout,minrepeatmsminrepeatms,enablecpucacheflushTrue,)tuningoption{tuner:xgb,trials:10,earlystopping:100,measureoption:autotvm。measureoption(builderautotvm。LocalBuilder(buildfuncdefault),runnerrunner),tuningrecords:resnet50v2autotuning。json,}beginbyextractingthetasksfromtheonnxmodeltasksautotvm。task。extractfromprogram(mod〔main〕,targettarget,paramsparams)Tunetheextractedtaskssequentially。fori,taskinenumerate(tasks):prefix〔Task2d2d〕(i1,len(tasks))tunerobjXGBTuner(task,losstyperank)tunerobj。tune(ntrialmin(tuningoption〔trials〕,len(task。configspace)),earlystoppingtuningoption〔earlystopping〕,measureoptiontuningoption〔measureoption〕,callbacks〔autotvm。callback。progressbar(tuningoption〔trials〕,prefixprefix),autotvm。callback。logtofile(tuningoption〔tuningrecords〕),〕,) 上述tuningoption选用的XGBoostGrid算法进行优化搜索,数据记录进tuningrecords。重编译模型,用调优数据 重新编译出一个优化模型,依据调优数据:withautotvm。applyhistorybest(tuningoption〔tuningrecords〕):withtvm。transform。PassContext(optlevel3,config{}):librelay。build(mod,targettarget,paramsparams)devtvm。device(str(target),0)modulegraphexecutor。GraphModule(lib〔default〕(devdefault))Verifythattheoptimizedmodelrunsandproducesthesameresultsdtypefloat32module。setinput(inputname,imgdata)module。run()outputshape(1,1000)tvmoutputmodule。getoutput(0,tvm。nd。empty(outputshape))。numpy()scoressoftmax(tvmoutput)scoresnp。squeeze(scores)ranksnp。argsort(scores)〔::1〕forrankinranks〔0:5〕:print(classswithprobabilityf(labels〔rank〕,scores〔rank〕))对比调优与非调优模型 收集优化后的性能数据,与优化前的对比:importtimeittimingnumber10timingrepeat10optimized(np。array(timeit。Timer(lambda:module。run())。repeat(repeattimingrepeat,numbertimingnumber))1000timingnumber)optimized{mean:np。mean(optimized),median:np。median(optimized),std:np。std(optimized)}print(optimized:s(optimized))print(unoptimized:s(unoptimized)) 调优模型,整个过程的运行结果,如下:timepythonautotvmtune。pyTVM编译运行模型DownloadingandLoadingtheONNXModelDownloading,Preprocessing,andLoadingtheTestImageCompiletheModelWithRelaytarget〔llvm〕:llvmmcpucoreavx2Oneormoreoperatorshavenotbeentuned。Pleasetuneyourmodelforbetterperformance。UseDEBUGloggingleveltoseemoredetails。ExecuteontheTVMRuntimeCollectBasicPerformanceData{mean:44。97057118016528,median:42。52320024970686,std:6。870915251002107}Postprocesstheoutputclassn02123045tabby,tabbycatwithprobability0。621104classn02123159tigercatwithprobability0。356378classn02124075Egyptiancatwithprobability0。019712classn02129604tiger,Pantheratigriswithprobability0。001215classn04040759radiatorwithprobability0。000262AutoTVM调优模型〔Yn〕Tunethemodel〔Task125〕CurrentBest:156。96353。76GFLOPSProgress:(1010)4。78sDone。〔Task225〕CurrentBest:54。66241。25GFLOPSProgress:(1010)2。88sDone。〔Task325〕CurrentBest:116。71241。30GFLOPSProgress:(1010)3。48sDone。〔Task425〕CurrentBest:119。92184。18GFLOPSProgress:(1010)3。48sDone。〔Task525〕CurrentBest:48。92158。38GFLOPSProgress:(1010)3。13sDone。〔Task625〕CurrentBest:156。89230。95GFLOPSProgress:(1010)2。82sDone。〔Task725〕CurrentBest:92。33241。99GFLOPSProgress:(1010)2。40sDone。〔Task825〕CurrentBest:50。04331。82GFLOPSProgress:(1010)2。64sDone。〔Task925〕CurrentBest:188。47409。93GFLOPSProgress:(1010)4。44sDone。〔Task1025〕CurrentBest:44。81181。67GFLOPSProgress:(1010)2。32sDone。〔Task1125〕CurrentBest:83。74312。66GFLOPSProgress:(1010)2。74sDone。〔Task1225〕CurrentBest:96。48294。40GFLOPSProgress:(1010)2。82sDone。〔Task1325〕CurrentBest:123。74354。34GFLOPSProgress:(1010)2。62sDone。〔Task1425〕CurrentBest:23。76178。71GFLOPSProgress:(1010)2。90sDone。〔Task1525〕CurrentBest:119。18534。63GFLOPSProgress:(1010)2。49sDone。〔Task1625〕CurrentBest:101。24172。92GFLOPSProgress:(1010)2。49sDone。〔Task1725〕CurrentBest:309。85309。85GFLOPSProgress:(1010)2。69sDone。〔Task1825〕CurrentBest:54。45368。31GFLOPSProgress:(1010)2。46sDone。〔Task1925〕CurrentBest:78。69162。43GFLOPSProgress:(1010)3。29sDone。〔Task2025〕CurrentBest:40。78317。50GFLOPSProgress:(1010)4。52sDone。〔Task2125〕CurrentBest:169。03296。36GFLOPSProgress:(1010)3。95sDone。〔Task2225〕CurrentBest:90。96210。43GFLOPSProgress:(1010)2。28sDone。〔Task2325〕CurrentBest:48。93217。36GFLOPSProgress:(1010)2。87sDone。〔Task2525〕CurrentBest:0。000。00GFLOPSProgress:(010)0。00sDone。〔Task2525〕CurrentBest:25。5033。86GFLOPSProgress:(1010)9。28sDone。CompilinganOptimizedModelwithTuningDataclassn02123045tabby,tabbycatwithprobability0。621104classn02123159tigercatwithprobability0。356378classn02124075Egyptiancatwithprobability0。019712classn02129604tiger,Pantheratigriswithprobability0。001215classn04040759radiatorwithprobability0。000262ComparingtheTunedandUntunedModelsoptimized:{mean:34。736288779822644,median:34。547542000655085,std:0。5144378649382363}unoptimized:{mean:44。97057118016528,median:42。52320024970686,std:6。870915251002107}real3m23。904suser5m2。900ssys5m37。099s 对比性能数据,可以发现:调优模型的运行速度更快、更平稳。参考笔记:startaicompiler〔7〕资料:2020TheDeepLearningCompiler:AComprehensiveSurvey〔8〕〔译〕深度学习编译器综述〔9〕2018TVM:AnAutomatedEndtoEndOptimizingCompilerforDeepLearning〔10〕〔译〕TVM:一个自动的端到端深度学习优化编译器〔11〕脚注 〔1〕DesignandArchitecture:https:tvm。apache。orgdocsarchindex。html 〔2〕UserTutorial:https:tvm。apache。orgdocstutorialindex。html 〔3〕CompilingandOptimizingaModelwiththePythonInterface(AutoTVM):https:tvm。apache。orgdocstutorialautotvmrelayx86。html 〔4〕InstallingTVM:https:tvm。apache。orgdocstutorialinstall。html 〔5〕TVM安装:https:github。comikuokuostartaicompilerblobmaindocstvmtvminstall。md 〔6〕IntelProducts:https:www。intel。comcontentwwwusenproductsoverview。html 〔7〕startaicompiler:https:github。comikuokuostartaicompilerE7AC94E8AEB0 〔8〕2020TheDeepLearningCompiler:AComprehensiveSurvey:https:arxiv。orgabs2002。03794 〔9〕〔译〕深度学习编译器综述:https:www。jianshu。comped372af7ef09 〔10〕2018TVM:AnAutomatedEndtoEndOptimizingCompilerforDeepLearning:https:www。usenix。orgconferenceosdi18presentationchen 〔11〕〔译〕TVM:一个自动的端到端深度学习优化编译器:https:zhuanlan。zhihu。comp426994569