一、搭建 Tez是一个Hive的运行引擎,性能优于MR Tez可以将多个有依赖的作业转换为一个作业,这样只需写一次HDFS,且中间节点较少,从而大大提升作业的计算性能1、将tez安装包拷贝到集群,并解压tar包mkdiroptmoduleteztarzxvfoptsoftwaretez0。10。1SNAPSHOTminimal。tar。gzCoptmoduletez2、上传tez依赖到HDFS 首先启动zk和hdfshadoopfsmkdirtezhadoopfsputoptsoftwaretez0。10。1SNAPSHOT。tar。gztez3、新建tezsite。xmlvimHADOOPHOMEetchadooptezsite。xmllt;?xmlversion1。0encodingUTF8?lt;?xmlstylesheettypetextxslhrefconfiguration。xsl?configurationpropertynametez。lib。urisnamevalue{fs。defaultFS}teztez0。10。1SNAPSHOT。tar。gzvaluepropertypropertynametez。use。cluster。hadooplibsnamevaluetruevaluepropertypropertynametez。am。resource。memory。mbnamevalue1024valuepropertypropertynametez。am。resource。cpu。vcoresnamevalue1valuepropertypropertynametez。container。max。java。heap。fractionnamevalue0。4valuepropertypropertynametez。task。resource。memory。mbnamevalue1024valuepropertypropertynametez。task。resource。cpu。vcoresnamevalue1valuepropertyconfiguration4、修改Hadoop环境变量vimHADOOPHOMEetchadoopshellprofile。dtez。shhadoopaddprofiletezfunctiontezhadoopclasspath{hadoopaddclasspathHADOOPHOMEetchadoopafterhadoopaddclasspathoptmoduletezafterhadoopaddclasspathoptmoduletezlibafter}5、修改Hive的计算引擎vimHIVEHOMEconfhivesite。xml 添加!tezpropertynamehive。execution。enginenamevaluetezvaluepropertypropertynamehive。tez。container。sizenamevalue1024valueproperty6、解决日志jar包冲突rmfoptmoduletezlibslf4jlog4j121。7。10。jar二、使用1、需求统计视频观看数Top10统计视频类别热度Top10统计出视频观看数最高的20个视频的所属类别以及类别包含Top20视频的个数统计视频观看数Top50所关联视频的所属类别Rank统计每个类别中的视频热度Top10,以Music为例统计每个类别视频观看数Top10统计上传视频最多的用户Top10以及他们上传的视频观看次数在前20的视频2、准备工作 pom。xmldependenciesdependencygroupIdorg。apache。hadoopgroupIdhadoopclientartifactIdversion3。1。3versiondependencydependencies1、ETLETL之封装工具类packageutils;program:BigData2021description:Etl工具类清洗视频数据规则:1。数据长度必须大于等于92。将视频的类别中的空格去掉3。将关联视频通过拼接paramlinereturn如果数据合法,返回清洗完的数据如果数据不合法,返回nullauthor:Mr。逗create:2022012015:28publicclassETLUtils{publicstaticStringetlGulivideoData(Stringline){StringBuffersbsnewStringBuffer();1。切割数据String〔〕splitsline。split();2。规则一if(splits。length9){returnnull;}3。规则二splits〔3〕splits〔3〕。replaceAll(,);4。规则三for(inti0;isplits。length;i){有相关视频或者没有相关视频if(i8){if(isplits。length1){sbs。append(splits〔i〕);}else{sbs。append(splits〔i〕)。append();}}else{if(isplits。length1){sbs。append(splits〔i〕);}else{sbs。append(splits〔i〕)。append();}}}returnsbs。toString();}publicstaticvoidmain(String〔〕args){StringlineRX24KLBhwMIlemonette697PeopleBlogs512241494。22315;StringresultetlGulivideoData(line);System。out。println(result);}}ETL之Mapperpackagemain;importorg。apache。hadoop。io。LongWritable;importorg。apache。hadoop。io。NullWritable;importorg。apache。hadoop。io。Text;importorg。apache。hadoop。mapreduce。Mapper;importutils。ETLUtils;importjava。io。IOException;program:BigData2021description:鼓励视频author:Mr。逗create:2022012015:30publicclassGulivideoETLMapperextendsMapperLongWritable,Text,Text,NullWritable{TextoutknewText();Overrideprotectedvoidmap(LongWritablekey,Textvalue,Contextcontext)throwsIOException,InterruptedException{Stringlinevalue。toString();StringresultETLUtils。etlGulivideoData(line);if(resultnull){return;}outk。set(result);context。write(outk,NullWritable。get());}}ETL之Driverpackagemain;importorg。apache。hadoop。conf。Configuration;importorg。apache。hadoop。fs。Path;importorg。apache。hadoop。io。NullWritable;importorg。apache。hadoop。io。Text;importorg。apache。hadoop。mapreduce。Job;importorg。apache。hadoop。mapreduce。lib。input。FileInputFormat;importorg。apache。hadoop。mapreduce。lib。output。FileOutputFormat;importjava。io。IOException;program:BigData2021description:鼓励视频author:Mr。逗create:2022012015:31publicclassGulivideoETLDriver{publicstaticvoidmain(String〔〕args)throwsIOException,ClassNotFoundException,InterruptedException{ConfigurationconfnewConfiguration();JobjobJob。getInstance(conf);job。setJarByClass(GulivideoETLDriver。class);job。setMapperClass(GulivideoETLMapper。class);job。setMapOutputKeyClass(Text。class);job。setMapOutputValueClass(NullWritable。class);job。setOutputKeyClass(Text。class);job。setOutputValueClass(NullWritable。class);job。setNumReduceTasks(0);FileInputFormat。setInputPaths(job,newPath(args〔0〕));FileOutputFormat。setOutputPath(job,newPath(args〔1〕));job。waitForCompletion(true);}}上传数据到hdfshadoopfsmkdirpgulivideovideohadoopfsmkdirpgulivideouserhadoopfsputhomebigdataprogramhiveprogramuseruser。txtgulivideouserhadoopfsputhomebigdataprogramhiveprogramvideo。txtgulivideovideo打jar包hadoopjarhomebigdataprogramhiveprogramjaretl。jarmain。GulivideoETLDrivergulivideovideogulivideovideooutput2、准备表 启动hive 视频表结构 字段 备注 详细描述 videoId 视频唯一id(String) 11位字符串 uploader 视频上传者(String) 上传视频的用户名String age 视频年龄(int) 视频在平台上的整数天 category 视频类别(Array) 上传视频指定的视频分类 length 视频长度(Int) 整形数字标识的视频长度 views 观看次数(Int) 视频被浏览的次数 rate 视频评分(Double) 满分5分 Ratings 流量(Int) 视频的流量,整型数字 conments 评论数(Int) 一个视频的整数评论数 relatedId 相关视频id(Array) 相关视频的id,最多20个 用户表结构 字段 备注 字段类型 uploader 上传者用户名 string videos 上传视频数 int friends 朋友数量 int1、原始表gulivideooricreatedatabaseodsvideo;createtableodsvideo。odsgulivideoori(videoIdstring,uploaderstring,ageint,categoryarray,lengthint,viewsint,ratefloat,ratingsint,commentsint,relatedIdarray)rowformatdelimitedfieldsterminatedbycollectionitemsterminatedbystoredastextfile;gulivideouseroricreatetableodsvideo。odsgulivideouserori(uploaderstring,videosint,friendsint)rowformatdelimitedfieldsterminatedbystoredastextfile;2、最终表gulivideoorccreatetableodsvideo。odsgulivideoorc(videoIdstring,uploaderstring,ageint,categoryarray,lengthint,viewsint,ratefloat,ratingsint,commentsint,relatedIdarray)storedasorctblproperties(orc。compressSNAPPY);gulivideouserorccreatetableodsvideo。odsgulivideouserorc(uploaderstring,videosint,friendsint)rowformatdelimitedfieldsterminatedbystoredasorctblproperties(orc。compressSNAPPY);3、导入数据loaddatainpathgulivideovideooutputintotableodsvideo。odsgulivideoori;loaddatainpathgulivideouserintotableodsvideo。odsgulivideouserori;insertintotableodsvideo。odsgulivideoorcselectfromodsvideo。odsgulivideoori;insertintotableodsvideo。odsgulivideouserorcselectfromodsvideo。odsgulivideouserori;4、验证数据selectfromodsvideo。odsgulivideouserorilimit10;3、业务分析1、统计视频观看数Top10 思路:使用orderby按照views字段做一个全局排序即可,同时我们设置只显示前10条。SELECTvideoId,viewsFROModsvideo。odsgulivideoorcORDERBYviewsDESCLIMIT10;2、统计视频类别热度Top10 (1)即统计每个类别有多少个视频,显示出包含视频最多的前10个类别。 (2)我们需要按照类别groupby聚合,然后count组内的videoId个数即可。 (3)因为当前表结构为:一个视频对应一个或多个类别。所以如果要groupby类别,需要先将类别进行列转行(展开),然后再进行count即可。 (4)最后按照热度排序,显示前10条。SELECTt1。categoryname,COUNT(t1。videoId)hotFROM(SELECTvideoId,categorynameFROModsvideo。odsgulivideoorclateralVIEWexplode(category)gulivideoorctmpAScategoryname)t1GROUPBYt1。categorynameORDERBYhotDESCLIMIT10;3、统计出视频观看数最高的20个视频的所属类别以及类别包含Top20视频的个数 (1)先找到观看数最高的20个视频所属条目的所有信息,降序排列 (2)把这20条信息中的category分裂出来(列转行) (3)最后查询视频分类名称和该分类下有多少个Top20的视频SELECTt2。categoryname,COUNT(t2。videoId)videosumFROM(SELECTt1。videoId,categorynameFROM(SELECTvideoId,views,categoryFROModsvideo。odsgulivideoorcORDERBYviewsDESCLIMIT20)t1lateralVIEWexplode(t1。category)t1tmpAScategoryname)t2GROUPBYt2。categoryname;4、统计视频观看数Top50所关联视频的所属类别排序SELECTt6。categoryname,t6。videosum,rank()over(ORDERBYt6。videosumDESC)rkFROM(SELECTt5。categoryname,COUNT(t5。relatedidid)videosumFROM(SELECTt4。relatedidid,categorynameFROM(SELECTt2。relatedidid,t3。categoryFROM(SELECTrelatedididFROM(SELECTvideoId,views,relatedidFROModsvideo。odsgulivideoorcORDERBYviewsDESCLIMIT50)t1lateralVIEWexplode(t1。relatedid)t1tmpASrelatedidid)t2JOINodsvideo。odsgulivideoorct3ONt2。relatedididt3。videoId)t4lateralVIEWexplode(t4。category)t4tmpAScategoryname)t5GROUPBYt5。categorynameORDERBYvideosumDESC)t6;5、统计每个类别中的视频热度Top10 (1)要想统计Music类别中的视频热度Top10,需要先找到Music类别,那么就需要将category展开,所以可以创建一张表用于存放categoryId展开的数据。 (2)向category展开的表中插入数据。 (3)统计对应类别(Music)中的视频热度。SELECTt1。videoId,t1。views,t1。categorynameFROM(SELECTvideoId,views,categorynameFROModsvideo。odsgulivideoorclateralVIEWexplode(category)gulivideoorctmpAScategoryname)t1WHEREt1。categorynameMusicORDERBYt1。viewsDESCLIMIT10;6、统计每个类别视频观看数Top10SELECTt2。videoId,t2。views,t2。categoryname,t2。rkFROM(SELECTt1。videoId,t1。views,t1。categoryname,rank()over(PARTITIONBYt1。categorynameORDERBYt1。viewsDESC)rkFROM(SELECTvideoId,views,categorynameFROModsvideo。odsgulivideoorclateralVIEWexplode(category)gulivideoorctmpAScategoryname)t1)t2WHEREt2。rk10;7、统计上传视频最多的用户Top10以及他们上传的视频观看次数在前20的视频 (1)求出上传视频最多的10个用户 (2)关联gulivideoorc表,求出这10个用户上传的所有的视频,按照观看数取前20SELECTt2。videoId,t2。views,t2。uploaderFROM(SELECTuploader,videosFROModsvideo。odsgulivideouserorcORDERBYvideosDESCLIMIT10)t1JOINodsvideo。odsgulivideoorct2ONt1。uploadert2。uploaderORDERBYt2。viewsDESCLIMIT20;三、计算引擎使用1、MRsethive。execution。enginemr;SELECTvideoId,viewsFROModsvideo。odsgulivideoorcORDERBYviewsDESCLIMIT10;2、tezsethive。execution。enginetez;SELECTvideoId,viewsFROModsvideo。odsgulivideoorcORDERBYviewsDESCLIMIT10;