排查问题提交hadoop作业偶尔失败
用户反馈偶尔有作业提交失败,一般运行10-20分钟就出现失败提醒,作业是在系统A提交的hive sql语句,系统A的日志和hiveserver2的日志输出是一样的,都提示YarnException: Failed to submit application_xxx to YARN : Application application_xxx was killed by user xxx at 10.10.x.x
hiveserver2日志如下: 2023-01-18T05:14:04,073 ERROR [HiveServer2-Background-Pool: Thread-917734] exec.Task: Job Submission failed with exception "java.io.IOException(org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application_xxx to YARN : Application application_xxx was killed by user userxxx at 10.x.x.x)" java.io.IOException: org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application_xxx to YARN : Application application_xxx was killed by user userxxx at 10.x.x.x at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:345) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:254) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1570) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1567) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1567) at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:576) at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:571) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:571) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:562) at org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:411) at org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:151) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:199) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2183) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1839) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1526) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1232) at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:255) at org.apache.hive.service.cli.operation.SQLOperation.access$800(SQLOperation.java:91) at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:348) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:362) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application_xxx to YARN : Application application_xxx was killed by user userxxx at 10.x.x.x at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication(YarnClientImpl.java:304) at org.apache.hadoop.mapred.ResourceMgrDelegate.submitApplication(ResourceMgrDelegate.java:299) at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:330) ... 35 more
提交了作业20分钟被kill了,YARN出错了?作业超时了?用户主动kill了?很难说,不清楚,hiveserver2提交作业是和Hadoop YARN交互,先检查下YARN日志再说。
YARN日志如下: 23/01/18 05:13:37 INFO resourcemanager.ClientRMService: Allocated new applicationId: 449050 23/01/18 05:13:38 INFO rmapp.RMAppImpl: Storing application with id application_xxx 23/01/18 05:13:38 INFO rmapp.RMAppImpl: application_xxx State change from NEW to NEW_SAVING on event = START ... ... 23/01/18 05:13:44 INFO rmapp.RMAppImpl: Updating application application_xxx with final state: KILLED 23/01/18 05:13:44 INFO rmapp.RMAppImpl: application_xxx State change from NEW_SAVING to FINAL_SAVING on event = KILL 23/01/18 05:13:44 INFO rmapp.RMAppImpl: application_xxx State change from FINAL_SAVING to FINISHING on event = APP_UPDATE_SAVED
日志没有异常信息,日志打印的原因仍然是收到kill event,和hiveserver2是一致的。经询问系统A本身没有kill功能,那为啥会提示kill,真是用户主动kill的?不敢随意质问用户,会不会超时了,超时会有timeout之类的日志吧,会不会某情况下触发了YARN内部机制然后kill了,先看看YARN源码吧,看看kill event的可能来源。直接搜索关键词"State change from",状态机变更一般都会在一个地方实现,不会搜出很多处
找到变量定义STATE_CHANGE_MESSAGE,查找引用,找到打印日志的地方
根据日志是收到了kill event,查看kill event的定义 public enum RMAppEventType { // Source: ClientRMService START, RECOVER, KILL, // Source: Scheduler and RMAppManager APP_REJECTED,
查找所有引用kill的地方,发现2处调用(可能不全,未完整分析),一是来源RMAppKillByClientEvent,另外一个AbstractYarnScheduler
RMAppKillByClientEvent来源ClientRMService,根据其类注释可知是给YARN客户端调用 /** * The client interface to the Resource Manager. This module handles all the rpc * interfaces to the resource manager from the client. */ public class ClientRMService extends AbstractService implements
从日志打印来看,和搜索到的日志是匹配的,是用户主动kill了?
看看另外一处kill场景AbstractYarnScheduler:
如果来源是AbstractYarnScheduler,日志会有diagnostic信息,且该方法是kill整个queue的作业,从日志来看不是kill不是来源于此,还是用户或某程序主动概率大。
第二天用户告之,作业经常卡在NEW_SAVING状态,然后写了个shell用于kill NEW_SAVING状态的作业,这都可以…这么暴力…原因找打了。
曝图赫尔今夏坐拥2亿转会费快乐男孩有望转投蓝桥蓝军切尔西即将在今年夏天正式迎来新金主伯利财团。根据每日电讯报邮报等多家英媒的报道,他们将会在收购球队的首个赛季投入巨资,主教练图赫尔有望在今年夏天获得2亿英镑的转会资金用来提升阵
如果小行星撞地球,人类能躲开吗?据新华社报道,近日,国家航天局副局长吴艳华向外界披露我国将着手组建近地小行星防御系统,以应对近地小行星撞击的威胁,为保护地球和人类安全贡献中国力量。这个消息一经报道,立马引起了学界
量子通用门集的突破高保真iToffoli门先进量子测试台上高保真iToffoli门的实验示意图。图片来源YosepKimBerkeleyLab应用于量子比特(量子比特)的高保真量子逻辑门是可编程量子电路的基本构建块。劳伦斯
回味一下出游的生活,发一个出游时写的诗因为疫情好久没有出游过了,和户外群里的朋友们也好久沒有见面了,想起和哥们姐们一起一旅游,一路上大家都是互相照顾非常暖心,在车上的时候也不寂寞,各展才艺,唱歌,朗诵,笑话都可以。那满
云洞岩游记作者曾茶香听说云洞岩挺好玩的,今天放假,朋友说走,带你去云洞岩看风景。于是,我们打开导航,骑着电动车往云洞岩方向驶去。大约走了半个小时,我看见前面有座山,山上有许多大石头,还有很多树木。从远
被蟑螂吃掉的阿波罗11号月球尘埃拍卖图注现在竞标蟑螂尸体和从它们的胃中提取的阿波罗11号月球尘埃,这是1969年美国宇航局为寻找危险的月球虫而进行的生物测试的结果。(图片来源通过collectSPACE。com进行的
好消息!楚乔传后终于等来了赵丽颖林更新的二搭,开机照来了赵丽颖不愧是古装武侠剧女王,近几年她的戏路似乎一直是以侠女为主,而事实证明还是非常受观众喜欢的,作为非科班出身的她,演技令人称赞,同时在片场的吃苦劲头更是让人佩服,虽然结婚有了宝宝
林更新评论赵丽颖栓q小赵,赵丽颖回复你怎么还在四川,大新林更新评论赵丽颖栓q小赵,赵丽颖回复你怎么还在四川,大新,两人二搭与凤行太期待了吧。赵丽颖一路走来太不容易了,这次又与林更新合作新戏,不知道这次又将会怎么样?自从赵丽颖离婚之后,能
杨幂赵丽颖刘诗诗刘亦菲重回古偶,85花们的转型之困搜狐娱乐专稿(山今文)下半年,85花即将在横店团建。九鹭非香的与凤行官宣赵丽颖和林更新,两人继楚乔传之后再次搭档。前不久,杨幂确定出演狐妖小红娘月红篇,制片人微博表示六月开机。除杨
与凤行发布先导片,赵丽颖手持银枪又仙又飒,战神碧苍王上线与凤行是非常令人期待的大IP,主演是赵丽颖与林更新,虽然还没有开拍,甚至连男主还没有进组,但噱头已经做足了。5月25日12点,与凤行发布了先导篇,很少有影视剧还没有开拍就发布先导片
同样是绿色嫁衣!把赵丽颖宋轶和安以轩对比,差别一目了然传统的中国风俗里,新娘服多用红色。正妻用正红色。小妾用淡红粉红等颜色。可古代贵族以绿色嫁衣为尊贵和身份的象征。纳尼?生活带点绿,才过得下去?绿衣新娘有多美,你想象不到。盘起!1赵丽