排查问题提交hadoop作业偶尔失败
用户反馈偶尔有作业提交失败,一般运行10-20分钟就出现失败提醒,作业是在系统A提交的hive sql语句,系统A的日志和hiveserver2的日志输出是一样的,都提示YarnException: Failed to submit application_xxx to YARN : Application application_xxx was killed by user xxx at 10.10.x.x
hiveserver2日志如下: 2023-01-18T05:14:04,073 ERROR [HiveServer2-Background-Pool: Thread-917734] exec.Task: Job Submission failed with exception "java.io.IOException(org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application_xxx to YARN : Application application_xxx was killed by user userxxx at 10.x.x.x)" java.io.IOException: org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application_xxx to YARN : Application application_xxx was killed by user userxxx at 10.x.x.x at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:345) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:254) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1570) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1567) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1567) at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:576) at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:571) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:571) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:562) at org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:411) at org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:151) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:199) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2183) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1839) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1526) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1232) at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:255) at org.apache.hive.service.cli.operation.SQLOperation.access$800(SQLOperation.java:91) at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:348) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:362) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application_xxx to YARN : Application application_xxx was killed by user userxxx at 10.x.x.x at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication(YarnClientImpl.java:304) at org.apache.hadoop.mapred.ResourceMgrDelegate.submitApplication(ResourceMgrDelegate.java:299) at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:330) ... 35 more
提交了作业20分钟被kill了,YARN出错了?作业超时了?用户主动kill了?很难说,不清楚,hiveserver2提交作业是和Hadoop YARN交互,先检查下YARN日志再说。
YARN日志如下: 23/01/18 05:13:37 INFO resourcemanager.ClientRMService: Allocated new applicationId: 449050 23/01/18 05:13:38 INFO rmapp.RMAppImpl: Storing application with id application_xxx 23/01/18 05:13:38 INFO rmapp.RMAppImpl: application_xxx State change from NEW to NEW_SAVING on event = START ... ... 23/01/18 05:13:44 INFO rmapp.RMAppImpl: Updating application application_xxx with final state: KILLED 23/01/18 05:13:44 INFO rmapp.RMAppImpl: application_xxx State change from NEW_SAVING to FINAL_SAVING on event = KILL 23/01/18 05:13:44 INFO rmapp.RMAppImpl: application_xxx State change from FINAL_SAVING to FINISHING on event = APP_UPDATE_SAVED
日志没有异常信息,日志打印的原因仍然是收到kill event,和hiveserver2是一致的。经询问系统A本身没有kill功能,那为啥会提示kill,真是用户主动kill的?不敢随意质问用户,会不会超时了,超时会有timeout之类的日志吧,会不会某情况下触发了YARN内部机制然后kill了,先看看YARN源码吧,看看kill event的可能来源。直接搜索关键词"State change from",状态机变更一般都会在一个地方实现,不会搜出很多处
找到变量定义STATE_CHANGE_MESSAGE,查找引用,找到打印日志的地方
根据日志是收到了kill event,查看kill event的定义 public enum RMAppEventType { // Source: ClientRMService START, RECOVER, KILL, // Source: Scheduler and RMAppManager APP_REJECTED,
查找所有引用kill的地方,发现2处调用(可能不全,未完整分析),一是来源RMAppKillByClientEvent,另外一个AbstractYarnScheduler
RMAppKillByClientEvent来源ClientRMService,根据其类注释可知是给YARN客户端调用 /** * The client interface to the Resource Manager. This module handles all the rpc * interfaces to the resource manager from the client. */ public class ClientRMService extends AbstractService implements
从日志打印来看,和搜索到的日志是匹配的,是用户主动kill了?
看看另外一处kill场景AbstractYarnScheduler:
如果来源是AbstractYarnScheduler,日志会有diagnostic信息,且该方法是kill整个queue的作业,从日志来看不是kill不是来源于此,还是用户或某程序主动概率大。
第二天用户告之,作业经常卡在NEW_SAVING状态,然后写了个shell用于kill NEW_SAVING状态的作业,这都可以…这么暴力…原因找打了。