排查问题提交hadoop作业偶尔失败
用户反馈偶尔有作业提交失败,一般运行1020分钟就出现失败提醒,作业是在系统A提交的hivesql语句,系统A的日志和hiveserver2的日志输出是一样的,都提示YarnException:FailedtosubmitapplicationxxxtoYARN:Applicationapplicationxxxwaskilledbyuserxxxat10。10。x。x
hiveserver2日志如下:20230118T05:14:04,073ERROR〔HiveServer2BackgroundPool:Thread917734〕exec。Task:JobSubmissionfailedwithexceptionjava。io。IOException(org。apache。hadoop。yarn。exceptions。YarnException:FailedtosubmitapplicationxxxtoYARN:Applicationapplicationxxxwaskilledbyuseruserxxxat10。x。x。x)java。io。IOException:org。apache。hadoop。yarn。exceptions。YarnException:FailedtosubmitapplicationxxxtoYARN:Applicationapplicationxxxwaskilledbyuseruserxxxat10。x。x。xatorg。apache。hadoop。mapred。YARNRunner。submitJob(YARNRunner。java:345)atorg。apache。hadoop。mapreduce。JobSubmitter。submitJobInternal(JobSubmitter。java:254)atorg。apache。hadoop。mapreduce。Job11。run(Job。java:1570)atorg。apache。hadoop。mapreduce。Job11。run(Job。java:1567)atjava。security。AccessController。doPrivileged(NativeMethod)atjavax。security。auth。Subject。doAs(Subject。java:422)atorg。apache。hadoop。security。UserGroupInformation。doAs(UserGroupInformation。java:1729)atorg。apache。hadoop。mapreduce。Job。submit(Job。java:1567)atorg。apache。hadoop。mapred。JobClient1。run(JobClient。java:576)atorg。apache。hadoop。mapred。JobClient1。run(JobClient。java:571)atjava。security。AccessController。doPrivileged(NativeMethod)atjavax。security。auth。Subject。doAs(Subject。java:422)atorg。apache。hadoop。security。UserGroupInformation。doAs(UserGroupInformation。java:1729)atorg。apache。hadoop。mapred。JobClient。submitJobInternal(JobClient。java:571)atorg。apache。hadoop。mapred。JobClient。submitJob(JobClient。java:562)atorg。apache。hadoop。hive。ql。exec。mr。ExecDriver。execute(ExecDriver。java:411)atorg。apache。hadoop。hive。ql。exec。mr。MapRedTask。execute(MapRedTask。java:151)atorg。apache。hadoop。hive。ql。exec。Task。executeTask(Task。java:199)atorg。apache。hadoop。hive。ql。exec。TaskRunner。runSequential(TaskRunner。java:100)atorg。apache。hadoop。hive。ql。Driver。launchTask(Driver。java:2183)atorg。apache。hadoop。hive。ql。Driver。execute(Driver。java:1839)atorg。apache。hadoop。hive。ql。Driver。runInternal(Driver。java:1526)atorg。apache。hadoop。hive。ql。Driver。run(Driver。java:1237)atorg。apache。hadoop。hive。ql。Driver。run(Driver。java:1232)atorg。apache。hive。service。cli。operation。SQLOperation。runQuery(SQLOperation。java:255)atorg。apache。hive。service。cli。operation。SQLOperation。access800(SQLOperation。java:91)atorg。apache。hive。service。cli。operation。SQLOperationBackgroundWork1。run(SQLOperation。java:348)atjava。security。AccessController。doPrivileged(NativeMethod)atjavax。security。auth。Subject。doAs(Subject。java:422)atorg。apache。hadoop。security。UserGroupInformation。doAs(UserGroupInformation。java:1729)atorg。apache。hive。service。cli。operation。SQLOperationBackgroundWork。run(SQLOperation。java:362)atjava。util。concurrent。ExecutorsRunnableAdapter。call(Executors。java:511)atjava。util。concurrent。FutureTask。run(FutureTask。java:266)atjava。util。concurrent。ThreadPoolExecutor。runWorker(ThreadPoolExecutor。java:1149)atjava。util。concurrent。ThreadPoolExecutorWorker。run(ThreadPoolExecutor。java:624)atjava。lang。Thread。run(Thread。java:750)Causedby:org。apache。hadoop。yarn。exceptions。YarnException:FailedtosubmitapplicationxxxtoYARN:Applicationapplicationxxxwaskilledbyuseruserxxxat10。x。x。xatorg。apache。hadoop。yarn。client。api。impl。YarnClientImpl。submitApplication(YarnClientImpl。java:304)atorg。apache。hadoop。mapred。ResourceMgrDelegate。submitApplication(ResourceMgrDelegate。java:299)atorg。apache。hadoop。mapred。YARNRunner。submitJob(YARNRunner。java:330)。。。35more
提交了作业20分钟被kill了,YARN出错了?作业超时了?用户主动kill了?很难说,不清楚,hiveserver2提交作业是和HadoopYARN交互,先检查下YARN日志再说。
YARN日志如下:23011805:13:37INFOresourcemanager。ClientRMService:AllocatednewapplicationId:44905023011805:13:38INFOrmapp。RMAppImpl:Storingapplicationwithidapplicationxxx23011805:13:38INFOrmapp。RMAppImpl:applicationxxxStatechangefromNEWtoNEWSAVINGoneventSTART。。。。。。23011805:13:44INFOrmapp。RMAppImpl:Updatingapplicationapplicationxxxwithfinalstate:KILLED23011805:13:44INFOrmapp。RMAppImpl:applicationxxxStatechangefromNEWSAVINGtoFINALSAVINGoneventKILL23011805:13:44INFOrmapp。RMAppImpl:applicationxxxStatechangefromFINALSAVINGtoFINISHINGoneventAPPUPDATESAVED
日志没有异常信息,日志打印的原因仍然是收到killevent,和hiveserver2是一致的。经询问系统A本身没有kill功能,那为啥会提示kill,真是用户主动kill的?不敢随意质问用户,会不会超时了,超时会有timeout之类的日志吧,会不会某情况下触发了YARN内部机制然后kill了,先看看YARN源码吧,看看killevent的可能来源。直接搜索关键词Statechangefrom,状态机变更一般都会在一个地方实现,不会搜出很多处
找到变量定义STATECHANGEMESSAGE,查找引用,找到打印日志的地方
根据日志是收到了killevent,查看killevent的定义publicenumRMAppEventType{Source:ClientRMServiceSTART,RECOVER,KILL,Source:SchedulerandRMAppManagerAPPREJECTED,
查找所有引用kill的地方,发现2处调用(可能不全,未完整分析),一是来源RMAppKillByClientEvent,另外一个AbstractYarnScheduler
RMAppKillByClientEvent来源ClientRMService,根据其类注释可知是给YARN客户端调用TheclientinterfacetotheResourceManager。Thismodulehandlesalltherpcinterfacestotheresourcemanagerfromtheclient。publicclassClientRMServiceextendsAbstractServiceimplements
从日志打印来看,和搜索到的日志是匹配的,是用户主动kill了?
看看另外一处kill场景AbstractYarnScheduler:
如果来源是AbstractYarnScheduler,日志会有diagnostic信息,且该方法是kill整个queue的作业,从日志来看不是kill不是来源于此,还是用户或某程序主动概率大。
第二天用户告之,作业经常卡在NEWSAVING状态,然后写了个shell用于killNEWSAVING状态的作业,这都可以这么暴力原因找打了。