一、简介 前面文章已经介绍了selenium库使用,及浏览器提取信息相关方法。参考:python爬虫之selenium库 现在目标要求,用爬虫通过浏览器,搜索关键词,将搜索到的视频信息存储在excel表中。二、创建excel表格,以及chrome驱动n1wordinput(请输入要搜索的关键词:)driverwebdriver。Chrome()waitWebDriverWait(driver,10)exclxlwt。Workbook(encodingutf8,stylecompression0)sheetexcl。addsheet(b站视频:word,celloverwriteokTrue)sheet。write(0,0,名称)sheet。write(0,1,up主)sheet。write(0,2,播放量)sheet。write(0,3,视频时长)sheet。write(0,4,链接)sheet。write(0,5,发布时间)三、创建定义搜索函数 里面有buttonnext为跳转下一页的功能,之所有不用By。CLASSNAME定位。看html代码可知buttonclassvuibuttonvuipagenationbtnvuipagenationbtnside下一页button class名称很长,而且有空格,如果selenium用By。CLASSNAME定位,有空格会报错:selenium。common。exceptions。NoSuchElementException:Message:nosuchelement 所以这里用By。CSSSELECTOR方法定位。defsearch():driver。get(https:www。bilibili。com)inputwait。until(EC。presenceofelementlocated((By。CLASSNAME,navsearchinput)))buttonwait。until(EC。elementtobeclickable((By。CLASSNAME,navsearchbtn)))input。sendkeys(word)button。click()print(开始搜索:word)windowsdriver。windowhandlesdriver。switchto。window(windows〔1〕)getsource()第1页跳转第2页buttonnextdriver。findelement(By。CSSSELECTOR,icecreampp:nthchild(2)p。searchcontentppp。flexcenter。mtx50。mbx50ppbutton:nthchild(11))wait。until(EC。presenceofelementlocated((By。CSSSELECTOR,icecreampp:nthchild(2)p。searchcontentppp。videolist。rowp:nthchild(1)pp。bilivideocardwrap。scalewrap)))buttonnext。click()getsource()四、定义跳转下一页函数 这里有调转下一页函数,那为什么在上面搜索函数也有下一页功能,因为分析代码。第2页的CSSSELECTOR路径icecreampp:nthchild(2)p。searchcontentppp。flexcenter。mtx50。mbx50ppbutton:nthchild(11)后面页面的CSSSELECTOR路径icecreampp:nthchild(2)p。searchcontentppp。flexcenter。mtx50。mblgppbutton:nthchild(11) 第1页的CSSSELECTOR和后面的页面的CSSSELECTOR的不一样,所以把第1页跳第2页单独加在了上面搜索函数中。defnextpage():buttonnextwait。until(EC。presenceofelementlocated((By。CSSSELECTOR,icecreampp:nthchild(2)p。searchcontentppp。flexcenter。mtx50。mblgppbutton:nthchild(11))))buttonnext。click()wait。until(EC。presenceofelementlocated((By。CSSSELECTOR,icecreampp:nthchild(2)p。searchcontentppp。videolist。rowp:nthchild(1)pp。bilivideocardwrap。scalewrap)))getsource()五、定义获取页面代码函数 上面定义的函数都有getsource()函数,这个函数就是现在需要创建的,用途获取页面代码,传入BeautifulSoupdefgetsource():htmldriver。pagesourcesoupBeautifulSoup(html,lxml)saveexcl(soup)六、获取元素并存到excel表 通过BeautifulSoup循环获取页面信息,并存到创建好的excel表中。defsaveexcl(soup):listsoup。find(classvideolistrow)。findall(classbilivideocard)foriteminlist:print(item)videonameitem。find(classbilivideocardinfotit)。textvideoupitem。find(classbilivideocardinfoauthor)。stringvideodateitem。find(classbilivideocardinfodate)。stringvideoplayitem。find(classbilivideocardstatsitem)。textvideotimesitem。find(classbilivideocardstatsduration)。stringvideolinkitem。find(a)〔href〕。replace(,https:)print(videoname,videoup,videoplay,videotimes,videolink,videodate)globalnsheet。write(n,0,videoname)sheet。write(n,1,videoup)sheet。write(n,2,videoplay)sheet。write(n,3,videotimes)sheet。write(n,4,videolink)sheet。write(n,5,videodate)nn1七、定义main函数,循环获取跳转每一页 这里默认是10页的数据,后面就不获取了,可以自行调整页面数。最后保存表名。defmain():search()foriinrange(1,10):nextpage()ii1driver。close()ifnamemain:main()excl。save(b站word视频。xls)八、最终代码执行效果 这里CSSSELECTOR路径,我这里尽量的在最底层,所以比较长,因为短路径,经常性等待时间不够长,没有加载所有页面,提取不到信息而报错。fromseleniumimportwebdriverfromselenium。webdriver。common。byimportByfrombs4importBeautifulSoupfromselenium。webdriver。support。uiimportWebDriverWaitfromselenium。webdriver。supportimportexpectedconditionsasECimportxlwtimporttimen1wordinput(请输入要搜索的关键词:)driverwebdriver。Chrome()waitWebDriverWait(driver,10)exclxlwt。Workbook(encodingutf8,stylecompression0)sheetexcl。addsheet(b站视频:word,celloverwriteokTrue)sheet。write(0,0,名称)sheet。write(0,1,up主)sheet。write(0,2,播放量)sheet。write(0,3,视频时长)sheet。write(0,4,链接)sheet。write(0,5,发布时间)defsearch():driver。get(https:www。bilibili。com)inputwait。until(EC。presenceofelementlocated((By。CLASSNAME,navsearchinput)))buttonwait。until(EC。elementtobeclickable((By。CLASSNAME,navsearchbtn)))input。sendkeys(word)button。click()print(开始搜索:word)windowsdriver。windowhandlesdriver。switchto。window(windows〔1〕)wait。until(EC。presenceofelementlocated((By。CSSSELECTOR,icecreampp:nthchild(2)p。searchcontentppp。video。iwrapper。searchalllist)))getsource()print(开始下一页:)buttonnextdriver。findelement(By。CSSSELECTOR,icecreampp:nthchild(2)p。searchcontentppp。flexcenter。mtx50。mbx50ppbutton:nthchild(11))buttonnext。click()time。sleep(2)wait。until(EC。presenceofelementlocated((By。CSSSELECTOR,icecreampp:nthchild(2)p。searchcontentppp。videolist。rowp:nthchild(1)pp。bilivideocardwrap。scalewrapppah3)))getsource()print(完成)defnextpage():buttonnextwait。until(EC。presenceofelementlocated((By。CSSSELECTOR,icecreampp:nthchild(2)p。searchcontentppp。flexcenter。mtx50。mblgppbutton:nthchild(11))))buttonnext。click()print(开始下一页)time。sleep(5)wait。until(EC。presenceofelementlocated((By。CSSSELECTOR,icecreampp:nthchild(2)p。searchcontentppp。videolist。rowp:nthchild(1)pp。bilivideocardwrap。scalewrapppah3)))getsource()print(完成)defsaveexcl(soup):listsoup。find(classvideolistrow)。findall(classbilivideocard)foriteminlist:print(item)videonameitem。find(classbilivideocardinfotit)。textvideoupitem。find(classbilivideocardinfoauthor)。stringvideodateitem。find(classbilivideocardinfodate)。stringvideoplayitem。find(classbilivideocardstatsitem)。textvideotimesitem。find(classbilivideocardstatsduration)。stringvideolinkitem。find(a)〔href〕。replace(,https:)print(videoname,videoup,videoplay,videotimes,videolink,videodate)globalnsheet。write(n,0,videoname)sheet。write(n,1,videoup)sheet。write(n,2,videoplay)sheet。write(n,3,videotimes)sheet。write(n,4,videolink)sheet。write(n,5,videodate)nn1defgetsource():htmldriver。pagesourcesoupBeautifulSoup(html,lxml)saveexcl(soup)defmain():search()foriinrange(1,10):nextpage()ii1driver。close()ifnamemain:main()excl。save(b站word视频。xls) 执行输入MV执行结果: 在文件夹也生成了excel文件表 打开,信息保存完成 同理,输入其他关键词,也可以。 以上,简单爬取搜索信息就完成了,如果要在服务器上隐藏运行,参考我上篇文章:python爬虫之selenium库