自动化无痕浏览器对比测试,PlayWrightVsSelen
也许每一个男子全都有过这样的两个女人,至少两个。娶了红玫瑰,久而久之,红的变了墙上的一抹蚊子血,白的还是床前明月光;娶了白玫瑰,白的便是衣服上沾的一粒饭黏子,红的却是心口上一颗朱砂痣。张爱玲《红玫瑰与白玫瑰》
Selenium一直都是Python开源自动化浏览器工具的王者,但这两年微软开源的PlayWright异军突起,后来者居上,隐隐然有撼动Selenium江湖地位之势,本次我们来对比PlayWright与Selenium之间的差异,看看曾经的玫瑰花Selenium是否会变成蚊子血。PlayWright的安装和使用
PlayWright是由业界大佬微软(Microsoft)开源的端到端Web测试和自动化库,可谓是大厂背书,功能满格,虽然作为无头浏览器,该框架的主要作用是测试Web应用,但事实上,无头浏览器更多的是用于Web抓取目的,也就是爬虫。
首先终端运行安装命令:pip3installplaywright
程序返回:SuccessfullybuiltgreenletInstallingcollectedpackages:pyee,greenlet,playwrightAttemptinguninstall:greenletFoundexistinginstallation:greenlet2。0。2Uninstallinggreenlet2。0。2:Successfullyuninstalledgreenlet2。0。2Successfullyinstalledgreenlet2。0。1playwright1。30。0pyee9。0。4
目前最新稳定版为1。30。0
随后可以选择直接安装浏览器驱动:playwrightinstall
程序返回:DownloadingChromium110。0。5481。38(playwrightbuildv1045)fromhttps:playwright。azureedge。netbuildschromium1045chromiummacarm64。zip123。8Mb〔〕1000。0sChromium110。0。5481。38(playwrightbuildv1045)downloadedtoUsersliuyueLibraryCachesmsplaywrightchromium1045DownloadingFFMPEGplaywrightbuildv1008fromhttps:playwright。azureedge。netbuildsffmpeg1008ffmpegmacarm64。zip1Mb〔〕1000。0sFFMPEGplaywrightbuildv1008downloadedtoUsersliuyueLibraryCachesmsplaywrightffmpeg1008DownloadingFirefox108。0。2(playwrightbuildv1372)fromhttps:playwright。azureedge。netbuildsfirefox1372firefoxmac11arm64。zip69。8Mb〔〕1000。0sFirefox108。0。2(playwrightbuildv1372)downloadedtoUsersliuyueLibraryCachesmsplaywrightfirefox1372DownloadingWebkit16。4(playwrightbuildv1767)fromhttps:playwright。azureedge。netbuildswebkit1767webkitmac12arm64。zip56。9Mb〔〕1000。0sWebkit16。4(playwrightbuildv1767)downloadedtoUsersliuyueLibraryCachesmsplaywrightwebkit1767
默认会下载Chromium内核、Firefox以及Webkit驱动。
其中使用最广泛的就是基于Chromium内核的浏览器,最负盛名的就是Google的Chrome和微软自家的Edge。
确保当前电脑安装了Edge浏览器,让我们小试牛刀一把:fromplaywright。syncapiimportsyncplaywrightimporttimewithsyncplaywright()asp:browserp。chromium。launch(channelmsedge,headlessTrue)pagebrowser。newpage()page。goto(http:v3u。cn)page。screenshot(pathf。examplev3u。png)time。sleep(5)browser。close()
这里导入syncplaywright模块,顾名思义,同步执行,通过上下文管理器开启浏览器进程。
随后通过channel指定edge浏览器,截图后关闭浏览器进程:
我们也可以指定headless参数为True,让浏览器再后台运行:fromplaywright。syncapiimportsyncplaywrightwithsyncplaywright()asp:browserp。chromium。launch(channelmsedge,headlessTrue)pagebrowser。newpage()page。goto(http:v3u。cn)page。screenshot(pathf。examplev3u。png)browser。close()
除了同步模式,PlayWright也支持异步非阻塞模式:importasynciofromplaywright。asyncapiimportasyncplaywrightasyncdefmain():asyncwithasyncplaywright()asp:browserawaitp。chromium。launch(channelmsedge,headlessFalse)pageawaitbrowser。newpage()awaitpage。goto(http:v3u。cn)print(awaitpage。title())awaitbrowser。close()asyncio。run(main())
可以通过原生协程库asyncio进行调用,PlayWright内置函数只需要添加await关键字即可,非常方便,与之相比,Selenium原生库并不支持异步模式,必须安装三方扩展才可以。
最炫酷的是,PlayWright可以对用户的浏览器操作进行录制,并且可以转换为相应的代码,在终端执行以下命令:pythonmplaywrightcodegentargetpythonoedge。pybchromiumchannelmsedge
这里通过codegen命令进行录制,指定浏览器为edge,将所有操作写入edge。py的文件中:
与此同时,PlayWright也支持移动端的浏览器模拟,比如苹果手机:fromplaywright。syncapiimportsyncplaywrightwithsyncplaywright()asp:iphone13p。devices〔iPhone13Pro〕browserp。webkit。launch(headlessFalse)pagebrowser。newpage()page。goto(https:v3u。cn)page。screenshot(path。v3uiphone。png)browser。close()
这里模拟Iphone13pro的浏览器访问情况。
当然了,除了UI功能测试,我们当然还需要PlayWright帮我们干点脏活累活,那就是爬虫:fromplaywright。syncapiimportsyncplaywrightdefextractdata(entry):nameentry。locator(h3)。innertext()。strip()。strip()capitalentry。locator(span。countrycapital)。innertext()populationentry。locator(span。countrypopulation)。innertext()areaentry。locator(span。countryarea)。innertext()return{name:name,capital:capital,population:population,area(kmsq):area}withsyncplaywright()asp:launchthebrowserinstanceanddefineanewcontextbrowserp。chromium。launch()contextbrowser。newcontext()openanewtabandgotothewebsitepagecontext。newpage()page。goto(https:www。scrapethissite。compagessimple)page。waitforloadstate(load)getthecountriescountriespage。locator(p。country)ncountriescountries。count()loopthroughtheelementsandscrapethedatadata〔〕foriinrange(ncountries):entrycountries。nth(i)sampleextractdata(entry)data。append(sample)browser。close()
这里data变量就是抓取的数据内容:〔{name:Andorra,capital:AndorralaVella,population:84000,area(kmsq):468。0},{name:UnitedArabEmirates,capital:AbuDhabi,population:4975593,area(kmsq):82880。0},{name:Afghanistan,capital:Kabul,population:29121286,area(kmsq):647500。0},{name:AntiguaandBarbuda,capital:St。Johns,population:86754,area(kmsq):443。0},{name:Anguilla,capital:TheValley,population:13254,area(kmsq):102。0},。。。〕
基本上,该有的功能基本都有,更多功能请参见官方文档:https:playwright。devpythondocslibrarySelenium
Selenium曾经是用于网络抓取和网络自动化的最流行的开源无头浏览器工具之一。在使用Selenium进行抓取时,我们可以自动化浏览器、与UI元素交互并在Web应用程序上模仿用户操作。Selenium的一些核心组件包括WebDriver、SeleniumIDE和SeleniumGrid。
关于Selenium的一些基本操作请移玉步至:python3。7爬虫:使用Selenium带Cookie登录并且模拟进行表单上传文件,这里不作过多赘述。
如同前文提到的,与Playwright相比,Selenium需要第三方库来实现异步并发执行,同时,如果需要录制动作视频,也需要使用外部的解决方案。
就像Playwright那样,让我们使用Selenium构建一个简单的爬虫脚本。
首先导入必要的模块并配置Selenium实例,并且通过设置确保无头模式处于活动状态option。headlessTrue:fromseleniumimportwebdriverfromselenium。webdriver。chrome。serviceimportServicefromselenium。webdriver。common。byimportBywebdrivermanager:https:github。comSergeyPirogovwebdrivermanagerwillhelpusautomaticallydownloadthewebdriverbinariesthenwecanuseServicetomanagethewebdriversstate。fromwebdrivermanager。chromeimportChromeDriverManagerdefextractdata(row):namerow。findelement(By。TAGNAME,h3)。text。strip()。strip()capitalrow。findelement(By。CSSSELECTOR,span。countrycapital)。textpopulationrow。findelement(By。CSSSELECTOR,span。countrypopulation)。textarearow。findelement(By。CSSSELECTOR,span。countryarea)。textreturn{name:name,capital:capital,population:population,area(kmsq):area}optionswebdriver。ChromeOptions()options。headlessTruethisreturnsthepathwebdriverdownloadedchromepathChromeDriverManager()。install()definethechromeserviceandpassittothedriverinstancechromeserviceService(chromepath)driverwebdriver。Chrome(servicechromeservice,optionsoptions)urlhttps:www。scrapethissite。compagessimpledriver。get(url)getthedatapscountriesdriver。findelements(By。CSSSELECTOR,p。country)extractthedatadatalist(map(extractdata,countries))driver。quit()
数据返回:〔{name:Andorra,capital:AndorralaVella,population:84000,area(kmsq):468。0},{name:UnitedArabEmirates,capital:AbuDhabi,population:4975593,area(kmsq):82880。0},{name:Afghanistan,capital:Kabul,population:29121286,area(kmsq):647500。0},{name:AntiguaandBarbuda,capital:St。Johns,population:86754,area(kmsq):443。0},{name:Anguilla,capital:TheValley,population:13254,area(kmsq):102。0},。。。〕性能测试
在数据抓取量一样的前提下,我们当然需要知道到底谁的性能更好,是PlayWright,还是Selenium?
这里我们使用Python3。10内置的time模块来统计爬虫脚本的执行速度。
PlayWright:importtimefromplaywright。syncapiimportsyncplaywrightdefextractdata(entry):nameentry。locator(h3)。innertext()。strip()。strip()capitalentry。locator(span。countrycapital)。innertext()populationentry。locator(span。countrypopulation)。innertext()areaentry。locator(span。countryarea)。innertext()return{name:name,capital:capital,population:population,area(kmsq):area}starttime。time()withsyncplaywright()asp:launchthebrowserinstanceanddefineanewcontextbrowserp。chromium。launch()contextbrowser。newcontext()openanewtabandgotothewebsitepagecontext。newpage()page。goto(https:www。scrapethissite。compages)clicktothefirstpageandwaitwhilepageloadspage。locator(a〔hrefpagessimple〕)。click()page。waitforloadstate(load)getthecountriescountriespage。locator(p。country)ncountriescountries。count()data〔〕foriinrange(ncountries):entrycountries。nth(i)sampleextractdata(entry)data。append(sample)browser。close()endtime。time()print(fThewholescripttook:{endstart:。4f})
Selenium:importtimefromseleniumimportwebdriverfromselenium。webdriver。chrome。serviceimportServicefromselenium。webdriver。common。byimportBywebdrivermanager:https:github。comSergeyPirogovwebdrivermanagerwillhelpusautomaticallydownloadthewebdriverbinariesthenwecanuseServicetomanagethewebdriversstate。fromwebdrivermanager。chromeimportChromeDriverManagerdefextractdata(row):namerow。findelement(By。TAGNAME,h3)。text。strip()。strip()capitalrow。findelement(By。CSSSELECTOR,span。countrycapital)。textpopulationrow。findelement(By。CSSSELECTOR,span。countrypopulation)。textarearow。findelement(By。CSSSELECTOR,span。countryarea)。textreturn{name:name,capital:capital,population:population,area(kmsq):area}startthetimerstarttime。time()optionswebdriver。ChromeOptions()options。headlessTruethisreturnsthepathwebdriverdownloadedchromepathChromeDriverManager()。install()definethechromeserviceandpassittothedriverinstancechromeserviceService(chromepath)driverwebdriver。Chrome(servicechromeservice,optionsoptions)urlhttps:www。scrapethissite。compagesdriver。get(url)getthefirstpageandclicktothelinkfirstpagedriver。findelement(By。CSSSELECTOR,h3。pagetitlea)firstpage。click()getthedatapandextractthedatausingbeautifulsoupcountriescontainerdriver。findelement(By。CSSSELECTOR,sectioncountriesp。container)countriesdriver。findelements(By。CSSSELECTOR,p。country)scrapethedatausingextractdatafunctiondatalist(map(extractdata,countries))endtime。time()print(fThewholescripttook:{endstart:。4f})driver。quit()
测试结果:
Y轴是执行时间,一望而知,Selenium比PlayWright差了大概五倍左右。红玫瑰还是白玫瑰?
不得不承认,Playwright和Selenium都是出色的自动化无头浏览器工具,都可以完成爬虫任务。我们还不能断定那个更好一点,所以选择那个取决于你的网络抓取需求、你想要抓取的数据类型、浏览器支持和其他考虑因素:
Playwright不支持真实设备,而Selenium可用于真实设备和远程服务器。
Playwright具有内置的异步并发支持,而Selenium需要第三方工具。
Playwright的性能比Selenium高。
Selenium不支持详细报告和视频录制等功能,而Playwright具有内置支持。
Selenium比Playwright支持更多的浏览器。
Selenium支持更多的编程语言。结语
如果您看完了本篇文章,那么到底谁是最好的无头浏览器工具,答案早已在心间,所谓强中强而立强,只有弱者才害怕竞争,相信PlayWright的出现会让Selenium变为更好的自己,再接再厉,再创辉煌。