requests 就有了更为强大的库requests,有了它,Cookies、登录验证、代理设置等操作都不是事儿。 安装环境pipinstallrequests 官方地址:https:requests。readthedocs。ioenlatest 1。示例引入 urllib库中的urlopen方法实际上是以GET方式请求网页,而requests中相应的方法就是get方法,是不是感觉表达更明确一些?下面通过实例来看一下:importrequestsrrequests。get(https:www。baidu。com)print(type(r))print(r。statuscode)print(type(r。text))print(r。text)print(r。cookies) 测试实例:rrequests。post(http:httpbin。orgpost)rrequests。put(http:httpbin。orgput)rrequests。delete(http:httpbin。orgdelete)rrequests。head(http:httpbin。orgget)rrequests。options(http:httpbin。orgget) 2、GET抓取importrequestsdata{name:germey,age:22}rrequests。get(http:httpbin。orgget,paramsdata)print(r。text) 2。1抓取二进制数据 下面以图片为例来看一下:importrequestsrrequests。get(http:qwmxpxq5y。hnbkt。clouddn。comhh。png)print(r。text)print(r。content) 如果不传递headers,就不能正常请求:importrequestsrrequests。get(https:mmzztt。com)print(r。text) 但如果加上headers并加上UserAgent信息,那就没问题了:importrequestsheaders{UserAgent:Mozilla5。0(Macintosh;IntelMacOSX10114)AppleWebKit537。36(KHTML,likeGecko)Chrome52。0。2743。116Safari537。36}rrequests。get(https:mmzztt。com,headersheaders)print(r。text) 3、POST请求 3。1前面我们了解了最基本的GET请求,另外一种比较常见的请求方式是POST。使用requests实现POST请求同样非常简单,示例如下:importrequestsdata{name:germey,age:22}rrequests。post(http:httpbin。orgpost,datadata)print(r。text) 测试网站 巨潮网络数据点击资讯选择公开信息importrequestsurlhttp:www。cninfo。com。cndata20intsstatisticsresrequests。post(url)print(res。text) 3。2发送请求后,得到的自然就是响应。在上面的实例中,我们使用text和content获取了响应的内容。此外,还有很多属性和方法可以用来获取其他信息,比如状态码、响应头、Cookies等。示例如下:importrequestsheaders{UserAgent:Mozilla5。0(Macintosh;IntelMacOSX10114)AppleWebKit537。36(KHTML,likeGecko)Chrome52。0。2743。116Safari537。36}rrequests。get(http:www。jianshu。com,headersheaders)print(type(r。statuscode),r。statuscode)print(type(r。headers),r。headers)print(type(r。cookies),r。cookies)print(type(r。url),r。url)print(type(r。history),r。history) 3。3状态码常用来判断请求是否成功,而requests还提供了一个内置的状态码查询对象requests。codes,示例如下:importrequestsheaders{UserAgent:Mozilla5。0(Macintosh;IntelMacOSX10114)AppleWebKit537。36(KHTML,likeGecko)Chrome52。0。2743。116Safari537。36}rrequests。get(http:www。jianshu。com,headersheaders)ifnotr。statuscoderequests。codes。ok:exit()else:print(RequestSuccessfully) 3。4那么,肯定不能只有ok这个条件码。下面列出了返回码和相应的查询条件:信息性状态码100:(continue,),101:(switchingprotocols,),102:(processing,),103:(checkpoint,),122:(uritoolong,requesturitoolong),成功状态码200:(ok,okay,allok,allokay,allgood,o,),201:(created,),202:(accepted,),203:(nonauthoritativeinfo,nonauthoritativeinformation),204:(nocontent,),205:(resetcontent,reset),206:(partialcontent,partial),207:(multistatus,multiplestatus,multistati,multiplestati),208:(alreadyreported,),226:(imused,),重定向状态码300:(multiplechoices,),301:(movedpermanently,moved,o),302:(found,),303:(seeother,other),304:(notmodified,),305:(useproxy,),306:(switchproxy,),307:(temporaryredirect,temporarymoved,temporary),308:(permanentredirect,resumeincomplete,resume,),These2toberemovedin3。0客户端错误状态码400:(badrequest,bad),401:(unauthorized,),402:(paymentrequired,payment),403:(forbidden,),404:(notfound,o),405:(methodnotallowed,notallowed),406:(notacceptable,),407:(proxyauthenticationrequired,proxyauth,proxyauthentication),408:(requesttimeout,timeout),409:(conflict,),410:(gone,),411:(lengthrequired,),412:(preconditionfailed,precondition),413:(requestentitytoolarge,),414:(requesturitoolarge,),415:(unsupportedmediatype,unsupportedmedia,mediatype),416:(requestedrangenotsatisfiable,requestedrange,rangenotsatisfiable),417:(expectationfailed,),418:(imateapot,teapot,iamateapot),421:(misdirectedrequest,),422:(unprocessableentity,unprocessable),423:(locked,),424:(faileddependency,dependency),425:(unorderedcollection,unordered),426:(upgraderequired,upgrade),428:(preconditionrequired,precondition),429:(toomanyrequests,toomany),431:(headerfieldstoolarge,fieldstoolarge),444:(noresponse,none),449:(retrywith,retry),450:(blockedbywindowsparentalcontrols,parentalcontrols),451:(unavailableforlegalreasons,legalreasons),499:(clientclosedrequest,),服务端错误状态码500:(internalservererror,servererror,o,),501:(notimplemented,),502:(badgateway,),503:(serviceunavailable,unavailable),504:(gatewaytimeout,),505:(httpversionnotsupported,httpversion),506:(variantalsonegotiates,),507:(insufficientstorage,),509:(bandwidthlimitexceeded,bandwidth),510:(notextended,),511:(networkauthenticationrequired,networkauth,networkauthentication) 4、高级用法 1、代理添加proxy{http:http:183。162。171。78:4216,}返回当前IPresrequests。get(http:httpbin。orgip,proxiesproxy)print(res。text) 2、快代理IP使用 文献:https:www。kuaidaili。comdocdevquickstart 打开后,默认http协议,返回格式选json,我的订单是VIP订单,所以稳定性选稳定,返回格式选json,然后点击生成链接,下面的API链接直接复制上。 3。关闭警告fromrequests。packagesimporturllib3urllib3。disablewarnings() 爬虫流程 5、初级爬虫importrequestsfromlxmlimportetreedefmain():1。定义页面URL和解析规则crawlurls〔https:36kr。comp1328468833360133,https:36kr。comp1328528129988866,https:36kr。comp1328512085344642〕parseruleh1〔contains(class,articletitlemarginbottom20commonwidth)〕text()forurlincrawlurls:2。发起HTTP请求responserequests。get(url)3。解析HTMLresultetree。HTML(response。text)。xpath(parserule)〔0〕4。保存结果print(result)ifnamemain:main() 6、全站采集 6。1封装公共文件 创建utils文件夹,写一个base类供其他程序调用importrequestsfromretryingimportretryfromrequests。packages。urllib3。exceptionsimportInsecureRequestWarningrequests。packages。urllib3。disablewarnings(InsecureRequestWarning)fromlxmlimportetreeimportrandom,timeclassFakeChromeUA:firstnumrandom。randint(55,62)thirdnumrandom。randint(0,3200)fourthnumrandom。randint(0,140)ostype〔(WindowsNT6。1;WOW64),(WindowsNT10。0;WOW64),(X11;Linuxx8664),(Macintosh;IntelMacOSX10126)〕chromeversionChrome{}。0。{}。{}。format(firstnum,thirdnum,fourthnum)classmethoddefgetua(cls):return。join(〔Mozilla5。0,random。choice(cls。ostype),AppleWebKit537。36,(KHTML,likeGecko),cls。chromeversion,Safari537。36〕)classSpiders(FakeChromeUA):urls〔〕retry(stopmaxattemptnumber3,waitfixed2000)deffetch(self,url,paramNone,headersNone):try:ifnotheaders:headers{}headers〔useragent〕self。getua()else:headers〔useragent〕self。getua()self。waitsometime()responserequests。get(url,paramsparam,headersheaders)ifresponse。statuscode200:response。encodingutf8returnresponseexceptrequests。ConnectionError:returndefwaitsometime(self):time。sleep(random。randint(100,300)1000) 6。2案例实践fromurllib。parseimporturljoinimportrequestsfromlxmlimportetreefromqueueimportQueuefromxl。baseimportSpidersfrompymongoimportMongoClientfltlambdax:x〔0〕ifxelseNoneclassCrawl(Spiders):baseurlhttps:36kr。com种子URLstarturlhttps:36kr。cominformationtechnology解析规则rules{文章列表listurls:p〔classarticleitempicwrapper〕ahref,详情页数据detailurls:p〔classcommonwidthmarginbottom20〕text(),标题title:h1〔classarticletitlemarginbottom20commonwidth〕text(),}定义队列listqueueQueue()defcrawl(self,url):首页responseself。fetch(url)listurlsetree。HTML(response。text)。xpath(self。rules〔listurls〕)print(urljoin(self。baseurl,listurls))forlisturlinlisturls:print(urljoin(self。baseurl,listurl))获取url列表信息self。listqueue。put(urljoin(self。baseurl,listurl))deflistloop(self):采集列表页whileTrue:listurlself。listqueue。get()print(self。listqueue。qsize())self。crawldetail(listurl)如果队列为空退出程序ifself。listqueue。empty():breakdefcrawldetail(self,url):详情页responseself。fetch(url)htmletree。HTML(response。text)contenthtml。xpath(self。rules〔detailurls〕)titleflt(html。xpath(self。rules〔title〕))print(title)data{content:content,title:title}self。savemongo(data)defsavemongo(self,data):clientMongoClient()建立连接colclient〔python〕〔hh〕ifisinstance(data,dict):rescol。insertone(data)returnreselse:return单条数据必须是这种格式:{name:age},你传入的是stype(data)defmain(self):1。标签页self。crawl(self。starturl)self。listloop()ifnamemain:sCrawl()s。main() 文件操作标识 requestscachepipinstallrequestscache 在做爬虫的时候,我们往往可能这些情况: 网站比较复杂,会碰到很多重复请求。 有时候爬虫意外中断了,但我们没有保存爬取状态,再次运行就需要重新爬取。 测试样例对比importrequestsimporttimestarttime。time()sessionrequests。Session()foriinrange(10):session。get(http:httpbin。orgdelay1)print(fFinished{i1}requests)endtime。time()print(Costtime,endstart) 测试样例对比2importrequestscacheimporttimestarttime。time()sessionrequestscache。CachedSession(democache)foriinrange(10):session。get(http:httpbin。orgdelay1)print(fFinished{i1}requests)endtime。time()print(Costtime,endstart) 但是,刚才我们在写的时候把requests的session对象直接替换了。有没有别的写法呢?比如我不影响当前代码,只在代码前面加几行初始化代码就完成requestscache的配置呢?importtimeimportrequestsimportrequestscacherequestscache。installcache(democache)starttime。time()sessionrequests。Session()foriinrange(10):session。get(http:httpbin。orgdelay1)print(fFinished{i1}requests)endtime。time()print(Costtime,endstart) 这次我们直接调用了requestscache库的installcache方法就好了,其他的requests的Session照常使用即可。 刚才我们知道了,requestscache默认使用了SQLite作为缓存对象,那这个能不能换啊?比如用文件,或者其他的数据库呢? 自然是可以的。 比如我们可以把后端换成本地文件,那可以这么做:requestscache。installcache(democache,backendfilesystem) 如果不想生产文件,可以指定系统缓存文件requestscache。installcache(democache,backendfilesystem,usecachedirTrue) 另外除了文件系统,requestscache也支持其他的后端,比如Redis、MongoDB、GridFS甚至内存,但也需要对应的依赖库支持,具体可以参见下表: Backend Class Alias Dependencies SQLite SQLiteCache sqlite Redis RedisCache redis redispy MongoDB MongoCache mongodb pymongo GridFS GridFSCache gridfs pymongo DynamoDB DynamoDbCache dynamodb boto3 Filesystem FileCache filesystem Memory BaseCache memory 比如使用Redis就可以改写如下:backendrequestscache。RedisCache(hostlocalhost,port6379)requestscache。installcache(democache,backendbackend) 更多详细配置可以参考官方文档:https:requestscache。readthedocs。ioenstableuserguidebackends。htmlbackends 当然,我们有时候也想指定有些请求不缓存,比如只缓存POST请求,不缓存GET请求,那可以这样来配置:importtimeimportrequestsimportrequestscacherequestscache。installcache(democache2,allowablemethods〔POST〕)starttime。time()sessionrequests。Session()foriinrange(10):session。get(http:httpbin。orgdelay1)print(fFinished{i1}requests)endtime。time()print(Costtimeforget,endstart)starttime。time()foriinrange(10):session。post(http:httpbin。orgdelay1)print(fFinished{i1}requests)endtime。time()print(Costtimeforpost,endstart) 当然我们还可以匹配URL,比如针对哪种Pattern的URL缓存多久,则可以这样写:urlsexpireafter{。site1。com:30,site2。comstatic:1}requestscache。installcache(democache2,urlsexpireafterurlsexpireafter) 好了,到现在为止,一些基本配置、过期时间配置、后端配置、过滤器配置等基本常见的用法就介绍到这里啦,更多详细的用法大家可以参考官方文档:https:requestscache。readthedocs。ioenstableuserguide。html。