Python爬虫之Request库的使用
requests
就有了更为强大的库 requests,有了它,Cookies、登录验证、代理设置等操作都不是事儿。
安装环境 pip install requests
官方地址:https://requests.readthedocs.io/en/latest/
1. 示例引入
urllib 库中的 urlopen 方法实际上是以 GET 方式请求网页,而 requests 中相应的方法就是 get 方法,是不是感觉表达更明确一些?下面通过实例来看一下: import requests r = requests.get("https://www.baidu.com/") print(type(r)) print(r.status_code) print(type(r.text)) print(r.text) print(r.cookies)
测试实例: r = requests.post("http://httpbin.org/post") r = requests.put("http://httpbin.org/put") r = requests.delete("http://httpbin.org/delete") r = requests.head("http://httpbin.org/get") r = requests.options("http://httpbin.org/get")
2、GET抓取 import requests data = { "name": "germey", "age": 22 } r = requests.get("http://httpbin.org/get", params=data) print(r.text)
2.1 抓取二进制数据
下面以 图片为例来看一下: import requests r = requests.get("http://qwmxpxq5y.hn-bkt.clouddn.com/hh.png") print(r.text) print(r.content)
如果不传递 headers,就不能正常请求: import requests r = requests.get("https://mmzztt.com/") print(r.text)
但如果加上 headers 并加上 User-Agent 信息,那就没问题了: import requests headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36" } r = requests.get("https://mmzztt.com/, headers=headers) print(r.text)
3、POST请求
3.1 --前面我们了解了最基本的 GET 请求,另外一种比较常见的请求方式是 POST。使用 requests 实现 POST 请求同样非常简单,示例如下: import requests data = {"name": "germey", "age": "22"} r = requests.post("http://httpbin.org/post", data=data) print(r.text)
测试网站
• 巨潮网络数据 点击资讯选择公开信息 import requests url= "http://www.cninfo.com.cn/data20/ints/statistics" res = requests.post(url) print(res.text)
3.2 --发送请求后,得到的自然就是响应。在上面的实例中,我们使用 text 和 content 获取了响应的内容。此外,还有很多属性和方法可以用来获取其他信息,比如状态码、响应头、Cookies 等。示例如下: import requests headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36" } r = requests.get("http://www.jianshu.com",headers=headers) print(type(r.status_code), r.status_code) print(type(r.headers), r.headers) print(type(r.cookies), r.cookies) print(type(r.url), r.url) print(type(r.history), r.history)
3.3 --状态码常用来判断请求是否成功,而 requests 还提供了一个内置的状态码查询对象 requests.codes,示例如下: import requests headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36" } r = requests.get("http://www.jianshu.com",headers=headers) if not r.status_code == requests.codes.ok: exit() else: print("Request Successfully")
3.4 --那么,肯定不能只有 ok 这个条件码。下面列出了返回码和相应的查询条件: # 信息性状态码 100: ("continue",), 101: ("switching_protocols",), 102: ("processing",), 103: ("checkpoint",), 122: ("uri_too_long", "request_uri_too_long"), # 成功状态码 200: ("ok", "okay", "all_ok", "all_okay", "all_good", "o/", "✓"), 201: ("created",), 202: ("accepted",), 203: ("non_authoritative_info", "non_authoritative_information"), 204: ("no_content",), 205: ("reset_content", "reset"), 206: ("partial_content", "partial"), 207: ("multi_status", "multiple_status", "multi_stati", "multiple_stati"), 208: ("already_reported",), 226: ("im_used",), # 重定向状态码 300: ("multiple_choices",), 301: ("moved_permanently", "moved", "o-"), 302: ("found",), 303: ("see_other", "other"), 304: ("not_modified",), 305: ("use_proxy",), 306: ("switch_proxy",), 307: ("temporary_redirect", "temporary_moved", "temporary"), 308: ("permanent_redirect", "resume_incomplete", "resume",), # These 2 to be removed in 3.0 # 客户端错误状态码 400: ("bad_request", "bad"), 401: ("unauthorized",), 402: ("payment_required", "payment"), 403: ("forbidden",), 404: ("not_found", "-o-"), 405: ("method_not_allowed", "not_allowed"), 406: ("not_acceptable",), 407: ("proxy_authentication_required", "proxy_auth", "proxy_authentication"), 408: ("request_timeout", "timeout"), 409: ("conflict",), 410: ("gone",), 411: ("length_required",), 412: ("precondition_failed", "precondition"), 413: ("request_entity_too_large",), 414: ("request_uri_too_large",), 415: ("unsupported_media_type", "unsupported_media", "media_type"), 416: ("requested_range_not_satisfiable", "requested_range", "range_not_satisfiable"), 417: ("expectation_failed",), 418: ("im_a_teapot", "teapot", "i_am_a_teapot"), 421: ("misdirected_request",), 422: ("unprocessable_entity", "unprocessable"), 423: ("locked",), 424: ("failed_dependency", "dependency"), 425: ("unordered_collection", "unordered"), 426: ("upgrade_required", "upgrade"), 428: ("precondition_required", "precondition"), 429: ("too_many_requests", "too_many"), 431: ("header_fields_too_large", "fields_too_large"), 444: ("no_response", "none"), 449: ("retry_with", "retry"), 450: ("blocked_by_windows_parental_controls", "parental_controls"), 451: ("unavailable_for_legal_reasons", "legal_reasons"), 499: ("client_closed_request",), # 服务端错误状态码 500: ("internal_server_error", "server_error", "/o", "✗"), 501: ("not_implemented",), 502: ("bad_gateway",), 503: ("service_unavailable", "unavailable"), 504: ("gateway_timeout",), 505: ("http_version_not_supported", "http_version"), 506: ("variant_also_negotiates",), 507: ("insufficient_storage",), 509: ("bandwidth_limit_exceeded", "bandwidth"), 510: ("not_extended",), 511: ("network_authentication_required", "network_auth", "network_authentication")
4、高级用法
1、 代理添加 proxy = { "http" : "http://183.162.171.78:4216", } # 返回当前IP res = requests.get("http://httpbin.org/ip",proxies=proxy) print(res.text)
2、快代理IP使用
文献: https://www.kuaidaili.com/doc/dev/quickstart/
打开后,默认http协议,返回格式选json,我的订单是VIP订单,所以稳定性选稳定,返回格式选json,然后点击生成链接,下面的API链接直接复制上。
3.关闭警告 from requests.packages import urllib3 urllib3.disable_warnings()
爬虫流程
5、初级爬虫 import requests from lxml import etree def main(): # 1. 定义页面URL和解析规则 crawl_urls = [ "https://36kr.com/p/1328468833360133", "https://36kr.com/p/1328528129988866", "https://36kr.com/p/1328512085344642" ] parse_rule = "//h1[contains(@class,"article-title margin-bottom-20 common-width")]/text()" for url in crawl_urls: # 2. 发起HTTP请求 response = requests.get(url) # 3. 解析HTML result = etree.HTML(response.text).xpath(parse_rule)[0] # 4. 保存结果 print(result) if __name__ == "__main__": main()
6、全站采集
6.1 封装公共文件
创建utils文件夹,写一个base类供其他程序调用 import requests from retrying import retry from requests.packages.urllib3.exceptions import InsecureRequestWarning requests.packages.urllib3.disable_warnings(InsecureRequestWarning) from lxml import etree import random,time class FakeChromeUA: first_num = random.randint(55, 62) third_num = random.randint(0, 3200) fourth_num = random.randint(0, 140) os_type = [ "(Windows NT 6.1; WOW64)", "(Windows NT 10.0; WOW64)", "(X11; Linux x86_64)","(Macintosh; Intel Mac OS X 10_12_6)" ] chrome_version = "Chrome/{}.0.{}.{}".format(first_num, third_num, fourth_num) @classmethod def get_ua(cls): return " ".join(["Mozilla/5.0", random.choice(cls.os_type), "AppleWebKit/537.36","(KHTML, like Gecko)", cls.chrome_version, "Safari/537.36"]) class Spiders(FakeChromeUA): urls = [] @retry(stop_max_attempt_number=3, wait_fixed=2000) def fetch(self, url, param=None,headers=None): try: if not headers: headers ={} headers["user-agent"] = self.get_ua() else: headers["user-agent"] = self.get_ua() self.wait_some_time() response = requests.get(url, params=param,headers=headers) if response.status_code == 200: response.encoding = "utf-8" return response except requests.ConnectionError: return def wait_some_time(self): time.sleep(random.randint(100, 300) / 1000)
6.2 案例实践 from urllib.parse import urljoin import requests from lxml import etree from queue import Queue from xl.base import Spiders from pymongo import MongoClient flt = lambda x :x[0] if x else None class Crawl(Spiders): base_url = "https://36kr.com/" # 种子URL start_url = "https://36kr.com/information/technology" # 解析规则 rules = { # 文章列表 "list_urls": "//p[@class="article-item-pic-wrapper"]/a/@href", # 详情页数据 "detail_urls": "//p[@class="common-width margin-bottom-20"]//text()", # 标题 "title": "//h1[@class="article-title margin-bottom-20 common-width"]/text()", } # 定义队列 list_queue = Queue() def crawl(self, url): """首页""" response =self.fetch(url) list_urls = etree.HTML(response.text).xpath(self.rules["list_urls"]) # print(urljoin(self.base_url, list_urls)) for list_url in list_urls: # print(urljoin(self.base_url, list_url)) # 获取url 列表信息 self.list_queue.put(urljoin(self.base_url, list_url)) def list_loop(self): """采集列表页""" while True: list_url = self.list_queue.get() print(self.list_queue.qsize()) self.crawl_detail(list_url) # 如果队列为空 退出程序 if self.list_queue.empty(): break def crawl_detail(self,url): """详情页""" response = self.fetch(url) html = etree.HTML(response.text) content = html.xpath(self.rules["detail_urls"]) title = flt(html.xpath(self.rules["title"])) print(title) data = { "content":content, "title":title } self.save_mongo(data) def save_mongo(self,data): client = MongoClient() # 建立连接 col = client["python"]["hh"] if isinstance(data, dict): res = col.insert_one(data) return res else: return "单条数据必须是这种格式:{"name":"age"},你传入的是%s" % type(data) def main(self): # 1. 标签页 self.crawl(self.start_url) self.list_loop() if __name__ == "__main__": s = Crawl() s.main()
文件操作标识
requests-cache pip install requests-cache
在做爬虫的时候,我们往往可能这些情况:
• 网站比较复杂,会碰到很多重复请求。
• 有时候爬虫意外中断了,但我们没有保存爬取状态,再次运行就需要重新爬取。
测试样例对比 import requests import time start = time.time() session = requests.Session() for i in range(10): session.get("http://httpbin.org/delay/1") print(f"Finished {i + 1} requests") end = time.time() print("Cost time", end - start)
测试样例对比2 import requests_cache import time start = time.time() session = requests_cache.CachedSession("demo_cache") for i in range(10): session.get("http://httpbin.org/delay/1") print(f"Finished {i + 1} requests") end = time.time() print("Cost time", end - start)
但是,刚才我们在写的时候把 requests 的 session 对象直接替换了。有没有别的写法呢?比如我不影响当前代码,只在代码前面加几行初始化代码就完成 requests-cache 的配置呢? import time import requests import requests_cache requests_cache.install_cache("demo_cache") start = time.time() session = requests.Session() for i in range(10): session.get("http://httpbin.org/delay/1") print(f"Finished {i + 1} requests") end = time.time() print("Cost time", end - start)
这次我们直接调用了 requests-cache 库的 install_cache 方法就好了,其他的 requests 的 Session 照常使用即可。
刚才我们知道了,requests-cache 默认使用了 SQLite 作为缓存对象,那这个能不能换啊?比如用文件,或者其他的数据库呢?
自然是可以的。
比如我们可以把后端换成本地文件,那可以这么做: requests_cache.install_cache("demo_cache", backend="filesystem")
如果不想生产文件,可以指定系统缓存文件 requests_cache.install_cache("demo_cache", backend="filesystem", use_cache_dir=True)
另外除了文件系统,requests-cache 也支持其他的后端,比如 Redis、MongoDB、GridFS 甚至内存,但也需要对应的依赖库支持,具体可以参见下表:
Backend
Class
Alias
Dependencies
SQLite
SQLiteCache
"sqlite"
Redis
RedisCache
"redis"
redis-py
MongoDB
MongoCache
"mongodb"
pymongo
GridFS
GridFSCache
"gridfs"
pymongo
DynamoDB
DynamoDbCache
"dynamodb"
boto3
Filesystem
FileCache
"filesystem"
Memory
BaseCache
"memory"
比如使用 Redis 就可以改写如下: backend = requests_cache.RedisCache(host="localhost", port=6379) requests_cache.install_cache("demo_cache", backend=backend)
更多详细配置可以参考官方文档: https://requests-cache.readthedocs.io/en/stable/user_guide/backends.html#backends
当然,我们有时候也想指定有些请求不缓存,比如只缓存 POST 请求,不缓存 GET 请求,那可以这样来配置: import time import requests import requests_cache requests_cache.install_cache("demo_cache2", allowable_methods=["POST"]) start = time.time() session = requests.Session() for i in range(10): session.get("http://httpbin.org/delay/1") print(f"Finished {i + 1} requests") end = time.time() print("Cost time for get", end - start) start = time.time() for i in range(10): session.post("http://httpbin.org/delay/1") print(f"Finished {i + 1} requests") end = time.time() print("Cost time for post", end - start)
当然我们还可以匹配 URL,比如针对哪种 Pattern 的 URL 缓存多久,则可以这样写: urls_expire_after = {"*.site_1.com": 30, "site_2.com/static": -1} requests_cache.install_cache("demo_cache2", urls_expire_after=urls_expire_after)
好了,到现在为止,一些基本配置、过期时间配置、后端配置、过滤器配置等基本常见的用法就介绍到这里啦,更多详细的用法大家可以参考官方文档: https://requests-cache.readthedocs.io/en/stable/user_guide.html。
土豆不要总是炖着吃,做一道香辣开胃的干锅土豆片,好吃停不下来干锅土豆片是川渝地区的一道传统名菜,干锅菜起源于四川的绵阳和德阳,属于川菜的制作方法之一,因其成菜后口味又香又辣又麻,下酒又下饭,非常好吃而在全国各地流行。寻味之旅,总能用奇遇来形
通驿公司粤东分公司开展庆祝三八妇女节系列活动近日,通驿粤东分公司以魅力女性,快乐巾帼为主题开展庆祝三八妇女节系列活动,营造快乐工作幸福生活健康向上的企业文化氛围。3月6日下午,通驿粤东分公司组织全体女职工参加花艺学习培训,邀
滨州学院经济管理学院开展庆祝三八国际妇女节系列主题活动滨州日报滨州网讯2023年是全面贯彻党的二十大精神的开局之年,也是滨州学院建设航空特色鲜明的高水平应用型大学的关键之年。为团结凝聚广大女教职工的智慧和力量,鼓舞女同胞们拼搏奋斗担当
兰嘉丝汀新品御颜系列3月亮相近日,Lancaster兰嘉丝汀御颜奢享荟于广州柏悦酒店奢雅启幕,现场交相辉映的蓝金色优雅奢华,尽显品牌卓越格调。御颜奢享荟打造了品牌历史墙产品陈列区和沙龙分享区,为现场嘉宾呈现精
每日一习话奋斗新征程激发广大妇女的历史责任感和主人翁精神视频加载中(欢迎点击视频,观看本期每日一习话)习近平各级妇联组织要承担引领广大妇女听党话跟党走的政治责任,激发广大妇女的历史责任感和主人翁精神,为推动我国妇女事业发展作出新贡献。这
警服下的黑恶势力湖南省株洲市公安局副局长凌娅投案自首凌娅这几天很火。不仅仅因为她是公职人员落马,还因为她所涉及的事情比狂飙还狂飙,这几天还不断有关于她的信息爆出来。见过公职人员落马的,他们大都是利用职权谋取私利,大搞权色交易。女的呢
银行存款全新利息2023年03月,中国农业银行人民币存款新利息表2023年03月,中国农业银行存款利率表,大额存单利率表农业银行大额存单利率表起存额20万元1。存入1个月存入20万元,利率为1。54存入3个月利率1。542。存入6个月存入20万
齐点手绘丨三八妇女节的由来三八妇女节是十九世纪妇女解放运动发展的实际成果是国际妇女斗争的纪念日是一个非常伟大并且非常有纪念意义的节日三八妇女节的全称是联合国妇女权益和国际和平日19世纪西方资本主义开始蓬勃发
山东白衣小伙王正阳给大网红发律师函,支持正义不容扭曲之举山东白衣小伙王正阳给大网红发律师函!无论是王正阳自己,还是眼镜女,再到大网红,能公开发律师函说明王正阳小伙要让事实大白于天下,摆在阳光下晒晒。再一个三轮车大姐的答谢锦旗也在说明小伙
三八妇女节幽默祝福语三八妇女节幽默祝福语1面对岁月的雕琢,女人依然努力的活出自我,让自己活得漂亮,让美丽依旧,让爱充盈内心,从容淡定的面对生活的洗礼,浅然淡笑的穿行于柴米油盐之中。2你最美的样子,不是
散文丨杨远新篦牛长满红花草的绿肥田翻耕后,插上早稻就能丰收(作者供图)篦牛文杨远新那年洞庭三月末的日子里,表叔出面,找到担任生产队长的我父亲杨先德,表兄表弟俩协商,将我负责饲养的青毛牯,从沧浪水下