c与vb。net两套代码手把手教你写。net网页爬虫
爬虫作为时下最热门的一项话题。
在爬虫技术上,python占据了大半壁江山。那.Net家族是否也能实现爬虫呢?答案是肯定的。
c# 可能还算比较热门,但vb.net在国内的饭碗全被c# 抢走了。但是就7月的编程排行榜来看,世界排名还不差,仅比c# 低了0.3%,遥遥领先JS和PHP。
接下来,我们重拾.Net家族的两位元老:c#与vb.net,手把手教你写第一个出色的爬虫!这篇文章适合有一定基础掌,握基本语法的朋友参考。我在此抛砖引玉。
网页爬虫
何为爬虫?简单地说就是一种抓取某个网页HTML代码的一项技术。通常用于数据接口交互、采集信息等。获取网页源码只是第一步,还要从获取的数据中提取到有用的数据,比如网页标题、网页内容、图片等。
总结一下,其实爬虫部分就只有两个步骤:获取网页源代码>分析并取得有用数据
下面介绍三种不同的爬取方式,分别适用于不同场景。正常爬取网页源代码
大部分浏览器在打开某个网页后,右键都有"查看源代码"这一项。在这一大串的HTML代码里面,可以看到网页上显示的网页标题、内容数据、图片内容等等。
当我们需要批量采集某个网站多个网址的内容时,一个一个页面右键查看源代码,手动的复制需要的内容保存,这未免太强人所难。这时候我们需要爬虫批量为我们爬取数据。
那获取网址的源代码就是第一步。
在Net下,有多种方式来获取网页源码。都是通过模拟发送http协议来实现的。
一般情况下,可以使用NET及系统自带的HttpXML对象、WebClient对象和HttpWebRequest对象来实现。当然,有能力的可以采用纯Socket来实现。
这里推荐采用HttpWebRequest方式。
以下是c#代码HttpWebRequest方式获取源代码的函数:public string GetHtmlStr(string url) { try { Uri uri = new Uri(url); HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri); request.UserAgent = "User-Agent:Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705"; request.Accept = "*/*"; HttpWebResponse response = (HttpWebResponse)request.GetResponse(); Stream s = response.GetResponseStream(); StreamReader sr = new StreamReader(s, System.Text.Encoding.GetEncoding("utf-8")); string html = sr.ReadToEnd(); s.Close(); response.Close(); return html; } catch (Exception ex) { return "/error/"; } }
以及咱说好的vb.net代码Public Function GetHtmlStr(ByVal url As String) As String Try Dim uri As Uri = New Uri(url) Dim request As HttpWebRequest = CType(WebRequest.Create(uri), HttpWebRequest) request.UserAgent = "User-Agent:Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705" request.Accept = "*/*" Dim response As HttpWebResponse = CType(request.GetResponse(), HttpWebResponse) Dim s As Stream = response.GetResponseStream() Dim sr As StreamReader = New StreamReader(s, System.Text.Encoding.GetEncoding("utf-8")) Dim html As String = sr.ReadToEnd() s.Close() response.Close() Return html Catch ex As Exception Return "/error/" End Try End Function
这个获取网页源码的函数以及封装好,调用非常简单:GetHtmlStr("网址")
如果发生错误,返回字符串/error/。这里,特别注意代码中的utf-8编码。编码错误会造成获取到乱码,还有命名空间的引用。
不少同学可能会发现,以上的代码并不支持https网址,不过不用担心,稍作改动即可支持。Imports System.Net.Security Imports System.Security.Authentication Imports System.Security.Cryptography.X509Certificates Public Function CheckValidationResult(ByVal sender As Object, ByVal certificate As X509Certificate, ByVal chain As X509Chain, ByVal errors As SslPolicyErrors) As Boolean Return True End Function
并在以上的GetHTMLStr函数中HttpWebRequest = xxx 添加以下代码即可ServicePointManager.ServerCertificateValidationCallback = New System.Net.Security.RemoteCertificateValidationCallback(AddressOf CheckValidationResult)
改造完成后,所有https的网址均能正常获取,哪怕https的ssl证书出现问题(过期、不信任)都可以正常获取,不受影响。
简单总结一下,以上介绍的爬取方式适合普通直接打开即可访问的http或https网址。 附带Cookie方式GET和POST爬取
以上介绍的是获取网页的源码,功能比较简单。下面介绍附带Cookie的GET方式和POST方式获得源码。附带Cookie实现POST和GET,可以爬取更深一层的信息。一些需要账号登录后才能显示的网址均可以正常爬取 。
vb.net代码:Public Function GetHtmlStr(ByVal url As String, cookies As String) As String Try Dim ck As New CookieContainer ck.SetCookies(New Uri(url), cookies) Dim uri As Uri = New Uri(url) Dim request As HttpWebRequest = CType(WebRequest.Create(uri), HttpWebRequest) request.UserAgent = "User-Agent:Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705" request.Accept = "*/*" request.CookieContainer = ck Dim response As HttpWebResponse = CType(request.GetResponse(), HttpWebResponse) Dim s As Stream = response.GetResponseStream() Dim sr As StreamReader = New StreamReader(s, System.Text.Encoding.GetEncoding("utf-8")) Dim html As String = sr.ReadToEnd() s.Close() response.Close() Return html Catch ex As Exception Return "/error/" End Try End Function
c#代码:public string GetHtmlStr(string url, string cookies) { try { CookieContainer ck = new CookieContainer(); ck.SetCookies(new Uri(url), cookies); Uri uri = new Uri(url); HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri); request.UserAgent = "User-Agent:Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705"; request.Accept = "*/*"; request.CookieContainer = ck; HttpWebResponse response = (HttpWebResponse)request.GetResponse(); Stream s = response.GetResponseStream(); StreamReader sr = new StreamReader(s, System.Text.Encoding.GetEncoding("utf-8")); string html = sr.ReadToEnd(); s.Close(); response.Close(); return html; } catch (Exception ex) { return "/error/"; } }
以上代码不难看出,其实就是在普通爬取版本上增加了cookie对象的引入。GetHtmlStr函数的Cookie参数格式为:cookie名=值,多个用逗号隔开,如:name=123,pass=123
有了它,就可以实现爬取某些需要登录才能访问的网址。
对了,还有post方法的vb.net版本:Private Function HttpPost(Url As String, postDataStr As String) As String Dim request As HttpWebRequest = DirectCast(WebRequest.Create(Url), HttpWebRequest) request.Method = "POST" request.ContentType = "application/x-www-form-urlencoded" request.UserAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36" request.ContentLength = Encoding.UTF8.GetByteCount(postDataStr) Dim myRequestStream As Stream = request.GetRequestStream() Dim myStreamWriter As New StreamWriter(myRequestStream, Encoding.GetEncoding("gb2312")) myStreamWriter.Write(postDataStr) myStreamWriter.Close() Dim response As HttpWebResponse = DirectCast(request.GetResponse(), HttpWebResponse) Dim myResponseStream As Stream = response.GetResponseStream() Dim myStreamReader As New StreamReader(myResponseStream, Encoding.GetEncoding("gb2312")) Dim retString As String = myStreamReader.ReadToEnd() myStreamReader.Close() myResponseStream.Close() Return retString End Function
c#版本:private string HttpPost(string Url, string postDataStr) { HttpWebRequest request; WebRequest.Create(Url); HttpWebRequest; request.Method = "POST"; request.ContentType = "application/x-www-form-urlencoded"; request.UserAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safa" + "ri/537.36"; request.ContentLength = Encoding.UTF8.GetByteCount(postDataStr); Stream myRequestStream = request.GetRequestStream(); StreamWriter myStreamWriter = new StreamWriter(myRequestStream, Encoding.GetEncoding("gb2312")); myStreamWriter.Write(postDataStr); myStreamWriter.Close(); HttpWebResponse response; request.GetResponse(); HttpWebResponse; Stream myResponseStream = response.GetResponseStream(); StreamReader myStreamReader = new StreamReader(myResponseStream, Encoding.GetEncoding("gb2312")); string retString = myStreamReader.ReadToEnd(); myStreamReader.Close(); myResponseStream.Close(); return retString; }
留个小作业,这里的HttpPost函数cookie参数部分,可以参考上面的带cookie的Get方案自行添加。浏览器爬取
有一些网页,前端显示的内容是由后端JS动态生成,以上两种获取网页的方式都无法正常获取。那难道真的就没有任何办法了吗?不见得。我们还可以利用Net自带的IE控件webBrowser解释网页JS后再获取!
添加一个WebBrowser浏览器控件,下面代码可以操作一个WebBrowser控件打开一个网址,并等待其加载完成(等待JS将内容解释完毕)。Public Sub WebBrowserOpenURL(W As WebBrowser, S As String) Try W.Navigate(S) While (W.ReadyState <> WebBrowserReadyState.Complete) Application.DoEvents() End While Catch ex As Exception End Try End Subpublic void WebBrowserOpenURL(WebBrowser W, string S) { try { W.Navigate(S); while ((W.ReadyState != WebBrowserReadyState.Complete)) Application.DoEvents(); } catch (Exception ex) { } }
等到此自定义过程结束时,再通过webBrowser1.DocumentText方法获取控件中显示的所有内容。
当然,如果条件允许,可以使用chromium来获得更好的兼容性和性能。其他
以上三种不同的源码爬取方式适合不同的场景。但是要分别注意以下三点:
1)浏览器UA,即浏览器标识。一些网页需要移动端才能正常显示。
2)网页编码。编码设置错误会造成返回乱码。
3)请遵守网站使用条款和相关法规,勿胡乱爬取。
以上代码支持Winform以及web后端编程。
本次限于篇幅,就简单介绍到这。下一期将介绍如何从爬取的HTML代码中分析并获取想要的数据。