當前位置：中華考試網(wǎng) >> python >> python教程 >> 文章內(nèi)容

舉例告訴你python爬蟲庫urllib2怎么用

來源：中華考試網(wǎng) [2020年9月30日] 【大中小】

　　　所謂網(wǎng)頁抓取，就是把URL地址中指定的網(wǎng)絡資源從網(wǎng)絡流中讀取出來，保存到本地。在Python中有很多庫可以用來抓取網(wǎng)頁，先學習urllib2。

　　urllib2模塊直接導入就可以用，在python3中urllib2被改為urllib.request

　　使用urllib2,試了下用代理登陸拉取cookie，跳轉(zhuǎn)抓圖片......

　　URLLIB2文檔：http://docs.python.org/library/urllib2.html

　　直接上demo代碼:包括：直接拉取，使用Reuqest(post/get),使用代理，cookie,跳轉(zhuǎn)處理

　　#!/usr/bin/python

　　# -*- coding:utf-8 -*-

　　# urllib2_test.py

　　import urllib,urllib2,cookielib,socket

　　url = "http://www.testurl....." #change yourself

　　#最簡單方式

　　def use_urllib2():

　　try:

　　f = urllib2.urlopen(url, timeout=5).read()

　　except urllib2.URLError, e:

　　print e.reason

　　print len(f)

　　#使用Request

　　def get_request():

　　#可以設置超時

　　socket.setdefaulttimeout(5)

　　#可以加入?yún)?shù) [無參數(shù)，使用get，以下這種方式，使用post]

　　params = {"wd":"a","b":"2"}

　　#可以加入請求頭信息，以便識別

　　i_headers = {"User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9.1) Gecko/20090624 Firefox/3.5",

　　"Accept": "text/plain"}

　　#use post,have some params post to server,if not support ,will throw exception

　　#req = urllib2.Request(url, data=urllib.urlencode(params), headers=i_headers)

　　req = urllib2.Request(url, headers=i_headers)

　　#創(chuàng)建request后，還可以進行其他添加,若是key重復，后者生效

python課程免費試聽預約

地區(qū):
- 北京
- 天津
- 上海
- 江蘇
- 浙江
- 山東
- 江西
- 安徽
- 廣東
- 廣西
- 海南
- 遼寧
- 吉林
- 黑龍江
- 內(nèi)蒙古
- 山西
- 福建
- 河南
- 河北
- 湖南
- 湖北
- 四川
- 重慶
- 云南
- 貴州
- 新疆
- 西藏
- 陜西
- 青海
- 寧夏
- 甘肅
姓名:
手機:

提交

　　#request.add_header('Accept','application/json')

　　#可以指定提交方式

　　#request.get_method = lambda: 'PUT'

　　try:

　　page = urllib2.urlopen(req)

　　print len(page.read())

　　#like get

　　#url_params = urllib.urlencode({"a":"1", "b":"2"})

　　#final_url = url + "?" + url_params

　　#print final_url

　　#data = urllib2.urlopen(final_url).read()

　　#print "Method:get ", len(data)

　　except urllib2.HTTPError, e:

　　print "Error Code:", e.code

　　except urllib2.URLError, e:

　　print "Error Reason:", e.reason

　　def use_proxy():

　　enable_proxy = False

　　proxy_handler = urllib2.ProxyHandler({"http":"http://proxyurlXXXX.com:8080"})

　　null_proxy_handler = urllib2.ProxyHandler({})

　　if enable_proxy:

　　opener = urllib2.build_opener(proxy_handler, urllib2.HTTPHandler)

　　else:

　　opener = urllib2.build_opener(null_proxy_handler, urllib2.HTTPHandler)

　　#此句設置urllib2的全局opener

　　urllib2.install_opener(opener)

　　content = urllib2.urlopen(url).read()

　　print "proxy len:",len(content)

　　class NoExceptionCookieProcesser(urllib2.HTTPCookieProcessor):

　　def http_error_403(self, req, fp, code, msg, hdrs):

　　return fp

　　def http_error_400(self, req, fp, code, msg, hdrs):

　　return fp

　　def http_error_500(self, req, fp, code, msg, hdrs):

　　return fp

　　def hand_cookie():

　　cookie = cookielib.CookieJar()

　　#cookie_handler = urllib2.HTTPCookieProcessor(cookie)

　　#after add error exception handler

　　cookie_handler = NoExceptionCookieProcesser(cookie)

　　opener = urllib2.build_opener(cookie_handler, urllib2.HTTPHandler)

　　url_login = "https://www.yourwebsite/?login"

　　params = {"username":"user","password":"111111"}

　　opener.open(url_login, urllib.urlencode(params))

　　for item in cookie:

　　print item.name,item.value

　　#urllib2.install_opener(opener)

　　#content = urllib2.urlopen(url).read()

　　#print len(content)

　　#得到重定向 N 次以后最后頁面URL

　　def get_request_direct():

　　import httplib

　　httplib.HTTPConnection.debuglevel = 1

　　request = urllib2.Request("http://www.google.com")

　　request.add_header("Accept", "text/html,*/*")

　　request.add_header("Connection", "Keep-Alive")

　　opener = urllib2.build_opener()

　　f = opener.open(request)

　　print f.url

　　print f.headers.dict

　　print len(f.read())

　　if __name__ == "__main__":

　　use_urllib2()

　　get_request()

　　get_request_direct()

　　use_proxy()

　　hand_cookie()

責編：hym

上一篇：怎樣用Python實現(xiàn)批量修改文件名

下一篇：python中怎么像PS一樣處理圖像

相關(guān)文章

編輯推薦

python問答

python教程

华南俳烁实业有限公司

python

舉例告訴你python爬蟲庫urllib2怎么用

python課程免費試聽預約

編輯推薦