网络爬虫(1)s | FightingTree

Python 爬虫正则表达式

安全

发布日期: 2020-04-20

文章字数: 2k

阅读时长: 8 分

阅读次数:

网络爬虫

爬虫是一个模拟人类请求网站的行为的程序，可以自动请求网页，并将数据抓取下来，然后使用一定规则提取有价值的数据。(摘自百度百科)

通用爬虫和聚焦爬虫

通用爬虫：是搜索引擎抓取系统的重要组成部分，主要是将网页下载到本地，形成一个互联网内容的备份
聚焦爬虫：是面向特定需求的一种网络爬虫程序，与通用爬虫的区别在于，聚焦爬虫在实施网页抓取的同时会对内容进行筛选和处理，尽量保证只抓取与需求相关的网页信息。

正则表达式

正则表达式是对字符串操作的一种逻辑公式，就是用事先定义好的一些特定字符，及这些特定字符的组合，组成一个规则字符串，这个规则字符串用来表达对字符串的一种过滤逻辑.

re模块

re模块是专门用来实现正则表达式的模块

正则表达式函数

search 若匹配，输出匹配字符的索引位置并输出匹配字符(match)
findall 若匹配，输出匹配字符
match 从头开始匹配
全局匹配 re.compile(正则表达式).findall(字符串)

re.search("p.*y","poyphuiy")   #输出第一个满足此正则表达式的子串 poy
re.compile("p.*y").findall("poyphuiy")   #输出所有满足正则表达式的子串 poy phuiy

原子

原子是正则表达式中最基本的组成单位，每个正则表达式中至少要包含一个原子，常见的原子类型有：

普通字符作为原子
非打印字符作为原子
通用字符作为原子
原子

普通字符作为原子

re.search("and","you and me")   #match and

非打印字符作为原子

\n 换行符 \t制表符

re.search("\n","hello
    world")            #match \n

通用字符作为原子

\w 匹配字母、数字、下划线
\W 除字母、数字、下划线
\d 十进制数字
\D 除十进制数字
\s 空白字符

\S 除空白字符

re.search("\d\d","hello17world")    #match 17
re.search("\d\d\w","hello17world")   #match 17w

原子表作为原子

从原子表中选一个字符进行匹配

re.search("hel[jkl]o","hello17world")  #match hello
re.search("hel[lo]","hello17world")    #match hell

元字符

. 除换行外任意一个字符
^ 开始字符
$ 结束位置
- 0/1/多次
? 0/1次
- 1/多次

单个字符匹配

. 点匹配单个任意字符

import re
re.findall(".ood","I say good and food")   #输出为good food

[] 中括号中的内容会被逐一匹配

re.findall("[gf]ood","I say good and food")   #输出为good food

\d 匹配单个数字

re.findall("\d","I am 40")   #输出为4 0
re.findall("\d\d","I am 40")  #输出为40

\w 匹配的是0-9，a-z,A-Z，_ 的单个字符

re.findall("\w","a b!1_")  #输出为a b 1 _

\s 匹配空白字符，包含tab键

匹配一组字符

直接匹配

re.findall("good","I say good and food")  #输出为good

使用分隔符，匹配两个不同的字符串

re.findall("good|food","I say good and food") #输出为good food

* 号匹配左邻字符出现0次或多次

re.findall("go*gle","I like google not ggle goooogle and gogle")  #*左邻为o,即o出现0次或多次，输出为 google ggle goooogle gogle

+ 号匹配左邻字符出现一次或多次

re.findall("go+gle","I like google not ggle goooogle and gogle")  #*左邻为o,即o出现1次或多次，输出为 google goooogle gogle

? 号匹配左邻字符出现0次或1次

re.findall("go?gle","I like google not ggle goooogle and gogle")  #*左邻为o,即o出现0次或1次，输出为 ggle gogle

{} 定义左邻字符出现的个数

re.findall("go{2，3}gle","I like google not ggle goooogle and gogle")  #*左邻为o,即o出现2-3次，输出为 google

^ 匹配是否以某个字符开头

re.findall("^I like","I like blue")  #输出为I like

$ 匹配是否以某个字符结尾

re.findall("blue&","I like blue") #输出为blue

模式修正符

I 匹配时忽略大小写
M 多行匹配
L 本地化识别匹配
U unicode
S 让.匹配包括换行符

re.search("pyt","Python",re.I)  #忽略大小写，Pyt

贪婪模式和懒惰模式

贪婪模式的核心就是尽可能多地匹配
懒惰模式的核心就是尽可能少地匹配

贪婪模式

re.search("p.*y","pythony",re.I)  #一直匹配到最后一个y，为pythony

懒惰模式

re.search("p.*?y","pythony",re.I)   #匹配第一个y，py

正则表达式匹配实例

匹配.com/.cn网站

http://www.baidu.com
[a-zA-Z]+://[^\s]*[.com|.cn]

urllib库

urllib库是Python中一个最基本的网络请求库，可以模拟浏览器的行为，向指定的服务器发送一个请求，并可以保存服务器返回的数据。

函数

urlopen

设置代理服务器

使用urllib库需开启代理，否则屏蔽爬虫的浏览器无法获取信息：

使用

在python3中urllib库中，所有和网络请求相关的方法，都集成到urllib.request模块下，下面是urlopen函数的基本使用

from urllib import request
resp = request.urlopen('https://graph.baidu.com')   #返回该请求的响应
print(resp.read())

urlretrieve

urlretreive(网址，本地文件存储路径) 直接下载网页到本地(只下载静态内容)

import urllib.request
urllib.request.urlretrieve("http://www.baidu.com","C:\\Users\\zby\\Desktop\\baidu.html")

urlcleanup

import urllib.request
urllib.request.urlcleanup   #清除缓存

info

用来以固定格式打印响应消息主体。该方法的对象为response(object)，而不是response返回的内容(string)。

import urllib.request
file = urllib.request.urlopen("http://www.baidu.com")
print(file.info())

getcode

返回http请求状态码

import urllib.request
file = urllib.request.urlopen("http://www.baidu.com")
print(file.getcode())

geturl

获取当前访问网页的url

import urllib.request
file = urllib.request.urlopen("http://www.baidu.com")
print(file.geturl())

超时设置

由于网络速度或对方服务器的问题，我们爬取一个网页的时候，都需要时间。我们访问一个网页，如果该网页长时间未响应，那么我们的系统就会判断该网页超时，无法打开该网页。

import urllib.request
for i in range(0,10):
        file = urllib.request.urlopen("http://www.baidu.com",timeout=1)
        try:
            print(file.read().decode("utf-8"))
        except Exception as err:
            print("出现异常")
            break

简单爬虫

获取网页所有的html代码

import urllib.request
class GetHtml(object):
    def __init__(self,URL):
        self.url = URL
        self.res = None
    def response(self):    #获取html页面的代码
        self.res = urllib.request.urlopen(self.url)
        return self.res.read()
html = GetHtml('http://graph.baidu.com')
print(html.response())

以上代码发出的请求在服务器端显示的User-Agent为python，我们需要修改User-Agent为我们的正常浏览器信息。

查看本地浏览器的信息

在浏览器中点F12打开调试平台：

修改代码

import urllib.request
class GetHtml(object):
    def __init__(self,URL,HEAD):
        self.url = URL
        self.response = None     #响应
        self.head = HEAD
        self.request = None   #请求
    def getContent(self):
        #修改请求头
        self.request = urllib.request.Request(self.url)
        self.request.add_header("user-agent",self.head)
        #读取响应
        self.response = urllib.request.urlopen(self.request)
        return self.response.read().decode("utf-8")     #进行编码，否则在提取时会出现错误
html = GetHtml("https://graph.baidu.com","user-agent消息头")
print(html.getContent())

爬虫实例

爬取CSDN提取QQ群

import urllib.request
import re
class GetHtml:
    def __init__(self,URL,HEAD):
        self.url = URL
        self.request = None
        self.response = None
        self.header = HEAD
    #获取响应
    def getContent(self):
        #获取请求
        self.request = urllib.request.Request(self.url)
        self.request.add_header("user-agent",self.header)
        #发送请求
        self.response = urllib.request.urlopen(self.request)
        return self.response.read().decode("utf-8")    #进行转码，否则会发送错误
    #获取QQ群信息
    def getQQ(self):
        qq = re.compile("<p>(\d*?)</p>").findall(self.getContent())
        return qq
html = GetHtml("https://edu.csdn.net/huiyiCourse/detail/253","浏览器user-agent")
print(html.getQQ())

此例中正则表达式的书写要查看该页面的html源码，根据html书写规律构造正则表达式

提取豆瓣出版社信息

豆瓣网址(https://read.douban.com/provider/all) ，代码结构如上，只需将正则表达式改为：

douban = re.compile("<div class=\"name\">(.*?)</div>").findall(self.getContent())

转载规则

《网络爬虫(1)s》由 fightingtree 采用知识共享署名 4.0 国际许可协议进行许可。

scrapy(1)

爬虫框架Scrapy的简单使用

2020-05-15 框架使用

Python Scrapy

外网规划

外部网络规划策略

2020-04-03 知识点总结

计算机网络