Python crawler

Python crawler

One, crawler concept

Crawler: A program that automatically grabs information from the Internet, grabbing information that is valuable to us from the Internet. Crawlers, also known as web spiders or web robots. A crawler is an automated program or script that simulates a human operating a client (browser, APP), initiates a network request to the server, and grabs data.

Second, the basic process of crawling

Insert picture description here
  • Initiate a request, by using the HTTP library to initiate a request to the target site, that is, send a Request, the request can contain additional headers and other information, and wait for the server to respond.
  • Obtaining the response content If the server can respond normally, it will get a Response. The content of the Response is the content of the page to be obtained, which will include: html, json, pictures, videos, etc.
  • The content obtained by parsing the content may be html data. Regular expressions and third-party parsing libraries such as Beautifulsoup, etree, etc. can be used. To parse the json data, you can use the json module, and the binary data can be saved or further processed.
  • There are multiple ways to save data. It can be stored in a database (MySQL, Mongdb, Redis) or saved in a file.

Three, regular expressions

Regular expressions, also known as regular expressions, are usually used to retrieve and replace text that meets a certain pattern (rule). Regular expression is a kind of logical formula for string manipulation. It uses some pre-defined specific characters and the combination of these specific characters to form a "rule string". This "rule string" is used to express the pair of characters. Some filtering logic for strings. Regular expressions in Python are implemented through the re module.

Four, crawler examples

1. Crawl Baidu pages

import requests
# 1.确定url, 向服务器发送请求
url = 'https://www.baidu.com'
res = requests.get(url=url)
# 2.操作响应数据, 获取目标数据
res.encoding = 'utf-8'
# 3.将目标数据持久化到本地: 写入文件
with open('baidu.html', 'w', encoding='utf-8') as f:
    f.write(res.text)
Insert picture description here

2. Crawl the school flower net

import re
import requests
# 获取路由
base_url = 'https://news.daxues.cn/xiaohua/ziliao/'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'
}
# 发送请求
res = requests.get(url=base_url, headers=headers)
# 字符编码设置为网页本来所属编码
html = res.content.decode('utf-8')
# 编译正则表达式
ret = re.findall('<img src="(.*?)" alt', html)
# 开始爬取
for i in ret:
    # 解析图片链接
    img_url = 'https://news.daxues.cn' + i
    # 请求图片链接
    res1 = requests.get(url=img_url, headers=headers)
    img_name = i.split('/')
    # 下载图片
    with open(f'img/{img_name[-1]}', 'wb') as f:
        f.write(res1.content)
Insert picture description here