The bear kid said "You haven't seen Ultraman", and quickly learn it with Python, but he didn't expect

Unexpectedly, Altman actually has so many kinds.

120 Python crawlers, complete list of articles

  1. 10 lines of code set 2000 pictures of beautiful women, 120 Python crawlers, and then on the journey
  2. Through the Python crawler, it was found that 60% of women's clothing elders wandered in the field of cosplay
  3. Python thousands of cats, simple technology to meet your collection control

The goal of this blog

Crawl target

Crawling 60+ Ultraman, target data source: http://www.ultramanclub.com/?page_id=1156

The bear kid disagrees with me and uses Python to tell him how many Ultraman there are in the world

Use frame

  • requests, re

Key learning content

  • get request;
  • Request request timeout setting, timeout parameter;
  • re module regular expression;
  • Data deduplication;
  • URL address splicing.

List analysis
Simple Search through developer tools, get all Altman card where the DOM label <li class="item"></li>
tag details page is located<a = href="详情页" ……

The elements of the specific tags are shown in the following figure:

The bear kid disagrees with me and uses Python to tell him how many Ultraman there are in the world

According to the actual request data, we will sort out the regular expressions later.

Details page

Click on any target data to enter the details page, and get the Ultraman picture on the details page. The location of the picture address is shown in the figure below.

The bear kid disagrees with me and uses Python to tell him how many Ultraman there are in the world


Right click to get the label of the picture.

The bear kid disagrees with me and uses Python to tell him how many Ultraman there are in the world

The finishing requirements are as follows

  1. Crawl the addresses of all Ultraman detail pages through the list page;
  2. Enter the details page and crawl the picture address in the details page;
  3. Download and save the picture;

Code

Crawl all Ultraman details page addresses

In the process of crawling the list page, it was found that the Ultraman page uses iframe nesting. This method is also the simplest anti-crawling method, just extract the real link, so the target data source is switched to http://www.ultramanclub.com/allultraman/.

The bear kid disagrees with me and uses Python to tell him how many Ultraman there are in the world
import requests
import re
import time


# 爬虫入口
def run():
    url = "http://www.ultramanclub.com/allultraman/"
    try:
        # 网页访问速度慢,需要设置 timeout
        res = requests.get(url=url, timeout=10)
        res.encoding = "gb2312"
        html = res.text
        get_detail_list(html)

    except Exception as e:
        print("请求异常", e)


# 获取全部奥特曼详情页
def get_detail_list(html):
    start_index = '<ul class="lists">'
    start = html.find(start_index)
    html = html[start:]
    links = re.findall('<li class="item"><a href="(.*)">', html)
    print(len(links))
    links = list(set(links))
    print(len(links))


if __name__ == '__main__':
    run()

In the coding process, we found to slow Web access, so set timeoutproperty to 10prevent abnormal,

Regular expression matching data, the duplicate data occurs, by setde weight set, is converted to the final list.

Next on access to the listsecondary stitching, for details page address.

The details page address obtained by the second stitching, the code is as follows:

# 获取全部奥特曼详情页
def get_detail_list(html):
    start_index = '<ul class="lists">'
    start = html.find(start_index)
    html = html[start:]
    links = re.findall('<li class="item"><a href="(.*)">', html)
    # links = list(set(links))
    links = [f"http://www.ultramanclub.com/allultraman/{i.split('/')[1]}/" for i in set(links)]
    print(links)

Crawl all Ultraman big pictures

This step first obtains the method of web page title, and then uses the title to name the Ultraman big picture crawling.

The crawling logic is very simple, you only need to loop the above to crawl to the details page address, and then match through regular expressions.

The modified code is shown below, and the key nodes are viewed in comments.

import requests
import re
import time

# 声明 UA
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36"
}
# 存储异常路径,防止出现爬取失败情况
error_list = []

# 爬虫入口
def run():
    url = "http://www.ultramanclub.com/allultraman/"
    try:
        # 网页访问速度慢,需要设置 timeout
        res = requests.get(url=url, headers=headers, timeout=10)
        res.encoding = "gb2312"
        html = res.text
        return get_detail_list(html)

    except Exception as e:
        print("请求异常", e)


# 获取全部奥特曼详情页
def get_detail_list(html):
    start_index = '<ul class="lists">'
    start = html.find(start_index)
    html = html[start:]
    links = re.findall('<li class="item"><a href="(.*)">', html)
    # links = list(set(links))
    links = [
        f"http://www.ultramanclub.com/allultraman/{i.split('/')[1]}/" for i in set(links)]
    return links


def get_image(url):
    try:
        # 网页访问速度慢,需要设置 timeout
        res = requests.get(url=url, headers=headers, timeout=15)
        res.encoding = "gb2312"
        html = res.text
        print(url)
        # 获取详情页标题,作为图片文件名
        title = re.search('<title>(.*?)\[', html).group(1)
        # 获取图片短连接地址
        image_short = re.search(
            '<figure class="image tile">[.\s]*?<img src="(.*?)"', html).group(1)

        # 拼接完整图片地址
        img_url = "http://www.ultramanclub.com/allultraman/" + image_short[3:]
        # 获取图片数据
        img_data = requests.get(img_url).content
        print(f"正在爬取{title}")
        if title is not None and image_short is not None:
            with open(f"images/{title}.png", "wb") as f:
                f.write(img_data)

    except Exception as e:
        print("*"*100)
        print(url)
        print("请求异常", e)

        error_list.append(url)


if __name__ == '__main__':
    details = run()
    for detail in details:
        get_image(detail)

    while len(error_list) > 0:
        print("再次爬取")
        detail = error_list.pop()
        get_image(detail)

    print("奥特曼图片数据爬取完毕")

Run the code to see a series of pictures stored in the local imagesdirectory.

The bear kid disagrees with me and uses Python to tell him how many Ultraman there are in the world


Code description:

In the main function, the above code crawls the detail page captured by the list page in a loop. That is, the following part of the code:

 for detail in details:
        get_image(detail)

As the site crawling slow, so the get_imagefunction of geta request which, added to the timeout=15set.

Image address regular matching and address splicing, the code used is as follows:

# 获取详情页标题,作为图片文件名
title = re.search('<title>(.*?)\[', html).group(1)
# 获取图片短连接地址
image_short = re.search(
    '<figure class="image tile">[.\s]*?<img src="(.*?)"', html).group(1)

# 拼接完整图片地址
img_url = "http://www.ultramanclub.com/allultraman/" + image_short[3:]
Hey, these Ultramans really look different.
The bear kid disagrees with me and uses Python to tell him how many Ultraman there are in the world

The complete code download address: https://codechina.csdn.net/hihell/python120

If you don't want to run the code, but only want pictures, buy a copy: https://download.csdn.net/download/hihell/19543243

Lottery draw time (currently 4 copies have been given out)

Unfortunately, in the previous article, the comments did not exceed 50, so for this blog, 2 copies are given~

As long as the number of comments exceeds 50,
a lucky reader will be randomly selected. The
reward is 39.9 yuan. 100 Crawlers. 100 cases column. A discount coupon is only 3.99 yuan.

Today is the 164 /200th day of continuous writing . You can like, comment, and favorite.