One article to solve scrapy with a case to crawl Dangdang books

Scrapy framework

Introduction

Five components of Scrapy

Spiders:

It is responsible for processing all Responses, analyzing and extracting data from them, obtaining the data required by the Item field, and submitting the URL that needs to be followed up to the engine, and entering the Scheduler again

Engine:

Responsible for communication, signal and data transmission among Spider, ItemPipeline, Downloader, and Scheduler.

Scheduler:

It is responsible for accepting Request requests sent by the engine, sorting them in a certain way, enqueuing them, and returning them to the engine when needed.

Downloader:

Responsible for downloading all the Requests sent by the Scrapy Engine (engine), and return the obtained Responses to the Scrapy Engine (engine), and the engine will hand it over to the Spider for processing

ItemPipeline (管道):

It is responsible for processing the Item obtained from the Spider and performing post-processing (detailed analysis, filtering, storage, etc.).

Downloader Middlewares:

You can think of it as a component that can customize and extend the download function.

Spider Middlewares (Spider middleware):

You can understand it as a customizable extension and operation between the engine and the Spider

The functional components of communication (such as Responses that enter the Spider; and Requests that go out of the Spider)
Data flow diagram of scrapy:

The green line is the data flow

Insert picture description here
Insert picture description here

installation:

   pip install scrapy

scrapy a few commands

Create project: scrapy startproject xxx
Enter project: cd xxx
Basic crawler: scrapy genspider xxx(爬虫名) xxx.com (爬取域)
There is another command for rule crawler, but this one has changed, and the first two remain unchanged.
Rule crawler: scrapy genspider -t crawl xxx(爬虫名) xxx.com (爬取域)
Run command:scrapy crawl xxx

create project

(1) establishment of a folder scrapyDemo1
(2). In the following folder window open command scrapy startproject demo1will be in the folder scrapyDemo1was created under the scrapyproject folder demo1, including scrapyvarious components of subfolders

Insert picture description here

(3). Enter the project folder demo1 cd demo1
(4). scrapy genspider crawler name domain name e.g:scrapy genspider demo1spider baidu.com
(5). scrapy crawl crawler name is used to run the crawler. Generally, this command is used on the command line and it is inconvenient to output a lot of information, so I write a special one run.pyTo execute the program and the content is almost fixed, and the scrapy.cfgsame level

run.py

from scrapy import cmdline
cmdline.execute('scrapy crawl demo1spider --nolog'.split())#--nolog   控制台不输出日志
   e.g:scrapy crawl demo1spider
Insert picture description here

After performing three or four, it is equivalent to establishing a crawler project and starting this project. Generally, the corresponding code writing and configuration changes are required after the third step

Insert picture description here

Introduction to the simple configuration and use of scrapy framework files

setting.py for global configuration

Project name User-Agent Robot rule concurrent number delay cookies Default request header Project pipeline priority configuration

Mainly pay attention to the introduction of notes, several commonly used

# Scrapy settings for demo1 project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'demo1'  #项目名

SPIDER_MODULES = ['demo1.spiders']    #
NEWSPIDER_MODULE = 'demo1.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'demo1 (+http://www.yourdomain.com)'    #这个可以浏览器抓包查看值 比较重要 一般都要带的

# Obey robots.txt rules
ROBOTSTXT_OBEY = False   # 机器人规则 默认是true  一般都要修改为false  否则几乎爬不了太多东西

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32  #最大并发数 可以开启的爬虫线程数

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 1   #下载延迟时间,单位是秒,默认是3秒,即爬一个停3秒,设置为1秒性价比较高,如果要爬取的文件较多,写零点几秒也行
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False   #是否保存COOKIES,默认关闭,开机可以记录爬取过程中的COKIE,非常好用的一个参数

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}     #默认请求头,上面写了一个USER_AGENT,其实这个东西就是放在请求头里面的,这个东西可以根据你爬取的内容做相应设置。

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'demo1.middlewares.Demo1SpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'demo1.middlewares.Demo1DownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
    #'demo1.pipelines.Demo1Pipeline': 300,
    #'demo1.pipelines.Demo1MySqlPipeline' : 200,
#}  #项目管道,300为优先级,越低爬取的优先度越高 pipelines.py里面写了两个管道,一个爬取网页的管道,一个存数据库的管道,我调整了他们的优先级,如果有爬虫数据,优先执行存库操作。



# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

Log configuration
LOG_LEVEL= ""

LOG_FILE="日志名.log"

Add –nolog at the end of the runtime, and the console will not output log information:

scrapy crawl demo1spider --nolog
Log level

1. DEBUGDebugging information

2. INFOGeneral information

3. WARNINGWarning

4. ERRORCommon errors

5. CRITICALSerious errors

If you set:,
LOG_LEVEL="WARNING"only WARNINGthe ERRORsum below the CRITICAL
level will be the default level is 1.

Export to several formats

Add the -o option when executing the crawler file

json format, the default is Unicode encoding

scrapy crawl 项目名 -o 项目名.json

json lines format, the default is Unicode encoding

scrapy crawl 项目名 -o 项目名.jsonlines

csv comma expression, can be opened with Excel

scrapy crawl 项目名 -o 项目名.csv

xml format

scrapy crawl 项目名 -o 项目名.xml

For the json file, add it to the setting.js file and set the encoding format, otherwise it will be garbled:

FEED_EXPORT_ENCODING='utf-8'

xpath

Select search class

1 from scrapy.selector import Selector.: Introduce the selection search class
2 selector = Selector(text=htmlText).: Load the Html document to form the Selector object, you can use the Xpath method
3. Xpath can be called continuously, and the return is a selectorlist, then this list can be called Xpath continuously

Xpath find Html element

1. “//”Represents the 所有node element under the document , “/”represents the 下一级node element of the current node , “.”represents the 当前node element
2. If Xpath returns an Selectorobject, calling the extract() function will get a list of the text of these object elements, extract_first()get the first element in the list, if the list is Empty, return None. There is no extract_first()function for a single Selector object
. 3. “/@attrName”Get Selectorthe attrName attribute node object of an element, which is also a Selector object
. 4. “/text()”Get the attrName attribute node object of an element, which is also a Selector object . 4. Get the 文本值text value node object contained in a Selector element . The text value node object is also a Selector object, which is qualified by extract()函数获取文本值
5 “tag[condition1 and condition2...]”. An tagelement conditionis tagthe attribute of this
6. Xpath can be used position()to determine the restriction of one of the elements, the selection number starts from 1

s = selector.xpath("//book[position()=1]/title").extract_first()

7 .“*”represents any element node, 不包括text和comment
8. “@*”represents any attribute
9. “element/parent::*”Select the parent node of the element, this node has only one
10. “element/following-sibling::*”Search all sibling nodes “element/following-sibling::[position()=1]”of the same level after the element , search for the first sibling node of the same level after the element
11. “element/preceding-sibling::*”Search All sibling nodes of the same level before element

yield function

1. Yield is similar to return, but it is part of the generator

Detailed explanation of generator
First of all, if you have not had a preliminary understanding of yield, then you first regard yield as "return". This is intuitive. First of all, it is a return. What does ordinary return mean is to return in the program. After a certain value is returned, the program will no longer run down. Think of it as a return and then regard it as a part of the generator (the function with yield is the real iterator)

2.yield与return

The function with yield is a generator, not a function. This generator has a function called the next function. Next is equivalent to which number is generated in the "next step". This time the next start is followed by the previous next It is executed where it stopped, so when next is called, the generator will not execute from the beginning of the foo function, but it will start from the place where it stopped in the previous step, and then after encountering the yield, return the number to be generated, and this step ends. .

Instance

爬取当当网站图书数据并保存到mysql中
基础就在demo1上进行

Observe the page

Dangdang Book Network: http://search.dangdang.com
Enter python, the URL changes to: http://search.dangdang.com/?key=python&act=input
Turn to the next or second page:http://search.dangdang.com/?key=python&act=input&page_index=2

Insert picture description here


Insert picture description here

Determine the information to be crawled:

title 标题
author 作者
date 发布日期
publisher 出版社
detail 细节介绍
price 价格
Insert picture description here


Insert picture description here

Create the corresponding database and table in mysql:

show databases ;


create database ddbookdb;



use ddbookdb;

create table books(
    btitle varchar(512) primary key ,
    bauthor varchar(256),
    bpublisher varchar(256),
    bdate varchar(32),
    bprice varchar(16),
    bdetail text
);
select * from books;

Write and run the run.py file:

from scrapy import cmdline
cmdline.execute('scrapy crawl demo1spider --nolog'.split())#--nolog   控制台不输出日志

Write items.py data item class Demo1Item:

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class Demo1Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()#标题
    author = scrapy.Field()#作者
    date = scrapy.Field()#发布日期
    publisher = scrapy.Field()#出版社
    detail = scrapy.Field()#细节介绍
    price = scrapy.Field()#价格
    #pass

Write the pipelines.py file, write two functions for connecting and closing the mysql database, output the data transferred from the data item to the console and store it in the mysql database, and use the count variable to count the number of books crawled:

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import pymysql

class Demo1Pipeline:
    def open_spider(self,spider):
        print("opened")

        try:
            self.con =pymysql.connect(host="127.0.0.1", port=3306, user="root", passwd="lzyft1030", db="ddbookdb", charset="utf8")
            self.cursor = self.con.cursor(pymysql.cursors.DictCursor)#创建游标
            self.cursor.execute("delete from books")
            self.opend = True
            self.count = 0
        except Exception as err:
            print(err)
            self.opend = False

    def close_spider(self,spider):
        if self.opend:
            self.con.commit()#提交
            self.con.close()#关闭
            self.opend = False
        print("closed")
        print("总共爬取",self.count,"本书籍")

    def process_item(self, item, spider):
        #查看传输过来的数据
        try:
            #把数据存入到mysql中
            if self.opend:
                self.cursor.execute("insert into books(btitle, bauthor, bpublisher, bdate, bprice, bdetail) values(%s, %s, %s ,%s ,%s, %s)", \
                            (item["title"], item["author"], item["publisher"], item["date"], item["price"], item["detail"]))
                #计算书籍数量
                self.count+= 1
        except Exception as err:
            print(err)

        return item


Modify the setting.py file [modify the robot rules, add User-Agent, open ITEM_PIPELINES, transfer the data to the demo1Pipeline class and save it in mysql]:

# Scrapy settings for demo1 project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'demo1'  #项目名

SPIDER_MODULES = ['demo1.spiders']    #
NEWSPIDER_MODULE = 'demo1.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'demo1 (+http://www.yourdomain.com)'    #这个可以浏览器抓包查看值 比较重要 一般都要带的

# Obey robots.txt rules
ROBOTSTXT_OBEY = False   # 机器人规则 默认是true  一般都要修改为false  否则几乎爬不了太多东西

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32  #最大并发数 可以开启的爬虫线程数

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1   #下载延迟时间,单位是秒,默认是3秒,即爬一个停3秒,设置为1秒性价比较高,如果要爬取的文件较多,写零点几秒也行
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False   #是否保存COOKIES,默认关闭,开机可以记录爬取过程中的COKIE,非常好用的一个参数

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}     #默认请求头,上面写了一个USER_AGENT,其实这个东西就是放在请求头里面的,这个东西可以根据你爬取的内容做相应设置。

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'demo1.middlewares.Demo1SpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'demo1.middlewares.Demo1DownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'demo1.pipelines.Demo1Pipeline': 300,
    #'demo1.pipelines.Demo1MySqlPipeline' : 200,
}  #项目管道,300为优先级,越低爬取的优先度越高 pipelines.py里面写了两个管道,一个爬取网页的管道,一个存数据库的管道,我调整了他们的优先级,如果有爬虫数据,优先执行存库操作。



# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

Modify the demo1spider.py file to perform specific crawler operations:

import scrapy
from bs4 import UnicodeDammit
from bs4 import BeautifulSoup
from ..items import Demo1Item


#重写start_requests方法

class Demo1spiderSpider(scrapy.Spider):
    name = 'demo1spider'
    #allowed_domains = ['baidu.com']
    #start_urls = ['http://baidu.com/']  #入口地址
    key = "python"
    source_url = "http://search.dangdang.com/"

    def start_requests(self):#入口函数  可以用入口地址代替 入口地址可以有多个 是个列表
        url = Demo1spiderSpider.source_url+"?key=" + Demo1spiderSpider.key
        yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):#回调函数
        #一般网址response返回的是二进制  可以response.body.decode()转为文本
        try:
            #采用bs4里面的方法来处理编码问题
            dammit = UnicodeDammit(response.body, ['utf-8','gbk'])
            data = dammit.unicode_markup
            #建立选择查找类Selector对象 调用xpath方法
            selector = scrapy.Selector(text=data)
            lis = selector.xpath("//li['@ddt-pit'][starts-with(@class,'line')]")
            #print(lis)
            for li in lis:
                title = li.xpath("./a[position()=1]/@title").extract_first()
                price = li.xpath("./p[@class='price']/span[@class='search_now_price']/text()").extract_first()
                author = li.xpath("./p[@class='search_book_author']/span/a/@title").extract_first()
                date = li.xpath("./p[@class='search_book_author']/span[position()=2]/text()").extract_first()
                publisher = li.xpath("./p[@class='search_book_author']/span[position()=3]/a/@title").extract_first()
                detail = li.xpath("./p[@class='detail']/text()").extract_first()#有时为空 None

                item = Demo1Item()
                item['title'] = title.strip() if title else ""
                item['author'] = author.strip() if author else ""
                item['date'] = date.strip()[1:] if date else ""
                item['publisher'] = publisher.strip() if publisher else ""
                item['price'] = price.strip() if price else ""
                item['detail'] = detail.strip() if detail else ""
                yield item

            #最后一页时link为none
            link = selector.xpath("//div[@class='paging']/ul[@name='Fy']/li[@class='next']/a/@href").extract_first()
            if link:
                url = response.urljoin(link)
                yield scrapy.Request(url=url, callback=self.parse)
        except Exception as err:
            print(err)

If importing from …items import Demo1Item is not suitable, you can click the operation
to set the root directory of the project as the source path: select the root directory of the project, the root directory of the project is the first demo1, and write the above format according to the figure operation statement to OK
Insert picture description here