Scrapy简介

Scrapy HomePage

概览

Scrapy是一个python爬虫框架,用于抓取网站上的数据。

一个示例爬虫

import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'author': quote.xpath('span/small/text()').get(),
                'text': quote.css('span.text::text').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

这个爬虫从示例网站爬取数据。

把上面的代码保存到一个文件中,比如quotes_spider.py,然后使用runspider命令运行这个爬虫。

scrapy runspider quotes_spider.py -o quotes.json

当运行结束之后,会生成一个json文件quotes.json,里面包含爬取到的数据。

[{
    "author": "Jane Austen",
    "text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d"
},
{
    "author": "Groucho Marx",
    "text": "\u201cOutside of a dog, a book is man's best friend. Inside of a dog it's too dark to read.\u201d"
},
{
    "author": "Steve Martin",
    "text": "\u201cA day without sunshine is like, you know, night.\u201d"
},
...]

发生了什么

当你运行scrapy runspider quotes_spider.py时,scrapy找到该文件中定义的爬虫(QuotesSpider是一个爬虫类,必须继承Spider类),然后通过crawler引擎运行这个爬虫。
crawl首先向start_urls指定的地址发送请求(request),然后使用默认的回调方法parse(response)处理响应。
在parse方法中,我们可以使用CSS选择器和XPath选择器从response中提取我们感兴趣的数据,保存起来(也可以直接打印出来)。
这个例子中,会寻找下一个连接(next_page = response.css(‘li.next a::attr(“href”)’).get()),针对下一个连接重复上面的过程。
需要注意的是,request是异步处理的,就是说,如果有两个request,不会等到第一个request处理结束才发送第二个request,会直接发送第二个request。

scrapy的功能

  • 通过CSS选择器和XPath选择器选择感兴趣的数据
  • 交互式shell console用于调试
  • 可以生成json、xml、csv格式的数据

安装scrapy

可以使用pip安装

pip install scrapy

当然有时候需要提前安装一些依赖包,可以根据log自行分析需要安装的包。

官方推荐我们把scrapy安装到一个独立的虚拟的python环境中,因为如果直接安装到系统环境中,有时候会产生一些包依赖之间的冲突。
关于虚拟运行环境,大家可以参考这个链接: Virtual Environment and Packages
下面是我的安装脚本(只测试过在RHEL上面运行)

#!/bin/bash

if [ ! -e /usr/bin/python ];then
         if [ -e "/usr/libexec/platform-python" ];then
                 ln -s /usr/libexec/platform-python /usr/bin/python
         elif [ -e "/usr/bin/python2" ];then
                 ln -s /usr/bin/python2 /usr/bin/python
         fi
fi

python -V >  python_version.txt 2>&1
python_version=$(cat python_version.txt|grep "Python [0-9]"|awk '{print $2}'|awk -F'.' '{print $1}')

# install pip
#if ! scrapy version;then
        if [ $python_version -eq 3 ];then
            yum install -y python3-devel
            yum install -y platform-python-devel
            yum install -y python3-pip
            ln -s /usr/bin/pip3 /usr/bin/pip
        elif [ $python_version -eq 2 ];then
            yum install -y python2-devel
            wget https://bootstrap.pypa.io/get-pip.py
            python get-pip.py
            ln -s /usr/bin/pip2 /usr/bin/pip
        fi
#fi

# install virtualenv,virtualenvwrapper
pip install virtualenv virtualenvwrapper
s_dir=$(find / -name virtualenvwrapper.sh)
cat >> ~/.bashrc <<-EOF
export WORKON_HOME=$HOME/.virtualenvs
export PROJECT_HOME=$HOME/virtualprj
source $s_dir
EOF
mkdir $HOME/virtualprj
. ~/.bashrc

# create virtualenv 'VENV'
mkvirtualenv VENV
# install scrapy in 'VENV'
pip install twisted
pip install pyopenssl
pip install scrapy
deactivate

如果安装到虚拟环境,在使用scrapy时,需要切换到虚拟环境

workon VENV  // 指定使用VENV
scrapy runspider kernel-spider.py --nolog -a family=rhel7
deactive     // 退出VENV

开始学习

创建一个project

scrapy startproject tutorial

上面的命令会创建下面的目录和文件

tutorial/
    scrapy.cfg            # deploy configuration file

    tutorial/             # project's Python module, you'll import your code from here
        __init__.py

        items.py          # project items definition file

        middlewares.py    # project middlewares file

        pipelines.py      # project pipelines file

        settings.py       # project settings file

        spiders/          # a directory where you'll later put your spiders
            __init__.py

创建一个spider

spider是一个类,必须继承Spider类。
把下面的代码保存到tutorial/spiders/quotes_spider.py中

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)
  • name:是爬虫的名字,在工程内必须唯一
  • start_request(): 用于发送第一个request
  • parse(): 用于处理response

运行spider

到工程根目录执行

scrapy crawl quotes

start_urls

你可以指定一个start_urls列表,这样scrapy会默认向这个列表中的地址发送request,进而省略start_requests()

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)

提取数据

可以使用css和xpath提取response中的数据。
css菜鸟教程
css官方
xpath菜鸟教程
xpath官方

css基本用法

scrapy shell "http://quotes.toscrape.com/page/1/"

# 获取类名是container的div元素
>>> response.css("div.container")
[<Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' container ')]" data='<div class="container">\n        <div cla'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' container ')]" data='<div class="container">\n            <p c'>]

# 获取第一个a元素
>>> response.css("a").extract_first()
'<a href="/" style="text-decoration: none">Quotes to Scrape</a>'
>>> response.css("a").get()
'<a href="/" style="text-decoration: none">Quotes to Scrape</a>'

# 获取第一个a元素的text文本
>>> response.css("a::text").get()
'Quotes to Scrape'

# 获取带有itemtype属性的div元素
>>> response.css("div[itemtype]")
[<Selector xpath='descendant-or-self::div[@itemtype]' data='<div class="quote" itemscope itemtype="h'>, <Selector xpath='descendant-or-self::div[@itemtype]' data='<div class="quote" itemscope itemtype="h'>, <Selector xpath='descendant-or-self::div[@itemtype]' data='<div class="quote" itemscope itemtype="h'>, <Selector xpath='descendant-or-self::div[@itemtype]' data='<div class="quote" itemscope itemtype="h'>, <Selector xpath='descendant-or-self::div[@itemtype]' data='<div class="quote" itemscope itemtype="h'>, <Selector xpath='descendant-or-self::div[@itemtype]' data='<div class="quote" itemscope itemtype="h'>, <Selector xpath='descendant-or-self::div[@itemtype]' data='<div class="quote" itemscope itemtype="h'>, <Selector xpath='descendant-or-self::div[@itemtype]' data='<div class="quote" itemscope itemtype="h'>, <Selector xpath='descendant-or-self::div[@itemtype]' data='<div class="quote" itemscope itemtype="h'>, <Selector xpath='descendant-or-self::div[@itemtype]' data='<div class="quote" itemscope itemtype="h'>]

# 获取itemtype=http://schema.org/CreativeWork的div元素
>>> response.css("div[itemtype='http://schema.org/CreativeWork']")
[<Selector xpath="descendant-or-self::div[@itemtype = 'http://schema.org/CreativeWork']" data='<div class="quote" itemscope itemtype="h'>, <Selector xpath="descendant-or-self::div[@itemtype = 'http://schema.org/CreativeWork']" data='<div class="quote" itemscope itemtype="h'>, <Selector xpath="descendant-or-self::div[@itemtype = 'http://schema.org/CreativeWork']" data='<div class="quote" itemscope itemtype="h'>, <Selector xpath="descendant-or-self::div[@itemtype = 'http://schema.org/CreativeWork']" data='<div class="quote" itemscope itemtype="h'>, <Selector xpath="descendant-or-self::div[@itemtype = 'http://schema.org/CreativeWork']" data='<div class="quote" itemscope itemtype="h'>, <Selector xpath="descendant-or-self::div[@itemtype = 'http://schema.org/CreativeWork']" data='<div class="quote" itemscope itemtype="h'>, <Selector xpath="descendant-or-self::div[@itemtype = 'http://schema.org/CreativeWork']" data='<div class="quote" itemscope itemtype="h'>, <Selector xpath="descendant-or-self::div[@itemtype = 'http://schema.org/CreativeWork']" data='<div class="quote" itemscope itemtype="h'>, <Selector xpath="descendant-or-self::div[@itemtype = 'http://schema.org/CreativeWork']" data='<div class="quote" itemscope itemtype="h'>, <Selector xpath="descendant-or-self::div[@itemtype = 'http://schema.org/CreativeWork']" data='<div class="quote" itemscope itemtype="h'>]

xpath基本用法

# 获取itemtype='http://schema.org/CreativeWork'的div的class属性
>>> response.xpath("//div[@itemtype='http://schema.org/CreativeWork']/@class").extract_first()
'quote'

# 获取itemtype的值,并通过re进行截取,只保留schema后面的数据
>>> response.xpath("//div[@class='quote']/@itemtype").re_first(r'schema.*')
'schema.org/CreativeWork'

# 获取href包含Albert的链接元素
>>> response.xpath("//a[re:test(@href,'Albert')]").extract_first()
'<a href="/author/Albert-Einstein">(about)</a>'
>>> response.xpath("//a[contains(@href,'Albert')]").extract_first()
'<a href="/author/Albert-Einstein">(about)</a>'

# 通过text()获取文本
>>> response.xpath("//a[contains(@href,'Albert')]//text()").extract_first()
'(about)'

# 一些更复杂的
url = response.css('.blue-text.dellmetrics-driverdownloads.DriverDetailDownload').xpath('//a[re:test(@href,"^https.*BIN$")]//@href').get()
url = response.xpath('//table[@class="sf_oem_table"]//tr//td[@class="oem_body"]//div[@class="oem_links"]//a[contains(@href,"RPM")]/@href').re_first(r'.*Solarflare.*driver.*source.*RPM\.zip.*')

实例

scrapy shell "http://quotes.toscrape.com/page/1/"

[ ... Scrapy log here ... ]
2016-09-19 12:09:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7fa91d888c90>
[s]   item       {}
[s]   request    <GET http://quotes.toscrape.com/page/1/>
[s]   response   <200 http://quotes.toscrape.com/page/1/>
[s]   settings   <scrapy.settings.Settings object at 0x7fa91d888c10>
[s]   spider     <DefaultSpider 'default' at 0x7fa91c8af990>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser


>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]

response.css('title')返回的是Selector对象列表,Selector包裹了XML/HTML元素,并且允许你进一步的通过css/xpath进行提取操作。

# title::text 表示提取title中的文本
>>> response.css('title::text').getall()
['Quotes to Scrape']

>>> response.css('title::text').get()
'Quotes to Scrape'

>>> response.css('title::text')[0].get()
'Quotes to Scrape'

>>> response.css('title::text').re(r'Quotes.*')
['Quotes to Scrape']
>>> response.css('title::text').re(r'Q\w+')
['Quotes']
>>> response.css('title::text').re(r'(\w+) to (\w+)')
['Quotes', 'Scrape']

>>> response.xpath('//title')
[<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
>>> response.xpath('//title/text()').get()
'Quotes to Scrape'

>>> response.css("span.text::text").get()
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
>>> response.css("span.text").get()
'<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>'

保存数据

scrapy crawl quotes -o quotes.json

scrapy crawl quotes -o quotes.jl

使用不同的回调函数

import scrapy


class AuthorSpider(scrapy.Spider):
    name = 'author'

    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        author_page_links = response.css('.author + a')
        yield from response.follow_all(author_page_links, self.parse_author)

        pagination_links = response.css('li.next a')
        yield from response.follow_all(pagination_links, self.parse)

    def parse_author(self, response):
        def extract_with_css(query):
            return response.css(query).get(default='').strip()

        yield {
            'name': extract_with_css('h3.author-title::text'),
            'birthdate': extract_with_css('.author-born-date::text'),
            'bio': extract_with_css('.author-description::text'),
        }

通过命令行传递参数

>>>scrapy crawl quotes -o quotes-humor.json -a tag=humor

# script
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        url = 'http://quotes.toscrape.com/'
        tag = getattr(self, 'tag', None)
        if tag is not None:
            url = url + 'tag/' + tag
        yield scrapy.Request(url, self.parse)

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)