概览
Scrapy是一个python爬虫框架,用于抓取网站上的数据。
一个示例爬虫
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
start_urls = [
'http://quotes.toscrape.com/tag/humor/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'author': quote.xpath('span/small/text()').get(),
'text': quote.css('span.text::text').get(),
}
next_page = response.css('li.next a::attr("href")').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
这个爬虫从示例网站爬取数据。
把上面的代码保存到一个文件中,比如quotes_spider.py,然后使用runspider命令运行这个爬虫。
scrapy runspider quotes_spider.py -o quotes.json
当运行结束之后,会生成一个json文件quotes.json,里面包含爬取到的数据。
[{
"author": "Jane Austen",
"text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d"
},
{
"author": "Groucho Marx",
"text": "\u201cOutside of a dog, a book is man's best friend. Inside of a dog it's too dark to read.\u201d"
},
{
"author": "Steve Martin",
"text": "\u201cA day without sunshine is like, you know, night.\u201d"
},
...]
发生了什么
当你运行scrapy runspider quotes_spider.py时,scrapy找到该文件中定义的爬虫(QuotesSpider是一个爬虫类,必须继承Spider类),然后通过crawler引擎运行这个爬虫。
crawl首先向start_urls指定的地址发送请求(request),然后使用默认的回调方法parse(response)处理响应。
在parse方法中,我们可以使用CSS选择器和XPath选择器从response中提取我们感兴趣的数据,保存起来(也可以直接打印出来)。
这个例子中,会寻找下一个连接(next_page = response.css(‘li.next a::attr(“href”)’).get()),针对下一个连接重复上面的过程。
需要注意的是,request是异步处理的,就是说,如果有两个request,不会等到第一个request处理结束才发送第二个request,会直接发送第二个request。
scrapy的功能
- 通过CSS选择器和XPath选择器选择感兴趣的数据
- 交互式shell console用于调试
- 可以生成json、xml、csv格式的数据
安装scrapy
可以使用pip安装
pip install scrapy
当然有时候需要提前安装一些依赖包,可以根据log自行分析需要安装的包。
官方推荐我们把scrapy安装到一个独立的虚拟的python环境中,因为如果直接安装到系统环境中,有时候会产生一些包依赖之间的冲突。
关于虚拟运行环境,大家可以参考这个链接: Virtual Environment and Packages
下面是我的安装脚本(只测试过在RHEL上面运行)
#!/bin/bash
if [ ! -e /usr/bin/python ];then
if [ -e "/usr/libexec/platform-python" ];then
ln -s /usr/libexec/platform-python /usr/bin/python
elif [ -e "/usr/bin/python2" ];then
ln -s /usr/bin/python2 /usr/bin/python
fi
fi
python -V > python_version.txt 2>&1
python_version=$(cat python_version.txt|grep "Python [0-9]"|awk '{print $2}'|awk -F'.' '{print $1}')
# install pip
#if ! scrapy version;then
if [ $python_version -eq 3 ];then
yum install -y python3-devel
yum install -y platform-python-devel
yum install -y python3-pip
ln -s /usr/bin/pip3 /usr/bin/pip
elif [ $python_version -eq 2 ];then
yum install -y python2-devel
wget https://bootstrap.pypa.io/get-pip.py
python get-pip.py
ln -s /usr/bin/pip2 /usr/bin/pip
fi
#fi
# install virtualenv,virtualenvwrapper
pip install virtualenv virtualenvwrapper
s_dir=$(find / -name virtualenvwrapper.sh)
cat >> ~/.bashrc <<-EOF
export WORKON_HOME=$HOME/.virtualenvs
export PROJECT_HOME=$HOME/virtualprj
source $s_dir
EOF
mkdir $HOME/virtualprj
. ~/.bashrc
# create virtualenv 'VENV'
mkvirtualenv VENV
# install scrapy in 'VENV'
pip install twisted
pip install pyopenssl
pip install scrapy
deactivate
如果安装到虚拟环境,在使用scrapy时,需要切换到虚拟环境
workon VENV // 指定使用VENV
scrapy runspider kernel-spider.py --nolog -a family=rhel7
deactive // 退出VENV
开始学习
创建一个project
scrapy startproject tutorial
上面的命令会创建下面的目录和文件
tutorial/
scrapy.cfg # deploy configuration file
tutorial/ # project's Python module, you'll import your code from here
__init__.py
items.py # project items definition file
middlewares.py # project middlewares file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
__init__.py
创建一个spider
spider是一个类,必须继承Spider类。
把下面的代码保存到tutorial/spiders/quotes_spider.py中
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
- name:是爬虫的名字,在工程内必须唯一
- start_request(): 用于发送第一个request
- parse(): 用于处理response
运行spider
到工程根目录执行
scrapy crawl quotes
start_urls
你可以指定一个start_urls列表,这样scrapy会默认向这个列表中的地址发送request,进而省略start_requests()
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
提取数据
可以使用css和xpath提取response中的数据。
css菜鸟教程
css官方
xpath菜鸟教程
xpath官方
css基本用法
scrapy shell "http://quotes.toscrape.com/page/1/"
# 获取类名是container的div元素
>>> response.css("div.container")
[<Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' container ')]" data='<div class="container">\n <div cla'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' container ')]" data='<div class="container">\n <p c'>]
# 获取第一个a元素
>>> response.css("a").extract_first()
'<a href="/" style="text-decoration: none">Quotes to Scrape</a>'
>>> response.css("a").get()
'<a href="/" style="text-decoration: none">Quotes to Scrape</a>'
# 获取第一个a元素的text文本
>>> response.css("a::text").get()
'Quotes to Scrape'
# 获取带有itemtype属性的div元素
>>> response.css("div[itemtype]")
[<Selector xpath='descendant-or-self::div[@itemtype]' data='<div class="quote" itemscope itemtype="h'>, <Selector xpath='descendant-or-self::div[@itemtype]' data='<div class="quote" itemscope itemtype="h'>, <Selector xpath='descendant-or-self::div[@itemtype]' data='<div class="quote" itemscope itemtype="h'>, <Selector xpath='descendant-or-self::div[@itemtype]' data='<div class="quote" itemscope itemtype="h'>, <Selector xpath='descendant-or-self::div[@itemtype]' data='<div class="quote" itemscope itemtype="h'>, <Selector xpath='descendant-or-self::div[@itemtype]' data='<div class="quote" itemscope itemtype="h'>, <Selector xpath='descendant-or-self::div[@itemtype]' data='<div class="quote" itemscope itemtype="h'>, <Selector xpath='descendant-or-self::div[@itemtype]' data='<div class="quote" itemscope itemtype="h'>, <Selector xpath='descendant-or-self::div[@itemtype]' data='<div class="quote" itemscope itemtype="h'>, <Selector xpath='descendant-or-self::div[@itemtype]' data='<div class="quote" itemscope itemtype="h'>]
# 获取itemtype=http://schema.org/CreativeWork的div元素
>>> response.css("div[itemtype='http://schema.org/CreativeWork']")
[<Selector xpath="descendant-or-self::div[@itemtype = 'http://schema.org/CreativeWork']" data='<div class="quote" itemscope itemtype="h'>, <Selector xpath="descendant-or-self::div[@itemtype = 'http://schema.org/CreativeWork']" data='<div class="quote" itemscope itemtype="h'>, <Selector xpath="descendant-or-self::div[@itemtype = 'http://schema.org/CreativeWork']" data='<div class="quote" itemscope itemtype="h'>, <Selector xpath="descendant-or-self::div[@itemtype = 'http://schema.org/CreativeWork']" data='<div class="quote" itemscope itemtype="h'>, <Selector xpath="descendant-or-self::div[@itemtype = 'http://schema.org/CreativeWork']" data='<div class="quote" itemscope itemtype="h'>, <Selector xpath="descendant-or-self::div[@itemtype = 'http://schema.org/CreativeWork']" data='<div class="quote" itemscope itemtype="h'>, <Selector xpath="descendant-or-self::div[@itemtype = 'http://schema.org/CreativeWork']" data='<div class="quote" itemscope itemtype="h'>, <Selector xpath="descendant-or-self::div[@itemtype = 'http://schema.org/CreativeWork']" data='<div class="quote" itemscope itemtype="h'>, <Selector xpath="descendant-or-self::div[@itemtype = 'http://schema.org/CreativeWork']" data='<div class="quote" itemscope itemtype="h'>, <Selector xpath="descendant-or-self::div[@itemtype = 'http://schema.org/CreativeWork']" data='<div class="quote" itemscope itemtype="h'>]
xpath基本用法
# 获取itemtype='http://schema.org/CreativeWork'的div的class属性
>>> response.xpath("//div[@itemtype='http://schema.org/CreativeWork']/@class").extract_first()
'quote'
# 获取itemtype的值,并通过re进行截取,只保留schema后面的数据
>>> response.xpath("//div[@class='quote']/@itemtype").re_first(r'schema.*')
'schema.org/CreativeWork'
# 获取href包含Albert的链接元素
>>> response.xpath("//a[re:test(@href,'Albert')]").extract_first()
'<a href="/author/Albert-Einstein">(about)</a>'
>>> response.xpath("//a[contains(@href,'Albert')]").extract_first()
'<a href="/author/Albert-Einstein">(about)</a>'
# 通过text()获取文本
>>> response.xpath("//a[contains(@href,'Albert')]//text()").extract_first()
'(about)'
# 一些更复杂的
url = response.css('.blue-text.dellmetrics-driverdownloads.DriverDetailDownload').xpath('//a[re:test(@href,"^https.*BIN$")]//@href').get()
url = response.xpath('//table[@class="sf_oem_table"]//tr//td[@class="oem_body"]//div[@class="oem_links"]//a[contains(@href,"RPM")]/@href').re_first(r'.*Solarflare.*driver.*source.*RPM\.zip.*')
实例
scrapy shell "http://quotes.toscrape.com/page/1/"
[ ... Scrapy log here ... ]
2016-09-19 12:09:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x7fa91d888c90>
[s] item {}
[s] request <GET http://quotes.toscrape.com/page/1/>
[s] response <200 http://quotes.toscrape.com/page/1/>
[s] settings <scrapy.settings.Settings object at 0x7fa91d888c10>
[s] spider <DefaultSpider 'default' at 0x7fa91c8af990>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
response.css('title')返回的是Selector对象列表,Selector包裹了XML/HTML元素,并且允许你进一步的通过css/xpath进行提取操作。
# title::text 表示提取title中的文本
>>> response.css('title::text').getall()
['Quotes to Scrape']
>>> response.css('title::text').get()
'Quotes to Scrape'
>>> response.css('title::text')[0].get()
'Quotes to Scrape'
>>> response.css('title::text').re(r'Quotes.*')
['Quotes to Scrape']
>>> response.css('title::text').re(r'Q\w+')
['Quotes']
>>> response.css('title::text').re(r'(\w+) to (\w+)')
['Quotes', 'Scrape']
>>> response.xpath('//title')
[<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
>>> response.xpath('//title/text()').get()
'Quotes to Scrape'
>>> response.css("span.text::text").get()
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
>>> response.css("span.text").get()
'<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>'
保存数据
scrapy crawl quotes -o quotes.json
scrapy crawl quotes -o quotes.jl
使用不同的回调函数
import scrapy
class AuthorSpider(scrapy.Spider):
name = 'author'
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
author_page_links = response.css('.author + a')
yield from response.follow_all(author_page_links, self.parse_author)
pagination_links = response.css('li.next a')
yield from response.follow_all(pagination_links, self.parse)
def parse_author(self, response):
def extract_with_css(query):
return response.css(query).get(default='').strip()
yield {
'name': extract_with_css('h3.author-title::text'),
'birthdate': extract_with_css('.author-born-date::text'),
'bio': extract_with_css('.author-description::text'),
}
通过命令行传递参数
>>>scrapy crawl quotes -o quotes-humor.json -a tag=humor
# script
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
url = 'http://quotes.toscrape.com/'
tag = getattr(self, 'tag', None)
if tag is not None:
url = url + 'tag/' + tag
yield scrapy.Request(url, self.parse)
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
}
next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
yield response.follow(next_page, self.parse)