

Python图片爬取方法总结

数据皮皮侠

2020-10-18

导读：1. 最常见爬取图片方法对于图片爬取，最容易想到的是通过urllib库或者requests库实现。具体两种方

1. 最常见爬取图片方法

对于图片爬取，最容易想到的是通过urllib库或者requests库实现。具体两种方法的实现如下：

1.1 urllib

使用urllib.request.urlretrieve方法，通过图片url和存储的名称完成下载。

'''

Signature: request.urlretrieve(url, filename=None, reporthook=None, data=None)

Docstring:

Retrieve a URL into a temporary location on disk.


Requires a URL argument. If a filename is passed, it is used as

the temporary file location. The reporthook argument should be

a callable that accepts a block number, a read size, and the

total file size of the URL target. The data argument should be

valid URL encoded data.


If a filename is passed and the URL points to a local resource,

the result is a copy from local file to new file.


Returns a tuple containing the path to the newly created

data file as well as the resulting HTTPMessage object.

File:      ~/anaconda/lib/python3.6/urllib/request.py

Type:      function

'''

参数 finename 指定了保存本地路径（如果参数未指定，urllib会生成一个临时文件保存数据。）
参数 reporthook 是一个回调函数，当连接上服务器、以及相应的数据块传输完毕时会触发该回调，我们可以利用这个回调函数来显示当前的下载进度。
参数 data 指 post 到服务器的数据，该方法返回一个包含两个元素的(filename, headers)元组，filename 表示保存到本地的路径，header 表示服务器的响应头。

使用示例：

request.urlretrieve('https://img3.doubanio.com/view/photo/photo/public/p454345512.jpg', 'kids.jpg')

但很有可能返回403错误（Forbidden），如：http://www.qnong.com.cn/uploa...。Stack Overflow指出原因：This website is blocking the user-agent used by urllib, so you need to change it in your request.

给urlretrieve加上User-Agent还挺麻烦，方法如下：


  
   import urllib
opener = request.build_opener()
headers = ('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:53.0) Gecko/20100101 Firefox/53.0')
opener.addheaders = [headers]
request.install_opener(opener)
request.urlretrieve('http://www.qnong.com.cn/uploadfile/2016/0416/20160416101815887.jpg', './dog.jpg')

1.2 requests

使用requests.get()获取图片，但要将参数stream设为True。

 
  import requests
req = requests.get('http://www.qnong.com.cn/uploadfile/2016/0416/20160416101815887.jpg', stream=True)
with open('dog.jpg', 'wb') as wr:
for chunk in req.iter_content(chunk_size=1024):
if chunk:
wr.write(chunk)
wr.flush()

requests添加User-Agent也很方便，使用headers参数即可。

2. Scrapy 支持的方法

2.1 ImagesPipeline

Scrapy 自带 ImagesPipeline 和 FilePipeline 用于图片和文件下载，最简单使用 ImagesPipeline 只需要在 settings 中配置。


  
   # settings.py

ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline': 500

}
IMAGES_STORE = 'pictures'  # 图片存储目录

IMAGES_MIN_HEIGHT = 400  # 小于600*400的图片过滤

IMAGES_MIN_WIDTH = 600


  
   # items.py

import scrapy
class PictureItem(scrapy.Item):

image_urls = scrapy.Field()


  
   # myspider.py

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ..items import BeePicture
class PicSpider(CrawlSpider):

name = 'pic'

allowed_domains = ['qnong.com.cn']
start_urls = ['http://www.qnong.com.cn/']
rules = (
Rule(LinkExtractor(allow=r'.*?', restrict_xpaths=('//a[@href]')), callback='parse_item', follow=True),
)
def parse_item(self, response):

for img_url in response.xpath('//img/@src').extract():
item = PictureItem()
item['image_urls'] = [response.urljoin(img_url)]
yield item

2.2 自定义 Pipeline

默认情况下，使用ImagePipeline组件下载图片的时候，图片名称是以图片URL的SHA1值进行保存的。

如：
图片URL: http://www.example.com/image.jpg
SHA1结果：3afec3b4765f8f0a07b78f98c07b83f013567a0a
则图片名称：3afec3b4765f8f0a07b78f98c07b83f013567a0a.jpg

想要以自定义图片文件名需要重写 ImagesPipeline 的file_path方法。参考：https://doc.scrapy.org/en/lat...。


  
   # settings.py

ITEM_PIPELINES = {
'qnong.pipelines.MyImagesPipeline': 500,
}

 
  # items.py
import scrapy
class PictureItem(scrapy.Item):
image_urls = scrapy.Field()
images = scrapy.Field()
image_paths = scrapy.Field()

# myspider.py

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ..items import BeePicture
class PicSpider(CrawlSpider):

name = 'pic'

allowed_domains = ['qnong.com.cn']
start_urls = ['http://www.qnong.com.cn/']
rules = (
Rule(LinkExtractor(allow=r'.*?', restrict_xpaths=('//a[@href]')), callback='parse_item', follow=True),
)
def parse_item(self, response):

for img_url in response.xpath('//img/@src').extract():
item = PictureItem()
item['image_urls'] = [response.urljoin(img_url)]
yield item

  # pipelines.py

from scrapy.exceptions import DropItem
from scrapy.pipelines.images import ImagesPipeline
import scrapy
class MyImagesPipeline(ImagesPipeline):

def get_media_requests(self, item, info):

for img_url in item['image_urls']:
yield scrapy.Request(img_url)
def item_completed(self, results, item, info):

image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem('Item contains no images')
item['image_paths'] = image_paths
return item
def file_path(self, request, response=None, info=None):

image_guid = request.url.split('/')[-1]
return 'full/%s' % (image_guid)

 

2.3 FilesPipeline 和 ImagesPipeline 工作流程

FilesPipeline

在一个爬虫里，你抓取一个项目，把其中图片的URL放入 file_urls 组内。
项目从爬虫内返回，进入项目管道。
当项目进入 FilesPipeline，file_urls 组内的 URLs 将被 Scrapy 的调度器和下载器（这意味着调度器和下载器的中间件可以复用）安排下载，当优先级更高，会在其他页面被抓取前处理。项目会在这个特定的管道阶段保持“locker”的状态，直到完成文件的下载（或者由于某些原因未完成下载）。
当文件下载完后，另一个字段(files)将被更新到结构中。这个组将包含一个字典列表，其中包括下载文件的信息，比如下载路径、源抓取地址（从 file_urls 组获得）和图片的校验码(checksum)。files 列表中的文件顺序将和源 file_urls 组保持一致。如果某个图片下载失败，将会记录下错误信息，图片也不会出现在 files 组中。

ImagesPipeline

在一个爬虫里，你抓取一个项目，把其中图片的 URL 放入 images_urls 组内。
项目从爬虫内返回，进入项目管道。
当项目进入 Imagespipeline，images_urls 组内的URLs将被Scrapy的调度器和下载器（这意味着调度器和下载器的中间件可以复用）安排下载，当优先级更高，会在其他页面被抓取前处理。项目会在这个特定的管道阶段保持“locker”的状态，直到完成文件的下载（或者由于某些原因未完成下载）。
当文件下载完后，另一个字段(images)将被更新到结构中。这个组将包含一个字典列表，其中包括下载文件的信息，比如下载路径、源抓取地址（从 images_urls 组获得）和图片的校验码(checksum)。images 列表中的文件顺序将和源 images_urls 组保持一致。如果某个图片下载失败，将会记录下错误信息，图片也不会出现在 images 组中。

Scrapy 不仅可以下载图片，还可以生成指定大小的缩略图。
Pillow 是用来生成缩略图，并将图片归一化为 JPEG/RGB 格式，因此为了使用图片管道，你需要安装这个库。

【声明】内容源于网络

数据皮皮侠

社科数据综合服务中心，立志服务百千万社科学者

内容 2137

粉丝 0

数据皮皮侠社科数据综合服务中心，立志服务百千万社科学者

总阅读2.3k

粉丝0

内容2.1k