到目前为止,我的蜘蛛代码一直运行得很好,但现在当我尝试运行一批这些蜘蛛时,所有的东西都正常工作,只是有些蜘蛛,scrapy下载了图像,其余的什么都没有。除了start_urls之外,所有的蜘蛛都是相同的。感谢任何帮助!
这是我的管道
from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy.http import Request
class DmozPipeline(object):
def process_item(self, item, spider):
return item
class MyImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
yield Request(image_url)
for nlabel in item['nlabel']:
yield Request(nlabel)
print item['image_urls']
def item_completed(self, results, item, info):
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem("Item contains no images")
item['image_paths'] = image_paths
return item
settings.py:
BOT_NAME = 'dmoz2'
BOT_VERSION = '1.0'
SPIDER_MODULES = ['dmoz2.spiders']
NEWSPIDER_MODULE = 'dmoz2.spiders'
DEFAULT_ITEM_CLASS = 'dmoz2.items.DmozItem'
ITEM_PIPELINES = ['dmoz2.pipelines.MyImagesPipeline']
IMAGES_STORE = '/ps/dmoz2/images'
IMAGES_THUMBS = {
#letting height be variable
#'small': ('', 120),
'small': (120, ''),
#'big': ('', 240),
'big': (300, ''),
}
USER_AGENT = '%s/%s' % (BOT_NAME, BOT_VERSION)
items.py:
from scrapy.item import Item, Field
from scrapy.utils.python import unicode_to_str
def u_to_str(text):
unicode_to_str(text,'latin-1','ignore')
class DmozItem(Item):
category_ids = Field()
....
image_urls = Field()
image_paths = Field()
pass
myspider.py:
from scrapy.spider import BaseSpider
from scrapy.spider import Spider
from scrapy.selector import HtmlXPathSelector
from scrapy import Selector
from scrapy.utils.url import urljoin_rfc
from scrapy.utils.response import get_base_url
from dmoz2.items import DmozItem
class DmozSpider(Spider):
name = "fritos_jun2015"
allowed_domains = ["walmart.com"]
start_urls = [
"http://www.walmart.com/ip/Fritos-Bar-B-Q-Flavored-Corn-Chips-9.75- oz/36915853",
"http://www.walmart.com/ip/Fritos-Corn-Chips-1-oz-6-count/10900088",
]
def parse(self, response):
hxs = Selector(response)
sites = hxs.xpath('/html/body/div[1]/section/section[4]/div[2]')
items = []
for site in sites:
item = DmozItem()
item['category_ids'] = ''
.....
item['image_urls'] = site.xpath('div[1]/div[3]/div[1]/div/div/div[2]/div/div/div[1]/div/div/img[2]/@src').extract()
items.append(item)
return items
我真的很想知道为什么这只蜘蛛有时会抓取图像,而有时却不抓取图像。除了来自同一个allowed_domain的start_url之外,所有的蜘蛛都是相同的。而且图像都是绝对路径,路径是正确的。
提前道谢。-TM
当屏幕刮擦时,一个常见的问题是服务器会切断连接,因为您尝试访问它的次数太多(以防止屏幕刮擦器无意中删除他们的网站,并防止成本上升,因为有人每毫秒ping他们的网站,等等)。
尝试添加
sleep()
方法。这样您就不会被阻止访问服务器。