提问者:小点点

For循环尝试刮取TripAdvisor餐厅数据


我试着在香港列出所有餐馆和他们的相应网址。目前,在我下面的代码中,我能够刮取第一页和第二页。但我希望底部的for循环更具动态性,并不断刮取,直到达到我在range()中指定的条目数量。

我在这方面还是个新手,所以任何帮助都会很棒。

#import libraries
import requests
from bs4 import BeautifulSoup
import csv


#scrape the first page because this URL is different then when you start moving to different pages
url0 = 'https://www.tripadvisor.com/Restaurants-g294217-Hong_Kong.html#EATERY_LIST_CONTENTS'
r = requests.get(url0)
data = r.text
soup = BeautifulSoup(r.text, "html.parser")
for link in soup.findAll('a', {'property_title'}):
    print 'https://www.tripadvisor.com/Restaurant_Review-g294217-' + link.get('href')
    print link.string

#loop to move into the next pages. entries are in increments of 30 per page
for i in range(0, 120, 30):
    entries = str(30)
    #url format offsets the restaurants in increments of 30 after the oa; hence entries as variable
    url1 = 'https://www.tripadvisor.com/Restaurants-g294217-oa' + entries + '-Hong_Kong.html#EATERY_LIST_CONTENTS'
    r1 = requests.get(url1)
    data1 = r1.text
    soup1 = BeautifulSoup(data1, "html.parser")
    for link in soup1.findAll('a', {'property_title'}):
        print 'https://www.tripadvisor.com/Restaurant_Review-g294217-' + link.get('href')
        print link.string
    break

共1个答案

匿名用户

最后加了一段时间,让它按照我想要的方式循环。希望这对未来的人们有所帮助

for i in range(30, 120, 30):
    while i <= range:
        i = str(i)
        #url format offsets the restaurants in increments of 30 after the oa; hence entries as variable
        url1 = 'https://www.tripadvisor.com/Restaurants-g294217-oa' + i + '-Hong_Kong.html#EATERY_LIST_CONTENTS'
        r1 = requests.get(url1)
        data1 = r1.text
        soup1 = BeautifulSoup(data1, "html.parser")
        for link in soup1.findAll('a', {'property_title'}):
            print 'https://www.tripadvisor.com/Restaurant_Review-g294217-' + link.get('href')
            print link.string
        break