提问者:小点点

硒中的蟒蛇/BeautifulSoup


我正在尝试使用本教程从一个使用selenium and beautiful soup的站点提取房地产列表信息:https://medium.com/@ben.sturm/scraping-house-listing-data-using-Selenium-and-Beautiful Soup-1CBB94BA9492

目的是在找到“下一页”按钮之前收集第一页的所有href链接,导航到下一页并收集该页上的所有链接等等。

尝试用一个单一的函数来实现这一点,并在每一页重复,但不能弄清楚它为什么不工作。对于学习代码来说是新的,而且似乎太琐碎了,还找不到一个答案。如有任何帮助,我将不胜感激

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
import time
import sys
import numpy as np
import pandas as pd
import regex as re


driver = webdriver.Chrome
url = "http://property.shw.co.uk/searchproperties/Level2-0/Level1-0-181-236-167-165/Units/Development-or-House-and-Flat-or-Investment-or-Land-or-Office-or-Other/UnitIds-0/For-Sale"
driver.get(url)
try:
    wait = WebDriverWait(driver, 3)
    wait.until(EC.presence_of_element_located((By.ID, "body1")))
    print("Page is Ready!")
except TimeoutException:
    print("page took too long to load")


def get_house_links(url, driver, pages=3):
    house_links = []
    driver.get(url)
    for i in range(pages):
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        listings = soup.find_all("a", class_="L")
        page_data = [row['href'] for row in listings]
        house_links.append(page_data)
        time.sleep(np.random.lognormal(0, 1))
        next_button = soup.find_all("a", class_="pageingBlock darkBorder")
        next_button_link = ['http://property.shw.co.uk'+row['href'] for row in next_button]
        if i < 3:
            driver.get(next_button_link[0])
    return house_links
get_house_links(url, driver)

共1个答案

匿名用户

class_=“PageingBlock DarkBorder”也与上一页按钮匹配,因此NEXT_BUTTON_LINK[0]会将您发送回上一页。你需要更精确的定位器

next_button = soup.select('img[src*="propNext"]')
if next_button:
    next_button = next_button[0].find_parent('a')
    next_button_link = 'http://property.shw.co.uk' + next_button['href']
    driver.get(next_button_link)