提问者:小点点

Selenium从XPath返回span元素的空列表


我正在尝试通过检查网页并识别我要提取的内容的XPath来刮取一些web元素。对于某些元素,我得到了预期的结果,而对于其他元素,我没有得到预期的结果。请参阅下面的可复制示例:

上载我要分析的页面:

import pandas as pd
import time
from selenium import webdriver
from webdriver_manager.firefox import GeckoDriverManager
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

options = Options()
options.set_preference("dom.push.enabled", False)
browser = webdriver.Firefox(options=options)
browser.get("https://medium.com/search")
browser.find_element_by_xpath("//input[@type='search']").send_keys("international development",Keys.ENTER)

然后,我使用XPath语言标识想要查看的元素的路径:

article_2016_t_xpath = '//div[contains(@class,"postArticle--short")][.//time[contains(@datetime, "2016")]][//span[@class="readingTime"]]'
article_element_list_t_1 = browser.find_elements_by_xpath(article_2016_t_xpath)

为了提取我希望的值,我现在在文章列表中查找元素time和span。最终的结果是一个时间表列表,但也是一个空的阅读时间列表。我尝试使用不同的版本而不是article.find_element_by_tag_name,例如[article.text for node in WebDriverWait(浏览器,10).until(ec.presence_of_all_elements_located((by.xpath,“.//span[@class='reading time']”))],但没有达到预期的结果

lista =[]
timelines=[]    
for article in article_element_list_t_1:
    readingtime = article.find_element_by_tag_name("span").text
    Timelines = article.find_element_by_tag_name("time").text
    timelines.append(Timelines)
    lista.append(readingtime)  
lista

我想知道1)如何获得网页span中托管的每篇文章的阅读时间,但2)为什么time显示为文本,而span不显示。当根据tag_name从XPath调用元素时,根据元素在主div包装器中的位置,需要考虑的关键标准/引用是什么?


共1个答案

匿名用户

1对子元素使用了错误类型的选择器。

2您没有使用任何等待,因此您的页面没有完全加载。

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

browser = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver')
browser.implicitly_wait(10)
browser.get("https://medium.com/search")
browser.find_element_by_xpath("//input[@type='search']").send_keys("international development", Keys.ENTER)

article_2016_t_xpath = '//div[contains(@class,"postArticle--short")][.//time[contains(@datetime, "2016")]][//span[@class="readingTime"]]'
article_element_list_t_1 = browser.find_elements_by_xpath(article_2016_t_xpath)

lista = []
timelines = []
wait = WebDriverWait(browser, 20)
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, '.readingTime')))
for article in article_element_list_t_1:
    readingtime = article.find_element_by_xpath(".//span[contains(@class, 'readingTime')]").get_attribute("title")
    Timelines = article.find_element_by_xpath(".//time").text
    timelines.append(Timelines)
    lista.append(readingtime)

for i in lista:
    print(i)
for i in timelines:
    print(i)

browser.close()
browser.quit()

readingtime不是文本,而是属性titletimelines是文本,但在这里需要使用xpath。此外,我还通过CSS添加了很好的显式等待。wait.until(ec.element_to_be_clickable((by.css_selector,'.readingtime')))

8 min read
5 min read
3 min read
4 min read
6 min read
3 min read
5 min read
Feb 29, 2016
Sep 5, 2016
Mar 30, 2016
Jan 13, 2016
May 25, 2016
Aug 4, 2016
Mar 21, 2016