我正在尝试通过检查网页并识别我要提取的内容的XPath来刮取一些web元素。对于某些元素,我得到了预期的结果,而对于其他元素,我没有得到预期的结果。请参阅下面的可复制示例:
上载我要分析的页面:
import pandas as pd
import time
from selenium import webdriver
from webdriver_manager.firefox import GeckoDriverManager
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
options = Options()
options.set_preference("dom.push.enabled", False)
browser = webdriver.Firefox(options=options)
browser.get("https://medium.com/search")
browser.find_element_by_xpath("//input[@type='search']").send_keys("international development",Keys.ENTER)
然后,我使用XPath语言标识想要查看的元素的路径:
article_2016_t_xpath = '//div[contains(@class,"postArticle--short")][.//time[contains(@datetime, "2016")]][//span[@class="readingTime"]]'
article_element_list_t_1 = browser.find_elements_by_xpath(article_2016_t_xpath)
为了提取我希望的值,我现在在文章列表中查找元素time和span。最终的结果是一个时间表列表,但也是一个空的阅读时间列表。我尝试使用不同的版本而不是article.find_element_by_tag_name
,例如[article.text for node in WebDriverWait(浏览器,10).until(ec.presence_of_all_elements_located((by.xpath,“.//span[@class='reading time']”))]
,但没有达到预期的结果
lista =[]
timelines=[]
for article in article_element_list_t_1:
readingtime = article.find_element_by_tag_name("span").text
Timelines = article.find_element_by_tag_name("time").text
timelines.append(Timelines)
lista.append(readingtime)
lista
我想知道1)如何获得网页span中托管的每篇文章的阅读时间,但2)为什么time显示为文本,而span不显示。当根据tag_name从XPath调用元素时,根据元素在主div包装器中的位置,需要考虑的关键标准/引用是什么?
1对子元素使用了错误类型的选择器。
2您没有使用任何等待,因此您的页面没有完全加载。
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
browser = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver')
browser.implicitly_wait(10)
browser.get("https://medium.com/search")
browser.find_element_by_xpath("//input[@type='search']").send_keys("international development", Keys.ENTER)
article_2016_t_xpath = '//div[contains(@class,"postArticle--short")][.//time[contains(@datetime, "2016")]][//span[@class="readingTime"]]'
article_element_list_t_1 = browser.find_elements_by_xpath(article_2016_t_xpath)
lista = []
timelines = []
wait = WebDriverWait(browser, 20)
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, '.readingTime')))
for article in article_element_list_t_1:
readingtime = article.find_element_by_xpath(".//span[contains(@class, 'readingTime')]").get_attribute("title")
Timelines = article.find_element_by_xpath(".//time").text
timelines.append(Timelines)
lista.append(readingtime)
for i in lista:
print(i)
for i in timelines:
print(i)
browser.close()
browser.quit()
readingtime
不是文本,而是属性title
。timelines
是文本,但在这里需要使用xpath。此外,我还通过CSS添加了很好的显式等待。wait.until(ec.element_to_be_clickable((by.css_selector,'.readingtime')))
8 min read
5 min read
3 min read
4 min read
6 min read
3 min read
5 min read
Feb 29, 2016
Sep 5, 2016
Mar 30, 2016
Jan 13, 2016
May 25, 2016
Aug 4, 2016
Mar 21, 2016