我正在尝试抓取html标签中的所有内容。
基本上它到达GoToUrl行,它在浏览器中打开页面,但在代码中没有进一步执行。
它只是在60秒后超时。
这是错误:
fail: Microsoft.AspNetCore.Diagnostics.DeveloperExceptionPageMiddleware[1]
An unhandled exception has occurred while executing the request.
更新:出于隐私原因编辑。
我为你的场景做了一个例子。
比方说,我们想抓取主页中的帖子,所以我们需要一个模型来存储我们的数据:
public class Post
{
public string ImageSrc { get; set; }
public string Category { get; set; }
public string Title { get; set; }
public string Description { get; set; }
public string Date { get; set; }
public override string ToString()
{
return JsonSerializer.Serialize(this,
new JsonSerializerOptions { WriteIndented = true });
}
}
接下来我们需要初始化selenium webriver
var options = new ChromeOptions();
options.AddArgument("--no-sandbox");
using var driver = new ChromeDriver(options);
// Here we setup a fluent wait
var wait = new WebDriverWait(driver, TimeSpan.FromSeconds(20))
{
PollingInterval = TimeSpan.FromMilliseconds(250)
};
wait.IgnoreExceptionTypes(typeof(NoSuchElementException), typeof(StaleElementReferenceException));
// Navigate to the target url
driver.Navigate().GoToUrl("https://www.rtlnieuws.nl/zoeken?q=Philips+fraude");
// Accept cookies
var cookieBtn = wait.Until(driver => driver.FindElement(By.Id("onetrust-accept-btn-handler")));
cookieBtn.Click();
// Scroll to end
int count = 0;
await driver.ScrollToEndAsync(d =>
{
// Determine when we are at the end of the page
var tempCount = d.FindElements(By.XPath("//a[@class = 'search-item search-item--artikel']")).Count;
if (tempCount != count)
{
count = tempCount;
return false;
}
return true;
});
// List of post elements
var elements = wait.Until(driver =>
{
return driver.FindElements(By.XPath("//div[@class = 'search-items']//a[contains(@class, 'search-item')]"));
});
// Print Posts in json format
foreach (var e in elements)
{
var post = new Post
{
ImageSrc = e.FindElement(By.XPath(".//img")).GetAttribute("src"),
Category = e.FindElement(By.XPath(".//div/span")).Text,
Title = e.FindElement(By.XPath(".//div/h2")).Text,
Description = e.FindElement(By.XPath(".//div[@class = 'search-item__content']/p[@class = 'search-item__description']")).Text,
Date = e.FindElement(By.XPath(".//div[@class = 'search-item__content']//span[@class = 'search-item__date']")).Text,
};
Console.WriteLine(post);
}
// Just for this sample in order to wait to see our results
Console.ReadLine();
为了像上面一样使用ScrollToEndAsync
,您必须创建一个扩展方法:
public static class WebDriverExtensions
{
public static async Task ScrollToEndAsync(this IWebDriver driver, Func<IWebDriver, bool> pageEnd)
{
while (!pageEnd.Invoke(driver))
{
var js = (IJavaScriptExecutor)driver;
js.ExecuteScript("window.scrollTo(0, document.body.scrollHeight);");
// Arbitrary delay between scrolling
await Task.Delay(200);
}
}
}