提问者:小点点

尝试使用Selenium和ChromeDriver抓取页面。它加载页面但随后超时


我正在尝试抓取html标签中的所有内容。

基本上它到达GoToUrl行,它在浏览器中打开页面,但在代码中没有进一步执行。

它只是在60秒后超时。

这是错误:

fail: Microsoft.AspNetCore.Diagnostics.DeveloperExceptionPageMiddleware[1]
      An unhandled exception has occurred while executing the request.

更新:出于隐私原因编辑。


共1个答案

匿名用户

我为你的场景做了一个例子。

比方说,我们想抓取主页中的帖子,所以我们需要一个模型来存储我们的数据:

public class Post
{
    public string ImageSrc { get; set; }
    public string Category { get; set; }
    public string Title { get; set; }
    public string Description { get; set; }
    public string Date { get; set; }

    public override string ToString()
    {
        return JsonSerializer.Serialize(this, 
              new JsonSerializerOptions { WriteIndented = true });
    }
}

接下来我们需要初始化selenium webriver

var options = new ChromeOptions();
options.AddArgument("--no-sandbox");
using var driver = new ChromeDriver(options);

// Here we setup a fluent wait
var wait = new WebDriverWait(driver, TimeSpan.FromSeconds(20))
{
    PollingInterval = TimeSpan.FromMilliseconds(250)
};
wait.IgnoreExceptionTypes(typeof(NoSuchElementException), typeof(StaleElementReferenceException));

// Navigate to the target url
driver.Navigate().GoToUrl("https://www.rtlnieuws.nl/zoeken?q=Philips+fraude");

// Accept cookies
var cookieBtn = wait.Until(driver => driver.FindElement(By.Id("onetrust-accept-btn-handler")));
cookieBtn.Click();

// Scroll to end
int count = 0; 
await driver.ScrollToEndAsync(d =>
{
    // Determine when we are at the end of the page
    var tempCount = d.FindElements(By.XPath("//a[@class = 'search-item search-item--artikel']")).Count;
    if (tempCount != count)
    {
        count = tempCount;
        return false;
    }       
    
    return true;
});

// List of post elements
var elements = wait.Until(driver =>
{
    return driver.FindElements(By.XPath("//div[@class = 'search-items']//a[contains(@class, 'search-item')]"));
});

// Print Posts in json format 
foreach (var e in elements)
{
    var post = new Post
    {
        ImageSrc = e.FindElement(By.XPath(".//img")).GetAttribute("src"),
        Category = e.FindElement(By.XPath(".//div/span")).Text,
        Title = e.FindElement(By.XPath(".//div/h2")).Text,
        Description = e.FindElement(By.XPath(".//div[@class = 'search-item__content']/p[@class = 'search-item__description']")).Text,
        Date = e.FindElement(By.XPath(".//div[@class = 'search-item__content']//span[@class = 'search-item__date']")).Text,
    };
    Console.WriteLine(post);
}

// Just for this sample in order to wait to see our results 
Console.ReadLine();

为了像上面一样使用ScrollToEndAsync,您必须创建一个扩展方法:

public static class WebDriverExtensions
{
    public static async Task ScrollToEndAsync(this IWebDriver driver, Func<IWebDriver, bool> pageEnd)
    {
        while (!pageEnd.Invoke(driver))
        {
            var js = (IJavaScriptExecutor)driver;
            js.ExecuteScript("window.scrollTo(0, document.body.scrollHeight);");
            
            // Arbitrary delay between scrolling
            await Task.Delay(200);
        }
    }
}