Python Web刮板-尝试让程序在一个特定位置刮数据，而不是整个页面

提问者：小点点

Python Web刮板-尝试让程序在一个特定位置刮数据，而不是整个页面

我已经浏览了网络，阅读和观看了几个关于如何解决我的问题的在线指南，但我被卡住了，希望得到一些投入。我正在尝试建立一个网站刮刮器，将从路透社的并购交易部分刮刮，并成功地编写了一个程序，可以刮刮标题，摘要，日期，和文章的链接。然而，我试图解决的问题是，我希望程序只从标题/文章中刮取摘要，这些标题/文章位于合并和收购栏的正下方。当前的程序正在刮取它看到的所有标记为“文章”和属性/类为“故事”的标题，因此，不仅刮取合并和收购栏目的标题，而且刮取市场新闻栏目的标题。

当bot开始从市场新闻栏中抓取标题时，我不断地得到属性错误，因为市场新闻栏没有任何摘要，因此没有文本可拉，导致我的代码终止。我试图用try/except逻辑路径来解决这个问题，认为它不会从市场新闻栏中拉出头条，但是代码一直拉出头条。

我试着写了一段新的代码，告诉程序不是寻找所有带有文章的标签，而是寻找所有带有文章的标签，我认为如果我给bot一个更直接的路径，它就会从自上而下的方法中抓取文章。然而，这失败了，现在我的头只是痛。提前谢谢大家！

下面是我的代码：

from bs4 import BeautifulSoup
import requests

website = 'https://www.reuters.com/finance/deals/mergers'
source = requests.get(website).text
soup = BeautifulSoup(source, 'lxml')

for article in soup.find_all('article'):
    headline = article.div.a.h3.text.strip()
    #threw in strip() to fix the issue of a bunch of space being printed before the headline title.
    print(headline+ "\n")

    date = article.find("span",class_ = 'timestamp').text
    print(date)

    try: #Put in Try/Except logic to keep the code going
        summary = article.find("div", class_="story-content").p.text
        print(summary + "\n")
        link = article.find('div', class_='story-content').a['href']
        #this bit [href] is the syntax needed for me to pull out the URL from the html code
        origin = "https://www.reuters.com/finance/deals/mergers"
        print(origin + link + "\n")
    except Exception as e:
        summary = None
        link = None

    #This section here is another part I'm working on to get the scraper to go to
    #the next page and continue scraping for headlines, dates, summaries, and links
    next_page = soup.find('a', class_='control-nav-next')["href"]
    source = requests.get(website + next_page).text
    soup = BeautifulSoup(source, 'lxml')

共1个答案

匿名用户

仅更改此行：

for article in soup.select('div[class="column1 col col-10"] article'):

使用此语法，.select（）查找下面的所有article标记，它包含您感兴趣的头，而不是其他头。

这里是文档：https://www.crummy.com/software/Beautifulsoup/bs4/doc/index.html?highlight=selectcss-selectors