提问者:小点点

在BeautifulSoup中进行web抓取时,如果一个特定的<span>标记没有类或ID,我该如何引用它呢?


我现在是一个用Python和BeautifulSoup进行网页浏览的新手,我正试图为一个人权项目收集来自《苏丹论坛报》的新闻文章。 正文文本包含在'span'标签中,我的最终目标是过滤掉所有不包含杀戮或侵犯人权新闻的文章。 我的问题是,当每个文本都包含在一个名为'span'的标记中,没有类或id来区分它们时,我该如何引用特定的文本体。

到目前为止,我的代码得到了每一篇文章的链接和正文文本,但我不知道如何调用一个特定的,只能一次调用所有的链接和正文文本。 理想情况下,我希望能够快速引用特定文章的正文文本,并根据我自己的标准告诉Python是否包含它。

URL = 'https://www.sudantribune.com/spip.php?rubrique1'
Source = requests.get(URL)
Soup = BeautifulSoup(Source.content, 'html.parser')
print("You are current crawling the website -> " + URL)
links = []
for link in Soup.find_all('a'):
    links.append(link.get('href'))
print("The links to the articles from " + URL + " are:")
print("https://www.sudantribune.com/" + links[45] + "\n" + "https://www.sudantribune.com/" + links[46] +
"\n" + "https://www.sudantribune.com/" + links[47] + "\n" + "https://www.sudantribune.com/" +
links[48] + "\n" + "https://www.sudantribune.com/" + links[49] + "\n" + "https://www.sudantribune.com/" + links[50]+
"\n" + "https://www.sudantribune.com/" + links[51]+ "\n" + "https://www.sudantribune.com/" + links[52] + "\n" +
"https://www.sudantribune.com/" + links[53] + "\n" + "https://www.sudantribune.com/" + links[54])
 Descriptions = Soup.find_all('span')
 print(Descriptions)

我使用Python仅仅一个星期,因此非常感谢您的任何建议


共2个答案

匿名用户

是否要从位于不同URL的不同网页中检索跨度? 如果是这样,对于每个url,您需要重复从该页面“获取”数据并进行调查的初始过程。

URL = 'https://www.sudantribune.com/spip.php?rubrique1'
Source = requests.get(URL)
Soup = BeautifulSoup(Source.content, 'html.parser')
print("You are current crawling the website -> " + URL)
links = []
for link in Soup.find_all('a'):
    SubPage = requests.get("https://www.sudantribune.com/" + link)
    SubSoup = BeautifulSoup(SubSource.content, 'html.parser')
    Descriptions = Soup.find_all('span')
    print(Descriptions)   
    if SOME_CONDITION_YOU_SPECIFY:         
         links.append(link.get('href')) # Only append if it meets your criteria

匿名用户

如果我要做的话,我会做这样的事

for story in Soup.find_all("li"):
   span = story.find("span")  # might even be able to do "story.span"
   if is_this_interesting(span.text):
        store_interesting_article(....)