我现在是一个用Python和BeautifulSoup进行网页浏览的新手,我正试图为一个人权项目收集来自《苏丹论坛报》的新闻文章。 正文文本包含在'span'标签中,我的最终目标是过滤掉所有不包含杀戮或侵犯人权新闻的文章。 我的问题是,当每个文本都包含在一个名为'span'的标记中,没有类或id来区分它们时,我该如何引用特定的文本体。
到目前为止,我的代码得到了每一篇文章的链接和正文文本,但我不知道如何调用一个特定的,只能一次调用所有的链接和正文文本。 理想情况下,我希望能够快速引用特定文章的正文文本,并根据我自己的标准告诉Python是否包含它。
URL = 'https://www.sudantribune.com/spip.php?rubrique1'
Source = requests.get(URL)
Soup = BeautifulSoup(Source.content, 'html.parser')
print("You are current crawling the website -> " + URL)
links = []
for link in Soup.find_all('a'):
links.append(link.get('href'))
print("The links to the articles from " + URL + " are:")
print("https://www.sudantribune.com/" + links[45] + "\n" + "https://www.sudantribune.com/" + links[46] +
"\n" + "https://www.sudantribune.com/" + links[47] + "\n" + "https://www.sudantribune.com/" +
links[48] + "\n" + "https://www.sudantribune.com/" + links[49] + "\n" + "https://www.sudantribune.com/" + links[50]+
"\n" + "https://www.sudantribune.com/" + links[51]+ "\n" + "https://www.sudantribune.com/" + links[52] + "\n" +
"https://www.sudantribune.com/" + links[53] + "\n" + "https://www.sudantribune.com/" + links[54])
Descriptions = Soup.find_all('span')
print(Descriptions)
我使用Python仅仅一个星期,因此非常感谢您的任何建议
是否要从位于不同URL的不同网页中检索跨度? 如果是这样,对于每个url,您需要重复从该页面“获取”数据并进行调查的初始过程。
URL = 'https://www.sudantribune.com/spip.php?rubrique1'
Source = requests.get(URL)
Soup = BeautifulSoup(Source.content, 'html.parser')
print("You are current crawling the website -> " + URL)
links = []
for link in Soup.find_all('a'):
SubPage = requests.get("https://www.sudantribune.com/" + link)
SubSoup = BeautifulSoup(SubSource.content, 'html.parser')
Descriptions = Soup.find_all('span')
print(Descriptions)
if SOME_CONDITION_YOU_SPECIFY:
links.append(link.get('href')) # Only append if it meets your criteria
如果我要做的话,我会做这样的事
for story in Soup.find_all("li"):
span = story.find("span") # might even be able to do "story.span"
if is_this_interesting(span.text):
store_interesting_article(....)