提问者:小点点

提取标题和强标签与美丽汤


我希望从< code>div内的标题和< code >内的文本中提取文本字符串

我可以用<code>汤得到标题。h1</code>,但我想获得特定于div<code>中的<code>h1</code>

超文本标记语言:

所以我想得到这是标题(还有一点!)有人能帮忙吗?

谢谢


共2个答案

匿名用户

你可以使用查找 attrs 参数,例如:

soup.find('div', attrs={'class': 'site-content'}).h1

编辑:仅获取直接文本

for div in soup.findAll('div', attrs={'class': 'site-content'}):
    print ''.join([x for x in div.h1.contents \
                                 if isinstance(x, bs4.element.NavigableString)])

使用lxml和xpath,生活更轻松:

>>> from lxml import html
>>> root = html.parse('x.html')
>>> print root.xpath('//div[@class="site-content"]/h1/text()')
['Here is the title']
>>> print root.xpath('//div[@class="site-content"]/h1//text()')
['Here is the title', '( And a bit more! )']
>>> print root.xpath('//div[@class="site-content"]/h1/strong/text()')
['( And a bit more! )']

匿名用户

使用BeautifulSoup从div内的标题和标记内的文本中提取文本字符串的代码。

>>> from bs4 import BeautifulSoup
>>> data = """<div class="site-content"><h1>Here is the title<strong>( And a bit more! )</strong></h1>"""
>>> soup = BeautifulSoup(data, "html.parser")
>>> reqText = soup.find('h1').text
>>> print(reqText)
'Here is the title( And a bit more! )'
>>> reqText1 = soup.find('strong').text
>>> print(reqText1)
'( And a bit more! )'

或者

>>> data = """<div class="site-content"><h1>Here is the title<strong>( And a bit more! )</strong></h1>"""
>>> soup = BeautifulSoup(data, "html.parser")
>>> soup.find('strong').text
'( And a bit more! )'
>>> reqText1 = soup.find('h1')
>>> for i in reqText1:
...    p_tag = soup.h1
...    s_tag = soup.strong
...    s_tag.decompose()
...    p_tag.get_text()
...
'Here is the title'