我希望从< code>div内的标题和< code >内的文本中提取文本字符串
我可以用<code>汤得到标题。h1</code>,但我想获得特定于div<code>中的<code>h1</code>
超文本标记语言:
所以我想得到这是标题
和(还有一点!)
有人能帮忙吗?
谢谢
你可以使用查找 attrs
参数,例如:
soup.find('div', attrs={'class': 'site-content'}).h1
编辑:仅获取直接文本
for div in soup.findAll('div', attrs={'class': 'site-content'}):
print ''.join([x for x in div.h1.contents \
if isinstance(x, bs4.element.NavigableString)])
使用lxml和xpath,生活更轻松:
>>> from lxml import html
>>> root = html.parse('x.html')
>>> print root.xpath('//div[@class="site-content"]/h1/text()')
['Here is the title']
>>> print root.xpath('//div[@class="site-content"]/h1//text()')
['Here is the title', '( And a bit more! )']
>>> print root.xpath('//div[@class="site-content"]/h1/strong/text()')
['( And a bit more! )']
使用BeautifulSoup从div内的标题和标记内的文本中提取文本字符串的代码。
>>> from bs4 import BeautifulSoup
>>> data = """<div class="site-content"><h1>Here is the title<strong>( And a bit more! )</strong></h1>"""
>>> soup = BeautifulSoup(data, "html.parser")
>>> reqText = soup.find('h1').text
>>> print(reqText)
'Here is the title( And a bit more! )'
>>> reqText1 = soup.find('strong').text
>>> print(reqText1)
'( And a bit more! )'
或者
>>> data = """<div class="site-content"><h1>Here is the title<strong>( And a bit more! )</strong></h1>"""
>>> soup = BeautifulSoup(data, "html.parser")
>>> soup.find('strong').text
'( And a bit more! )'
>>> reqText1 = soup.find('h1')
>>> for i in reqText1:
... p_tag = soup.h1
... s_tag = soup.strong
... s_tag.decompose()
... p_tag.get_text()
...
'Here is the title'