我正在尝试刮网页,在那里我需要解码整个表到一个数据帧。我正为此使用漂亮的汤。在某些td
标记中,有一些span
标记没有任何文本。但这些值会显示在网页上的特定span标记中。
下面的HTML
代码对应于该网页,
<td>
<span class="nttu">::after</span>
<span class="ntbb">::after</span>
<span class="ntyc">::after</span>
<span class="nttu">::after</span>
</td>
但是,这个td
标记中显示的值是23.8
。我试着删掉它,但我收到的是空短信。
如何刮这个价值使用美丽的汤。
URL:https://en.tutiempo.net/climate/ws-432950.html
下面给出了我的用于报废表的代码,
http_url = "https://en.tutiempo.net/climate/01-2013/ws-432950.html"
retreived_data = requests.get(http_url).text
soup = BeautifulSoup(retreived_data, "lxml")
climate_table = soup.find("table", attrs={"class": "medias mensuales numspan"})
climate_data = climate_table.find_all("tr")
for data in climate_data[1:-2]:
table_data = data.find_all("td")
row_data = []
for row in table_data:
row_data.append(row.get_text())
climate_df.loc[len(climate_df)] = row_data
误解了您的问题,因为您引用了两个不同的URL。我现在明白你的意思了。
奇怪的是,在第二个表中,他们使用CSS来填充一些 输出:标记的内容。您需要做的是从 标记中取出那些特殊情况。一旦有了这些,就可以替换html源中的元素,并最终将其解析为DataFrame。我使用pandas,因为它在引擎盖下使用BeautifulSoup解析
标记。但我相信这会让你如愿以偿:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
http_url = "https://en.tutiempo.net/climate/01-2013/ws-432950.html"
retreived_data = requests.get(http_url).text
soup = BeautifulSoup(retreived_data, "lxml")
hiddenData = str(soup.find_all('style')[1])
hiddenSpan = {}
for group in re.findall(r'span\.(.+?)}',hiddenData):
class_attr = group.split('span.')[-1].split('::')[0]
content = group.split('"')[1]
hiddenSpan[class_attr] = content
climate_table = str(soup.find("table", attrs={"class": "medias mensuales numspan"}))
for k, v in hiddenSpan.items():
climate_table = climate_table.replace('<span class="%s"></span>' %(k), hiddenSpan[k])
df = pd.read_html(climate_table)[0]
print (df.to_string())
Day T TM Tm SLP H PP VV V VM VG RA SN TS FG
0 1 23.4 30.3 19 - 59 0 6.3 4.3 5.4 - NaN NaN NaN NaN
1 2 22.4 30.3 16.9 - 57 0 6.9 3.3 7.6 - NaN NaN NaN NaN
2 3 24 31.8 16.9 - 51 0 6.9 2.8 5.4 - NaN NaN NaN NaN
3 4 24.2 32 17.4 - 53 0 6 3.3 5.4 - NaN NaN NaN NaN
4 5 23.8 32 18 - 58 0 6.9 3.1 7.6 - NaN NaN NaN NaN
5 6 23.3 31 18.3 - 60 0 6.9 5 9.4 - NaN NaN NaN NaN
6 7 22.8 30.2 17.6 - 55 0 7.7 3.7 7.6 - NaN NaN NaN NaN
7 8 23.1 30.6 17.4 - 46 0 6.9 3.3 5.4 - NaN NaN NaN NaN
8 9 22.9 30.6 17.4 - 51 0 6.9 3.5 3.5 - NaN NaN NaN NaN
9 10 22.3 30 17 - 56 0 6.3 3.3 7.6 - NaN NaN NaN NaN
10 11 22.3 29.4 17 - 53 0 6.9 4.3 7.6 - NaN NaN NaN NaN
11 12 21.8 29.4 15.7 - 54 0 6.9 2.8 3.5 - NaN NaN NaN NaN
12 13 22.3 30.1 15.7 - 43 0 6.9 2.8 5.4 - NaN NaN NaN NaN
13 14 21.8 30.6 14.8 - 41 0 6.9 1.9 5.4 - NaN NaN NaN NaN
14 15 21.6 30.6 14.2 - 43 0 6.9 3.1 7.6 - NaN NaN NaN NaN
15 16 21.1 29.9 15.4 - 55 0 6.9 4.1 7.6 - NaN NaN NaN NaN
16 17 20.4 28.1 15.4 - 59 0 6.9 5 11.1 - NaN NaN NaN NaN
17 18 21.2 28.3 14.5 - 53 0 6.9 3.1 7.6 - NaN NaN NaN NaN
18 19 21.6 29.6 16.4 - 58 0 6.9 2.2 3.5 - NaN NaN NaN NaN
19 20 21.9 29.6 16.6 - 58 0 6.9 2.4 5.4 - NaN NaN NaN NaN
20 21 22.3 29.9 17.5 - 55 0 6.9 3.1 5.4 - NaN NaN NaN NaN
21 22 21.9 29.9 15.1 - 46 0 6.9 4.3 7.6 - NaN NaN NaN NaN
22 23 21.3 29 15.2 - 50 0 6.9 3.3 5.4 - NaN NaN NaN NaN
23 24 21.3 28.8 14.6 - 45 0 6.9 3 5.4 - NaN NaN NaN NaN
24 25 21.6 29.1 15.5 - 47 0 7.7 4.8 7.6 - NaN NaN NaN NaN
25 26 21.8 29.2 14.6 - 41 0 6.9 2.8 3.5 - NaN NaN NaN NaN
26 27 22.3 30.1 15.6 - 40 0 6.9 2.4 5.4 - NaN NaN NaN NaN
27 28 22.4 30.3 16 - 51 0 6.9 2.8 3.5 - NaN NaN NaN NaN
28 29 23 30.3 16.9 - 53 0 6.6 2.8 5.4 - NaN NaN NaN o
29 30 23.1 30 17.8 - 54 0 6.9 5.4 7.6 - NaN NaN NaN NaN
30 31 22.1 29.8 17.3 - 54 0 6.9 5.2 9.4 - NaN NaN NaN NaN
31 Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals:
32 NaN 22.3 30 16.4 - 51.6 0 6.9 3.5 6.3 NaN 0 0 0 1
相关问题