我正在尝试浏览以下网站:https://www.basketball-reference.com/players/a/
我的最终目标是构建该表的数据框架,以及包含players索引的新列。 例如,对于顶级球员,这将是Abdelal01。
我目前的尝试:
url = "https://www.basketball-reference.com/players/a"
# this is the HTML from the given URL
html = urlopen(url)
soup = BeautifulSoup(html)
headers = [th.getText() for th in soup.findAll('tr')[0].findAll('th')]
headers = headers
rows = soup.findAll('tr')
player_names = [[td.getText() for td in rows[i].findAll('th')]
for i in range(len(rows))]
names = pd.DataFrame(player_names, columns = headers)
names.head(10)
player_stats = [[td.getText() for td in rows[i].findAll('td')]
for i in range(len(rows))]
stats = pd.DataFrame(player_stats, columns = headers[1:])
stats['Player'] = names['Player']
实际上,这完全重建了表,但没有指向播放器的URL。 有没有更简单的方法来实现这一点,而不是构建两个数据帧,因为在html中它们有不同的参考点?
而收集指数给玩家最好的方法是什么呢?
提取表数据的最简单方法是通过pandas包。 这样就可以很容易地进行操作。
read_html()方法从页面中抓取任何表数据。
import pandas as pd
df = pd.read_html('https://www.basketball-reference.com/players/a/')[0]
df
Player From To Pos Ht Wt Birth Date Colleges
0 Alaa Abdelnaby 1991 1995 F-C 6-10 240 June 24, 1968 Duke
1 Zaid Abdul-Aziz 1969 1978 C-F 6-9 235 April 7, 1946 Iowa State
2 Kareem Abdul-Jabbar* 1970 1989 C 7-2 225 April 16, 1947 UCLA
3 Mahmoud Abdul-Rauf 1991 2001 G 6-1 162 March 9, 1969 LSU
4 Tariq Abdul-Wahad 1998 2003 F 6-6 223 November 3, 1974 Michigan, San Jose State
... ... ... ... ... ... ... ... ...
161 Dennis Awtrey 1971 1982 C 6-10 235 February 22, 1948 Santa Clara
162 Gustavo Ayón 2012 2014 C 6-10 250 April 1, 1985 NaN
163 Jeff Ayres 2010 2016 F 6-9 240 April 29, 1987 Arizona State
164 Deandre Ayton 2019 2020 C 6-11 250 July 23, 1998 Arizona
165 Kelenna Azubuike 2007 2012 G 6-5 220 December 16, 1983 Kentucky
df['players']
0 Alaa Abdelnaby
1 Zaid Abdul-Aziz
2 Kareem Abdul-Jabbar*
3 Mahmoud Abdul-Rauf
4 Tariq Abdul-Wahad
...
161 Dennis Awtrey
162 Gustavo Ayón
163 Jeff Ayres
164 Deandre Ayton
165 Kelenna Azubuike