我正在做我的供应链管理学院项目,想分析网站上的每日帖子,以分析和记录行业对服务/产品的需求。每天更改的特定页面,具有不同数量的容器和页面:
https://buyandsell.gc.ca/procurement-data/search/site?f[0]=sm\u方面\u采购\u数据:数据\u数据\u招标公告
代码通过删除HTML标记和记录数据点来生成csv文件(不要介意标题)。尝试使用“for”循环,但代码仍然只扫描第一页。
Python知识水平:初学者,通过youtube和Google学习“艰难之路”。找到了一个对我的理解水平有用的例子,但在结合人们的不同解决方案时遇到了困难。
从urllib导入bs4。从bs4请求导入urlopen作为uReq导入BeautifulSoup作为汤
for page in range (1,3):my_url = 'https://buyandsell.gc.ca/procurement-data/search/site?f%5B0%5D=sm_facet_procurement_data%3Adata_data_tender_notice&f%5B1%5D=dds_facet_date_published%3Adds_facet_date_published_today'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
containers = page_soup.findAll("div",{"class":"rc"})
filename = "BuyandSell.csv"
f = open(filename, "w")
headers = "Title, Publication Date, Closing Date, GSIN, Notice Type, Procurement Entity\n"
f.write(headers)
for container in containers:
Title = container.h2.text
publication_container = container.findAll("dd",{"class":"data publication-date"})
Publication_date = publication_container[0].text
closing_container = container.findAll("dd",{"class":"data date-closing"})
Closing_date = closing_container[0].text
gsin_container = container.findAll("li",{"class":"first"})
Gsin = gsin_container[0].text
notice_container = container.findAll("dd",{"class":"data php"})
Notice_type = notice_container[0].text
entity_container = container.findAll("dd",{"class":"data procurement-entity"})
Entity = entity_container[0].text
print("Title: " + Title)
print("Publication_date: " + Publication_date)
print("Closing_date: " + Closing_date)
print("Gsin: " + Gsin)
print("Notice: " + Notice_type)
print("Entity: " + Entity)
f.write(Title + "," +Publication_date + "," +Closing_date + "," +Gsin + "," +Notice_type + "," +Entity +"\n")
f.close()
实际结果:
代码仅为第一页生成CSV文件。
代码至少不会写在已经扫描的内容之上(从一天到一天)
预期结果:
代码扫描下一页,并在没有页面可浏览时识别。
CSV文件每页将生成10个csv行。(不管最后一页是多少,因为数字并不总是10)。
代码将写在已经删除的内容之上(用于使用Excel工具和历史数据进行更高级的分析)
有人可能会说,使用pandas太过分了,但就我个人而言,我喜欢使用它,就像使用它创建表和写入文件一样。
可能还有一种更健壮的方式可以一页一页地浏览,但我只是想把它带给你,你可以使用它。
到目前为止,我只是硬编码下一个页面值(我只是随意选择了20个页面作为最大值),所以它从第1页开始,然后遍历20个页面(或者在到达无效页面时停止)。
import pandas as pd
from bs4 import BeautifulSoup
import requests
import os
filename = "BuyandSell.csv"
# Initialize an empty 'results' dataframe
results = pd.DataFrame()
# Iterarte through the pages
for page in range(0,20):
url = 'https://buyandsell.gc.ca/procurement-data/search/site?page=' + str(page) + '&f%5B0%5D=sm_facet_procurement_data%3Adata_data_tender_notice&f%5B1%5D=dds_facet_date_published%3Adds_facet_date_published_today'
page_html = requests.get(url).text
page_soup = BeautifulSoup(page_html, "html.parser")
containers = page_soup.findAll("div",{"class":"rc"})
# Get data from each container
if containers != []:
for each in containers:
title = each.find('h2').text.strip()
publication_date = each.find('dd', {'class':'data publication-date'}).text.strip()
closing_date = each.find('dd', {'class':'data date-closing'}).text.strip()
gsin = each.find('dd', {'class':'data gsin'}).text.strip()
notice_type = each.find('dd', {'class':'data php'}).text.strip()
procurement_entity = each.find('dd', {'data procurement-entity'}).text.strip()
# Create 1 row dataframe
temp_df = pd.DataFrame([[title, publication_date, closing_date, gsin, notice_type, procurement_entity]], columns = ['Title', 'Publication Date', 'Closing Date', 'GSIN', 'Notice Type', 'Procurement Entity'])
# Append that row to a 'results' dataframe
results = results.append(temp_df).reset_index(drop=True)
print ('Aquired page ' + str(page+1))
else:
print ('No more pages')
break
# If already have a file saved
if os.path.isfile(filename):
# Read in previously saved file
df = pd.read_csv(filename)
# Append the newest results
df = df.append(results).reset_index()
# Drop and duplicates (incase the newest results aren't really new)
df = df.drop_duplicates()
# Save the previous file, with appended results
df.to_csv(filename, index=False)
else:
# If a previous file not already saved, save a new one
df = results.copy()
df.to_csv(filename, index=False)