提问者:小点点

不能使用请求从下一页刮取名称


我正在尝试使用python脚本解析一个网页中跨越多个页面的名称。通过我目前的尝试,我可以从它的登录页上获取名称。但是,我找不到任何办法,也无法使用requests和BeautifulSoup从下一页获取名称。

网站链接

我迄今为止的努力:

import requests
from bs4 import BeautifulSoup

url = "https://proximity.niceic.com/mainform.aspx?PostCode=YO95"

with requests.Session() as s:
    r = s.get(url)
    soup = BeautifulSoup(r.text,"lxml")
    for elem in soup.select("table#gvContractors tr:has([id*='_lblName'])"):
        name = elem.select_one("span[id*='_lblName']").get_text(strip=True)
        print(name)

我试图修改我的脚本,只从第二页获取内容,以确保它在涉及下一页按钮时工作,但不幸的是,它仍然从第一页获取数据:

import requests
from bs4 import BeautifulSoup

url = "https://proximity.niceic.com/mainform.aspx?PostCode=YO95"

with requests.Session() as s:
    r = s.get(url)
    soup = BeautifulSoup(r.text,"lxml")
    payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
    payload['__EVENTARGUMENT'] = 'Page$Next'
    payload.pop('btnClose')
    payload.pop('btnMapClose')
    res = s.post(url,data=payload,headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36',
        'X-Requested-With':'XMLHttpRequest',
        'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
        'Referer': 'https://proximity.niceic.com/mainform.aspx?PostCode=YO95',
        })
    sauce = BeautifulSoup(res.text,"lxml")
    for elem in sauce.select("table#gvContractors tr:has([id*='_lblName'])"):
        name = elem.select_one("span[id*='_lblName']").get_text(strip=True)
        print(name)

共1个答案

匿名用户

导航到下一个页面正在执行通过POST请求与__VIEWSTATE光标。

您如何处理请求:

>

  • 向首页发出GET请求;

    解析所需的数据和uuu VIEWSTATE游标;

    用收到的光标准备下一页的POST请求;

    运行它,解析下一页的所有数据和新光标。

    我不会提供任何代码,因为这需要写下几乎所有爬虫的代码。

    ==已添加====

    你几乎做到了,但有两件重要的事情你错过了。

    >

  • 有必要在第一个GET请求中发送头。如果没有发送头-我们会收到损坏的令牌(很容易通过视觉检测-它们在末尾没有==个)

    我们需要在发送的有效负载中添加异步POST。(这非常有趣:它不是布尔值True,而是字符串“True”)

    这是代码。我删除了bs4并添加了lxml(我不喜欢bs4,它非常慢)。我们确切地知道我们需要发送哪些数据,所以让我们只解析几个输入。

    import re
    import requests
    from lxml import etree
    
    
    def get_nextpage_tokens(response_body):
        """ Parse tokens from XMLHttpRequest response for making next request to next page and create payload """
        try:
            payload = dict()
            payload['ToolkitScriptManager1'] = 'UpdatePanel1|gvContractors'
            payload['__EVENTTARGET'] = 'gvContractors'
            payload['__EVENTARGUMENT'] = 'Page$Next'
            payload['__VIEWSTATEENCRYPTED'] = ''
            payload['__VIEWSTATE'] = re.search(r'__VIEWSTATE\|([^\|]+)', response_body).group(1)
            payload['__VIEWSTATEGENERATOR'] = re.search(r'__VIEWSTATEGENERATOR\|([^\|]+)', response_body).group(1)
            payload['__EVENTVALIDATION'] = re.search(r'__EVENTVALIDATION\|([^\|]+)', response_body).group(1)
            payload['__ASYNCPOST'] = 'true'
            return payload
        except:
            return None
    
    
    if __name__ == '__main__':
        url = "https://proximity.niceic.com/mainform.aspx?PostCode=YO95"
    
        headers = {
                'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36',
                'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
                'Referer': 'https://proximity.niceic.com/mainform.aspx?PostCode=YO95',
                }
    
        with requests.Session() as s:
            page_num = 1
            r = s.get(url, headers=headers)
            parser = etree.HTMLParser()
            tree = etree.fromstring(r.text, parser)
    
            # Creating payload
            payload = dict()
            payload['ToolkitScriptManager1'] = 'UpdatePanel1|gvContractors'
            payload['__EVENTTARGET'] = 'gvContractors'
            payload['__EVENTARGUMENT'] = 'Page$Next'
            payload['__VIEWSTATE'] = tree.xpath("//input[@name='__VIEWSTATE']/@value")[0]
            payload['__VIEWSTATEENCRYPTED'] = ''
            payload['__VIEWSTATEGENERATOR'] = tree.xpath("//input[@name='__VIEWSTATEGENERATOR']/@value")[0]
            payload['__EVENTVALIDATION'] = tree.xpath("//input[@name='__EVENTVALIDATION']/@value")[0]
            payload['__ASYNCPOST'] = 'true'
            headers['X-Requested-With'] = 'XMLHttpRequest'
    
            while True:
                page_num += 1
                res = s.post(url, data=payload, headers=headers)
    
                print(f'page {page_num} data: {res.text}')  # FIXME: Parse data
    
                payload = get_nextpage_tokens(res.text)  # Creating payload for next page
                if not payload:
                    # Break if we got no tokens - maybe it was last page (it must be checked)
                    break
    

    重要的

    响应不是格式良好的超文本标记语言。所以你必须处理它:切桌子或其他东西。祝你好运!