提问者:小点点

使用beautifulsoup4从<td>中清除西里尔文文本


我是刮痧与汽车提供的网站,在那里我有型号,价格,里程等。我试图提取里程它只是文本在一个标签西里尔。 然而,无论我怎么做,它都说那里没有文本。

这是html代码:

<tr class="odd">
    <td style="padding-right:0px;" valign="top" width="200">
        <a href="offer/5ef37bc31cd64405946eff12">
            <img align="left" border="0" onmouseout="UnTip()"
                 onmouseover="Tip('&lt;div &gt;&lt;div style=\'float:left;\' class=\'ver15black\'&gt;&lt;b&gt;Citroen Xsara Picasso 1.6 hdi&lt;/b&gt;&lt;/div&gt;&lt;div style=\'float:right;text-align:right;\'&gt;&lt;span class=\'ver20black\'&gt;&lt;strong&gt;3,500&lt;/strong&gt;&lt;/span&gt;&lt;br&gt;ЛЕВА&lt;/strong&gt;&lt;/div&gt;&lt;/div&gt;&lt;div style=\'clear:both;padding-top:5px;\'&gt;&lt;/div&gt;&lt;center&gt;&lt;img width=448 height=336 src=\'https://g1-bg.cars.bg/2020-06-24_2/5ef37adfca1c397c9015b753o.jpg\'&gt;&lt;/center&gt;&lt;div style=\'clear:both\'&gt;&lt;/div&gt;&lt;div class=\'ver13black\' style=\'padding-top:5px;\'&gt;дизел,  2007 (нов внос) , 170011 км, Гоце Делчев&lt;/div&gt;')"
                 src="https://g1-bg.cars.bg/2020-06-24_2/5ef37adfca1c397c9015b753b.jpg" style="padding-right:10px;"
                 width="200"/>
        </a></td>
    <td align="left" style="border-left:0px;padding-left:0px;" valign="top" width="360">
        <span style="color:#808080;font-size: 0.85em;"><i>днес, 21:12</i></span><br/>
        <a class="ver15black" href="offer/5ef37bc31cd64405946eff12"><span class="ver15black"><b>Citroen Xsara Picasso 1.6 hdi</b></span></a>
        <br>
        дизел, 170,011 км
        <div style="word-break: break-all;margin-top: 10px; font-style: italic; font-size: 0.9em; /*line-height: 1.5em;*/ color:#666666;">
            ЛИЗИНГ БЕЗ ДОКАЗВАНЕ НА ДОХОДИ С ИЗКЛЮЧИТЕЛНО ГЪВКАВИ УСЛОВИЯ Aвтомобила е нов внос от ИТАЛИЯ на реални
            километри перфектен мотор, скорости, ходова ча...
        </div>
        </br></td>
    <td align="center" valign="top" width="80"><span class="year">2007</span><br/>
        нов внос
    </td>
    <td align="right" valign="top" width="120">
        <span class="ver20black"><strong>3,500</strong></span><br/>
        ЛЕВА
    </td>
    <td align="center" valign="top" width="150">
        <span class="ownerName">частно лице</span>
        <br/>
        <img height="5" src="https://assets.cars.bg/desktop/images/px.gif" width="1"/>
        <br/>
        <span class="cityName">Гоце Делчев</span>
    </td>
    <td>
        <div class="iconed notepadlist icon-star" id="5ef37bc31cd64405946eff12"
             style="font-size: 1.8em; position: relative; float: right; cursor: pointer;" title="Запази"></div>
    </td>
</tr>

我需要摘录的文本是这个:authorizated.,170,011ca.p./code>

我尝试了两种不同的获取数据的方法(使用requests get和urllib.request)

from bs4 import BeautifulSoup as soup
from requests import get
import json

my_url='https://www.cars.bg/?go=cars&search=1&fromhomeu=1&currencyId=1&autotype=1&stateId=1&offersForD=1&offersForA=1&filterOrderBy=1&radius=1'

#headers = {"Accept-Language": "bg-BG, bg;q=0.5"}
response = get(my_url)

html_soup = soup(response.text, 'html.parser')
#type(html_soup)

page_content = html_soup.find_all('tr', class_='odd')
for container in page_content:
 name = container.td.text
 print(name)

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url='https://www.cars.bg/?go=cars&search=1&fromhomeu=1&currencyId=1&autotype=1&stateId=1&offersForD=1&offersForA=1&filterOrderBy=1&radius=1'

#opening up connection
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

#html parsing
page_soup = soup(page_html, "html.parser")

#grab each product
odd = page_soup.findAll("tr",{"class":"odd"})

for od in odd:
  mileage=od.td.text
  print(mileage)

所以我的问题是如何提取文本?


共1个答案

匿名用户

此脚本没有完全按照您的要求提取脚本。 但几乎离它很近。

from bs4 import BeautifulSoup as soup
from requests import get
import json

my_url='https://www.cars.bg/?go=cars&search=1&fromhomeu=1&currencyId=1&autotype=1&stateId=1&offersForD=1&offersForA=1&filterOrderBy=1&radius=1'

#headers = {"Accept-Language": "bg-BG, bg;q=0.5"}
response = get(my_url)

html_soup = soup(response.text, 'html.parser')
#type(html_soup)

page_content = html_soup.find_all('td', {"width": 360})
for container in page_content:
    name = container.text.split(" ")
    name = [ele for ele in name if ele != '']
    name=name[name.index('км')-3 if not '\n' in name[name.index('км')-3] else name.index('км')-2 :name.index('км')+1]
    print(name)

输出:

['дизел,', '212,000', 'км']
['дизел,', 'автоматик,', '182,000', 'км']
['дизел,', '129,000', 'км']
['бензин,', '42,000', 'км']
['дизел,', '202,000', 'км']
['бензин,', '14,000', 'км']
['бензин,', 'автоматик,', '200,000', 'км']
['дизел,', '186,000', 'км']
['бензин,', '118,000', 'км']
['дизел,', 'автоматик,', '200,000', 'км']
['дизел,', '204,000', 'км']
['дизел,', '163,000', 'км']
['дизел,', '212,000', 'км']
['бензин,', 'автоматик,', '144,000', 'км']
['дизел,', '183,000', 'км']
['дизел,', '152,000', 'км']
['бензин,', '103,000', 'км']
['дизел,', '190,000', 'км']
['дизел,', '166,000', 'км']
['дизел,', '192,000', 'км']