有人能告诉我如何访问底层URL来查看给定用户的Instagram关注者吗?我可以使用InstagramAPI来做到这一点,但是考虑到审批流程的未决更改,我决定切换到抓取。
Instagram网络浏览器允许您查看任何给定公共用户的关注者列表——例如,要查看Instagram的关注者,请访问“https://www.instagram.com/instagram”,然后单击关注者URL打开一个通过浏览者分页的窗口(注意:您必须登录您的帐户才能查看此内容)。
我注意到,当此窗口弹出时,URL更改为“https://www.instagram.com/instagram/followers”,但我似乎无法查看此URL的底层页面源。
由于它出现在我的浏览器窗口上,我假设我将能够刮擦。但是我必须使用像Selenium这样的包吗?有人知道底层URL是什么吗,所以我不必使用Selenium?
例如,我可以通过访问“instagram.com/instagram/media/”直接访问底层提要数据,我可以从中抓取并分页所有迭代。我想对关注者列表做类似的事情,并直接访问这些数据(而不是使用Selenium)。
编辑:2018年12月更新:
自发布以来,Insta土地上的情况发生了变化。这是一个更新的脚本,它更加Pythonic,更好地利用了XPATH/CSS路径。
请注意,要使用此更新的脚本,您必须安装明确
包(pip install明确
),或将带有waiter
的每一行转换为纯selenium明确等待。
import itertools
from explicit import waiter, XPATH
from selenium import webdriver
def login(driver):
username = "" # <username here>
password = "" # <password here>
# Load page
driver.get("https://www.instagram.com/accounts/login/")
# Login
waiter.find_write(driver, "//div/input[@name='username']", username, by=XPATH)
waiter.find_write(driver, "//div/input[@name='password']", password, by=XPATH)
waiter.find_element(driver, "//div/button[@type='submit']", by=XPATH).click()
# Wait for the user dashboard page to load
waiter.find_element(driver, "//a/span[@aria-label='Find People']", by=XPATH)
def scrape_followers(driver, account):
# Load account page
driver.get("https://www.instagram.com/{0}/".format(account))
# Click the 'Follower(s)' link
# driver.find_element_by_partial_link_text("follower").click()
waiter.find_element(driver, "//a[@href='/instagram/followers/']", by=XPATH).click()
# Wait for the followers modal to load
waiter.find_element(driver, "//div[@role='dialog']", by=XPATH)
# At this point a Followers modal pops open. If you immediately scroll to the bottom,
# you hit a stopping point and a "See All Suggestions" link. If you fiddle with the
# model by scrolling up and down, you can force it to load additional followers for
# that person.
# Now the modal will begin loading followers every time you scroll to the bottom.
# Keep scrolling in a loop until you've hit the desired number of followers.
# In this instance, I'm using a generator to return followers one-by-one
follower_css = "ul div li:nth-child({}) a.notranslate" # Taking advange of CSS's nth-child functionality
for group in itertools.count(start=1, step=12):
for follower_index in range(group, group + 12):
yield waiter.find_element(driver, follower_css.format(follower_index)).text
# Instagram loads followers 12 at a time. Find the last follower element
# and scroll it into view, forcing instagram to load another 12
# Even though we just found this elem in the previous for loop, there can
# potentially be large amount of time between that call and this one,
# and the element might have gone stale. Lets just re-acquire it to avoid
# that
last_follower = waiter.find_element(driver, follower_css.format(follower_index))
driver.execute_script("arguments[0].scrollIntoView();", last_follower)
if __name__ == "__main__":
account = 'instagram'
driver = webdriver.Chrome()
try:
login(driver)
# Print the first 75 followers for the "instagram" account
print('Followers of the "{}" account'.format(account))
for count, follower in enumerate(scrape_followers(driver, account=account), 1):
print("\t{:>3}: {}".format(count, follower))
if count >= 75:
break
finally:
driver.quit()
我做了一个快速基准测试,以显示您尝试以这种方式抓取的关注者越多,性能如何呈指数级下降:
$ python example.py
Followers of the "instagram" account
Found 100 followers in 11 seconds
Found 200 followers in 19 seconds
Found 300 followers in 29 seconds
Found 400 followers in 47 seconds
Found 500 followers in 71 seconds
Found 600 followers in 106 seconds
Found 700 followers in 157 seconds
Found 800 followers in 213 seconds
Found 900 followers in 284 seconds
Found 1000 followers in 375 seconds
原始帖子:你的问题有点混乱。例如,我不太确定“我可以从中抓取并分页所有迭代”实际上是什么意思。你目前使用什么来抓取和分页?
无论如何,instagram.com/instagram/media/
与instagram.com/instagram/followers
不是同一类型的endpoint。media
endpoint似乎是RESTAPI,配置为返回易于解析的JSON对象。
据我所知,关注者
endpoint并不是真正的RESTfulendpoint。相反,在单击关注者按钮后,Instagram AJAX会在页面源的信息中(使用React?)。我认为如果不使用Selenium之类的东西,您将无法获得这些信息,Selenium可以加载/渲染向用户显示关注者的javascript。
此示例代码将起作用:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
def login(driver):
username = "" # <username here>
password = "" # <password here>
# Load page
driver.get("https://www.instagram.com/accounts/login/")
# Login
driver.find_element_by_xpath("//div/input[@name='username']").send_keys(username)
driver.find_element_by_xpath("//div/input[@name='password']").send_keys(password)
driver.find_element_by_xpath("//span/button").click()
# Wait for the login page to load
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.LINK_TEXT, "See All")))
def scrape_followers(driver, account):
# Load account page
driver.get("https://www.instagram.com/{0}/".format(account))
# Click the 'Follower(s)' link
driver.find_element_by_partial_link_text("follower").click()
# Wait for the followers modal to load
xpath = "//div[@style='position: relative; z-index: 1;']/div/div[2]/div/div[1]"
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, xpath)))
# You'll need to figure out some scrolling magic here. Something that can
# scroll to the bottom of the followers modal, and know when its reached
# the bottom. This is pretty impractical for people with a lot of followers
# Finally, scrape the followers
xpath = "//div[@style='position: relative; z-index: 1;']//ul/li/div/div/div/div/a"
followers_elems = driver.find_elements_by_xpath(xpath)
return [e.text for e in followers_elems]
if __name__ == "__main__":
driver = webdriver.Chrome()
try:
login(driver)
followers = scrape_followers(driver, "instagram")
print(followers)
finally:
driver.quit()
这种方法有很多问题,其中最主要的原因是相对于API来说,它有多慢。
更新:2020年3月
这只是李维斯的回答,在某些地方有一些小的更新,因为现在,它没有成功退出驱动程序。这也默认获取所有的追随者,正如其他人所说,它不适用于很多追随者。
import itertools
from explicit import waiter, XPATH
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from time import sleep
def login(driver):
username = "" # <username here>
password = "" # <password here>
# Load page
driver.get("https://www.instagram.com/accounts/login/")
sleep(3)
# Login
driver.find_element_by_name("username").send_keys(username)
driver.find_element_by_name("password").send_keys(password)
submit = driver.find_element_by_tag_name('form')
submit.submit()
# Wait for the user dashboard page to load
WebDriverWait(driver, 15).until(
EC.presence_of_element_located((By.LINK_TEXT, "See All")))
def scrape_followers(driver, account):
# Load account page
driver.get("https://www.instagram.com/{0}/".format(account))
# Click the 'Follower(s)' link
# driver.find_element_by_partial_link_text("follower").click
sleep(2)
driver.find_element_by_partial_link_text("follower").click()
# Wait for the followers modal to load
waiter.find_element(driver, "//div[@role='dialog']", by=XPATH)
allfoll = int(driver.find_element_by_xpath("//li[2]/a/span").text)
# At this point a Followers modal pops open. If you immediately scroll to the bottom,
# you hit a stopping point and a "See All Suggestions" link. If you fiddle with the
# model by scrolling up and down, you can force it to load additional followers for
# that person.
# Now the modal will begin loading followers every time you scroll to the bottom.
# Keep scrolling in a loop until you've hit the desired number of followers.
# In this instance, I'm using a generator to return followers one-by-one
follower_css = "ul div li:nth-child({}) a.notranslate" # Taking advange of CSS's nth-child functionality
for group in itertools.count(start=1, step=12):
for follower_index in range(group, group + 12):
if follower_index > allfoll:
raise StopIteration
yield waiter.find_element(driver, follower_css.format(follower_index)).text
# Instagram loads followers 12 at a time. Find the last follower element
# and scroll it into view, forcing instagram to load another 12
# Even though we just found this elem in the previous for loop, there can
# potentially be large amount of time between that call and this one,
# and the element might have gone stale. Lets just re-acquire it to avoid
# tha
last_follower = waiter.find_element(driver, follower_css.format(group+11))
driver.execute_script("arguments[0].scrollIntoView();", last_follower)
if __name__ == "__main__":
account = "" # <account to check>
driver = webdriver.Firefox(executable_path="./geckodriver")
try:
login(driver)
print('Followers of the "{}" account'.format(account))
for count, follower in enumerate(scrape_followers(driver, account=account), 1):
print("\t{:>3}: {}".format(count, follower))
finally:
driver.quit()
我注意到之前的答案不再有效,所以我根据之前的答案制作了一个更新版本,其中包括滚动功能(以获取列表中的所有用户,而不仅仅是最初加载的用户)。此外,这会抓取关注者和关注者。(您还需要下载chrome驱动程序)
import time
from selenium import webdriver as wd
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# The account you want to check
account = ""
# Chrome executable
chrome_binary = r"chrome.exe" # Add your path here
def login(driver):
username = "" # Your username
password = "" # Your password
# Load page
driver.get("https://www.instagram.com/accounts/login/")
# Login
driver.find_element_by_xpath("//div/input[@name='username']").send_keys(username)
driver.find_element_by_xpath("//div/input[@name='password']").send_keys(password)
driver.find_element_by_xpath("//span/button").click()
# Wait for the login page to load
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.LINK_TEXT, "See All")))
def scrape_followers(driver, account):
# Load account page
driver.get("https://www.instagram.com/{0}/".format(account))
# Click the 'Follower(s)' link
driver.find_element_by_partial_link_text("follower").click()
# Wait for the followers modal to load
xpath = "/html/body/div[4]/div/div/div[2]/div/div[2]"
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, xpath)))
SCROLL_PAUSE = 0.5 # Pause to allow loading of content
driver.execute_script("followersbox = document.getElementsByClassName('_gs38e')[0];")
last_height = driver.execute_script("return followersbox.scrollHeight;")
# We need to scroll the followers modal to ensure that all followers are loaded
while True:
driver.execute_script("followersbox.scrollTo(0, followersbox.scrollHeight);")
# Wait for page to load
time.sleep(SCROLL_PAUSE)
# Calculate new scrollHeight and compare with the previous
new_height = driver.execute_script("return followersbox.scrollHeight;")
if new_height == last_height:
break
last_height = new_height
# Finally, scrape the followers
xpath = "/html/body/div[4]/div/div/div[2]/div/div[2]/ul/li"
followers_elems = driver.find_elements_by_xpath(xpath)
followers_temp = [e.text for e in followers_elems] # List of followers (username, full name, follow text)
followers = [] # List of followers (usernames only)
# Go through each entry in the list, append the username to the followers list
for i in followers_temp:
username, sep, name = i.partition('\n')
followers.append(username)
print("______________________________________")
print("FOLLOWERS")
return followers
def scrape_following(driver, account):
# Load account page
driver.get("https://www.instagram.com/{0}/".format(account))
# Click the 'Following' link
driver.find_element_by_partial_link_text("following").click()
# Wait for the following modal to load
xpath = "/html/body/div[4]/div/div/div[2]/div/div[2]"
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, xpath)))
SCROLL_PAUSE = 0.5 # Pause to allow loading of content
driver.execute_script("followingbox = document.getElementsByClassName('_gs38e')[0];")
last_height = driver.execute_script("return followingbox.scrollHeight;")
# We need to scroll the following modal to ensure that all following are loaded
while True:
driver.execute_script("followingbox.scrollTo(0, followingbox.scrollHeight);")
# Wait for page to load
time.sleep(SCROLL_PAUSE)
# Calculate new scrollHeight and compare with the previous
new_height = driver.execute_script("return followingbox.scrollHeight;")
if new_height == last_height:
break
last_height = new_height
# Finally, scrape the following
xpath = "/html/body/div[4]/div/div/div[2]/div/div[2]/ul/li"
following_elems = driver.find_elements_by_xpath(xpath)
following_temp = [e.text for e in following_elems] # List of following (username, full name, follow text)
following = [] # List of following (usernames only)
# Go through each entry in the list, append the username to the following list
for i in following_temp:
username, sep, name = i.partition('\n')
following.append(username)
print("\n______________________________________")
print("FOLLOWING")
return following
if __name__ == "__main__":
options = wd.ChromeOptions()
options.binary_location = chrome_binary # chrome.exe
driver_binary = r"chromedriver.exe"
driver = wd.Chrome(driver_binary, chrome_options=options)
try:
login(driver)
followers = scrape_followers(driver, account)
print(followers)
following = scrape_following(driver, account)
print(following)
finally:
driver.quit()