使用Python读取和与HTML表交互(Reading & Interacting With HTML Table Using Python)
我想在9:30开始,然后向前跳1分钟与桌子互动。 我想将所有数据导出到DataFrame。 我尝试过使用pandas.read_html()并尝试使用BeautifulSoup。 尽管我对BeautifulSoup缺乏经验,但这些都不适合我。 我的请求是否可能,或者网站是否通过网络报废保护此信息? 任何帮助,将不胜感激!
I'm trying to web scrape information from an HTML table that has interactive ability to sift through various time periods. An example table is located at this URL: /dl/frt/M?IM=quotes&type=Time%26Sales&SA=quotes&symbol=IBM&qm_page=45750.
I'd like to start at the time of 9:30 and then interact with the table by jumping forward 1 min. I want to export all of the data to a DataFrame. I've tried using pandas.read_html() and also tried using BeautifulSoup. Neither of these are working for me albeit I am inexperienced with BeautifulSoup. Is my request possible or has the website protected this information from web scrapping? Any help would be appreciated!
原文:/questions/41581616
更新时间:-02-22 17:08
最满意答案
该页面非常动态(而且非常慢,至少在我身边),涉及JavaScript和多个异步请求以获取数据。 接近requests并不容易,您可能需要通过例如selenium来使用浏览器自动化。
这是你开始的事情。 请注意在这里和那里使用显式等待 :
import pandas as pd
import time
from selenium import webdriver
from mon.by import By
from selenium.webdriver.support.select import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.maximize_window()
driver.get("/dl/frt/M?IM=quotes&type=Time%26Sales&SA=quotes&symbol=IBM&qm_page=45750")
wait = WebDriverWait(driver, 400) # 400 seconds timeout
# wait for select element to be visible
time_select = Select(wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "select[name=time]"))))
# select 9:30 and go
time_select.select_by_visible_text("09:30")
driver.execute_script("arguments[0].click();", driver.find_element_by_id("go"))
time.sleep(2)
while True:
# wait for the table to appear and load to pandas dataframe
table = wait.until(EC.presence_of_element_located((By.ID, "qmmt-time-and-sales-data-table")))
df = pd.read_html(table.get_attribute("outerHTML"))
print(df[0])
# wait for offset select to be visible and forward it 1 min
offset_select = Select(wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "select[name=timeOffset]"))))
offset_select.select_by_value("1")
time.sleep(2)
# TODO: think of a break condition
请注意,这在我的机器上真的非常慢,我不确定它会在你的机器上运行得多好,但它会在无限循环中持续前进1分钟(你可能需要在某些时候停止它)。
The page is quite dynamic (and terribly slow, at least on my side), involves JavaScript and multiple asynchronous requests to get the data. Approaching that with requests would not be easy and you might need to fall into using browser automation via, for example, selenium.
Here is something for you to get started. Note the use of Explicit Waits here and there:
import pandas as pd
import time
from selenium import webdriver
from mon.by import By
from selenium.webdriver.support.select import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.maximize_window()
driver.get("/dl/frt/M?IM=quotes&type=Time%26Sales&SA=quotes&symbol=IBM&qm_page=45750")
wait = WebDriverWait(driver, 400) # 400 seconds timeout
# wait for select element to be visible
time_select = Select(wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "select[name=time]"))))
# select 9:30 and go
time_select.select_by_visible_text("09:30")
driver.execute_script("arguments[0].click();", driver.find_element_by_id("go"))
time.sleep(2)
while True:
# wait for the table to appear and load to pandas dataframe
table = wait.until(EC.presence_of_element_located((By.ID, "qmmt-time-and-sales-data-table")))
df = pd.read_html(table.get_attribute("outerHTML"))
print(df[0])
# wait for offset select to be visible and forward it 1 min
offset_select = Select(wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "select[name=timeOffset]"))))
offset_select.select_by_value("1")
time.sleep(2)
# TODO: think of a break condition
Note that this works really, really slow on my machine and I am not sure how well it would run on yours, but it continuously advances 1 minute forward in an endless loop (you would probably need to stop it at some point).
相关问答
您可以使用cheerio操作DOM树。 const cheerio = require('cheerio');
const json = { html: '...' };
const $ = cheerio.load(json.html);
const script = `
function onActivateClick(event){
// YOU CODE HERE
}
`;
$(
...
不同之处在于Python 3.4中默认为bufsize=-1 ,因此slave.stdin.write()不会立即将该行发送到ruby子slave.stdin.write() 。 快速解决方法是添加slave.stdin.flush()调用。 #!/usr/bin/env python3
from subprocess import Popen, PIPE
log = print
log("Launch slave process...")
with Popen(['ruby', 'slave.
...
该页面非常动态(而且非常慢,至少在我身边),涉及JavaScript和多个异步请求以获取数据。 接近requests并不容易,您可能需要通过例如selenium来使用浏览器自动化。 这是你开始的事情。 请注意在这里和那里使用显式等待 : import pandas as pd
import time
from selenium import webdriver
from mon.by import By
from selenium.webdriver.
...
经过一些小修改...... food_list = ['car', 'plane', 'van', 'boat', 'ship', 'jet','shuttle']
for i in xrange(0, len(food_list), 4):
print '
' + ''.join(food_list[i:i+4]) + ''
这基本上将分隔符更改为不是制表符,而是表格元素。 此外,将开放行和关闭行放在开头和结尾。 With some
...
BeautifulSoup会让你非常接近你想要的行为: from bs4 import BeautifulSoup
html_table_string = """
"""
table = BeautifulSoup(html_table_string, "html.parser")
# Select first td element and set it's
...
您可以尝试BeautifulSoup.findAll并提供您可能知道的标签以及您正在寻找的标签的任何其他属性。 看完页面之后,看起来你正在寻找所有
标签。 所以你可以使用soup.findAll("tr", attrs = {"class": "even"}) 。 例如。 import urllib.request
from bs4 import BeautifulSoup
game_link = "/nba/playbyplay?gameId=4005
...
您看到的弹出窗口不是可以使用switch_to进行交互的常规弹出窗口。 这些弹出窗口是系统对话框 , 无法使用selenium自动执行 。 通常人们通过调整浏览器首选项来避免首先显示这些对话框,例如: 使用selenium下载文件 访问Firefox中的文件下载对话框 如何使用Selenium的WebDriver下载文件? 对于上传,通常您可以找到相应的输入元素并使用文件路径向其发送密钥: 如何使用selenium,python上传文件(图片) 如何将文件上传到文件输入? (python-sele
...
您必须迭代字典和状态的所有水果组合,然后为每个水果创建一行(而不是一列)。 然后迭代匹配该水果的所有文件并过滤那些包含当前状态的文件并将其连接到一个单元格中。 d = {'kiwi': ['kiwi.good.svg'], 'apple': ['apple.good.2.svg', 'apple.good.1.svg'], 'banana': ['banana.1.ugly.svg', 'banana.bad.2.svg']}
html = """
...
启动命令没有输入到gdb中,并且它没有通常在main中放置断点 在gdb提示符中手动键入start命令时,按Enter键实际执行它。 你应该在js脚本中做同样的事情。 在start命令结束时添加\n : ps.stdin.write('start\n');
至于(1)问题,我无法在Fedora上重现它。 the start command isn't fed into gdb and it doesn't do it's usual thing of putting a breakpoint i
...
In [49]: for table in soup.find_all('table'):
...: keys = [th.get_text(strip=True)for th in table.find_all('th')]
...: values = [td.get_text(strip=True) for td in table.find_all('td')]
...: d = dict(zip(keys, values))
...:
...
如果觉得《python加载html表格数据 使用Python读取和与HTML表交互(Reading Interacting With HTML Table Using Python)...》对你有帮助,请点赞、收藏,并留下你的观点哦!