这是我通过pd读取的输入csv文件。read_csv()
ProductCode,Date,Receipt,Total
x1,07/29/15,101790,17.35
x2,07/29/15,103601,8.89
x3,07/29/15,103601,8.58
x4,07/30/15,101425,11.95
x5,07/29/15,101422,1.09
x6,07/29/15,101422,0.99
x7,07/29/15,101422,3
y7,08/05/15,100358,7.29
x8,08/05/15,100358,2.6
z3,08/05/15,100358,2.99
import pandas as pd
df = pd.read_csv('product.csv')
#I have to add some columns to the data:
df['Receipt_Count'] = df.groupby(['Date','Receipt'])['Receipt'].transform('count')
df['Day_of_Week'] = pd.to_datetime(df['Date']).dt.weekday_name
我的CSV文件里有大约800K行。当我运行将日期转换为weekday_name的代码行时,大约需要2分钟。我知道我的'Date'列转换为datetime第一,因为它被视为一个字符串从csv然后它被转换为其工作日等效。有什么办法可以缩短转换时间吗?
我对熊猫/蟒蛇相当陌生,所以我不确定我是否错过了这里的一些东西。
指定日期字符串的格式将大大加快转换速度:
df['Day_of_Week'] = pd.to_datetime(df['Date'], format='%m/%d/%y').dt.weekday_name
以下是一些基准:
import io
import pandas as pd
data = io.StringIO('''\
ProductCode,Date,Receipt,Total
x1,07/29/15,101790,17.35
x2,07/29/15,103601,8.89
x3,07/29/15,103601,8.58
x4,07/30/15,101425,11.95
x5,07/29/15,101422,1.09
x6,07/29/15,101422,0.99
x7,07/29/15,101422,3
y7,08/05/15,100358,7.29
x8,08/05/15,100358,2.6
z3,08/05/15,100358,2.99
''')
df = pd.read_csv(data)
%timeit pd.to_datetime(df['Date']).dt.weekday_name
# => 100 loops, best of 3: 2.48 ms per loop
%timeit pd.to_datetime(df['Date'], format='%m/%d/%y').dt.weekday_name
# => 1000 loops, best of 3: 507 µs per loop
large_df = pd.concat([df] * 1000)
%timeit pd.to_datetime(large_df['Date']).dt.weekday_name
# => 1 loop, best of 3: 1.62 s per loop
%timeit pd.to_datetime(large_df['Date'], format='%m/%d/%y').dt.weekday_name
# => 10 loops, best of 3: 45.9 ms per loop
即使对于OP中提供的小样本,性能也会提高5倍——对于更大的数据帧,性能会好得多。
另一种方法是加载带有日期信息的csv,特别是如果您经常需要此日期列。不幸的是,似乎没有办法将日期的格式传递到中,并且expert\u datetime\u format
参数到read\u csv
似乎没有什么区别:
import timeit
repeat = 3
numbers = 100
setup = """import pandas as pd
import io
data = io.StringIO('''\
ProductCode,Date,Receipt,Total
''' + '''\
x1,07/29/15,101790,17.35
x2,07/29/15,103601,8.89
x3,07/29/15,103601,8.58
x4,07/30/15,101425,11.95
x5,07/29/15,101422,1.09
x6,07/29/15,101422,0.99
x7,07/29/15,101422,3
y7,08/05/15,100358,7.29
x8,08/05/15,100358,2.6
z3,08/05/15,100358,2.99
''' * 100)"""
def time(statement, _setup=None):
print (min(
timeit.Timer(statement, setup=_setup or setup).repeat(
repeat, numbers)))
time('pd.read_csv(data); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"]); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"],'
'infer_datetime_format=True); data.seek(0)')
印刷品:
0.5536041843652657
25.298157679942697
25.34556727133409
但是如果你想经常使用Date列,那么从一开始就转换它是值得的。