我正在阅读csv文件,其中包含这种格式的日期:
date
01/05/2014
01/05/2014
01/05/2014
01/05/2014
01/05/2014
01/05/2014
01/05/2014
01/05/2014
01/05/2014
我不能在字符串格式中使用这样的日期,我需要将其转换为数字时间戳。
所以我写了这段代码:
Train = pd.read_csv("train.tsv", sep='\t')
Train['timestamp'] = pd.to_datetime(Train['date']).apply(lambda a: a.timestamp())
这给了我:
['timestamp']=pd.to_datetime(Train['date'])。应用(lambda a:a.timestamp())
属性错误:'Timestamp'对象没有属性'timestamp'
你能告诉我在lambda的时间戳吗?
编辑代码:
Train = pd.read_csv("data_scientist_assignment.tsv", sep='\t', parse_dates=['date'])
#print df.head()
# Train['timestamp'] = pd.to_datetime(Train['date']).apply(lambda a: a.timestamp())
Train['timestamp'] = Train.date.values.astype(np.int64)
x1=["timestamp", "hr_of_day"]
test=pd.read_csv("test.csv")
print(Train.columns)
print(test.columns)
model = LogisticRegression()
model.fit(Train[x1], Train["vals"])
print(model)
print model.score(Train[x1], Train["vals"])
您需要将参数parse_dates
添加到read_csv
,并将列名转换为datetime
:
import pandas as pd
import io
temp=u"""date
01/05/2014
01/05/2014
01/05/2014
01/05/2014
01/05/2014
01/05/2014
01/05/2014
01/05/2014
01/05/2014"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep='\t', parse_dates=['date'])
print (df)
date
0 2014-01-05
1 2014-01-05
2 2014-01-05
3 2014-01-05
4 2014-01-05
5 2014-01-05
6 2014-01-05
7 2014-01-05
8 2014-01-05
print (df.dtypes)
date datetime64[ns]
dtype: object
另一个解决方案是为列的顺序添加数字date
-在示例中,它是第一列,所以添加0
(python从0
计数):
df = pd.read_csv(io.StringIO(temp), sep='\t', parse_dates=[0])
print (df)
date
0 2014-01-05
1 2014-01-05
2 2014-01-05
3 2014-01-05
4 2014-01-05
5 2014-01-05
6 2014-01-05
7 2014-01-05
8 2014-01-05
print (df.dtypes)
date datetime64[ns]
dtype: object
然后需要将列转换为Numpy数组
通过值
并转换为int
:
#unix time in ns
df.date = df.date.values.astype(np.int64)
print (df)
date
0 1388880000000000000
1 1388880000000000000
2 1388880000000000000
3 1388880000000000000
4 1388880000000000000
5 1388880000000000000
6 1388880000000000000
7 1388880000000000000
8 1388880000000000000
#unix time in us
df.date = df.date.values.astype(np.int64) // 1000
print (df)
date
0 1388880000000000
1 1388880000000000
2 1388880000000000
3 1388880000000000
4 1388880000000000
5 1388880000000000
6 1388880000000000
7 1388880000000000
8 1388880000000000
#unix time in ms
df.date = df.date.values.astype(np.int64) // 1000000
#df.date = pd.to_datetime(df.date, unit='ms')
print (df)
date
0 1388880000000
1 1388880000000
2 1388880000000
3 1388880000000
4 1388880000000
5 1388880000000
6 1388880000000
7 1388880000000
8 1388880000000
#unix time in s
df.date = df.date.values.astype(np.int64) // 1000000000
print (df)
date
0 1388880000
1 1388880000
2 1388880000
3 1388880000
4 1388880000
5 1388880000
6 1388880000
7 1388880000
8 1388880000
另一个简短的方法是使用to_datetime():
In [209]: df['date']
Out[209]:
0 01/05/2014
1 01/05/2014
2 01/05/2014
3 01/05/2014
4 01/05/2014
5 01/05/2014
6 01/05/2014
7 01/05/2014
8 01/05/2014
Name: date, dtype: object
In [210]: df['date'] = pd.to_datetime(df['date'])
In [211]: df['date']
Out[211]:
0 2014-01-05
1 2014-01-05
2 2014-01-05
3 2014-01-05
4 2014-01-05
5 2014-01-05
6 2014-01-05
7 2014-01-05
8 2014-01-05
Name: date, dtype: datetime64[ns]
另外,你可以得到这样的秒:
In [232]: df['date'].astype(pd.np.int64) // 10**9
Out[232]:
0 1388880000
1 1388880000
2 1388880000
3 1388880000
4 1388880000
5 1388880000
6 1388880000
7 1388880000
8 1388880000
Name: date, dtype: int64