我有数据,是在3列(名称,问题,回应),从判断学生研究研讨会的结果。水平(研究生/本科生)有2个可能的选择,大学(科学、教育等)有5个可能的选择。)
我需要做的是取一个给定名字的数字响应的平均值,并根据该人的平均数字分数进行排序,以输出一个包含以下内容的表:
College Level Rank science graduate 1st science graduate 2nd science graduate 3rd science undergrad 1st ... education graduate 1st ... education undergrad 3rd
下面是一个示例数据表:
name question response Smith, John Q1 10 Smith, John Q2 7 Smith, John Q3 10 Smith, John Q4 8 Smith, John Q5 10 Smith, John Q8 8 Smith, John level graduate Smith, John colleg science Smith, John Q9 4 Jones, Mary Q3 10 Jones, Mary Q2 10 Jones, Mary Q1 10 Jones, Mary Q4 10 Jones, Mary level undergraduate Jones, Mary college education Jones, Mary Q6 10 Jones, Mary Q7 10 Jones, Mary Q9 10
一个有才华的学生用pivot表在excel中为我们做了这件事,但我确信这可以用pandas来完成,我很好奇怎么做(我对pandas很陌生)。棘手的是,所有“有用”的信息都在第三栏。
将响应列转换为数字,字符串将是na然后Groupby和聚合
import pandas as pd
import numpy as np
import StringIO
data = '''name;question;response
Smith, John;Q1;10
Smith, John;Q2;7
Smith, John;Q3;10
Smith, John;Q4;8
Smith, John;Q5;10
Smith, John;Q8;8
Smith, John;level;graduate
Smith, John;colleg;science
Smith, John;Q9;4
Jones, Mary;Q3;10
Jones, Mary;Q2;10
Jones, Mary;Q1;10
Jones, Mary;Q4;10
Jones, Mary;level;undergraduate
Jones, Mary;college;education
Jones, Mary;Q6;10
Jones, Mary;Q7;10
Jones, Mary;Q9;10'''
df = pd.read_csv(StringIO.StringIO(data), delimiter=';')
df['response'] = pd.to_numeric(df['response'], errors='coerce')
df.groupby('name').agg(np.mean).reset_index().sort_values(by='response')
输出
name response
1 Smith, John 8.142857
0 Jones, Mary 10.000000
您可以开始旋转数据框,以获取学院和级别的信息以及不同数据框中的问号:
pivoted = df.pivot_table(index='name',
columns='question',
values='response',
fill_value=np.nan, # Changing this value to 0 would consider
# unanswered questions as 0 score
aggfunc=lambda x: x)
categories = pivoted[['college','level']]
questions = pivoted.drop(['college','level'],axis=1)
并在类别数据框中为每个学生设置问号平均值:
categories['points'] = questions.astype(float).mean(axis=1,skipna=True)
skipna=True
与fill\u值=np组合。nan
使得未回答的问题不会计算在平均值中,因此,一个学生只有一个答案是10,则平均值为10。如前所述,fill\u value=0
修改此行为。
最终,可以使用sort_values
对值进行排序,以便为每个类别进行排名:
categories.sort_values(['college','level','points'],ascending=False)