pandas中有一些函数和数据中元素出现的频次相关。对Series使用unique()和nunique()可以分别得到其唯一值组成的列表和唯一值的个数:
In [50]: df['School'].unique()
Out[50]: array(['A', 'B', 'C', 'D'], dtype=object)
In [51]: df['School'].nunique()
Out[51]: 4
通过value_counts()可以得到序列中每个值出现的次数,当设定normalize为True时会进行归一化处理。
In [52]: df['School'].value_counts()
Out[52]: D 69
A 57
C 40
B 34
Name: School, dtype: int64
In [53]: df['School'].value_counts(normalize=True)
Out[53]: D 0.345
A 0.285
C 0.200
B 0.170
Name: School, dtype: float64
如果想要观察多个列组合的唯一值,可以使用drop_duplicates()。其中的关键参数是keep,默认值first表示保留每个组合第一次出现的所在行,指定为last表示保留每个组合最后一次出现的所在行,指定为False表示把所有组合重复的所在行剔除。
In [54]: df_demo = df[['Gender','Transfer','Name']]
df_demo.drop_duplicates(['Gender', 'Transfer'])
Out[54]: Gender Transfer Name
0 Female N Gaopeng Yang
1 Male N Changqiang You
12 Female NaN Peng You
21 Male NaN Xiaopeng Shen
36 Male Y Xiaojuan Qin
43 Female Y Gaoli Feng
In [55]: df_demo.drop_duplicates(['Gender', 'Transfer'], keep='last')
Out[55]: Gender Transfer Name
147 Male NaN Juan You
150 Male Y Chengpeng You
169 Female Y Chengquan Qin
194 Female NaN Yanmei Qian
197 Female N Chengqiang Chu
199 Male N Chunpeng Lv
将keep指定为False意味着保留标签或标签组合中只出现过一次的行:
In [56]: df_demo.drop_duplicates(['Name', 'Gender'], keep=False).head()
Out[56]: Gender Transfer Name
0 Female N Gaopeng Yang
1 Male N Changqiang You
2 Male N Mei Sun
4 Male N Gaojuan You
5 Female N Xiaoli Qian
我们在Series上也可以使用drop_duplicates():
In [57]: df['School'].drop_duplicates()
Out[57]: 0 A
1 B
3 C
5 D
Name: School, dtype: object
duplicated()和drop_duplicates()的功能类似,但前者返回关于元素是否为唯一值的布尔列表,而其参数keep的意义与后者一致。duplicated()返回的序列把重复元素设为True,否则为False,drop_duplicates()等价于把duplicated()返回为True的对应行剔除。
In [58]: df_demo.duplicated(['Gender', 'Transfer']).head()
Out[58]: 0 False
1 False
2 True
3 True
4 True
dtype: bool
In [59]: df['School'].duplicated().head()
Out[59]: 0 False
1 False
2 True
3 False
4 True
Name: School, dtype: bool