pandas两种基本的数据结构：Series和DataFrame

2023年2月16日10:54:06已关闭评论

pandas有两种基本的数据结构，分别是存储一维值属性values的Series和存储二维值属性values的DataFrame，在这两种数据结构上定义了很多属性和方法，pandas中的绝大多数数据处理操作基于它们来进行。

1　Series

Series对象中包含4个重要的组成部分，分别是序列的值data、索引index、存储类型dtype和序列的名字name。其中，索引也可以指定名字。

In [17]:   s = pd.Series(data = [100, 'a', {'dic1':5}],
                         index = pd.Index(['id1', 20, 'third'], name='my_idx'),
                         dtype = 'object', # 常用的dtype还有int、float、string、category
                         name = 'my_name')
           s

Out[17]:   my_idx
           id1              100
           20                 a
           third    {'dic1': 5}
           Name: my_name, dtype: object

注解

object代表一种混合类型，正如上面的例子中存储了整数、字符串以及Python的字典数据结构。此外，在默认状态下，pandas把纯字符串序列当作一种object类型的序列，但它也可以显式地指定string作为其类型。

对于这些属性等内容，可以通过“.”来获取：

In [18]:   s.values

Out[18]:   array([100, 'a', {'dic1': 5}], dtype=object)

In [19]:   s.index

Out[19]:   Index(['id1', 20, 'third'], dtype='object', name='my_idx')

In [20]:   s.dtype

Out[20]:   dtype('O')

In [21]:   s.name

Out[21]:   'my_name'

利用.shape可以获取序列的长度：

In [22]:   s.shape

Out[22]:   (3,)

索引是pandas中最重要的概念之一，将在第3章中详细地讨论。如果想要取出单个索引对应的值，可以通过[index_item]取出，其中index_item是索引的标签。

In [23]:   s['third']

Out[23]:   {'dic1': 5}

2　DataFrame

DataFrame在Series的基础上增加了列索引，可以把它理解为一种将一组具有公共索引的Series拼接而得到的数据结构。一个DataFrame可以由二维的data与行列索引来构造：

In [24]:   data = [[1, 'a', 1.2], [2, 'b', 2.2], [3, 'c', 3.2]]
           df = pd.DataFrame(data=data,
                             index=['row_%d'%i for i in range(3)],  # 行索引
                             columns=['col_0', 'col_1', 'col_2'])   # 列索引
           df

Out[24]:         col_0 col_1   col_2
           row_0     1     a     1.2
           row_1     2     b     2.2
           row_2     3     c     3.2

但更多的时候会采用从列索引名到数据的映射来构造DataFrame，再加上行索引：

In [25]:   df = pd.DataFrame(data = {'col_0': [1,2,3], 'col_1':list('abc'),
                                     'col_2': [1.2, 2.2, 3.2]},
                             index = ['row_%d'%i for i in range(3)])
           df

Out[25]:         col_0 col_1   col_2
           row_0     1     a     1.2
           row_1     2     b     2.2
           row_2     3     c     3.2

由于这种映射关系，在DataFrame中可以用[col_name]与[col_list]来取出相应的列与由多个列组成新的DataFrame，结果分别为Series和DataFrame：

In [26]:   df['col_0']

Out[26]:   row_0    1
           row_1    2
           row_2    3
           Name: col_0, dtype: int64

In [27]:   df[['col_0', 'col_1']]

Out[27]:         col_0 col_1
           row_0     1     a
           row_1     2     b
           row_2     3     c

使用to_frame()函数可以把序列转换为列数为1的DataFrame：

In [28]:   df['col_0'].to_frame()

Out[28]:        col_0
           row_0    1
           row_1    2
           row_2    3

与Series类似，在DataFrame中同样可以取出相应的属性：

In [29]:   df.values

Out[29]:   array([[1, 'a', 1.2],
                  [2, 'b', 2.2],
                  [3, 'c', 3.2]], dtype=object)

In [30]:   df.index

Out[30]:   Index(['row_0', 'row_1', 'row_2'], dtype='object')

In [31]:   df.columns

Out[31]:   Index(['col_0', 'col_1', 'col_2'], dtype='object')

In [32]:   df.dtypes # 返回的是值为相应列数据类型的Series

Out[32]:   col_0      int64
           col_1     object
           col_2    float64
           dtype: object

In [33]:   df.shape # 返回一个元组

Out[33]:   (3, 3)

通过“.T”可以把DataFrame的行列进行转置：

In [34]:   df.T

Out[34]:          row_0  row_1  row_2
           col_0      1      2      3
           col_1      a      b      c
           col_2    1.2    2.2    3.2

当想要对列进行修改或者新增一列时，可以直接使用df[col_name]的方式：

In [35]:   df["col_0"] = df['col_0'].values[::-1] # 颠倒顺序
           df["col_2"] *= 2
           df["col_3"] = ["apple",banaa", "cat"] 
           df

Out[35]:         col_0 col_1   col_2   col_3
           row_0     3     a     2.4   apple
           row_1     2     b     4.4  banana
           row_2     1     c     6.4     cat

当想要删除某一个列时，可以使用drop方法：

In [36]:   df.drop(["col_3"], axis=1)

Out[36]:         col_0 col_1   col_2 
           row_0     3     a     2.4 
           row_1     2     b     4.4 
           row_2     1     c     6.4

当axis取值为1时为删除列，而当axis取值为0时为删除行：

In [37]:   df.drop(["row_1"], axis=0)
           df

Out[37]:         col_0 col_1   col_2   col_3
           row_0     3     a     2.4   apple
           row_2     1     c     6.4     cat

注解

Series或DataFrame的绝大多数方法在默认参数下都不会改变原表，而是返回一个临时拷贝。当真正需要在df上删除时，使用赋值语句df=df.drop(...)即可。

同时，利用[col_list]的方式来选出需要的列可以做到如上的等价筛选：

In [38]:   df = df[df.columns[:-1]]
           df

Out[38]:         col_0 col_1   col_2 
           row_0     3     a     2.4 
           row_1     2     b     4.4 
           row_2     1     c     6.4

1 Series

2 DataFrame

登录 找回密码

1　Series

2　DataFrame

登录找回密码