Skip to the content.

摘要: Pandas数据规整相关的学习笔记,记录了《利用python进行数据分析》学习过程和笔记。

清理、转换、合并、重塑

1 合并数据集

pandas有一些内置的合并数据集方法:

1.1 数据库风格的DataFrame合并

数据库合并(merge)或连接(join)运行将一个或多个键将行链接起来。来个例子:

In [1]: import pandas as pd

In [2]: from pandas import Series,DataFrame

In [3]: df1 = DataFrame({'key':list('bbacaab'),'data1':range(7)})

In [4]: df2 = DataFrame({'key':list('abd'),'data2':range(3)})

In [5]: df1
Out[5]:
   data1 key
0      0   b
1      1   b
2      2   a
3      3   c
4      4   a
5      5   a
6      6   b

In [6]: df2
Out[6]:
   data2 key
0      0   a
1      1   b
2      2   d

# merge 合并数据集,相同列的共同值。
In [7]: pd.merge(df1, df2)    #不指定列名,默认会选择列名相同的key列。
Out[7]:
   data1 key  data2
0      0   b      1
1      1   b      1
2      6   b      1
3      2   a      0
4      4   a      0
5      5   a      0

In [8]: pd.merge(df1, df2, on='key') # 指定列名
Out[8]:
   data1 key  data2
0      0   b      1
1      1   b      1
2      6   b      1
3      2   a      0
4      4   a      0
5      5   a      0

# 可以将不同列名进行链接。

In [9]: df3 = DataFrame({'lkey':list('bbacaab'),'data1':range(7)})

In [10]: df4 = DataFrame({'rkey':list('abd'),'data2':range(3)})

In [11]: pd.merge(df3,df4,left_on='lkey',right_on='rkey')
Out[11]:
   data1 lkey  data2 rkey
0      0    b      1    b
1      1    b      1    b
2      6    b      1    b
3      2    a      0    a
4      4    a      0    a
5      5    a      0    a

# 默认inner,通过how可以指定outer
In [12]: pd.merge(df1,df2,how='outer')
Out[12]:
   data1 key  data2
0    0.0   b    1.0
1    1.0   b    1.0
2    6.0   b    1.0
3    2.0   a    0.0
4    4.0   a    0.0
5    5.0   a    0.0
6    3.0   c    NaN
7    NaN   d    2.0

In [13]: pd.merge(df1,df2,how='left')
Out[13]:
   data1 key  data2
0      0   b    1.0
1      1   b    1.0
2      2   a    0.0
3      3   c    NaN
4      4   a    0.0
5      5   a    0.0
6      6   b    1.0

In [14]: pd.merge(df1,df2,how='right',on='key')
Out[14]:
   data1 key  data2
0    0.0   b      1
1    1.0   b      1
2    6.0   b      1
3    2.0   a      0
4    4.0   a      0
5    5.0   a      0
6    NaN   d      2

# 多个键 合并

In [15]: left = DataFrame({'key1':['foo','foo','bar'],'key2':['one','two','one']
    ...: ,'lval':[1,2,3]})

In [16]: right = DataFrame({'key1':['foo','foo','bar','bar'],'key2':['one','one'
    ...: ,'one','two'],'rval':[4,5,6,7]})

In [17]: pd.merge(left,right,on=['key1','key2'],how='outer')
Out[17]:
  key1 key2  lval  rval
0  foo  one   1.0   4.0
1  foo  one   1.0   5.0
2  foo  two   2.0   NaN
3  bar  one   3.0   6.0
4  bar  two   NaN   7.0

# 如果使用的on选项后,还有重复列名,怎么办?
# merge有一个suffixes选项,可以处理重复列名

In [18]: pd.merge(left,right,on='key1')
Out[18]:
  key1 key2_x  lval key2_y  rval
0  foo    one     1    one     4
1  foo    one     1    one     5
2  foo    two     2    one     4
3  foo    two     2    one     5
4  bar    one     3    one     6
5  bar    one     3    two     7

In [19]: pd.merge(left,right,on='key1',suffixes=('_left','_right'))
Out[19]:
  key1 key2_left  lval key2_right  rval
0  foo       one     1        one     4
1  foo       one     1        one     5
2  foo       two     2        one     4
3  foo       two     2        one     5
4  bar       one     3        one     6
5  bar       one     3        two     7

merge选项:

1.2 索引上的合并

传入left_index=True或者right_index=True,可以将索引作为链接键使用。

In [20]: left1 = DataFrame({'key':list('abaabc'),'value':range(6)})

In [21]: left2 = DataFrame({'group_val':[3.5,7]},index=['a','b'])

In [22]: left1
Out[22]:
  key  value
0   a      0
1   b      1
2   a      2
3   a      3
4   b      4
5   c      5

In [23]: left2
Out[23]:
   group_val
a        3.5
b        7.0

# 将右DataFrame的index作为链接键;
In [24]: pd.merge(left1, left2 ,left_on='key', right_index=True)
Out[24]:
  key  value  group_val
0   a      0        3.5
2   a      2        3.5
3   a      3        3.5
1   b      1        7.0
4   b      4        7.0

In [25]: pd.merge(left1, left2 ,left_on='key', right_index=True, how='outer')
Out[25]:
  key  value  group_val
0   a      0        3.5
2   a      2        3.5
3   a      3        3.5
1   b      1        7.0
4   b      4        7.0
5   c      5        NaN

# 对于层次化索引,比较复杂
In [28]: lefth = DataFrame({'key1':['Ohio','Ohio','Ohio','Nevada','Nevada'],'key
    ...: 2':[2000,2001,2002,2001,2002],'data':np.arange(5.)})

In [29]: righth = DataFrame(np.arange(12).reshape(6,2),index=[['Nevada','Nevada'
    ...: ,'Ohio','Ohio','Ohio','Ohio'],[2001,2000,2000,2000,2001,2002]],columns=
    ...: ['event1','event2'])

In [30]: lefth
Out[30]:
   data    key1  key2
0   0.0    Ohio  2000
1   1.0    Ohio  2001
2   2.0    Ohio  2002
3   3.0  Nevada  2001
4   4.0  Nevada  2002

In [31]: righth
Out[31]:
             event1  event2
Nevada 2001       0       1
       2000       2       3
Ohio   2000       4       5
       2000       6       7
       2001       8       9
       2002      10      11

# 使用列表的形式,指明多个列作为合并键
In [32]: pd.merge(lefth, righth, left_on=['key1','key2'],right_index=True)
Out[32]:
   data    key1  key2  event1  event2
0   0.0    Ohio  2000       4       5
0   0.0    Ohio  2000       6       7
1   1.0    Ohio  2001       8       9
2   2.0    Ohio  2002      10      11
3   3.0  Nevada  2001       0       1

In [33]: pd.merge(lefth, righth, left_on=['key1','key2'],right_index=True, how='
    ...: outer')
Out[33]:
   data    key1    key2  event1  event2
0   0.0    Ohio  2000.0     4.0     5.0
0   0.0    Ohio  2000.0     6.0     7.0
1   1.0    Ohio  2001.0     8.0     9.0
2   2.0    Ohio  2002.0    10.0    11.0
3   3.0  Nevada  2001.0     0.0     1.0
4   4.0  Nevada  2002.0     NaN     NaN
4   NaN  Nevada  2000.0     2.0     3.0

In [34]: left2 = DataFrame([[1,2],[3,4],[5,6]],index=['a','c','e'],columns=['Ohi
    ...: o','Nevada'])

In [35]: right2 = DataFrame([[7,8],[9,10],[11,12],[13,14]],index=list('bcde'),co
    ...: lumns=['Missouri','Alabama'])

In [36]: left2
Out[36]:
   Ohio  Nevada
a     1       2
c     3       4
e     5       6

In [37]: right2
Out[37]:
   Missouri  Alabama
b         7        8
c         9       10
d        11       12
e        13       14

# 同时根据两边索引合并
In [38]: pd.merge(left2, right2, how='outer', left_index=True, right_index=True)
    ...:
Out[38]:
   Ohio  Nevada  Missouri  Alabama
a   1.0     2.0       NaN      NaN
b   NaN     NaN       7.0      8.0
c   3.0     4.0       9.0     10.0
d   NaN     NaN      11.0     12.0
e   5.0     6.0      13.0     14.0

DataFrame还有join方法,更好的合并索引。

In [39]: left2.join(right2, how='outer')
Out[39]:
   Ohio  Nevada  Missouri  Alabama
a   1.0     2.0       NaN      NaN
b   NaN     NaN       7.0      8.0
c   3.0     4.0       9.0     10.0
d   NaN     NaN      11.0     12.0
e   5.0     6.0      13.0     14.0
In [44]: right1
Out[44]:
   group_val
a        3.5
b        7.0

In [45]: left1
Out[45]:
  key  value
0   a      0
1   b      1
2   a      2
3   a      3
4   b      4
5   c      5

# left1 使用key来合并,right1使用索引。
In [46]: left1.join(right1, on='key')
Out[46]:
  key  value  group_val
0   a      0        3.5
1   b      1        7.0
2   a      2        3.5
3   a      3        3.5
4   b      4        7.0
5   c      5        NaN

# 对于简单的索引合并,可以在join传入一个DataFrame组。

In [47]: another = DataFrame([[7,8],[9,10],[11,12],[16,17]],index=list('acef'),c
    ...: olumns=['NewYork','Oregon'])

In [48]: left2.join([right2, another])  # 传入list
Out[48]:
   Ohio  Nevada  Missouri  Alabama  NewYork  Oregon
a     1       2       NaN      NaN        7       8
c     3       4       9.0     10.0        9      10
e     5       6      13.0     14.0       11      12
In [50]: left2.join([right2, another],how='outer')
Out[50]:
   Ohio  Nevada  Missouri  Alabama  NewYork  Oregon
a   1.0     2.0       NaN      NaN      7.0     8.0
b   NaN     NaN       7.0      8.0      NaN     NaN
c   3.0     4.0       9.0     10.0      9.0    10.0
d   NaN     NaN      11.0     12.0      NaN     NaN
e   5.0     6.0      13.0     14.0     11.0    12.0
f   NaN     NaN       NaN      NaN     16.0    17.0

concat函数也能实现类似功能,后面介绍。

1.3 轴向连接

这是另一种连接,也叫作:连接、绑定或堆叠。

# numpy的函数:concatenation
In [51]: arr = np.arange(12).reshape(3,4)

In [52]: np.concatenate([arr,arr],axis=1)
Out[52]:
array([[ 0,  1,  2,  3,  0,  1,  2,  3],
       [ 4,  5,  6,  7,  4,  5,  6,  7],
       [ 8,  9, 10, 11,  8,  9, 10, 11]])

In [53]: np.concatenate([arr,arr],axis=0)
Out[53]:
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

对于pandas也能推广,

In [54]: s1 = Series([0,1],index=list('ab'))

In [55]: s2 = Series([2,3,4],index=list('cde'))

In [56]: s3 = Series([5,6],index=list('fg'))

In [57]: pd.concat([s1,s2,s3])
Out[57]:
a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64

# 做一个DataFrame出来。
In [58]: pd.concat([s1,s2,s3],axis=1)
Out[58]:
     0    1    2
a  0.0  NaN  NaN
b  1.0  NaN  NaN
c  NaN  2.0  NaN
d  NaN  3.0  NaN
e  NaN  4.0  NaN
f  NaN  NaN  5.0
g  NaN  NaN  6.0


# 使用inner选项,求并集
In [59]: s4 = pd.concat([s1*5,s3])

In [60]: s4
Out[60]:
a    0
b    5
f    5
g    6
dtype: int64

In [61]: pd.concat([s1,s4],axis=1)
Out[61]:
     0  1
a  0.0  0
b  1.0  5
f  NaN  5
g  NaN  6

In [62]: pd.concat([s1,s4],axis=1,join='inner')
Out[62]:
   0  1
a  0  0
b  1  5


# 指定求那些索引的并集
In [63]: pd.concat([s1,s4],axis=1,join_axes=[list('acbe')])
Out[63]:
     0    1
a  0.0  0.0
c  NaN  NaN
b  1.0  5.0
e  NaN  NaN

In [64]: pd.concat([s1,s4],axis=1,join_axes=[list('acbef')])
Out[64]:
     0    1
a  0.0  0.0
c  NaN  NaN
b  1.0  5.0
e  NaN  NaN
f  NaN  5.0

# 用keys创建一个自己的层次化索引
In [65]: result = pd.concat([s1,s2,s3],keys=['one','two','three'])

In [66]: result
Out[66]:
one    a    0
       b    1
two    c    2
       d    3
       e    4
three  f    5
       g    6
dtype: int64

# 在axis=1轴上连接的话,keys就成了列
In [70]: result = pd.concat([s1,s2,s3],keys=['one','two','three'],axis=1)

In [71]: result
Out[71]:
   one  two  three
a  0.0  NaN    NaN
b  1.0  NaN    NaN
c  NaN  2.0    NaN
d  NaN  3.0    NaN
e  NaN  4.0    NaN
f  NaN  NaN    5.0
g  NaN  NaN    6.0

同样的逻辑,对DataFrame也是一样,

In [72]: df1 = DataFrame(np.arange(6).reshape(3,2),index=list('abc'),columns=['o
    ...: ne','two'])

In [73]: df2 = DataFrame(np.arange(4).reshape(2,2),index=list('ac'),columns=['th
    ...: ree','four'])

In [74]: pd.concat([df1,df2],axis=1,keys=['level1','level2'])
Out[74]:
  level1     level2     
     one two  three four
a      0   1    0.0  1.0
b      2   3    NaN  NaN
c      4   5    2.0  3.0

In [75]: pd.concat([df1,df2],axis=1)
Out[75]:
   one  two  three  four
a    0    1    0.0   1.0
b    2    3    NaN   NaN
c    4    5    2.0   3.0

In [76]: pd.concat([df1,df2])
Out[76]:
   four  one  three  two
a   NaN  0.0    NaN  1.0
b   NaN  2.0    NaN  3.0
c   NaN  4.0    NaN  5.0
a   1.0  NaN    0.0  NaN
c   3.0  NaN    2.0  NaN


# 还可以传入字典
In [77]: pd.concat({'level1':df1,'level2':df2},axis=1)
Out[77]:
  level1     level2     
     one two  three four
a      0   1    0.0  1.0
b      2   3    NaN  NaN
c      4   5    2.0  3.0

# 没用的索引怎么处理?
In [78]: df3 = DataFrame(np.random.randn(3,4),columns=list('abcd'))

In [79]: df4 = DataFrame(np.random.randn(2,3),columns=list('bda'))

  ## 产生新的一列索引,顺序从0开始
In [80]: pd.concat([df3,df4],ignore_index=True)
Out[80]:
          a         b         c         d
0  0.322805 -1.495024  0.971860 -0.904791
1  0.332766 -2.320841 -0.903456  0.026713
2 -0.163670  0.387144 -0.746927 -0.360291
3 -0.074411 -0.557447       NaN  0.704929
4  2.673221 -1.143143       NaN  0.592930

1.4 合并重叠数据

合并重叠数据不能使用merge和concat来实现,使用np.where函数。

In [81]: a = Series([np.nan,2.5,np.nan,3.5,4.5,np.nan],index=list('fedcba'))

In [85]: b = Series([0,1,2,3,4,np.nan],index=list('fedcba'))

In [86]: a
Out[86]:
f    NaN
e    2.5
d    NaN
c    3.5
b    4.5
a    NaN
dtype: float64

In [87]: b
Out[87]:
f    0.0
e    1.0
d    2.0
c    3.0
b    4.0
a    NaN
dtype: float64

# 实现如果a为空,就是用b
In [88]: np.where(pd.isnull(a), b, a)
Out[88]: array([ 0. ,  2.5,  2. ,  3.5,  4.5,  nan])

# 相同的目标
In [95]: a.combine_first(b)
Out[95]:
f    0.0
e    2.5
d    2.0
c    3.5
b    4.5
a    NaN
dtype: float64

针对DataFrame:

In [96]: df1 = DataFrame({'a':[1,np.nan,5,np.nan],'b':[np.nan,2,np.nan,6],'c':ra
    ...: nge(2,18,4)})

In [97]: df2 = DataFrame({'a':[5,4,np.nan,3,7],'b':[np.nan,3,4,6,8]})

In [98]: df1.combine_first(df2)
Out[98]:
     a    b     c
0  1.0  NaN   2.0
1  4.0  2.0   6.0
2  5.0  4.0  10.0
3  3.0  6.0  14.0
4  7.0  8.0   NaN

2 重塑和轴向旋转

2.1 重塑层次化索引

针对DataFrame,有stack和unstack。stack将数据的列“旋转”为行,unstack反之。

In [6]: data = DataFrame(np.arange(6).reshape(2,3),index=pd.Index(['Ohio','Color
   ...: ado'], name='state'),columns=pd.Index(['one','two','three'],name='number
   ...: '))

In [7]: data
Out[7]:
number    one  two  three
state                    
Ohio        0    1      2
Colorado    3    4      5

In [8]: data.stack()  # 把列转化为行,即形成一个层次化索引的Series
Out[8]:
state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int64

In [9]: data.stack().unstack()  # 把层次化索引的Series转化为一个普通的DataFrame
Out[9]:
number    one  two  three
state                    
Ohio        0    1      2
Colorado    3    4      5
In [10]: data.stack().unstack('number')
Out[10]:
number    one  two  three
state                    
Ohio        0    1      2
Colorado    3    4      5

In [11]: data.stack().unstack(1)
Out[11]:
number    one  two  three
state                    
Ohio        0    1      2
Colorado    3    4      5

In [12]: data.stack().unstack(0)
Out[12]:
state   Ohio  Colorado
number                
one        0         3
two        1         4
three      2         5

In [13]: data.stack().unstack('state')  # 选择指定的列,进行旋转
Out[13]:
state   Ohio  Colorado
number                
one        0         3
two        1         4
three      2         5

对于Series也是一样,

In [14]: s1 = Series([0,1,2,3],index=list('abcd'))

In [15]: s2 = Series([4,5,6],index=list('cde'))

In [16]: data2 = pd.concat([s1,s2],keys=['one','two'])

In [17]: data2
Out[17]:
one  a    0
     b    1
     c    2
     d    3
two  c    4
     d    5
     e    6
dtype: int64

In [18]: data2.unstack()
Out[18]:
       a    b    c    d    e
one  0.0  1.0  2.0  3.0  NaN
two  NaN  NaN  4.0  5.0  6.0

In [19]: data2.unstack().stack()  # 默认丢弃nan值
Out[19]:
one  a    0.0
     b    1.0
     c    2.0
     d    3.0
two  c    4.0
     d    5.0
     e    6.0
dtype: float64

In [20]: data2.unstack().stack(dropna=False) # 不丢弃nan值
Out[20]:
one  a    0.0
     b    1.0
     c    2.0
     d    3.0
     e    NaN
two  a    NaN
     b    NaN
     c    4.0
     d    5.0
     e    6.0
dtype: float64

2.2 将“长格式”转化为“宽格式”

时间序列的通常以“长格式”或“堆栈”格式存放在数据库或者CSV中。

In [28]: df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
    ...: 'B' : ['A', 'B', 'C'] * 4,'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'ba
    ...: r'] * 2,
    ...: 'D' : np.random.randn(12),'E' : np.random.randn(12)})

In [29]: df
Out[29]:
        A  B    C         D         E
0     one  A  foo -1.130142 -0.800142
1     one  B  foo  1.453914 -1.344615
2     two  C  foo  0.452666  0.103939
3   three  A  bar  0.053393  0.284834
4     one  B  bar -0.435699  0.885828
5     one  C  bar  0.492411 -1.280743
6     two  A  foo -1.819792 -1.654636
7   three  B  foo  0.364982  1.654728
8     one  C  foo -1.205436  0.712150
9     one  A  bar  2.696329  1.357930
10    two  B  bar -0.561867  1.326672
11  three  C  bar  0.782410 -0.600689

# 这样就是df表中以D作为数值域,A,B为行,C为列的数据视图。

In [30]: pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
Out[30]:
C             bar       foo
A     B                    
one   A  2.696329 -1.130142
      B -0.435699  1.453914
      C  0.492411 -1.205436
three A  0.053393       NaN
      B       NaN  0.364982
      C  0.782410       NaN
two   A       NaN -1.819792
      B -0.561867       NaN
      C       NaN  0.452666

3 数据转换

前面两部分说的都是数据的重排,还有一类操作就是过滤、清理和转换。

3.1 移除重复数据

duplicated方法返回一个bool型Series,判断行是否为重复行。 drop_duplicates方法:返回一个删除了重复行的DataFrame

In [31]: data = DataFrame({'k1':['one']*3+['two']*4,'k2':[1,1,2,3,3,4,4]})

In [32]: data
Out[32]:
    k1  k2
0  one   1
1  one   1
2  one   2
3  two   3
4  two   3
5  two   4
6  two   4

In [33]: data.duplicated()
Out[33]:
0    False
1     True
2    False
3    False
4     True
5    False
6     True
dtype: bool

In [34]: data.duplicated('k2')  # 按照列进行判断
Out[34]:
0    False
1     True
2    False
3    False
4     True
5    False
6     True
dtype: bool

In [35]: data.drop_duplicates
Out[35]:
<bound method DataFrame.drop_duplicates of     k1  k2
0  one   1
1  one   1
2  one   2
3  two   3
4  two   3
5  two   4
6  two   4>

In [36]: data.drop_duplicates()  # 返回一个新的去重后的对象
Out[36]:
    k1  k2
0  one   1
2  one   2
3  two   3
5  two   4

In [37]: data  # 原对象不变
Out[37]:
    k1  k2
0  one   1
1  one   1
2  one   2
3  two   3
4  two   3
5  two   4
6  two   4

# 默认保留的是第一个重复值
In [41]: data.drop_duplicates(['k1','k2'])  #
Out[41]:
    k1  k2  v1
0  one   1   0
2  one   2   2
3  two   3   3
5  two   4   5

In [42]: data.drop_duplicates(['k1','k2'],take_last=True)
/Users/yangfeilong/anaconda/bin/ipython:1: FutureWarning: the take_last=True keyword is deprecated, use keep='last' instead
  #!/bin/bash /Users/yangfeilong/anaconda/bin/python.app
Out[42]:
    k1  k2  v1
1  one   1   1
2  one   2   2
4  two   3   4
6  two   4   6

# 也可以指定保留最后一个值
In [43]: data.drop_duplicates(['k1','k2'],keep='last')
Out[43]:
    k1  k2  v1
1  one   1   1
2  one   2   2
4  two   3   4
6  two   4   6

In [44]: data.drop_duplicates(['k1','k2'],keep='first')
Out[44]:
    k1  k2  v1
0  one   1   0
2  one   2   2
3  two   3   3
5  two   4   5

3.2 利用函数或映射进行数据转换

映射可以是一个字典,key为源,value为目标。 如何转换呢?DataFrame通过apply或者applymap,Series通过map函数进行。

In [45]: data = DataFrame({'food':['bacon','pulled pork','bacon','Pastrami','cor
    ...: ned beef','Bacon','pastrami','honey ham','nova lox'],'ounces':[4,3,12,6
    ...: ,7.5,8,3,5,6]})

In [46]: data
Out[46]:
          food  ounces
0        bacon     4.0
1  pulled pork     3.0
2        bacon    12.0
3     Pastrami     6.0
4  corned beef     7.5
5        Bacon     8.0
6     pastrami     3.0
7    honey ham     5.0
8     nova lox     6.0

In [47]: meat_to_animal = {'bacon':'pig','pulled pork':'pig','pastrami':'cow','c
    ...: orned beef':'cow','honey ham':'pig','nova lox':'salmon'}

In [49]: data['food']
Out[49]:
0          bacon
1    pulled pork
2          bacon
3       Pastrami
4    corned beef
5          Bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

# 在data中food列,有些存在大写,所以需要处理一下。
In [50]: data['food'].map(str.lower).map(meat_to_animal)
Out[50]:
0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object


# 通过字典进行映射
In [51]: data['animal']=data['food'].map(str.lower).map(meat_to_animal)

In [52]: data
Out[52]:
          food  ounces  animal
0        bacon     4.0     pig
1  pulled pork     3.0     pig
2        bacon    12.0     pig
3     Pastrami     6.0     cow
4  corned beef     7.5     cow
5        Bacon     8.0     pig
6     pastrami     3.0     cow
7    honey ham     5.0     pig
8     nova lox     6.0  salmon


#通过函数进行映射
In [53]: data['animal']=data['food'].map(lambda x:meat_to_animal[(x.lower())])

In [54]: data
Out[54]:
          food  ounces  animal
0        bacon     4.0     pig
1  pulled pork     3.0     pig
2        bacon    12.0     pig
3     Pastrami     6.0     cow
4  corned beef     7.5     cow
5        Bacon     8.0     pig
6     pastrami     3.0     cow
7    honey ham     5.0     pig
8     nova lox     6.0  salmon

3.3 替换值

fillna是填充nan值,可以看成是替换值的一种特殊情况。 replace方法提供一个简单的替换值得方法。

In [55]: data = Series([1,-999,2,-999,-1000,3])

In [56]: data
Out[56]:
0       1
1    -999
2       2
3    -999
4   -1000
5       3
dtype: int64

In [57]: data.replace(-999,np.nan)
Out[57]:
0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

In [58]: data.replace([-999,-1000],np.nan) # 列表 + 标量
Out[58]:
0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

In [59]: data.replace([-999,-1000],[np.nan,0])  # 两个列表
Out[59]:
0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

In [60]: data.replace({-999:np.nan,-1000:0})  # 字典形式
Out[60]:
0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

3.4 重命名轴索引

# 修改源数据
In [62]: data
Out[62]:
0       1
1    -999
2       2
3    -999
4   -1000
5       3
dtype: int64

In [63]: data.index = range(2,8)

In [64]: data
Out[64]:
2       1
3    -999
4       2
5    -999
6   -1000
7       3
dtype: int64

In [65]: data.index.map(lambda x:x+100)
Out[65]: array([102, 103, 104, 105, 106, 107])

In [67]: data.index = data.index.map(lambda x:x+100)

In [68]: data
Out[68]:
102       1
103    -999
104       2
105    -999
106   -1000
107       3
dtype: int64

# 创建一个新数据,使用rename
In [73]: data2 = DataFrame(np.arange(12).reshape(3,4),columns=list('abcd'),index
    ...: =range(3))

In [74]: data2
Out[74]:
   a  b   c   d
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11
In [77]: data2.rename(index={0:100})
Out[77]:
     a  b   c   d
100  0  1   2   3
1    4  5   6   7
2    8  9  10  11

In [78]: data2.rename(index={0:100},columns={'a':'ff'})
Out[78]:
     ff  b   c   d
100   0  1   2   3
1     4  5   6   7
2     8  9  10  11

3.5 离散化和面元划分

连续数据常常需要被拆分为离散数据。这就需要用cut函数。

In [79]: ages = range(100)

In [80]: bins = [18,25,35,60,100]

In [81]: pd.cut(ages,bins)
Out[81]:
[NaN, NaN, NaN, NaN, NaN, ..., (60, 100], (60, 100], (60, 100], (60, 100], (60, 100]]
Length: 100
Categories (4, object): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

In [82]: cuts = pd.cut(ages,bins)

In [83]: cuts[45]
Out[83]: '(35, 60]'

In [85]: cuts.dtype
Out[85]: category

这时,cut返回的是一个category对象,

In [86]: cuts.labels
/Users/yangfeilong/anaconda/bin/ipython:1: FutureWarning: 'labels' is deprecated. Use 'codes' instead
  #!/bin/bash /Users/yangfeilong/anaconda/bin/python.app
Out[86]:
array([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1,  0,  0,  0,  0,  0,  0,  0,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
        2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  3,  3,  3,  3,  3,  3,  3,
        3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,
        3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3], dtype=int8)

In [87]: cuts.levels
/Users/yangfeilong/anaconda/bin/ipython:1: FutureWarning: Accessing 'levels' is deprecated, use 'categories'
  #!/bin/bash /Users/yangfeilong/anaconda/bin/python.app
Out[87]: Index([u'(18, 25]', u'(25, 35]', u'(35, 60]', u'(60, 100]'], dtype='object')

In [88]: pd.value_counts(cuts) #查看每个种类的数量
Out[88]:
(60, 100]    39
(35, 60]     25
(25, 35]     10
(18, 25]      7
dtype: int64

# 使用right=False可以修改开端和闭端

In [89]: cuts = pd.cut(ages,bins,right=False)

In [90]: cuts
Out[90]:
[NaN, NaN, NaN, NaN, NaN, ..., [60, 100), [60, 100), [60, 100), [60, 100), [60, 100)]
Length: 100
Categories (4, object): [[18, 25) < [25, 35) < [35, 60) < [60, 100)]

In [91]: cuts = pd.cut(ages,bins,right=False,labels=list('abcd'))

In [92]: cuts
Out[92]:
[NaN, NaN, NaN, NaN, NaN, ..., d, d, d, d, d]
Length: 100
Categories (4, object): [a < b < c < d]

# 如果cut传入的是数字n,那么就会均分成n份。
In [94]: pd.cut(ages,5)
Out[94]:
[(-0.099, 19.8], (-0.099, 19.8], (-0.099, 19.8], (-0.099, 19.8], (-0.099, 19.8], ..., (79.2, 99], (79.2, 99], (79.2, 99], (79.2, 99], (79.2, 99]]
Length: 100
Categories (5, object): [(-0.099, 19.8] < (19.8, 39.6] < (39.6, 59.4] < (59.4, 79.2] < (79.2, 99]]

# qcut使用样本分位数进行划分,比cut更平均,获得大小基本相同的面元
In [95]: pd.qcut(ages,5)
Out[95]:
[[0, 19.8], [0, 19.8], [0, 19.8], [0, 19.8], [0, 19.8], ..., (79.2, 99], (79.2, 99], (79.2, 99], (79.2, 99], (79.2, 99]]
Length: 100
Categories (5, object): [[0, 19.8] < (19.8, 39.6] < (39.6, 59.4] < (59.4, 79.2] < (79.2, 99]]
In [96]: pd.value_counts(pd.qcut(ages,5))
Out[96]:
(79.2, 99]      20
(59.4, 79.2]    20
(39.6, 59.4]    20
(19.8, 39.6]    20
[0, 19.8]       20
dtype: int64

In [98]: ages = np.random.randn(1000)

In [99]: pd.value_counts(pd.qcut(ages,5))
Out[99]:
(0.883, 3.534]      200
(0.25, 0.883]       200
(-0.266, 0.25]      200
(-0.806, -0.266]    200
[-3.172, -0.806]    200
dtype: int64

In [100]: pd.value_counts(pd.cut(ages,5))
Out[100]:
(-0.489, 0.852]     481
(-1.831, -0.489]    283
(0.852, 2.193]      198
(-3.179, -1.831]     23
(2.193, 3.534]       15
dtype: int64


3.6 检测和过滤异常值

只看个例子:把绝对值大于3的值都替换为3-3

In [101]: data = DataFrame(np.random.randn(1000,4))

In [102]: data
Out[102]:
            0         1         2         3
0    0.746898 -0.657217  0.358517  1.075503
1   -0.821582  1.408816  0.937318 -0.548889
2   -0.527847  0.749256 -0.524283 -0.977338
3    0.800118 -0.544609  0.481839  0.555546
...
996 -0.081920  0.100507 -0.431551 -1.832261
997 -1.069475  0.205978  0.918592  0.881498
998  1.749007  2.658384 -1.079657 -0.023890
999  1.488167 -0.061146  0.006189  0.070564

[1000 rows x 4 columns]

In [103]: data.describe()
Out[103]:
                 0            1            2            3
count  1000.000000  1000.000000  1000.000000  1000.000000
mean     -0.018474     0.030781     0.025966     0.041950
std       1.020634     0.982547     1.010164     0.948116
min      -3.150278    -3.422772    -3.406069    -2.575391
25%      -0.719646    -0.621461    -0.644949    -0.572558
50%      -0.021287     0.043128     0.027890     0.050492
75%       0.701195     0.711455     0.678644     0.665413
max       3.270054     2.983500     3.323021     2.925822


In [105]: data[(np.abs(data) > 3).any(1)]
Out[105]:
            0         1         2         3
16   1.450842  2.222636 -3.289273  1.265199
353 -3.067571  0.490152  0.536051  1.252143
531  1.518928  0.289636  3.323021 -0.695272
688  1.776682 -0.799736 -3.406069  0.632751
702  0.807953 -0.674675 -3.045511 -0.267613
704 -0.469205 -3.422772 -2.669146 -0.712407
772 -0.861015 -0.573432  3.012727  0.325476
808  3.270054 -0.035521  0.061092  1.263363
812 -3.150278 -0.734308  1.376565  0.671424

In [108]: data[np.abs(data) > 3] = 100

In [110]: data.describe()
Out[110]:
                 0            1            2            3
count  1000.000000  1000.000000  1000.000000  1000.000000
mean      0.284474     0.134204     0.529371     0.041950
std       5.564261     3.308565     7.123116     0.948116
min      -2.746446    -2.983108    -2.710984    -2.575391
25%      -0.714890    -0.619551    -0.635457    -0.572558
50%      -0.018859     0.048775     0.036321     0.050492
75%       0.710050     0.712375     0.686314     0.665413
max     100.000000   100.000000   100.000000     2.925822

In [111]: data[np.abs(data) > 3] = np.sign(data) * 3

In [113]: data.describe()
Out[113]:
                 0            1            2            3
count  1000.000000  1000.000000  1000.000000  1000.000000
mean     -0.006526     0.037204     0.044371     0.041950
std       1.019295     0.980940     1.006138     0.948116
min      -2.746446    -2.983108    -2.710984    -2.575391
25%      -0.714890    -0.619551    -0.635457    -0.572558
50%      -0.018859     0.048775     0.036321     0.050492
75%       0.710050     0.712375     0.686314     0.665413
max       3.000000     3.000000     3.000000     2.925822

3.7 排列和随机采样

# 随机重排序
In [114]: df = DataFrame(np.arange(20).reshape(5,4))

In [115]: df.ix[np.random.permutation(5)]
Out[115]:
    0   1   2   3
0   0   1   2   3
4  16  17  18  19
1   4   5   6   7
2   8   9  10  11
3  12  13  14  15

# 获取随机前三行
In [116]: df.ix[np.random.permutation(len(df))[:3]]
Out[116]:
    0   1   2   3
0   0   1   2   3
2   8   9  10  11
4  16  17  18  19

# 随机获取10,每个值随机取自bag
In [118]: bag = np.array([5,7,-1,4,6])

In [119]: bag.take(np.random.randint(0,len(bag),size=10))
Out[119]: array([ 7,  7,  7,  5,  7,  6,  4, -1,  6,  4])

3.8 计算标量/哑变量

还有一种常用的转换方式是:将分类变量转换为哑变量矩阵或指标矩阵。

# 比较复杂,其实就是用1和0填充一个矩阵,说明元素在一个Series中的位置。
In [120]: s1 = Series(list('bbacab'))

In [121]: pd.get_dummies(s1)
Out[121]:
     a    b    c
0  0.0  1.0  0.0
1  0.0  1.0  0.0
2  1.0  0.0  0.0
3  0.0  0.0  1.0
4  1.0  0.0  0.0
5  0.0  1.0  0.0

结合cut和get_dummies函数可以来做一些统计有用的事情:

In [122]: values = np.random.rand(10)

In [123]: values
Out[123]:
array([ 0.96294606,  0.15142481,  0.38260636,  0.01174038,  0.44699216,
        0.3726047 ,  0.50965999,  0.61532174,  0.27740518,  0.28994142])

In [126]: bins = [0,0.2,0.4,0.6,0.8,1.]

In [127]: pd.cut(values,bins)
Out[127]:
[(0.8, 1], (0, 0.2], (0.2, 0.4], (0, 0.2], (0.4, 0.6], (0.2, 0.4], (0.4, 0.6], (0.6, 0.8], (0.2, 0.4], (0.2, 0.4]]
Categories (5, object): [(0, 0.2] < (0.2, 0.4] < (0.4, 0.6] < (0.6, 0.8] < (0.8, 1]]

In [128]: pd.get_dummies(pd.cut(values, bins)) # 查看每个类型的指标矩阵
Out[128]:
   (0, 0.2]  (0.2, 0.4]  (0.4, 0.6]  (0.6, 0.8]  (0.8, 1]
0       0.0         0.0         0.0         0.0       1.0
1       1.0         0.0         0.0         0.0       0.0
2       0.0         1.0         0.0         0.0       0.0
3       1.0         0.0         0.0         0.0       0.0
4       0.0         0.0         1.0         0.0       0.0
5       0.0         1.0         0.0         0.0       0.0
6       0.0         0.0         1.0         0.0       0.0
7       0.0         0.0         0.0         1.0       0.0
8       0.0         1.0         0.0         0.0       0.0
9       0.0         1.0         0.0         0.0       0.0

In [129]: a =pd.get_dummies(pd.cut(values, bins))

In [131]: a.sum(axis=0)  # 统计一下每个类型的数量
Out[131]:
(0, 0.2]      2.0
(0.2, 0.4]    4.0
(0.4, 0.6]    2.0
(0.6, 0.8]    1.0
(0.8, 1]      1.0
dtype: float64

4 字符串操作

4.1 字符串对象方法

大部分的字符处理,内置方法已经够用了。结合“split”和“strip”、‘+’、“join”等方法。

In [1]: a = 'a,b, guide'

In [2]: a.split(',')
Out[2]: ['a', 'b', ' guide']

In [4]: [x.strip() for x in a.split(',')]
Out[4]: ['a', 'b', 'guide']

In [5]: first,second,third = [x.strip() for x in a.split(',')]

In [6]: first
Out[6]: 'a'

In [7]: second
Out[7]: 'b'

In [8]: third
Out[8]: 'guide'

In [9]: first + third
Out[9]: 'aguide'

In [10]: first.join(third)
Out[10]: 'gauaiadae'

In [12]: ''.join([third,first])
Out[12]: 'guidea'

In [13]: 'a' in a
Out[13]: True

In [14]: a.find(':')
Out[14]: -1

In [15]: a.index(':')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-15-e54927e30300> in <module>()
----> 1 a.index(':')

ValueError: substring not found

In [16]: a.count(',')
Out[16]: 2

字符串内置方法:

4.2 正则表达式

复习一下正则表达式

In [17]: import re

In [18]: text = 'foo    bar\t baz  \tqux'

In [19]: re.split('\s+', text)  # \s+  表示一个或多个空格
Out[19]: ['foo', 'bar', 'baz', 'qux']

In [20]: regex = re.compile('\s+')  # 创建regex对象,多次复用,减少大量CPU时间

In [21]: regex.split(text)
Out[21]: ['foo', 'bar', 'baz', 'qux']

In [22]: regex.findall(text)  # 所有text中匹配到的对象
Out[22]: ['    ', '\t ', '  \t']

# match :从头开始匹配
# search:第一个
# findall:找到所有

In [23]: text = '''
    ...: Dave dave@google.com
    ...: Steve steve@gmail.com
    ...: Rob rob@gmail.com
    ...: Ryan ryan@yahoo.com
    ...: '''

In [24]: pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
# flags=re.IGNORECASE 表示忽略大小写
In [25]: regex = re.compile(pattern, flags=re.IGNORECASE)

In [26]: regex.findall(text)
Out[26]: ['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']


In [27]: regex.search(text)
Out[27]: <_sre.SRE_Match at 0x104227b28>

In [29]: regex.search(text).group(0)
Out[29]: 'dave@google.com'

# sub 替换
In [30]: regex.sub('HAHAHA', text)
Out[30]: '\nDave HAHAHA\nSteve HAHAHA\nRob HAHAHA\nRyan HAHAHA\n'

In [31]: print(regex.sub("HAHAHAAHA", text))

Dave HAHAHAAHA
Steve HAHAHAAHA
Rob HAHAHAAHA
Ryan HAHAHAAHA

# 分组
In [33]: pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'

In [34]: regex = re.compile(pattern, flags=re.IGNORECASE)

In [35]: m=regex.match('wesm@bright.net')

In [36]: m.groups()
Out[36]: ('wesm', 'bright', 'net')

In [37]: m.group(0)
Out[37]: 'wesm@bright.net'

In [38]: m.group(1)
Out[38]: 'wesm'

In [39]: m.group(2)
Out[39]: 'bright'

In [40]: m.group(3)
Out[40]: 'net'

# 返回一个大的列表
In [41]: regex.findall(text)
Out[41]:
[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

# 指定替换
 In [48]: regex.sub(r'Username:\1,Domain:\2,Suffix:\3',text)
 Out[48]: '\nDave Username:dave,Domain:google,Suffix:com\nSteve Username:steve,Domain:gmail,Suffix:com\nRob Username:rob,Domain:gmail,Suffix:com\nRyan Username:ryan,Domain:yahoo,Suffix:com\n'

4.3 pandas矢量化的字符串函数

In [61]: data = {
    ...: 'Dave':'dave@google.com',
    ...: 'Steve':'steve@gmail.com',
    ...: 'Rob':'rob@gmail.com',
    ...: 'Ryan':'ryan@yahoo.com',
    ...: 'Wes':np.nan}

In [62]: data = Series(data)

In [63]: data
Out[63]:
Dave     dave@google.com
Rob        rob@gmail.com
Ryan      ryan@yahoo.com
Steve    steve@gmail.com
Wes                  NaN
dtype: object

In [64]: data.isnull()
Out[64]:
Dave     False
Rob      False
Ryan     False
Steve    False
Wes       True
dtype: bool

# 查看是否包含,返回一个bool Series
In [65]: data.str.contains('gmail')
Out[65]:
Dave     False
Rob       True
Ryan     False
Steve     True
Wes        NaN
dtype: object


In [66]: pattern
Out[66]: '([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'
# 返回一个匹配好的Series
In [67]: data.str.findall(pattern, flags=re.IGNORECASE)
Out[67]:
Dave     [(dave, google, com)]
Rob        [(rob, gmail, com)]
Ryan      [(ryan, yahoo, com)]
Steve    [(steve, gmail, com)]
Wes                        NaN
dtype: object


In [68]: matches = data.str.match(pattern, flags=re.IGNORECASE)
/Users/yangfeilong/anaconda/bin/ipython:1: FutureWarning: In future versions of pandas, match will change to always return a bool indexer.
  #!/bin/bash /Users/yangfeilong/anaconda/bin/python.app

In [69]: matches
Out[69]:
Dave     (dave, google, com)
Rob        (rob, gmail, com)
Ryan      (ryan, yahoo, com)
Steve    (steve, gmail, com)
Wes                      NaN
dtype: object

In [70]: matches.str.get(1)
Out[70]:
Dave     google
Rob       gmail
Ryan      yahoo
Steve     gmail
Wes         NaN
dtype: object

In [71]: matches.str.get(2)
Out[71]:
Dave     com
Rob      com
Ryan     com
Steve    com
Wes      NaN
dtype: object

In [72]: matches.str[0]
Out[72]:
Dave      dave
Rob        rob
Ryan      ryan
Steve    steve
Wes        NaN
dtype: object

In [73]: data.str[:5]
Out[73]:
Dave     dave@
Rob      rob@g
Ryan     ryan@
Steve    steve
Wes        NaN
dtype: object


5 示例:usda食品数据库

本部分完。