数据科学必备Pandas数据集操作详解

科技 06-18 来源： Mr数据杨

Python 的 Pandas 可以帮助我们处理复杂的数据信息，不管数据是大是小都可以拆分成可操作的部分，并进行处理获得想要的结果。

整套学习自学教程中应用的数据都是《三國志》、《真·三國無雙》系列游戏中的内容。

环境的配置

熟悉 Python 内置数据结构尤其是列表和字典，可以参考前面的内容。

Mr数据杨：Python基础必掌握的列表List和元组Tuple的使用2 赞同 · 0 评论文章

Pandas 预览数据

直接读取Excel中的数据，进行相关操作。

查看的数据基本信息。

import pandas as pd
# 读取数据
df = pd.read_excel("Romance of the Three Kingdoms 13/人物详情数据.xlsx")

# 数据类型
type(df) 


# 行数
len(df) 
857

# 行、列数
df.shape  
(857, 45)

数据的预览。

# 查看前5行数据
df.head()

会看到一列省略号 (...) 表示缺失的数据。可以设置为滚动条显示。

pd.set_option("display.max.columns", 2)

# 显示最后5行数据
df.tail()

Pandas 检查数据

显示数据类型

显示所有列表及其数据类型。

df.info()

RangeIndex: 857 entries, 0 to 856
Data columns (total 45 columns):
名前       857 non-null object
字        857 non-null object
読み       857 non-null object
性別       857 non-null object
生年       857 non-null int64
登場       857 non-null int64
没年       857 non-null int64
寿命       857 non-null int64
死因       857 non-null object
父親       857 non-null object
母親       857 non-null object
相性       857 non-null object
列伝       857 non-null object
商業       857 non-null int64
農業       857 non-null int64
文化       857 non-null int64
訓練       857 non-null int64
巡察       857 non-null int64
説破       857 non-null int64
交渉       857 non-null int64
弁舌       857 non-null int64
人徳       857 non-null int64
威風       857 non-null int64
神速       857 non-null int64
奮戦       857 non-null int64
連戦       857 non-null int64
攻城       857 non-null int64
兵器       857 non-null int64
堅守       857 non-null int64
水連       857 non-null int64
一騎       857 non-null int64
豪傑       857 non-null int64
鬼謀       857 non-null int64
音声       857 non-null object
武器       857 non-null object
性格       857 non-null object
義理       857 non-null object
勇愛       857 non-null object
才愛       857 non-null object
分類       857 non-null object
武具
興味    857 non-null object
書物
興味    857 non-null object
宝物
興味    857 non-null object
酒
興味     857 non-null object
物欲       857 non-null object
dtypes: int64(24), object(21)
memory usage: 301.4+ KB

可以看到数据类型 int64 和 object。Pandas 使用 NumPy 库克来处理这些类型。

显示基础统计

显示所有数字列的一些基本描述性统计信息。

df.describe()

使用 include 参数可以查看其他数据类型。不会计算 object 列的平均值或标准差，只显示一些描述性统计信息。

import numpy as np
df.describe(include=object)

探索数据集

探索性的数据分析可以回答有关数据集的问题。

特定值在列表中出现的频率。

df["性格"].value_counts()

冷静    290
豪胆    223
小心    178
猪突    165
？       1
Name: 性格, dtype: int64


df["分類"].value_counts()

武官    520
文官    336
？       1
Name: 分類, dtype: int64

Pandas 数据结构

Series 对象

Python 最基本的数据结构是list，Series根据列表创建一个新对象。

revenues = pd.Series([1, 2, 3])
revenues
0    1
1    2
2    3
dtype: int64

type(revenues.values)

Series 操作同 List，区别在于可以设置显式索引值。

data = pd.Series(
    [1, 2, 3],
    index=["A", "B", "C"]
)
>>> data
A    1
B    2
C    3
dtype: int64

也可以通过字典的方式构建 Series 。

data = pd.Series({"A": 1, "B": 2})
data
A    1
B    2
dtype: int64

Series 也支持 .keys() 和 in 关键字。

data.keys()
Index(['A', 'B'], dtype='object')
"A" in data 
True
"C" in data 
False

DataFrame 对象

DataFrame 可以通过在构造函数中提供字典来将这些对象组合成一个。字典键将成为列名，值应包含Series对象。

A_ = pd.Series([1, 2, 3],index=["A", "B", "C"])
B_ = pd.Series({"A": 11, "B": 22})
data = pd.DataFrame({
    "A": A_,
    "B": B_
})

data
   A     B
A  1  11.0
B  2  22.0
C  3   NaN

新 DataFrame 索引是两个Series索引的并集。

data.index
Index(['A', 'B', 'C'], dtype='object')

DataFrame也将其值存储在 NumPy 数组中。

data.values
array([[1, 11.0],
       [2, 22.0],
       [3, nan]])

DataFrame 轴操作。

data.axes
[Index(['A', 'B', 'C'], dtype='object'), Index(['A', 'B'], dtype='object')]

data.axes[0]
Index(['A', 'B', 'C'], dtype='object')

data.axes[1]
Index(['A', 'B'], dtype='object')

DataFrame也是一个类字典的数据结构，也支持 .keys() 和 in 关键字。

data.keys()
Index(['A', 'B'], dtype='object')

Series元素操作

使用索引运算符

支持关键字和索引数字操作。

A_
A    1
B    2
C    3
dtype: int64

A_['A']
1

A_[0]
1

支持 list 的切片操作。

A_[-1]
3

A_[1:]
B    2
C    3
dtype: int64

.loc和.iloc

处理标签索引是数字的情况。

colors = pd.Series(
    ["red", "purple", "blue", "green", "yellow"],
    index=[1, 2, 3, 5, 8]
)

colors
1       red
2    purple
3      blue
5     green
8    yellow
dtype: object

.loc 指标签索引。
.iloc 指位置索引。

colors.loc[1]
'red'

colors.iloc[1]
'purple'

.loc 指向图像右侧的标签索引。同时.iloc指向图片左侧的位置索引。

.iloc 返回具有隐式索引的元素。

colors.iloc[1:3]
2    purple
3      blue
dtype: object

.loc 返回显式索引在 3 到 8 之间的元素。

colors.loc[3:8]
3      blue
5     green
8    yellow
dtype: object

.iloc 支持负位置索引传递。

colors.iloc[-2]
'green'

DataFrame 元素访问

使用索引运算符

将 DataFrame 视为其 Series 值为的字典。

import pandas as pd

A_ = pd.Series([1, 2, 3],index=["A", "B", "C"])
B_ = pd.Series({"A": 11, "B": 22})
data = pd.DataFrame({
    "A": A_,
    "B": B_
})

data["A"]
A    1
B    2
C    3
Name: A, dtype: int64


type(data["A"])

也支持 . 符号访问。

data["A"]
A    1
B    2
C    3
Name: A, dtype: int64

有可能出现函数方法和列名重复的情况。

toys = pd.DataFrame([
    {"name": "ball", "shape": "sphere"},
    {"name": "Rubik's cube", "shape": "cube"}
])

toys["shape"]
0    sphere
1      cube
Name: shape, dtype: object

toys.shape
(2, 2)

.loc和.iloc

DataFrame 也提供 .loc 和 .iloc 数据访问方法。

data.loc["A"]
A     1.0
B    11.0
Name: A, dtype: float64

data.loc["A": "B"]
    A   B
A   1   11.0
B   2   22.0

data.iloc[1]
A     2.0
B    22.0
Name: B, dtype: float64

查询数据集

可以根据索引访问庞大数据集的子集，意味着可以根据索引来查询数据。

筛选200年后出生的人物。

born_date = df[df["生年"] > 200]
born_date.shape

(189, 45)

选定姓名中字为 - 的人物。

not_null_data = df[df["字"]!="-"]
not_null_data.shape

(381, 45)

选择姓为曹的人物

people = df[df["名前"].str.startswith("曹")]
people.shape

(25, 45)

多条件搜索使用 & 符号。

df[
    (df["名前"].str.startswith("曹")) &
    (df["生年"] > 200) &
    (df["分類"] == "文官")
]

分组和聚合数据

Pandas 库提供了分组和聚合函数进行各种数据的统计操作。

Series 有 20 多种不同的方法来计算描述性统计量。

A_.sum()
6

A_.min()
1

DataFrame 可以有多个列可以进行聚合分组操作。

根据『分類』聚合，求和『寿命』列数据。

df.groupby("分類", sort=False)["寿命"].mean()

分類
武官     51.646154
文官     58.220238
？     104.000000
Name: 寿命, dtype: float64

列操作

创建原 df 的副本DataFrame 使用。

df_ = df.copy()
df_.shape

(857, 45)

自定义列间减法求寿命。

df["life"] = df["没年"] - df["生年"]
df["life"]

0      35
1      69
2      64
3      66
4      59
       ..
852    99
853    98
854    98
855    98
856    98
Name: life, Length: 857, dtype: int64
df["difference"].max()
68

列名可以重新自定义。

renamed_df = df.rename(
    columns={"生年": "born", "没年": "death"}
)

也可以删除不需要的行或者列。

renamed_df.shape
(857, 46)

del_columns = ["寿命"]
# 这里需要指定 axis = 1 为列，axis = 0 为行
renamed_df.drop(del_columns , inplace=True, axis=1)
df.shape
(857, 45)

数据类型重新定义

这里重新定义的数据类型是根据列全部重新定义。

重新定义 object 直接赋值修改数据类型。

df["寿命"] = df["寿命"].astype('object')

清理数据

缺失值

处理包含缺失值的记录的最简单方法是忽略或者删除。

df_drop = df.dropna()

缺失值也可以进行填充。

df_drop  = df.copy()
df_drop ["字"].fillna(
    value="-",
    inplace=True
)

无效值

无效值可能比缺失值更难处理，也会为后续的数据分析操作造成各种不可未知的麻烦。

这个需要根据自己对业务的理解剔除相关不合理的或者异常的数据。

不一致的值

可以定义一些互斥的查询条件，并验证这些条件不会同时出现。

判断出生年小于死亡年。

df[(df["没年"] > df["生年"]).empty
False

幸运的是这两个查询都返回一个空的DataFrame证明不存在不一致的数据。

数据集的拼接

直接跳转学习数据拼接的各种方法。

Mr数据杨：数据科学必备Pandas实操数据各种拼接操作汇总1 赞同 · 0 评论文章

DataFrame 可视化

Series 和 DataFrame对象都有一个 .plot() 方法绘制可视化图。

"""
# 可选的图标模式有
    - 'line' : 折线图
    - 'bar' : 柱状图
    - 'hist' : 直方图
    - 'box' : 箱线图
    .....
"""

data = df["性格"].value_counts()
data.plot('bar')