Summary query of Pandas function methods (continuous addition and improvement)

Summary query of Pandas function methods (continuous addition and improvement)

Article Directory

Original intention

NumPy, Pandas, Matplotlib, SciPy, etc. can be said to be the most commonly used Python libraries. When we use Python libraries, we usually encounter two situations. Take Pandas as an example.

  • I want to implement certain operations on the data of the Pandas data structure, but I don’t know or do not seem to remember whether there is such a function method in my impression. If so, which method should I use?
  • I want to implement some kind of data manipulation. I remember that I have used or seen a function that can implement this function, but I can't remember the name of that function. Or, I think of which function can achieve this function, but I want to know if there is a better option.

At this time, everyone will start to conduct key searches with Baidu, Zhihu, Google, and CSDN. Of course it is possible to do this, and in the end you can get the results you want, but you will face two small problems.

Sometimes I want to do such an operation on data. I know what it means, but I don’t know how to describe it. The keywords are not accurate, which leads to deviations in search results. The search technology is not good, and I have taken a lot of detours.

I searched for the results provided by others, but, either the layout is messy, or it is a long story. Starting from the function interface, a lot of information you don’t want makes you lose the point. It can be solved in a second. Questions, but it took you a minute to read other people’s explanations, get the key points, and wasted a lot of time.

Based on the above, I was thinking about how to solve such a problem. The solution is as follows: If you know but just forget that a certain function can achieve this function, then when you see the function name, you will remember. So, I want some of the most commonly used method and function names directly listed here, then you sweep in the past or simply ctrl+fsearch for Chinese, you can easily jog your memory once. If you don't know whether there is a function that has the function you want, I still list the function name, and then make a supplementary explanation in Chinese. If you scan these functions and their supplementary descriptions at a glance, you will be able to quickly determine whether there are functions that meet your needs.

The following content is organized like this. In the second part, I list the commonly used Pandas functions and their supplementary explanations. In the third part, I give examples of the usage of these functions. You can copy them into the code and modify them and you can use them directly. It's a lot easier than starting from the interface template. I believe this is what most good programmers want,Use the shortest time to complete what you want to do. Of course, when the following content does not meet your needs, you can search further.

The following function methods cover more than 90% of the use cases and are worth collecting as a small dictionary query.

Unless you often use a certain language and are so familiar with the functions and methods that you don't need to query at all, most people can't remember these things. I may only remember that there is such a thing, it is to check other people's code now, or copy it and change it, or copy it by hand. Especially like me, I have come into contact with a lot of languages, so I often engage in string usage. Most of the time, I check it out and sell it now.

List of the most commonly used functions in Pandas

## 读写
pd.Series #定义一维标记数组
pd.DataFrame #定义数据框
pd.read_csv #读取逗号分隔符文件
pd.read_excel #读取 excel 表格
pd.to_excel #写入 excel 表格
pd.read_sql #读取 SQL 数据
pd.read_table #读取 table
pd.read_json #读取 json 文件
pd.read_html #读取 html
pd.read_clipboard() #从剪切板读入数据
df.to_csv #写入 csv 文件
df.to_excel #写入 excel 文件
df.to_sql #写入 SQL 表
df.to_json #写入 JSON 文件
df.to_html #写入 HTML 表格
df.to_clipboard() #写入剪切板

## 数据展示和统计
df.info() #统计数据信息
df.shape() #统计行数和列数
df.index() #显示索引总数
df.columns() #显示数据框有哪些列
df.count() #显示有多少个记录
df.head(n) #返回前 n 个,默认 5
df.tail(n) #返回后 n 个
df.sample(n) #随机选取 n 行
df.sample(frac = 0.8) #百分比为 0.8 的选取
df.dtypes #查看每一列的数据类型
df.sum() #数据框按列求和
df.cumsum() #数据框累计求和
df.min() #给出每列的最小值
df.max() #给出每列的最大值
df['列名'].idxmin() #获取数据框某一列的最小值
mySeries.idxmin() #获取 Series 的最小值
df['列名'].idxmax() #获取数据框某一列的最大值
mySeries.idxmax() #获取 Series 的最大值
df.describe() #关数据的基本统计信息描述
df.mean() #给出数据框每一列的均值
df.median() #给出数据框每一列的中位数
df.quantile #给出分位数
df.var() #统计每一列的方差
df.std() #统计每一列的标准差
df.cummax() #寻找累计最大值,即已出现中最大的一个
df.cummin() #累计最小值
df['列名'].cumproad() #计算累积连乘
len(df) #统计数据框长度
df.isnull #返回数据框是否包含 null 值
df.corr() #返回列之间的相关系数,以矩阵形式展示
df['列名'].value_counts() #列去重后给每个值计数

## 数据选择
mySeries['列名'] #用中括号获取列
df['列名'] #选取指定列
df.列名 #同上
df[n0:n1] #返回 n0 到 n1 行之间的数据框
df.iloc[[m],[n]] #iloc按行号来索引,两层中括号,取第 m 行第 n 列
df.loc[m:n] #loc 按标签来索引,返回索引 m 到 n 的数据框,loc、iloc 主要针对行来说的
df.loc[:,"列1":"列2"] #返回连续列的所有行
df.loc[m:n,"列1":"列2"] #返回连续列的固定行
df['列名'][n] #选取指定列的第 n 行
df[['列1','列2']] #返回多个指定的列

## 数据筛选和排序
df[df.列名 < n] #筛选,单中括号用于 bool 值筛选
df.filter(regex = 'code') #过滤器,按正则表达式筛选
df.sort_values #按某一列进行排序
df.sort_index() #按照索引升序排列
df['列名'].unique() #列去重
df['列名'].nunique() #列去重后的计数
df.nlargest(n,'列名') #返回 n 个最大值构成的数据框
df.nsmallest(n,'列名') #返回 n 个最小的数据框
df.rank #给出排名,即为第几名

## 数据增加删除修改
df["新列"] = xxx #定义新列
df.rename #给列重命名
df.index.name = "index_name" #设定或者修改索引名称
df.drop #删除行或者列
df.列名 = df.列名.astype('category') #列类型强制转化
df.append #在末尾追加一行
del df['删除的列'] #直接删除一列

## 特别的
df.列名.apply #按列的函数操作
pd.melt #将宽数据转化为长数据(拆分拉长),run 一下下面例子就知道什么意思了
pd.merge #两个数据表间的横向连接(内连接,外连接等)
pd.concat #横向或者纵向拼接

Pandas function usage example

mySeries = pd.Series([1,2,3,4], index=['a','b','c','d'])

data = {'Country' : ['Belgium', 'India', 'Brazil' ],
        'Capital': ['Brussels', 'New Delhi', 'Brassilia'],
        'Population': [1234,1234,1234]}
df = pd.DataFrame(data, columns=['Country','Capital','Population'])

pd.DataFrame(np.random.rand(20,5))

df = pd.read_csv('data.csv')

pd.read_excel('filename')
pd.to_excel('filename.xlsx', sheet_name='Sheet1')

df.quantile([0.25, 0.75]) # 给出每一列中的25%和75%的分位数

filters = df.Date > '2021-06-1'
df[filters] #选出日期在某个日期之后的所有行

df.filter(regex='^L') #选出 L 开头的列

df.sort_values('列名', ascending= False) #按指定列的值大小升序排列

df.rename(columns= {'老列名' : '新列名'}) #修改某个列名

df["新列"] = df.a- df.b #定义一个新的列表示为两个的差

df.columns = map(str.lower(), df.columns) #所有列名变为小写字母

df.columns = map(str.upper(), df.columns) #所有列名变为大写字母

df.drop(columns=['列名']) #删除某一列
df.drop(['列1', '列2'], axis=1) #含义同上,删除两列
mySeries.drop(['a']) #删除 Series 指定值
df.drop([0, 1]) #根据索引删除,双闭区间

def fun(x):
    return x*3
df.列名.apply(fun)  #把某一列乘 3 倍

df.列名.apply(lambda x: x*3) #匿名表达式的写法

df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},'B': {0: 1, 1: 3, 2: 5}, 'C': {0: 2, 1: 4, 2: 6}})
pd.melt(df, id_vars=['A'], value_vars=['B','C']) #melt的使用

new=pd.DataFrame({'name':'lisa', 'gender':'F', 'city':'北京'},index=[1])
df = new
df=df.append(new) #增加一行数据

frame = pd.DataFrame({'a':[2.3,-1.7,5,3],'b':[6,2.9,-3.1,8]},index=['one','two','three','four'])
frame.rank(method="min",ascending=False)#对每一列的数据,根据大小给个排名


#merge 表示横向连接
df3 = pd.merge(df1,df2,how='inner',on='股票简称') #on表示连接列,how选择连接方式
pd.merge(df1,df2,left_on='lkey',right_on='rkey',how='left') #当连接列名不同分别指定
#concat 拼接
pd.concat([df1,df1])  #纵向连接,当s1和s2索引不重叠时,可以直接拼接
pd.concat([df1,df1],axis = 1)  #横向连接,默认外连接,以行索引为连接字段