日期:
来源:数据派THU收集编辑:数据派THU
来源:DeepHub IMBA 本文约2000字,建议阅读5分钟
本文介绍了10个Pandas的常用技巧。
1、Categorical类型
# raw data+----------+------------------------+| class | filename |+----------+------------------------+| Bathroom | Bathroom\bath_1.jpg || Bathroom | Bathroom\bath_100.jpg || Bathroom | Bathroom\bath_1003.jpg || Bathroom | Bathroom\bath_1004.jpg || Bathroom | Bathroom\bath_1005.jpg |+----------+------------------------+# target+------------------------+------------------------+----------------------------+| anchor | positive | negative |+------------------------+------------------------+----------------------------+| Bathroom\bath_1.jpg | Bathroom\bath_100.jpg | Dinning\din_540.jpg || Bathroom\bath_100.jpg | Bathroom\bath_1003.jpg | Dinning\din_1593.jpg || Bathroom\bath_1003.jpg | Bathroom\bath_1004.jpg | Bedroom\bed_329.jpg || Bathroom\bath_1004.jpg | Bathroom\bath_1005.jpg | Livingroom\living_1030.jpg || Bathroom\bath_1005.jpg | Bathroom\bath_1007.jpg | Bedroom\bed_1240.jpg |+------------------------+------------------------+----------------------------+
triplets.info(memory_usage="deep")# Column Non-Null Count Dtype# --- ------ -------------- -----# 0 anchor 525000 non-null category# 1 positive 525000 non-null category# 2 negative 525000 non-null category# dtypes: category(3)# memory usage: 4.6 MB# without categoriestriplets_raw.info(memory_usage="deep")# Column Non-Null Count Dtype# --- ------ -------------- -----# 0 anchor 525000 non-null object# 1 positive 525000 non-null object# 2 negative 525000 non-null object# dtypes: object(3)# memory usage: 118.1 MB
2、行列转换
cols = sorted([col for col in original_df.columns \if col.startswith("pct_bb")])df = original_df[(["cfips"] + cols)]df = df.melt(id_vars="cfips",value_vars=cols,var_name="year",value_name="feature").sort_values(by=["cfips", "year"])
3、apply()很慢
import pandas as pdimport swifterdef target_function(row):return row * 10def traditional_way(data):data['out'] = data['in'].apply(target_function)def swifter_way(data):data['out'] = data['in'].swifter.apply(target_function)
import pandas as pdfrom pandarallel import pandaralleldef target_function(row):return row * 10def traditional_way(data):data['out'] = data['in'].apply(target_function)def pandarallel_way(data):pandarallel.initialize()data['out'] = data['in'].parallel_apply(target_function)
4、空值,int, Int64
5、Csv, 压缩还是parquet?
| file | size |+------------------------+---------+| triplets_525k.csv | 38.4 MB || triplets_525k.csv.gzip | 4.3 MB || triplets_525k.csv.zip | 4.5 MB || triplets_525k.parquet | 1.9 MB |+------------------------+---------+
6、value_counts ()
df = pd.DataFrame({"a": [1, 2, None], "b": [4., 5.1, 14.02]})df["a"] = df["a"].astype("Int64")print(df.info())print(df["a"].value_counts(normalize=True, dropna=False),df["a"].value_counts(normalize=True, dropna=True), sep="\n\n")
7、Modin
!pip install modin[all]import modin.pandas as pddf = pd.read_csv("my_dataset.csv")
8、extract()
import pandas as pdregex = (r'(?P<title&>[A-Za-z\'\s]+),'r'(?P<author&>[A-Za-z\s\']+),'r'(?P<isbn&>[\d-]+),'r'(?P<year&>\d{4}),'r'(?P<publisher&>.+)')addr = pd.Series(["The Lost City of Amara,Olivia Garcia,978-1-234567-89-0,2023,HarperCollins","The Alchemist's Daughter,Maxwell Greene,978-0-987654-32-1,2022,Penguin Random House","The Last Voyage of the HMS Endeavour,Jessica Kim,978-5-432109-87-6,2021,Simon & Schuster","The Ghosts of Summer House,Isabella Lee,978-3-456789-12-3,2000,Macmillan Publishers","The Secret of the Blackthorn Manor,Emma Chen,978-9-876543-21-0,2023,Random House Children's Books"])addr.str.extract(regex)
9、读写剪贴板
10、数组列分成多列
import pandas as pddf = pd.DataFrame({"a": [1, 2, 3],"b": [4, 5, 6],"category": [["foo", "bar"], ["foo"], ["qux"]]})# let's increase the number of rows in a dataframedf = pd.concat([df]*10000, ignore_index=True)
def dummies_series_apply(df):return df.join(df['category'].apply(pd.Series) \.stack() \.str.get_dummies() \.groupby(level=0) \.sum()) \.drop("category", axis=1)%timeit dummies_series_apply(df.copy())#5.96 s ± 66.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
from sklearn.preprocessing import MultiLabelBinarizerdef sklearn_mlb(df):mlb = MultiLabelBinarizer()return df.join(pd.DataFrame(mlb.fit_transform(df['category']), columns=mlb.classes_)) \.drop("category", axis=1)%timeit sklearn_mlb(df.copy())#35.1 ms ± 1.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
def dummies_vectorized(df):return pd.get_dummies(df.explode("category"), prefix="cat") \.groupby(["a", "b"]) \.sum() \.reset_index()%timeit dummies_vectorized(df.copy())#29.3 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)