Polars 笔记

前言

一、Polars 中的基本概念

1. Data types 数据类型

Polars 支持多种数据类型，主要有数值数据类型（有无符号整数、浮点数）、嵌套数据类型（列表、结构体和数组）、日期时间类数据类型、杂项（字符串、二进制数据、布尔值、分类、枚举和对象）

下表是完整的数据类型，常用的有 Boolean、Int、Uint、Float、Decimal、String、Date、Time、Array、List、Categorical、Struct、Null

类型	说明
Boolean	布尔值
Int8、Int16、Int32、Int64	可变精度有符号整数类型
UInt8、UInt16、UInt32、UInt64	可变精度无符号整数类型
Float32、Float64	可变精度有符号浮点数
Decimal	十进制 128 位类型，具有可选的精度和非负小数位数，如果需要精细确定精度时使用
String	可变长度 UTF-8 编码字符串类型
Binary	二进制数据
Date	日历日期
Time	一天之中的时间类型
Array	每个序列具有固定形状的数组，类似 Numpy
List	长度可变的均质 1D 容器，数组和列表的区别
Object	任意 Python 对象
Categorical	对字符串进行高效编码
Enum	对一组预先确定的字符串类别进行高效有序编码
Struct	存储多个字段的复合类型，结构体
Null	表示 null 值

2. Series 系列

Polars 中的核心数据结构是 Series 序列和 DataFrame 数据帧，序列是一维的同构数据结构，序列中所有元素数据类型一致

2.1 Series 基础操作

# 创建 Series		创建序列时，Polars 自动推断类型，但可以 dtype 参数指定类型
import polars as pl
s = pl.Series("ints", [1, 2, 3, 4, 5])
s2 = pl.Series("unints", [1, 2, 3, 4, 5], dtype=pl.UInt64)
print(s)

3. DataFrame 数据帧

DataFrame 是一种二维异构数据结构，其中包含不同命名的序列

3.1 DataFrame 基础操作

# 创建 DataFrame
from datetime import date
df = pl.DataFrame(
    {
        "name": ["Alice", "Ben", "Chloe"],
    	"birthdate": [
            date(2000, 1, 2),
            date(2001, 2, 22),
            date(2003, 3, 4),
        ],
        "weight": [57.6, 72.1, 65],
        "height": [1.68, 1.77, 1.82]
    }
)

3.2 检查 DataFrame

Head 预览头几行

.head() 函数展示 DataFrame 的前几行，方便预览数据
1
2
print(df.head()) # 默认输出 5 行
print(df.head(5)) # 指定行数
Glimpse 一瞥

.glimpse() 函数显示 DataFrame 前几行的值，但是输出格式与 .head() 不同，是类似转置后的格式，横向展示数据，.glimpse() 函数仅适用于 Python 版 Polars
1
print(df.glimpse(return_as_string=True))
Tail 尾巴

.tail() 函数显示 DataFrame 的最后几行，格式与 .head() 一致，且默认也为 5 行
1
print(df.tail(6))
Sample 样本

.sample() 函数从 DataFrame 中获取任意数量的随机选择的行。返回的行不具有 DataFrame 的顺序
1
2
3
import random
random.seed(42)
print(df.sample(7))
Describe 描述

.describe() 函数显示 DataFrame 的列的摘要统计信息
1
print(df.describe())

Schema 架构

.schema 是 DataFrame 的一个属性，表示 DataFrame 的架构，显示的是列或者序列与数据类型的映射

1 2	print(df.schema) # Schema({'name': String, 'birthdate': Date, 'weight': Float64, 'height': Float64})

与序列类似，Polars 在创建 DataFrame 时推断其架构，但可以手动覆盖架构

df = pl.DataFrame(
	{
        "name": ["Alice", "Ben", "Chloe"],
        "weight": [57.6, 72.1, 65],
    },
    schema={"name": None, "age": pl.UInt8}	# 创建 DataFrame 时传参 schema 映射，需要包含所有列
    schema_overrides={"age": pl.UInt8}		# 使用 schema_overrides 可以省略不需要覆盖推理类型的列
)
print(df)

4. Expression 表达式

Polars 的表达式是惰性表示形式，在没有上下文的语句时，不执行计算，仅仅是表达式。表达式是灵活的、模块化的，可以用来创建更复杂的表达式。

import polars as pl
bmi_expr = pl.col('weight') / (pl.col('height') ** 2)
print(bmi_expr)
# 由于表达式是惰性的，所以打印的是表达式，而不是表达式的结果
# [(col("weight")) / (col("height").pow([dyn int: 2]))]

5. Context 上下文

Polars 表达式需要使用上下文来在他们执行过程中计算出结果。根据使用的上下文，相同的表达式可能计算出不同的结果，最常见的上下文有：

select
with_columns
filter
group_by
group_by_dynamic
rolling

# 示例 DataFrame
# shape: (4, 4)
# ┌────────────────┬────────────┬────────┬────────┐
# │ name           ┆ birthdate  ┆ weight ┆ height │
# │ ---            ┆ ---        ┆ ---    ┆ ---    │
# │ str            ┆ date       ┆ f64    ┆ f64    │
# ╞════════════════╪════════════╪════════╪════════╡
# │ Alice Archer   ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   │
# │ Ben Brown      ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
# │ Chloe Cooper   ┆ 1983-03-22 ┆ 53.6   ┆ 1.65   │
# │ Daniel Donovan ┆ 1981-04-30 ┆ 83.1   ┆ 1.75   │
# └────────────────┴────────────┴────────┴────────┘

5.1 select

select 上下文将表达式应用于列，生成聚合、其他列的组合或者文本的新列

result = df.select(
	bmi=bmi_expr,				# 表达式，在上下文中使用会计算出结果
    avg_bmi=bmi_expr.mean(),	# 计算平均的 bmi
    ideal_max_bmi=25			# 常量列 25
)
print(result)
# shape: (4, 3)
# ┌───────────┬───────────┬───────────────┐
# │ bmi       ┆ avg_bmi   ┆ ideal_max_bmi │
# │ ---       ┆ ---       ┆ ---           │
# │ f64       ┆ f64       ┆ i32           │
# ╞═══════════╪═══════════╪═══════════════╡
# │ 23.791913 ┆ 23.438973 ┆ 25            │
# │ 23.141498 ┆ 23.438973 ┆ 25            │
# │ 19.687787 ┆ 23.438973 ┆ 25            │
# │ 27.134694 ┆ 23.438973 ┆ 25            │
# └───────────┴───────────┴───────────────┘

5.2 with_columns

with_columns 上下文与 select 上下文类似，两者的区别在于，with_columns 创建一个新的 DataFrame，该 DataFrame 包含原始 DataFrame 的列和根据输入的表达式创建的新列，而 select 仅包含输入表达式选择或创建的列

result = df.with_columns(
	bmi=bmi_expr,
    avg_bmi=bmi_expr.mean(),
    ideal_max_bmi=25,
)
print(result)
shape: (4, 7)
# ┌────────────────┬────────────┬────────┬────────┬───────────┬───────────┬───────────────┐
# │ name           ┆ birthdate  ┆ weight ┆ height ┆ bmi       ┆ avg_bmi   ┆ ideal_max_bmi │
# │ ---            ┆ ---        ┆ ---    ┆ ---    ┆ ---       ┆ ---       ┆ ---           │
# │ str            ┆ date       ┆ f64    ┆ f64    ┆ f64       ┆ f64       ┆ i32           │
# ╞════════════════╪════════════╪════════╪════════╪═══════════╪═══════════╪═══════════════╡
# │ Alice Archer   ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   ┆ 23.791913 ┆ 23.438973 ┆ 25            │
# │ Ben Brown      ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   ┆ 23.141498 ┆ 23.438973 ┆ 25            │
# │ Chloe Cooper   ┆ 1983-03-22 ┆ 53.6   ┆ 1.65   ┆ 19.687787 ┆ 23.438973 ┆ 25            │
# │ Daniel Donovan ┆ 1981-04-30 ┆ 83.1   ┆ 1.75   ┆ 27.134694 ┆ 23.438973 ┆ 25            │
# └────────────────┴────────────┴────────┴────────┴───────────┴───────────┴───────────────┘

由于 with_columns 和 select 的这种差异，上下文 with_columns 中使用的表达式必须生成与 DataFrame 中的原始列具有相同长度的序列，而 select 则是生成的表达式列间长度相同就可以，而不需要与原始 DataFrame 列长度相同

5.3 filter

filter 上下文根据计算结果为 Boolean 类型的一个或者多个表达式筛选数据帧的行

result = df.filter(
	pl.col("birthdate").is_between(date(1982, 12, 31), date(1996, 1, 1)),
    pl.col("height") > 1.7,
)
print(result)
# shape: (1, 4)
# ┌───────────┬────────────┬────────┬────────┐
# │ name      ┆ birthdate  ┆ weight ┆ height │
# │ ---       ┆ ---        ┆ ---    ┆ ---    │
# │ str       ┆ date       ┆ f64    ┆ f64    │
# ╞═══════════╪════════════╪════════╪════════╡
# │ Ben Brown ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
# └───────────┴────────────┴────────┴────────┘

5.4 group_by

在上下文 group_by 中，根据分组表达式中的唯一值将行进行分组

result = df.group_by(
	(pl.col("birthdate").dt.year() // 10 * 10).alias("decade"),
).agg(pl.col("name"))
print(result)
# shape: (2, 2)
# ┌────────┬─────────────────────────────────┐
# │ decade ┆ name                            │
# │ ---    ┆ ---                             │
# │ i32    ┆ list[str]                       │
# ╞════════╪═════════════════════════════════╡
# │ 1980   ┆ ["Ben Brown", "Chloe Cooper", … │
# │ 1990   ┆ ["Alice Archer"]                │
# └────────┴─────────────────────────────────┘

result = df.group_by(
	(pl.col("birthdate").dt.year() // 10 * 10).alias("decade"),
    (pl.col("height") < 1.7).alias("short?"),
).agg(pl.col("name"))
print(result)
# shape: (3, 3)
# ┌────────┬────────┬─────────────────────────────────┐
# │ decade ┆ short? ┆ name                            │
# │ ---    ┆ ---    ┆ ---                             │
# │ i32    ┆ bool   ┆ list[str]                       │
# ╞════════╪════════╪═════════════════════════════════╡
# │ 1980   ┆ false  ┆ ["Ben Brown", "Daniel Donovan"… │
# │ 1990   ┆ true   ┆ ["Alice Archer"]                │
# │ 1980   ┆ true   ┆ ["Chloe Cooper"]                │
# └────────┴────────┴─────────────────────────────────┘

前言