import pandas as pd
import numpy as np
import polars as pl
from datetime import date, timedelta, datetime
This guide will help you start learning polars
by showcasing analogous code snippets from pandas
.
In recent years, polars
1 is becoming increasingly popular in the data science community (more than 33k stars on GitHub as of May 20252). According to the author of polars
, Ritchie Vink, the package’s API is “consistent and strict,” and its focus is on maximizing single machine performance3 which perhaps explains some of the library’s appeal. From my experience, polars
has been a major time saver, especially in data-intensive computations. However, I think that it is perfectly reasonable to prefer pandas
for some tasks (like quick data visualization), and I am glad that this competition is pushing the field forward.
In this post, I wrote down some of the most common operations in pandas
and their equivalents in polars
to help you get acquainted with the package (and to help myself remember). Please, note that this guide / cheat sheet may not be exhaustive and in some cases, there might be additional ways to achieve the same goal. Feel free to let me know in the comments.
This is a runnable Quarto document, so first, let’s load the packages.
Load data
We’ll be working with my Wizard Shop Dataset4 which was specifically crafted for introductory data analysis. It consists of three tables:
wizard_shop_inventory.csv
: A list of products with prices, item quality, and other attributes.magical_items_info.csv
: A small table with typical price, quality, and where the item is typically found.items_prices_timeline.csv
: Average daily prices of each product category.
Let’s load the data.
As we can see, the syntax is the same in both packages except for parsing dates.
= "https://raw.githubusercontent.com/rnd195/wizard-shop-dataset/refs/heads/main/data/"
data_url
= pd.read_csv(data_url + "wizard_shop_inventory.csv")
df_pd = pd.read_csv(data_url + "magical_items_info.csv")
info_pd = pd.read_csv(
prices_pd + "items_prices_timeline.csv",
data_url =["date"]
parse_dates )
As we can see, the syntax is the same in both packages except for parsing dates.
= "https://raw.githubusercontent.com/rnd195/wizard-shop-dataset/refs/heads/main/data/"
data_url
= pl.read_csv(data_url + "wizard_shop_inventory.csv")
df_pl = pl.read_csv(data_url + "magical_items_info.csv")
info_pl = pl.read_csv(
prices_pl + "items_prices_timeline.csv",
data_url =True
try_parse_dates )
Take a peek
Sometimes, we want to take a quick look at the data. The methods .sample()
, .head()
, and .tail()
all work in both packages.
The df
DataFrame contains all the products the wizard shopkeeper sells—items like potions, amulets, or cloaks.
3) df_pd.sample(
id | item | price | magical_power | quality | in_stock | found_in | |
---|---|---|---|---|---|---|---|
374 | 375 | staff | 2068.00 | 768.6380 | 10 | True | dungeon |
138 | 139 | potion | 54.95 | 162.0795 | 1 | True | village |
436 | 437 | scroll | 857.70 | 407.6550 | 6 | True | dungeon |
The info
table contains information about the typical attributes of these items.
3) info_pd.head(
item | typical_price | typical_quality | typically_found_in | |
---|---|---|---|---|
0 | amulet | 1000 | 9 | dungeon |
1 | potion | 50 | 7 | village |
2 | cloak | 500 | 4 | city |
The prices
DataFrame contains the daily average price of each item in the fantasy world’s economy in the magical year of 2025.
3) prices_pd.tail(
date | amulet | potion | cloak | staff | scroll | |
---|---|---|---|---|---|---|
362 | 2025-12-29 | 742.21 | 44.70 | 648.72 | 971.90 | 731.69 |
363 | 2025-12-30 | 802.06 | 48.99 | 446.10 | 1711.04 | 728.60 |
364 | 2025-12-31 | 957.94 | 64.08 | 503.88 | 2899.72 | 829.59 |
The df
DataFrame contains all the products the wizard shopkeeper sells—items like potions, amulets, or cloaks.
3) df_pl.sample(
id | item | price | magical_power | quality | in_stock | found_in |
---|---|---|---|---|---|---|
i64 | str | f64 | f64 | i64 | bool | str |
243 | "amulet" | 1286.0 | 508.896 | 8 | false | "dungeon" |
416 | "cloak" | 501.0 | 283.573 | 3 | false | null |
481 | "potion" | 62.3 | 44.2 | 5 | false | null |
The info
table contains information about the typical attributes of these items.
3) info_pl.head(
item | typical_price | typical_quality | typically_found_in |
---|---|---|---|
str | i64 | i64 | str |
"amulet" | 1000 | 9 | "dungeon" |
"potion" | 50 | 7 | "village" |
"cloak" | 500 | 4 | "city" |
The prices
DataFrame contains the daily average price of each item in the fantasy world’s economy in the magical year of 2025.
3) prices_pl.tail(
date | amulet | potion | cloak | staff | scroll |
---|---|---|---|---|---|
date | f64 | f64 | f64 | f64 | f64 |
2025-12-29 | 742.21 | 44.7 | 648.72 | 971.9 | 731.69 |
2025-12-30 | 802.06 | 48.99 | 446.1 | 1711.04 | 728.6 |
2025-12-31 | 957.94 | 64.08 | 503.88 | 2899.72 | 829.59 |
Subset a DataFrame
Columns
Select a column by name
There are several ways to select a single column in both pandas
and polars
. Note that some of the calls return a Series while others return a DataFrame.
"price"] # -> returns Series of shape (500,)
df_pd["price"]] # -> returns DataFrame of shape (500, 1)
df_pd[[# -> returns Series of shape (500,)
df_pd.price "price"] # -> returns Series of shape (500,) df_pd.loc[:,
"price"] # -> returns Series of shape (500,)
df_pl["price"] # -> returns Series of shape (500,)
df_pl[:, "price") # -> returns DataFrame of shape (500, 1)
df_pl.select("price")) # -> returns DataFrame of shape (500, 1) df_pl.select(pl.col(
Select multiple columns by name
Below are several alternatives for selecting columns.
"item", "price"]]
df_pd[["item", "price"]] df_pd.loc[:, [
"item", "price"]]
df_pl[["item", "price"]]
df_pl[:, ["item", "price"]
df_pl["item", "price"])
df_pl.select(["item", "price")) df_pl.select(pl.col(
Slice columns by range
Instead of selecting columns by name, we can write their positions in the DataFrame.
5:7] df_pd.iloc[:,
5:7] df_pl[:,
Slice columns by name
It’s also possible to select a range of columns by name. The resulting DataFrame will contain the first and the last selected column as well as any columns in-between.
"in_stock":"found_in"] df_pd.loc[:,
"in_stock":"found_in"] df_pl[:,
Filter columns using Bools
We can pass a list of True/False values to select specific columns. The length of this list needs to be the same as the number of columns in the DataFrame. For instance, df_pd
/df_pl
contains 7 columns. Thus, one possible list of True/False values may look like this: [True, False, True, False, True, True, True]
.
# Return all columns containing the substring "price"
"price" in col for col in df_pd.columns]] df_pd.loc[:, [
# Return all columns containing the substring "price"
"price" in col for col in df_pl.columns]] df_pl[:, [
Rows
Select row by index label
In case the index is, for example, datetime
or str
, it is possible to select rows by the index label. However, this is not applicable in polars
since polars
treats the index differently than pandas
.
= prices_pd.set_index("date")
prices_pd "2025-01-05"] prices_pd.loc[
amulet 996.24
potion 49.65
cloak 497.42
staff 2643.03
scroll 1096.03
Name: 2025-01-05 00:00:00, dtype: float64
# Not applicable in polars
# Below is a call that outputs a similar result
filter(pl.col("date") == date(2025, 1, 5)) prices_pl.
date | amulet | potion | cloak | staff | scroll |
---|---|---|---|---|---|
date | f64 | f64 | f64 | f64 | f64 |
2025-01-05 | 996.24 | 49.65 | 497.42 | 2643.03 | 1096.03 |
Select a single row by integer position
Both pandas
and polars
support selecting a single row using its integer position.
4] # -> returns a Series
df_pd.iloc[4:5] # -> returns a DataFrame df_pd[
4]
df_pl[4:5]
df_pl[4, :] # -> all three of these return a DataFrame df_pl[
Slice rows by integer range
Likewise, both pandas
and polars
support selecting rows using a range of integers.
0:5]
df_pd.iloc[0:5] df_pd[
0:5]
df_pl[0:5, :] df_pl[
Filter rows using Bools
We can pass a Series (or a similar object) containing True/False values to subset the DataFrame.
# Get products with price over 1000
"price"] > 1000] df_pd[df_pd[
# Get products with price over 1000
filter(df_pl["price"] > 1000)
df_pl.filter(pl.col("price") > 1000) df_pl.
Creating new columns
New empty column
Sometimes, it might make sense to create a new column in a DataFrame and fill it with NA values. I think of it as “reserving” the column for values that will be put into the column later. There are several ways to achieve this in both packages.
Missing values in pandas
: depends on the datatype. Consider using None
or np.nan
. Note that pd.NA
is still experimental.
# Is any item in the wizard's shop cursed? We don't know => NA
"is_cursed"] = np.nan
df_pd[= df_pd.assign(is_cursed=np.nan)
df_pd = df_pd.assign(**{"is_cursed": np.nan})
df_pd "is_cursed"] = np.nan df_pd.loc[:,
Missing values in polars
: None
, represented as null
.
# Is any item in the wizard's shop cursed? We don't know => NA
= df_pl.with_columns(is_cursed=None)
df_pl = df_pl.with_columns(is_cursed=pl.lit(None))
df_pl = df_pl.with_columns(pl.lit(None).alias("is_cursed")) df_pl
Transform an existing column
Apply any function to an existing column in a DataFrame and write it as a new column.
Some options for transforming columns in pandas
: .transform()
, .apply()
, calling a numpy
function on the Series…
# Take the logarithm of the price column
"log_price"] = df_pd["price"].transform("log")
df_pd["log_price"] = df_pd["price"].apply(np.log)
df_pd["log_price"] = np.log(df_pd["price"]) df_pd[
Transforming columns in polars
: use the .with_columns(...)
method.
# Take the logarithm of the price column
= df_pl.with_columns(pl.col("price").log().alias("log_price"))
df_pl = df_pl.with_columns(log_price=pl.col("price").log())
df_pl = df_pl.with_columns(df_pl["price"].log().alias("log_price")) df_pl
Boolean filters
Use set operations to filter the DataFrame based on predefined conditions.
# AND operator
"price"] > 500) & (df_pd["price"] < 2000)]
df_pd[(df_pd[# OR operator
"price"] >= 500) | (df_pd["item"] != "scroll")]
df_pd[(df_pd[# Inverse
~df_pd["in_stock"]]
df_pd[# True if a value matches any of the specified values, else False
"item"].isin(["staff", "potion"])]
df_pd[df_pd[# Is not NA... there's also .dropna()
~df_pd["found_in"].isna()] df_pd[
# AND operator
filter((pl.col("price") > 500) & (pl.col("price") < 2000))
df_pl.# OR operator
filter((pl.col("price") >= 500) | (pl.col("item") != "scroll"))
df_pl.# Inverse
filter(~pl.col("in_stock"))
df_pl.# True if a value matches any of the specified values, else False
filter(pl.col("item").is_in(["staff", "potion"]))
df_pl.# Is not NA... consider also .is_not_null()
filter(~pl.col("found_in").is_null()) df_pl.
Replacing values in row slices
Let me explain this with an example relating to the dataset at hand. We are looking at the inventory of a particular wizard shop. In this magical universe, let’s suppose that we learn that every item with magical_power
over 900 is cursed. There might be other reasons why an item is cursed, but these reasons are unknown to us.
What we can do is to filter the DataFrame to display only items with magical_power
over 900 and using this filter, we write True
to the is_cursed
column for every row satisfying this condition.
Label all items with magical_power
over 900 as cursed.
# A column full of NAs should be cast as "object" first
"is_cursed"] = df_pd["is_cursed"].astype("object")
df_pd[
"magical_power"] > 900, "is_cursed"] = True
df_pd.loc[df_pd[3) df_pd.head(
id | item | price | magical_power | quality | in_stock | found_in | is_cursed | log_price | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | amulet | 915.0 | 402.961 | 9 | True | dungeon | NaN | 6.818924 |
1 | 2 | staff | 2550.0 | 933.978 | 2 | False | dungeon | True | 7.843849 |
2 | 3 | potion | 62.2 | 129.897 | 5 | True | city | NaN | 4.130355 |
Label all items with magical_power
over 900 as cursed.
= df_pl.with_columns(
df_pl "magical_power") > 900)
pl.when(pl.col(True))
.then(pl.lit("is_cursed"))
.otherwise(pl.col("is_cursed")
.alias(
)3) df_pl.head(
id | item | price | magical_power | quality | in_stock | found_in | is_cursed | log_price |
---|---|---|---|---|---|---|---|---|
i64 | str | f64 | f64 | i64 | bool | str | bool | f64 |
1 | "amulet" | 915.0 | 402.961 | 9 | true | "dungeon" | null | 6.818924 |
2 | "staff" | 2550.0 | 933.978 | 2 | false | "dungeon" | true | 7.843849 |
3 | "potion" | 62.2 | 129.897 | 5 | true | "city" | null | 4.130355 |
Create a copy
= df_pd.copy() df_pd_temp
In polars
, copying or cloning (.clone()
method) may not be necessary.5
= df_pl df_pl_temp
Joining data
Inner join
The info_pd
/ info_pl
table contains typical information (e.g., typical price) about each item in the fantasy universe this dataset was sourced from. Naturally, the shopkeeper has no incentive to provide this information in the original data. However, we can add the information from the info table to the main table ourselves.
In our case, we can use an “inner join” and match the items using the item
column as long as
both tables contain this column,
and the column itself serves the same purpose in both tables.
If there would be an additional item in the shopkeeper’s inventory for which we wouldn’t have data in the info
table, we may consider using an “outer join.” For a review of joins consider reading the following Wiki article.
info_pd
item | typical_price | typical_quality | typically_found_in | |
---|---|---|---|---|
0 | amulet | 1000 | 9 | dungeon |
1 | potion | 50 | 7 | village |
2 | cloak | 500 | 4 | city |
3 | staff | 2000 | 5 | dungeon |
4 | scroll | 900 | 2 | dungeon |
= pd.merge(left=df_pd, right=info_pd, how="inner", on="item")
df_full_pd 3) df_full_pd.head(
id | item | price | magical_power | quality | in_stock | found_in | is_cursed | log_price | typical_price | typical_quality | typically_found_in | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | amulet | 915.0 | 402.961 | 9 | True | dungeon | NaN | 6.818924 | 1000 | 9 | dungeon |
1 | 2 | staff | 2550.0 | 933.978 | 2 | False | dungeon | True | 7.843849 | 2000 | 5 | dungeon |
2 | 3 | potion | 62.2 | 129.897 | 5 | True | city | NaN | 4.130355 | 50 | 7 | village |
info_pl
item | typical_price | typical_quality | typically_found_in |
---|---|---|---|
str | i64 | i64 | str |
"amulet" | 1000 | 9 | "dungeon" |
"potion" | 50 | 7 | "village" |
"cloak" | 500 | 4 | "city" |
"staff" | 2000 | 5 | "dungeon" |
"scroll" | 900 | 2 | "dungeon" |
= df_pl.join(other=info_pl, on="item", how="inner")
df_full_pl 3) df_full_pl.head(
id | item | price | magical_power | quality | in_stock | found_in | is_cursed | log_price | typical_price | typical_quality | typically_found_in |
---|---|---|---|---|---|---|---|---|---|---|---|
i64 | str | f64 | f64 | i64 | bool | str | bool | f64 | i64 | i64 | str |
1 | "amulet" | 915.0 | 402.961 | 9 | true | "dungeon" | null | 6.818924 | 1000 | 9 | "dungeon" |
2 | "staff" | 2550.0 | 933.978 | 2 | false | "dungeon" | true | 7.843849 | 2000 | 5 | "dungeon" |
3 | "potion" | 62.2 | 129.897 | 5 | true | "city" | null | 4.130355 | 50 | 7 | "village" |
Concatenate by rows
Concatenating by rows stacks two tables on top of each other like bricks.
= pd.DataFrame(
new_price
{"amulet": [1005.1],
"potion": [55.32],
"cloak": [550.06],
"staff": [1500.15],
"scroll": [1123.06]
},=[datetime(2026, 1, 1)]
index
)2) pd.concat([prices_pd, new_price]).tail(
amulet | potion | cloak | staff | scroll | |
---|---|---|---|---|---|
2025-12-31 | 957.94 | 64.08 | 503.88 | 2899.72 | 829.59 |
2026-01-01 | 1005.10 | 55.32 | 550.06 | 1500.15 | 1123.06 |
= pl.DataFrame({
new_price "date": date(2026, 1, 1),
"amulet": 1005.1,
"potion": 55.32,
"cloak": 550.06,
"staff": 1500.15,
"scroll": 1123.06
})2) pl.concat([prices_pl, new_price]).tail(
date | amulet | potion | cloak | staff | scroll |
---|---|---|---|---|---|
date | f64 | f64 | f64 | f64 | f64 |
2025-12-31 | 957.94 | 64.08 | 503.88 | 2899.72 | 829.59 |
2026-01-01 | 1005.1 | 55.32 | 550.06 | 1500.15 | 1123.06 |
Quick plotting
DataFrames in pandas
can be quickly plotted using the .plot()
method. While polars
also contains quick plotting capabilities, I prefer converting the DataFrame to pandas
. For more complex plots, consider using seaborn
, bokeh
, plotly
, altair
, or any other data visualization library.
Display the mean quality of the shopkeeper’s items vs the typical quality of each item.
= df_full_pd[["item", "quality", "typical_quality"]]
quality_pd "item").mean().plot(kind="bar") quality_pd.groupby(
Display the mean quality of the shopkeeper’s items vs the typical quality of each item.
= df_full_pl.to_pandas()[["item", "quality", "typical_quality"]]
quality_pl "item").mean().plot(kind="bar") quality_pl.groupby(
# Similar plot in polars via Altair (not displayed)
("item")
df_full_pl.group_by(
.mean()"item", "quality", "typical_quality")
.select(="item")
.unpivot(index="item", y="value", color="variable", column="variable")
.plot.bar(x )
Dates
Create a range of dates
The pandas
package is great for time series data. It offers valuable functions like date_range()
or to_datetime()
. polars
also has special capabilities for handling time series data, though it seems to me that it expects the user to utilize the datetime
module more than pandas
.
Another difference is that polars
may return an expression instead of the actual Series of dates by default. This is why you’ll see me using eager=True
to get a Series instead of an expression.
Let’s take a look at some of the basic operations with dates.
Make a range of dates from a specific date (and time) to a specific date (and time).
pd.date_range(="2025-01-01 00:00",
start="2025-01-01 23:00",
end="h",
freq="Europe/Prague"
tz )
Make a range of dates from a specific date (and time) to a specific date (and time).
pl.datetime_range(=datetime(2025, 1, 1, 0, 0),
start=datetime(2025, 1, 1, 23, 0),
end="1h",
interval=True,
eager="Europe/Prague"
time_zone
)# Alternative start/end: datetime.fromisoformat("2025-01-01 00:00")
Create a date range with a specific number of periods. For instance, start at 2025-01-01
and continue 100 days into the future.
= 100
out_periods
pd.date_range(="2025-01-01",
start=out_periods,
periods="d"
freq )
Create a date range with a specific number of periods. For instance, start at 2025-01-01
and continue 100 days into the future.
= 100
out_periods
pl.date_range(=date(2025, 1, 1),
start=date(2025, 1, 1) + timedelta(days=out_periods - 1),
end="1d",
interval# Returns a Series if True, otherwise returns an expression
=True
eager )
Subset a specific time interval
Suppose we want to take a look at February data in our price table.
# Prices in February
== 2].head() prices_pd[prices_pd.index.month
amulet | potion | cloak | staff | scroll | |
---|---|---|---|---|---|
date | |||||
2025-02-01 | 1200.15 | 45.70 | 441.72 | 1315.41 | 604.27 |
2025-02-02 | 828.88 | 55.32 | 526.56 | 1591.24 | 628.25 |
2025-02-03 | 774.96 | 47.35 | 515.86 | 1924.56 | 969.40 |
2025-02-04 | 842.21 | 53.05 | 522.74 | 2305.17 | 991.64 |
2025-02-05 | 1147.03 | 53.27 | 583.07 | 2222.09 | 1529.96 |
# Prices in February
filter(pl.col("date").dt.month() == 2).head() prices_pl.
date | amulet | potion | cloak | staff | scroll |
---|---|---|---|---|---|
date | f64 | f64 | f64 | f64 | f64 |
2025-02-01 | 1200.15 | 45.7 | 441.72 | 1315.41 | 604.27 |
2025-02-02 | 828.88 | 55.32 | 526.56 | 1591.24 | 628.25 |
2025-02-03 | 774.96 | 47.35 | 515.86 | 1924.56 | 969.4 |
2025-02-04 | 842.21 | 53.05 | 522.74 | 2305.17 | 991.64 |
2025-02-05 | 1147.03 | 53.27 | 583.07 | 2222.09 | 1529.96 |
Resampling
In time series analysis, we often need to change the frequency of our data. For instance, if we have daily data, we may want to look at monthly averages, yearly max values, and so on.
Conversely, suppose we want to join a daily and an hourly dataset. In this case, we may need to convert the daily time series to hourly. We can do this by, for example, repeating the daily value for each hour or by using interpolation.
# Monthly mean from daily data
"ME").mean()
prices_pd.resample(
# Upsample daily data to hourly by repeating values
"h").ffill()
prices_pd.resample(
# Upsample daily data to hourly by interpolating (linearly)
"h").interpolate("linear") prices_pd.resample(
# Monthly mean from daily data
prices_pl.group_by_dynamic(="date",
index_column="1mo"
every
).agg(pl.selectors.numeric().mean())
# Upsample daily data to hourly by repeating values
prices_pl.upsample(="date",
time_column="1h"
everyall().forward_fill())
).select(pl.
# Upsample daily data to hourly by interpolating (linearly)
prices_pl.upsample(="date",
time_column="1h"
every ).interpolate()
Group by time intervals
We can also group data by specific time intervals. For instance, we can calculate the median price of each item in 3-month windows.
="3MS")).median() prices_pd.groupby(pd.Grouper(freq
amulet | potion | cloak | staff | scroll | |
---|---|---|---|---|---|
date | |||||
2025-01-01 | 973.755 | 49.995 | 503.050 | 2043.170 | 966.495 |
2025-04-01 | 990.040 | 49.340 | 505.000 | 1931.890 | 870.100 |
2025-07-01 | 1051.605 | 50.345 | 499.150 | 2005.865 | 850.470 |
2025-10-01 | 991.855 | 49.620 | 509.435 | 1997.755 | 851.620 |
Some people (example in the official docs) format longer polars
code like this:
(
prices_pl="date", every="3mo")
.group_by_dynamic(index_column
.agg(pl.selectors.numeric().median()) )
date | amulet | potion | cloak | staff | scroll |
---|---|---|---|---|---|
date | f64 | f64 | f64 | f64 | f64 |
2025-01-01 | 973.755 | 49.995 | 503.05 | 2043.17 | 966.495 |
2025-04-01 | 990.04 | 49.34 | 505.0 | 1931.89 | 870.1 |
2025-07-01 | 1051.605 | 50.345 | 499.15 | 2005.865 | 850.47 |
2025-10-01 | 991.855 | 49.62 | 509.435 | 1997.755 | 851.62 |
Saving data
Both packages support a large number of output formats and the syntax is similar.
"file.csv") df_pd.to_csv(
"file.csv") df_pl.write_csv(
Sources and further reading
This guide is mainly based on information from
It’s also loosely inspired by Keith Galli’s incredible pandas
tutorial on YouTube
Let me also mention a neat cheat sheet on going from pandas
to polars
on Rho Signal
Finally, for more advanced and extensive coverage of polars
x pandas
, I would highly recommend “Modern Polars”
Footnotes
For stylistic purposes, let me refer to the Polars package as
polars
.↩︎33.4k stars in early May 2025 https://github.com/pola-rs/polars.↩︎
JetBrains (2023). “What is Polars?” on YouTube. https://www.youtube.com/watch?v=QfLzEp-yt_U.↩︎