A Beginner’s Guide to pandas & polars

This guide will help you start learning polars by showcasing analogous code snippets from pandas.

In recent years, polars¹ is becoming increasingly popular in the data science community (more than 33k stars on GitHub as of May 2025²). According to the author of polars, Ritchie Vink, the package’s API is “consistent and strict,” and its focus is on maximizing single machine performance³ which perhaps explains some of the library’s appeal. From my experience, polars has been a major time saver, especially in data-intensive computations. However, I think that it is perfectly reasonable to prefer pandas for some tasks (like quick data visualization), and I am glad that this competition is pushing the field forward.

In this post, I wrote down some of the most common operations in pandas and their equivalents in polars to help you get acquainted with the package (and to help myself remember). Please, note that this guide / cheat sheet may not be exhaustive and in some cases, there might be additional ways to achieve the same goal. Feel free to let me know in the comments.

This is a runnable Quarto document, so first, let’s load the packages.

import pandas as pd
import numpy as np
import polars as pl
from datetime import date, timedelta, datetime

Load data

We’ll be working with my Wizard Shop Dataset⁴ which was specifically crafted for introductory data analysis. It consists of three tables:

wizard_shop_inventory.csv: A list of products with prices, item quality, and other attributes.
magical_items_info.csv: A small table with typical price, quality, and where the item is typically found.
items_prices_timeline.csv: Average daily prices of each product category.

Let’s load the data.

1️⃣ pandas
2️⃣ polars

As we can see, the syntax is the same in both packages except for parsing dates.

data_url = "https://raw.githubusercontent.com/rnd195/wizard-shop-dataset/refs/heads/main/data/"

df_pd = pd.read_csv(data_url + "wizard_shop_inventory.csv")
info_pd = pd.read_csv(data_url + "magical_items_info.csv")
prices_pd = pd.read_csv(
    data_url + "items_prices_timeline.csv", 
    parse_dates=["date"]
)

As we can see, the syntax is the same in both packages except for parsing dates.

data_url = "https://raw.githubusercontent.com/rnd195/wizard-shop-dataset/refs/heads/main/data/"

df_pl = pl.read_csv(data_url + "wizard_shop_inventory.csv")
info_pl = pl.read_csv(data_url + "magical_items_info.csv")
prices_pl = pl.read_csv(
    data_url + "items_prices_timeline.csv", 
    try_parse_dates=True
)

Take a peek

Sometimes, we want to take a quick look at the data. The methods .sample(), .head(), and .tail() all work in both packages.

1️⃣ pandas
2️⃣ polars

The df DataFrame contains all the products the wizard shopkeeper sells—items like potions, amulets, or cloaks.

df_pd.sample(3)

	id	item	price	magical_power	quality	in_stock	found_in
374	375	staff	2068.00	768.6380	10	True	dungeon
138	139	potion	54.95	162.0795	1	True	village
436	437	scroll	857.70	407.6550	6	True	dungeon

The info table contains information about the typical attributes of these items.

info_pd.head(3)

	item	typical_price	typical_quality	typically_found_in
0	amulet	1000	9	dungeon
1	potion	50	7	village
2	cloak	500	4	city

The prices DataFrame contains the daily average price of each item in the fantasy world’s economy in the magical year of 2025.

prices_pd.tail(3)

	date	amulet	potion	cloak	staff	scroll
362	2025-12-29	742.21	44.70	648.72	971.90	731.69
363	2025-12-30	802.06	48.99	446.10	1711.04	728.60
364	2025-12-31	957.94	64.08	503.88	2899.72	829.59

The df DataFrame contains all the products the wizard shopkeeper sells—items like potions, amulets, or cloaks.

df_pl.sample(3)

shape: (3, 7)

id	item	price	magical_power	quality	in_stock	found_in
i64	str	f64	f64	i64	bool	str
243	"amulet"	1286.0	508.896	8	false	"dungeon"
416	"cloak"	501.0	283.573	3	false	null
481	"potion"	62.3	44.2	5	false	null

The info table contains information about the typical attributes of these items.

info_pl.head(3)

shape: (3, 4)

item	typical_price	typical_quality	typically_found_in
str	i64	i64	str
"amulet"	1000	9	"dungeon"
"potion"	50	7	"village"
"cloak"	500	4	"city"

The prices DataFrame contains the daily average price of each item in the fantasy world’s economy in the magical year of 2025.

prices_pl.tail(3)

shape: (3, 6)

date	amulet	potion	cloak	staff	scroll
date	f64	f64	f64	f64	f64
2025-12-29	742.21	44.7	648.72	971.9	731.69
2025-12-30	802.06	48.99	446.1	1711.04	728.6
2025-12-31	957.94	64.08	503.88	2899.72	829.59

Subset a DataFrame

Columns

Select a column by name

There are several ways to select a single column in both pandas and polars. Note that some of the calls return a Series while others return a DataFrame.

1️⃣ pandas
2️⃣ polars

df_pd["price"]        # -> returns Series of shape (500,)
df_pd[["price"]]      # -> returns DataFrame of shape (500, 1)
df_pd.price           # -> returns Series of shape (500,)
df_pd.loc[:, "price"] # -> returns Series of shape (500,)

df_pl["price"]                # -> returns Series of shape (500,)
df_pl[:, "price"]             # -> returns Series of shape (500,)
df_pl.select("price")         # -> returns DataFrame of shape (500, 1)
df_pl.select(pl.col("price")) # -> returns DataFrame of shape (500, 1)

Select multiple columns by name

Below are several alternatives for selecting columns.

1️⃣ pandas
2️⃣ polars

df_pd[["item", "price"]]
df_pd.loc[:, ["item", "price"]]

df_pl[["item", "price"]]
df_pl[:, ["item", "price"]]
df_pl["item", "price"]
df_pl.select(["item", "price"])
df_pl.select(pl.col("item", "price"))

Slice columns by range

Instead of selecting columns by name, we can write their positions in the DataFrame.

1️⃣ pandas
2️⃣ polars

df_pd.iloc[:, 5:7]

df_pl[:, 5:7]

Slice columns by name

It’s also possible to select a range of columns by name. The resulting DataFrame will contain the first and the last selected column as well as any columns in-between.

1️⃣ pandas
2️⃣ polars

df_pd.loc[:, "in_stock":"found_in"]

df_pl[:, "in_stock":"found_in"]

Filter columns using Bools

We can pass a list of True/False values to select specific columns. The length of this list needs to be the same as the number of columns in the DataFrame. For instance, df_pd/df_pl contains 7 columns. Thus, one possible list of True/False values may look like this: [True, False, True, False, True, True, True].

1️⃣ pandas
2️⃣ polars

# Return all columns containing the substring "price"
df_pd.loc[:, ["price" in col for col in df_pd.columns]]

# Return all columns containing the substring "price"
df_pl[:, ["price" in col for col in df_pl.columns]]

Rows

Select row by index label

In case the index is, for example, datetime or str, it is possible to select rows by the index label. However, this is not applicable in polars since polars treats the index differently than pandas.

1️⃣ pandas
2️⃣ polars

prices_pd = prices_pd.set_index("date")
prices_pd.loc["2025-01-05"]

amulet     996.24
potion      49.65
cloak      497.42
staff     2643.03
scroll    1096.03
Name: 2025-01-05 00:00:00, dtype: float64

# Not applicable in polars
# Below is a call that outputs a similar result
prices_pl.filter(pl.col("date") == date(2025, 1, 5))

shape: (1, 6)

date	amulet	potion	cloak	staff	scroll
date	f64	f64	f64	f64	f64
2025-01-05	996.24	49.65	497.42	2643.03	1096.03

Select a single row by integer position

Both pandas and polars support selecting a single row using its integer position.

1️⃣ pandas
2️⃣ polars

df_pd.iloc[4] # -> returns a Series
df_pd[4:5]    # -> returns a DataFrame

df_pl[4] 
df_pl[4:5]
df_pl[4, :] # -> all three of these return a DataFrame

Slice rows by integer range

Likewise, both pandas and polars support selecting rows using a range of integers.

1️⃣ pandas
2️⃣ polars

df_pd.iloc[0:5]
df_pd[0:5]

df_pl[0:5]
df_pl[0:5, :]

Filter rows using Bools

We can pass a Series (or a similar object) containing True/False values to subset the DataFrame.

1️⃣ pandas
2️⃣ polars

# Get products with price over 1000
df_pd[df_pd["price"] > 1000]

# Get products with price over 1000
df_pl.filter(df_pl["price"] > 1000)
df_pl.filter(pl.col("price") > 1000)

Creating new columns

New empty column

Sometimes, it might make sense to create a new column in a DataFrame and fill it with NA values. I think of it as “reserving” the column for values that will be put into the column later. There are several ways to achieve this in both packages.

1️⃣ pandas
2️⃣ polars

Missing values in pandas: depends on the datatype. Consider using None or np.nan. Note that pd.NA is still experimental.

# Is any item in the wizard's shop cursed? We don't know => NA
df_pd["is_cursed"] = np.nan
df_pd = df_pd.assign(is_cursed=np.nan)
df_pd = df_pd.assign(**{"is_cursed": np.nan})
df_pd.loc[:, "is_cursed"] = np.nan

Missing values in polars: None, represented as null.

# Is any item in the wizard's shop cursed? We don't know => NA
df_pl = df_pl.with_columns(is_cursed=None)
df_pl = df_pl.with_columns(is_cursed=pl.lit(None))
df_pl = df_pl.with_columns(pl.lit(None).alias("is_cursed"))

Transform an existing column

Apply any function to an existing column in a DataFrame and write it as a new column.

1️⃣ pandas
2️⃣ polars

Some options for transforming columns in pandas: .transform(), .apply(), calling a numpy function on the Series…

# Take the logarithm of the price column
df_pd["log_price"] = df_pd["price"].transform("log")
df_pd["log_price"] = df_pd["price"].apply(np.log)
df_pd["log_price"] = np.log(df_pd["price"])

Transforming columns in polars: use the .with_columns(...) method.

# Take the logarithm of the price column
df_pl = df_pl.with_columns(pl.col("price").log().alias("log_price"))
df_pl = df_pl.with_columns(log_price=pl.col("price").log())
df_pl = df_pl.with_columns(df_pl["price"].log().alias("log_price"))

Boolean filters

Use set operations to filter the DataFrame based on predefined conditions.

1️⃣ pandas
2️⃣ polars

# AND operator
df_pd[(df_pd["price"] > 500) & (df_pd["price"] < 2000)]
# OR operator
df_pd[(df_pd["price"] >= 500) | (df_pd["item"] != "scroll")]
# Inverse
df_pd[~df_pd["in_stock"]]
# True if a value matches any of the specified values, else False
df_pd[df_pd["item"].isin(["staff", "potion"])]
# Is not NA... there's also .dropna() 
df_pd[~df_pd["found_in"].isna()]

# AND operator
df_pl.filter((pl.col("price") > 500) & (pl.col("price") < 2000))
# OR operator
df_pl.filter((pl.col("price") >= 500) | (pl.col("item") != "scroll"))
# Inverse
df_pl.filter(~pl.col("in_stock"))
# True if a value matches any of the specified values, else False
df_pl.filter(pl.col("item").is_in(["staff", "potion"]))
# Is not NA... consider also .is_not_null()
df_pl.filter(~pl.col("found_in").is_null())

Replacing values in row slices

Let me explain this with an example relating to the dataset at hand. We are looking at the inventory of a particular wizard shop. In this magical universe, let’s suppose that we learn that every item with magical_power over 900 is cursed. There might be other reasons why an item is cursed, but these reasons are unknown to us.

What we can do is to filter the DataFrame to display only items with magical_power over 900 and using this filter, we write True to the is_cursed column for every row satisfying this condition.

1️⃣ pandas
2️⃣ polars

Label all items with magical_power over 900 as cursed.

# A column full of NAs should be cast as "object" first
df_pd["is_cursed"] = df_pd["is_cursed"].astype("object")

df_pd.loc[df_pd["magical_power"] > 900, "is_cursed"] = True
df_pd.head(3)

	id	item	price	magical_power	quality	in_stock	found_in	is_cursed	log_price
0	1	amulet	915.0	402.961	9	True	dungeon	NaN	6.818924
1	2	staff	2550.0	933.978	2	False	dungeon	True	7.843849
2	3	potion	62.2	129.897	5	True	city	NaN	4.130355

Label all items with magical_power over 900 as cursed.

df_pl = df_pl.with_columns(
    pl.when(pl.col("magical_power") > 900)
    .then(pl.lit(True))
    .otherwise(pl.col("is_cursed"))
    .alias("is_cursed")
)
df_pl.head(3)

shape: (3, 9)

id	item	price	magical_power	quality	in_stock	found_in	is_cursed	log_price
i64	str	f64	f64	i64	bool	str	bool	f64
1	"amulet"	915.0	402.961	9	true	"dungeon"	null	6.818924
2	"staff"	2550.0	933.978	2	false	"dungeon"	true	7.843849
3	"potion"	62.2	129.897	5	true	"city"	null	4.130355

df_pd_temp = df_pd.copy()

In polars, copying or cloning (.clone() method) may not be necessary.⁵

df_pl_temp = df_pl

Joining data

Inner join

The info_pd / info_pl table contains typical information (e.g., typical price) about each item in the fantasy universe this dataset was sourced from. Naturally, the shopkeeper has no incentive to provide this information in the original data. However, we can add the information from the info table to the main table ourselves.

In our case, we can use an “inner join” and match the items using the item column as long as

both tables contain this column,
and the column itself serves the same purpose in both tables.

If there would be an additional item in the shopkeeper’s inventory for which we wouldn’t have data in the info table, we may consider using an “outer join.” For a review of joins consider reading the following Wiki article.

1️⃣ pandas
2️⃣ polars

info_pd

	item	typical_price	typical_quality	typically_found_in
0	amulet	1000	9	dungeon
1	potion	50	7	village
2	cloak	500	4	city
3	staff	2000	5	dungeon
4	scroll	900	2	dungeon

df_full_pd = pd.merge(left=df_pd, right=info_pd, how="inner", on="item")
df_full_pd.head(3)

	id	item	price	magical_power	quality	in_stock	found_in	is_cursed	log_price	typical_price	typical_quality	typically_found_in
0	1	amulet	915.0	402.961	9	True	dungeon	NaN	6.818924	1000	9	dungeon
1	2	staff	2550.0	933.978	2	False	dungeon	True	7.843849	2000	5	dungeon
2	3	potion	62.2	129.897	5	True	city	NaN	4.130355	50	7	village

info_pl

shape: (5, 4)

item	typical_price	typical_quality	typically_found_in
str	i64	i64	str
"amulet"	1000	9	"dungeon"
"potion"	50	7	"village"
"cloak"	500	4	"city"
"staff"	2000	5	"dungeon"
"scroll"	900	2	"dungeon"

df_full_pl = df_pl.join(other=info_pl, on="item", how="inner")
df_full_pl.head(3)

shape: (3, 12)

id	item	price	magical_power	quality	in_stock	found_in	is_cursed	log_price	typical_price	typical_quality	typically_found_in
i64	str	f64	f64	i64	bool	str	bool	f64	i64	i64	str
1	"amulet"	915.0	402.961	9	true	"dungeon"	null	6.818924	1000	9	"dungeon"
2	"staff"	2550.0	933.978	2	false	"dungeon"	true	7.843849	2000	5	"dungeon"
3	"potion"	62.2	129.897	5	true	"city"	null	4.130355	50	7	"village"

Concatenate by rows

Concatenating by rows stacks two tables on top of each other like bricks.

1️⃣ pandas
2️⃣ polars

new_price = pd.DataFrame(
    {
        "amulet": [1005.1],
        "potion": [55.32],
        "cloak": [550.06],
        "staff": [1500.15],
        "scroll": [1123.06]
    },
    index=[datetime(2026, 1, 1)]
)
pd.concat([prices_pd, new_price]).tail(2)

	amulet	potion	cloak	staff	scroll
2025-12-31	957.94	64.08	503.88	2899.72	829.59
2026-01-01	1005.10	55.32	550.06	1500.15	1123.06

new_price = pl.DataFrame({
    "date": date(2026, 1, 1),
    "amulet": 1005.1,
    "potion": 55.32,
    "cloak": 550.06,
    "staff": 1500.15,
    "scroll": 1123.06
})
pl.concat([prices_pl, new_price]).tail(2)

shape: (2, 6)

date	amulet	potion	cloak	staff	scroll
date	f64	f64	f64	f64	f64
2025-12-31	957.94	64.08	503.88	2899.72	829.59
2026-01-01	1005.1	55.32	550.06	1500.15	1123.06

Quick plotting

DataFrames in pandas can be quickly plotted using the .plot() method. While polars also contains quick plotting capabilities, I prefer converting the DataFrame to pandas. For more complex plots, consider using seaborn, bokeh, plotly, altair, or any other data visualization library.

1️⃣ pandas
2️⃣ polars

Display the mean quality of the shopkeeper’s items vs the typical quality of each item.

quality_pd = df_full_pd[["item", "quality", "typical_quality"]]
quality_pd.groupby("item").mean().plot(kind="bar")

Display the mean quality of the shopkeeper’s items vs the typical quality of each item.

quality_pl = df_full_pl.to_pandas()[["item", "quality", "typical_quality"]]
quality_pl.groupby("item").mean().plot(kind="bar")

# Similar plot in polars via Altair (not displayed)
(
    df_full_pl.group_by("item")
    .mean()
    .select("item", "quality", "typical_quality")
    .unpivot(index="item")
    .plot.bar(x="item", y="value", color="variable", column="variable")
)

Dates

Create a range of dates

The pandas package is great for time series data. It offers valuable functions like date_range() or to_datetime(). polars also has special capabilities for handling time series data, though it seems to me that it expects the user to utilize the datetime module more than pandas.

Another difference is that polars may return an expression instead of the actual Series of dates by default. This is why you’ll see me using eager=True to get a Series instead of an expression.

Let’s take a look at some of the basic operations with dates.

1️⃣ pandas
2️⃣ polars

Make a range of dates from a specific date (and time) to a specific date (and time).

pd.date_range(
    start="2025-01-01 00:00",
    end="2025-01-01 23:00", 
    freq="h",
    tz="Europe/Prague"
)

Make a range of dates from a specific date (and time) to a specific date (and time).

pl.datetime_range(
    start=datetime(2025, 1, 1, 0, 0), 
    end=datetime(2025, 1, 1, 23, 0),
    interval="1h",
    eager=True,
    time_zone="Europe/Prague"
)
# Alternative start/end: datetime.fromisoformat("2025-01-01 00:00")

1️⃣ pandas
2️⃣ polars

Create a date range with a specific number of periods. For instance, start at 2025-01-01 and continue 100 days into the future.

out_periods = 100

pd.date_range(
    start="2025-01-01", 
    periods=out_periods, 
    freq="d"
)

Create a date range with a specific number of periods. For instance, start at 2025-01-01 and continue 100 days into the future.

out_periods = 100

pl.date_range(
    start=date(2025, 1, 1),
    end=date(2025, 1, 1) + timedelta(days=out_periods - 1),
    interval="1d",
    # Returns a Series if True, otherwise returns an expression
    eager=True
)

Subset a specific time interval

Suppose we want to take a look at February data in our price table.

1️⃣ pandas
2️⃣ polars

# Prices in February
prices_pd[prices_pd.index.month == 2].head()

	amulet	potion	cloak	staff	scroll
date
2025-02-01	1200.15	45.70	441.72	1315.41	604.27
2025-02-02	828.88	55.32	526.56	1591.24	628.25
2025-02-03	774.96	47.35	515.86	1924.56	969.40
2025-02-04	842.21	53.05	522.74	2305.17	991.64
2025-02-05	1147.03	53.27	583.07	2222.09	1529.96

# Prices in February
prices_pl.filter(pl.col("date").dt.month() == 2).head()

shape: (5, 6)

date	amulet	potion	cloak	staff	scroll
date	f64	f64	f64	f64	f64
2025-02-01	1200.15	45.7	441.72	1315.41	604.27
2025-02-02	828.88	55.32	526.56	1591.24	628.25
2025-02-03	774.96	47.35	515.86	1924.56	969.4
2025-02-04	842.21	53.05	522.74	2305.17	991.64
2025-02-05	1147.03	53.27	583.07	2222.09	1529.96

Resampling

In time series analysis, we often need to change the frequency of our data. For instance, if we have daily data, we may want to look at monthly averages, yearly max values, and so on.

Conversely, suppose we want to join a daily and an hourly dataset. In this case, we may need to convert the daily time series to hourly. We can do this by, for example, repeating the daily value for each hour or by using interpolation.

1️⃣ pandas
2️⃣ polars

# Monthly mean from daily data
prices_pd.resample("ME").mean()

# Upsample daily data to hourly by repeating values
prices_pd.resample("h").ffill()

# Upsample daily data to hourly by interpolating (linearly)
prices_pd.resample("h").interpolate("linear")

# Monthly mean from daily data
prices_pl.group_by_dynamic(
    index_column="date",
    every="1mo"
).agg(pl.selectors.numeric().mean())

# Upsample daily data to hourly by repeating values
prices_pl.upsample(
    time_column="date",
    every="1h"
).select(pl.all().forward_fill())

# Upsample daily data to hourly by interpolating (linearly)
prices_pl.upsample(
    time_column="date",
    every="1h"
).interpolate()

Group by time intervals

We can also group data by specific time intervals. For instance, we can calculate the median price of each item in 3-month windows.

1️⃣ pandas
2️⃣ polars

prices_pd.groupby(pd.Grouper(freq="3MS")).median()

	amulet	potion	cloak	staff	scroll
date
2025-01-01	973.755	49.995	503.050	2043.170	966.495
2025-04-01	990.040	49.340	505.000	1931.890	870.100
2025-07-01	1051.605	50.345	499.150	2005.865	850.470
2025-10-01	991.855	49.620	509.435	1997.755	851.620

Some people (example in the official docs) format longer polars code like this:

(
    prices_pl
    .group_by_dynamic(index_column="date", every="3mo")
    .agg(pl.selectors.numeric().median())
)

shape: (4, 6)

date	amulet	potion	cloak	staff	scroll
date	f64	f64	f64	f64	f64
2025-01-01	973.755	49.995	503.05	2043.17	966.495
2025-04-01	990.04	49.34	505.0	1931.89	870.1
2025-07-01	1051.605	50.345	499.15	2005.865	850.47
2025-10-01	991.855	49.62	509.435	1997.755	851.62

Saving data

Both packages support a large number of output formats and the syntax is similar.

1️⃣ pandas
2️⃣ polars

df_pd.to_csv("file.csv")

df_pl.write_csv("file.csv")

Sources and further reading

This guide is mainly based on information from

It’s also loosely inspired by Keith Galli’s incredible pandas tutorial on YouTube

https://www.youtube.com/watch?v=vmEHCJofslg

Let me also mention a neat cheat sheet on going from pandas to polars on Rho Signal

https://www.rhosignal.com/posts/polars-pandas-cheatsheet/

Finally, for more advanced and extensive coverage of polars x pandas, I would highly recommend “Modern Polars”

https://kevinheavey.github.io/modern-polars/

Comment on GitHub

Footnotes

For stylistic purposes, let me refer to the Polars package as polars.↩︎
33.4k stars in early May 2025 https://github.com/pola-rs/polars.↩︎
JetBrains (2023). “What is Polars?” on YouTube. https://www.youtube.com/watch?v=QfLzEp-yt_U.↩︎
See https://github.com/rnd195/wizard-shop-dataset ↩︎
See https://stackoverflow.com/a/78136839 ↩︎

Reuse

Content of this blogpost is licensed under Creative Commons Attribution CC BY 4.0. Any content used from other sources is indicated and is not covered by this license. (View License)