Let’s say there are two data frames orders and order_products:
import polars as pl
# Load the Orders dataset into a Polars DataFrame
orders_df = pl.read_csv("orders.csv")
# Load the Order Products dataset into a Polars DataFrame
order_products_df = pl.read_csv("order_products.csv")
orders.csv:
order_id | user_id | order_number | order_dow | order_hour_of_day | days_since_prior_order |
---|---|---|---|---|---|
1 | 112108 | 4 | 4 | 10 | NaN |
2 | 79431 | 8 | 1 | 9 | 15.0 |
3 | 42756 | 6 | 6 | 17 | 21.0 |
4 | 17227 | 2 | 1 | 9 | 29.0 |
5 | 56463 | 7 | 6 | 14 | 28.0 |
order_products.csv:
order_id | product_id | add_to_cart_order | reordered |
---|---|---|---|
1 | 49302 | 1 | 1 |
1 | 11109 | 2 | 1 |
1 | 10246 | 3 | 0 |
1 | 49683 | 4 | 0 |
1 | 43633 | 5 | 1 |
Inner join:
#Inner join example
orders_products_inner_join = orders_df.inner_join(order_products_df, left_on='order_id', right_on='order_id')
order_id | user_id | order_number | order_date | product_id | quantity |
---|---|---|---|---|---|
101 | 1001 | 1 | 2022-03-15 | 2001 | 2 |
101 | 1001 | 1 | 2022-03-15 | 2002 | 3 |
102 | 1001 | 2 | 2022-03-17 | 2003 | 1 |
102 | 1001 | 2 | 2022-03-17 | 2004 | 2 |
103 | 1002 | 1 | 2022-03-18 | 2002 | 1 |
103 | 1002 | 1 | 2022-03-18 | 2003 | 3 |
103 | 1002 | 1 | 2022-03-18 | 2004 | 2 |
Left join:
# Left join the two datasets on the 'order_id' column
orders_products_left_join = orders_df.left_join(order_products_df, left_on='order_id', right_on='order_id')
Output for left join:
order_id | user_id | order_number | order_date | product_id | quantity |
---|---|---|---|---|---|
101 | 1001 | 1 | 2022-03-15 | 2001 | 2 |
101 | 1001 | 1 | 2022-03-15 | 2002 | 3 |
102 | 1001 | 2 | 2022-03-17 | 2003 | 1 |
102 | 1001 | 2 | 2022-03-17 | 2004 | 2 |
103 | 1002 | 1 | 2022-03-18 | 2002 | 1 |
103 | 1002 | 1 | 2022-03-18 | 2003 | 3 |
103 | 1002 | 1 | 2022-03-18 | 2004 | 2 |
104 | 1003 | 1 | 2022-03-20 | None | None |
Polars is a powerful DataFrame library in Rust programming language, which can handle large amounts of data and process them very efficiently. Polars supports various types of joins like inner join, left join, right join, and outer join. In this section, we will discuss some advanced techniques to perform joins in Polars.
- Joining on Multiple Columns:
Polars supports joining on multiple columns simultaneously. To join on multiple columns, we can pass a list of column names to the left_on
and right_on
arguments instead of a single column name. For example, to join two dataframes on columns col1
and col2
, we can use the following code:
import polars as pl
df1 = pl.DataFrame({'col1': [1, 2, 3, 4], 'col2': ['a', 'b', 'c', 'd']})
df2 = pl.DataFrame({'col1': [1, 2, 4, 5], 'col2': ['a', 'b', 'd', 'e']})
joined_df = df1.join(df2, left_on=['col1', 'col2'], right_on=['col1', 'col2'])
This will perform an inner join on columns col1
and col2
of both dataframes.
- Joining with Different Join Types:
By default, Polars performs an inner join when we call the join()
method. However, we can perform different types of joins by passing the how
argument. The different types of joins that Polars supports are inner
, left
, right
, and outer
.
import polars as pl
df1 = pl.DataFrame({'col1': [1, 2, 3, 4], 'col2': ['a', 'b', 'c', 'd']})
df2 = pl.DataFrame({'col1': [1, 2, 4, 5], 'col3': ['x', 'y', 'z', 'w']})
# Inner Join
joined_df = df1.join(df2, left_on='col1', right_on='col1', how='inner')
# Left Join
joined_df = df1.join(df2, left_on='col1', right_on='col1', how='left')
# Right Join
joined_df = df1.join(df2, left_on='col1', right_on='col1', how='right')
# Outer Join
joined_df = df1.join(df2, left_on='col1', right_on='col1', how='outer')