Combining Polars and Tidyverse for Python

Note

This site is still under construction, but full documentation can be found in package docstrings and API Reference.

tidypolars $^{4sci}$ provides functions that match as closely as possible to R’s Tidyverse functions for manipulating data frames and conducting data analysis in Python using the blazingly fast Polars as backend.

Key features

Fast: Uses Polars as a backend for data manipulation. Therefore, it inherits many advantages of that module: fast, parallel, GPU support, etc.
Tidy: Keeps the data in a tidy (rectangular table) format (no multi-indexes).
Syntax: While Polars is fast, the syntax is not the most intuitive. The package provides frontend methods that match R’s Tidyverse functions, making it easier for users familiar with that ecosystem to transition to this library.
Extended functionalities: Polars is extended to facilitate data manipulation and analysis for academic research.
Research: The package is designed to facilitate academic research, data analysis, and reporting of results. It provides functions to quickly produce tables using minimal code, and whose output matches the format commonly used in academic publications. Those output formats include LaTeX, Excel, CSV, and others.

Syntax

The main motivation for tidypolars $^{4sci}$ was to provide more readable and elegant syntax in Python for Polars, similar to R’s Tidyverse, while (1) extending Polars functionalities to facilitate data manipulation and (2) keeping the advantages of speed and efficiency in data processing provided by that module. Here are some examples of syntax differences:

tidypolars4sciTidyverse (R)PolarsPandas

tab = (df
       .filter(tp.col("carb")<8)
       .filter(tp.col("name").str.contains("Mazda|Toyota|Merc"))
       .mutate(cyl_squared = tp.col("cyl")**2,
               cyl_group = tp.case_when(tp.col("cyl")<tp.col("cyl").mean(), "Low cyl",
                                        tp.col("cyl")>tp.col("cyl").mean(), "High cyl",
                                        True, 'Average cyl'),
               am = tp.as_factor("am")
               )
        .select("name", "am")
        .pivot_wider(values_from="name", names_from="am",
                     values_fn=tp.element().sort().str.concat("; "))
        )

tab = (df
    %>% filter(carb < 8)
    %>% filter(str_detect(name, "Mazda|Toyota|Merc"))
    %>% mutate(cyl_squared = cyl^2,
               cyl_group = case_when(cyl < mean(cyl) ~ "Low cyl",
                                     cyl > mean(cyl) ~ "High cyl",
                                     TRUE ~ "Average cyl"),
               am = as.factor(am)
               )
    %>% select(name, am)
    %>% pivot_wider(names_from = am, values_from = name,
                    values_fn = list(name = ~ paste(sort(.), collapse = "; ")))
)

tab = (df.to_polars()
       .filter(pl.col("carb") < 8)
       .filter(pl.col("name").str.contains("Mazda|Toyota|Merc"))
       .with_columns([
           (pl.col("cyl") ** 2).alias("cyl_squared"),
           (pl
            .when(pl.col("cyl") < pl.col("cyl").mean()).then(pl.lit("Low cyl"))
            .when(pl.col("cyl") > pl.col("cyl").mean()).then(pl.lit("High cyl"))
            .otherwise(pl.lit("Average cyl")).alias("cyl_group")),
           (pl.col("am").cast(pl.String).cast(pl.Categorical).alias("am"))
       ])
       .select(["name", "am"])
       .with_columns(idx=0)
       # pivot-wide
       .pivot(index='idx', on="am", values="name",
              aggregate_function=pl.element().sort().str.concat("; ")
              )
       .drop('idx')
       )

 tab = (df
        .query(f"carb < 8")
        .query(f"name.str.contains('Mazda|Toyota|Merc')")
        .assign(cyl_squared = lambda col: col["cyl"]**2,
                cyl_group = lambda col: pd.cut(col["cyl"], 
                                               bins=[-float("inf"), col["cyl"].mean(),
                                                      float("inf")],
                                               labels=["Low cyl", "High cyl"]),
                am = lambda col: col["am"].astype("str"))
        .filter(["name", "am"])
        .pivot_table(columns="am", values="name",
                     aggfunc = lambda x: "; ".join(x)))

Performance

In most cases, the performance of tidypolars $^{4sci}$ is comparable to Polars. In some instances, it may operate slightly slower due to the additional functionalities provided by the module. Check the section Performance for details.