Jul 17, 2024 9:41:51 AM

Your R code is probably 100x slower than it should (Part 2)

This is the second part to the “Your R code probably is 100x slower than it should” blog post. If you have not read it please be sure to read it first as it has very important context for this post.

Reading feedback from the community, some concerns and opinions were given and we are going to put it to the test! Just to clarify, the idea of the previous post was to show that the effort required to implement an extremely fast version of a function in a language like Rust can be marginally greater than implementing a way slower version in R (with libraries). It was meant to show how interoperable these two amazing languages are in 2024 and why you should consider using it to speed up certain operations.

In this post we will add a couple of implementations by the community that were given as feedback to the previous post. Thanks to Mickaël Canouil and Panagiotis Togias for your contributions.

Implementations

In this post we will compare a couple more implementations to see if it would have been worth the effort to try and implement the same operation using various libraries and methods instead of writing our function in Rust out of the gate.

To add some spice to the test we added the case of having an additional year, in this case we expect an NA value.

We will also be using a data.table instead of a data.frame.

mock_dataset <- data.table(
  year = c(2024L, 2025L),
  week = as.integer(runif(50000000, min = 1, max = 52))
)

A small note, we tried using setkey to make data.table faster. However, it was not trivial to benchmark the additional time it takes to do said pre-processing. In this case a rule is, not pre-processing of the data. For data.table this is no issue, it ran really well either way, however take this into account.

Parallel Rust

The first implementation is practically the same as before. We match the cases and return an NA in the case that we have not covered the year and the weeks.

rextendr::rust_function(r"{
  fn get_iso_month_rs_par(years: &[i32], weeks: &[i32]) -> Vec<i32> {
    use rayon::prelude::*;
    years.into_par_iter().zip(weeks.into_par_iter()).map(|(&year, &week)| {
        match (year, week) {
          (2024,     1..=4 ) =>  1,
          (2024,     5..=8 ) =>  2,
          (2024,     9..=13 ) =>  3,
          (2024,     14..=17) =>  4,
          (2024,     18..=22) =>  5,
          (2024,     23..=26) =>  6,
          (2024,     27..=30) =>  7,
          (2024,     31..=35) =>  8,
          (2024,     36..=39) =>  9,
          (2024,     40..=44) =>  10,
          (2024,     45..=48) =>  11,
          (2024,     49..=52) =>  12,
          _ => i32::MIN
        }
      }).collect()
    }
  }",
  dependencies = list(rayon = "1.10"),
  profile = "release"
)

R Base + cut

Huge thanks to Mickaël Canouil for taking the time to write this implementation. This uses base R and the cut function to do the same matching. However, this requires using aggregations, so in our benchmark we will need to have a grouping clause.

get_iso_month_r_base_cut <- function(years, weeks) {
  if (years[1] == 2024) return(
    as.integer(
      cut(
        weeks,
        c(0, 4, 8, 13, 17, 22, 26, 30, 35, 39, 44, 48, 52),
        labels = 1:12
      )
    )
  )
  return(NA)
}

dplyr case_when

This implementation looks very similar to the implementation in the last blog post, however now we do comparisons instead of %in%. Using %in% was our first thought in making the code readable, however it requires additional allocations so we will just do comparisons.

mock_dataset |>
  dplyr::mutate(
    isomonth = dplyr::case_when(
      year == 2024 & dplyr::between(week, 1, 4) ~  1,
      year == 2024 & dplyr::between(week, 5, 8) ~  2,
      year == 2024 & dplyr::between(week, 9, 13)  ~ 3,
      year == 2024 & dplyr::between(week, 14, 17) ~ 4,
      year == 2024 & dplyr::between(week, 18, 22) ~ 5,
      year == 2024 & dplyr::between(week, 23, 26) ~ 6,
      year == 2024 & dplyr::between(week, 27, 30) ~ 7,
      year == 2024 & dplyr::between(week, 31, 35) ~ 8,
      year == 2024 & dplyr::between(week, 36, 39) ~ 9,
      year == 2024 & dplyr::between(week, 40, 44) ~ 10,
      year == 2024 & dplyr::between(week, 45, 48) ~ 11,
      year == 2024 & dplyr::between(week, 49, 52) ~ 12
    )
  )

data.table fcase

Thanks to Panagiotis Togias for suggesting to use data.table’s fcase function.

mock_dataset[, isomonth := data.table::fcase(
  year == 2024 & data.table::between(week, 1,  4),  1,
  year == 2024 & data.table::between(week, 5,  8),  2,
  year == 2024 & data.table::between(week, 9,  13), 3,
  year == 2024 & data.table::between(week, 14, 17), 4,
  year == 2024 & data.table::between(week, 18, 22), 5,
  year == 2024 & data.table::between(week, 23, 26), 6,
  year == 2024 & data.table::between(week, 27, 30), 7,
  year == 2024 & data.table::between(week, 31, 35), 8,
  year == 2024 & data.table::between(week, 36, 39), 9,
  year == 2024 & data.table::between(week, 40, 44), 10,
  year == 2024 & data.table::between(week, 45, 48), 11,
  year == 2024 & data.table::between(week, 49, 52), 12
), by = c("year", "week")]

`duckdb + SQL`

Finally we will also compare using duckdb and writing the query directly in SQL.

SELECT
  year,
  week,
  CASE
    WHEN year = 2024 AND week BETWEEN 1 AND 4   THEN  1
    WHEN year = 2024 AND week BETWEEN 5 AND 8   THEN  2
    WHEN year = 2024 AND week BETWEEN 9 AND 13  THEN  3
    WHEN year = 2024 AND week BETWEEN 14 AND 17 THEN  4
    WHEN year = 2024 AND week BETWEEN 18 AND 22 THEN  5
    WHEN year = 2024 AND week BETWEEN 23 AND 26 THEN  6
    WHEN year = 2024 AND week BETWEEN 27 AND 30 THEN  7
    WHEN year = 2024 AND week BETWEEN 31 AND 35 THEN  8
    WHEN year = 2024 AND week BETWEEN 36 AND 39 THEN  9
    WHEN year = 2024 AND week BETWEEN 40 AND 44 THEN  10
    WHEN year = 2024 AND week BETWEEN 45 AND 48 THEN  11
    WHEN year = 2024 AND week BETWEEN 49 AND 52 THEN  12
    ELSE null
  END as isomonth
FROM duckdb_mock_dataset

Benchmarking

Now that we have our implementations let’s do some benchmarking. The exact code for the benchmark can be found at the end of this blog post (it’s a little long).

864e990d-2a18-45af-9f10-ee9d697080c3

The results once again are quite amazing. As we can see a one to one implementation for dplyr, data.table, and duckdb differ significantly in performance. So yes, using different dataframe libraries does make the code faster. We must note that the optimal fcase solution with the double by statement was the result of a couple of minutes of experimentation, it was not the first solution we thought of. However, using the implementation suggested by Mickaël shows a massive performance improvement beating out duckdb.

Once again, our Rust implementation (that is not much longer than the dplyr::case_when solution) from the last blog post still beats any other implementation by orders of magnitude using both data.table or dplyr as underlying dataframe libraries.

For a clearer understanding of the results, here is the plot without dplyr.

0fb23c23-7fba-4505-a90d-3f1ed213f3c5

Conclusion

The purpose of this blog post is not to show that R is slow, quite the opposite, it is meant to show that R can be really fast when certain operations are handled by compiled, statically typed languages. R’s easy to use interfaces like rextendr or Rcpp make this possible to anyone willing to learn it.

Benchmark Code

results <- microbenchmark::microbenchmark(
  times = 10,
  "get_iso_month_rs_par" = {
    mock_dataset |>
      dplyr::mutate(isomonth = get_iso_month_rs_par(year, week))
  },
  "get_iso_month_rs_par_with_datatable" = {
    mock_dataset[, isomonth := get_iso_month_rs_par(year, week)]
  },
  "get_iso_month_dplyr_case_when" = {
    mock_dataset |>
      dplyr::mutate(
        isomonth = dplyr::case_when(
          year == 2024 & dplyr::between(week, 1, 4) ~  1,
          year == 2024 & dplyr::between(week, 5, 8) ~  2,
          year == 2024 & dplyr::between(week, 9, 13)  ~ 3,
          year == 2024 & dplyr::between(week, 14, 17) ~ 4,
          year == 2024 & dplyr::between(week, 18, 22) ~ 5,
          year == 2024 & dplyr::between(week, 23, 26) ~ 6,
          year == 2024 & dplyr::between(week, 27, 30) ~ 7,
          year == 2024 & dplyr::between(week, 31, 35) ~ 8,
          year == 2024 & dplyr::between(week, 36, 39) ~ 9,
          year == 2024 & dplyr::between(week, 40, 44) ~ 10,
          year == 2024 & dplyr::between(week, 45, 48) ~ 11,
          year == 2024 & dplyr::between(week, 49, 52) ~ 12
        )
      )
  },
  "get_iso_month_datatable_fcase" = {
    mock_dataset[, isomonth := data.table::fcase(
      year == 2024 & data.table::between(week, 1,  4),  1,
      year == 2024 & data.table::between(week, 5,  8),  2,
      year == 2024 & data.table::between(week, 9,  13), 3,
      year == 2024 & data.table::between(week, 14, 17), 4,
      year == 2024 & data.table::between(week, 18, 22), 5,
      year == 2024 & data.table::between(week, 23, 26), 6,
      year == 2024 & data.table::between(week, 27, 30), 7,
      year == 2024 & data.table::between(week, 31, 35), 8,
      year == 2024 & data.table::between(week, 36, 39), 9,
      year == 2024 & data.table::between(week, 40, 44), 10,
      year == 2024 & data.table::between(week, 45, 48), 11,
      year == 2024 & data.table::between(week, 49, 52), 12
    ), by = c("year", "week")]
  },
  "get_iso_month_r_base_cut" = {
    mock_dataset |>
      dplyr::group_by(year) |>
      dplyr::mutate(isomonth = get_iso_month_r_base_cut(year, week))
  },
  "get_iso_month_r_base_cut_with_datatable" = {
    mock_dataset[, isomonth := get_iso_month_r_base_cut(year, week), by = c("year", "week")]
  },
  "get_iso_month_duckdb" = {
    DBI::dbGetQuery(
      ddb_mem,
      r"(
      SELECT
        year,
        week,
        CASE
          WHEN year = 2024 AND week BETWEEN 1 AND 4   THEN  1
          WHEN year = 2024 AND week BETWEEN 5 AND 8   THEN  2
          WHEN year = 2024 AND week BETWEEN 9 AND 13  THEN  3
          WHEN year = 2024 AND week BETWEEN 14 AND 17 THEN  4
          WHEN year = 2024 AND week BETWEEN 18 AND 22 THEN  5
          WHEN year = 2024 AND week BETWEEN 23 AND 26 THEN  6
          WHEN year = 2024 AND week BETWEEN 27 AND 30 THEN  7
          WHEN year = 2024 AND week BETWEEN 31 AND 35 THEN  8
          WHEN year = 2024 AND week BETWEEN 36 AND 39 THEN  9
          WHEN year = 2024 AND week BETWEEN 40 AND 44 THEN  10
          WHEN year = 2024 AND week BETWEEN 45 AND 48 THEN  11
          WHEN year = 2024 AND week BETWEEN 49 AND 52 THEN  12
          ELSE null
        END as isomonth
      FROM duckdb_mock_dataset
      )"
    )
  }
)

R, Big data, Best Practices, Data Engineering, Rust