For the past year or so I have been working on writing performant code out of the box without...
Your R code is probably 100x slower than it should (Part 2)
This is the second part to the “Your R code probably is 100x slower than it should” blog post. If you have not read it please be sure to read it first as it has very important context for this post.
Reading feedback from the community, some concerns and opinions were given and we are going to put it to the test! Just to clarify, the idea of the previous post was to show that the effort required to implement an extremely fast version of a function in a language like Rust can be marginally greater than implementing a way slower version in R (with libraries). It was meant to show how interoperable these two amazing languages are in 2024 and why you should consider using it to speed up certain operations.
In this post we will add a couple of implementations by the community that were given as feedback to the previous post. Thanks to Mickaël Canouil and Panagiotis Togias for your contributions.
Implementations
In this post we will compare a couple more implementations to see if it would have been worth the effort to try and implement the same operation using various libraries and methods instead of writing our function in Rust out of the gate.
To add some spice to the test we added the case of having an additional year, in this case we expect an NA value.
We will also be using a data.table instead of a data.frame.
mock_dataset <- data.table(
year = c(2024L, 2025L),
week = as.integer(runif(50000000, min = 1, max = 52))
)
A small note, we tried using setkey to make data.table faster. However, it was not trivial to benchmark the additional time it takes to do said pre-processing. In this case a rule is, not pre-processing of the data. For data.table this is no issue, it ran really well either way, however take this into account.
Parallel Rust
The first implementation is practically the same as before. We match the cases and return an NA in the case that we have not covered the year and the weeks.
rextendr::rust_function(r"{
fn get_iso_month_rs_par(years: &[i32], weeks: &[i32]) -> Vec<i32> {
use rayon::prelude::*;
years.into_par_iter().zip(weeks.into_par_iter()).map(|(&year, &week)| {
match (year, week) {
(2024, 1..=4 ) => 1,
(2024, 5..=8 ) => 2,
(2024, 9..=13 ) => 3,
(2024, 14..=17) => 4,
(2024, 18..=22) => 5,
(2024, 23..=26) => 6,
(2024, 27..=30) => 7,
(2024, 31..=35) => 8,
(2024, 36..=39) => 9,
(2024, 40..=44) => 10,
(2024, 45..=48) => 11,
(2024, 49..=52) => 12,
_ => i32::MIN
}
}).collect()
}
}",
dependencies = list(rayon = "1.10"),
profile = "release"
)
R Base + cut
Huge thanks to Mickaël Canouil for taking the time to write this implementation. This uses base R and the cut function to do the same matching. However, this requires using aggregations, so in our benchmark we will need to have a grouping clause.
get_iso_month_r_base_cut <- function(years, weeks) {
if (years[1] == 2024) return(
as.integer(
cut(
weeks,
c(0, 4, 8, 13, 17, 22, 26, 30, 35, 39, 44, 48, 52),
labels = 1:12
)
)
)
return(NA)
}
dplyr case_when
This implementation looks very similar to the implementation in the last blog post, however now we do comparisons instead of %in%. Using %in% was our first thought in making the code readable, however it requires additional allocations so we will just do comparisons.
mock_dataset |>
dplyr::mutate(
isomonth = dplyr::case_when(
year == 2024 & dplyr::between(week, 1, 4) ~ 1,
year == 2024 & dplyr::between(week, 5, 8) ~ 2,
year == 2024 & dplyr::between(week, 9, 13) ~ 3,
year == 2024 & dplyr::between(week, 14, 17) ~ 4,
year == 2024 & dplyr::between(week, 18, 22) ~ 5,
year == 2024 & dplyr::between(week, 23, 26) ~ 6,
year == 2024 & dplyr::between(week, 27, 30) ~ 7,
year == 2024 & dplyr::between(week, 31, 35) ~ 8,
year == 2024 & dplyr::between(week, 36, 39) ~ 9,
year == 2024 & dplyr::between(week, 40, 44) ~ 10,
year == 2024 & dplyr::between(week, 45, 48) ~ 11,
year == 2024 & dplyr::between(week, 49, 52) ~ 12
)
)
data.table fcase
Thanks to Panagiotis Togias for suggesting to use data.table’s fcase function.
mock_dataset[, isomonth := data.table::fcase(
year == 2024 & data.table::between(week, 1, 4), 1,
year == 2024 & data.table::between(week, 5, 8), 2,
year == 2024 & data.table::between(week, 9, 13), 3,
year == 2024 & data.table::between(week, 14, 17), 4,
year == 2024 & data.table::between(week, 18, 22), 5,
year == 2024 & data.table::between(week, 23, 26), 6,
year == 2024 & data.table::between(week, 27, 30), 7,
year == 2024 & data.table::between(week, 31, 35), 8,
year == 2024 & data.table::between(week, 36, 39), 9,
year == 2024 & data.table::between(week, 40, 44), 10,
year == 2024 & data.table::between(week, 45, 48), 11,
year == 2024 & data.table::between(week, 49, 52), 12
), by = c("year", "week")]
duckdb + SQL
Finally we will also compare using duckdb and writing the query directly in SQL.
SELECT
year,
week,
CASE
WHEN year = 2024 AND week BETWEEN 1 AND 4 THEN 1
WHEN year = 2024 AND week BETWEEN 5 AND 8 THEN 2
WHEN year = 2024 AND week BETWEEN 9 AND 13 THEN 3
WHEN year = 2024 AND week BETWEEN 14 AND 17 THEN 4
WHEN year = 2024 AND week BETWEEN 18 AND 22 THEN 5
WHEN year = 2024 AND week BETWEEN 23 AND 26 THEN 6
WHEN year = 2024 AND week BETWEEN 27 AND 30 THEN 7
WHEN year = 2024 AND week BETWEEN 31 AND 35 THEN 8
WHEN year = 2024 AND week BETWEEN 36 AND 39 THEN 9
WHEN year = 2024 AND week BETWEEN 40 AND 44 THEN 10
WHEN year = 2024 AND week BETWEEN 45 AND 48 THEN 11
WHEN year = 2024 AND week BETWEEN 49 AND 52 THEN 12
ELSE null
END as isomonth
FROM duckdb_mock_dataset
Benchmarking
Now that we have our implementations let’s do some benchmarking. The exact code for the benchmark can be found at the end of this blog post (it’s a little long).
The results once again are quite amazing. As we can see a one to one implementation for dplyr, data.table, and duckdb differ significantly in performance. So yes, using different dataframe libraries does make the code faster. We must note that the optimal fcase solution with the double by statement was the result of a couple of minutes of experimentation, it was not the first solution we thought of. However, using the implementation suggested by Mickaël shows a massive performance improvement beating out duckdb.
Once again, our Rust implementation (that is not much longer than the dplyr::case_when solution) from the last blog post still beats any other implementation by orders of magnitude using both data.table or dplyr as underlying dataframe libraries.
For a clearer understanding of the results, here is the plot without dplyr.
Conclusion
The purpose of this blog post is not to show that R is slow, quite the opposite, it is meant to show that R can be really fast when certain operations are handled by compiled, statically typed languages. R’s easy to use interfaces like rextendr or Rcpp make this possible to anyone willing to learn it.
Benchmark Code
results <- microbenchmark::microbenchmark(
times = 10,
"get_iso_month_rs_par" = {
mock_dataset |>
dplyr::mutate(isomonth = get_iso_month_rs_par(year, week))
},
"get_iso_month_rs_par_with_datatable" = {
mock_dataset[, isomonth := get_iso_month_rs_par(year, week)]
},
"get_iso_month_dplyr_case_when" = {
mock_dataset |>
dplyr::mutate(
isomonth = dplyr::case_when(
year == 2024 & dplyr::between(week, 1, 4) ~ 1,
year == 2024 & dplyr::between(week, 5, 8) ~ 2,
year == 2024 & dplyr::between(week, 9, 13) ~ 3,
year == 2024 & dplyr::between(week, 14, 17) ~ 4,
year == 2024 & dplyr::between(week, 18, 22) ~ 5,
year == 2024 & dplyr::between(week, 23, 26) ~ 6,
year == 2024 & dplyr::between(week, 27, 30) ~ 7,
year == 2024 & dplyr::between(week, 31, 35) ~ 8,
year == 2024 & dplyr::between(week, 36, 39) ~ 9,
year == 2024 & dplyr::between(week, 40, 44) ~ 10,
year == 2024 & dplyr::between(week, 45, 48) ~ 11,
year == 2024 & dplyr::between(week, 49, 52) ~ 12
)
)
},
"get_iso_month_datatable_fcase" = {
mock_dataset[, isomonth := data.table::fcase(
year == 2024 & data.table::between(week, 1, 4), 1,
year == 2024 & data.table::between(week, 5, 8), 2,
year == 2024 & data.table::between(week, 9, 13), 3,
year == 2024 & data.table::between(week, 14, 17), 4,
year == 2024 & data.table::between(week, 18, 22), 5,
year == 2024 & data.table::between(week, 23, 26), 6,
year == 2024 & data.table::between(week, 27, 30), 7,
year == 2024 & data.table::between(week, 31, 35), 8,
year == 2024 & data.table::between(week, 36, 39), 9,
year == 2024 & data.table::between(week, 40, 44), 10,
year == 2024 & data.table::between(week, 45, 48), 11,
year == 2024 & data.table::between(week, 49, 52), 12
), by = c("year", "week")]
},
"get_iso_month_r_base_cut" = {
mock_dataset |>
dplyr::group_by(year) |>
dplyr::mutate(isomonth = get_iso_month_r_base_cut(year, week))
},
"get_iso_month_r_base_cut_with_datatable" = {
mock_dataset[, isomonth := get_iso_month_r_base_cut(year, week), by = c("year", "week")]
},
"get_iso_month_duckdb" = {
DBI::dbGetQuery(
ddb_mem,
r"(
SELECT
year,
week,
CASE
WHEN year = 2024 AND week BETWEEN 1 AND 4 THEN 1
WHEN year = 2024 AND week BETWEEN 5 AND 8 THEN 2
WHEN year = 2024 AND week BETWEEN 9 AND 13 THEN 3
WHEN year = 2024 AND week BETWEEN 14 AND 17 THEN 4
WHEN year = 2024 AND week BETWEEN 18 AND 22 THEN 5
WHEN year = 2024 AND week BETWEEN 23 AND 26 THEN 6
WHEN year = 2024 AND week BETWEEN 27 AND 30 THEN 7
WHEN year = 2024 AND week BETWEEN 31 AND 35 THEN 8
WHEN year = 2024 AND week BETWEEN 36 AND 39 THEN 9
WHEN year = 2024 AND week BETWEEN 40 AND 44 THEN 10
WHEN year = 2024 AND week BETWEEN 45 AND 48 THEN 11
WHEN year = 2024 AND week BETWEEN 49 AND 52 THEN 12
ELSE null
END as isomonth
FROM duckdb_mock_dataset
)"
)
}
)