library(R.utils)
# create a 1000 x 1000 matrix of random numbers
<- matrix(rnorm(2000000), nrow = 2000, ncol = 2000) |>
df as.data.frame()
# Make and save gzipped file
write.csv(df, "df.csv")
gzip(
filename = "df.csv",
destname = "df.csv.gz",
overwrite = FALSE,
remove = TRUE
)
Introduction
Yesterday I posted on performing a benchmark on reading in a compressed .csv.gz
file of a 2,000 by 2,000 data.frame
. It was brought to my attention by someone on Mastadon (@mariviere@fediscience.org - https://fediscience.org/@mariviere) that I should also use {duckdb}
and {arrow}
so I will perform the same analysis as yesterday but I will also add in the two aforementioned packages.
Function
The functions that we will be using this time around are as follows:
fread()
- data.tableduckdb_read_csv()
- duckdbread_csv()
- readrread.csv()
- baseread.table()
- basevroom()
with altrep =FALSE
- vroomvroom()
with altrep =TRUE
- vroomoprn_csv_dataset()
- arrow
Example
Make the Data
Let’s make that dataset again:
Benchmarking
Time to benchmark
library(rbenchmark)
library(data.table)
library(readr)
library(duckdb)
library(arrow)
library(vroom)
library(dplyr)
library(DBI)
<- 30
n
benchmark(
# Base R
"read.table" = {
<- read.table(
a "df.csv.gz",
sep = ",",
colClasses = list(numeric = 1:2000)
)
},"read.csv" = {
<- read.csv(
b "df.csv.gz",
sep = ",",
colClasses = list(numeric = 1:2000)
)
},
# data.table
"fread" = {
<- fread(
c "df.csv.gz",
sep = ",",
colClasses = list(numeric = 1:2000)
)
},
# vroom
"vroom alltrep false" = {
<- vroom("df.csv.gz", delim = ",", col_types = "d")
d
},"vroom alltrep true" = {
<- vroom("df.csv.gz", delim = ",", altrep = TRUE, col_types = "d")
e
},
# readr
"readr" = {
<- read_csv("df.csv.gz", col_types = "d")
f
},
# Arrow
"arrow" = {
<- open_csv_dataset("df.csv.gz")
g
},
# DuckDB
"duckdb" = {
<- dbConnect(duckdb())
con <- duckdb_read_csv(
h conn = con,
name = "df",
files = "C:\\Users\\ssanders\\Documents\\GitHub\\steveondata\\posts\\rtip-2023-03-28\\df.csv.gz"
)dbDisconnect(con)
},
# Replications
replications = n,
# Columns
columns = c(
"test","replications","elapsed","relative","user.self","sys.self")
|>
) arrange(relative)
test replications elapsed relative user.self sys.self
1 arrow 30 3.01 1.000 5.04 0.25
2 fread 30 28.28 9.395 19.56 4.30
3 vroom alltrep false 30 31.89 10.595 26.25 10.75
4 vroom alltrep true 30 33.72 11.203 25.75 10.67
5 duckdb 30 94.09 31.259 90.70 2.77
6 readr 30 98.28 32.651 113.05 45.12
7 read.table 30 109.97 36.535 107.78 1.24
8 read.csv 30 153.79 51.093 152.44 0.56
Important note is the session info on the pc I am using to write this:
sessionInfo()
R version 4.2.3 (2023-03-15 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_1.1.1 vroom_1.6.1 arrow_11.0.0.3 duckdb_0.7.1-1
[5] DBI_1.1.3 readr_2.1.4 data.table_1.14.8 rbenchmark_1.0.0
[9] R.utils_2.12.2 R.oo_1.25.0 R.methodsS3_1.8.2
loaded via a namespace (and not attached):
[1] pillar_1.9.0 compiler_4.2.3 tools_4.2.3 digest_0.6.31
[5] bit_4.0.5 jsonlite_1.8.4 evaluate_0.20 lifecycle_1.0.3
[9] tibble_3.2.1 pkgconfig_2.0.3 rlang_1.1.0 cli_3.6.1
[13] rstudioapi_0.14 parallel_4.2.3 yaml_2.3.7 xfun_0.38
[17] fastmap_1.1.1 knitr_1.42 generics_0.1.3 vctrs_0.6.1
[21] htmlwidgets_1.6.2 hms_1.1.3 bit64_4.0.5 tidyselect_1.2.0
[25] glue_1.6.2 R6_2.5.1 fansi_1.0.4 rmarkdown_2.21
[29] tzdb_0.3.0 purrr_1.0.1 magrittr_2.0.3 htmltools_0.5.5
[33] assertthat_0.2.1 utf8_1.2.3 crayon_1.5.2
Sys.info() |>
as.data.frame() |>
::rownames_to_column() |>
tibbleas_tibble() |>
slice(1,2,3,5)
# A tibble: 4 × 2
rowname `Sys.info()`
<chr> <chr>
1 sysname Windows
2 release 10 x64
3 version build 19045
4 machine x86-64
memory.profile() |>
as.data.frame()
memory.profile()
NULL 1
symbol 24303
pairlist 642504
closure 11189
environment 4009
promise 22963
language 189766
special 47
builtin 701
char 2039073
logical 18866
integer 108132
double 20060
complex 5
character 160381
... 21
any 0
list 58500
expression 5
bytecode 41555
externalptr 12382
weakref 13860
raw 10113
S4 1362
gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 3363479 179.7 5830931 311.5 5830931 311.5
Vcells 32950395 251.4 81254422 620.0 81254324 620.0