ClickHouse © is a high-performance relational column-store database to enable big data exploration and 'analytics' scaling to petabytes of data. Methods are provided that enable working with 'Yandex Clickhouse' databases via 'DBI' methods and using 'dplyr'/'dbplyr' idioms.
This R package is a DBI interface for the Yandex Clickhouse database. It provides basic dplyr support by auto-generating SQL-commands using dbplyr and is based on the official C++ Clickhouse Client.
To cite this library, please use the BibTeX entry provided in inst/CITATION.
ClickHouse can be reached over several protocols, and different R packages target different ones. Pick the package that matches the interface you want to use:
- Native TCP interface — this package,
RClickhouse. It talks to ClickHouse over the native binary protocol (default port9000) via the official C++ ClickHouse client. This is not the HTTP interface, so connect to the native port (9000) rather than the HTTP port (8123). - HTTP(S) interface — use
ClickHouseHTTP(CRAN), a DBI backend built on the ClickHouse HTTP interface (default port8123) with HTTPS/SSL support. - Arrow Flight SQL interface (experimental) — an experimental rewrite of
RClickhouseon top of Arrow Flight SQL is being developed on thev2-flight-sqlbranch. Both the rewrite itself and the Arrow Flight SQL interface are experimental, and Arrow Flight is not exposed by default on ClickHouse Cloud. This branch also provides an HTTP fallback, but it is only a fallback — if you want to use the HTTP interface, preferClickHouseHTTP(see above).
This package is available on CRAN, and thus installable by running:
install.packages("RClickhouse")You can also install the latest development version directly from github using devtools:
devtools::install_github("IMSMWU/RClickhouse")Note: please be aware that {RClickhouse} doesn't use a HTTP interface in order to communicate with Clickhouse. Thus, You may use the native interface port (by default 9000) instead of the HTTP interface (8123).
con <- DBI::dbConnect(RClickhouse::clickhouse(), host="example-db.com")DBI::dbWriteTable(con, "mtcars", mtcars)
dbListTables(con)
dbListFields(con, "mtcars") Query a database using dplyr:
library(dplyr)
tbl(con, "mtcars") %>%
group_by(cyl) %>%
summarise(smpg=sum(mpg))
tbl(con, "mtcars") %>%
filter(cyl == 8, vs == 0) %>%
group_by(am) %>%
summarise(mean(qsec))
# Close the connection
dbDisconnect(con)Query a database using SQL-style commands with DBI::dbGetQuery:
DBI::dbGetQuery(con, "SELECT
vs
,COUNT(*) AS 'number of cases'
,AVG(qsec) AS 'average qsec'
FROM mtcars
GROUP BY vs")
# Save results of querying:
res <- DBI::dbGetQuery(con, "SELECT (*)
FROM mtcars
WHERE am = 1")
# Or save the whole set of data (only useful for smaller datasets, for better performance and for larger datasets always use remote servers):
mtcars <- dbReadTable(con, mtcars)
# Close the connection
dbDisconnect(con)Query a database using ClickHouse functions
# Get the names of all the avaliable databases
DBI::dbGetQuery(con, "SHOW DATABASES")
# Get information about the variable names and types
DBI::dbGetQuery(con, "DESCRIBE TABLE mtcars")
# Compact CASE - WHEN - THEN conditionals
DBI::dbGetQuery(con, "SELECT multiIf(am='1', 'automatic', 'manual') AS 'transmission'
,multiIf(vs='1', 'straight', 'V-shaped') AS 'engine'
FROM mtcars")
# Close the connection
dbDisconnect(con)You may use a config file that is looked up for automatic initialization of the dbConnect parameters.
To do so, create a yaml file (default RClickhouse.yaml), in at least one directory (default lookup paths of parameter config_paths: ./RClickhouse.yaml, ~/.R/RClickhouse.yaml, /etc/RClickhouse.yaml), e.g. ~/.R/configs/RClickhouse.yaml and pass a vector of the corresponding file paths to dbConnect as config_paths parameter.
In RClickhouse.yaml, you may specify a variable number of parameters (host, port, db, user, password, compression) to be initialized using the following format (example):
host: example-db.com
port: 1111The actual initialization of the parameters of dbConnect follows a hierarchical structure with varying priorities (1 to 3, where 1 is highest):
- Specified input parameters when calling
dbConnect. If parameters are unspecified, fall back to (2) - Parameters specified in
RClickhouse.yaml, where the level of priority depends on the position of the path in the config_path input vector (first position, highest priority etc.). If parameters are unspecified, fall back to (3). - Default parameters (
host="localhost", port = 9000, db = "default", user = "default", password = "", compression = "lz4").
Big thanks to Kirill Müller, Maxwell Peterson, Artemkin Pavel and Hannes Mühleisen.