Creating opportunities that empower rural people to thrive in the tech economy.
Website: https://ruralinnovation.us/
Data team blog: https://ruralinnovation.github.io/blog/about.html
(Sorry
`,`(no
!(abstract syntax tree))
)
Probably not valid lisp or scheme code
Do not fear permanent storage / database
Example of data packages (and data)
A new way to use data via an R package
#!/bin/bash
db_set () {
echo "$1, $2" >> database
}
db_get (){
grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
}
Very simple implementation1 with a lot of problems but… it saves data and retrieves it!
Data Isolation: isolate both the OS and user/applications from data representations.
Table (relations) as a data structures
DBMS manages storage
DBMS provides a declarative language to access it
Your applications are happy 😀
flowchart LR A[Database Storage] --> B[Physical Schema] --> D[Logical Schema] --> E[Applications]
No more ETL, just database migrations!
%%{init: {'theme': 'dagre', 'themeVariables': { 'fontSize': '20px', 'fontFamily': 'Bitter'}}}%% flowchart LR S["Upstream"] -->|ETL| A[Database Storage] --> B[Physical Schema] --> D[Logical Schema] --> E[Applications]
Upstream: lots of different objects usually represented as files, you have no control over them
“datasets for demonstrating, benchmarking and teaching spatial data analysis”3
flowchart LR S["Upstream"] A[Database Storage] B[Physical Schema] D[Logical Schema] E[Applications] S -->|ETL| A --> B --> D --> E subgraph zero["spData"] direction LR subgraph data_raw S end subgraph one["file systems"] A end subgraph two["files: rda, gpkg"] B end subgraph three["None"] D end subgraph four["Minimal"] E end end style data_raw fill:#fff style one fill:#fff style two fill:#fff style three fill:#fff style four fill:#fff style zero fill:#fff
data/
and inst/
store data (moved .libpath()/spdata/
)
Processing in data_raw/
Documentation in R/
and package website
“directly download and use TIGER/Line shapefiles”4
%%{init: {'theme': 'dagre', 'themeVariables': { 'fontSize': '20px', 'fontFamily': 'Bitter'}}}%% flowchart LR S["Upstream"] A[Database Storage] B[Physical Schema] D[Logical Schema] E[Applications] S --> A --> B --> D --> E subgraph one["FTP server"] S A end subgraph two["files shapefiles"] B end subgraph three["None"] D end subgraph tigris E end style one fill:#fff style two fill:#fff style three fill:#fff style tigris fill:#fff
# Download Vermont Census blocks, last release
vt_blocks <- tigris::blocks("VT", progress_bar = FALSE)
# Do interesting stuff with it!
The US Census FTP is still amazing ❤️, but tigris does a lot of lifting!
“interface with a select number of the US Census Bureau’s data APIs”5
%%{init: {'theme': 'dagre', 'themeVariables': { 'fontSize': '20px', 'fontFamily': 'Bitter'}}}%% flowchart LR A[Database Storage] B[Physical Schema] D[Logical Schema] E[Applications] A --> B --> D --> E subgraph one["Census API, Upstream"] A B D end subgraph two["Tidycensus"] E end style one fill:#fff style two fill:#fff
census_api_key("My secret api for git") # .Renviron or OS keyring
# median income per block, 2020, 90% interval, for VT
vt <- get_acs(geography = "block",
variables = c(med_income = "B19013_001"),
state = "VT", year = 2020)
Also provides great “convenience” functions and caching
They display how powerful and versatile R packages are at encapsulating complexity.
You can use them to abstract data as code in your workflow
But:
How do we deal with bigger datasets?
Should we use logical schemas?
I heard modularity is still cool…
Form 477 | National Broadband Map | |
---|---|---|
US Census Boundaries | 2010 | 2020 |
Granularity | Census blocks | Locations |
Timeframe | 2014-2021 | 2022 - Ongoing |
Releases | twice a year | twice a year |
Records | 416,447,807 | 3,488,191,994 |
Size | 400MB/year | 22GB/year |
To get a full NBM release, you need to click to around 440 links, spread on multiple menus6.
“sneaky updates”, after some time, the FCC used a file name convention to communicate updates (but no changelog..).
Few functions packaged:
get_nbm_available
)dl_nbm
)%%{init: {'theme': 'dagre', 'themeVariables': { 'fontSize': '20px', 'fontFamily': 'Bitter'}}}%% flowchart LR A1[FCC Website] --> B2[source DB] A2[TIGER Census block] --> C3 subgraph one["RDBMS (RDS/PG)"] direction TB B2 --> C3[Blocks DB] end subgraph Serving["External Schemas"] direction TB D1[Climate Risk Mapping] D3[Analytic Tables] D2[Rural Broadband Map] end subgraph Aplications E1 E2 E4 end C3 --> D1 C3 --> D2 C3 --> D3 D1--> E1[Ap. CH] D2--> E2[Ap. RBM] D3 --> E4[Data team <BR> Broadband team] click E2 "https://rural-broadband-map.ruralinnovation.us/" click E1 "ttps://broadband-risk.ruralinnovation.us/" style one fill:#fff style Serving fill:#fff style Aplications fill: #fff
Transform all the CSV in a hive partitioned parquet
One partition is “release”, easy to update
We can freely convert the metadata contained in file names into columns
DuckDB was used here (nanoparquet is an other option):
# [...]
con <- DBI::dbConnect(duckdb::duckdb(), dbdir = paste0(duck_dir, "/nbm.duckdb"))
DBI::dbExecute(con, "PRAGMA max_temp_directory_size='10GiB'")
copy_stat <- paste0("COPY (SELECT frn,
# [...]
strptime(split_part(filename, '_', 7), '%B%Y')::DATE as release
FROM read_csv('", raw_dta_dir, "/*.csv',
types = { 'frn' : 'VARCHAR(10)',
# [...]
}, ignore_errors = true,
# [...]
)
TO '", data_dir, "/nbm_raw' (FORMAT 'parquet', PARTITION_BY(release, state_usps, technology), OVERWRITE true);")
DBI::dbExecute(con, copy_stat); DBI::dbDisconnect(con)
system("aws s3 sync nbm_raw s3://cori.data.fcc/nbm_raw")
Source: https://github.com/ruralinnovation/cori.data.fcc/blob/main/data-raw/nbm_raw.R
We used it for storing all CSVs
We can now use it as “Database Storage” for our parquet files and save money!
Since it is not expensive, we can host it for others
DuckDB has an extension to interact with it!
DBI::dbExecute(con, "INSTALL httpfs;LOAD httpfs")
# yes, this is dynamically loaded
Since DuckDB was built to be as “a nice R package”7, why not fully embrace it in an R package?
Moves part of the documentation from other repos to the package
Provides convenience functions
Wraps queries in R functions
Using an R package allows you to leverage all the tools developed by the community!
You can “spy” and look how people implement their ideas
Documentation within R (vignette()
, help()
) or using pkgdown.
R CMD check / tests and more
get_nbm_bl <- function(geoid_co, release = "latest") {
[...]
con <- DBI::dbConnect(duckdb::duckdb())
DBI::dbExecute(con, sprintf("SET temp_directory ='%s';", tempdir()))
on.exit(DBI::dbDisconnect(con), add = TRUE)
DBI::dbExecute(con, "INSTALL httpfs;LOAD httpfs")
statement <- sprintf(
paste0("select *
from read_parquet('s3://cori.data.fcc/nbm_block", release_target, "/*/*.parquet')
where geoid_co = '%s';"), geoid_co)
DBI::dbGetQuery(con, statement)
}
nbm_bl <- get_nbm_bl(geoid_co = "47051")
Be carefull you can still fetch a lot of data! (OOM)
read_only
arguments:con <- DBI::dbConnect(duckdb::duckdb(read_only = TRUE))
dbExecute(con, "select * from read_parquet('s3://cori.data.fcc/nbm_block", ?, "/*/*.parquet')
where geoid_co = ?;"), list(release_target, geoid_co)) )
%%{init: {'theme': 'dagre', 'themeVariables': { 'fontSize': '20px', 'fontFamily': 'Bitter'}}}%% flowchart LR subgraph data_raw["data_raw"] S["Upstream"] end subgraph subGraph1["S3 Object Storage"] A["Database Storage"] end subgraph subGraph2["parquet files"] B["Physical Schema"] end subgraph R["R/"] direction LR E subgraph DuckDB["DuckDB"] direction LR D["Logical Schema"] data_raw subGraph1 subGraph2 end end subgraph s1["cori.data.fcc"] direction LR E["Applications"] data_raw subGraph1 subGraph2 DuckDB R end S --> A A --> B B --> D D --> E style DuckDB fill:#fff style data_raw fill:#fff style subGraph1 fill: #fff style subGraph2 fill: #fff style R fill: #fff style s1 fill: #fff
From the ETL to the application, every piece of code can be kept and organized in a version controlled R package!
Don’t want our custom functions or need a custom query? No problem… You can use DuckDB (CLI, Python API, duckplyr)!
You still need to respect conventions! Updates in code or ETL should be relayed by version/news.md
Good solution for “Big” data with rare updates
Not a good solution for “fact” data: Data Warehouses (and/or DuckLake, Iceberg)
For Slowly Changing Dimension: “it depends” ¯\(ツ)/¯
Still considered experimental, so try it! The cost is low and benefits are high!
You still need to rely on an external storage
This presentation was built with litedown hosted on codeberg!
Codeberg repository: https://codeberg.org/defuneste/useR_2025/
Email me: leroy.olivier@pm.me My Website: https://branchtwigleaf.com/
Kleppmann, Martin, author. (2017). Designing data-intensive applications : the big ideas behind reliable, scalable, and maintainable systems. Sebastopol, CA :O’Reilly Media ↩
CMU Database group, Relational Model & Algebra (CMU Intro to Database Systems) ↩
Bivand R, Nowosad J, Lovelace R (2025). spData: Datasets for Spatial Analysis. R package version 2.3.4, https://jakubnowosad.com/spData/. ↩
Walker K (2016). https://cran.r-project.org/web/packages/tigris/index.html ↩
Walker K, Herman M (2025). tidycensus: Load US Census Boundary and Attribute Data as ‘tidyverse’ and ‘sf’-Ready Data Frames. R package version 1.7.3. ↩
Great talk on this problem: https://talks.osgeo.org/foss4g-europe-2024/talk/P78AK7/ ↩