Data as Code

1 Introduction

1.1 About us!

1.2 Center On Rural Innovation: Unlocking Rural Potential

Creating opportunities that empower rural people to thrive in the tech economy.

Data team blog: https://ruralinnovation.github.io/blog/about.html

1.3 Data as code?

(Sorry 
  `,`(no 
     !(abstract syntax tree))
)

Probably not valid lisp or scheme code

1.4 Key ideas: Packaging data as code with DuckDB and S3

Do not fear permanent storage / database
Example of data packages (and data)
A new way to use data via an R package

1.5 An easy Example

#!/bin/bash

db_set () {
  echo "$1, $2" >> database 
}

db_get (){
  grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
}

Very simple implementation¹ with a lot of problems but… it saves data and retrieves it!

1.6 The Relational DataBase Management System (RDBMS)²:

Data Isolation: isolate both the OS and user/applications from data representations.

Table (relations) as a data structures
DBMS manages storage
DBMS provides a declarative language to access it
Your applications are happy 😀

flowchart LR
    A[Database Storage] --> B[Physical Schema] --> D[Logical Schema] --> E[Applications]

1.7 If only data was already in databases!

No more ETL, just database migrations!

%%{init: {'theme': 'dagre', 'themeVariables': { 'fontSize': '20px', 'fontFamily': 'Bitter'}}}%%
flowchart LR
    S["Upstream"] -->|ETL| A[Database Storage] --> B[Physical Schema] --> D[Logical Schema] --> E[Applications]

Upstream: lots of different objects usually represented as files, you have no control over them

2 Data Package Examples

2.1 spData

“datasets for demonstrating, benchmarking and teaching spatial data analysis”³

flowchart LR
    S["Upstream"]
    A[Database Storage] 
    B[Physical Schema]
    D[Logical Schema] 
    E[Applications]
    S -->|ETL| A --> B --> D --> E
    subgraph zero["spData"]
      direction LR
      subgraph data_raw
      S
      end
      subgraph one["file systems"]
      A 
      end
      subgraph two["files: rda, gpkg"]
      B
      end
      subgraph three["None"]
      D
      end
      subgraph four["Minimal"]
      E
      end
    end
    style data_raw fill:#fff
    style one fill:#fff
    style two fill:#fff
    style three fill:#fff
    style four fill:#fff
    style zero fill:#fff

data/ and inst/ store data (moved .libpath()/spdata/)
Processing in data_raw/
Documentation in R/ and package website

2.2 Tigris

“directly download and use TIGER/Line shapefiles”⁴

%%{init: {'theme': 'dagre', 'themeVariables': { 'fontSize': '20px', 'fontFamily': 'Bitter'}}}%%
flowchart LR
    S["Upstream"]
    A[Database Storage] 
    B[Physical Schema]
    D[Logical Schema] 
    E[Applications]
    S --> A --> B --> D --> E
    subgraph one["FTP server"]
    S
    A 
    end 
    subgraph two["files shapefiles"]
    B
    end 
    subgraph three["None"]
    D
    end 
    subgraph tigris
    E 
    end
    style one fill:#fff
    style two fill:#fff 
    style three fill:#fff 
    style tigris fill:#fff

# Download Vermont Census blocks, last release
vt_blocks <- tigris::blocks("VT", progress_bar = FALSE)
# Do interesting stuff with it!

The US Census FTP is still amazing ❤️, but tigris does a lot of lifting!

2.3 Tidycensus

“interface with a select number of the US Census Bureau’s data APIs”⁵

%%{init: {'theme': 'dagre', 'themeVariables': { 'fontSize': '20px', 'fontFamily': 'Bitter'}}}%%
flowchart LR
    A[Database Storage] 
    B[Physical Schema]
    D[Logical Schema] 
    E[Applications]
    A --> B --> D --> E
    subgraph one["Census API, Upstream"]
    A 
    B 
    D 
    end
    subgraph two["Tidycensus"]
    E 
    end
    style one fill:#fff
    style two fill:#fff

census_api_key("My secret api for git") # .Renviron or OS keyring   
# median income per block, 2020, 90% interval, for VT
vt <- get_acs(geography = "block",
              variables = c(med_income = "B19013_001"), 
              state = "VT", year = 2020)

Also provides great “convenience” functions and caching

2.4 All are valid approaches!

They display how powerful and versatile R packages are at encapsulating complexity.
You can use them to abstract data as code in your workflow

But:

How do we deal with bigger datasets?
Should we use logical schemas?
I heard modularity is still cool…

3 FCC Datasets and cori.data.fcc

3.1 Two big datasets, no FTP and an API rate limit

	Form 477	National Broadband Map
US Census Boundaries	2010	2020
Granularity	Census blocks	Locations
Timeframe	2014-2021	2022 - Ongoing
Releases	twice a year	twice a year
Records	416,447,807	3,488,191,994
Size	400MB/year	22GB/year

To get a full NBM release, you need to click to around 440 links, spread on multiple menus⁶.
“sneaky updates”, after some time, the FCC used a file name convention to communicate updates (but no changelog..).

3.2 First iteration of cori.data.fcc

Few functions packaged:

List assets available from FCC (get_nbm_available)
Download what is needed (dl_nbm)

%%{init: {'theme': 'dagre', 'themeVariables': { 'fontSize': '20px', 'fontFamily': 'Bitter'}}}%%
flowchart LR
    A1[FCC Website] --> B2[source DB]
    A2[TIGER Census block] --> C3
    subgraph one["RDBMS (RDS/PG)"]
    direction TB
    B2 --> C3[Blocks DB]
    end
    subgraph Serving["External Schemas"]
    direction TB
    D1[Climate Risk Mapping]
    D3[Analytic Tables]
    D2[Rural Broadband Map]
    end
    subgraph Aplications
    E1
    E2
    E4
    end 
    C3 --> D1
    C3 --> D2
    C3 --> D3
    D1--> E1[Ap. CH]
    D2--> E2[Ap. RBM]
    D3 --> E4[Data team <BR> Broadband team]
    click E2 "https://rural-broadband-map.ruralinnovation.us/"
    click E1 "ttps://broadband-risk.ruralinnovation.us/"
    style one fill:#fff
    style Serving fill:#fff
    style Aplications fill: #fff

One “raw” release was voluminous
Analysts/Scientists need to us a specific schema / SQL was a barrier
Documentation was spread in other repos

3.3 Instead of drowning in CSVs, use Parquet

Transform all the CSV in a hive partitioned parquet
One partition is “release”, easy to update
We can freely convert the metadata contained in file names into columns
DuckDB was used here (nanoparquet is an other option):
- Great CSV parser (fast, useful functions)
- Implement logical schema (data types, constraints)
- multiple compressions supported
- DuckDB and PG SQL are very close, DuckDB has a PG extension

# [...]
con <- DBI::dbConnect(duckdb::duckdb(), dbdir = paste0(duck_dir, "/nbm.duckdb"))
DBI::dbExecute(con, "PRAGMA max_temp_directory_size='10GiB'")
copy_stat <- paste0("COPY (SELECT frn, 
# [...]
      strptime(split_part(filename, '_', 7), '%B%Y')::DATE as release 
    FROM read_csv('", raw_dta_dir, "/*.csv',
              types = { 'frn'        : 'VARCHAR(10)',
# [...]
                      }, ignore_errors = true,
# [...]
              )
    TO '", data_dir, "/nbm_raw' (FORMAT 'parquet', PARTITION_BY(release, state_usps, technology), OVERWRITE true);")
DBI::dbExecute(con, copy_stat); DBI::dbDisconnect(con)
system("aws s3 sync nbm_raw s3://cori.data.fcc/nbm_raw")

Source: https://github.com/ruralinnovation/cori.data.fcc/blob/main/data-raw/nbm_raw.R

3.4 Like everyone else, use S3! (Object Storage)

We used it for storing all CSVs
We can now use it as “Database Storage” for our parquet files and save money!
Since it is not expensive, we can host it for others

DuckDB has an extension to interact with it!

DBI::dbExecute(con, "INSTALL httpfs;LOAD httpfs")
# yes, this is dynamically loaded

3.5 Fully Embrace R packages

Since DuckDB was built to be as “a nice R package”⁷, why not fully embrace it in an R package?

Moves part of the documentation from other repos to the package
Provides convenience functions
Wraps queries in R functions

3.6 Vignettes/articles and more!

Using an R package allows you to leverage all the tools developed by the community!

You can “spy” and look how people implement their ideas
Documentation within R (vignette(), help()) or using pkgdown.
R CMD check / tests and more

3.7 Simple example of functions:

get_nbm_bl <- function(geoid_co, release = "latest") {
[...]
  con <- DBI::dbConnect(duckdb::duckdb())
  DBI::dbExecute(con, sprintf("SET temp_directory ='%s';", tempdir()))
  on.exit(DBI::dbDisconnect(con), add = TRUE)
  DBI::dbExecute(con, "INSTALL httpfs;LOAD httpfs")
  statement <- sprintf(
    paste0("select *
 		  from read_parquet('s3://cori.data.fcc/nbm_block", release_target, "/*/*.parquet')
    where geoid_co = '%s';"), geoid_co)
  DBI::dbGetQuery(con, statement)
}

 nbm_bl <- get_nbm_bl(geoid_co = "47051")

3.8 Potential Improvements:

Be carefull you can still fetch a lot of data! (OOM)

Specify read_only arguments:

con <- DBI::dbConnect(duckdb::duckdb(read_only = TRUE))

Explore prepared statements!

dbExecute(con, "select * from read_parquet('s3://cori.data.fcc/nbm_block", ?, "/*/*.parquet')
    where geoid_co = ?;"), list(release_target, geoid_co)) )

3.9 cori.data.fcc a model to improve

%%{init: {'theme': 'dagre', 'themeVariables': { 'fontSize': '20px', 'fontFamily': 'Bitter'}}}%%
flowchart LR
 subgraph data_raw["data_raw"]
        S["Upstream"]
  end
 subgraph subGraph1["S3 Object Storage"]
        A["Database Storage"]
  end
 subgraph subGraph2["parquet files"]
        B["Physical Schema"]
  end

  subgraph R["R/"]
    direction LR
  E  
  subgraph DuckDB["DuckDB"]
    direction LR
        D["Logical Schema"]
        data_raw
        subGraph1
        subGraph2
  end  
  end
 subgraph s1["cori.data.fcc"]
    direction LR
        E["Applications"]
        data_raw
        subGraph1
        subGraph2
        DuckDB
        R
  end

    S --> A
    A --> B
    B --> D
    D --> E
    style DuckDB fill:#fff
    style data_raw fill:#fff
    style subGraph1 fill: #fff 
    style subGraph2 fill: #fff
    style R fill: #fff 
    style s1 fill: #fff

From the ETL to the application, every piece of code can be kept and organized in a version controlled R package!

Don’t want our custom functions or need a custom query? No problem… You can use DuckDB (CLI, Python API, duckplyr)!

4 Discussions / Improvements

You still need to respect conventions! Updates in code or ETL should be relayed by version/news.md

Good solution for “Big” data with rare updates

Not a good solution for “fact” data: Data Warehouses (and/or DuckLake, Iceberg)

For Slowly Changing Dimension: “it depends” ¯\(ツ)/¯

Still considered experimental, so try it! The cost is low and benefits are high!
You still need to rely on an external storage

5 Thank you!

This presentation was built with litedown hosted on codeberg!

Codeberg repository: https://codeberg.org/defuneste/useR_2025/

Email me: leroy.olivier@pm.me My Website: https://branchtwigleaf.com/

Kleppmann, Martin, author. (2017). Designing data-intensive applications : the big ideas behind reliable, scalable, and maintainable systems. Sebastopol, CA :O’Reilly Media ↩
CMU Database group, Relational Model & Algebra (CMU Intro to Database Systems) ↩
Bivand R, Nowosad J, Lovelace R (2025). spData: Datasets for Spatial Analysis. R package version 2.3.4, https://jakubnowosad.com/spData/. ↩
Walker K (2016). https://cran.r-project.org/web/packages/tigris/index.html ↩
Walker K, Herman M (2025). tidycensus: Load US Census Boundary and Attribute Data as ‘tidyverse’ and ‘sf’-Ready Data Frames. R package version 1.7.3. ↩
Great talk on this problem: https://talks.osgeo.org/foss4g-europe-2024/talk/P78AK7/ ↩
https://www.youtube.com/watch?v=zeonmOO9jm4 ↩

Data as Code

Packaging data as code with DuckDB and S3

John Hall and Olivier Leroy

2025-09-09

1 Introduction

1.1 About us!

1.2 Center On Rural Innovation: Unlocking Rural Potential

1.3 Data as code?

1.4 Key ideas: Packaging data as code with DuckDB and S3

1.5 An easy Example

1.6 The Relational DataBase Management System (RDBMS)²:

1.7 If only data was already in databases!

2 Data Package Examples

2.1 spData

2.2 Tigris

2.3 Tidycensus

2.4 All are valid approaches!

3 FCC Datasets and cori.data.fcc

3.1 Two big datasets, no FTP and an API rate limit

3.2 First iteration of cori.data.fcc

3.3 Instead of drowning in CSVs, use Parquet

3.4 Like everyone else, use S3! (Object Storage)

3.5 Fully Embrace R packages

3.6 Vignettes/articles and more!

3.7 Simple example of functions:

3.8 Potential Improvements:

3.9 cori.data.fcc a model to improve

4 Discussions / Improvements

5 Thank you!

Data as Code

Packaging data as code with DuckDB and S3

John Hall and Olivier Leroy

2025-09-09

1 Introduction

1.1 About us!

1.2 Center On Rural Innovation: Unlocking Rural Potential

1.3 Data as code?

1.4 Key ideas: Packaging data as code with DuckDB and S3

1.5 An easy Example

1.6 The Relational DataBase Management System (RDBMS)2:

1.7 If only data was already in databases!

2 Data Package Examples

2.1 spData

2.2 Tigris

2.3 Tidycensus

2.4 All are valid approaches!

3 FCC Datasets and cori.data.fcc

3.1 Two big datasets, no FTP and an API rate limit

3.2 First iteration of cori.data.fcc

3.3 Instead of drowning in CSVs, use Parquet

3.4 Like everyone else, use S3! (Object Storage)

3.5 Fully Embrace R packages

3.6 Vignettes/articles and more!

3.7 Simple example of functions:

3.8 Potential Improvements:

3.9 cori.data.fcc a model to improve

4 Discussions / Improvements

5 Thank you!

1.6 The Relational DataBase Management System (RDBMS)²: