Data as Code

Packaging data as code with DuckDB and S3

John Hall and Olivier Leroy

2025-09-09

1 Introduction


1.1 About us!


1.2 Center On Rural Innovation: Unlocking Rural Potential

Creating opportunities that empower rural people to thrive in the tech economy.

Website: https://ruralinnovation.us/

Data team blog: https://ruralinnovation.github.io/blog/about.html


1.3 Data as code?

(Sorry 
  `,`(no 
     !(abstract syntax tree))
)

Probably not valid lisp or scheme code


1.4 Key ideas: Packaging data as code with DuckDB and S3

  1. Do not fear permanent storage / database

  2. Example of data packages (and data)

  3. A new way to use data via an R package


1.5 An easy Example

#!/bin/bash

db_set () {
  echo "$1, $2" >> database 
}

db_get (){
  grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
}

Very simple implementation1 with a lot of problems but… it saves data and retrieves it!


1.6 The Relational DataBase Management System (RDBMS)2:

Data Isolation: isolate both the OS and user/applications from data representations.

flowchart LR
    A[Database Storage] --> B[Physical Schema] --> D[Logical Schema] --> E[Applications]

1.7 If only data was already in databases!

No more ETL, just database migrations!

%%{init: {'theme': 'dagre', 'themeVariables': { 'fontSize': '20px', 'fontFamily': 'Bitter'}}}%%
flowchart LR
    S["Upstream"] -->|ETL| A[Database Storage] --> B[Physical Schema] --> D[Logical Schema] --> E[Applications]

Upstream: lots of different objects usually represented as files, you have no control over them


2 Data Package Examples


2.1 spData

“datasets for demonstrating, benchmarking and teaching spatial data analysis”3

flowchart LR
    S["Upstream"]
    A[Database Storage] 
    B[Physical Schema]
    D[Logical Schema] 
    E[Applications]
    S -->|ETL| A --> B --> D --> E
    subgraph zero["spData"]
      direction LR
      subgraph data_raw
      S
      end
      subgraph one["file systems"]
      A 
      end
      subgraph two["files: rda, gpkg"]
      B
      end
      subgraph three["None"]
      D
      end
      subgraph four["Minimal"]
      E
      end
    end
    style data_raw fill:#fff
    style one fill:#fff
    style two fill:#fff
    style three fill:#fff
    style four fill:#fff
    style zero fill:#fff

2.2 Tigris

“directly download and use TIGER/Line shapefiles”4

%%{init: {'theme': 'dagre', 'themeVariables': { 'fontSize': '20px', 'fontFamily': 'Bitter'}}}%%
flowchart LR
    S["Upstream"]
    A[Database Storage] 
    B[Physical Schema]
    D[Logical Schema] 
    E[Applications]
    S --> A --> B --> D --> E
    subgraph one["FTP server"]
    S
    A 
    end 
    subgraph two["files shapefiles"]
    B
    end 
    subgraph three["None"]
    D
    end 
    subgraph tigris
    E 
    end
    style one fill:#fff
    style two fill:#fff 
    style three fill:#fff 
    style tigris fill:#fff  
# Download Vermont Census blocks, last release
vt_blocks <- tigris::blocks("VT", progress_bar = FALSE)
# Do interesting stuff with it! 

The US Census FTP is still amazing ❤️, but tigris does a lot of lifting!


2.3 Tidycensus

“interface with a select number of the US Census Bureau’s data APIs”5

%%{init: {'theme': 'dagre', 'themeVariables': { 'fontSize': '20px', 'fontFamily': 'Bitter'}}}%%
flowchart LR
    A[Database Storage] 
    B[Physical Schema]
    D[Logical Schema] 
    E[Applications]
    A --> B --> D --> E
    subgraph one["Census API, Upstream"]
    A 
    B 
    D 
    end
    subgraph two["Tidycensus"]
    E 
    end
    style one fill:#fff
    style two fill:#fff
census_api_key("My secret api for git") # .Renviron or OS keyring   
# median income per block, 2020, 90% interval, for VT
vt <- get_acs(geography = "block",
              variables = c(med_income = "B19013_001"), 
              state = "VT", year = 2020)

Also provides great “convenience” functions and caching


2.4 All are valid approaches!

  1. They display how powerful and versatile R packages are at encapsulating complexity.

  2. You can use them to abstract data as code in your workflow

But:


3 FCC Datasets and cori.data.fcc


3.1 Two big datasets, no FTP and an API rate limit

Form 477 National Broadband Map
US Census Boundaries 2010 2020
Granularity Census blocks Locations
Timeframe 2014-2021 2022 - Ongoing
Releases twice a year twice a year
Records 416,447,807 3,488,191,994
Size 400MB/year 22GB/year

3.2 First iteration of cori.data.fcc

Few functions packaged:

%%{init: {'theme': 'dagre', 'themeVariables': { 'fontSize': '20px', 'fontFamily': 'Bitter'}}}%%
flowchart LR
    A1[FCC Website] --> B2[source DB]
    A2[TIGER Census block] --> C3
    subgraph one["RDBMS (RDS/PG)"]
    direction TB
    B2 --> C3[Blocks DB]
    end
    subgraph Serving["External Schemas"]
    direction TB
    D1[Climate Risk Mapping]
    D3[Analytic Tables]
    D2[Rural Broadband Map]
    end
    subgraph Aplications
    E1
    E2
    E4
    end 
    C3 --> D1
    C3 --> D2
    C3 --> D3
    D1--> E1[Ap. CH]
    D2--> E2[Ap. RBM]
    D3 --> E4[Data team <BR> Broadband team]
    click E2 "https://rural-broadband-map.ruralinnovation.us/"
    click E1 "ttps://broadband-risk.ruralinnovation.us/"
    style one fill:#fff
    style Serving fill:#fff
    style Aplications fill: #fff

3.3 Instead of drowning in CSVs, use Parquet


# [...]
con <- DBI::dbConnect(duckdb::duckdb(), dbdir = paste0(duck_dir, "/nbm.duckdb"))
DBI::dbExecute(con, "PRAGMA max_temp_directory_size='10GiB'")
copy_stat <- paste0("COPY (SELECT frn, 
# [...]
      strptime(split_part(filename, '_', 7), '%B%Y')::DATE as release 
    FROM read_csv('", raw_dta_dir, "/*.csv',
              types = { 'frn'        : 'VARCHAR(10)',
# [...]
                      }, ignore_errors = true,
# [...]
              )
    TO '", data_dir, "/nbm_raw' (FORMAT 'parquet', PARTITION_BY(release, state_usps, technology), OVERWRITE true);")
DBI::dbExecute(con, copy_stat); DBI::dbDisconnect(con)
system("aws s3 sync nbm_raw s3://cori.data.fcc/nbm_raw")

Source: https://github.com/ruralinnovation/cori.data.fcc/blob/main/data-raw/nbm_raw.R


3.4 Like everyone else, use S3! (Object Storage)

DuckDB has an extension to interact with it!

DBI::dbExecute(con, "INSTALL httpfs;LOAD httpfs")
# yes, this is dynamically loaded

3.5 Fully Embrace R packages

Since DuckDB was built to be as “a nice R package”7, why not fully embrace it in an R package?


3.6 Vignettes/articles and more!

Using an R package allows you to leverage all the tools developed by the community!


3.7 Simple example of functions:

get_nbm_bl <- function(geoid_co, release = "latest") {
[...]
  con <- DBI::dbConnect(duckdb::duckdb())
  DBI::dbExecute(con, sprintf("SET temp_directory ='%s';", tempdir()))
  on.exit(DBI::dbDisconnect(con), add = TRUE)
  DBI::dbExecute(con, "INSTALL httpfs;LOAD httpfs")
  statement <- sprintf(
    paste0("select *
 		  from read_parquet('s3://cori.data.fcc/nbm_block", release_target, "/*/*.parquet')
    where geoid_co = '%s';"), geoid_co)
  DBI::dbGetQuery(con, statement)
}

 nbm_bl <- get_nbm_bl(geoid_co = "47051")

3.8 Potential Improvements:

Be carefull you can still fetch a lot of data! (OOM)

con <- DBI::dbConnect(duckdb::duckdb(read_only = TRUE))
dbExecute(con, "select * from read_parquet('s3://cori.data.fcc/nbm_block", ?, "/*/*.parquet')
    where geoid_co = ?;"), list(release_target, geoid_co)) )

3.9 cori.data.fcc a model to improve

%%{init: {'theme': 'dagre', 'themeVariables': { 'fontSize': '20px', 'fontFamily': 'Bitter'}}}%%
flowchart LR
 subgraph data_raw["data_raw"]
        S["Upstream"]
  end
 subgraph subGraph1["S3 Object Storage"]
        A["Database Storage"]
  end
 subgraph subGraph2["parquet files"]
        B["Physical Schema"]
  end

  subgraph R["R/"]
    direction LR
  E  
  subgraph DuckDB["DuckDB"]
    direction LR
        D["Logical Schema"]
        data_raw
        subGraph1
        subGraph2
  end  
  end
 subgraph s1["cori.data.fcc"]
    direction LR
        E["Applications"]
        data_raw
        subGraph1
        subGraph2
        DuckDB
        R
  end

    S --> A
    A --> B
    B --> D
    D --> E
    style DuckDB fill:#fff
    style data_raw fill:#fff
    style subGraph1 fill: #fff 
    style subGraph2 fill: #fff
    style R fill: #fff 
    style s1 fill: #fff

From the ETL to the application, every piece of code can be kept and organized in a version controlled R package!

Don’t want our custom functions or need a custom query? No problem… You can use DuckDB (CLI, Python API, duckplyr)!


4 Discussions / Improvements

You still need to respect conventions! Updates in code or ETL should be relayed by version/news.md

Good solution for “Big” data with rare updates

Not a good solution for “fact” data: Data Warehouses (and/or DuckLake, Iceberg)

For Slowly Changing Dimension: “it depends” ¯\(ツ)


5 Thank you!

This presentation was built with litedown hosted on codeberg!

Codeberg repository: https://codeberg.org/defuneste/useR_2025/

Email me: leroy.olivier@pm.me My Website: https://branchtwigleaf.com/

  1. Kleppmann, Martin, author. (2017). Designing data-intensive applications : the big ideas behind reliable, scalable, and maintainable systems. Sebastopol, CA :O’Reilly Media

  2. CMU Database group, Relational Model & Algebra (CMU Intro to Database Systems)

  3. Bivand R, Nowosad J, Lovelace R (2025). spData: Datasets for Spatial Analysis. R package version 2.3.4, https://jakubnowosad.com/spData/.

  4. Walker K (2016). https://cran.r-project.org/web/packages/tigris/index.html

  5. Walker K, Herman M (2025). tidycensus: Load US Census Boundary and Attribute Data as ‘tidyverse’ and ‘sf’-Ready Data Frames. R package version 1.7.3.

  6. Great talk on this problem: https://talks.osgeo.org/foss4g-europe-2024/talk/P78AK7/

  7. https://www.youtube.com/watch?v=zeonmOO9jm4