SQLite for Research Artifact Storage
I went to relocate 25 GB of 31 kB files over NFS the other day…
9-12 hours later, I can finally close my laptop lid.
Over a (theoretical) 100 Gb connection (but without any careful tuning, iperf3 results of 90Gb-ish), this should have take seconds. But damn if the network doesn’t throw a wrench into the works.
A Persistent Issue
So exactly what was going on here isn’t important, but for the record, my best guess is that untuned, unoptimized NFS mounts are single-threading an extreme amount of per-file metadata (relative to file size) that ultimately caps my file-transfer throughput at ~10 Mb; that is many orders of magnitude less than it could be.
The bigger issue though is that we often have to deal with an enormous amount of relatively microscopic files. They are all associated with the same experiment or topic, but need to also store binary data (think like a JPG or a PNG, or a numpy array). Our general approach has been “the filesystem” and carefully argued/constructed/agreed upon directory hierarchies. My sinking frustration and suspicion is that the hierarchy often fails to correctly model the data we’re trying to represent or share.
Hierarchies are shockingly good at modeling a lot of things (maybe that’s why human organizational structures often default back to this), but they don’t represent everything. From a usability standpoint though, the filesystem is hyper-convenient when quickly scanning through a lot of files at the OS-level, but has so many bad tradeoffs from a long-term “management” perspective. FWIW, maybe the only feature I still miss from Mac OS is the space-bar file preview (“quick look,” I think?) that performed a quick, often pretty capable, render of whatever the currently selected file in the Finder was.
I digress.
One of those bad side effects of modeling results at the “file” level is the file transfer issue. Transferring a single, massive file is radically more efficient than my attempt to tranfer large amounts of relatively small files. And of course, I could elect to spend 10 hours optimizing NFS for the current network architecture that could of course make my transfer “mode” much more viable, but sheesh, who has the time?
No seriously: who has the time??
Towards self-sufficiency
I manage my own systems. I need most things to work out of the box. My current mantra is “less.” That’s it: “less.” I need less of everything, and configuration is part of that.
I’m also willing to adapt myself, and one thing we can do is adapt our methods to better suit the “general” use case (like slamming 25 GB of garbage over our NFS mount). Software tooling is so much more abstract than anything physical, and it can be difficult to know what merits solving versus what merits just toughing it out & babysitting. In this case, the short answer is that if that 25GB can be a single file, it’ll go way better for us these days in like 99% of use cases. Not even worth bothering to solve smaller edge cases, IMO.
In my case, I’m trying to wrangle the output of a Monte Carlo simulation where each file is an annotated plot ~30 kB in size. This is so frustratingly common among medical imaging (e.g., volumetric DICOM studies as individual files per “slice” rather than a single, self-describing file… gimme a break!)
The value at the DICOM-level is standardization and cross-compatibility, but in pretty much every use case we have for research these days, that standard is a massive hindrance. We’re pro-standard but anti-DICOM (outside of the proven, tested and highly-valuable clinical use cases for radiology i.e., PACS).
Databases are optimized for different things
Our experimental results don’t need to be PACS-compliant, but they do need to be comprehensible at a later date in time, at least to us and preferably to a broader community (and ultimately back to PACS in the name of translation). Traceability and auditability of results are increasingly important, and one place I still see stuff getting lost is the filesystem.
SQLite has been a huge inspiration in design decisions for core. In the back of my mind, I’ve been looking for more ways to use SQLite in our research, but the fact that a lot of our output is file-centric has been an exclusion criterion.
… but why?
Core is designed to solve the problem of data-oriented sharing among independent processes. But really it’s just trying to find a shared, intuitive language that we as programmers and developers can use to push information to each other from any process. It’s a spec more than a tool. It’s essentially just a schema that can work in like 99% of use cases. And also that’s the exact same problem SQLite solved (among many other things).
The binary-data use case falls into the fuzzy overlap between the two solutions.
A tool like core that’s designed to handle big data blobs vs a SQL-oriented
DB tied to strong-ish typing, both trying to force the “simplicity” hand
by limiting the accepted types into narrow-er formats.
Getting to a single file
I confess that I’m not a SQLite expert, but my general understanding is that although it can handle blobs up to 1GB, it’s really not intended to just “slam in” data to your main table (ever, but especially) > 1MB.
But here’s the thing no one seems to realize: computers have gotten insanely good over the last 100 years. Even if it’s not “meant for” something, it may still be way better than whatever you’re currently doing.
So here’s the honest truth: if your data “artifacts” are < 1MB, YOLO.
If your data is > 1 MB, YOLO… until you have evidence it doesn’t work for you.
Below is the sketch of a wrapper that I think most science folks would enjoy working with and consolidates a TON of functionality and reliability/reproducibility with essentially no real overhead. It solves the single-file consolidation and builds on the single most widely deployed DB in existence It’s so simple you don’t even really need a library. Just copy/paste the code at the end of the section.
You’ll end up writing code like:
schema = {
"label": "TEXT",
"variable1": "REAL",
"observation1": "REAL",
"image": "BLOB",
}
r = ResultStore("data.sqlite", "experiment-1", schema)
for config in experiments:
# take measurement
s = sensor.Read(config.setting)
img = s.array
# convert to bytes
data = Serialize.numpy(img)
# record measurement
r.record(variable1=config.setting, observation1=s, label=config.description, image=data)
The code for the ResultStore type:
import sqlite3
import pandas as pd
import datetime
from typing import Dict
class ResultStore:
def __init__(
self, db_path: str, experiment_name: str, schema: Dict[str, str] | None = None
):
"""
db_path: Path to the .db file
experiment_name: Name of the experiment (used to name the table)
schema: Dictionary mapping field names to SQLite types (e.g., {"temp": "REAL"})
"""
self.db_path = db_path
self.table_name = experiment_name
self.schema = schema
# Ensure the database and table exist
if schema is not None:
self._initialize_db()
@classmethod
def load_dataframe(
cls, db_path: str, table_name: str, load_blob: bool = False
) -> pd.DataFrame:
"""
load_dataframe is a convenience method intended to streamline
dealing with result store when you don't care about SQLite.
load_blob: If False, columns of type BLOB will be excluded from the query.
"""
with sqlite3.connect(db_path) as conn:
if not load_blob:
# Fetch metadata for the specific table
# PRAGMA table_info returns (id, name, type, notnull, default_value, pk)
cursor = conn.cursor()
cursor.execute(f"PRAGMA table_info({table_name})")
columns_info = cursor.fetchall()
# Filter out columns where the type is "BLOB"
# index 1 is the column name, index 2 is the type
allowed_cols = [
row[1] for row in columns_info if row[2].upper() != "BLOB"
]
# Construct query with only non-blob columns
query = f"SELECT {', '.join(allowed_cols)} FROM {table_name}"
else:
query = f"SELECT * FROM {table_name}"
df = pd.read_sql_query(query, conn)
return df
@classmethod
def list_experiments(cls, db_path: str) -> list[str]:
"""
Opens the database and returns a list of all user-defined
experiment names (table names) found in the file.
db_path: Path to the sqlite file
A list of strings representing the experiment names.
"""
with sqlite3.connect(db_path) as conn:
# Query sqlite_master to find all table names.
# exclude sqlite_ tables
query = """
SELECT name
FROM sqlite_master
WHERE type = 'table'
AND name NOT LIKE 'sqlite_%';
"""
cursor = conn.cursor()
cursor.execute(query)
# Extract the first element of each tuple returned by the cursor
tables = [row[0] for row in cursor.fetchall()]
return tables
def _initialize_db(self):
if self.schema is None:
raise RuntimeError("cannot initialize db without schema")
with sqlite3.connect(self.db_path) as conn:
cols = ["id INTEGER PRIMARY KEY AUTOINCREMENT", "timestamp TEXT"]
for name, dtype in self.schema.items():
cols.append(f"{name} {dtype}")
query = f"CREATE TABLE IF NOT EXISTS {self.table_name} ({', '.join(cols)})"
conn.execute(query)
conn.commit()
def record(self, **kwargs) -> int:
"""
Records a single result.
kwargs should match the keys in the schema.
"""
kwargs["timestamp"] = datetime.datetime.now().isoformat()
# Build the query
query = f"INSERT INTO {self.table_name} ({', '.join(kwargs.keys())}) VALUES ({', '.join(['?']*len(kwargs))})"
with sqlite3.connect(self.db_path) as conn:
cursor = conn.cursor()
cursor.execute(query, list(kwargs.values()))
conn.commit()
if cursor.lastrowid is None:
raise RuntimeError("DB update failed")
return cursor.lastrowid
Standards FTW
An additional key benefit here is the ability of a downstream user to read the DB. Even if we document nothing, it’s heading into the most widely deployed DB format in existence. There are CLI tools to parse SQLite and there are GUI tools to navigate SQLite. SQLite plugs in nicely to Pandas. SQLite is consolidated into a single file on disk which makes for nice transfers and organization. Backups get easier.
Extracting data can be slightly more complex, but this is honestly maybe a
great use case for non-expert users to enlist AI tools to write a simple
parsing script. “I have a sqlite database with a table called
Reproducibility is a crisis right now. There are valid use cases for data obfuscation, but that should be a conscious choice rather than the default option purely because we can’t standardize on anything. Henceforth, I’m going to be using this approach, and making these full artifacts available where I can. Efficient transfer formats make this choice way easier and more manageable.
The Other Key Benefits
The database vs files provides a better linkage between metadata and results. I can’t tell you how many parsing scripts I’ve had to write to extract the exeriment configuration out of a filename.
Other great formats (HDF5, Zarr) can work for similar applications. HDF5 and SQLite both have a slightly abnormal “open source” models, so the choice is kind of wash based on preference. I find the SQLite model more appealing at the moment. Zarr is awesome if you’re working in a cloud/network heavy model. We are for certain things, but not others. I feel confident asserting that SQLite is proven to work in more environments that anything else and that counts for me right now when I need guarantees, not fun promises.
Of course there are many applications for which this isn’t the best choice, but for many of the imaging experiments I’m working on, this model works perfectly. I think a lot of problems can be sliced into components that would fit even better with the SQLite model, but that’s left to the reader as an exercise.
<3
2026-06-19 11:11:11.11