Skip to content

0001 - Direct Storage Architecture

Build a new Graft client library (called graft-kernel) which directly interfaces with object storage, eliminating the need for the MetaStore and PageStore, and setting Graft up as a viable replacement to systems like Litestream. Graft should focus on providing best-in-class PITR, branching, and sparse replication for page-based workloads.

An overview of why we should consider making this major change to Graft’s architecture.

Currently Graft requires a MetaStore and PageStore to support replication to and from object storage. This architecture has the following advantages and disadvantages:

Advantages

  • The MetaStore can efficiently rollup commits to fast forward clients, increasing performance and enabling instant read replicas
  • The PageStore acts as a smart cache, allowing clients to pull only the pages they need at the edge.
  • The PageStore is able to collocate writes to multiple Volumes in the same Segment which can reduce the cost and overhead of small transactions.

Disadvantages

  • There is little isolation between data in different Volumes. Graft will need to roll out a comprehensive encryption + authorization layer to work for production workloads. This is a huge cost in terms of testing and engineering.
  • Users must run two services to take full advantage of Graft, this makes Graft much harder to use.

In a discussion with Simon Willison and Alex Garcia, we talked about some of their dream features for SQLite + Graft:

Rollback database to earlier version The ability to cheaply rollback a database would make risky features like giving an LLM read/write access to your database much safer. Additionally, the ability to branch a database at a particular version may enable risk-free experimentation and testing.

Read-only replication Cheap and fast read-only replication to horizontally scale a heavy query workload over multiple machines, or simply to expose data to less-trusted users.

Composability with IAM permissions Currently, Datasette uses IAM keys limited to a single S3 prefix to restrict Litestream’s access to a single tenant’s data. This ensures that a bug in Litestream can affect at most a single tenant.

This feature implies that data does not cross “tenant” boundaries (or in this case, the configured S3 prefix).

In a discussion with a potential user, they expressed reservations due to the layer of indirection between the Graft client and object storage. Their main argument is that S3 is already proven to handle extremely high scale. They would be more comfortable using Graft if clients connected directly to object storage to pull changes. In some cases, this may also reduce costs due to free bandwidth between compute and S3.

Graft should work out of the box without requiring additional services to run. By supporting direct access to object storage, it will be easier to get started and embed Graft in an application.

A high level explanation of how this feature works, and would change the behavior of existing Graft clients such as graft-sqlite and graft-fuse.

graft-kernel implements the Graft Kernel which supersedes the functionality of the graft-client and graft-server crates. The Kernel provides access to a remote Volume Catalog, local storage, and client functionality to downstream crates like graft-sqlite and graft-fuse.

Client functionality Clients such as graft-sqlite and graft-fuse will use the graft-kernel to operate on Volumes. The Kernel is designed to be embedded in the application, performing I/O in a small set of background threads. The Kernel will be implemented as an async core wrapped with an async and sync API.

The Graft Proxy is an optional stateless edge service and caching layer which makes it easier for Graft to replicate to & from devices.

Graft Proxy exposes a simple API to consumers, enabling two key performance features:

  1. Graft Proxy caches reads from object storage.
  2. Virtual overlay of commits and segments
  • rather than passing straight through to object store, the proxy is able to coalesce multiple commits into one larger commit for clients to efficiently fast forward and only download interesting portions of segments.

Eventually Graft Proxy will enhance Graft with these features:

  • volume subscriptions -> eliminate the need to poll for changes
  • granular authorization
  • direct byte range puts and gets against Volumes, enabling “dumb clients”.

Detailed technical breakdown. Cover APIs, algorithms, data structures, formats, edge cases, and performance implications.

  • Volume Handle: A reference to a local-remote Volume pair. Also responsible for tracking synchronization between a local and remote Volume. Documented more in the Volume Handle section.
  • vid: A 16 byte Volume ID using GID encoding.
  • sid: A 16 byte Segment ID using GID encoding.
  • Snapshot: A frozen point-in-time view of a Volume.
  • lsn: Documented in the Volume log section.
  • pageidx: A 4 byte page index, representing the index of a page within a Volume. Valid range: [1, 2^32).
  • Graft: A compressed bitset (based on Splinter), used to keep track of which pageidxs appear in a Segment.
  • Segment: A compressed sequence of pages, sorted by pageidx. Documented in the Segment Encoding section.
  • VolumeRef: A (vid, lsn) tuple, representing a fixed point in a Volumes history.
  • CommitHash: Documented in the Commit hash section.

A volume’s durable state consists of a Checkpoint and a Log.

  • Checkpoint — a point-in-time mapping from each non-empty pageidx to the version (by LSN) that was current when the checkpoint was taken.
  • Log — an append-only sequence of log records. Each record contains the set of pages modified since the previous LSN.
PropertyDefinition
DomainUnsigned 64-bit integer in the range [1, 2^64). Zero is invalid.
OrderingStrictly increasing, gap-free, and scoped per volume.
EncodingCanonical representation is CBE64: one’s-complement, big-endian.

Because 0 is never a valid LSN, it’s available as a sentinel value.

CBE64 stands for Ones-Compliment Big Edian.

  • Binary form — 8-byte array. Bytewise comparison yields descending numeric order. Used for space-efficient storage, such as in embedded key-value stores like Fjall.
  • Hex form — 16-character, zero-padded, uppercase hexadecimal string. Lexicographically sorts in the same order as the binary form. Used where human readability is preferred, such as object store keys.

The CBE64 encoding allows both local key-value stores and object stores to perform forward iteration over keys to process log records in descending LSN order, without additional index structures.

To verify data integrity, we attach a blake3 hash to each Commit.

def commit_hash(vid, lsn, page_count, pages):
hasher = blake3::new()
# unique 4 byte magic number identifying commits
hasher.write(COMMIT_MAGIC))
hasher.write(vid)
hasher.write(lsn)
hasher.write(page_count)
# pages must be in order by pageidx
for (pageidx, page) in pages:
hasher.write(pageidx)
hasher.write(page)
return hasher.hash()

Note that the Commit’s snapshot is passed in. This ensures the Hash’s uniqueness incorporates the Volume ID, LSN, and page count.

Rather than forcing users to reference Volumes by ID, Graft exposes Volume Handles. Volume Handles are pointers to a local and remote Volume. Volume Handles only exist on a single client, and are not shared between clients or pushed to a remote.

Each Volume Handle has a id given to it at creation time. The id must match the regex ^[-_a-zA-Z0-9]{0,128}$ (alphanumeric + underscore + dash, max len 128 chars) and be unique on the client.

A Volume Handle always has a local-only Volume associated with it. This Volume is used for all local reads and writes.

A Volume Handle may be linked to a remote Volume. In this case, the local and remote Volumes will be kept in sync by the sync subsystem.

Pages are stored in Segments, which provide seekable compression over ranges of pages. Internally Segments are sequences of compressed Frames. All of the pages in a Segment are stored in order by PageIdx.

Segments currently compress each Frame using zstd with the trailing checksum enabled. Graft may add support for other compression methods in the future.

To read a page from a Segment, the client must first retrieve the relevant SegmentRef from the Commit. Then the client may use the frame index to search for the correct frame. This allows Clients to download only the relevant byte offsets from a Segment.

Graft will store all of a Volume’s data in Object Storage using the following keyspace layout:

{prefix} /
{vid} /
control: Control
forks /
{fork-vid}: Fork
checkpoints: CheckpointSet
log /
{lsn}: Commit
segments /
{sid}: Segment

This flexible layout allows users to isolate tenants from one another by simply providing a unique {prefix}. This can be helpful when using AWS IAM to scope access keys to particular S3 prefixes for example.

All of the files aside from Segments (Segment Encoding) are encoded using Bilrost and prefixed with a GraftFileHeader:

const GRAFT_FILE_MAGIC: [u8;4] = [0xB7, 0x77, 0xDD, 0x13]
enum MessageType {
VolumeControlV1 = 1,
VolumeForkV1 = 2,
CheckpointSetV1 = 3,
CommitV1 = 4,
}
struct GraftFileHeader {
magic: [u8; 4],
message: MessageType,
length: u32,
}

Bilrost messages:

#[derive(Message)]
pub struct VolumeRef {
#[bilrost(1)]
vid: VolumeId,
#[bilrost(2)]
lsn: LSN,
}
/// Object store path: {prefix}/{vid}/control
#[derive(Message)]
pub struct VolumeControl {
#[bilrost(1)]
vid: VolumeId,
#[bilrost(2)]
parent: Option<VolumeRef>,
#[bilrost(3)]
created_at: SystemTime,
}
/// Object store path: {prefix}/{parent-vid}/forks/{fork-vid}
#[derive(Message)]
pub struct VolumeFork {
#[bilrost(1)]
vid: VolumeId,
#[bilrost(2)]
parent: VolumeRef,
}
/// Object store path: {prefix}/{vid}/checkpoints
#[derive(Message)]
pub struct CheckpointSet {
#[bilrost(1)]
lsns: SmallVec<[LSN; 2]>, // sorted in ascending order
}
/// Object store path: {prefix}/{vid}/log/{lsn}
#[derive(Message)]
pub struct Commit {
#[bilrost(1)]
vid: VolumeId,
#[bilrost(2)]
lsn: LSN,
#[bilrost(3)]
page_count: PageCount,
#[bilrost(4)]
commit_hash: Option<CommitHash>,
#[bilrost(5)]
segment_idx: Option<SegmentIdx>,
#[bilrost(6)]
checkpointed_at: Option<SystemTime>,
}
#[derive(Message)]
pub struct SegmentIdx {
#[bilrost(1)]
sid: SegmentId,
#[bilrost(2)]
graft: Graft,
#[bilrost(3)]
frames: SmallVec<[SegmentFrameIdx; 2]>,
}
#[derive(Message)]
pub struct SegmentFrameIdx {
#[bilrost(1)]
frame_size: usize,
#[bilrost(2)]
last_pageidx: PageIdx,
}

This section documents how clients store data.

Local storage uses Fjall, a partitioned k/v store. In the following keyspace, the top level keys are independent partitions. The remainder of the keys and the values are encoded using types in the following section.

handles / {handle_id} -> VolumeHandle
volumes / {vid} -> VolumeMeta
log / {vid} / {lsn} -> Commit
pages / {sid} / {pageidx} -> Page

Keys stored in the local keyspace are encoded using zerocopy types and a custom FjallRepr trait. lsn values are stored using CBE64, which ensures they naturally sort in descending order. This allows us to use a forward iterator to quickly find the most recent LSN, which is much more efficient in most k/v stores (including Fjall).

The log and pages partition use the following zerocopy types as keys:

/// Key for the `log` partition
struct CommitKey {
vid: VolumeId,
lsn: CBE64,
}
/// Key for the `pages` partition
struct PageKey {
sid: SegmentId,
pageidx: U32<BigEndian>,
}

Values stored locally are encoded using Bilrost using the following additional schema messages:

#[derive(Message)]
pub struct VolumeHandle {
/// The Handle ID
#[bilrost(1)]
id: HandleId,
/// Reference to the latest synchronization point for the local Volume.
#[bilrost(2)]
local: VolumeRef,
/// Reference to the latest synchronization point for the remote Volume.
#[bilrost(3)]
remote: Option<VolumeRef>,
/// Presence of the `pending_commit` field means that the Push operation is
/// in the process of committing to the remote. If no such Push job is
/// currently running (i.e. it was interrupted), this field must be used to
/// resume or abort the commit process.
#[bilrost(4)]
pending_commit: Option<PendingCommit>,
}
#[derive(Message)]
pub struct PendingCommit {
/// The resulting remote LSN that the push job is attempting to create.
#[bilrost(1)]
remote_lsn: LSN,
/// The associated commit hash. This is used to determine whether or not the
/// commit has landed in the remote, in the case that we are interrupted
/// while attempting to push.
#[bilrost(2)]
commit_hash: CommitHash,
}
#[derive(Message)]
pub struct VolumeMeta {
/// The Volume's ID
#[bilrost(1)]
vid: VolumeId,
/// The parent reference if this Volume is a fork.
#[bilrost(2)]
parent: Option<VolumeRef>,
/// The etag from the last time we pulled the `CheckpointSet`, used to only
/// pull changed `CheckpointSets`
#[bilrost(3)]
etag: Option<Bytes>,
/// The set of checkpoint LSNs.
#[bilrost(4)]
checkpoints: CheckpointSet,
}

This section details the various key algorithms powering Graft’s new direct storage architecture.

Similar to the existing Graft architecture, we will start with one coarse storage_lock which must be held when we are doing any read-modify-update transaction on storage.

Lock rules:

  • never hold the lock while performing IO other than storage operations
  • hold the lock for the smallest time possible

When to hold the lock:

  • Modifying the Volume Handle
  • Committing to a Volume

When not to hold the lock:

  • When the operation is idempotent, for example when we are writing to the pages partition or updating a Volume’s checkpoint set
  • When performing read operations; use a Fjall instant instead

Reading from a Volume requires creating a VolumeReader from a VolumeHandle at either the latest or a specific snapshot.

def visibility_path(snapshot):
cursor = VolumeRef { vid: snapshot.vid, lsn: snapshot.lsn }
path = []
while cursor:
if checkpoints = read(f"volumes/{cursor.vid}/checkpoints"):
if checkpoint = checkpoints.for(snapshot.lsn):
# found checkpoint, we can terminate the path here
path.push((cursor.vid, (cursor.lsn)..=(checkpoint.lsn)))
return path
# no checkpoint, so scan to the beginning
path.push((cursor.vid, (cursor.lsn)..=1))
# and iterate to the parent
cursor = read(f"volumes/{cursor.vid}/control").parent
return path
class Snapshot:
vref: VolumeRef
path: SearchPath
class VolumeReader:
def new(snapshot):
self.snapshot = snapshot
def read_page(self, pageidx):
if not self.snapshot.page_count.contains(pageidx):
return None
for key, commit in iter_commits(self.snapshot.path):
{ snapshot, segment_ref } = commit
if not (
# handle truncate+extend
snapshot.page_count.contains(pageidx)
and segment_ref.graft.contains(pageidx)):
continue
page = read(f"pages/{key.sid}/{pageidx}")
if page:
return page
return remote_read_page(snapshot.vid, key.sid, segment_ref, pageidx)
return None
def iter_commits(path):
result = []
for (vid, scan) in path:
top = f"log/{vid}/{scan.start}"
bottom = f"log/{vid}/{scan.end}"
result = chain(result, iter(top..=bottom))
return result
def remote_read_page(vid, sid, segment_ref, pageidx):
# first we need to determine which frame in the segment contains the relevant
# page
bytes = (0, 0)
pages = (0, 0)
for frame in segment_ref.frames:
if pageidx > f.last_pageidx:
break
bytes = (bytes.end, bytes.end + frame.frame_size)
pages = (pages.end+1, frame.last_pageidx)
# fetch the frame from object storage, loading the pages into `pages/{sid}/{pageidx}`
frame = object_store.fetch(
f"{PREFIX}/{vid}/segments/{sid}", bytes)
frame = zstd.decompress(frame)
for (pi, page) in zip(
segment_ref.graft.iter_range(pages), frame.split(PAGESIZE)
):
write(f"pages/{sid}/{pi}", page)
return read(f"pages/{sid}/{pageidx}")
class VolumeWriter:
def new(snapshot):
self.reader = VolumeReader::new(snapshot)
self.page_count = snapshot.page_count
self.sid = SegmentId::random()
self.graft = Graft::new()
def read(pageidx):
if self.graft.contains(pageidx):
return read(f"pages/{self.sid}/{pageidx}")
else:
self.reader.read_page(pageidx)
def write(pageidx, page):
self.graft.insert(pageidx)
self.page_count = max(self.page_count, pageidx.pages())
write(f"pages/{self.sid}/{pageidx}", page)
def truncate(page_count):
self.page_count = page_count
delete_range(f"pages/{self.sid}/{page_count}"..)
# also triggered on drop
def rollback():
delete_prefix(f"pages/{sid}")
def commit():
snapshot = self.reader.snapshot
vid = snapshot.vid
commit_lsn = snapshot.lsn.next()
with storage_lock:
# verify we are the latest snapshot
latest_snapshot = first(f"log/{vid}").snapshot
if snapshot != latest_snapshot:
raise "concurrent write"
write(f"log/{vid}/{commit_lsn}", Commit {
snapshot = Snapshot {
vid,
lsn = commit_lsn,
page_count = self.page_count
},
# no hash for commits to a local volume
segment = SegmentRef {
sid: self.sid,
graft: self.graft,
# no frame info for commits to a local volume
}
})
def push_volume(handle_id):
handle = read(f"handles/{handle_id}")
(local_lsn, commit) = prepare_commit(handle)
match remote_commit(commit):
Ok() => push_success(handle, commit, local_lsn)
Err() => push_failure(handle, commit)
def prepare_commit(handle):
# trigger recovery
if handle.pending_commit:
raise InterruptedPush
{ local, remote } = handle
remote_snapshot = first(f"log/{remote.vid}")
local_snapshot = first(f"log/{local.vid}")
# we can only push if the remote has not changed since the last time we synced
if remote_snapshot.lsn != remote.lsn:
# this situation can only occur in normal operation if the local volume has diverged from the remote. I.e. a remote commit happened concurrently with a local commit preventing fast forward
raise Diverged
top_lsn = local_snapshot.lsn
bottom_lsn = local.lsn
sync_range = top_lsn..bottom_lsn
if sync_range.is_empty():
raise NothingToCommit
# build and push segments from commits
(segment_idx, commit_hash) = build_and_push_segments(local.vid, lsn_range, remote.vid)
commit_lsn = remote_snapshot.lsn.next()
# write out the pending commit
with storage_lock:
# abort the push if the handle changed since we started the push process
if handle != read(f"handles/{handle_id}"):
raise Retry
# abort the push if the remote snapshot changed since we started the push process
if remote_snapshot != first(f"log/{remote.vid}"):
raise Retry
handle.pending_commit = PendingCommit {
remote_lsn = commit_lsn,
commit_hash
}
write(f"handles/{handle_id}", handle)
return (local_snapshot.lsn, Commit {
vid: remote.vid,
lsn: commit_lsn,
page_count: local_snapshot.page_count
commit_hash,
segment_idx
})
def build_and_push_segments(local_vid, lsn_range, remote_vid):
# merge segments from all local commits in the lsn_range into one segment which is uploaded to the remote
#
# IF we expect to be querying the remote volume anytime soon, we can optionally write out segments to our local page store.
return (segment_idx, commit_hash)
def remote_commit(commit):
path = f"{PREFIX}/{commit.snapshot.vid}/log/{commit.snapshot.lsn}"
object_store.write_if_not_exists(path, commit)
def push_success(handle, commit, local_lsn):
{vid, lsn} = commit.snapshot
batch = storage.batch()
batch.write(f"log/{vid}/{lsn}", commit)
new_handle = VolumeHandle {
pending_commit = None,
local = VolumeRef {
vid: handle.local.vid,
lsn: local_lsn
},
remote = VolumeRef {
vid: handle.remote.vid,
lsn
},
...handle
}
batch.write(f"handles/{handle.id}", new_handle)
with storage_lock:
# fail if handle has changed
assert(handle == read(f"handles/{handle_id}"))
# fail if the remote lsn already exists
assert(not read(f"log/{vid}/{lsn}"))
batch.commit()
def push_failure(handle, commit):
# push failed, clear pending commit
with storage_lock:
# panic if handle has changed
assert(handle == read(f"handles/{handle_id}"))
handle.pending_commit = None
write(f"handles/{handle.id}", handle)

The Pull and Fetch Volume operations support incrementally pulling or fully fetching a Volume respectively.

def fetch_visibility_path(vid, lsn):
cursor = VolumeRef { vid, lsn }
path = []
while cursor:
{ vid, lsn } = cursor
# load the control if it doesn't exist
if not read(f"volumes/{vid}/control"):
control = object_store.fetch(f"{PREFIX}/{vid}/control")
write(f"volumes/{vid}/control", control)
# update checkpoints if it doesn't exist or has changed
prev_checkpoints = read(f"volumes/{vid}/checkpoints")
checkpoints = object_store.fetch(f"{PREFIX}/{vid}/checkpoints", prev_checkpoints.etag)
if checkpoints:
update_checkpoints(prev_checkpoints, checkpoints)
if checkpoints = read(f"volumes/{vid}/checkpoints"):
if checkpoint_lsn = checkpoints.for(lsn):
# found checkpoint, we can terminate the path here
path.push((vid, (lsn)..=(checkpoint_lsn)))
return path
# no checkpoint, so scan to the beginning
path.push((vid, (lsn)..=1))
# and iterate to the parent
cursor = read(f"volumes/{vid}/control").parent
return path
def update_checkpoints(vid, old_checkpoints, new_checkpoints):
new_lsns = new_checkpoints.lsns - old_checkpoints.lsns
fetch_commits(vid, new_lsns, replace=True)
write(f"volumes/{vid}/checkpoints", new_checkpoints.into())
def fetch_volume(vid, max_lsn=LSN::MAX):
# retrieve the latest snapshot <= max_lsn
snapshot = first(f"log/{vid}/{max_lsn}"..)
# refresh the visibility path
path = fetch_visibility_path(vid, snapshot.lsn)
# fetch all commits in path
for (vid, scan) in path:
fetch_commits(vid, scan)
def pull_volume(vid):
snapshot = first(f"log/{vid}")
# refresh the visibility path, to update any checkpoints
fetch_visibility_path(vid, snapshot.lsn)
fetch_commits(vid, snapshot.lsn..)
# lsns may be a range of lsns (possibly unbounded)
# or a set of lsns
# in the unbounded range case, fetch_all will stop fetching once it discovers the end of the range
# if replace is True, this function will refetch all commits in range
def fetch_commits(vid, lsns, replace=False):
lsns = lsns if replace else remove_fetched_lsns(lsns)
for commit in fetch_all(f"{PREFIX}/{vid}/log/{lsns}"):
lsn = commit.snapshot.lsn
with storage_lock:
# we take the lock here to ensure that we serialize with other read-modify-update operations on the log.
# in theory we don't need to worry about conflicting with other operations since pulling commits from a remote is idempotent
write(f"log/{vid}/{lsn}", commit)
def remove_fetched_lsns(lsns):
# return a new lsn set that only contains unfetched lsns
return lsns

In this new architecture we split up pulling the remote volume and syncing it into the local volume.

def sync_remote_to_local(handle_id):
handle = read(f"handles/{handle_id}")
# we can safely sync the latest remote snapshot into the local volume if the local volume has no outstanding local changes.
local_snapshot = first(f"log/{handle.local.vid}")
if local_snapshot.lsn != handle.local.lsn:
raise OutstandingLocalChanges
# check to see if we have any changes to sync
remote_snapshot = first(f"log/{handle.remote.vid}")
lsn_range = remote_snapshot.lsn..handle.remote.lsn
if lsn_range.is_empty():
raise NothingToSync
# sync lsn range from remote to local, this requires copying each commit and mapping the LSNs to the local volume's LSN space
commits = prepare_commits(local_snapshot, handle.remote.vid, lsn_range)
batch = storage.batch()
for commit in commits:
{vid, lsn} = commit.snapshot
batch.write(f"log/{vid}/{lsn}", commit)
with storage_lock:
if local_snapshot != first(f"log/{handle.local.vid}"):
raise Retry
batch.commit()
def prepare_commits(local_snapshot, remote_vid, lsn_range):
remote_commits = iter_range(f"log/{handle.remote.vid}/{lsn_range}")
# build up an array of local commits by mapping each remote commit to the next local lsn
return commits

It’s possible to reset the local volume for a Volume Handle by creating a new Local Volume and potentially copying over unsynced local commits. GC will be able to clean up the orphaned local volume once all snapshots have closed.

We should automatically reset local volumes when their log hits some configurable point and they’ve synced with the remote recently.

We can also use a reset to throw away local changes.

To perform GC locally, we will need to keep track of all open Snapshots and Handles. For now GC will focus on eliminating inaccessible data:

  • unreferenced Volumes
  • unreferenced Segments
    • taking care not to eliminate in-progress writes

An interesting aspect of GC is whether or not it will delete portions of a Volume’s commit log which are no longer accessible. For remote volumes, since we can always redownload a log, we just need to ensure that only unreferenced portions of the log are removed - i.e. not directly visible in the visibility_path computation from any snapshot or handle. For local volumes, we will just wait until they are Reset and then cleanup the orphaned local volume id.

The checkpointing process involves picking a VID/LSN based on some checkpoint heuristics, and then rewriting the commit to reference a new segment that contains all non-empty pages of the entire Volume. Then recording the checkpoint LSN in the checkpoint set for the volume.

The Checkpoint algorithm is made crash safe by first scanning for checkpoints that have been written but are not also present in the CheckpointSet.

Remote GC must take care to not truncate a Volume’s history which is referenced by a Fork. So, when GC decides based on heuristics to checkpoint a volume, it will first check the Volume’s forks to determine if there are any references that would be truncated. For each such reference, GC must first checkpoint the fork.

GC will use the checkpoint_ts associated with commit checkpoints to ensure that relevant checkpoints have lived “long enough” to shadow truncated data. I.e. long enough that hopefully all the clients have picked them up.

This RFC depends on the following new Splinter features:

  • iter_range(keys: Range<u32>) -> impl Iterator<Item=u32>
    • returns an iterator over keys contained by the range which are present in the Splinter.

All new methods should be implemented on both Splinter and SplinterRef.

It’s a huge amount of work? Architecturally I think this is much better than the current state of Graft. And it opens up more opportunities to work with new customers. It also makes Graft a bit more standalone and simpler to run which will make new users happier.

I think the biggest drawback is that there won’t be a metastore to perform client fast forwarding. This will make fetches after long periods of being offline a bit less optimal. However, I think it’s ok to solve this with a optional service (graft-proxy) rather than requiring it. Graft proxy can provide its own virtual overlay of commits and segments - allowing all the same optimizations the metastore and pagestore is able to do now.