Skip to content

0001 - Direct Storage Architecture

Build a new Graft client library (called graft-kernel) which directly interfaces with object storage, eliminating the need for the MetaStore and PageStore, and setting Graft up as a viable replacement to systems like Litestream. Graft should focus on providing best-in-class PITR, branching, and sparse replication for page-based workloads.

An overview of why we should consider making this major change to Graft’s architecture.

Currently Graft requires a MetaStore and PageStore to support replication to and from object storage. This architecture has the following advantages and disadvantages:

Advantages

  • The MetaStore can efficiently rollup commits to fast forward clients, increasing performance and enabling instant read replicas
  • The PageStore acts as a smart cache, allowing clients to pull only the pages they need at the edge.
  • The PageStore is able to collocate writes to multiple Volumes in the same Segment which can reduce the cost and overhead of small transactions.

Disadvantages

  • There is little isolation between data in different Volumes. Graft will need to roll out a comprehensive encryption + authorization layer to work for production workloads. This is a huge cost in terms of testing and engineering.
  • Users must run two services to take full advantage of Graft, this makes Graft much harder to use.

In a discussion with Simon Willison and Alex Garcia, we talked about some of their dream features for SQLite + Graft:

Rollback database to earlier version The ability to cheaply rollback a database would make risky features like giving an LLM read/write access to your database much safer. Additionally, the ability to branch a database at a particular version may enable risk-free experimentation and testing.

Read-only replication Cheap and fast read-only replication to horizontally scale a heavy query workload over multiple machines, or simply to expose data to less-trusted users.

Composability with IAM permissions Currently, Datasette uses IAM keys limited to a single S3 prefix to restrict Litestream’s access to a single tenant’s data. This ensures that a bug in Litestream can affect at most a single tenant.

This feature implies that data does not cross “tenant” boundaries (or in this case, the configured S3 prefix).

In a discussion with a potential user, they expressed reservations due to the layer of indirection between the Graft client and object storage. Their main argument is that S3 is already proven to handle extremely high scale. They would be more comfortable using Graft if clients connected directly to object storage to pull changes. In some cases, this may also reduce costs due to free bandwidth between compute and S3.

Graft should work out of the box without requiring additional services to run. By supporting direct access to object storage, it will be easier to get started and embed Graft in an application.

A high level explanation of how this feature works, and would change the behavior of existing Graft clients such as graft-sqlite and graft-fuse.

graft-kernel implements the Graft Kernel which supersedes the functionality of the graft-client and graft-server crates. The Kernel provides access to a remote Volume Catalog, local storage, and client functionality to downstream crates like graft-sqlite and graft-fuse.

Client functionality Clients such as graft-sqlite and graft-fuse will use the graft-kernel to operate on Volumes. The Kernel is designed to be embedded in the application, performing I/O in a small set of background threads. The Kernel will be implemented as an async core wrapped with an async and sync API.

The Graft Proxy is an optional stateless edge service and caching layer which makes it easier for Graft to replicate to & from devices.

Graft Proxy exposes a simple API to consumers, enabling two key performance features:

  1. Graft Proxy caches reads from object storage.
  2. Virtual overlay of commits and segments
  • rather than passing straight through to object store, the proxy is able to coalesce multiple commits into one larger commit for clients to efficiently fast forward and only download interesting portions of segments.

Eventually Graft Proxy will enhance Graft with these features:

  • volume subscriptions -> eliminate the need to poll for changes
  • granular authorization
  • direct byte range puts and gets against Volumes, enabling “dumb clients”.

Detailed technical breakdown. Cover APIs, algorithms, data structures, formats, edge cases, and performance implications.

  • Volume Handle: A reference to a local-remote Volume pair. Also responsible for tracking synchronization between a local and remote Volume. Documented more in the Volume Handle section.
  • vid: A 16 byte Volume ID using GID encoding.
  • sid: A 16 byte Segment ID using GID encoding.
  • Snapshot: A frozen point-in-time view of a Volume.
  • lsn: Documented in the Volume log section.
  • pageidx: A 4 byte page index, representing the index of a page within a Volume. Valid range: [1, 2^32).
  • Splinter: A compressed bitset, used to keep track of which pageidxs appear in a Segment.
  • Segment: A compressed sequence of pages, sorted by pageidx. Documented in the Segment Encoding section.
  • VolumeRef: A (vid, lsn) tuple, representing a fixed point in a Volumes history.
  • CommitHash: Documented in the Commit hash section.

A volume’s durable state consists of a Checkpoint and a Log.

  • Checkpoint — a point-in-time mapping from each non-empty pageidx to the version (by LSN) that was current when the checkpoint was taken.
  • Log — an append-only sequence of log records. Each record contains the set of pages modified since the previous LSN.
PropertyDefinition
DomainUnsigned 64-bit integer in the range [1, 2^64). Zero is invalid.
OrderingStrictly increasing, gap-free, and scoped per volume.
EncodingCanonical representation is CBE64: one’s-complement, big-endian.

Because 0 is never a valid LSN, it’s available as a sentinel value.

CBE64 stands for Ones-Compliment Big Edian.

  • Binary form — 8-byte array. Bytewise comparison yields descending numeric order. Used for space-efficient storage, such as in embedded key-value stores like Fjall.
  • Hex form — 16-character, zero-padded, uppercase hexadecimal string. Lexicographically sorts in the same order as the binary form. Used where human readability is preferred, such as object store keys.

The CBE64 encoding allows both local key-value stores and object stores to perform forward iteration over keys to process log records in descending LSN order, without additional index structures.

To verify data integrity, we attach a blake3 hash to each Commit.

def commit_hash(snapshot, pages):
hasher = blake3::new()
# unique 4 byte magic number identifying commits
hasher.write(COMMIT_MAGIC))
hasher.write(snapshot.vid)
hasher.write(snapshot.lsn)
hasher.write(snapshot.page_count)
# pages must be in order by pageidx
for page in pages:
hasher.write(page)
return hasher.hash()

Note that the Commit’s snapshot is passed in. This ensures the Hash’s uniqueness incorporates the Volume ID, LSN, and page count.

Rather than forcing users to reference Volumes by ID, Graft exposes Volume Handles. Volume Handles are pointers to a local and remote Volume. Volume Handles only exist on a single client, and are not shared between clients or pushed to a remote.

Each Volume Handle has a name given to it at creation time. The name must match the regex ^[-_a-zA-Z0-9]{0,128}$ (alphanumeric, starts with letter or underscore, max len 128 chars) and be unique on the client.

A Volume Handle always has a local-only Volume associated with it. This Volume is used for all local reads and writes.

A Volume Handle may be linked to a remote Volume. In this case, the local and remote Volumes will be kept in sync by the sync subsystem.

Pages are stored in Segments, which provide seekable compression over ranges of pages. Internally Segments are sequences of compressed Frames. All of the pages in a Segment are stored in order by PageIdx.

Segments currently compress each Frame using zstd with the trailing checksum enabled. Graft may add support for other compression methods in the future.

To read a page from a Segment, the client must first retrieve the relevant SegmentRef from the Commit. Then the client may use the frame index to search for the correct frame. This allows Clients to download only the relevant byte offsets from a Segment.

Graft will store all of a Volume’s data in Object Storage using the following keyspace layout:

{prefix} /
{vid} /
control: Control
forks /
{fork-vid}: Fork
checkpoints: CheckpointSet
log /
{lsn}: Commit
segments /
{sid}: Segment

This flexible layout allows users to isolate tenants from one another by simply providing a unique {prefix}. This can be helpful when using AWS IAM to scope access keys to particular S3 prefixes for example.

All of the files aside from Segments are encoded using Protobuf. Serialized Protobuf is wrapped with a zerocopy ProtobufEnvelope:

enum ProtobufMessage {
GraftRemoteV1Control = 1,
GraftRemoteV1Fork = 2,
GraftRemoteV1CheckpointSet = 3,
GraftRemoteV1Commit = 4,
}
struct ProtobufEnvelope {
magic: [u8; 4],
_padding: [u8; 3],
message: ProtobufMessage,
data: [u8]
}

Segments are compressed sets of pages. Their encoding is documented in the Segment Encoding section, and their metadata is stored in the SegmentRef message.

syntax = "proto3";
package graft.remote.v1;
import "google/protobuf/timestamp.proto";
// A Volume has a top level control file stored at
// `{prefix}/{vid}/control`
// Control files are immutable.
message Control {
// The Volume ID stored as a 16 byte GID.
bytes vid = 1;
// The parent reference if this Volume is a fork.
optional VolumeRef parent = 2;
// The creation timestamp of this Volume.
google.protobuf.Timestamp created_at = 3;
}
// When a Volume is forked, a ref is first written to the parent Volume:
// `{prefix}/{parent-vid}/forks/{fork-vid}`
// Forks are immutable.
message Fork {
// The VID of the fork.
bytes fork_vid = 1;
// The fork point. Must match the parent field in the Fork's Control file.
VolumeRef parent = 2;
}
// A reference to a Volume at a particular LSN.
message VolumeRef {
// The Volume ID stored as a 16 byte GID.
bytes vid = 1;
// The referenced LSN.
uint64 lsn = 2;
}
// A Volume's CheckpointSet is stored at `{prefix}/{vid}/checkpoints`.
// CheckpointSets are updated by the checkpointer via compare-and-swap.
message CheckpointSet {
// The Volume ID stored as a 16 byte GID.
bytes vid = 1;
// The list of checkpoint LSNs.
repeated uint64 lsns = 2;
}
// A snapshot of a Volume.
message Snapshot {
// The Volume ID stored as a 16 byte GID.
bytes vid = 1;
// The LSN of the Volume at this Snapshot.
uint64 lsn = 2;
// The Volume's page count at this Snapshot.
uint32 page_count = 3;
}
// Commits are stored at `{prefix}/{vid}/log/{lsn}`.
// A commit may not include a SegmentRef if only the Volume's page count has
// changed. This happens when the Volume is extended or truncated without
// additional writes.
// Commits are immutable.
message Commit {
// The Volume Snapshot at this Commit.
Snapshot snapshot = 1;
// An optional 256 bit CommitHash of this Commit.
// Always present on Remote Volume commits.
// May be omitted on Local commits.
optional bytes hash = 2;
// If this Commit contains any pages, `segment_ref` records details on the relevant Segment.
optional SegmentRef segment_ref = 3;
// If this commit is a checkpoint, this timestamp is set and records the time
// the commit was made a checkpoint
optional google.protobuf.Timestamp checkpoint_ts = 4;
}
message SegmentRef {
// The 16 byte Segment ID.
bytes sid = 1;
// The set of pageidxs stored in this Segment.
// Serialized using Splinter encoding.
bytes splinter = 2;
// An index of frames contained by the Segment.
// Empty on Local Segments which have not been encoded and uploaded to object
// storage.
repeated SegmentFrame frames = 3;
}
message SegmentFrame {
// The length of the compressed frame in bytes.
uint32 frame_size = 1;
// The last pageidx stored in the frame
uint32 last_pageidx = 2;
}

This section documents how clients store data.

Local storage uses Fjall, a partitioned k/v store. In the following keyspace, the top level keys are independent partitions. The remainder of the keys and the values are encoded using types in the following section.

handles / {name} -> VolumeHandle
volumes / {vid} /
// The Volume's Control, includes its parent reference
control -> Control
// the latest LocalCheckpointSet for this Volume
checkpoints -> LocalCheckpointSet
log / {vid} / {lsn} -> Commit
pages / {sid} / {pageidx} -> Page

Keys which are stored in the local keyspace are encoded using zerocopy types. lsn values are stored using CBE64, which ensures they naturally sort in descending order. This allows us to use a forward iterator to quickly find the most recent LSN, which is much more efficient in most k/v stores (including Fjall).

The handles partition is unique in that it is keyed directly by the VolumeHandle’s name rather than one of the following zerocopy types.

enum VolumeProperty {
Control = 1,
Checkpoints = 2,
}
/// Key for the `volumes` partition
struct VolumeKey {
vid: VolumeId,
property: VolumeProperty,
}
/// Key for the `log` partition
struct CommitKey {
vid: VolumeId,
lsn: CBE64,
}
/// Key for the `pages` partition
struct PageKey {
sid: SegmentId,
pageidx: PageIdx,
}

Values stored locally are encoded using protobuf, using a combination of the remote storage schema and the following additional message types. No envelope is needed for local values as the upgrade process can migrate local data.

syntax = "proto3";
package graft.local.v1;
import "graft/remote/v1/index.proto";
message LocalCheckpointSet {
// The etag from the last time we pulled the CheckpointSet, used to only pull
// changed CheckpointSets
bytes etag = 1;
// The list of checkpoint LSNs.
repeated uint64 lsns = 2;
}
message VolumeHandle {
// The name of the Volume Handle
string name = 1;
// References to the local and remote Volumes, along with LSNs representing their latest successful synchronization.
graft.remote.v1.VolumeRef local = 2;
optional graft.remote.v1.VolumeRef remote = 3;
// Presence of the pending_commit field means that the Push operation is in the process of committing to the remote. If no such Push job is currently running (i.e. it was interrupted), this field must be used to resume or abort the commit process.
optional PendingCommit pending_commit = 4;
}
message PendingCommit {
// The resulting remote LSN that the push job is attempting to create
uint64 remote_lsn = 1;
// The associated 256 bit blake3 commit hash. This is used to determine
// whether or not the commit has landed in the remote, in the case that we are
// interrupted while attempting to push.
bytes commit_hash = 2;
}

This section details the various key algorithms powering Graft’s new direct storage architecture.

Similar to the existing Graft architecture, we will start with one coarse storage_lock which must be held when we are doing any read-modify-update transaction on storage.

Lock rules:

  • never hold the lock while performing IO other than storage operations
  • hold the lock for the smallest time possible

When to hold the lock:

  • Modifying the Volume Handle
  • Committing to a Volume

When not to hold the lock:

  • When the operation is idempotent, for example when we are writing to the pages partition or updating a Volume’s checkpoint set
  • When performing read operations; use a Fjall instant instead

Reading from a Volume requires creating a VolumeReader from a VolumeHandle at either the latest or a specific snapshot.

def visibility_path(snapshot):
cursor = VolumeRef { vid: snapshot.vid, lsn: snapshot.lsn }
path = []
while cursor:
if checkpoints = read(f"volumes/{cursor.vid}/checkpoints"):
if checkpoint = checkpoints.for(snapshot.lsn):
# found checkpoint, we can terminate the path here
path.push((cursor.vid, (cursor.lsn)..=(checkpoint.lsn)))
return path
# no checkpoint, so scan to the beginning
path.push((cursor.vid, (cursor.lsn)..=1))
# and iterate to the parent
cursor = read(f"volumes/{cursor.vid}/control").parent
return path
class VolumeReader:
def new(snapshot):
self.snapshot = snapshot
self.path = visibility_path(snapshot)
def read_page(self, pageidx):
if not self.snapshot.page_count.contains(pageidx):
return None
for key, commit in iter_commits(self.path):
{ snapshot, segment_ref } = commit
if not (
# handle truncate+extend
snapshot.page_count.contains(pageidx)
and segment_ref.splinter.contains(pageidx)):
continue
page = read(f"pages/{key.sid}/{pageidx}")
if page:
return page
return remote_read_page(snapshot.vid, key.sid, segment_ref, pageidx)
return None
def iter_commits(path):
result = []
for (vid, scan) in path:
top = f"log/{vid}/{scan.start}"
bottom = f"log/{vid}/{scan.end}"
result = chain(result, iter(top..=bottom))
return result
def remote_read_page(vid, sid, segment_ref, pageidx):
# first we need to determine which frame in the segment contains the relevant
# page
bytes = (0, 0)
pages = (0, 0)
for frame in segment_ref.frames:
if pageidx > f.last_pageidx:
break
bytes = (bytes.end, bytes.end + frame.frame_size)
pages = (pages.end+1, frame.last_pageidx)
# fetch the frame from object storage, loading the pages into `pages/{sid}/{pageidx}`
frame = object_store.fetch(
f"{PREFIX}/{vid}/segments/{sid}", bytes)
frame = zstd.decompress(frame)
for (pi, page) in zip(
segment_ref.splinter.iter_range(pages), frame.split(PAGESIZE)
):
write(f"pages/{sid}/{pi}", page)
return read(f"pages/{sid}/{pageidx}")
class VolumeWriter:
def new(snapshot):
self.reader = VolumeReader::new(snapshot)
self.page_count = snapshot.page_count
self.sid = SegmentId::random()
self.splinter = Splinter::new()
def read(pageidx):
if self.splinter.contains(pageidx):
return read(f"pages/{self.sid}/{pageidx}")
else:
self.reader.read_page(pageidx)
def write(pageidx, page):
self.splinter.insert(pageidx)
self.page_count = max(self.page_count, pageidx.pages())
write(f"pages/{self.sid}/{pageidx}", page)
def truncate(page_count):
self.page_count = page_count
delete_range(f"pages/{self.sid}/{page_count}"..)
# also triggered on drop
def rollback():
delete_prefix(f"pages/{sid}")
def commit():
snapshot = self.reader.snapshot
vid = snapshot.vid
commit_lsn = snapshot.lsn.next()
with storage_lock:
# verify we are the latest snapshot
latest_snapshot = first(f"log/{vid}").snapshot
if snapshot != latest_snapshot:
raise "concurrent write"
write(f"log/{vid}/{commit_lsn}", Commit {
snapshot = Snapshot {
vid,
lsn = commit_lsn,
page_count = self.page_count
},
# no hash for commits to a local volume
segment = SegmentRef {
sid: self.sid,
splinter: self.splinter,
# no frame info for commits to a local volume
}
})
def push_volume(handle_name):
handle = read(f"handles/{handle_name}")
(local_lsn, commit) = prepare_commit(handle)
match remote_commit(commit):
Ok() => push_success(handle, commit, local_lsn)
Err() => push_failure(handle, commit)
def prepare_commit(handle):
# trigger recovery
if handle.pending_commit:
raise InterruptedPush
{ local, remote } = handle
remote_snapshot = first(f"log/{remote.vid}")
local_snapshot = first(f"log/{local.vid}")
# we can only push if the remote has not changed since the last time we synced
if remote_snapshot.lsn != remote.lsn:
# this situation can only occur in normal operation if the local volume has diverged from the remote. I.e. a remote commit happened concurrently with a local commit preventing fast forward
raise Diverged
top_lsn = local_snapshot.lsn
bottom_lsn = local.lsn
sync_range = top_lsn..bottom_lsn
if sync_range.is_empty():
raise NothingToCommit
# build and push segments from commits
(segment_ref, commit_hash) = build_and_push_segments(local.vid, lsn_range, remote.vid)
commit_lsn = remote_snapshot.lsn.next()
# write out the pending commit
with storage_lock:
# abort the push if the handle changed since we started the push process
if handle != read(f"handles/{handle_name}"):
raise Retry
# abort the push if the remote snapshot changed since we started the push process
if remote_snapshot != first(f"log/{remote.vid}"):
raise Retry
handle.pending_commit = PendingCommit {
remote_lsn = commit_lsn,
commit_hash
}
write(f"handles/{handle_name}", handle)
return (local_snapshot.lsn, Commit {
snapshot: Snapshot {
vid: remote.vid,
lsn: commit_lsn,
page_count: local_snapshot.page_count
},
hash: commit_hash,
segment_ref
})
def build_and_push_segments(local_vid, lsn_range, remote_vid):
# merge segments from all local commits in the lsn_range into one segment which is uploaded to the remote
#
# IF we expect to be querying the remote volume anytime soon, we can optionally write out segments to our local page store.
return (segment_ref, commit_hash)
def remote_commit(commit):
path = f"{PREFIX}/{commit.snapshot.vid}/log/{commit.snapshot.lsn}"
object_store.write_if_not_exists(path, commit)
def push_success(handle, commit, local_lsn):
{vid, lsn} = commit.snapshot
batch = storage.batch()
batch.write(f"log/{vid}/{lsn}", commit)
new_handle = VolumeHandle {
pending_commit = None,
local = VolumeRef {
vid: handle.local.vid,
lsn: local_lsn
},
remote = VolumeRef {
vid: handle.remote.vid,
lsn
},
...handle
}
batch.write(f"handles/{handle.name}", new_handle)
with storage_lock:
# fail if handle has changed
assert(handle == read(f"handles/{handle_name}"))
# fail if the remote lsn already exists
assert(not read(f"log/{vid}/{lsn}"))
batch.commit()
def push_failure(handle, commit):
# push failed, clear pending commit
with storage_lock:
# panic if handle has changed
assert(handle == read(f"handles/{handle_name}"))
handle.pending_commit = None
write(f"handles/{handle.name}", handle)

The Pull and Fetch Volume operations support incrementally pulling or fully fetching a Volume respectively.

def fetch_visibility_path(vid, lsn):
cursor = VolumeRef { vid, lsn }
path = []
while cursor:
{ vid, lsn } = cursor
# load the control if it doesn't exist
if not read(f"volumes/{vid}/control"):
control = object_store.fetch(f"{PREFIX}/{vid}/control")
write(f"volumes/{vid}/control", control)
# update checkpoints if it doesn't exist or has changed
prev_checkpoints = read(f"volumes/{vid}/checkpoints")
checkpoints = object_store.fetch(f"{PREFIX}/{vid}/checkpoints", prev_checkpoints.etag)
if checkpoints:
update_checkpoints(prev_checkpoints, checkpoints)
if checkpoints = read(f"volumes/{vid}/checkpoints"):
if checkpoint_lsn = checkpoints.for(lsn):
# found checkpoint, we can terminate the path here
path.push((vid, (lsn)..=(checkpoint_lsn)))
return path
# no checkpoint, so scan to the beginning
path.push((vid, (lsn)..=1))
# and iterate to the parent
cursor = read(f"volumes/{vid}/control").parent
return path
def update_checkpoints(vid, old_checkpoints, new_checkpoints):
new_lsns = new_checkpoints.lsns - old_checkpoints.lsns
fetch_commits(vid, new_lsns, replace=True)
write(f"volumes/{vid}/checkpoints", new_checkpoints.into())
def fetch_volume(vid, max_lsn=LSN::MAX):
# retrieve the latest snapshot <= max_lsn
snapshot = first(f"log/{vid}/{max_lsn}"..)
# refresh the visibility path
path = fetch_visibility_path(vid, snapshot.lsn)
# fetch all commits in path
for (vid, scan) in path:
fetch_commits(vid, scan)
def pull_volume(vid):
snapshot = first(f"log/{vid}")
# refresh the visibility path, to update any checkpoints
fetch_visibility_path(vid, snapshot.lsn)
fetch_commits(vid, snapshot.lsn..)
# lsns may be a range of lsns (possibly unbounded)
# or a set of lsns
# in the unbounded range case, fetch_all will stop fetching once it discovers the end of the range
# if replace is True, this function will refetch all commits in range
def fetch_commits(vid, lsns, replace=False):
lsns = lsns if replace else remove_fetched_lsns(lsns)
for commit in fetch_all(f"{PREFIX}/{vid}/log/{lsns}"):
lsn = commit.snapshot.lsn
with storage_lock:
# we take the lock here to ensure that we serialize with other read-modify-update operations on the log.
# in theory we don't need to worry about conflicting with other operations since pulling commits from a remote is idempotent
write(f"log/{vid}/{lsn}", commit)
def remove_fetched_lsns(lsns):
# return a new lsn set that only contains unfetched lsns
return lsns

In this new architecture we split up pulling the remote volume and syncing it into the local volume.

def sync_remote_to_local(handle_name):
handle = read(f"handles/{handle_name}")
# we can safely sync the latest remote snapshot into the local volume if the local volume has no outstanding local changes.
local_snapshot = first(f"log/{handle.local.vid}")
if local_snapshot.lsn != handle.local.lsn:
raise OutstandingLocalChanges
# check to see if we have any changes to sync
remote_snapshot = first(f"log/{handle.remote.vid}")
lsn_range = remote_snapshot.lsn..handle.remote.lsn
if lsn_range.is_empty():
raise NothingToSync
# sync lsn range from remote to local, this requires copying each commit and mapping the LSNs to the local volume's LSN space
commits = prepare_commits(local_snapshot, handle.remote.vid, lsn_range)
batch = storage.batch()
for commit in commits:
{vid, lsn} = commit.snapshot
batch.write(f"log/{vid}/{lsn}", commit)
with storage_lock:
if local_snapshot != first(f"log/{handle.local.vid}"):
raise Retry
batch.commit()
def prepare_commits(local_snapshot, remote_vid, lsn_range):
remote_commits = iter_range(f"log/{handle.remote.vid}/{lsn_range}")
# build up an array of local commits by mapping each remote commit to the next local lsn
return commits

It’s possible to reset the local volume for a Volume Handle by creating a new Local Volume and potentially copying over unsynced local commits. GC will be able to clean up the orphaned local volume once all snapshots have closed.

We should automatically reset local volumes when their log hits some configurable point and they’ve synced with the remote recently.

We can also use a reset to throw away local changes.

To perform GC locally, we will need to keep track of all open Snapshots and Handles. For now GC will focus on eliminating inaccessible data:

  • unreferenced Volumes
  • unreferenced Segments
    • taking care not to eliminate in-progress writes

An interesting aspect of GC is whether or not it will delete portions of a Volume’s commit log which are no longer accessible. For remote volumes, since we can always redownload a log, we just need to ensure that only unreferenced portions of the log are removed - i.e. not directly visible in the visibility_path computation from any snapshot or handle. For local volumes, we will just wait until they are Reset and then cleanup the orphaned local volume id.

The checkpointing process involves picking a VID/LSN based on some checkpoint heuristics, and then rewriting the commit to reference a new segment that contains all non-empty pages of the entire Volume. Then recording the checkpoint LSN in the checkpoint set for the volume.

The Checkpoint algorithm is made crash safe by first scanning for checkpoints that have been written but are not also present in the CheckpointSet.

Remote GC must take care to not truncate a Volume’s history which is referenced by a Fork. So, when GC decides based on heuristics to checkpoint a volume, it will first check the Volume’s forks to determine if there are any references that would be truncated. For each such reference, GC must first checkpoint the fork.

GC will use the checkpoint_ts associated with commit checkpoints to ensure that relevant checkpoints have lived “long enough” to shadow truncated data. I.e. long enough that hopefully all the clients have picked them up.

This RFC depends on the following new Splinter features:

  • iter_range(keys: Range<u32>) -> impl Iterator<Item=u32>
    • returns an iterator over keys contained by the range which are present in the Splinter.

All new methods should be implemented on both Splinter and SplinterRef.

It’s a huge amount of work? Architecturally I think this is much better than the current state of Graft. And it opens up more opportunities to work with new customers. It also makes Graft a bit more standalone and simpler to run which will make new users happier.

I think the biggest drawback is that there won’t be a metastore to perform client fast forwarding. This will make fetches after long periods of being offline a bit less optimal. However, I think it’s ok to solve this with a optional service (graft-proxy) rather than requiring it. Graft proxy can provide its own virtual overlay of commits and segments - allowing all the same optimizations the metastore and pagestore is able to do now.