0001 - Direct Storage Architecture

Build a new Graft client library (called graft-kernel) which directly interfaces with object storage, eliminating the need for the MetaStore and PageStore, and setting Graft up as a viable replacement to systems like Litestream. Graft should focus on providing best-in-class PITR, branching, and sparse replication for page-based workloads.

Motivation

An overview of why we should consider making this major change to Graft’s architecture.

Existing state of Graft

Currently Graft requires a MetaStore and PageStore to support replication to and from object storage. This architecture has the following advantages and disadvantages:

Advantages

The MetaStore can efficiently rollup commits to fast forward clients, increasing performance and enabling instant read replicas
The PageStore acts as a smart cache, allowing clients to pull only the pages they need at the edge.
The PageStore is able to collocate writes to multiple Volumes in the same Segment which can reduce the cost and overhead of small transactions.

Disadvantages

There is little isolation between data in different Volumes. Graft will need to roll out a comprehensive encryption + authorization layer to work for production workloads. This is a huge cost in terms of testing and engineering.
Users must run two services to take full advantage of Graft, this makes Graft much harder to use.

Datasette

In a discussion with Simon Willison and Alex Garcia, we talked about some of their dream features for SQLite + Graft:

Rollback database to earlier version The ability to cheaply rollback a database would make risky features like giving an LLM read/write access to your database much safer. Additionally, the ability to branch a database at a particular version may enable risk-free experimentation and testing.

Read-only replication Cheap and fast read-only replication to horizontally scale a heavy query workload over multiple machines, or simply to expose data to less-trusted users.

Composability with IAM permissions Currently, Datasette uses IAM keys limited to a single S3 prefix to restrict Litestream’s access to a single tenant’s data. This ensures that a bug in Litestream can affect at most a single tenant.

This feature implies that data does not cross “tenant” boundaries (or in this case, the configured S3 prefix).

Object storage scalability

In a discussion with a potential user, they expressed reservations due to the layer of indirection between the Graft client and object storage. Their main argument is that S3 is already proven to handle extremely high scale. They would be more comfortable using Graft if clients connected directly to object storage to pull changes. In some cases, this may also reduce costs due to free bandwidth between compute and S3.

New user experience

Graft should work out of the box without requiring additional services to run. By supporting direct access to object storage, it will be easier to get started and embed Graft in an application.

Guide-level Explanation

A high level explanation of how this feature works, and would change the behavior of existing Graft clients such as graft-sqlite and graft-fuse.

`graft-kernel`

graft-kernel implements the Graft Kernel which supersedes the functionality of the graft-client and graft-server crates. The Kernel provides access to a remote Volume Catalog, local storage, and client functionality to downstream crates like graft-sqlite and graft-fuse.

Client functionality Clients such as graft-sqlite and graft-fuse will use the graft-kernel to operate on Volumes. The Kernel is designed to be embedded in the application, performing I/O in a small set of background threads. The Kernel will be implemented as an async core wrapped with an async and sync API.

`graft-proxy`

The Graft Proxy is an optional stateless edge service and caching layer which makes it easier for Graft to replicate to & from devices.

Graft Proxy exposes a simple API to consumers, enabling two key performance features:

Graft Proxy caches reads from object storage.
Virtual overlay of commits and segments

rather than passing straight through to object store, the proxy is able to coalesce multiple commits into one larger commit for clients to efficiently fast forward and only download interesting portions of segments.

Eventually Graft Proxy will enhance Graft with these features:

volume subscriptions -> eliminate the need to poll for changes
granular authorization
direct byte range puts and gets against Volumes, enabling “dumb clients”.

Reference-level Explanation

Detailed technical breakdown. Cover APIs, algorithms, data structures, formats, edge cases, and performance implications.

Glossary

Named Volume: A named reference to a local-remote Volume pair. Also responsible for tracking synchronization between a local and remote Volume. Documented more in the Named Volume section.
vid: A 16 byte Volume ID using GID encoding.
sid: A 16 byte Segment ID using GID encoding.
Snapshot: A frozen point-in-time view of a Volume.
lsn: Documented in the Volume log section.
pageidx: A 4 byte page index, representing the index of a page within a Volume. Valid range: [1, 2^32).
Graft: A compressed bitset (based on Splinter), used to keep track of which pageidxs appear in a Segment.
Segment: A compressed sequence of pages, sorted by pageidx. Documented in the Segment Encoding section.
VolumeRef: A (vid, lsn) tuple, representing a fixed point in a Volumes history.
CommitHash: Documented in the Commit hash section.

Volume log

A volume’s durable state consists of a Checkpoint and a Log.

Checkpoint — a point-in-time mapping from each non-empty pageidx to the version (by LSN) that was current when the checkpoint was taken.
Log — an append-only sequence of log records. Each record contains the set of pages modified since the previous LSN.

Log Sequence Numbers (LSNs)

Property	Definition
Domain	Unsigned 64-bit integer in the range `[1, 2^64)`. Zero is invalid.
Ordering	Strictly increasing, gap-free, and scoped per volume.
Encoding	Canonical representation is `CBE64`: one’s-complement, big-endian.

Because 0 is never a valid LSN, it’s available as a sentinel value.

`CBE64` Encoding

CBE64 stands for Ones-Compliment Big Endian.

Binary form — 8-byte array. Bytewise comparison yields descending numeric order. Used for space-efficient storage, such as in embedded key-value stores like Fjall.
Hex form — 16-character, zero-padded, lowercase hexadecimal string. Lexicographically sorts in the same order as the binary form. Used where human readability is preferred, such as object store keys.

The CBE64 encoding allows both local key-value stores and object stores to perform forward iteration over keys to process log records in descending LSN order, without additional index structures.

Commit hash

To verify data integrity, we attach a blake3 hash to each Commit.

def commit_hash(vid, lsn, page_count, pages):
  hasher = blake3::new()
  # unique 4 byte magic number identifying commits
  hasher.write(COMMIT_MAGIC))
  hasher.write(vid)
  hasher.write(lsn)
  hasher.write(page_count)
  # pages must be in order by pageidx
  for (pageidx, page) in pages:
    hasher.write(pageidx)
    hasher.write(page)
  return hasher.hash()

Note that the Commit’s snapshot is passed in. This ensures the Hash’s uniqueness incorporates the Volume ID, LSN, and page count.

Named Volume

Rather than forcing users to reference Volumes by ID, Graft exposes Named Volumes. Named Volumes are pointers to a local and remote Volume. Named Volumes only exist on a single client, and are not shared between clients or pushed to a remote.

Each Named Volume has a id given to it at creation time. The id must match the regex ^[-_a-zA-Z0-9]{0,128}$ (alphanumeric + underscore + dash, max len 128 chars) and be unique on the client.

A Named Volume always has a local-only Volume associated with it. This Volume is used for all local reads and writes.

A Named Volume may be linked to a remote Volume. In this case, the local and remote Volumes will be kept in sync by the sync subsystem.

Segment Encoding

Pages are stored in Segments, which provide seekable compression over ranges of pages. Internally Segments are sequences of compressed Frames. All of the pages in a Segment are stored in order by PageIdx.

Segments currently compress each Frame using zstd with the trailing checksum enabled. Graft may add support for other compression methods in the future.

To read a page from a Segment, the client must first retrieve the relevant SegmentRef from the Commit. Then the client may use the frame index to search for the correct frame. This allows Clients to download only the relevant byte offsets from a Segment.

Remote storage

Object storage keyspace

Graft will store all of a Volume’s data in Object Storage using the following keyspace layout:

{prefix} /
  {vid} /
    control: Control
    checkpoints: CheckpointSet
    log /
      {lsn}: Commit
    segments /
      {sid}: Segment

This flexible layout allows users to isolate tenants from one another by simply providing a unique {prefix}. This can be helpful when using AWS IAM to scope access keys to particular S3 prefixes for example.

Object storage schemas

All of the files aside from Segments (Segment Encoding) are encoded using Bilrost and prefixed with a GraftFileHeader:

const GRAFT_FILE_MAGIC: [u8;4] = [0xB7, 0x77, 0xDD, 0x13]

enum MessageType {
    VolumeControlV1 = 1,
    VolumeForkV1 = 2,
    CheckpointSetV1 = 3,
    CommitV1 = 4,
}

struct GraftFileHeader {
    magic: [u8; 4],
    message: MessageType,
    length: u32,
}

Bilrost messages:

#[derive(Message)]
pub struct VolumeRef {
    #[bilrost(1)]
    vid: VolumeId,
    #[bilrost(2)]
    lsn: LSN,
}

/// Object store path: {prefix}/{vid}/control
#[derive(Message)]
pub struct VolumeControl {
    #[bilrost(1)]
    vid: VolumeId,
    #[bilrost(2)]
    parent: Option<VolumeRef>,
    #[bilrost(3)]
    created_at: SystemTime,
}

/// Object store path: {prefix}/{parent-vid}/forks/{fork-vid}
#[derive(Message)]
pub struct VolumeFork {
    #[bilrost(1)]
    vid: VolumeId,
    #[bilrost(2)]
    parent: VolumeRef,
}

/// Object store path: {prefix}/{vid}/checkpoints
#[derive(Message)]
pub struct CheckpointSet {
    #[bilrost(1)]
    lsns: SmallVec<[LSN; 2]>,  // sorted in ascending order
}

/// Object store path: {prefix}/{vid}/log/{lsn}
#[derive(Message)]
pub struct Commit {
    #[bilrost(1)]
    vid: VolumeId,
    #[bilrost(2)]
    lsn: LSN,
    #[bilrost(3)]
    page_count: PageCount,
    #[bilrost(4)]
    commit_hash: Option<CommitHash>,
    #[bilrost(5)]
    segment_idx: Option<SegmentIdx>,
    #[bilrost(6)]
    checkpointed_at: Option<SystemTime>,
}

#[derive(Message)]
pub struct SegmentIdx {
    #[bilrost(1)]
    sid: SegmentId,
    #[bilrost(2)]
    graft: Graft,
    #[bilrost(3)]
    frames: SmallVec<[SegmentFrameIdx; 2]>,
}

#[derive(Message)]
pub struct SegmentFrameIdx {
    #[bilrost(1)]
    frame_size: usize,
    #[bilrost(2)]
    last_pageidx: PageIdx,
}

Local storage

This section documents how clients store data.

Local keyspace

Local storage uses Fjall, a partitioned k/v store. In the following keyspace, the top level keys are independent partitions. The remainder of the keys and the values are encoded using types in the following section.

named / {VolumeName} -> NamedVolume

volumes / {vid} -> VolumeMeta

log / {vid} / {lsn} -> Commit

pages / {sid} / {pageidx} -> Page

Local schemas

Keys stored in the local keyspace are encoded using zerocopy types and a custom FjallRepr trait. lsn values are stored using CBE64, which ensures they naturally sort in descending order. This allows us to use a forward iterator to quickly find the most recent LSN, which is much more efficient in most k/v stores (including Fjall).

The log and pages partition use the following zerocopy types as keys:

/// Key for the `log` partition
struct CommitKey {
    vid: VolumeId,
    lsn: CBE64,
}

/// Key for the `pages` partition
struct PageKey {
    sid: SegmentId,
    pageidx: U32<BigEndian>,
}

Values stored locally are encoded using Bilrost using the following additional schema messages:

#[derive(Message)]
pub struct NamedVolume {
    /// The name of this NamedVolume
    #[bilrost(1)]
    name: VolumeName,

    /// Reference to the latest synchronization point for the local Volume.
    #[bilrost(2)]
    local: VolumeRef,

    /// Reference to the latest synchronization point for the remote Volume.
    #[bilrost(3)]
    remote: Option<VolumeRef>,

    /// Presence of the `pending_commit` field means that the Push operation is
    /// in the process of committing to the remote. If no such Push job is
    /// currently running (i.e. it was interrupted), this field must be used to
    /// resume or abort the commit process.
    #[bilrost(4)]
    pending_commit: Option<PendingCommit>,
}

#[derive(Message)]
pub struct PendingCommit {
    /// The resulting remote LSN that the push job is attempting to create.
    #[bilrost(1)]
    remote_lsn: LSN,

    /// The associated commit hash. This is used to determine whether or not the
    /// commit has landed in the remote, in the case that we are interrupted
    /// while attempting to push.
    #[bilrost(2)]
    commit_hash: CommitHash,
}

#[derive(Message)]
pub struct VolumeMeta {
    /// The Volume's ID
    #[bilrost(1)]
    vid: VolumeId,

    /// The parent reference if this Volume is a fork.
    #[bilrost(2)]
    parent: Option<VolumeRef>,

    /// The etag from the last time we pulled the `CheckpointSet`, used to only
    /// pull changed `CheckpointSets`
    #[bilrost(3)]
    etag: Option<Bytes>,

    /// The set of checkpoint LSNs.
    #[bilrost(4)]
    checkpoints: CheckpointSet,
}

Algorithms

This section details the various key algorithms powering Graft’s new direct storage architecture.

Locks

Similar to the existing Graft architecture, we will start with one coarse storage_lock which must be held when we are doing any read-modify-update transaction on storage.

Lock rules:

never hold the lock while performing IO other than storage operations
hold the lock for the smallest time possible

When to hold the lock:

Modifying the Named Volume
Committing to a Volume

When not to hold the lock:

When the operation is idempotent, for example when we are writing to the pages partition or updating a Volume’s checkpoint set
When performing read operations; use a Fjall instant instead

Volume Reader

Reading from a Volume requires creating a VolumeReader from a NamedVolume at either the latest or a specific snapshot.

def visibility_path(snapshot):
  cursor = VolumeRef { vid: snapshot.vid, lsn: snapshot.lsn }
  path = []
  while cursor:
    if checkpoints = read(f"volumes/{cursor.vid}/checkpoints"):
      if checkpoint = checkpoints.for(snapshot.lsn):
        # found checkpoint, we can terminate the path here
        path.push((cursor.vid, (cursor.lsn)..=(checkpoint.lsn)))
        return path

    # no checkpoint, so scan to the beginning
    path.push((cursor.vid, (cursor.lsn)..=1))
    # and iterate to the parent
    cursor = read(f"volumes/{cursor.vid}/control").parent

  return path

class Snapshot:
  vid: VolumeId
  path: SearchPath

class VolumeReader:
  def new(snapshot):
    self.snapshot = snapshot

  def read_page(self, pageidx):
    if not self.snapshot.page_count.contains(pageidx):
      return None
    for key, commit in iter_commits(self.snapshot.path):
      { snapshot, segment_ref } = commit
      if not (
          # handle truncate+extend
          snapshot.page_count.contains(pageidx)
          and segment_ref.graft.contains(pageidx)):
        continue
      page = read(f"pages/{key.sid}/{pageidx}")
      if page:
        return page
      return remote_read_page(snapshot.vid, key.sid, segment_ref, pageidx)
    return None

def iter_commits(path):
  result = []
  for (vid, scan) in path:
    top = f"log/{vid}/{scan.start}"
    bottom = f"log/{vid}/{scan.end}"
    result = chain(result, iter(top..=bottom))
  return result

def remote_read_page(vid, sid, segment_ref, pageidx):
  # first we need to determine which frame in the segment contains the relevant
  # page
  bytes = (0, 0)
  pages = (0, 0)
  for frame in segment_ref.frames:
    if pageidx > f.last_pageidx:
      break
    bytes = (bytes.end, bytes.end + frame.frame_size)
    pages = (pages.end+1, frame.last_pageidx)

  # fetch the frame from object storage, loading the pages into `pages/{sid}/{pageidx}`
  frame = object_store.fetch(
    f"{PREFIX}/{vid}/segments/{sid}", bytes)
  frame = zstd.decompress(frame)
  for (pi, page) in zip(
    segment_ref.graft.iter_range(pages), frame.split(PAGESIZE)
  ):
    write(f"pages/{sid}/{pi}", page)

  return read(f"pages/{sid}/{pageidx}")

Volume Writer

class VolumeWriter:
  def new(snapshot):
    self.reader = VolumeReader::new(snapshot)
    self.page_count = snapshot.page_count
    self.sid = SegmentId::random()
    self.graft = Graft::new()

  def read(pageidx):
    if self.graft.contains(pageidx):
      return read(f"pages/{self.sid}/{pageidx}")
    else:
      self.reader.read_page(pageidx)

  def write(pageidx, page):
    self.graft.insert(pageidx)
    self.page_count = max(self.page_count, pageidx.pages())
    write(f"pages/{self.sid}/{pageidx}", page)

  def truncate(page_count):
    self.page_count = page_count
    delete_range(f"pages/{self.sid}/{page_count}"..)

  # also triggered on drop
  def rollback():
    delete_prefix(f"pages/{sid}")

  def commit():
    snapshot = self.reader.snapshot
    vid = snapshot.vid
    commit_lsn = snapshot.lsn.next()

    with storage_lock:
      # verify we are the latest snapshot
      latest_snapshot = first(f"log/{vid}").snapshot
      if snapshot != latest_snapshot:
        raise "concurrent write"

      write(f"log/{vid}/{commit_lsn}", Commit {
        snapshot = Snapshot {
          vid,
          lsn = commit_lsn,
          page_count = self.page_count
        },
        # no hash for commits to a local volume
        segment = SegmentRef {
          sid: self.sid,
          graft: self.graft,
          # no frame info for commits to a local volume
        }
      })

Push Volume

def push_volume(name):
  handle = read(f"named/{name}")
  (local_lsn, commit) = prepare_commit(handle)
  match remote_commit(commit):
    Ok() => push_success(handle, commit, local_lsn)
    Err() => push_failure(handle, commit)

def prepare_commit(handle: NamedVolume):
  # trigger recovery
  if handle.pending_commit:
    raise InterruptedPush

  { local, remote } = handle

  remote_snapshot = first(f"log/{remote.vid}")
  local_snapshot = first(f"log/{local.vid}")

  # we can only push if the remote has not changed since the last time we synced
  if remote_snapshot.lsn != remote.lsn:
    # this situation can only occur in normal operation if the local volume has diverged from the remote. I.e. a remote commit happened concurrently with a local commit preventing fast forward
    raise Diverged

  top_lsn = local_snapshot.lsn
  bottom_lsn = local.lsn
  sync_range = top_lsn..bottom_lsn

  if sync_range.is_empty():
    raise NothingToCommit

  # build and push segments from commits
  (segment_idx, commit_hash) = build_and_push_segments(local.vid, lsn_range, remote.vid)

  commit_lsn = remote_snapshot.lsn.next()

  # write out the pending commit
  with storage_lock:
    # abort the push if the handle changed since we started the push process
    if handle != read(f"named/{handle.name}"):
      raise Retry

    # abort the push if the remote snapshot changed since we started the push process
    if remote_snapshot != first(f"log/{remote.vid}"):
      raise Retry

    handle.pending_commit = PendingCommit {
      remote_lsn = commit_lsn,
      commit_hash
    }
    write(f"named/{handle.name}", handle)

return (local_snapshot.lsn, Commit {
    vid: remote.vid,
    lsn: commit_lsn,
    page_count: local_snapshot.page_count
    commit_hash,
    segment_idx
  })

def build_and_push_segments(local_vid, lsn_range, remote_vid):
  # merge segments from all local commits in the lsn_range into one segment which is uploaded to the remote
  #
  # IF we expect to be querying the remote volume anytime soon, we can optionally write out segments to our local page store.

  return (segment_idx, commit_hash)

def remote_commit(commit):
  path = f"{PREFIX}/{commit.snapshot.vid}/log/{commit.snapshot.lsn}"
  object_store.write_if_not_exists(path, commit)

def push_success(handle: NamedVolume, commit, local_lsn):
  {vid, lsn} = commit.snapshot

  batch = storage.batch()

  batch.write(f"log/{vid}/{lsn}", commit)

  new_handle = NamedVolume {
    ...handle,
    pending_commit = None,
    local = VolumeRef {
      vid: handle.local.vid,
      lsn: local_lsn
    },
    remote = VolumeRef {
      vid: handle.remote.vid,
      lsn
    },
  }
  batch.write(f"named/{handle.name}", new_handle)

  with storage_lock:
    # fail if handle has changed
    assert(handle == read(f"named/{handle.name}"))

    # fail if the remote lsn already exists
    assert(not read(f"log/{vid}/{lsn}"))

    batch.commit()

def push_failure(handle: NamedVolume, commit):
  # push failed, clear pending commit
  with storage_lock:
    # panic if handle has changed
    assert(handle == read(f"named/{handle.name}"))
    handle.pending_commit = None
    write(f"named/{handle.name}", handle)

Pull & Fetch Volume

The Pull and Fetch Volume operations support incrementally pulling or fully fetching a Volume respectively.

def fetch_visibility_path(vid, lsn):
  cursor = VolumeRef { vid, lsn }
  path = []
  while cursor:
    { vid, lsn } = cursor

    # load the control if it doesn't exist
    if not read(f"volumes/{vid}/control"):
      control = object_store.fetch(f"{PREFIX}/{vid}/control")
      write(f"volumes/{vid}/control", control)

    # update checkpoints if it doesn't exist or has changed
    prev_checkpoints = read(f"volumes/{vid}/checkpoints")
    checkpoints = object_store.fetch(f"{PREFIX}/{vid}/checkpoints", prev_checkpoints.etag)
    if checkpoints:
      update_checkpoints(prev_checkpoints, checkpoints)

    if checkpoints = read(f"volumes/{vid}/checkpoints"):
      if checkpoint_lsn = checkpoints.for(lsn):
        # found checkpoint, we can terminate the path here
        path.push((vid, (lsn)..=(checkpoint_lsn)))
        return path

    # no checkpoint, so scan to the beginning
    path.push((vid, (lsn)..=1))
    # and iterate to the parent
    cursor = read(f"volumes/{vid}/control").parent

  return path

def update_checkpoints(vid, old_checkpoints, new_checkpoints):
  new_lsns = new_checkpoints.lsns - old_checkpoints.lsns
  fetch_commits(vid, new_lsns, replace=True)
  write(f"volumes/{vid}/checkpoints", new_checkpoints.into())

def fetch_volume(vid, max_lsn=LSN::MAX):
  # retrieve the latest snapshot <= max_lsn
  snapshot = first(f"log/{vid}/{max_lsn}"..)
  # refresh the visibility path
  path = fetch_visibility_path(vid, snapshot.lsn)

  # fetch all commits in path
  for (vid, scan) in path:
    fetch_commits(vid, scan)

def pull_volume(vid):
  snapshot = first(f"log/{vid}")
  # refresh the visibility path, to update any checkpoints
  fetch_visibility_path(vid, snapshot.lsn)
  fetch_commits(vid, snapshot.lsn..)

# lsns may be a range of lsns (possibly unbounded)
# or a set of lsns
# in the unbounded range case, fetch_all will stop fetching once it discovers the end of the range
# if replace is True, this function will refetch all commits in range
def fetch_commits(vid, lsns, replace=False):
  lsns = lsns if replace else remove_fetched_lsns(lsns)
  for commit in fetch_all(f"{PREFIX}/{vid}/log/{lsns}"):
    lsn = commit.snapshot.lsn
    with storage_lock:
      # we take the lock here to ensure that we serialize with other read-modify-update operations on the log.
      # in theory we don't need to worry about conflicting with other operations since pulling commits from a remote is idempotent
      write(f"log/{vid}/{lsn}", commit)

def remove_fetched_lsns(lsns):
  # return a new lsn set that only contains unfetched lsns
  return lsns

Sync remote volume to local

In this new architecture we split up pulling the remote volume and syncing it into the local volume.

def sync_remote_to_local(name):
  handle = read(f"named/{name}")

  # we can safely sync the latest remote snapshot into the local volume if the local volume has no outstanding local changes.

  local_snapshot = first(f"log/{handle.local.vid}")
  if local_snapshot.lsn != handle.local.lsn:
    raise OutstandingLocalChanges

  # check to see if we have any changes to sync
  remote_snapshot = first(f"log/{handle.remote.vid}")
  lsn_range = remote_snapshot.lsn..handle.remote.lsn

  if lsn_range.is_empty():
    raise NothingToSync

  # sync lsn range from remote to local, this requires copying each commit and mapping the LSNs to the local volume's LSN space
  commits = prepare_commits(local_snapshot, handle.remote.vid, lsn_range)

  batch = storage.batch()
  for commit in commits:
    {vid, lsn} = commit.snapshot
    batch.write(f"log/{vid}/{lsn}", commit)

  with storage_lock:
    if local_snapshot != first(f"log/{handle.local.vid}"):
      raise Retry

    batch.commit()

def prepare_commits(local_snapshot, remote_vid, lsn_range):
  remote_commits = iter_range(f"log/{handle.remote.vid}/{lsn_range}")
  # build up an array of local commits by mapping each remote commit to the next local lsn
  return commits

Reset volume

It’s possible to reset the local volume for a Named Volume by creating a new Local Volume and potentially copying over unsynced local commits. GC will be able to clean up the orphaned local volume once all snapshots have closed.

We should automatically reset local volumes when their log hits some configurable point and they’ve synced with the remote recently.

We can also use a reset to throw away local changes.

GC and Checkpointing

Local garbage collection

To perform GC locally, we will need to keep track of all open Snapshots and NamedVolume’s. For now GC will focus on eliminating inaccessible data:

unreferenced Volumes
unreferenced Segments
- taking care not to eliminate in-progress writes

An interesting aspect of GC is whether or not it will delete portions of a Volume’s commit log which are no longer accessible. For remote volumes, since we can always redownload a log, we just need to ensure that only unreferenced portions of the log are removed - i.e. not directly visible in the visibility_path computation from any snapshot or named volume. For local volumes, we will just wait until they are Reset and then cleanup the orphaned local volume id.

Checkpointing

The checkpointing process involves picking a VID/LSN based on some checkpoint heuristics, and then rewriting the commit to reference a new segment that contains all non-empty pages of the entire Volume. Then recording the checkpoint LSN in the checkpoint set for the volume.

The Checkpoint algorithm is made crash safe by first scanning for checkpoints that have been written but are not also present in the CheckpointSet.

Remote garbage collection

Remote GC must take care to not truncate a Volume’s history which is referenced by a Fork. So, when GC decides based on heuristics to checkpoint a volume, it will first check the Volume’s forks to determine if there are any references that would be truncated. For each such reference, GC must first checkpoint the fork.

GC will use the checkpoint_ts associated with commit checkpoints to ensure that relevant checkpoints have lived “long enough” to shadow truncated data. I.e. long enough that hopefully all the clients have picked them up.

Splinter improvements

This RFC depends on the following new Splinter features:

iter_range(keys: Range<u32>) -> impl Iterator<Item=u32>
- returns an iterator over keys contained by the range which are present in the Splinter.

All new methods should be implemented on both Splinter and SplinterRef.

Drawbacks

It’s a huge amount of work? Architecturally I think this is much better than the current state of Graft. And it opens up more opportunities to work with new customers. It also makes Graft a bit more standalone and simpler to run which will make new users happier.

I think the biggest drawback is that there won’t be a metastore to perform client fast forwarding. This will make fetches after long periods of being offline a bit less optimal. However, I think it’s ok to solve this with a optional service (graft-proxy) rather than requiring it. Graft proxy can provide its own virtual overlay of commits and segments - allowing all the same optimizations the metastore and pagestore is able to do now.