0001 - Direct Storage Architecture
Build a new Graft client library (called graft-kernel
) which directly interfaces with object storage, eliminating the need for the MetaStore and PageStore, and setting Graft up as a viable replacement to systems like Litestream. Graft should focus on providing best-in-class PITR, branching, and sparse replication for page-based workloads.
Motivation
Section titled “Motivation”An overview of why we should consider making this major change to Graft’s architecture.
Existing state of Graft
Section titled “Existing state of Graft”Currently Graft requires a MetaStore and PageStore to support replication to and from object storage. This architecture has the following advantages and disadvantages:
Advantages
- The MetaStore can efficiently rollup commits to fast forward clients, increasing performance and enabling instant read replicas
- The PageStore acts as a smart cache, allowing clients to pull only the pages they need at the edge.
- The PageStore is able to collocate writes to multiple Volumes in the same Segment which can reduce the cost and overhead of small transactions.
Disadvantages
- There is little isolation between data in different Volumes. Graft will need to roll out a comprehensive encryption + authorization layer to work for production workloads. This is a huge cost in terms of testing and engineering.
- Users must run two services to take full advantage of Graft, this makes Graft much harder to use.
In a discussion with Simon Willison and Alex Garcia, we talked about some of their dream features for SQLite + Graft:
Rollback database to earlier version The ability to cheaply rollback a database would make risky features like giving an LLM read/write access to your database much safer. Additionally, the ability to branch a database at a particular version may enable risk-free experimentation and testing.
Read-only replication Cheap and fast read-only replication to horizontally scale a heavy query workload over multiple machines, or simply to expose data to less-trusted users.
Composability with IAM permissions Currently, Datasette uses IAM keys limited to a single S3 prefix to restrict Litestream’s access to a single tenant’s data. This ensures that a bug in Litestream can affect at most a single tenant.
This feature implies that data does not cross “tenant” boundaries (or in this case, the configured S3 prefix).
Object storage scalability
Section titled “Object storage scalability”In a discussion with a potential user, they expressed reservations due to the layer of indirection between the Graft client and object storage. Their main argument is that S3 is already proven to handle extremely high scale. They would be more comfortable using Graft if clients connected directly to object storage to pull changes. In some cases, this may also reduce costs due to free bandwidth between compute and S3.
New user experience
Section titled “New user experience”Graft should work out of the box without requiring additional services to run. By supporting direct access to object storage, it will be easier to get started and embed Graft in an application.
Guide-level Explanation
Section titled “Guide-level Explanation”A high level explanation of how this feature works, and would change the behavior of existing Graft clients such as graft-sqlite
and graft-fuse
.
graft-kernel
Section titled “graft-kernel”graft-kernel
implements the Graft Kernel which supersedes the functionality of the graft-client
and graft-server
crates. The Kernel provides access to a remote Volume Catalog, local storage, and client functionality to downstream crates like graft-sqlite
and graft-fuse
.
Client functionality
Clients such as graft-sqlite
and graft-fuse
will use the graft-kernel
to operate on Volumes. The Kernel is designed to be embedded in the application, performing I/O in a small set of background threads. The Kernel will be implemented as an async core wrapped with an async and sync API.
graft-proxy
Section titled “graft-proxy”The Graft Proxy is an optional stateless edge service and caching layer which makes it easier for Graft to replicate to & from devices.
Graft Proxy exposes a simple API to consumers, enabling two key performance features:
- Graft Proxy caches reads from object storage.
- Virtual overlay of commits and segments
- rather than passing straight through to object store, the proxy is able to coalesce multiple commits into one larger commit for clients to efficiently fast forward and only download interesting portions of segments.
Eventually Graft Proxy will enhance Graft with these features:
- volume subscriptions -> eliminate the need to poll for changes
- granular authorization
- direct byte range puts and gets against Volumes, enabling “dumb clients”.
Reference-level Explanation
Section titled “Reference-level Explanation”Detailed technical breakdown. Cover APIs, algorithms, data structures, formats, edge cases, and performance implications.
Glossary
Section titled “Glossary”Volume Handle
: A reference to a local-remote Volume pair. Also responsible for tracking synchronization between a local and remote Volume. Documented more in the Volume Handle section.vid
: A 16 byte Volume ID usingGID
encoding.sid
: A 16 byte Segment ID usingGID
encoding.Snapshot
: A frozen point-in-time view of a Volume.lsn
: Documented in the Volume log section.pageidx
: A 4 byte page index, representing the index of a page within a Volume. Valid range:[1, 2^32)
.Splinter
: A compressed bitset, used to keep track of whichpageidxs
appear in a Segment.Segment
: A compressed sequence of pages, sorted bypageidx
. Documented in the Segment Encoding section.VolumeRef
: A(vid, lsn)
tuple, representing a fixed point in a Volumes history.CommitHash
: Documented in the Commit hash section.
Volume log
Section titled “Volume log”A volume’s durable state consists of a Checkpoint and a Log.
- Checkpoint — a point-in-time mapping from each non-empty
pageidx
to the version (by LSN) that was current when the checkpoint was taken. - Log — an append-only sequence of log records. Each record contains the set of pages modified since the previous LSN.
Log Sequence Numbers (LSNs)
Section titled “Log Sequence Numbers (LSNs)”Property | Definition |
---|---|
Domain | Unsigned 64-bit integer in the range [1, 2^64) . Zero is invalid. |
Ordering | Strictly increasing, gap-free, and scoped per volume. |
Encoding | Canonical representation is CBE64 : one’s-complement, big-endian. |
Because 0
is never a valid LSN, it’s available as a sentinel value.
CBE64
Encoding
Section titled “CBE64 Encoding”CBE64
stands for Ones-Compliment Big Edian.
- Binary form — 8-byte array. Bytewise comparison yields descending numeric order. Used for space-efficient storage, such as in embedded key-value stores like Fjall.
- Hex form — 16-character, zero-padded, uppercase hexadecimal string. Lexicographically sorts in the same order as the binary form. Used where human readability is preferred, such as object store keys.
The CBE64 encoding allows both local key-value stores and object stores to perform forward iteration over keys to process log records in descending LSN order, without additional index structures.
Commit hash
Section titled “Commit hash”To verify data integrity, we attach a blake3 hash to each Commit
.
def commit_hash(snapshot, pages): hasher = blake3::new() # unique 4 byte magic number identifying commits hasher.write(COMMIT_MAGIC)) hasher.write(snapshot.vid) hasher.write(snapshot.lsn) hasher.write(snapshot.page_count) # pages must be in order by pageidx for page in pages: hasher.write(page) return hasher.hash()
Note that the Commit’s snapshot is passed in. This ensures the Hash’s uniqueness incorporates the Volume ID, LSN, and page count.
Volume Handle
Section titled “Volume Handle”Rather than forcing users to reference Volumes by ID, Graft exposes Volume Handles. Volume Handles are pointers to a local and remote Volume. Volume Handles only exist on a single client, and are not shared between clients or pushed to a remote.
Each Volume Handle has a name given to it at creation time. The name must match the regex ^[-_a-zA-Z0-9]{0,128}$
(alphanumeric, starts with letter or underscore, max len 128 chars) and be unique on the client.
A Volume Handle always has a local-only Volume associated with it. This Volume is used for all local reads and writes.
A Volume Handle may be linked to a remote Volume. In this case, the local and remote Volumes will be kept in sync by the sync subsystem.
Segment Encoding
Section titled “Segment Encoding”Pages are stored in Segments, which provide seekable compression over ranges of pages. Internally Segments are sequences of compressed Frames. All of the pages in a Segment are stored in order by PageIdx
.
Segments currently compress each Frame using zstd with the trailing checksum enabled. Graft may add support for other compression methods in the future.
To read a page from a Segment, the client must first retrieve the relevant SegmentRef
from the Commit
. Then the client may use the frame index to search for the correct frame. This allows Clients to download only the relevant byte offsets from a Segment.
Remote storage
Section titled “Remote storage”Object storage keyspace
Section titled “Object storage keyspace”Graft will store all of a Volume’s data in Object Storage using the following keyspace layout:
{prefix} / {vid} / control: Control forks / {fork-vid}: Fork checkpoints: CheckpointSet log / {lsn}: Commit segments / {sid}: Segment
This flexible layout allows users to isolate tenants from one another by simply providing a unique {prefix}
. This can be helpful when using AWS IAM to scope access keys to particular S3 prefixes for example.
Object storage schemas
Section titled “Object storage schemas”All of the files aside from Segments are encoded using Protobuf. Serialized Protobuf is wrapped with a zerocopy ProtobufEnvelope
:
enum ProtobufMessage { GraftRemoteV1Control = 1, GraftRemoteV1Fork = 2, GraftRemoteV1CheckpointSet = 3, GraftRemoteV1Commit = 4,}
struct ProtobufEnvelope { magic: [u8; 4], _padding: [u8; 3], message: ProtobufMessage, data: [u8]}
Segments are compressed sets of pages. Their encoding is documented in the Segment Encoding section, and their metadata is stored in the SegmentRef
message.
syntax = "proto3";package graft.remote.v1;
import "google/protobuf/timestamp.proto";
// A Volume has a top level control file stored at// `{prefix}/{vid}/control`// Control files are immutable.message Control { // The Volume ID stored as a 16 byte GID. bytes vid = 1;
// The parent reference if this Volume is a fork. optional VolumeRef parent = 2;
// The creation timestamp of this Volume. google.protobuf.Timestamp created_at = 3;}
// When a Volume is forked, a ref is first written to the parent Volume:// `{prefix}/{parent-vid}/forks/{fork-vid}`// Forks are immutable.message Fork { // The VID of the fork. bytes fork_vid = 1;
// The fork point. Must match the parent field in the Fork's Control file. VolumeRef parent = 2;}
// A reference to a Volume at a particular LSN.message VolumeRef { // The Volume ID stored as a 16 byte GID. bytes vid = 1;
// The referenced LSN. uint64 lsn = 2;}
// A Volume's CheckpointSet is stored at `{prefix}/{vid}/checkpoints`.// CheckpointSets are updated by the checkpointer via compare-and-swap.message CheckpointSet { // The Volume ID stored as a 16 byte GID. bytes vid = 1;
// The list of checkpoint LSNs. repeated uint64 lsns = 2;}
// A snapshot of a Volume.message Snapshot { // The Volume ID stored as a 16 byte GID. bytes vid = 1;
// The LSN of the Volume at this Snapshot. uint64 lsn = 2;
// The Volume's page count at this Snapshot. uint32 page_count = 3;}
// Commits are stored at `{prefix}/{vid}/log/{lsn}`.// A commit may not include a SegmentRef if only the Volume's page count has// changed. This happens when the Volume is extended or truncated without// additional writes.// Commits are immutable.message Commit { // The Volume Snapshot at this Commit. Snapshot snapshot = 1;
// An optional 256 bit CommitHash of this Commit. // Always present on Remote Volume commits. // May be omitted on Local commits. optional bytes hash = 2;
// If this Commit contains any pages, `segment_ref` records details on the relevant Segment. optional SegmentRef segment_ref = 3;
// If this commit is a checkpoint, this timestamp is set and records the time // the commit was made a checkpoint optional google.protobuf.Timestamp checkpoint_ts = 4;}
message SegmentRef { // The 16 byte Segment ID. bytes sid = 1;
// The set of pageidxs stored in this Segment. // Serialized using Splinter encoding. bytes splinter = 2;
// An index of frames contained by the Segment. // Empty on Local Segments which have not been encoded and uploaded to object // storage. repeated SegmentFrame frames = 3;}
message SegmentFrame { // The length of the compressed frame in bytes. uint32 frame_size = 1;
// The last pageidx stored in the frame uint32 last_pageidx = 2;}
Local storage
Section titled “Local storage”This section documents how clients store data.
Local keyspace
Section titled “Local keyspace”Local storage uses Fjall, a partitioned k/v store. In the following keyspace, the top level keys are independent partitions. The remainder of the keys and the values are encoded using types in the following section.
handles / {name} -> VolumeHandle
volumes / {vid} / // The Volume's Control, includes its parent reference control -> Control
// the latest LocalCheckpointSet for this Volume checkpoints -> LocalCheckpointSet
log / {vid} / {lsn} -> Commit
pages / {sid} / {pageidx} -> Page
Local schemas
Section titled “Local schemas”Keys which are stored in the local keyspace are encoded using zerocopy
types. lsn
values are stored using CBE64
, which ensures they naturally sort in descending order. This allows us to use a forward iterator to quickly find the most recent LSN, which is much more efficient in most k/v stores (including Fjall).
The handles
partition is unique in that it is keyed directly by the VolumeHandle
’s name rather than one of the following zerocopy types.
enum VolumeProperty { Control = 1, Checkpoints = 2,}
/// Key for the `volumes` partitionstruct VolumeKey { vid: VolumeId, property: VolumeProperty,}
/// Key for the `log` partitionstruct CommitKey { vid: VolumeId, lsn: CBE64,}
/// Key for the `pages` partitionstruct PageKey { sid: SegmentId, pageidx: PageIdx,}
Values stored locally are encoded using protobuf, using a combination of the remote storage schema and the following additional message types. No envelope is needed for local values as the upgrade process can migrate local data.
syntax = "proto3";package graft.local.v1;import "graft/remote/v1/index.proto";
message LocalCheckpointSet { // The etag from the last time we pulled the CheckpointSet, used to only pull // changed CheckpointSets bytes etag = 1;
// The list of checkpoint LSNs. repeated uint64 lsns = 2;}
message VolumeHandle { // The name of the Volume Handle string name = 1;
// References to the local and remote Volumes, along with LSNs representing their latest successful synchronization. graft.remote.v1.VolumeRef local = 2; optional graft.remote.v1.VolumeRef remote = 3;
// Presence of the pending_commit field means that the Push operation is in the process of committing to the remote. If no such Push job is currently running (i.e. it was interrupted), this field must be used to resume or abort the commit process. optional PendingCommit pending_commit = 4;}
message PendingCommit { // The resulting remote LSN that the push job is attempting to create uint64 remote_lsn = 1;
// The associated 256 bit blake3 commit hash. This is used to determine // whether or not the commit has landed in the remote, in the case that we are // interrupted while attempting to push. bytes commit_hash = 2;}
Algorithms
Section titled “Algorithms”This section details the various key algorithms powering Graft’s new direct storage architecture.
Similar to the existing Graft architecture, we will start with one coarse storage_lock
which must be held when we are doing any read-modify-update transaction on storage.
Lock rules:
- never hold the lock while performing IO other than storage operations
- hold the lock for the smallest time possible
When to hold the lock:
- Modifying the Volume Handle
- Committing to a Volume
When not to hold the lock:
- When the operation is idempotent, for example when we are writing to the pages partition or updating a Volume’s checkpoint set
- When performing read operations; use a Fjall instant instead
Volume Reader
Section titled “Volume Reader”Reading from a Volume requires creating a VolumeReader
from a VolumeHandle
at either the latest or a specific snapshot.
def visibility_path(snapshot): cursor = VolumeRef { vid: snapshot.vid, lsn: snapshot.lsn } path = [] while cursor: if checkpoints = read(f"volumes/{cursor.vid}/checkpoints"): if checkpoint = checkpoints.for(snapshot.lsn): # found checkpoint, we can terminate the path here path.push((cursor.vid, (cursor.lsn)..=(checkpoint.lsn))) return path
# no checkpoint, so scan to the beginning path.push((cursor.vid, (cursor.lsn)..=1)) # and iterate to the parent cursor = read(f"volumes/{cursor.vid}/control").parent
return path
class VolumeReader: def new(snapshot): self.snapshot = snapshot self.path = visibility_path(snapshot)
def read_page(self, pageidx): if not self.snapshot.page_count.contains(pageidx): return None for key, commit in iter_commits(self.path): { snapshot, segment_ref } = commit if not ( # handle truncate+extend snapshot.page_count.contains(pageidx) and segment_ref.splinter.contains(pageidx)): continue page = read(f"pages/{key.sid}/{pageidx}") if page: return page return remote_read_page(snapshot.vid, key.sid, segment_ref, pageidx) return None
def iter_commits(path): result = [] for (vid, scan) in path: top = f"log/{vid}/{scan.start}" bottom = f"log/{vid}/{scan.end}" result = chain(result, iter(top..=bottom)) return result
def remote_read_page(vid, sid, segment_ref, pageidx): # first we need to determine which frame in the segment contains the relevant # page bytes = (0, 0) pages = (0, 0) for frame in segment_ref.frames: if pageidx > f.last_pageidx: break bytes = (bytes.end, bytes.end + frame.frame_size) pages = (pages.end+1, frame.last_pageidx)
# fetch the frame from object storage, loading the pages into `pages/{sid}/{pageidx}` frame = object_store.fetch( f"{PREFIX}/{vid}/segments/{sid}", bytes) frame = zstd.decompress(frame) for (pi, page) in zip( segment_ref.splinter.iter_range(pages), frame.split(PAGESIZE) ): write(f"pages/{sid}/{pi}", page)
return read(f"pages/{sid}/{pageidx}")
Volume Writer
Section titled “Volume Writer”class VolumeWriter: def new(snapshot): self.reader = VolumeReader::new(snapshot) self.page_count = snapshot.page_count self.sid = SegmentId::random() self.splinter = Splinter::new()
def read(pageidx): if self.splinter.contains(pageidx): return read(f"pages/{self.sid}/{pageidx}") else: self.reader.read_page(pageidx)
def write(pageidx, page): self.splinter.insert(pageidx) self.page_count = max(self.page_count, pageidx.pages()) write(f"pages/{self.sid}/{pageidx}", page)
def truncate(page_count): self.page_count = page_count delete_range(f"pages/{self.sid}/{page_count}"..)
# also triggered on drop def rollback(): delete_prefix(f"pages/{sid}")
def commit(): snapshot = self.reader.snapshot vid = snapshot.vid commit_lsn = snapshot.lsn.next()
with storage_lock: # verify we are the latest snapshot latest_snapshot = first(f"log/{vid}").snapshot if snapshot != latest_snapshot: raise "concurrent write"
write(f"log/{vid}/{commit_lsn}", Commit { snapshot = Snapshot { vid, lsn = commit_lsn, page_count = self.page_count }, # no hash for commits to a local volume segment = SegmentRef { sid: self.sid, splinter: self.splinter, # no frame info for commits to a local volume } })
Push Volume
Section titled “Push Volume”def push_volume(handle_name): handle = read(f"handles/{handle_name}") (local_lsn, commit) = prepare_commit(handle) match remote_commit(commit): Ok() => push_success(handle, commit, local_lsn) Err() => push_failure(handle, commit)
def prepare_commit(handle): # trigger recovery if handle.pending_commit: raise InterruptedPush
{ local, remote } = handle
remote_snapshot = first(f"log/{remote.vid}") local_snapshot = first(f"log/{local.vid}")
# we can only push if the remote has not changed since the last time we synced if remote_snapshot.lsn != remote.lsn: # this situation can only occur in normal operation if the local volume has diverged from the remote. I.e. a remote commit happened concurrently with a local commit preventing fast forward raise Diverged
top_lsn = local_snapshot.lsn bottom_lsn = local.lsn sync_range = top_lsn..bottom_lsn
if sync_range.is_empty(): raise NothingToCommit
# build and push segments from commits (segment_ref, commit_hash) = build_and_push_segments(local.vid, lsn_range, remote.vid)
commit_lsn = remote_snapshot.lsn.next()
# write out the pending commit with storage_lock: # abort the push if the handle changed since we started the push process if handle != read(f"handles/{handle_name}"): raise Retry
# abort the push if the remote snapshot changed since we started the push process if remote_snapshot != first(f"log/{remote.vid}"): raise Retry
handle.pending_commit = PendingCommit { remote_lsn = commit_lsn, commit_hash } write(f"handles/{handle_name}", handle)
return (local_snapshot.lsn, Commit { snapshot: Snapshot { vid: remote.vid, lsn: commit_lsn, page_count: local_snapshot.page_count }, hash: commit_hash, segment_ref })
def build_and_push_segments(local_vid, lsn_range, remote_vid): # merge segments from all local commits in the lsn_range into one segment which is uploaded to the remote # # IF we expect to be querying the remote volume anytime soon, we can optionally write out segments to our local page store.
return (segment_ref, commit_hash)
def remote_commit(commit): path = f"{PREFIX}/{commit.snapshot.vid}/log/{commit.snapshot.lsn}" object_store.write_if_not_exists(path, commit)
def push_success(handle, commit, local_lsn): {vid, lsn} = commit.snapshot
batch = storage.batch()
batch.write(f"log/{vid}/{lsn}", commit)
new_handle = VolumeHandle { pending_commit = None, local = VolumeRef { vid: handle.local.vid, lsn: local_lsn }, remote = VolumeRef { vid: handle.remote.vid, lsn }, ...handle } batch.write(f"handles/{handle.name}", new_handle)
with storage_lock: # fail if handle has changed assert(handle == read(f"handles/{handle_name}"))
# fail if the remote lsn already exists assert(not read(f"log/{vid}/{lsn}"))
batch.commit()
def push_failure(handle, commit): # push failed, clear pending commit with storage_lock: # panic if handle has changed assert(handle == read(f"handles/{handle_name}")) handle.pending_commit = None write(f"handles/{handle.name}", handle)
Pull & Fetch Volume
Section titled “Pull & Fetch Volume”The Pull and Fetch Volume operations support incrementally pulling or fully fetching a Volume respectively.
def fetch_visibility_path(vid, lsn): cursor = VolumeRef { vid, lsn } path = [] while cursor: { vid, lsn } = cursor
# load the control if it doesn't exist if not read(f"volumes/{vid}/control"): control = object_store.fetch(f"{PREFIX}/{vid}/control") write(f"volumes/{vid}/control", control)
# update checkpoints if it doesn't exist or has changed prev_checkpoints = read(f"volumes/{vid}/checkpoints") checkpoints = object_store.fetch(f"{PREFIX}/{vid}/checkpoints", prev_checkpoints.etag) if checkpoints: update_checkpoints(prev_checkpoints, checkpoints)
if checkpoints = read(f"volumes/{vid}/checkpoints"): if checkpoint_lsn = checkpoints.for(lsn): # found checkpoint, we can terminate the path here path.push((vid, (lsn)..=(checkpoint_lsn))) return path
# no checkpoint, so scan to the beginning path.push((vid, (lsn)..=1)) # and iterate to the parent cursor = read(f"volumes/{vid}/control").parent
return path
def update_checkpoints(vid, old_checkpoints, new_checkpoints): new_lsns = new_checkpoints.lsns - old_checkpoints.lsns fetch_commits(vid, new_lsns, replace=True) write(f"volumes/{vid}/checkpoints", new_checkpoints.into())
def fetch_volume(vid, max_lsn=LSN::MAX): # retrieve the latest snapshot <= max_lsn snapshot = first(f"log/{vid}/{max_lsn}"..) # refresh the visibility path path = fetch_visibility_path(vid, snapshot.lsn)
# fetch all commits in path for (vid, scan) in path: fetch_commits(vid, scan)
def pull_volume(vid): snapshot = first(f"log/{vid}") # refresh the visibility path, to update any checkpoints fetch_visibility_path(vid, snapshot.lsn) fetch_commits(vid, snapshot.lsn..)
# lsns may be a range of lsns (possibly unbounded)# or a set of lsns# in the unbounded range case, fetch_all will stop fetching once it discovers the end of the range# if replace is True, this function will refetch all commits in rangedef fetch_commits(vid, lsns, replace=False): lsns = lsns if replace else remove_fetched_lsns(lsns) for commit in fetch_all(f"{PREFIX}/{vid}/log/{lsns}"): lsn = commit.snapshot.lsn with storage_lock: # we take the lock here to ensure that we serialize with other read-modify-update operations on the log. # in theory we don't need to worry about conflicting with other operations since pulling commits from a remote is idempotent write(f"log/{vid}/{lsn}", commit)
def remove_fetched_lsns(lsns): # return a new lsn set that only contains unfetched lsns return lsns
Sync remote volume to local
Section titled “Sync remote volume to local”In this new architecture we split up pulling the remote volume and syncing it into the local volume.
def sync_remote_to_local(handle_name): handle = read(f"handles/{handle_name}")
# we can safely sync the latest remote snapshot into the local volume if the local volume has no outstanding local changes.
local_snapshot = first(f"log/{handle.local.vid}") if local_snapshot.lsn != handle.local.lsn: raise OutstandingLocalChanges
# check to see if we have any changes to sync remote_snapshot = first(f"log/{handle.remote.vid}") lsn_range = remote_snapshot.lsn..handle.remote.lsn
if lsn_range.is_empty(): raise NothingToSync
# sync lsn range from remote to local, this requires copying each commit and mapping the LSNs to the local volume's LSN space commits = prepare_commits(local_snapshot, handle.remote.vid, lsn_range)
batch = storage.batch() for commit in commits: {vid, lsn} = commit.snapshot batch.write(f"log/{vid}/{lsn}", commit)
with storage_lock: if local_snapshot != first(f"log/{handle.local.vid}"): raise Retry
batch.commit()
def prepare_commits(local_snapshot, remote_vid, lsn_range): remote_commits = iter_range(f"log/{handle.remote.vid}/{lsn_range}") # build up an array of local commits by mapping each remote commit to the next local lsn return commits
Reset volume
Section titled “Reset volume”It’s possible to reset the local volume for a Volume Handle by creating a new Local Volume and potentially copying over unsynced local commits. GC will be able to clean up the orphaned local volume once all snapshots have closed.
We should automatically reset local volumes when their log hits some configurable point and they’ve synced with the remote recently.
We can also use a reset to throw away local changes.
GC and Checkpointing
Section titled “GC and Checkpointing”Local garbage collection
Section titled “Local garbage collection”To perform GC locally, we will need to keep track of all open Snapshots and Handles. For now GC will focus on eliminating inaccessible data:
- unreferenced Volumes
- unreferenced Segments
- taking care not to eliminate in-progress writes
An interesting aspect of GC is whether or not it will delete portions of a Volume’s commit log which are no longer accessible. For remote volumes, since we can always redownload a log, we just need to ensure that only unreferenced portions of the log are removed - i.e. not directly visible in the visibility_path computation from any snapshot or handle. For local volumes, we will just wait until they are Reset and then cleanup the orphaned local volume id.
Checkpointing
Section titled “Checkpointing”The checkpointing process involves picking a VID/LSN based on some checkpoint heuristics, and then rewriting the commit to reference a new segment that contains all non-empty pages of the entire Volume. Then recording the checkpoint LSN in the checkpoint set for the volume.
The Checkpoint algorithm is made crash safe by first scanning for checkpoints that have been written but are not also present in the CheckpointSet.
Remote garbage collection
Section titled “Remote garbage collection”Remote GC must take care to not truncate a Volume’s history which is referenced by a Fork. So, when GC decides based on heuristics to checkpoint a volume, it will first check the Volume’s forks to determine if there are any references that would be truncated. For each such reference, GC must first checkpoint the fork.
GC will use the checkpoint_ts
associated with commit checkpoints to ensure that relevant checkpoints have lived “long enough” to shadow truncated data. I.e. long enough that hopefully all the clients have picked them up.
Splinter improvements
Section titled “Splinter improvements”This RFC depends on the following new Splinter features:
iter_range(keys: Range<u32>) -> impl Iterator<Item=u32>
- returns an iterator over keys contained by the range which are present in the Splinter.
All new methods should be implemented on both Splinter and SplinterRef.
Drawbacks
Section titled “Drawbacks”It’s a huge amount of work? Architecturally I think this is much better than the current state of Graft. And it opens up more opportunities to work with new customers. It also makes Graft a bit more standalone and simpler to run which will make new users happier.
I think the biggest drawback is that there won’t be a metastore to perform client fast forwarding. This will make fetches after long periods of being offline a bit less optimal. However, I think it’s ok to solve this with a optional service (graft-proxy) rather than requiring it. Graft proxy can provide its own virtual overlay of commits and segments - allowing all the same optimizations the metastore and pagestore is able to do now.