Client

Graft Clients support reading and writing to Volumes.

Local Storage

Graft client uses Fjall, an embeddable rust key-value store based on LSM trees, for local storage. Graft splits up the data between three Fjall partitions with the following key layout and value types:

volumes:
  {vid}/config          -> VolumeConfig
  {vid}/status          -> VolumeStatus
  {vid}/snapshot        -> Snapshot
  {vid}/watermarks      -> Watermarks

pages:
  {vid}/{pageidx}/{LSN} -> PageValue

commits:
  {vid}/{LSN}           -> Graft

VolumeConfig:
  sync: Disabled | Push | Pull | Both

VolumeStatus: Ok | RejectedCommit | Conflict

Snapshot:
  local: LSN
  remote: RemoteMapping
  pages: PageCount

RemoteMapping:
  Unmapped
  Mapped {
    remote: LSN,
    local: LSN
  }

Watermarks:
  pending_sync: Option<LSN>
  checkpoint: Option<LSN>

PageValue:
  Pending
  Empty
  Available(Page)

Graft:
  Splinter of all PageIdxs changed in the commit

Reading

To issue a local read against a Volume snapshot:

Lookup the latest page in storage such that page.LSN <= snapshot.local
- If this page is either Available or Empty return the page
If snapshot.remote is empty, return an empty page
Request the page from the Pagestore
- This may be batched along with prefetches
Save the requested page into storage at page.LSN

Writing

Writes commit locally and then are asynchronously committed remotely. This section only deals with the local commit.

Writes go through a VolumeWriter which buffers newly written pages in a memtable. Reads check the memtable to enable RYOW before falling back to the regular Read algorithm. Each VolumeWriter is pinned to a Snapshot.

The commit process happens atomically via a Fjall batch.

Set commit_lsn = snapshot.local.next()
Persist the memtable at commit_lsn
Write out a Graft to the commits partition at commit_lsn
Take the local commit lock
Set latest to the latest volume Snapshot
Fail if latest.local != snapshot.local
Write out the new snapshot (without changing the remote mapping)
Commit the Fjall batch
release the commit lock

Sync

The Graft Client runtime supports asynchronously pushing and pulling from the server. Since this process happens out of band, two writers committing to the same Volume will frequently conflict and will need to rebase or reset to continue.

Future work:

synchronous commit+push to make conflicts easier to detect
MVCC automatic conflict resolution
Rebase conflict resolution

Sync: Pull

The Graft runtime polls /metastore/v1/pull_graft for changes. When a change is detected, the runtime attempts to “accept” the change.

The pull process happens atomically via a Fjall batch.

Take the local commit lock
Read the latest Volume Snapshot and Watermarks
If remote_mapping.local < pending_sync: FAIL with VolumeNeedsRecovery
If remote_mapping.local < snapshot.local: FAIL with RemoteConflict
- set Volume status to VolumeStatus::Conflict
Set commit_lsn = snapshot.local.next()
Update the snapshot
- local=commit_lsn, remote=(remote_lsn, commit_lsn), pages=remote_pages
Update the watermarks
- pending_sync=commit_lsn
For each changed pageidx in the remote commit, write out PageValue::Pending into the pages partition using commit_lsn. This ensures that future reads know to fetch the page from the PageStore.
Commit the Fjall batch
release the commit lock

FAIL states:

VolumeNeedsRecovery: This means that we had previously crashed in the middle of pushing the Volume to the server. The client needs to recover or reset the volume before continuing.
Conflict: This means that we have made local commits since the last successful sync. The client needs to reconcile with the server before continuing.

Sync: Push

When the Graft runtime detects a local commit has occurred, it tries to push the commit to the server.

Take the local commit lock
Read the latest Volume Snapshot and Watermarks
If remote_mapping.local < pending_sync: FAIL with VolumeNeedsRecovery
update watermarks.pending_sync to snapshot.local
calculate the LSN range to push:
- start_lsn = remote_mapping.local.next()
- end_lsn = snapshot.local
release the local commit lock
iterate through the local commit splinters
- send the most recent page for each pageidx to the pagestore
- collect new segments
commit the segments to the metastore
take the local commit lock

On commit success:

Open a Fjall batch
Read the latest Volume Snapshot and Watermarks
Assert that the new remote LSN is larger than the last remote LSN
Assert that watermarks.pending_sync == snapshot.local
Update the snapshot’s remote mapping to (remote_lsn, snapshot.local)
Remove all successfully synced commit grafts
Commit the batch
Release the local commit lock

On commit failure:

Update watermarks.pending_sync = snapshot.remote_mapping.local
Set Volume status to VolumeStatus::RejectedCommit

Crash recovery

The Graft client runtime must be able to crash at any point and recover. Fjall already has its own recovery mechanisms built in, so we just need to handle failed Pushes. Failed pushes can be detected when pending_sync is larger than remote_mapping.local and no concurrent Push job is running.

When a volume is in this failed push state, it needs to determine if the commit was successfully accepted by the Metastore or not. It does so by retrying the commit process with the same idempotency token.

Lite Client

In some cases, a Client may want to boot without any state and quickly read (+ possibly write) to a particular Volume snapshot. In the most minimal case, if the client already knows the LSN of the snapshot they want to access, they can read from the Page Server immediately. If they want to issue a write, they will need to read the latest snapshot to get the page count and current remote LSN.

Supporting Lite Clients is desirable to help enable edge serverless workloads which want to optimize for latency and have no cached state.