read this skill for a token-efficient summary of the bucket subsystem
The bucket subsystem implements a log-structured merge tree (LSM-tree) data structure called the BucketList. It maintains a canonical, hash-verifiable representation of all ledger state. There are two BucketList instances: the LiveBucketList (current ledger state) and the HotArchiveBucketList (recently evicted entries). Each is organized into 11 temporal levels (level 0 being youngest/smallest, level 10 being oldest/largest), where older levels are exponentially larger and change less frequently.
The system is designed to:
BucketBase<BucketT, IndexT> — Abstract CRTP base for immutable, sorted, hashed containers of XDR entries. Holds a filename, hash, size, and optional index (shared_ptr<IndexT>). Provides the core merge() and mergeInternal() static methods. Buckets are designed to be held in shared_ptr and shared across threads.
LiveBucket — Stores BucketEntry (INITENTRY, LIVEENTRY, DEADENTRY, METAENTRY). Supports shadows, INIT/DEAD annihilation, and in-memory level-0 merges. Has an optional mEntries vector for in-memory-only buckets. Index type: LiveBucketIndex. Key methods:
fresh() — Creates a new bucket from init/live/dead entry vectors, sorts, hashes, writes to disk.freshInMemoryOnly() — Creates a bucket that only exists in memory (for level-0 "snap" that immediately merges).mergeInMemory() — Merges two in-memory buckets without FutureBucket, used only for level 0.merge() (inherited) — File-based merge of two buckets with optional shadows.maybePut() — Shadow-aware write: elides entries shadowed by newer buckets (pre-v12), preserves INIT/DEAD lifecycle entries (v11+).mergeCasesWithEqualKeys() — Handles INIT/DEAD annihilation and INIT+LIVE→INIT promotion.convertToBucketEntry() — Converts raw ledger entry vectors into sorted BucketEntry vector.isTombstoneEntry() — Returns true for DEADENTRY.HotArchiveBucket — Stores HotArchiveBucketEntry (HOT_ARCHIVE_ARCHIVED, HOT_ARCHIVE_LIVE, HOT_ARCHIVE_METAENTRY). No shadow support. Index type: HotArchiveBucketIndex. HOT_ARCHIVE_LIVE acts as tombstone (restored entries). Key methods:
fresh() — Creates bucket from archived entries and restored keys.maybePut() — Always writes (no shadow logic).isTombstoneEntry() — Returns true for HOT_ARCHIVE_LIVE.BucketListBase<BucketT> — Abstract templated base for BucketList data structure. Contains a vector of BucketLevel<BucketT>. Defines the temporal-leveling algorithm: level sizes are powers of 4 (levelSize(i) = 4^(i+1)), each split into curr and snap halves. Key methods:
addBatchInternal() — Main entry point: adds a batch of entries at a ledger close. Walks levels top-down, calling snap() and prepare() on levels that should spill. Level 0 uses prepareFirstLevel() for in-memory merges.levelShouldSpill() — Returns true when a level needs to snapshot curr→snap and merge snap into the next level.restartMerges() — Re-starts merges after deserialization (catchup or restart). For v12+ merges, reconstructs from current BucketList state; for older merges, uses serialized hashes.resolveAnyReadyFutures() — Non-blocking resolution of completed merges.getHash() — Returns concatenated hash of all level hashes (each level = hash of curr + snap).levelSize(), levelHalf(), sizeOfCurr(), sizeOfSnap(), oldestLedgerInCurr(), oldestLedgerInSnap(), keepTombstoneEntries(), bucketUpdatePeriod().BucketLevel<BucketT> — A single level in the BucketList. Holds mCurr, mSnap (both shared_ptr<BucketT>), and mNextCurr (a std::variant<FutureBucket<BucketT>, shared_ptr<BucketT>>). Key methods:
prepare() — Starts an async merge via FutureBucket (used for levels 1+).prepareFirstLevel() — Specialization for level 0: does an in-memory merge if possible (LiveBucket::mergeInMemory), falls back to prepare() otherwise.commit() — Resolves any pending merge and sets result as new curr.snap() — Moves curr to snap, resets curr to empty bucket.LiveBucketList — Extends BucketListBase<LiveBucket>. Adds eviction-related methods (updateStartingEvictionIterator, updateEvictionIterAndRecordStats, checkIfEvictionScanIsStuck) and addBatch() which calls addBatchInternal() with init/live/dead entry vectors. Also maybeInitializeCaches() for index random eviction caches.
HotArchiveBucketList — Extends BucketListBase<HotArchiveBucket>. Simpler addBatch() with archived/restored entry vectors.
BucketManager — Singleton owner of the BucketList instances, bucket files, and merge futures. Thread-safe for bucket file operations via mBucketMutex. Key responsibilities:
adoptFileAsBucket() moves temp files into the bucket directory, deduplicating by hash. forgetUnreferencedBuckets() GCs unreferenced buckets. cleanupStaleFiles() removes orphaned files.mLiveBucketFutures / mHotArchiveBucketFutures (hash maps of MergeKey → shared_future) track running merges. mFinishedMerges (BucketMergeMap) records completed merge input→output mappings for reattachment.addLiveBatch() and addHotArchiveBatch() feed new entries from ledger close into the BucketList.snapshotLedger() computes the bucketListHash for the LedgerHeader.startBackgroundEvictionScan() launches async eviction scan on a snapshot; resolveBackgroundEvictionScan() applies results.assumeState() loads BucketList from a HistoryArchiveState. loadCompleteLedgerState() materializes the full ledger into a map.maybeSetIndex() sets the index for a bucket, handling race conditions on startup.LiveBucketList, HotArchiveBucketList, BucketSnapshotManager, TmpDirManager, bucket maps (mSharedLiveBuckets, mSharedHotArchiveBuckets), merge future maps, BucketMergeMap.FutureBucket<BucketT> — Wraps a std::shared_future<shared_ptr<BucketT>> representing an in-progress or completed merge. Has 5 states: FB_CLEAR, FB_HASH_OUTPUT, FB_HASH_INPUTS, FB_LIVE_OUTPUT, FB_LIVE_INPUTS. Serializable via cereal (saves/loads hash strings). Key lifecycle:
FB_LIVE_INPUTS, immediately calls startMerge().startMerge() checks for existing merge via BucketManager::getMergeFuture() (reattachment). If none, creates a packaged_task that calls BucketT::merge() and posts it to a background worker thread.mergeComplete() polls the future. resolve() blocks to get result → FB_LIVE_OUTPUT.FB_HASH_INPUTS or FB_HASH_OUTPUT. makeLive() reconstitutes from hashes.MergeKey — Identifies a merge by its inputs: keepTombstoneEntries, inputCurrHash, inputSnapHash, shadowHashes. Used as key in merge future/finished maps.
BucketMergeMap — Bidirectional weak mapping of merge input→output and output→input. Stores MergeKey→Hash, Hash→Hash (input→output multimap), and Hash→MergeKey (output→input multimap). Used for merge reattachment: if a merge's output bucket still exists, we can synthesize a pre-resolved future instead of re-running the merge.
MergeInput<BucketT> (abstract), FileMergeInput<BucketT>, MemoryMergeInput<BucketT> — Adapters providing a uniform interface over either BucketInputIterator pairs (file-based merge) or in-memory vector<EntryT> pairs. Methods: isDone(), oldFirst(), newFirst(), equalKeys(), getOldEntry(), getNewEntry(), advanceOld(), advanceNew().
BucketInputIterator<BucketT> — Reads entries sequentially from a bucket file via XDRInputFileStream. Auto-extracts the leading METAENTRY. Provides operator*, operator++, pos(), seek(), size().
BucketOutputIterator<BucketT> — Writes entries to a temp file, computing a running SHA256 hash. put() buffers one entry to deduplicate adjacent same-key entries. getBucket() finalizes the file, calls BucketManager::adoptFileAsBucket(), and returns the new bucket. Respects keepTombstoneEntries to elide tombstones at the bottom level. Writes a METAENTRY at the start if protocol version ≥ 11.
BucketListSnapshotData<BucketT> — Immutable snapshot of a BucketList: a vector of Level{curr, snap} (shared_ptr to const buckets) plus a LedgerHeader. Thread-safe to share.
SearchableBucketListSnapshot<BucketT> — Provides lookup functionality over a snapshot. Each instance owns mutable file stream caches (mStreams) for I/O. Key methods:
load(LedgerKey) — Point lookup: iterates buckets newest-to-oldest, returns first match via index lookup + file read. Returns the LoadT (LedgerEntry for live, HotArchiveBucketEntry for hot archive).loadKeysFromBucket() — Bulk scan: uses index scan() iterator for sequential multi-key lookup within a bucket.loadKeysInternal() — Loads keys from all buckets, supports historical snapshots.loopAllBuckets() — Iterates all non-empty bucket (curr, snap) across levels, calling a function. Stops early on Loop::COMPLETE.getBucketEntry() — Single-key lookup via index: CACHE_HIT returns cached entry, FILE_OFFSET reads from disk, NOT_FOUND skips.SearchableLiveBucketListSnapshot — Extends the base with live-specific queries:
loadKeys() — Bulk load with timer.loadPoolShareTrustLinesByAccountAndAsset() — Two-step query: index lookup for PoolIDs, then bulk trustline load.loadInflationWinners() — Legacy inflation vote counting.scanForEviction() — Background eviction scan: iterates bucket region, collects expired entries.scanForEntriesOfType() — Iterates entries of a given LedgerEntryType using type range bounds.SearchableHotArchiveBucketListSnapshot — Hot archive queries: loadKeys(), scanAllEntries().
BucketSnapshotManager — Thread-safe boundary between main-thread BucketList mutations and read-only snapshots. Holds canonical snapshots behind a SharedMutex. Key methods:
updateCurrentSnapshot() — Called by main thread after BucketList changes. Takes exclusive lock, rotates historical snapshots.copySearchableLiveBucketListSnapshot() / copySearchableHotArchiveBucketListSnapshot() — Creates a new Searchable*Snapshot with fresh stream caches pointing to the current snapshot data.maybeCopySearchableBucketListSnapshot() — Refreshes a snapshot only if a newer one is available (shared lock).maybeCopyLiveAndHotArchiveSnapshots() — Atomically refreshes both live and hot archive snapshots for consistency.LiveBucketIndex — Wraps either an InMemoryIndex (small buckets) or DiskIndex<LiveBucket> (large buckets), selected based on config (BUCKETLIST_DB_INDEX_CUTOFF). Additionally owns an optional RandomEvictionCache for ACCOUNT entries. Key methods:
lookup(LedgerKey) — Returns IndexReturnT (CACHE_HIT, FILE_OFFSET, or NOT_FOUND).scan(IterT, LedgerKey) — Sequential scan for bulk loads.getPoolIDsByAsset() — Returns PoolIDs for asset-based trustline queries.maybeInitializeCache() — Lazily initializes the random eviction cache proportional to bucket's share of total accounts.typeNotSupported() — Returns true for OFFER type (offers are loaded from SQL during catchup, not BucketListDB).BUCKET_INDEX_VERSION = 6.HotArchiveBucketIndex — Always uses DiskIndex<HotArchiveBucket> (no in-memory index, no cache). Version: BUCKET_INDEX_VERSION = 0.
DiskIndex<BucketT> — Persisted range-based index. Contains:
RangeIndex (vector<pair<RangeEntry, streamoff>>) — Maps key ranges to file offsets (page boundaries).BinaryFuseFilter16 — Bloom-filter-like structure for quick negative lookups.AssetPoolIDMap — Asset→PoolID mapping (LiveBucket only).BucketEntryCounters — Per-type entry counts and sizes.typeRanges — Map of LedgerEntryType → (startOffset, endOffset) for type-specific scans.InMemoryIndex — For small buckets. Uses InMemoryBucketState (an unordered_set<InternalInMemoryBucketEntry>) to store all entries in memory. InternalInMemoryBucketEntry uses type-erasure to allow lookup by LedgerKey in a set of BucketEntry (C++20 heterogeneous lookup workaround).
IndexReturnT — Variant return type from index queries: IndexPtrT (cache hit), std::streamoff (file offset), or std::monostate (not found).
BucketIndexUtils — Free functions: createIndex() builds a new index from a bucket file; loadIndex() loads a persisted index from disk; getPageSizeFromConfig().
LedgerEntryIdCmp — Compares LedgerEntry or LedgerKey by identity (type, then type-specific key fields). Used for sorted sets and merge ordering.
BucketEntryIdCmp<BucketT> — Compares BucketEntry or HotArchiveBucketEntry by their embedded ledger keys. METAENTRY sorts below all others. Handles cross-type comparisons (LIVEENTRY vs DEADENTRY, ARCHIVED vs LIVE).
BucketApplicator — Applies a LiveBucket to the database during history catchup. Processes entries in scheduler-friendly batches (LEDGER_ENTRY_BATCH_COMMIT_SIZE). Only applies offers (seeks to offer range using type index). Tracks seenKeys to avoid applying shadowed entries. Handles pre-v11 entries that lack INITENTRY.EvictionResultEntry — A single eviction candidate: the LedgerEntry, its EvictionIterator position, and liveUntilLedger.EvictionResultCandidates — Collection of eviction candidates from a background scan, with validity checks against archival settings.EvictedStateVectors — Final eviction output: deletedKeys (temp entries + TTLs) and archivedEntries (persistent entries).EvictionStatistics — Tracks eviction cycle metrics (entry age, cycle period).EvictionMetrics — Medida metrics for eviction (entries evicted, bytes scanned, blocking/background time).MergeCounters — Fine-grained counters for merge operations (entry types processed, shadow elisions, reattachments, annihilations). Not published via medida; used for internal tracking and testing.BucketEntryCounters — Per-LedgerEntryTypeAndDurability counts and sizes. Stored in indexes, aggregated across buckets.LedgerEntryTypeAndDurability — Finer-grained enum distinguishing TEMPORARY vs PERSISTENT CONTRACT_DATA.BucketManager::addLiveBatch() → LiveBucketList::addBatch() → addBatchInternal().levelShouldSpill(ledger, i-1), then levels[i-1].snap() + levels[i].commit() + levels[i].prepare().prepareFirstLevel() — creates fresh in-memory bucket, merges with curr in-memory via LiveBucket::mergeInMemory(), commits immediately (synchronous, no background thread).prepare() creates a FutureBucket which launches a background merge task via app.postOnBackgroundThread().resolveAnyReadyFutures() non-blockingly resolves any completed merges.BucketSnapshotManager::updateCurrentSnapshot() creates new immutable snapshot data.MergeKey from inputs. Checks BucketManager::getMergeFuture() for reattachment.packaged_task that calls BucketT::merge().merge() opens BucketInputIterators on old and new buckets, creates BucketOutputIterator, then calls mergeInternal().mergeInternal() dispatches: if entries have different keys, the lesser key is accepted via maybePut(); if keys are equal, mergeCasesWithEqualKeys() handles INIT/DEAD annihilation, lifecycle promotion, and shadow checks.BucketOutputIterator::getBucket() finalizes → BucketManager::adoptFileAsBucket() → file rename into bucket dir, index construction, merge tracking.SearchableLiveBucketListSnapshot from BucketSnapshotManager.load(LedgerKey) iterates all buckets via loopAllBuckets().getBucketEntry() calls index.lookup() → returns CACHE_HIT (cached BucketEntry), FILE_OFFSET (seek + read page), or NOT_FOUND (skip bucket).BucketManager::startBackgroundEvictionScan() posts a task to eviction background thread.SearchableLiveBucketListSnapshot::scanForEviction(), which iterates through a region of the bucket file, collecting candidates whose TTL entries are expired.resolveBackgroundEvictionScan(): validates candidates against current ledger state, evicts up to maxEntriesToArchive, updates eviction iterator in network config.BucketManager, LiveBucketList, HotArchiveBucketList. Calls addBatch(), snapshotLedger(), forgetUnreferencedBuckets(). Updates canonical snapshots in BucketSnapshotManager.app.postOnBackgroundThread()): Run FutureBucket merges. Call BucketT::merge(), adoptFileAsBucket(). Access mBucketMutex for file operations and future maps.scanForEviction() on a snapshot.SearchableBucketListSnapshot copies from BucketSnapshotManager. Each snapshot has its own file stream cache. Snapshot data is immutable and shared via shared_ptr.mBucketMutex (RecursiveMutex): Guards bucket file maps, future maps, finished merge map. Must be acquired AFTER LedgerManagerImpl::mLedgerStateMutex.mSnapshotMutex (SharedMutex in BucketSnapshotManager): Exclusive for updateCurrentSnapshot(), shared for copying snapshots.mCacheMutex (shared_mutex in LiveBucketIndex): Guards the random eviction cache.BucketManager
├── LiveBucketList (unique_ptr)
│ └── vector<BucketLevel<LiveBucket>>
│ ├── mCurr: shared_ptr<LiveBucket>
│ │ ├── mFilename, mHash, mSize
│ │ ├── mIndex: shared_ptr<LiveBucketIndex const>
│ │ │ ├── DiskIndex<LiveBucket> (or InMemoryIndex)
│ │ │ └── RandomEvictionCache (optional)
│ │ └── mEntries: unique_ptr<vector<BucketEntry>> (level 0 only)
│ ├── mSnap: shared_ptr<LiveBucket>
│ └── mNextCurr: variant<FutureBucket<LiveBucket>, shared_ptr<LiveBucket>>
│ └── FutureBucket holds shared_future + input/output bucket refs
├── HotArchiveBucketList (unique_ptr)
│ └── vector<BucketLevel<HotArchiveBucket>> (same structure)
├── BucketSnapshotManager (unique_ptr)
│ ├── mCurrLiveSnapshot: shared_ptr<BucketListSnapshotData<LiveBucket>>
│ ├── mCurrHotArchiveSnapshot: shared_ptr<BucketListSnapshotData<HotArchiveBucket>>
│ └── historical snapshot maps
├── mSharedLiveBuckets: map<Hash, shared_ptr<LiveBucket>>
├── mSharedHotArchiveBuckets: map<Hash, shared_ptr<HotArchiveBucket>>
├── mLiveBucketFutures: map<MergeKey, shared_future<shared_ptr<LiveBucket>>>
├── mHotArchiveBucketFutures: map<MergeKey, shared_future<shared_ptr<HotArchiveBucket>>>
├── mFinishedMerges: BucketMergeMap
├── TmpDirManager (unique_ptr)
└── Config (copy, thread-safe)
Ledger close → BucketList: LedgerManager → BucketManager::addLiveBatch(initEntries, liveEntries, deadEntries) → LiveBucketList::addBatch() → spill cascade through levels → background merges.
BucketList → Snapshot: After addBatch(), main thread calls BucketSnapshotManager::updateCurrentSnapshot() → creates new immutable BucketListSnapshotData from current BucketList state.
Snapshot → Query: Background threads call BucketSnapshotManager::copySearchableLiveBucketListSnapshot() → gets a SearchableLiveBucketListSnapshot with fresh file streams → load() / loadKeys() for point and bulk queries.
Eviction flow: Main thread → startBackgroundEvictionScan() → eviction thread scans snapshot → resolveBackgroundEvictionScan() on main thread applies evictions to AbstractLedgerTxn → evicted persistent entries flow to addHotArchiveBatch() → HotArchiveBucketList.
Catchup flow: HistoryManager downloads bucket files → BucketManager::assumeState(HistoryArchiveState) → sets curr/snap on each level, restarts merges → BucketApplicator applies offers to database.
Merge reattachment: On restart, FutureBucket::makeLive() reconstitutes from hashes → startMerge() checks BucketManager::getMergeFuture() → if finished merge exists in BucketMergeMap, synthesizes a pre-resolved future; otherwise re-launches background merge.