mirror of
https://github.com/deepseek-ai/3FS
synced 2025-06-26 18:16:45 +00:00
chunk-engine
Design
- The entire Chunk Engine can be divided into two components:
- Allocator: Responsible for allocating/reclaiming chunks and modifying memory states.
- MetaStore: Responsible for persisting allocation/reclamation events.
- Workflow for writing a new chunk:
- The Allocator assigns a new chunk position, pointing to a disk space (purely in-memory operation).
- Write data to this chunk position. If a power failure or write failure occurs at this stage, no existing data is affected.
- Generate corresponding chunk metadata and persist it alongside the allocation event to the MetaStore. Using RocksDB's WriteBatch ensures atomic updates—the entire write operation either succeeds or fails, with no intermediate states.
- Maintaining the Allocator's in-memory state:
- At startup, the Allocator quickly loads all allocation information from RocksDB.
- Allocation is performed in-memory first, followed by persistence. If a failure occurs before persistence, the allocation event is lost.
- Reclamation first persists the event to disk, then modifies the memory state. Even if a chunk deletion event is persisted, the chunk remains readable as long as memory holds its reference.
- This ensures conflict-free read/write operations: a read operation acquires a chunk reference, guaranteeing the chunk's validity until the read completes.
- Use
Arcto manage ownership of chunk position:- For allocation, returns an
Arc<ChunkPos>. If persistence fails, the position is automatically released when theArcis dropped. - Read operations also return an
Arc<ChunkPos>, ensuring safe data access even during concurrent writes or deletions.
- For allocation, returns an
Allocator
Storage hierarchy:
- Chunk: Basic data unit, currently proposed as 64KB, 512KB, and 4MB.
- Group: Each group contains 256 chunks (16MB, 128MB, or 1GB depending on chunk size).
- File: For 512KB chunks, a single file (~120GB) contains ~960 groups.
- Disk: Single disk capacity of 30TB, divided into 256 files per chunk size.
- Node: A single node contains 10–20 disks.
This configuration supports up to ~1.2 billion chunks and ~5 million groups per machine.
Implementation details:
- Each group uses a 256-bit bitset (4
uint64_t) to track allocation status. - Maintain three in-memory structures:
allocated_groups: Groups with allocated space but no chunks assigned.unallocated_groups: Groups without allocated space.active_groups: Map of<group_id, group_state>tracking allocation status.
- Chunk allocation workflow:
- Prioritize finding free slots in
active_groupsusing__builtin_ctzfor fast bitwise operations. - If
active_groupsis empty, acquire a new group fromallocated_groups. - If
allocated_groupsis empty, fetch a group fromunallocated_groupsand allocate disk space synchronously.
- Prioritize finding free slots in
- Background threads:
allocate_thread: Maintainsactive_groupswithin a target size range to ensure in-memory allocation efficiency.compact_thread: Periodically scansactive_groups, migrates all chunks from selected groups, releases space, and returns groups toallocated_groups.
MetaStore
Persists three mappings:
chunk_id -> chunk_meta: Metadata includes chunk location, length, hash, version, etc., serialized usingderse.group_id -> group_state: Tracks chunk allocation status within groups, leveraging RocksDB's MergeOp for atomic updates.chunk_pos -> chunk_id: Maps physical positions to chunk IDs, used bycompact_threadduring chunk migration.
Chunk Engine
- MetaCache: Maintains an in-memory
chunk_id -> chunk_infomapping, wherechunk_infoincludeschunk_metaandArc<ChunkPos>. - Read operation: Returns
chunk_info. TheArc<ChunkPos>ensures safe data access until the read completes. - Write operation workflow:
- Query
MetaCacheto retrieve the currentchunk_info. - Invoke
Allocator::allocate()to obtain a new chunk position. - Read existing chunk data, write it to the new chunk position, append the new write request, and generate
new_chunk_info. - Persist
new_chunk_infoto the MetaStore along with a release record for the original chunk position.
- Query