This commit is contained in:
Runji Wang
2025-02-25 18:16:31 +08:00
commit 770aa417d5
77 changed files with 18785 additions and 0 deletions

37
docs/source/internals.rst Normal file
View File

@@ -0,0 +1,37 @@
Internals
=========
Data Root
---------
Smallpond stores all data in a single directory called data root.
This directory has the following structure:
.. code-block:: bash
data_root
└── 2024-12-11-12-00-28.2cc39990-296f-48a3-8063-78cf6dca460b # job_time.job_id
├── config # configuration and state
│ ├── exec_plan.pickle
│ ├── logical_plan.pickle
│ └── runtime_ctx.pickle
├── log # logs
│ ├── graph.png
│ └── scheduler.log
├── queue # message queue between scheduler and workers
├── output # output data
├── staging # intermediate data
│ ├── DataSourceTask.000001
│ ├── EvenlyDistributedPartitionProducerTask.000002
│ ├── completed_tasks # output dataset of completed tasks
│ └── started_tasks # used for checkpoint
└── temp # temporary data
├── DataSourceTask.000001
└── EvenlyDistributedPartitionProducerTask.000002
Failure Recovery
----------------
Smallpond can recover from failure and resume execution from the last checkpoint.
Checkpoint is task-level. A few tasks, such as `ArrowBatchTask`, support checkpointing at the batch level.