smallpond/docs/source/internals.rst
Runji Wang 770aa417d5 init
2025-02-27 17:23:53 +08:00

38 lines
1.3 KiB
ReStructuredText

Internals
=========
Data Root
---------
Smallpond stores all data in a single directory called data root.
This directory has the following structure:
.. code-block:: bash
data_root
└── 2024-12-11-12-00-28.2cc39990-296f-48a3-8063-78cf6dca460b # job_time.job_id
├── config # configuration and state
│ ├── exec_plan.pickle
│ ├── logical_plan.pickle
│ └── runtime_ctx.pickle
├── log # logs
│ ├── graph.png
│ └── scheduler.log
├── queue # message queue between scheduler and workers
├── output # output data
├── staging # intermediate data
│ ├── DataSourceTask.000001
│ ├── EvenlyDistributedPartitionProducerTask.000002
│ ├── completed_tasks # output dataset of completed tasks
│ └── started_tasks # used for checkpoint
└── temp # temporary data
├── DataSourceTask.000001
└── EvenlyDistributedPartitionProducerTask.000002
Failure Recovery
----------------
Smallpond can recover from failure and resume execution from the last checkpoint.
Checkpoint is task-level. A few tasks, such as `ArrowBatchTask`, support checkpointing at the batch level.