This commit is contained in:
Runji Wang
2025-02-25 18:16:31 +08:00
commit 770aa417d5
77 changed files with 18785 additions and 0 deletions

View File

@@ -0,0 +1,104 @@
.. _dataframe:
DataFrame
=========
DataFrame is the main class in smallpond. It represents a lazily computed, partitioned data set.
A typical workflow looks like this:
.. code-block:: python
import smallpond
sp = smallpond.init()
df = sp.read_parquet("path/to/dataset/*.parquet")
df = df.repartition(10)
df = df.map("x + 1")
df.write_parquet("path/to/output")
Initialization
--------------
.. autosummary::
:toctree: ../generated
smallpond.init
.. currentmodule:: smallpond.dataframe
.. _loading_data:
Loading Data
------------
.. autosummary::
:toctree: ../generated
Session.from_items
Session.from_arrow
Session.from_pandas
Session.read_csv
Session.read_json
Session.read_parquet
.. _partitioning_data:
Partitioning Data
-----------------
.. autosummary::
:toctree: ../generated
DataFrame.repartition
.. _transformations:
Transformations
---------------
Apply transformations and return a new DataFrame.
.. autosummary::
:toctree: ../generated
Session.partial_sql
DataFrame.map
DataFrame.map_batches
DataFrame.flat_map
DataFrame.filter
DataFrame.limit
DataFrame.partial_sort
DataFrame.random_shuffle
.. _consuming_data:
Consuming Data
--------------
These operations will trigger execution of the lazy transformations performed on this DataFrame.
.. autosummary::
:toctree: ../generated
DataFrame.count
DataFrame.take
DataFrame.take_all
DataFrame.to_arrow
DataFrame.to_pandas
DataFrame.write_parquet
DataFrame.write_parquet_lazy
Execution
---------
DataFrames are lazily computed. You can use these methods to manually trigger computation.
.. autosummary::
:toctree: ../generated
DataFrame.compute
DataFrame.is_computed
DataFrame.recompute
Session.wait

View File

@@ -0,0 +1,28 @@
.. currentmodule:: smallpond.logical.dataset
Dataset
=======
Dataset represents a collection of files.
To create a dataset:
.. code-block:: python
dataset = ParquetDataSet("path/to/dataset/*.parquet")
DataSets
--------
.. autosummary::
:toctree: ../generated
DataSet
FileSet
ParquetDataSet
CsvDataSet
JsonDataSet
ArrowTableDataSet
PandasDataSet
PartitionedDataSet
SqlQueryDataSet

View File

@@ -0,0 +1,85 @@
.. currentmodule:: smallpond.execution
.. _execution:
Execution
=========
Submit a Job
------------
After constructing the LogicalPlan, you can use the JobManager to create a Job in the cluster to execute it. However, in most cases, you only need to use the Driver as the entry point of the entire script and then submit the plan. The Driver is a simple wrapper around the JobManager. It reads the configuration from the command line arguments and passes it to the JobManager.
.. code-block:: python
from smallpond.execution.driver import Driver
if __name__ == "__main__":
driver = Driver()
# add your own arguments
driver.add_argument("-i", "--input_paths", nargs="+")
driver.add_argument("-n", "--npartitions", type=int, default=10)
# build and run logical plan
plan = my_pipeline(**driver.get_arguments())
driver.run(plan)
.. autosummary::
:toctree: ../generated
~driver.Driver
~manager.JobManager
Scheduler and Executor
----------------------
Scheduler and Executor are lower-level APIs. They are directly responsible for scheduling and executing tasks, respectively. Generally, users do not need to use them directly.
.. autosummary::
:toctree: ../generated
~scheduler.Scheduler
~executor.Executor
.. _platform:
Customize Platform
------------------
Smallpond supports user-defined task execution platforms. A Platform includes methods for submitting jobs and a series of default configurations. By default, smallpond automatically detects the current environment and selects the most suitable platform. If it cannot detect one, it uses the default platform.
You can specify a built-in platform via parameters:
.. code-block:: bash
# run with your platform
python script.py --platform mpi
Or implement your own Platform class:
.. code-block:: python
# path/to/my/platform.py
from smallpond.platform import Platform
class MyPlatform(Platform):
def start_job(self, ...) -> List[str]:
...
.. code-block:: bash
# run with your platform
# if using Driver
python script.py --platform path.to.my.platform
# if using smallpond.init
SP_PLATFORM=path.to.my.platform python script.py
.. currentmodule:: smallpond
.. autosummary::
:toctree: ../generated
~platform.Platform
~platform.MPI

89
docs/source/api/nodes.rst Normal file
View File

@@ -0,0 +1,89 @@
.. currentmodule:: smallpond.logical.node
.. _nodes:
Nodes
=====
Nodes represent the fundamental building blocks of a data processing pipeline. Each node encapsulates a specific operation or transformation that can be applied to a dataset.
Nodes can be chained together to form a logical plan, which is a directed acyclic graph (DAG) of nodes that represent the overall data processing workflow.
A typical workflow to create a logical plan is as follows:
.. code-block:: python
# Create a global context
ctx = Context()
# Create a dataset
dataset = ParquetDataSet("path/to/dataset/*.parquet")
# Create a data source node
node = DataSourceNode(ctx, dataset)
# Partition the data
node = DataSetPartitionNode(ctx, (node,), npartitions=2)
# Create a SQL engine node to transform the data
node = SqlEngineNode(ctx, (node,), "SELECT * FROM {0}")
# Create a logical plan from the root node
plan = LogicalPlan(ctx, node)
You can then create tasks from the logical plan, see :ref:`tasks`.
Notable properties of Node:
1. Nodes are partitioned. Each Node generates a series of tasks, with each task processing one partition of data.
2. The input and output of a Node are a series of partitioned Datasets. A Node may write data to shared storage and return a new Dataset, or it may simply recombine the input Datasets.
Context
-------
.. autosummary::
:toctree: ../generated
Context
NodeId
LogicalPlan
-----------
.. autosummary::
:toctree: ../generated
LogicalPlan
LogicalPlanVisitor
.. Planner
Nodes
-----
.. autosummary::
:toctree: ../generated
Node
DataSetPartitionNode
ArrowBatchNode
ArrowComputeNode
ArrowStreamNode
ConsolidateNode
DataSinkNode
DataSourceNode
EvenlyDistributedPartitionNode
HashPartitionNode
LimitNode
LoadPartitionedDataSetNode
PandasBatchNode
PandasComputeNode
PartitionNode
ProjectionNode
PythonScriptNode
RangePartitionNode
RepeatPartitionNode
RootNode
ShuffleNode
SqlEngineNode
UnionNode
UserDefinedPartitionNode
UserPartitionedDataSourceNode

76
docs/source/api/tasks.rst Normal file
View File

@@ -0,0 +1,76 @@
.. currentmodule:: smallpond.execution.task
.. _tasks:
Tasks
=====
.. code-block:: python
# create a runtime context
runtime_ctx = RuntimeContext(JobId.new(), data_root)
runtime_ctx.initialize(socket.gethostname(), cleanup_root=True)
# create a logical plan
plan = create_logical_plan()
# create an execution plan
planner = Planner(runtime_ctx)
exec_plan = planner.create_exec_plan(plan)
You can then execute the tasks in a scheduler, see :ref:`execution`.
RuntimeContext
--------------
.. autosummary::
:toctree: ../generated
RuntimeContext
JobId
TaskId
TaskRuntimeId
PartitionInfo
PerfStats
ExecutionPlan
-------------
.. autosummary::
:toctree: ../generated
ExecutionPlan
Tasks
-----
.. autosummary::
:toctree: ../generated
Task
ArrowBatchTask
ArrowComputeTask
ArrowStreamTask
DataSinkTask
DataSourceTask
EvenlyDistributedPartitionProducerTask
HashPartitionArrowTask
HashPartitionDuckDbTask
HashPartitionTask
LoadPartitionedDataSetProducerTask
MergeDataSetsTask
PandasBatchTask
PandasComputeTask
PartitionConsumerTask
PartitionProducerTask
ProjectionTask
PythonScriptTask
RangePartitionTask
RepeatPartitionProducerTask
RootTask
SplitDataSetTask
SqlEngineTask
UserDefinedPartitionProducerTask