smallpond/docs/source/api/nodes.rst
Runji Wang 770aa417d5 init
2025-02-27 17:23:53 +08:00

90 lines
2.1 KiB
ReStructuredText

.. currentmodule:: smallpond.logical.node
.. _nodes:
Nodes
=====
Nodes represent the fundamental building blocks of a data processing pipeline. Each node encapsulates a specific operation or transformation that can be applied to a dataset.
Nodes can be chained together to form a logical plan, which is a directed acyclic graph (DAG) of nodes that represent the overall data processing workflow.
A typical workflow to create a logical plan is as follows:
.. code-block:: python
# Create a global context
ctx = Context()
# Create a dataset
dataset = ParquetDataSet("path/to/dataset/*.parquet")
# Create a data source node
node = DataSourceNode(ctx, dataset)
# Partition the data
node = DataSetPartitionNode(ctx, (node,), npartitions=2)
# Create a SQL engine node to transform the data
node = SqlEngineNode(ctx, (node,), "SELECT * FROM {0}")
# Create a logical plan from the root node
plan = LogicalPlan(ctx, node)
You can then create tasks from the logical plan, see :ref:`tasks`.
Notable properties of Node:
1. Nodes are partitioned. Each Node generates a series of tasks, with each task processing one partition of data.
2. The input and output of a Node are a series of partitioned Datasets. A Node may write data to shared storage and return a new Dataset, or it may simply recombine the input Datasets.
Context
-------
.. autosummary::
:toctree: ../generated
Context
NodeId
LogicalPlan
-----------
.. autosummary::
:toctree: ../generated
LogicalPlan
LogicalPlanVisitor
.. Planner
Nodes
-----
.. autosummary::
:toctree: ../generated
Node
DataSetPartitionNode
ArrowBatchNode
ArrowComputeNode
ArrowStreamNode
ConsolidateNode
DataSinkNode
DataSourceNode
EvenlyDistributedPartitionNode
HashPartitionNode
LimitNode
LoadPartitionedDataSetNode
PandasBatchNode
PandasComputeNode
PartitionNode
ProjectionNode
PythonScriptNode
RangePartitionNode
RepeatPartitionNode
RootNode
ShuffleNode
SqlEngineNode
UnionNode
UserDefinedPartitionNode
UserPartitionedDataSourceNode