mirror of
https://github.com/deepseek-ai/smallpond
synced 2025-05-10 15:41:25 +00:00
90 lines
2.1 KiB
ReStructuredText
90 lines
2.1 KiB
ReStructuredText
.. currentmodule:: smallpond.logical.node
|
|
|
|
.. _nodes:
|
|
|
|
Nodes
|
|
=====
|
|
|
|
Nodes represent the fundamental building blocks of a data processing pipeline. Each node encapsulates a specific operation or transformation that can be applied to a dataset.
|
|
Nodes can be chained together to form a logical plan, which is a directed acyclic graph (DAG) of nodes that represent the overall data processing workflow.
|
|
|
|
A typical workflow to create a logical plan is as follows:
|
|
|
|
.. code-block:: python
|
|
|
|
# Create a global context
|
|
ctx = Context()
|
|
|
|
# Create a dataset
|
|
dataset = ParquetDataSet("path/to/dataset/*.parquet")
|
|
|
|
# Create a data source node
|
|
node = DataSourceNode(ctx, dataset)
|
|
|
|
# Partition the data
|
|
node = DataSetPartitionNode(ctx, (node,), npartitions=2)
|
|
|
|
# Create a SQL engine node to transform the data
|
|
node = SqlEngineNode(ctx, (node,), "SELECT * FROM {0}")
|
|
|
|
# Create a logical plan from the root node
|
|
plan = LogicalPlan(ctx, node)
|
|
|
|
You can then create tasks from the logical plan, see :ref:`tasks`.
|
|
|
|
Notable properties of Node:
|
|
|
|
1. Nodes are partitioned. Each Node generates a series of tasks, with each task processing one partition of data.
|
|
2. The input and output of a Node are a series of partitioned Datasets. A Node may write data to shared storage and return a new Dataset, or it may simply recombine the input Datasets.
|
|
|
|
Context
|
|
-------
|
|
|
|
.. autosummary::
|
|
:toctree: ../generated
|
|
|
|
Context
|
|
NodeId
|
|
|
|
LogicalPlan
|
|
-----------
|
|
|
|
.. autosummary::
|
|
:toctree: ../generated
|
|
|
|
LogicalPlan
|
|
LogicalPlanVisitor
|
|
.. Planner
|
|
|
|
Nodes
|
|
-----
|
|
|
|
.. autosummary::
|
|
:toctree: ../generated
|
|
|
|
Node
|
|
DataSetPartitionNode
|
|
ArrowBatchNode
|
|
ArrowComputeNode
|
|
ArrowStreamNode
|
|
ConsolidateNode
|
|
DataSinkNode
|
|
DataSourceNode
|
|
EvenlyDistributedPartitionNode
|
|
HashPartitionNode
|
|
LimitNode
|
|
LoadPartitionedDataSetNode
|
|
PandasBatchNode
|
|
PandasComputeNode
|
|
PartitionNode
|
|
ProjectionNode
|
|
PythonScriptNode
|
|
RangePartitionNode
|
|
RepeatPartitionNode
|
|
RootNode
|
|
ShuffleNode
|
|
SqlEngineNode
|
|
UnionNode
|
|
UserDefinedPartitionNode
|
|
UserPartitionedDataSourceNode
|