smallpond/docs/source/api/nodes.rst

.. currentmodule:: smallpond.logical.node

.. _nodes:

Nodes
=====

Nodes represent the fundamental building blocks of a data processing pipeline. Each node encapsulates a specific operation or transformation that can be applied to a dataset.
Nodes can be chained together to form a logical plan, which is a directed acyclic graph (DAG) of nodes that represent the overall data processing workflow.

A typical workflow to create a logical plan is as follows:

.. code-block:: python

   # Create a global context
   ctx = Context()

   # Create a dataset
   dataset = ParquetDataSet("path/to/dataset/*.parquet")

   # Create a data source node
   node = DataSourceNode(ctx, dataset)

   # Partition the data
   node = DataSetPartitionNode(ctx, (node,), npartitions=2)

   # Create a SQL engine node to transform the data
   node = SqlEngineNode(ctx, (node,), "SELECT * FROM {0}")

   # Create a logical plan from the root node
   plan = LogicalPlan(ctx, node)

You can then create tasks from the logical plan, see :ref:`tasks`.

Notable properties of Node:

1. Nodes are partitioned. Each Node generates a series of tasks, with each task processing one partition of data.
2. The input and output of a Node are a series of partitioned Datasets. A Node may write data to shared storage and return a new Dataset, or it may simply recombine the input Datasets.

Context
-------

.. autosummary::
   :toctree: ../generated

   Context
   NodeId

LogicalPlan
-----------

.. autosummary::
   :toctree: ../generated

   LogicalPlan
   LogicalPlanVisitor
   .. Planner

Nodes
-----

.. autosummary::
   :toctree: ../generated

   Node
   DataSetPartitionNode
   ArrowBatchNode
   ArrowComputeNode
   ArrowStreamNode
   ConsolidateNode
   DataSinkNode
   DataSourceNode
   EvenlyDistributedPartitionNode
   HashPartitionNode
   LimitNode
   LoadPartitionedDataSetNode
   PandasBatchNode
   PandasComputeNode
   PartitionNode
   ProjectionNode
   PythonScriptNode
   RangePartitionNode
   RepeatPartitionNode
   RootNode
   ShuffleNode
   SqlEngineNode
   UnionNode
   UserDefinedPartitionNode
   UserPartitionedDataSourceNode