mirror of
https://github.com/deepseek-ai/smallpond
synced 2025-03-26 10:48:46 +00:00
.github/workflows | ||
benchmarks | ||
docs | ||
examples | ||
smallpond | ||
tests | ||
.gitignore | ||
LICENSE | ||
Makefile | ||
MANIFEST.in | ||
pyproject.toml | ||
README.md |
smallpond
A lightweight data processing framework built on DuckDB and 3FS.
Features
- 🚀 High-performance data processing powered by DuckDB
- 🌍 Scalable to handle PB-scale datasets
- 🛠️ Easy operations with no long-running services
Installation
Python 3.8 to 3.12 is supported.
pip install smallpond
Quick Start
# Download example data
wget https://duckdb.org/data/prices.parquet
import smallpond
# Initialize session
sp = smallpond.init()
# Load data
df = sp.read_parquet("prices.parquet")
# Process data
df = df.repartition(3, hash_by="ticker")
df = sp.partial_sql("SELECT ticker, min(price), max(price) FROM {0} GROUP BY ticker", df)
# Save results
df.write_parquet("output/")
# Show results
print(df.to_pandas())
Documentation
For detailed guides and API reference:
Performance
We evaluated smallpond using the GraySort benchmark (script) on a cluster comprising 50 compute nodes and 25 storage nodes running 3FS. The benchmark sorted 110.5TiB of data in 30 minutes and 14 seconds, achieving an average throughput of 3.66TiB/min.
Details can be found in 3FS - Gray Sort.
Development
pip install .[dev]
# run unit tests
pytest -v tests/test*.py
# build documentation
pip install .[docs]
cd docs
make html
python -m http.server --directory build/html
License
This project is licensed under the MIT License.