DeepSeek/smallpond

Fork 0

mirror of https://github.com/deepseek-ai/smallpond synced 2025-06-26 18:27:45 +00:00

Go to file

Runji Wang 52ecc5e455 reformat code with --line-length=150 (#18 )

2025-03-05 22:46:23 +08:00

.github/workflows

reformat code with --line-length=150 (#18 )

2025-03-05 22:46:23 +08:00

benchmarks

reformat code with --line-length=150 (#18 )

2025-03-05 22:46:23 +08:00

docs

init

2025-02-27 17:23:53 +08:00

examples

reformat code with --line-length=150 (#18 )

2025-03-05 22:46:23 +08:00

smallpond

reformat code with --line-length=150 (#18 )

2025-03-05 22:46:23 +08:00

tests

reformat code with --line-length=150 (#18 )

2025-03-05 22:46:23 +08:00

.gitignore

init

2025-02-27 17:23:53 +08:00

LICENSE

init

2025-02-27 17:23:53 +08:00

Makefile

reformat code with --line-length=150 (#18 )

2025-03-05 22:46:23 +08:00

MANIFEST.in

init

2025-02-27 17:23:53 +08:00

pyproject.toml

init

2025-02-27 17:23:53 +08:00

README.md

update readme

2025-02-27 19:56:42 +08:00

README.md

smallpond

A lightweight data processing framework built on DuckDB and 3FS.

Features

🚀 High-performance data processing powered by DuckDB
🌍 Scalable to handle PB-scale datasets
🛠️ Easy operations with no long-running services

Installation

Python 3.8 to 3.12 is supported.

pip install smallpond

Quick Start

# Download example data
wget https://duckdb.org/data/prices.parquet

import smallpond

# Initialize session
sp = smallpond.init()

# Load data
df = sp.read_parquet("prices.parquet")

# Process data
df = df.repartition(3, hash_by="ticker")
df = sp.partial_sql("SELECT ticker, min(price), max(price) FROM {0} GROUP BY ticker", df)

# Save results
df.write_parquet("output/")
# Show results
print(df.to_pandas())

Documentation

For detailed guides and API reference:

Performance

We evaluated smallpond using the GraySort benchmark (script) on a cluster comprising 50 compute nodes and 25 storage nodes running 3FS. The benchmark sorted 110.5TiB of data in 30 minutes and 14 seconds, achieving an average throughput of 3.66TiB/min.

Details can be found in 3FS - Gray Sort.

Development

pip install .[dev]

# run unit tests
pytest -v tests/test*.py

# build documentation
pip install .[docs]
cd docs
make html
python -m http.server --directory build/html

License

This project is licensed under the MIT License.