mirror of
https://github.com/deepseek-ai/smallpond
synced 2025-06-26 18:27:45 +00:00
86 lines
2.2 KiB
Markdown
86 lines
2.2 KiB
Markdown
# smallpond
|
|
|
|
[](https://github.com/deepseek-ai/smallpond/actions/workflows/ci.yml)
|
|
[](https://pypi.org/project/smallpond/)
|
|
[](https://deepseek-ai.github.io/smallpond/)
|
|
[](LICENSE)
|
|
|
|
A lightweight data processing framework built on [DuckDB] and [3FS].
|
|
|
|
## Features
|
|
|
|
- 🚀 High-performance data processing powered by DuckDB
|
|
- 🌍 Scalable to handle PB-scale datasets
|
|
- 🛠️ Easy operations with no long-running services
|
|
|
|
## Installation
|
|
|
|
Python 3.8 to 3.12 is supported.
|
|
|
|
```bash
|
|
pip install smallpond
|
|
```
|
|
|
|
## Quick Start
|
|
|
|
```bash
|
|
# Download example data
|
|
wget https://duckdb.org/data/prices.parquet
|
|
```
|
|
|
|
```python
|
|
import smallpond
|
|
|
|
# Initialize session
|
|
sp = smallpond.init()
|
|
|
|
# Load data
|
|
df = sp.read_parquet("prices.parquet")
|
|
|
|
# Process data
|
|
df = df.repartition(3, hash_by="ticker")
|
|
df = sp.partial_sql("SELECT ticker, min(price), max(price) FROM {0} GROUP BY ticker", df)
|
|
|
|
# Save results
|
|
df.write_parquet("output/")
|
|
# Show results
|
|
print(df.to_pandas())
|
|
```
|
|
|
|
## Documentation
|
|
|
|
For detailed guides and API reference:
|
|
- [Getting Started](docs/source/getstarted.rst)
|
|
- [API Reference](docs/source/api.rst)
|
|
|
|
## Performance
|
|
|
|
We evaluated smallpond using the [GraySort benchmark] ([script]) on a cluster comprising 50 compute nodes and 25 storage nodes running [3FS]. The benchmark sorted 110.5TiB of data in 30 minutes and 14 seconds, achieving an average throughput of 3.66TiB/min.
|
|
|
|
Details can be found in [3FS - Gray Sort].
|
|
|
|
[DuckDB]: https://duckdb.org/
|
|
[3FS]: https://github.com/deepseek-ai/3FS
|
|
[GraySort benchmark]: https://sortbenchmark.org/
|
|
[script]: benchmarks/gray_sort_benchmark.py
|
|
[3FS - Gray Sort]: https://github.com/deepseek-ai/3FS?tab=readme-ov-file#2-graysort
|
|
|
|
## Development
|
|
|
|
```bash
|
|
pip install .[dev]
|
|
|
|
# run unit tests
|
|
pytest -v tests/test*.py
|
|
|
|
# build documentation
|
|
pip install .[docs]
|
|
cd docs
|
|
make html
|
|
python -m http.server --directory build/html
|
|
```
|
|
|
|
## License
|
|
|
|
This project is licensed under the [MIT License](LICENSE).
|