Initial commit

This commit is contained in:
dev
2025-02-27 21:53:53 +08:00
commit 815e55e4c0
1291 changed files with 185445 additions and 0 deletions

373
deploy/README.md Normal file
View File

@@ -0,0 +1,373 @@
# 3FS Setup Guide
This section provides a manual deployment guide for setting up a six-node cluster with the cluster ID `stage`.
## Installation prerequisites
### Hardware specifications
| Node | OS | IP | Memory | SSD | RDMA |
|----------|---------------|--------------|--------|------------|-------|
| meta | Ubuntu 22.04 | 192.168.1.1 | 128GB | - | RoCE |
| storage1 | Ubuntu 22.04 | 192.168.1.2 | 512GB | 14TB × 16 | RoCE |
| storage2 | Ubuntu 22.04 | 192.168.1.3 | 512GB | 14TB × 16 | RoCE |
| storage3 | Ubuntu 22.04 | 192.168.1.4 | 512GB | 14TB × 16 | RoCE |
| storage4 | Ubuntu 22.04 | 192.168.1.5 | 512GB | 14TB × 16 | RoCE |
| storage5 | Ubuntu 22.04 | 192.168.1.6 | 512GB | 14TB × 16 | RoCE |
> **RDMA Configuration**
> 1. Assign IP addresses to RDMA NICs. Multiple RDMA NICs (InfiniBand or RoCE) are supported on each node.
> 2. Check RDMA connectivity between nodes using `ib_write_bw`.
### Third-party dependencies
In production environment, it is recommended to install FoundationDB and ClickHouse on dedicated nodes.
| Service | Node |
|------------|-------------------------|
| [ClickHouse](https://clickhouse.com/docs/install) | meta |
| [FoundationDB](https://apple.github.io/foundationdb/administration.html) | meta |
> **FoundationDB**
> 1. Ensure that the version of FoundationDB client matches the server version, or copy the corresponding version of `libfdb_c.so` to maintain compatibility.
> 2. Find the `fdb.cluster` file and `libfdb_c.so` at `/etc/foundationdb/fdb.cluster`, `/usr/lib/libfdb_c.so` on nodes with FoundationDB installed.
---
## Step 0: Build 3FS
Follow the [instructions](../README.md#build-3fs) to build 3FS. Binaries can be found in `build/bin`.
### Services and clients
The following steps show how to install 3FS services in `/opt/3fs/bin` and the config files in `/opt/3fs/etc`.
| Service | Binary | Config files | NodeID | Node |
|------------|-------------------------|-----------------------------------------------------------------------------|--------|---------------|
| monitor | monitor_collector_main | [monitor_collector_main.toml](../configs/monitor_collector_main.toml) | - | meta |
| admin_cli | admin_cli | [admin_cli.toml](../configs/admin_cli.toml)<br>fdb.cluster | - | meta<br>storage1<br>storage2<br>storage3<br>storage4<br>storage5 |
| mgmtd | mgmtd_main | [mgmtd_main_launcher.toml](../configs/mgmtd_main_launcher.toml)<br>[mgmtd_main.toml](../configs/mgmtd_main.toml)<br>[mgmtd_main_app.toml](../configs/mgmtd_main_app.toml)<br>fdb.cluster | 1 | meta |
| meta | meta_main | [meta_main_launcher.toml](../configs/meta_main_launcher.toml)<br>[meta_main.toml](../configs/meta_main.toml)<br>[meta_main_app.toml](../configs/meta_main_app.toml)<br>fdb.cluster | 100 | meta |
| storage | storage_main | [storage_main_launcher.toml](../configs/storage_main_launcher.toml)<br>[storage_main.toml](../configs/storage_main.toml)<br>[storage_main_app.toml](../configs/storage_main_app.toml) | 10001~10005 | storage1<br>storage2<br>storage3<br>storage4<br>storage5 |
| client | hf3fs_fuse_main | [hf3fs_fuse_main_launcher.toml](../configs/hf3fs_fuse_main_launcher.toml)<br>[hf3fs_fuse_main.toml](../configs/hf3fs_fuse_main.toml) | - | meta |
---
## Step 1: Create ClickHouse tables for metrics
Import the SQL file into ClickHouse:
```bash
clickhouse-client -n < ~/3fs/deploy/sql/3fs-monitor.sql
```
---
## Step 2: Monitor service
Install `monitor_collector` service on the **meta** node.
1. Copy `monitor_collector_main` to `/opt/3fs/bin` and config files to `/opt/3fs/etc`, and create log directory `/var/log/3fs`.
```bash
mkdir -p /opt/3fs/{bin,etc}
mkdir -p /var/log/3fs
cp ~/3fs/build/bin/monitor_collector_main /opt/3fs/bin
cp ~/3fs/configs/monitor_collector_main.toml /opt/3fs/etc
```
2. Update [`monitor_collector_main.toml`](../configs/monitor_collector_main.toml) to add a ClickHouse connection:
```toml
[server.monitor_collector.reporter]
type = 'clickhouse'
[server.monitor_collector.reporter.clickhouse]
db = '3fs'
host = '<CH_HOST>'
passwd = '<CH_PASSWD>'
port = '<CH_PORT>'
user = '<CH_USER>'
```
3. Start monitor service:
```bash
cp ~/3fs/deploy/systemd/monitor_collector_main.service /usr/lib/systemd/system
systemctl start monitor_collector_main
```
Note that
> - Multiple instances of monitor services can be deployed behind a virtual IP address to share the traffic.
> - Other services communicate with the monitor service over a TCP connection.
---
## Step 3: Admin client
Install `admin_cli` on **all** nodes.
1. Copy `admin_cli` to `/opt/3fs/bin` and config files to `/opt/3fs/etc`.
```bash
mkdir -p /opt/3fs/{bin,etc}
rsync -avz meta:~/3fs/build/bin/admin_cli /opt/3fs/bin
rsync -avz meta:~/3fs/configs/admin_cli.toml /opt/3fs/etc
rsync -avz meta:/etc/foundationdb/fdb.cluster /opt/3fs/etc
```
2. Update [`admin_cli.toml`](../configs/admin_cli.toml) to set `cluster_id` and `clusterFile`:
```toml
cluster_id = "stage"
[fdb]
clusterFile = '/opt/3fs/etc/fdb.cluster'
```
The full help documentation for `admin_cli` can be displayed by running the following command:
```bash
/opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml help
```
---
## Step 4: Mgmtd service
Install `mgmtd` service on **meta** node.
1. Copy `mgmtd_main` to `/opt/3fs/bin` and config files to `/opt/3fs/etc`.
```bash
cp ~/3fs/build/bin/mgmtd_main /opt/3fs/bin
cp ~/3fs/configs/{mgmtd_main.toml,mgmtd_main_launcher.toml,mgmtd_main_app.toml} /opt/3fs/etc
```
2. Update config files:
- Set mgmtd `node_id = 1` in [`mgmtd_main_app.toml`](../configs/mgmtd_main_app.toml).
- Edit [`mgmtd_main_launcher.toml`](../configs/mgmtd_main_launcher.toml) to set the `cluster_id` and `clusterFile`:
```toml
cluster_id = "stage"
[fdb]
clusterFile = '/opt/3fs/etc/fdb.cluster'
```
- Set monitor address in [`mgmtd_main.toml`](../configs/mgmtd_main.toml):
```toml
[common.monitor.reporters.monitor_collector]
remote_ip = "192.168.1.1:10000"
```
3. Initialize the cluster:
```bash
/opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml "init-cluster --mgmtd /opt/3fs/etc/mgmtd_main.toml 1 1048576 16"
```
The parameters of `admin_cli`:
> - `1` the chain table ID
> - `1048576` the chunk size in bytes
> - `16` the file strip size
Run `help init-cluster` for full documentation.
4. Start mgmtd service:
```bash
cp ~/3fs/deploy/systemd/mgmtd_main.service /usr/lib/systemd/system
systemctl start mgmtd_main
```
5. Run `list-nodes` command to check if the cluster has been successfully initialized:
```bash
/opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml --config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.1.1:8000"]' "list-nodes"
```
If multiple instances of `mgmtd` services deployed, one of the `mgmtd` services is elected as the primary; others are secondaries. Automatic failover occurs when the primary fails.
---
## Step 5: Meta service
Install `meta` service on **meta** node.
1. Copy `meta_main` to `/opt/3fs/bin` and config files to `/opt/3fs/etc`.
```bash
cp ~/3fs/build/bin/meta_main /opt/3fs/bin
cp ~/3fs/configs/{meta_main_launcher.toml,meta_main.toml,meta_main_app.toml} /opt/3fs/etc
```
2. Update config files:
- Set meta `node_id = 100` in [`meta_main_app.toml`](../configs/meta_main_app.toml).
- Set `cluster_id`, `clusterFile` and mgmtd address in [`meta_main_launcher.toml`](../configs/meta_main_launcher.toml):
```toml
cluster_id = "stage"
[mgmtd_client]
mgmtd_server_addresses = ["RDMA://192.168.1.1:8000"]
```
- Set mgmtd and monitor addresses in [`meta_main.toml`](../configs/meta_main.toml).
```toml
[server.mgmtd_client]
mgmtd_server_addresses = ["RDMA://192.168.1.1:8000"]
[common.monitor.reporters.monitor_collector]
remote_ip = "192.168.1.1:10000"
[server.fdb]
clusterFile = '/opt/3fs/etc/fdb.cluster'
```
3. Config file of meta service is managed by mgmtd service. Use `admin_cli` to upload the config file to mgmtd:
```bash
/opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml --config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.1.1:8000"]' "set-config --type META --file /opt/3fs/etc/meta_main.toml"
```
4. Start meta service:
```bash
cp ~/3fs/deploy/systemd/meta_main.service /usr/lib/systemd/system
systemctl start meta_main
```
5. Run `list-nodes` command to check if meta service has joined the cluster:
```bash
/opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml --config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.1.1:8000"]' "list-nodes"
```
If multiple instances of `meta` services deployed, meta requests will be evenly distributed to all instances.
---
## Step 6: Storage service
Install `storage` service on **storage** node.
1. Format the attached 16 SSDs as XFS and mount at `/storage/data{1..16}`, then create data directories `/storage/data{1..16}/3fs` and log directory `/var/log/3fs`.
```bash
mkdir -p /storage/data{1..16}
mkdir -p /var/log/3fs
for i in {1..16};do mkfs.xfs -L data${i} /dev/nvme${i}n1;mount -o noatime,nodiratime -L data${i} /storage/data${i};done
mkdir -p /storage/data{1..16}/3fs
```
2. Increase the max number of asynchronous aio requests:
```bash
sysctl -w fs.aio-max-nr=67108864
```
3. Copy `storage_main` to `/opt/3fs/bin` and config files to `/opt/3fs/etc`.
```bash
rsync -avz meta:~/3fs/build/bin/storage_main /opt/3fs/bin
rsync -avz meta:~/3fs/configs/{storage_main_launcher.toml,storage_main.toml,storage_main_app.toml} /opt/3fs/etc
```
4. Update config files:
- Set `node_id` in [`storage_main_app.toml`](../configs/storage_main_app.toml). Each storage service is assigned a unique id between `10001` and `10005`.
- Set `cluster_id` and mgmtd address in [`storage_main_launcher.toml`](../configs/storage_main_launcher.toml).
```toml
cluster_id = "stage"
[mgmtd_client]
mgmtd_server_addresses = ["RDMA://192.168.1.1:8000"]
```
- Add target paths in [`storage_main.toml`](../configs/storage_main.toml):
```toml
[server.mgmtd]
mgmtd_server_address = ["RDMA://192.168.1.1:8000"]
[common.monitor.reporters.monitor_collector]
remote_ip = "192.168.1.1:10000"
[server.targets]
target_paths = ["/storage/data1/3fs","/storage/data2/3fs","/storage/data3/3fs","/storage/data4/3fs","/storage/data5/3fs","/storage/data6/3fs","/storage/data7/3fs","/storage/data8/3fs","/storage/data9/3fs","/storage/data10/3fs","/storage/data11/3fs","/storage/data12/3fs","/storage/data13/3fs","/storage/data14/3fs","/storage/data15/3fs","/storage/data16/3fs",]
```
5. Config file of storage service is managed by mgmtd service. Use `admin_cli` to upload the config file to mgmtd:
```bash
/opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml --config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.1.1:8000"]' "set-config --type STORAGE --file /opt/3fs/etc/storage_main.toml"
```
6. Start storage service:
```bash
rsync -avz meta:~/3fs/deploy/systemd/storage_main.service /usr/lib/systemd/system
systemctl start storage_main
```
7. Run `list-nodes` command to check if storage service has joined the cluster:
```
/opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml --config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.1.1:8000"]' "list-nodes"
```
---
## Step 7: Create admin user, storage targets and chain table
1. Create an admin user:
```bash
/opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml --config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.1.1:8000"]' "user-add --root --admin 0 root"
```
Save the admin token to `/opt/3fs/etc/token.txt`.
2. Generate `admin_cli` commands to create storage targets on 5 storage nodes (16 SSD per node, 6 targets per SSD).
- Follow instructions at [here](data_placement/README.md) to install Python packages.
```bash
python ~/3fs/deploy/data_placement/src/model/data_placement.py \
-ql -relax -type CR --num_nodes 5 --replication_factor 3 --min_targets_per_disk 6
python ~/3fs/deploy/data_placement/src/setup/gen_chain_table.py \
--chain_table_type CR --node_id_begin 10001 --node_id_end 10005 \
--num_disks_per_node 16 --num_targets_per_disk 6 \
--target_id_prefix 1 --chain_id_prefix 9 \
--incidence_matrix_path output/DataPlacementModel-v_5-b_10-r_6-k_3-λ_2-lb_1-ub_1/incidence_matrix.pickle
```
The following 3 files will be generated in `output` directory: `create_target_cmd.txt`, `generated_chains.csv`, and `generated_chain_table.csv`.
3. Create storage targets:
```bash
/opt/3fs/bin/admin_cli --cfg /opt/3fs/etc/admin_cli.toml --config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.1.1:8000"]' --config.user_info.token $(<"/opt/3fs/etc/token.txt") < output/create_target_cmd.txt
```
4. Upload chains to mgmtd service:
```bash
/opt/3fs/bin/admin_cli --cfg /opt/3fs/etc/admin_cli.toml --config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.1.1:8000"]' --config.user_info.token $(<"/opt/3fs/etc/token.txt") "upload-chains output/generated_chains.csv"
```
5. Upload chain table to mgmtd service:
```bash
/opt/3fs/bin/admin_cli --cfg /opt/3fs/etc/admin_cli.toml --config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.1.1:8000"]' --config.user_info.token $(<"/opt/3fs/etc/token.txt") "upload-chain-table --desc stage 1 output/generated_chain_table.csv"
```
6. List chains and chain tables to check if they have been correctly uploaded:
```bash
/opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml --config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.1.1:8000"]' "list-chains"
/opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml --config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.1.1:8000"]' "list-chain-tables"
```
---
## Step 8: FUSE client
For simplicity FUSE client is deployed on the **meta** node in this guide. However, we strongly advise against deploying clients on service nodes in production environment.
1. Copy `hf3fs_fuse_main` to `/opt/3fs/bin` and config files to `/opt/3fs/etc`.
```bash
cp ~/3fs/build/bin/hf3fs_fuse_main /opt/3fs/bin
cp ~/3fs/configs/{hf3fs_fuse_main_launcher.toml,hf3fs_fuse_main.toml,hf3fs_fuse_main_app.toml} /opt/3fs/etc
```
2. Create the mount point:
```bash
mkdir -p /3fs/stage
```
3. Set cluster ID, mountpoint, token file and mgmtd address in [`hf3fs_fuse_main_launcher.toml`](../configs/hf3fs_fuse_main_launcher.toml)
```toml
cluster_id = "stage"
mountpoint = '/3fs/stage'
token_file = '/opt/3fs/etc/token.txt'
[mgmtd_client]
mgmtd_server_addresses = ["RDMA://192.168.1.1:8000"]
```
4. Set mgmtd and monitor address in [`hf3fs_fuse_main.toml`](../configs/hf3fs_fuse_main.toml).
```toml
[mgmtd]
mgmtd_server_addresses = ["RDMA://192.168.1.1:8000"]
[common.monitor.reporters.monitor_collector]
remote_ip = "192.168.1.1:10000"
```
5. Config file of FUSE client is also managed by mgmtd service. Use `admin_cli` to upload the config file to mgmtd:
```bash
/opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml --config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.1.1:8000"]' "set-config --type FUSE --file /opt/3fs/etc/hf3fs_fuse_main.toml"
```
6. Start FUSE client:
```bash
cp ~/3fs/deploy/systemd/hf3fs_fuse_main.service /usr/lib/systemd/system
systemctl start hf3fs_fuse_main
```
7. Check if 3FS has been mounted at `/3fs/stage`:
```bash
mount | grep '/3fs/stage'
```
## FAQ
<details>
<summary>How to troubleshoot <code>admin_cli init-cluster</code> error?</summary>
If mgmtd fails to start after running `init-cluster`, the most likely cause is an error in `mgmtd_main.toml`. Any changes to this file require clearing all FoundationDB data and re-running `init-cluster`
</details>
---
<details>
<summary>How to build a single-node cluster?</summary>
A minimum of two storage services is required for data replication. If set `--num-nodes=1`, the `gen_chain_table.py` script will fail. In a test environment, this limitation can be bypassed by deploying multiple storage services on a single machine.
</details>
---
<details>
<summary>How to update config files?</summary>
All config files are managed by mgmtd. If any `*_main.toml` is updated, such as `storage_main.toml`, the modified file should be uploaded using `admin_cli set-config`.
</details>
---
<details>
<summary>How to troubleshoot common deployment issues?</summary>
When encountering any error during deployment,
- Check the log messages in `stdout/stderr` using `journalctl`, especially during service startup.
- Check log files stored in `/var/log/3fs/` on service and client nodes.
- Ensure that the directory `/var/log/3fs/` exists before starting any service.
</details>

17
deploy/data_placement/.gitignore vendored Normal file
View File

@@ -0,0 +1,17 @@
__pycache__
.ipynb_checkpoints
.tmp/
dist/
build/
output/
*.egg-info/
test/scratch/
test/runtime/
*.log
*.pyc
*.xml
.tmp/
.idea
.coverage
.vscode/
.hypothesis/

View File

@@ -0,0 +1,60 @@
# How to generate chain tables
Suppose we are going to setup a small 3FS cluster:
- 3 replicas for each chunk
- 5 storage nodes: `10001 ... 10005`
- 16 SSDs attached to each node
- 6 storage targets on each SSD
First generate a solution of the data placement problem.
```bash
$ python src/model/data_placement.py -ql -relax -type CR --num_nodes 5 --replication_factor 3 --min_targets_per_disk 6 --init_timelimit 600
...
2025-02-24 14:25:13.623 | SUCCESS | __main__:solve:165 - optimal solution:
- Status: ok
Termination condition: optimal
Termination message: TerminationCondition.optimal
2025-02-24 14:25:13.624 | DEBUG | __main__:check_solution:322 - 1,2: 1.5
2025-02-24 14:25:13.624 | DEBUG | __main__:check_solution:322 - 1,3: 1.5
2025-02-24 14:25:13.624 | DEBUG | __main__:check_solution:322 - 1,4: 1.5
2025-02-24 14:25:13.624 | DEBUG | __main__:check_solution:322 - 1,5: 1.5
2025-02-24 14:25:13.624 | DEBUG | __main__:check_solution:322 - 2,1: 1.5
2025-02-24 14:25:13.624 | DEBUG | __main__:check_solution:322 - 2,3: 1.5
2025-02-24 14:25:13.624 | DEBUG | __main__:check_solution:322 - 2,4: 1.5
2025-02-24 14:25:13.624 | DEBUG | __main__:check_solution:322 - 2,5: 1.5
2025-02-24 14:25:13.624 | DEBUG | __main__:check_solution:322 - 3,1: 1.5
2025-02-24 14:25:13.624 | DEBUG | __main__:check_solution:322 - 3,2: 1.5
2025-02-24 14:25:13.624 | DEBUG | __main__:check_solution:322 - 3,4: 1.5
2025-02-24 14:25:13.624 | DEBUG | __main__:check_solution:322 - 3,5: 1.5
2025-02-24 14:25:13.624 | DEBUG | __main__:check_solution:322 - 4,1: 1.5
2025-02-24 14:25:13.624 | DEBUG | __main__:check_solution:322 - 4,2: 1.5
2025-02-24 14:25:13.624 | DEBUG | __main__:check_solution:322 - 4,3: 1.5
2025-02-24 14:25:13.624 | DEBUG | __main__:check_solution:322 - 4,5: 1.5
2025-02-24 14:25:13.624 | DEBUG | __main__:check_solution:322 - 5,1: 1.5
2025-02-24 14:25:13.624 | DEBUG | __main__:check_solution:322 - 5,2: 1.5
2025-02-24 14:25:13.624 | DEBUG | __main__:check_solution:322 - 5,3: 1.5
2025-02-24 14:25:13.624 | DEBUG | __main__:check_solution:322 - 5,4: 1.5
2025-02-24 14:25:13.624 | INFO | __main__:check_solution:331 - min_peer_traffic=1.5 max_peer_traffic=1.5
2025-02-24 14:25:13.624 | INFO | __main__:check_solution:332 - total_traffic=30.0 max_total_traffic=30
2025-02-24 14:25:14.147 | SUCCESS | __main__:run:147 - saved solution to: output/DataPlacementModel-v_5-b_10-r_6-k_3-λ_2-lb_1-ub_1
```
Note that some combinations of `--num_nodes` and `--replication_factor` may have no solution.
Then generate commands to create/remove storage targets.
```bash
$ python src/setup/gen_chain_table.py --chain_table_type CR --node_id_begin 10001 --node_id_end 10005 --num_disks_per_node 16 --num_targets_per_disk 6 --incidence_matrix_path output/DataPlacementModel-v_5-b_10-r_6-k_3-λ_2-lb_1-ub_1/incidence_matrix.pickle
$ ls -1 output/
DataPlacementModel-v_5-b_10-r_6-k_3-λ_2-lb_1-ub_1
appsi_highs.log
create_target_cmd.txt
generated_chain_table.csv
generated_chains.csv
remove_target_cmd.txt
```

View File

@@ -0,0 +1,12 @@
psutil
pandas
plotly
loguru
highspy==1.8.0
pyomo==6.8.0
coverage~=7.4.4
pytest==8.2.1
pytest-cov==5.0.0
pytest-forked==1.6.0
pytest-xdist==3.6.1
pytest-timeout==2.3.1

View File

View File

@@ -0,0 +1,549 @@
import math
import pickle
import random
import time
import psutil
import os.path
import pandas as pd
import pyomo.environ as po
import plotly.express as px
from typing import Dict, Generator, Literal, Tuple
from loguru import logger
from pyomo.opt import SolverStatus, TerminationCondition
class InfeasibleModel(Exception):
pass
class SolverTimeout(Exception):
pass
class SolverError(Exception):
pass
class InvalidSolution(Exception):
pass
class DataPlacementModel(object):
def __init__(self, chain_table_type: Literal["EC", "CR"], num_nodes, group_size, num_groups=None, num_targets_per_disk=None, min_targets_per_disk=1, bibd_only=False, qlinearize=False, relax_lb=1, relax_ub=0):
if num_targets_per_disk is None:
num_nodes, num_groups, num_targets_per_disk, group_size = DataPlacementModel.find_params(num_nodes, group_size, min_r=min_targets_per_disk, bibd_only=bibd_only)
self.chain_table_type = chain_table_type
self.num_nodes = num_nodes
self.group_size = group_size
self.num_targets_per_disk = num_targets_per_disk
self.num_groups = num_groups or self.num_targets_total // self.group_size
self.bibd_only = bibd_only
self.qlinearize = qlinearize
self.relax_lb = relax_lb
self.relax_ub = relax_ub
def __repr__(self):
v, b, r, k, λ = self.v, self.b, self.r, self.k, self.λ
lb, ub = self.relax_lb, self.relax_ub
return f"{self.__class__.__name__}-{v=},{b=},{r=},{k=},{λ=},{lb=},{ub=}"
__str__ = __repr__
@property
def path_name(self):
return str(self).translate(str.maketrans(' ,:=', '---_'))
@property
def v(self):
return self.num_nodes
@property
def b(self):
return self.num_groups
@property
def r(self):
return self.num_targets_per_disk
@property
def k(self):
return self.group_size
@property
def λ(self):
return self.max_recovery_traffic_on_peer
@property
def num_targets_used(self):
return self.num_groups * self.group_size
@property
def num_targets_total(self):
return self.num_nodes * self.num_targets_per_disk
@property
def all_targets_used(self):
return self.num_targets_used == self.num_targets_total
@property
def balanced_peer_traffic(self):
return self.all_targets_used and self.sum_recovery_traffic_per_failure % (self.num_nodes-1) == 0
@property
def recovery_traffic_factor(self):
return (self.group_size - 1) if self.chain_table_type == "EC" else 1
@property
def sum_recovery_traffic_per_failure(self):
return self.num_targets_per_disk * self.recovery_traffic_factor
@property
def max_recovery_traffic_on_peer(self):
return math.ceil(self.sum_recovery_traffic_per_failure / (self.num_nodes-1))
@property
def balanced_incomplete_block_design(self):
return self.bibd_only and self.balanced_peer_traffic and self.relax_ub == 0
@staticmethod
def find_params(v, k, min_r=1, max_r=100, bibd_only=False):
if bibd_only: min_r = max(min_r, k)
for r in range(min_r, max_r):
if v * r % k == 0 and r * (k - 1) >= v - 1:
b = v * r // k
if not bibd_only or r * (k - 1) % (v - 1) == 0:
return v, b, r, k
raise ValueError(f"cannot find valid params: {v=}, {k=}")
def run(self, pyomo_solver=None, threads=psutil.cpu_count(logical=False), init_timelimit=1800, max_timelimit=3600*2, auto_relax=False, output_root="output", verbose=False, add_elapsed_time=None):
init_relax_lb = self.relax_lb
init_relax_ub = self.relax_ub
timelimit = 0
num_loops = self.max_recovery_traffic_on_peer*2
os.makedirs(output_root, exist_ok=True)
for loop in range(num_loops):
try:
logger.info(f"solving model with {pyomo_solver} #{loop}: {self}")
if add_elapsed_time is not None:
add_elapsed_time()
timelimit = min(timelimit + init_timelimit, max_timelimit)
instance = self.solve(pyomo_solver, threads, timelimit, output_root, verbose)
if add_elapsed_time is not None:
add_elapsed_time(f"solve model time (lb={self.relax_lb}, ub={self.relax_ub})")
except (InfeasibleModel, SolverTimeout) as ex:
logger.error(f"cannot find solution for current params: {ex}")
if auto_relax:
self.relax_lb = init_relax_lb + (loop+1) // 2
self.relax_ub = init_relax_ub + (loop+2) // 2
continue
elif loop + 1 < num_loops:
logger.critical(f"failed to find solution after {num_loops} attempts")
raise ex
else:
raise ex
else:
output_path = os.path.join(output_root, self.path_name)
os.makedirs(output_path, exist_ok=True)
self.save_solution(instance, output_path)
self.visualize_solution(instance, output_path)
logger.success(f"saved solution to: {output_path}")
return instance
logger.catch(reraise=True, message="failed to solve model")
def solve(self, pyomo_solver=None, threads=psutil.cpu_count(logical=False), timelimit=3600, output_path="output", verbose=False):
if "highs" in pyomo_solver:
self.qlinearize = True
instance = self.build_model()
if verbose: instance.pprint()
try:
results = self.solve_model(instance, pyomo_solver, threads, timelimit, output_path)
except RuntimeError as ex:
raise SolverError("unknown runtime error") from ex
if (results.solver.status == SolverStatus.ok) and (results.solver.termination_condition == TerminationCondition.optimal):
logger.success(f"optimal solution: {str(results.solver)}")
if pyomo_solver is not None: instance.solutions.load_from(results)
elif results.solver.termination_condition == TerminationCondition.infeasible:
raise InfeasibleModel(f"infeasible: {str(results.solver)}")
elif results.solver.termination_condition in (TerminationCondition.maxTimeLimit, TerminationCondition.maxIterations):
raise SolverTimeout(f"timeout: {str(results.solver)}")
else:
raise SolverError(f"error: {str(results.solver)}")
if verbose: self.print_solution(instance)
try:
self.check_solution(instance)
except AssertionError as ex:
raise InvalidSolution from ex
return instance
def build_model(self):
logger.info(f"{self.num_nodes=} {self.num_targets_per_disk=} {self.group_size=} {self.num_groups=} {self.qlinearize=} {self.relax_lb=} {self.relax_ub=}")
# v >= k
assert self.num_nodes >= self.group_size, f"{self.num_nodes=} < {self.group_size=}"
# Fisher's inequality
if self.balanced_incomplete_block_design:
# b >= v
assert self.num_groups >= self.num_nodes, f"{self.num_groups=} < {self.num_nodes=}"
# r >= k
assert self.num_targets_per_disk >= self.group_size, f"{self.num_targets_per_disk=} < {self.group_size=}"
logger.info(f"{self.sum_recovery_traffic_per_failure=} {self.max_recovery_traffic_on_peer=}")
if self.sum_recovery_traffic_per_failure < self.num_nodes - 1:
logger.warning(f"some disks do not share recovery traffic: {self.sum_recovery_traffic_per_failure=} < {self.num_nodes=} - 1")
logger.info(f"{self.all_targets_used=} {self.balanced_peer_traffic=}")
logger.info(f"{self.num_targets_used=} {self.num_targets_total=}")
if self.num_targets_used < self.num_targets_total:
logger.warning(f"some disks have unused targets: {self.num_targets_used=} < {self.num_targets_total=}")
else:
assert self.num_targets_used == self.num_targets_total, f"{self.num_targets_used=} > {self.num_targets_total=}"
model = po.ConcreteModel()
# index sets
model.disks = po.RangeSet(1, self.num_nodes)
model.target_idxs = po.RangeSet(1, self.num_targets_per_disk)
model.targets = model.disks * model.target_idxs
model.groups = po.RangeSet(1, self.num_groups)
def disk_pairs_init(model):
for disk in model.disks:
for peer in model.disks:
if peer > disk:
yield (disk, peer)
model.disk_pairs = po.Set(dimen=2, initialize=disk_pairs_init)
# variables
model.disk_used_by_group = po.Var(model.disks, model.groups, domain=po.Binary)
if self.qlinearize:
model.disk_in_same_group = po.Var(model.disk_pairs, model.groups, domain=po.Binary)
# constraints
def calc_disk_in_same_group(model, disk, peer, group):
return model.disk_used_by_group[disk,group] * model.disk_used_by_group[peer,group]
def define_disk_in_same_group_lower_bound(model, disk, peer, group):
return model.disk_used_by_group[disk,group] + model.disk_used_by_group[peer,group] <= model.disk_in_same_group[disk,peer,group] + 1
def define_disk_in_same_group_upper_bound1(model, disk, peer, group):
return model.disk_in_same_group[disk,peer,group] <= model.disk_used_by_group[disk,group]
def define_disk_in_same_group_upper_bound2(model, disk, peer, group):
return model.disk_in_same_group[disk,peer,group] <= model.disk_used_by_group[peer,group]
if self.qlinearize:
model.define_disk_in_same_group_lower_bound_eqn = po.Constraint(model.disk_pairs, model.groups, rule=define_disk_in_same_group_lower_bound)
model.define_disk_in_same_group_upper_bound1_eqn = po.Constraint(model.disk_pairs, model.groups, rule=define_disk_in_same_group_upper_bound1)
model.define_disk_in_same_group_upper_bound2_eqn = po.Constraint(model.disk_pairs, model.groups, rule=define_disk_in_same_group_upper_bound2)
def each_disk_has_limited_capcity(model, disk):
if self.all_targets_used:
return po.quicksum(model.disk_used_by_group[disk,group] for group in model.groups) == self.num_targets_per_disk
else:
return po.quicksum(model.disk_used_by_group[disk,group] for group in model.groups) <= self.num_targets_per_disk
model.each_disk_has_limited_capcity_eqn = po.Constraint(model.disks, rule=each_disk_has_limited_capcity)
def enough_disks_assigned_to_each_group(model, group):
return po.quicksum(model.disk_used_by_group[disk,group] for disk in model.disks) == self.group_size
model.enough_disks_assigned_to_each_group_eqn = po.Constraint(model.groups, rule=enough_disks_assigned_to_each_group)
def calc_peer_recovery_traffic(model, disk, peer):
if self.qlinearize:
return po.quicksum(model.disk_in_same_group[disk,peer,group] for group in model.groups)
else:
return po.quicksum(calc_disk_in_same_group(model, disk, peer, group) for group in model.groups)
def peer_recovery_traffic_upper_bound(model, disk, peer):
if self.balanced_incomplete_block_design:
return calc_peer_recovery_traffic(model, disk, peer) == self.max_recovery_traffic_on_peer
else:
return calc_peer_recovery_traffic(model, disk, peer) <= self.max_recovery_traffic_on_peer + self.relax_ub
model.peer_recovery_traffic_upper_bound_eqn = po.Constraint(model.disk_pairs, rule=peer_recovery_traffic_upper_bound)
def peer_recovery_traffic_lower_bound(model, disk, peer):
return calc_peer_recovery_traffic(model, disk, peer) >= max(0, self.max_recovery_traffic_on_peer - self.relax_lb)
if self.balanced_incomplete_block_design:
logger.info(f"lower bound not needed for balanced incomplete block design (BIBD)")
elif self.all_targets_used:
logger.info(f"lower bound imposed on peer traffic: {self.relax_lb=} {self.qlinearize=} {self.all_targets_used=}")
model.peer_recovery_traffic_lower_bound_eqn = po.Constraint(model.disk_pairs, rule=peer_recovery_traffic_lower_bound)
else:
logger.info(f"lower bound not imposed on peer traffic: {self.relax_lb=} {self.qlinearize=} {self.all_targets_used=}")
def total_recovery_traffic(model):
return po.summation(model.disk_in_same_group) * 2
# model.obj = po.Objective(rule=total_recovery_traffic, sense=po.minimize)
model.obj = po.Objective(expr=1) # dummy objective
return model
def solve_model(self, instance, pyomo_solver, threads, timelimit, output_path):
if pyomo_solver is not None:
solver = po.SolverFactory(pyomo_solver)
return solver.solve(instance, options={"threads": str(threads), "log_file": os.path.join(output_path, f"{pyomo_solver}.log")}, load_solutions=False, timelimit=timelimit, tee=True)
else:
raise ValueError(f"no solver specified")
def get_peer_traffic(self, instance) -> Dict[Tuple[int,int], int]:
peer_traffic_map = {}
for disk in instance.disks:
for peer in instance.disks:
if disk == peer: continue
peer_traffic_map[(disk, peer)] = sum(
po.value(instance.disk_used_by_group[disk,group]) *
po.value(instance.disk_used_by_group[peer,group])
for group in instance.groups) * self.recovery_traffic_factor / (self.group_size - 1)
return peer_traffic_map
def get_incidence_matrix(self, instance) -> Dict[Tuple[int, int], bool]:
incidence_matrix = {}
for disk in instance.disks:
for group in instance.groups:
val = instance.disk_used_by_group[disk,group]
if math.isclose(po.value(val), 1):
incidence_matrix[(disk,group)] = True
if self.all_targets_used:
assert len(incidence_matrix) % self.num_nodes == 0, f"{len(incidence_matrix)=} % {self.num_nodes=}"
assert len(incidence_matrix) % self.num_groups == 0, f"{len(incidence_matrix)=} % {self.num_groups=}"
return incidence_matrix
def check_solution(self, instance):
has_peer_traffic_lower_bound = False
for c in instance.component_objects(po.Constraint):
if "peer_recovery_traffic_lower_bound_eqn" in str(c):
has_peer_traffic_lower_bound = True
peer_traffic_map = self.get_peer_traffic(instance)
for (disk, peer), peer_traffic in peer_traffic_map.items():
logger.debug(f"{disk},{peer}: {peer_traffic:.1f}")
assert peer_traffic <= self.max_recovery_traffic_on_peer + self.relax_ub + 1e-5, f"{peer_traffic=} > {self.max_recovery_traffic_on_peer=} + {self.relax_ub}"
if has_peer_traffic_lower_bound:
assert peer_traffic >= max(0, self.max_recovery_traffic_on_peer - self.relax_lb) - 1e-5, f"{peer_traffic=} < {self.max_recovery_traffic_on_peer=} - {self.relax_lb}"
min_peer_traffic = min(peer_traffic_map.values())
max_peer_traffic = max(peer_traffic_map.values())
total_traffic = sum(peer_traffic_map.values())
max_total_traffic = self.num_nodes * self.sum_recovery_traffic_per_failure
logger.info(f"{min_peer_traffic=:.1f} {max_peer_traffic=:.1f}")
logger.info(f"{total_traffic=} {max_total_traffic=}")
peer_traffic_diff = max_peer_traffic - min_peer_traffic
if has_peer_traffic_lower_bound:
assert peer_traffic_diff <= self.relax_ub + self.relax_lb + 1e-5, f"{peer_traffic_diff=}"
if self.balanced_incomplete_block_design:
assert math.isclose(peer_traffic_diff, 0.0, abs_tol=1e-9), f"{peer_traffic_diff=}"
assert total_traffic <= max_total_traffic + 1e-5
return total_traffic, min_peer_traffic, max_peer_traffic
def print_solution(self, instance):
for disk in instance.disks:
for group in instance.groups:
val = instance.disk_used_by_group[disk,group]
if math.isclose(po.value(val), 1):
logger.info(f"{val}: {po.value(val)}")
def save_solution(self, instance, output_path: str="output"):
incidence_matrix = self.get_incidence_matrix(instance)
with open(os.path.join(output_path, "incidence_matrix.pickle"), "wb") as fout:
pickle.dump(incidence_matrix, fout)
peer_traffic_map = self.get_peer_traffic(instance)
with open(os.path.join(output_path, "peer_traffic_map.pickle"), "wb") as fout:
pickle.dump(peer_traffic_map, fout)
def visualize_solution(self, instance, output_path: str="output", write_html=True):
incidence_matrix = self.get_incidence_matrix(instance)
disks, groups = zip(*incidence_matrix.keys())
incidence_df = pd.DataFrame(zip(disks, groups), columns=["disk", "group"])
peer_traffic_map = self.get_peer_traffic(instance)
min_peer_traffic = min(peer_traffic_map.values())
max_peer_traffic = max(peer_traffic_map.values())
fig = px.scatter(
incidence_df,
x="disk",
y="group",
title=f"{self}, min/max peer traffic: {min_peer_traffic:.1f}/{max_peer_traffic:.1f}")
fig.update_layout(
xaxis_title="Nodes",
yaxis_title="Groups",
xaxis = dict(
tickmode = 'array',
tickvals = list(range(1, self.num_nodes+1)),
),
yaxis = dict(
tickmode = 'array',
tickvals = list(range(1, self.num_groups+1)),
),
)
if write_html:
fig.write_html(os.path.join(output_path, f"data_placement.html"), include_plotlyjs=True)
return fig
class RebalanceTrafficModel(DataPlacementModel):
def __init__(self, existing_incidence_matrix, chain_table_type: Literal["EC", "CR"], num_nodes, group_size, num_groups=None, num_targets_per_disk=None, min_targets_per_disk=1, bibd_only=False, qlinearize=False, relax_lb=1, relax_ub=0):
self.existing_incidence_matrix = existing_incidence_matrix
self.existing_disks, self.existing_groups = zip(*existing_incidence_matrix.keys())
num_existing_targets_per_disk = math.ceil(self.total_existing_targets / self.num_existing_disk)
min_targets_per_disk = max(min_targets_per_disk, num_existing_targets_per_disk)
if num_targets_per_disk is None:
num_nodes, num_groups, num_targets_per_disk, group_size = DataPlacementModel.find_params(num_nodes, group_size, min_r=min_targets_per_disk, bibd_only=bibd_only)
else:
assert num_targets_per_disk >= min_targets_per_disk
super().__init__(chain_table_type, num_nodes, group_size, num_groups, num_targets_per_disk, min_targets_per_disk, bibd_only, qlinearize, relax_lb, relax_ub)
@property
def num_existing_disk(self):
return max(self.existing_disks)
@property
def num_existing_groups(self):
return max(self.existing_groups)
@property
def total_existing_targets(self):
return len(self.existing_disks)
@property
def existing_group_size(self):
assert self.total_existing_targets % self.num_existing_groups == 0, f"{self.total_existing_targets=} % {self.num_existing_groups=}"
return self.total_existing_targets // self.num_existing_groups
def build_model(self):
max_existing_targets_per_disk = math.ceil(self.total_existing_targets / self.num_nodes)
logger.info(f"{self.num_existing_disk=} {self.num_existing_groups=} {self.total_existing_targets=} {max_existing_targets_per_disk=}")
assert self.num_nodes >= self.num_existing_disk, f"{self.num_nodes=} < {self.num_existing_disk=}"
assert self.num_groups >= self.num_existing_groups, f"{self.num_groups=} < {self.num_existing_groups=}"
assert self.group_size == self.existing_group_size, f"{self.group_size=} != {self.existing_group_size=}"
assert self.num_targets_per_disk >= max_existing_targets_per_disk, f"{self.num_targets_per_disk=} >= {max_existing_targets_per_disk=}"
model = super().build_model()
def existing_targets_evenly_distributed_to_disks(model, disk):
return po.quicksum(model.disk_used_by_group[disk,group] for group in model.groups if group <= self.num_existing_groups) <= max_existing_targets_per_disk
model.existing_targets_evenly_distributed_to_disks_eqn = po.Constraint(model.disks, rule=existing_targets_evenly_distributed_to_disks)
def num_existing_targets_not_moved(model):
return po.quicksum(model.disk_used_by_group[disk,group] for disk in model.disks for group in model.groups if (disk,group) in self.existing_incidence_matrix)
def total_rebalance_traffic(model):
return self.total_existing_targets - num_existing_targets_not_moved(model)
model.obj = po.Objective(expr=total_rebalance_traffic, sense=po.minimize)
return model
def visualize_solution(self, instance, output_path = "output", write_html=True):
incidence_matrix = self.get_incidence_matrix(instance)
disks, groups = zip(*incidence_matrix.keys())
incidence_df = pd.DataFrame(zip(disks, groups, [g > self.num_existing_groups for g in groups]), columns=["disk", "group", "new"])
peer_traffic_map = self.get_peer_traffic(instance)
min_peer_traffic = min(peer_traffic_map.values())
max_peer_traffic = max(peer_traffic_map.values())
fig = px.scatter(
incidence_df,
x="disk",
y="group",
color="new",
title=f"{self}, min/max peer traffic: {min_peer_traffic:.1f}/{max_peer_traffic:.1f}, rebalance traffic: {po.value(instance.obj.expr)}")
fig.update_layout(
xaxis_title="Nodes",
yaxis_title="Groups",
xaxis = dict(
tickmode = 'array',
tickvals = list(range(1, self.num_nodes+1)),
),
yaxis = dict(
tickmode = 'array',
tickvals = list(range(1, self.num_groups+1)),
),
)
if write_html:
fig.write_html(os.path.join(output_path, f"{self.path_name}.html"), include_plotlyjs=True)
return fig
def main():
import psutil
import argparse
parser = argparse.ArgumentParser(prog="model.py", description="3FS data placement")
parser.add_argument("-pyomo", "--pyomo_solver", default="appsi_highs", choices=["appsi_highs", "cbc", "scip"], help="Solver used by Pyomo")
parser.add_argument("-type", "--chain_table_type", type=str, required=True, choices=["CR", "EC"], help="CR - Chain Replication; EC - Erasure Coding")
parser.add_argument("-j", "--solver_threads", type=int, default=psutil.cpu_count(logical=False)//2, help="Number of solver threads")
parser.add_argument("-v", "--num_nodes", type=int, required=True, help="Number of storage nodes")
parser.add_argument("-r", "--num_targets_per_disk", type=int, default=None, help="Number of storage targets on each disk")
parser.add_argument("-min_r", "--min_targets_per_disk", type=int, default=1, help="Min number of storage targets on each disk")
parser.add_argument("-k", "--replication_factor", "--group_size", dest="group_size", type=int, default=3, help="Replication factor or erasure coding group size")
parser.add_argument("-b", "--num_groups", type=int, default=None, help="Number of chains or EC groups")
parser.add_argument("-ql", "--qlinearize", action="store_true", help="Enable linearization of quadratic equations")
parser.add_argument("-lb", "--relax_lb", type=int, default=1, help="Relax the lower bound of peer recovery traffic")
parser.add_argument("-ub", "--relax_ub", type=int, default=0, help="Relax the upper bound of peer recovery traffic")
parser.add_argument("-relax", "--auto_relax", action="store_true", help="Auto relax the lower/upper bound of peer recovery traffic when timeout")
parser.add_argument("-bibd", "--bibd_only", action="store_true", help="Only create balanced incomplete block design (BIBD)")
parser.add_argument("-t", "--init_timelimit", type=int, default=1800, help="Initial timeout for solver")
parser.add_argument("-T", "--max_timelimit", type=int, default=3600*2, help="Max timeout for solver")
parser.add_argument("-o", "--output_path", default="output", help="Path of output files")
parser.add_argument("-m", "--existing_incidence_matrix", default=None, help="Existing incidence matrix for rebalance traffic model")
parser.add_argument("-V", "--verbose", action="store_true", help="Show verbose output")
args = parser.parse_args()
if args.existing_incidence_matrix is None:
DataPlacementModel(
args.chain_table_type,
args.num_nodes,
args.group_size,
args.num_groups,
args.num_targets_per_disk,
args.min_targets_per_disk,
args.bibd_only,
args.qlinearize,
args.relax_lb,
args.relax_ub,
).run(
args.pyomo_solver,
args.solver_threads,
args.init_timelimit,
args.max_timelimit,
args.auto_relax,
args.output_path,
args.verbose)
else:
with open(args.existing_incidence_matrix, "rb") as fin:
existing_incidence_matrix = pickle.load(fin)
RebalanceTrafficModel(
existing_incidence_matrix,
args.chain_table_type,
args.num_nodes,
args.group_size,
args.num_groups,
args.num_targets_per_disk,
args.min_targets_per_disk,
args.bibd_only,
args.qlinearize,
args.relax_lb,
args.relax_ub,
).run(
args.pyomo_solver,
args.solver_threads,
args.init_timelimit,
args.max_timelimit,
args.auto_relax,
args.output_path,
args.verbose)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,108 @@
# local test
# pytest test/test_plan.py -v -x
# production setup
import functools
import socket
import sys
import os.path
import itertools
import pandas as pd
import pyarrow as arrow
from typing import List, Literal
from loguru import logger
from smallpond.common import pytest_running
from smallpond.logical.dataset import ArrowTableDataSet
from smallpond.logical.node import Context, ConsolidateNode, DataSetPartitionNode, DataSourceNode, ArrowComputeNode, LogicalPlan, SqlEngineNode
from smallpond.execution.driver import Driver
from smallpond.execution.task import RuntimeContext, ArrowComputeTask
def solve_model(runtime_task: ArrowComputeTask,
chain_table_type, num_nodes, group_size, min_targets_per_disk,
init_timelimit, max_timelimit,
pyomo_solver="appsi_highs"):
import logging
pyomo_logger = logging.getLogger('pyomo')
pyomo_logger.setLevel(logging.WARNING)
try:
from src.model.data_placement import DataPlacementModel
except:
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
from src.model.data_placement import DataPlacementModel
model = DataPlacementModel(chain_table_type, num_nodes, group_size, min_targets_per_disk=min_targets_per_disk, bibd_only=False, qlinearize=True, relax_lb=1, relax_ub=0)
runtime_task.add_elapsed_time("build model time")
instance = model.run(
pyomo_solver=pyomo_solver,
threads=runtime_task.cpu_limit,
init_timelimit=init_timelimit,
max_timelimit=max_timelimit,
auto_relax=True,
output_root=runtime_task.runtime_output_abspath,
add_elapsed_time=runtime_task.add_elapsed_time)
return model, instance
def solve_loop(runtime_ctx: RuntimeContext, input_tables: List[arrow.Table],
init_timelimit, max_timelimit,
pyomo_solver="appsi_highs") -> arrow.Table:
runtime_task = runtime_ctx.task
model_params, = input_tables
output_table = None
schema = arrow.schema([
arrow.field("chain_table_type", arrow.string()),
arrow.field("num_nodes", arrow.uint32()),
arrow.field("group_size", arrow.uint32()),
arrow.field("disks", arrow.list_(arrow.uint32())),
arrow.field("groups", arrow.list_(arrow.uint32())),
])
for chain_table_type, num_nodes, group_size, min_targets_per_disk in zip(*model_params.to_pydict().values()):
model, instance = solve_model(runtime_task, chain_table_type, num_nodes, group_size, min_targets_per_disk, init_timelimit, max_timelimit, pyomo_solver)
incidence_matrix = model.get_incidence_matrix(instance)
disks, groups = zip(*incidence_matrix.keys())
sol_table = arrow.Table.from_arrays([[chain_table_type], [num_nodes], [group_size], [disks], [groups]], schema=schema)
output_table = sol_table if output_table is None else arrow.concat_tables((output_table, sol_table))
return output_table
def search_data_placement_plans(
chain_table_type: Literal["EC", "CR"],
num_nodes: List[int], group_size: List[int], min_targets_per_disk=1,
init_timelimit=1800, max_timelimit=3600*3,
solver_threads: int=64,
pyomo_solver="appsi_highs"):
params = pd.DataFrame([(chain_table_type, v, k, min_targets_per_disk)
for v, k in itertools.product(num_nodes, group_size) if v >= k],
columns=["chain_table_type", "num_nodes", "group_size", "min_targets_per_disk"])
logger.warning(f"params: {params}")
ctx = Context()
params_source = DataSourceNode(ctx, ArrowTableDataSet(arrow.Table.from_pandas(params)))
params_partitions = DataSetPartitionNode(ctx, (params_source,), npartitions=len(params), partition_by_rows=True)
data_placement_sols = ArrowComputeNode(
ctx, (params_partitions,),
process_func=functools.partial(solve_loop, init_timelimit=init_timelimit, max_timelimit=max_timelimit, pyomo_solver=pyomo_solver),
cpu_limit=solver_threads)
return LogicalPlan(ctx, data_placement_sols)
def main():
driver = Driver()
driver.add_argument("-pyomo", "--pyomo_solver", default="appsi_highs", choices=["appsi_highs", "cbc", "scip"], help="Solver used by Pyomo")
driver.add_argument("-type", "--chain_table_type", type=str, required=True, choices=["EC", "CR"], help="CR - Chain Replication; EC - Erasure Coding")
driver.add_argument("-v", "--num_nodes", nargs="+", type=int, required=True, help="Number of storage nodes")
driver.add_argument("-k", "--replication_factor", "--group_size", dest="group_size", type=int, default=3, help="Replication factor or erasure coding group size")
driver.add_argument("-min_r", "--min_targets_per_disk", type=int, default=1, help="Min number of storage targets on each disk")
driver.add_argument("-j", "--solver_threads", type=int, default=32, help="Number of solver threads")
driver.add_argument("-t", "--init_timelimit", type=int, default=1800, help="Initial timeout for solver")
driver.add_argument("-T", "--max_timelimit", type=int, default=3600*3, help="Max timeout for solver")
plan = search_data_placement_plans(num_executors=driver.num_executors, **driver.get_arguments())
driver.run(plan)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,124 @@
import argparse
import os.path
from collections import Counter, defaultdict, namedtuple
import pickle
from typing import Dict, List, Literal, Tuple
Target = namedtuple("Target", ["target_id", "node_id", "disk_index"])
Chain = namedtuple("Chain", ["chain_id", "target_list"])
def calc_target_id(target_id_prefix: int, node_id: int, disk_index: int, target_index: int):
return ((target_id_prefix * 1_000_000 + node_id) * 1_000 + (disk_index+1)) * 100 + (target_index+1)
def generate_chains(
chain_table_type: Literal["EC", "CR"],
node_id_begin: int,
node_id_end: int,
num_disks_per_node: int,
num_targets_per_disk: int,
target_id_prefix: int,
chain_id_prefix: int,
incidence_matrix: Dict[Tuple[int, int], bool],
**kwargs):
num_nodes = node_id_end - node_id_begin + 1
nodes, groups = zip(*sorted(incidence_matrix.keys()))
group_sizes = list(Counter(groups).values())
assert max(nodes) == num_nodes, f"{max(nodes)=} != {num_nodes=}"
assert all(s == group_sizes[0] for s in group_sizes[1:]), f"not all group sizes the same: {group_sizes}"
assert len(incidence_matrix) % group_sizes[0] == 0, f"{len(incidence_matrix)=} % {group_sizes[0]=} != 0"
assert len(incidence_matrix) == num_nodes * num_targets_per_disk, f"{len(incidence_matrix)=} != {num_nodes=} * {num_targets_per_disk=}"
global_target_list = []
chain_target_list = defaultdict(list)
for disk_index in range(num_disks_per_node):
group_slot_idx = defaultdict(int)
for node_id in range(node_id_begin, node_id_end+1):
for target_index in range(num_targets_per_disk):
target_id = calc_target_id(target_id_prefix, node_id, disk_index, target_index)
target_pos = (node_id - node_id_begin) * num_targets_per_disk + target_index
if chain_table_type == "EC":
group_slot_idx[groups[target_pos]] += 1
chain_index = (groups[target_pos]-1) * group_sizes[0] + group_slot_idx[groups[target_pos]]
else:
chain_index = groups[target_pos]
assert chain_index < 1_00_000, f"{chain_index} >= {1_00_000}"
chain_id = (chain_id_prefix * 1_000 + (disk_index+1)) * 1_00_000 + chain_index
target = Target(target_id, node_id, disk_index)
global_target_list.append(target)
chain_target_list[chain_id].append(target)
num_targets_on_node = list(Counter(target.node_id for target in global_target_list).values())
num_targets_on_disk = list(Counter((target.node_id, target.disk_index) for target in global_target_list).values())
assert len(global_target_list) == len(set(global_target_list)) == num_nodes * num_disks_per_node * num_targets_per_disk
assert all(x == num_targets_on_node[0] for x in num_targets_on_node[1:])
assert all(x == num_targets_on_disk[0] for x in num_targets_on_disk[1:])
if chain_table_type == "EC":
assert all(len(target_ids) == 1 for target_ids in chain_target_list.values())
assert len(chain_target_list) == num_nodes * num_disks_per_node * num_targets_per_disk
else:
assert all(len(target_ids) == group_sizes[0] for target_ids in chain_target_list.values())
assert len(chain_target_list) == num_nodes * num_disks_per_node * num_targets_per_disk // group_sizes[0]
return [Chain(chain_id, target_list) for chain_id, target_list in sorted(chain_target_list.items())]
def main():
parser = argparse.ArgumentParser(prog="model.py", description="Generate 3FS create target commands")
parser.add_argument("-type", "--chain_table_type", type=str, required=True, choices=["EC", "CR"], help="CR - Chain Replication; EC - Erasure Coding")
parser.add_argument("-b", "--node_id_begin", type=int, required=True, help="The first node id")
parser.add_argument("-e", "--node_id_end", type=int, required=True, help="The last node id")
parser.add_argument("-d", "--num_disks_per_node", type=int, required=True, help="Number of disk on each storage node")
parser.add_argument("-r", "--num_targets_per_disk", type=int, required=True, help="Number of storage targets on each disk")
parser.add_argument("-tp", "--target_id_prefix", type=int, default=10, help="Prefix of generated target id")
parser.add_argument("-cp", "--chain_id_prefix", type=int, default=10, help="Prefix of generated chain id")
parser.add_argument("-cs", "--chunk_size", nargs="+", help="A list of supported file chunk sizes")
parser.add_argument("-mat", "--incidence_matrix_path", type=str, required=True, help="Incidence matrix generated by data placement model")
parser.add_argument("-o", "--output_path", default="output", help="Path of output files")
args = parser.parse_args()
with open(args.incidence_matrix_path, "rb") as fin:
incidence_matrix = pickle.load(fin)
assert len(incidence_matrix) < 1_00_000
assert args.node_id_end - args.node_id_begin < 1000
assert args.node_id_end < 1_000_000
assert args.node_id_begin < 1_000_000
assert args.num_disks_per_node < 1000
assert args.num_targets_per_disk < 100
assert args.target_id_prefix < 100
assert args.chain_id_prefix < 100
chain_list = generate_chains(**vars(args), incidence_matrix=incidence_matrix)
with open(os.path.join(args.output_path, "generated_chains.csv"), "w") as fout:
print(f"ChainId,{','.join(['TargetId']*len(chain_list[0].target_list))}", file=fout)
for chain in chain_list:
print(f"{chain.chain_id},{','.join(str(target.target_id) for target in chain.target_list)}", file=fout)
with open(os.path.join(args.output_path, "generated_chain_table.csv"), "w") as fout:
print("ChainId", file=fout)
for chain in chain_list:
print(f"{chain.chain_id}", file=fout)
with open(os.path.join(args.output_path, "create_target_cmd.txt"), "w") as fout:
chunk_size_opt = f"--chunk-size {' '.join(args.chunk_size)}" if args.chunk_size else ""
for chain in chain_list:
for target in chain.target_list:
print(f"create-target --node-id {target.node_id} --disk-index {target.disk_index} --target-id {target.target_id} --chain-id {chain.chain_id} {chunk_size_opt} --use-new-chunk-engine", file=fout)
with open(os.path.join(args.output_path, "remove_target_cmd.txt"), "w") as fout:
for chain in chain_list:
for target in chain.target_list:
print(f"offline-target --node-id {target.node_id} --target-id {target.target_id}", file=fout)
print(f"remove-target --node-id {target.node_id} --target-id {target.target_id}", file=fout)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,94 @@
import copy
import glob
import os.path
import importlib
import shutil
import tempfile
import pytest
from src.model.data_placement import DataPlacementModel, RebalanceTrafficModel
placement_params = [
# simple cases for replication group
{
"chain_table_type": "EC",
"num_nodes": 5,
"num_targets_per_disk": 6,
"group_size": 2,
},
{
"chain_table_type": "EC",
"num_nodes": 5,
"num_targets_per_disk": 6,
"group_size": 3,
},
# not all targets used: num_nodes * num_targets_per_disk % group_size != 0
{
"chain_table_type": "EC",
"num_nodes": 7,
"num_targets_per_disk": 5,
"group_size": 4,
},
# always evenly distributed: num_targets_per_disk * (group_size-1) % (num_nodes-1) == 0
{
"chain_table_type": "EC",
"num_nodes": 8,
"num_targets_per_disk": 6,
"group_size": 5,
},
# all targets used & evenly distributed
{
"chain_table_type": "EC",
"num_nodes": 10,
"num_targets_per_disk": 9,
"group_size": 5,
},
]
qlinearize = [False, True]
relax_lb = [1, 2]
@pytest.mark.parametrize('qlinearize', qlinearize[1:])
@pytest.mark.parametrize('relax_lb', relax_lb)
@pytest.mark.parametrize('placement_params', placement_params)
@pytest.mark.skipif(importlib.util.find_spec("highspy") is None, reason="cannot find solver")
def test_solve_placement_model_with_highs(placement_params, qlinearize, relax_lb):
DataPlacementModel(
**placement_params,
qlinearize=qlinearize,
relax_lb=relax_lb,
).run(pyomo_solver="appsi_highs")
@pytest.mark.parametrize('chain_table_type, num_nodes, group_size', [("CR", 25, 3), ("EC", 25, 20)])
@pytest.mark.skipif(importlib.util.find_spec("highspy") is None, reason="cannot find solver")
def test_solve_placement_model_v25(chain_table_type, num_nodes, group_size):
model = DataPlacementModel(
chain_table_type=chain_table_type,
num_nodes=num_nodes,
group_size=group_size,
qlinearize=True,
relax_lb=1,
relax_ub=1,
)
model.run(pyomo_solver="appsi_highs", max_timelimit=30, auto_relax=True)
@pytest.mark.parametrize('placement_params', placement_params)
@pytest.mark.skipif(importlib.util.find_spec("highspy") is None, reason="cannot find solver")
def test_solve_rebalance_model(placement_params):
model = DataPlacementModel(
**placement_params,
qlinearize=True,
relax_lb=1,
relax_ub=1,
)
instance = model.run(pyomo_solver="appsi_highs")
placement_params = copy.deepcopy(placement_params)
placement_params["num_nodes"] *= 2
placement_params.pop("num_targets_per_disk")
RebalanceTrafficModel(
existing_incidence_matrix=model.get_incidence_matrix(instance),
**placement_params,
qlinearize=True,
relax_lb=2,
relax_ub=1,
).run(pyomo_solver="appsi_highs", max_timelimit=15, auto_relax=True)

View File

@@ -0,0 +1,10 @@
from smallpond.test_fabric import TestFabric
from src.model.data_placement_job import search_data_placement_plans
class TestPlan(TestFabric):
def test_search_data_placement_plans(self):
for pyomo_solver in ["appsi_highs"]:
with self.subTest(pyomo_solver=pyomo_solver):
plan = search_data_placement_plans(chain_table_type="EC", num_nodes=[10], group_size=[5, 9], solver_threads=16, pyomo_solver=pyomo_solver)
self.execute_plan(plan, num_executors=1)

View File

@@ -0,0 +1,55 @@
from collections import Counter
import glob
import os.path
import pytest
from src.model.data_placement import DataPlacementModel
from src.setup.gen_chain_table import generate_chains
@pytest.mark.parametrize('num_nodes, num_disks_per_node, num_targets_per_disk, num_replicas', [(5, 10, 6, 2), (10, 10, 9, 3)])
def test_generate_cr_chains(num_nodes: int, num_disks_per_node: int, num_targets_per_disk: int, num_replicas: int):
model = DataPlacementModel(
chain_table_type="CR",
num_nodes=num_nodes,
num_targets_per_disk=num_targets_per_disk,
group_size=num_replicas,
qlinearize=True,
relax_lb=1,
relax_ub=1,
)
instance = model.run(pyomo_solver="appsi_highs", max_timelimit=15, auto_relax=True)
generate_chains(
chain_table_type="CR",
node_id_begin=1,
node_id_end=num_nodes,
num_disks_per_node=num_disks_per_node,
num_targets_per_disk=num_targets_per_disk,
target_id_prefix=1,
chain_id_prefix=9,
incidence_matrix=model.get_incidence_matrix(instance))
@pytest.mark.parametrize('num_nodes, num_disks_per_node, num_targets_per_disk, ec_group_size', [(20, 10, 6, 12), (25, 10, 12, 20)])
def test_generate_ec_chains(num_nodes: int, num_disks_per_node: int, num_targets_per_disk: int, ec_group_size: int):
model = DataPlacementModel(
chain_table_type="EC",
num_nodes=num_nodes,
num_targets_per_disk=num_targets_per_disk,
group_size=ec_group_size,
qlinearize=True,
relax_lb=1,
relax_ub=1,
)
instance = model.run(pyomo_solver="appsi_highs", max_timelimit=15, auto_relax=True)
generate_chains(
chain_table_type="EC",
node_id_begin=1,
node_id_end=num_nodes,
num_disks_per_node=num_disks_per_node,
num_targets_per_disk=num_targets_per_disk,
target_id_prefix=1,
chain_id_prefix=9,
incidence_matrix=model.get_incidence_matrix(instance))

View File

@@ -0,0 +1,51 @@
CREATE DATABASE IF NOT EXISTS 3fs;
CREATE TABLE IF NOT EXISTS 3fs.counters (
`TIMESTAMP` DateTime CODEC(DoubleDelta),
`metricName` LowCardinality(String) CODEC(ZSTD(1)),
`host` LowCardinality(String) CODEC(ZSTD(1)),
`tag` LowCardinality(String) CODEC(ZSTD(1)),
`val` Int64 CODEC(ZSTD(1)),
`mount_name` LowCardinality(String) CODEC(ZSTD(1)),
`instance` String CODEC(ZSTD(1)),
`io` LowCardinality(String) CODEC(ZSTD(1)),
`uid` LowCardinality(String) CODEC(ZSTD(1)),
`pod` String CODEC(ZSTD(1)),
`thread` LowCardinality(String) CODEC(ZSTD(1)),
`statusCode` LowCardinality(String) CODEC(ZSTD(1))
)
ENGINE = MergeTree
PRIMARY KEY (metricName, host, pod, instance, TIMESTAMP)
PARTITION BY toDate(TIMESTAMP)
ORDER BY (metricName, host, pod, instance, TIMESTAMP)
TTL TIMESTAMP + toIntervalMonth(1)
SETTINGS index_granularity = 8192;
CREATE TABLE IF NOT EXISTS 3fs.distributions (
`TIMESTAMP` DateTime CODEC(DoubleDelta),
`metricName` LowCardinality(String) CODEC(ZSTD(1)),
`host` LowCardinality(String) CODEC(ZSTD(1)),
`tag` LowCardinality(String) CODEC(ZSTD(1)),
`count` Float64 CODEC(ZSTD(1)),
`mean` Float64 CODEC(ZSTD(1)),
`min` Float64 CODEC(ZSTD(1)),
`max` Float64 CODEC(ZSTD(1)),
`p50` Float64 CODEC(ZSTD(1)),
`p90` Float64 CODEC(ZSTD(1)),
`p95` Float64 CODEC(ZSTD(1)),
`p99` Float64 CODEC(ZSTD(1)),
`mount_name` LowCardinality(String) CODEC(ZSTD(1)),
`instance` String CODEC(ZSTD(1)),
`io` LowCardinality(String) CODEC(ZSTD(1)),
`uid` LowCardinality(String) CODEC(ZSTD(1)),
`method` LowCardinality(String) CODEC(ZSTD(1)),
`pod` String CODEC(ZSTD(1)),
`thread` LowCardinality(String) CODEC(ZSTD(1)),
`statusCode` LowCardinality(String) CODEC(ZSTD(1))
)
ENGINE = MergeTree
PRIMARY KEY (metricName, host, pod, instance, TIMESTAMP)
PARTITION BY toDate(TIMESTAMP)
ORDER BY (metricName, host, pod, instance, TIMESTAMP)
TTL TIMESTAMP + toIntervalMonth(1)
SETTINGS index_granularity = 8192;

View File

@@ -0,0 +1,12 @@
[Unit]
Description=fuse_main Server
Requires=network-online.target
After=network-online.target
[Service]
LimitNOFILE=1000000
ExecStart=/opt/3fs/bin/hf3fs_fuse_main --launcher_cfg /opt/3fs/etc/hf3fs_fuse_main_launcher.toml
Type=simple
[Install]
WantedBy=multi-user.target

View File

@@ -0,0 +1,12 @@
[Unit]
Description=meta_main Server
Requires=network-online.target
After=network-online.target
[Service]
LimitNOFILE=1000000
ExecStart=/opt/3fs/bin/meta_main --launcher_cfg /opt/3fs/etc/meta_main_launcher.toml --app-cfg /opt/3fs/etc/meta_main_app.toml
Type=simple
[Install]
WantedBy=multi-user.target

View File

@@ -0,0 +1,12 @@
[Unit]
Description=mgmtd_main Server
Requires=network-online.target
After=network-online.target
[Service]
LimitNOFILE=1000000
ExecStart=/opt/3fs/bin/mgmtd_main --launcher_cfg /opt/3fs/etc/mgmtd_main_launcher.toml --app-cfg /opt/3fs/etc/mgmtd_main_app.toml
Type=simple
[Install]
WantedBy=multi-user.target

View File

@@ -0,0 +1,11 @@
[Unit]
Description=monitor_collector_main Server
Requires=network-online.target
After=network-online.target
[Service]
ExecStart=/opt/3fs/bin/monitor_collector_main --cfg /opt/3fs/etc/monitor_collector_main.toml
Type=simple
[Install]
WantedBy=multi-user.target

View File

@@ -0,0 +1,14 @@
[Unit]
Description=storage_main Server
Requires=network-online.target
After=network-online.target
[Service]
LimitNOFILE=1000000
LimitMEMLOCK=infinity
TimeoutStopSec=5m
ExecStart=/opt/3fs/bin/storage_main --launcher_cfg /opt/3fs/etc/storage_main_launcher.toml --app-cfg /opt/3fs/etc/storage_main_app.toml
Type=simple
[Install]
WantedBy=multi-user.target