mirror of
https://github.com/deepseek-ai/3FS
synced 2025-06-26 18:16:45 +00:00
Initial commit
This commit is contained in:
373
deploy/README.md
Normal file
373
deploy/README.md
Normal file
@@ -0,0 +1,373 @@
|
||||
# 3FS Setup Guide
|
||||
|
||||
This section provides a manual deployment guide for setting up a six-node cluster with the cluster ID `stage`.
|
||||
|
||||
## Installation prerequisites
|
||||
|
||||
### Hardware specifications
|
||||
|
||||
| Node | OS | IP | Memory | SSD | RDMA |
|
||||
|----------|---------------|--------------|--------|------------|-------|
|
||||
| meta | Ubuntu 22.04 | 192.168.1.1 | 128GB | - | RoCE |
|
||||
| storage1 | Ubuntu 22.04 | 192.168.1.2 | 512GB | 14TB × 16 | RoCE |
|
||||
| storage2 | Ubuntu 22.04 | 192.168.1.3 | 512GB | 14TB × 16 | RoCE |
|
||||
| storage3 | Ubuntu 22.04 | 192.168.1.4 | 512GB | 14TB × 16 | RoCE |
|
||||
| storage4 | Ubuntu 22.04 | 192.168.1.5 | 512GB | 14TB × 16 | RoCE |
|
||||
| storage5 | Ubuntu 22.04 | 192.168.1.6 | 512GB | 14TB × 16 | RoCE |
|
||||
|
||||
> **RDMA Configuration**
|
||||
> 1. Assign IP addresses to RDMA NICs. Multiple RDMA NICs (InfiniBand or RoCE) are supported on each node.
|
||||
> 2. Check RDMA connectivity between nodes using `ib_write_bw`.
|
||||
|
||||
### Third-party dependencies
|
||||
|
||||
In production environment, it is recommended to install FoundationDB and ClickHouse on dedicated nodes.
|
||||
|
||||
| Service | Node |
|
||||
|------------|-------------------------|
|
||||
| [ClickHouse](https://clickhouse.com/docs/install) | meta |
|
||||
| [FoundationDB](https://apple.github.io/foundationdb/administration.html) | meta |
|
||||
|
||||
> **FoundationDB**
|
||||
> 1. Ensure that the version of FoundationDB client matches the server version, or copy the corresponding version of `libfdb_c.so` to maintain compatibility.
|
||||
> 2. Find the `fdb.cluster` file and `libfdb_c.so` at `/etc/foundationdb/fdb.cluster`, `/usr/lib/libfdb_c.so` on nodes with FoundationDB installed.
|
||||
|
||||
|
||||
---
|
||||
## Step 0: Build 3FS
|
||||
|
||||
Follow the [instructions](../README.md#build-3fs) to build 3FS. Binaries can be found in `build/bin`.
|
||||
|
||||
### Services and clients
|
||||
|
||||
The following steps show how to install 3FS services in `/opt/3fs/bin` and the config files in `/opt/3fs/etc`.
|
||||
|
||||
| Service | Binary | Config files | NodeID | Node |
|
||||
|------------|-------------------------|-----------------------------------------------------------------------------|--------|---------------|
|
||||
| monitor | monitor_collector_main | [monitor_collector_main.toml](../configs/monitor_collector_main.toml) | - | meta |
|
||||
| admin_cli | admin_cli | [admin_cli.toml](../configs/admin_cli.toml)<br>fdb.cluster | - | meta<br>storage1<br>storage2<br>storage3<br>storage4<br>storage5 |
|
||||
| mgmtd | mgmtd_main | [mgmtd_main_launcher.toml](../configs/mgmtd_main_launcher.toml)<br>[mgmtd_main.toml](../configs/mgmtd_main.toml)<br>[mgmtd_main_app.toml](../configs/mgmtd_main_app.toml)<br>fdb.cluster | 1 | meta |
|
||||
| meta | meta_main | [meta_main_launcher.toml](../configs/meta_main_launcher.toml)<br>[meta_main.toml](../configs/meta_main.toml)<br>[meta_main_app.toml](../configs/meta_main_app.toml)<br>fdb.cluster | 100 | meta |
|
||||
| storage | storage_main | [storage_main_launcher.toml](../configs/storage_main_launcher.toml)<br>[storage_main.toml](../configs/storage_main.toml)<br>[storage_main_app.toml](../configs/storage_main_app.toml) | 10001~10005 | storage1<br>storage2<br>storage3<br>storage4<br>storage5 |
|
||||
| client | hf3fs_fuse_main | [hf3fs_fuse_main_launcher.toml](../configs/hf3fs_fuse_main_launcher.toml)<br>[hf3fs_fuse_main.toml](../configs/hf3fs_fuse_main.toml) | - | meta |
|
||||
|
||||
---
|
||||
## Step 1: Create ClickHouse tables for metrics
|
||||
|
||||
Import the SQL file into ClickHouse:
|
||||
```bash
|
||||
clickhouse-client -n < ~/3fs/deploy/sql/3fs-monitor.sql
|
||||
```
|
||||
|
||||
---
|
||||
## Step 2: Monitor service
|
||||
|
||||
Install `monitor_collector` service on the **meta** node.
|
||||
|
||||
1. Copy `monitor_collector_main` to `/opt/3fs/bin` and config files to `/opt/3fs/etc`, and create log directory `/var/log/3fs`.
|
||||
```bash
|
||||
mkdir -p /opt/3fs/{bin,etc}
|
||||
mkdir -p /var/log/3fs
|
||||
cp ~/3fs/build/bin/monitor_collector_main /opt/3fs/bin
|
||||
cp ~/3fs/configs/monitor_collector_main.toml /opt/3fs/etc
|
||||
```
|
||||
2. Update [`monitor_collector_main.toml`](../configs/monitor_collector_main.toml) to add a ClickHouse connection:
|
||||
```toml
|
||||
[server.monitor_collector.reporter]
|
||||
type = 'clickhouse'
|
||||
|
||||
[server.monitor_collector.reporter.clickhouse]
|
||||
db = '3fs'
|
||||
host = '<CH_HOST>'
|
||||
passwd = '<CH_PASSWD>'
|
||||
port = '<CH_PORT>'
|
||||
user = '<CH_USER>'
|
||||
```
|
||||
3. Start monitor service:
|
||||
```bash
|
||||
cp ~/3fs/deploy/systemd/monitor_collector_main.service /usr/lib/systemd/system
|
||||
systemctl start monitor_collector_main
|
||||
```
|
||||
|
||||
Note that
|
||||
> - Multiple instances of monitor services can be deployed behind a virtual IP address to share the traffic.
|
||||
> - Other services communicate with the monitor service over a TCP connection.
|
||||
|
||||
---
|
||||
## Step 3: Admin client
|
||||
Install `admin_cli` on **all** nodes.
|
||||
|
||||
1. Copy `admin_cli` to `/opt/3fs/bin` and config files to `/opt/3fs/etc`.
|
||||
```bash
|
||||
mkdir -p /opt/3fs/{bin,etc}
|
||||
rsync -avz meta:~/3fs/build/bin/admin_cli /opt/3fs/bin
|
||||
rsync -avz meta:~/3fs/configs/admin_cli.toml /opt/3fs/etc
|
||||
rsync -avz meta:/etc/foundationdb/fdb.cluster /opt/3fs/etc
|
||||
```
|
||||
2. Update [`admin_cli.toml`](../configs/admin_cli.toml) to set `cluster_id` and `clusterFile`:
|
||||
```toml
|
||||
cluster_id = "stage"
|
||||
|
||||
[fdb]
|
||||
clusterFile = '/opt/3fs/etc/fdb.cluster'
|
||||
```
|
||||
|
||||
The full help documentation for `admin_cli` can be displayed by running the following command:
|
||||
|
||||
```bash
|
||||
/opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml help
|
||||
```
|
||||
|
||||
---
|
||||
## Step 4: Mgmtd service
|
||||
Install `mgmtd` service on **meta** node.
|
||||
|
||||
1. Copy `mgmtd_main` to `/opt/3fs/bin` and config files to `/opt/3fs/etc`.
|
||||
```bash
|
||||
cp ~/3fs/build/bin/mgmtd_main /opt/3fs/bin
|
||||
cp ~/3fs/configs/{mgmtd_main.toml,mgmtd_main_launcher.toml,mgmtd_main_app.toml} /opt/3fs/etc
|
||||
```
|
||||
2. Update config files:
|
||||
- Set mgmtd `node_id = 1` in [`mgmtd_main_app.toml`](../configs/mgmtd_main_app.toml).
|
||||
- Edit [`mgmtd_main_launcher.toml`](../configs/mgmtd_main_launcher.toml) to set the `cluster_id` and `clusterFile`:
|
||||
```toml
|
||||
cluster_id = "stage"
|
||||
|
||||
[fdb]
|
||||
clusterFile = '/opt/3fs/etc/fdb.cluster'
|
||||
```
|
||||
- Set monitor address in [`mgmtd_main.toml`](../configs/mgmtd_main.toml):
|
||||
```toml
|
||||
[common.monitor.reporters.monitor_collector]
|
||||
remote_ip = "192.168.1.1:10000"
|
||||
```
|
||||
3. Initialize the cluster:
|
||||
```bash
|
||||
/opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml "init-cluster --mgmtd /opt/3fs/etc/mgmtd_main.toml 1 1048576 16"
|
||||
```
|
||||
|
||||
The parameters of `admin_cli`:
|
||||
> - `1` the chain table ID
|
||||
> - `1048576` the chunk size in bytes
|
||||
> - `16` the file strip size
|
||||
|
||||
Run `help init-cluster` for full documentation.
|
||||
4. Start mgmtd service:
|
||||
```bash
|
||||
cp ~/3fs/deploy/systemd/mgmtd_main.service /usr/lib/systemd/system
|
||||
systemctl start mgmtd_main
|
||||
```
|
||||
5. Run `list-nodes` command to check if the cluster has been successfully initialized:
|
||||
```bash
|
||||
/opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml --config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.1.1:8000"]' "list-nodes"
|
||||
```
|
||||
|
||||
If multiple instances of `mgmtd` services deployed, one of the `mgmtd` services is elected as the primary; others are secondaries. Automatic failover occurs when the primary fails.
|
||||
|
||||
---
|
||||
## Step 5: Meta service
|
||||
Install `meta` service on **meta** node.
|
||||
1. Copy `meta_main` to `/opt/3fs/bin` and config files to `/opt/3fs/etc`.
|
||||
```bash
|
||||
cp ~/3fs/build/bin/meta_main /opt/3fs/bin
|
||||
cp ~/3fs/configs/{meta_main_launcher.toml,meta_main.toml,meta_main_app.toml} /opt/3fs/etc
|
||||
```
|
||||
2. Update config files:
|
||||
- Set meta `node_id = 100` in [`meta_main_app.toml`](../configs/meta_main_app.toml).
|
||||
- Set `cluster_id`, `clusterFile` and mgmtd address in [`meta_main_launcher.toml`](../configs/meta_main_launcher.toml):
|
||||
```toml
|
||||
cluster_id = "stage"
|
||||
|
||||
[mgmtd_client]
|
||||
mgmtd_server_addresses = ["RDMA://192.168.1.1:8000"]
|
||||
```
|
||||
- Set mgmtd and monitor addresses in [`meta_main.toml`](../configs/meta_main.toml).
|
||||
```toml
|
||||
[server.mgmtd_client]
|
||||
mgmtd_server_addresses = ["RDMA://192.168.1.1:8000"]
|
||||
|
||||
[common.monitor.reporters.monitor_collector]
|
||||
remote_ip = "192.168.1.1:10000"
|
||||
|
||||
[server.fdb]
|
||||
clusterFile = '/opt/3fs/etc/fdb.cluster'
|
||||
```
|
||||
3. Config file of meta service is managed by mgmtd service. Use `admin_cli` to upload the config file to mgmtd:
|
||||
```bash
|
||||
/opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml --config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.1.1:8000"]' "set-config --type META --file /opt/3fs/etc/meta_main.toml"
|
||||
```
|
||||
4. Start meta service:
|
||||
```bash
|
||||
cp ~/3fs/deploy/systemd/meta_main.service /usr/lib/systemd/system
|
||||
systemctl start meta_main
|
||||
```
|
||||
5. Run `list-nodes` command to check if meta service has joined the cluster:
|
||||
```bash
|
||||
/opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml --config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.1.1:8000"]' "list-nodes"
|
||||
```
|
||||
|
||||
If multiple instances of `meta` services deployed, meta requests will be evenly distributed to all instances.
|
||||
|
||||
---
|
||||
## Step 6: Storage service
|
||||
Install `storage` service on **storage** node.
|
||||
1. Format the attached 16 SSDs as XFS and mount at `/storage/data{1..16}`, then create data directories `/storage/data{1..16}/3fs` and log directory `/var/log/3fs`.
|
||||
```bash
|
||||
mkdir -p /storage/data{1..16}
|
||||
mkdir -p /var/log/3fs
|
||||
for i in {1..16};do mkfs.xfs -L data${i} /dev/nvme${i}n1;mount -o noatime,nodiratime -L data${i} /storage/data${i};done
|
||||
mkdir -p /storage/data{1..16}/3fs
|
||||
```
|
||||
2. Increase the max number of asynchronous aio requests:
|
||||
```bash
|
||||
sysctl -w fs.aio-max-nr=67108864
|
||||
```
|
||||
3. Copy `storage_main` to `/opt/3fs/bin` and config files to `/opt/3fs/etc`.
|
||||
```bash
|
||||
rsync -avz meta:~/3fs/build/bin/storage_main /opt/3fs/bin
|
||||
rsync -avz meta:~/3fs/configs/{storage_main_launcher.toml,storage_main.toml,storage_main_app.toml} /opt/3fs/etc
|
||||
```
|
||||
4. Update config files:
|
||||
- Set `node_id` in [`storage_main_app.toml`](../configs/storage_main_app.toml). Each storage service is assigned a unique id between `10001` and `10005`.
|
||||
- Set `cluster_id` and mgmtd address in [`storage_main_launcher.toml`](../configs/storage_main_launcher.toml).
|
||||
```toml
|
||||
cluster_id = "stage"
|
||||
|
||||
[mgmtd_client]
|
||||
mgmtd_server_addresses = ["RDMA://192.168.1.1:8000"]
|
||||
```
|
||||
- Add target paths in [`storage_main.toml`](../configs/storage_main.toml):
|
||||
```toml
|
||||
[server.mgmtd]
|
||||
mgmtd_server_address = ["RDMA://192.168.1.1:8000"]
|
||||
|
||||
[common.monitor.reporters.monitor_collector]
|
||||
remote_ip = "192.168.1.1:10000"
|
||||
|
||||
[server.targets]
|
||||
target_paths = ["/storage/data1/3fs","/storage/data2/3fs","/storage/data3/3fs","/storage/data4/3fs","/storage/data5/3fs","/storage/data6/3fs","/storage/data7/3fs","/storage/data8/3fs","/storage/data9/3fs","/storage/data10/3fs","/storage/data11/3fs","/storage/data12/3fs","/storage/data13/3fs","/storage/data14/3fs","/storage/data15/3fs","/storage/data16/3fs",]
|
||||
```
|
||||
5. Config file of storage service is managed by mgmtd service. Use `admin_cli` to upload the config file to mgmtd:
|
||||
```bash
|
||||
/opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml --config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.1.1:8000"]' "set-config --type STORAGE --file /opt/3fs/etc/storage_main.toml"
|
||||
```
|
||||
6. Start storage service:
|
||||
```bash
|
||||
rsync -avz meta:~/3fs/deploy/systemd/storage_main.service /usr/lib/systemd/system
|
||||
systemctl start storage_main
|
||||
```
|
||||
7. Run `list-nodes` command to check if storage service has joined the cluster:
|
||||
```
|
||||
/opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml --config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.1.1:8000"]' "list-nodes"
|
||||
```
|
||||
|
||||
---
|
||||
## Step 7: Create admin user, storage targets and chain table
|
||||
1. Create an admin user:
|
||||
```bash
|
||||
/opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml --config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.1.1:8000"]' "user-add --root --admin 0 root"
|
||||
```
|
||||
Save the admin token to `/opt/3fs/etc/token.txt`.
|
||||
2. Generate `admin_cli` commands to create storage targets on 5 storage nodes (16 SSD per node, 6 targets per SSD).
|
||||
- Follow instructions at [here](data_placement/README.md) to install Python packages.
|
||||
```bash
|
||||
python ~/3fs/deploy/data_placement/src/model/data_placement.py \
|
||||
-ql -relax -type CR --num_nodes 5 --replication_factor 3 --min_targets_per_disk 6
|
||||
python ~/3fs/deploy/data_placement/src/setup/gen_chain_table.py \
|
||||
--chain_table_type CR --node_id_begin 10001 --node_id_end 10005 \
|
||||
--num_disks_per_node 16 --num_targets_per_disk 6 \
|
||||
--target_id_prefix 1 --chain_id_prefix 9 \
|
||||
--incidence_matrix_path output/DataPlacementModel-v_5-b_10-r_6-k_3-λ_2-lb_1-ub_1/incidence_matrix.pickle
|
||||
```
|
||||
The following 3 files will be generated in `output` directory: `create_target_cmd.txt`, `generated_chains.csv`, and `generated_chain_table.csv`.
|
||||
3. Create storage targets:
|
||||
```bash
|
||||
/opt/3fs/bin/admin_cli --cfg /opt/3fs/etc/admin_cli.toml --config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.1.1:8000"]' --config.user_info.token $(<"/opt/3fs/etc/token.txt") < output/create_target_cmd.txt
|
||||
```
|
||||
4. Upload chains to mgmtd service:
|
||||
```bash
|
||||
/opt/3fs/bin/admin_cli --cfg /opt/3fs/etc/admin_cli.toml --config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.1.1:8000"]' --config.user_info.token $(<"/opt/3fs/etc/token.txt") "upload-chains output/generated_chains.csv"
|
||||
```
|
||||
5. Upload chain table to mgmtd service:
|
||||
```bash
|
||||
/opt/3fs/bin/admin_cli --cfg /opt/3fs/etc/admin_cli.toml --config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.1.1:8000"]' --config.user_info.token $(<"/opt/3fs/etc/token.txt") "upload-chain-table --desc stage 1 output/generated_chain_table.csv"
|
||||
```
|
||||
6. List chains and chain tables to check if they have been correctly uploaded:
|
||||
```bash
|
||||
/opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml --config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.1.1:8000"]' "list-chains"
|
||||
/opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml --config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.1.1:8000"]' "list-chain-tables"
|
||||
```
|
||||
---
|
||||
## Step 8: FUSE client
|
||||
For simplicity FUSE client is deployed on the **meta** node in this guide. However, we strongly advise against deploying clients on service nodes in production environment.
|
||||
|
||||
1. Copy `hf3fs_fuse_main` to `/opt/3fs/bin` and config files to `/opt/3fs/etc`.
|
||||
```bash
|
||||
cp ~/3fs/build/bin/hf3fs_fuse_main /opt/3fs/bin
|
||||
cp ~/3fs/configs/{hf3fs_fuse_main_launcher.toml,hf3fs_fuse_main.toml,hf3fs_fuse_main_app.toml} /opt/3fs/etc
|
||||
```
|
||||
2. Create the mount point:
|
||||
```bash
|
||||
mkdir -p /3fs/stage
|
||||
```
|
||||
3. Set cluster ID, mountpoint, token file and mgmtd address in [`hf3fs_fuse_main_launcher.toml`](../configs/hf3fs_fuse_main_launcher.toml)
|
||||
```toml
|
||||
cluster_id = "stage"
|
||||
mountpoint = '/3fs/stage'
|
||||
token_file = '/opt/3fs/etc/token.txt'
|
||||
|
||||
[mgmtd_client]
|
||||
mgmtd_server_addresses = ["RDMA://192.168.1.1:8000"]
|
||||
```
|
||||
4. Set mgmtd and monitor address in [`hf3fs_fuse_main.toml`](../configs/hf3fs_fuse_main.toml).
|
||||
```toml
|
||||
[mgmtd]
|
||||
mgmtd_server_addresses = ["RDMA://192.168.1.1:8000"]
|
||||
|
||||
[common.monitor.reporters.monitor_collector]
|
||||
remote_ip = "192.168.1.1:10000"
|
||||
```
|
||||
5. Config file of FUSE client is also managed by mgmtd service. Use `admin_cli` to upload the config file to mgmtd:
|
||||
```bash
|
||||
/opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml --config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.1.1:8000"]' "set-config --type FUSE --file /opt/3fs/etc/hf3fs_fuse_main.toml"
|
||||
```
|
||||
6. Start FUSE client:
|
||||
```bash
|
||||
cp ~/3fs/deploy/systemd/hf3fs_fuse_main.service /usr/lib/systemd/system
|
||||
systemctl start hf3fs_fuse_main
|
||||
```
|
||||
7. Check if 3FS has been mounted at `/3fs/stage`:
|
||||
```bash
|
||||
mount | grep '/3fs/stage'
|
||||
```
|
||||
|
||||
## FAQ
|
||||
<details>
|
||||
<summary>How to troubleshoot <code>admin_cli init-cluster</code> error?</summary>
|
||||
|
||||
If mgmtd fails to start after running `init-cluster`, the most likely cause is an error in `mgmtd_main.toml`. Any changes to this file require clearing all FoundationDB data and re-running `init-cluster`
|
||||
</details>
|
||||
|
||||
---
|
||||
<details>
|
||||
<summary>How to build a single-node cluster?</summary>
|
||||
|
||||
A minimum of two storage services is required for data replication. If set `--num-nodes=1`, the `gen_chain_table.py` script will fail. In a test environment, this limitation can be bypassed by deploying multiple storage services on a single machine.
|
||||
</details>
|
||||
|
||||
---
|
||||
<details>
|
||||
<summary>How to update config files?</summary>
|
||||
|
||||
All config files are managed by mgmtd. If any `*_main.toml` is updated, such as `storage_main.toml`, the modified file should be uploaded using `admin_cli set-config`.
|
||||
</details>
|
||||
|
||||
---
|
||||
<details>
|
||||
<summary>How to troubleshoot common deployment issues?</summary>
|
||||
|
||||
When encountering any error during deployment,
|
||||
- Check the log messages in `stdout/stderr` using `journalctl`, especially during service startup.
|
||||
- Check log files stored in `/var/log/3fs/` on service and client nodes.
|
||||
- Ensure that the directory `/var/log/3fs/` exists before starting any service.
|
||||
</details>
|
||||
17
deploy/data_placement/.gitignore
vendored
Normal file
17
deploy/data_placement/.gitignore
vendored
Normal file
@@ -0,0 +1,17 @@
|
||||
__pycache__
|
||||
.ipynb_checkpoints
|
||||
.tmp/
|
||||
dist/
|
||||
build/
|
||||
output/
|
||||
*.egg-info/
|
||||
test/scratch/
|
||||
test/runtime/
|
||||
*.log
|
||||
*.pyc
|
||||
*.xml
|
||||
.tmp/
|
||||
.idea
|
||||
.coverage
|
||||
.vscode/
|
||||
.hypothesis/
|
||||
60
deploy/data_placement/README.md
Normal file
60
deploy/data_placement/README.md
Normal file
@@ -0,0 +1,60 @@
|
||||
# How to generate chain tables
|
||||
|
||||
Suppose we are going to setup a small 3FS cluster:
|
||||
- 3 replicas for each chunk
|
||||
- 5 storage nodes: `10001 ... 10005`
|
||||
- 16 SSDs attached to each node
|
||||
- 6 storage targets on each SSD
|
||||
|
||||
First generate a solution of the data placement problem.
|
||||
|
||||
```bash
|
||||
$ python src/model/data_placement.py -ql -relax -type CR --num_nodes 5 --replication_factor 3 --min_targets_per_disk 6 --init_timelimit 600
|
||||
|
||||
...
|
||||
|
||||
2025-02-24 14:25:13.623 | SUCCESS | __main__:solve:165 - optimal solution:
|
||||
- Status: ok
|
||||
Termination condition: optimal
|
||||
Termination message: TerminationCondition.optimal
|
||||
|
||||
2025-02-24 14:25:13.624 | DEBUG | __main__:check_solution:322 - 1,2: 1.5
|
||||
2025-02-24 14:25:13.624 | DEBUG | __main__:check_solution:322 - 1,3: 1.5
|
||||
2025-02-24 14:25:13.624 | DEBUG | __main__:check_solution:322 - 1,4: 1.5
|
||||
2025-02-24 14:25:13.624 | DEBUG | __main__:check_solution:322 - 1,5: 1.5
|
||||
2025-02-24 14:25:13.624 | DEBUG | __main__:check_solution:322 - 2,1: 1.5
|
||||
2025-02-24 14:25:13.624 | DEBUG | __main__:check_solution:322 - 2,3: 1.5
|
||||
2025-02-24 14:25:13.624 | DEBUG | __main__:check_solution:322 - 2,4: 1.5
|
||||
2025-02-24 14:25:13.624 | DEBUG | __main__:check_solution:322 - 2,5: 1.5
|
||||
2025-02-24 14:25:13.624 | DEBUG | __main__:check_solution:322 - 3,1: 1.5
|
||||
2025-02-24 14:25:13.624 | DEBUG | __main__:check_solution:322 - 3,2: 1.5
|
||||
2025-02-24 14:25:13.624 | DEBUG | __main__:check_solution:322 - 3,4: 1.5
|
||||
2025-02-24 14:25:13.624 | DEBUG | __main__:check_solution:322 - 3,5: 1.5
|
||||
2025-02-24 14:25:13.624 | DEBUG | __main__:check_solution:322 - 4,1: 1.5
|
||||
2025-02-24 14:25:13.624 | DEBUG | __main__:check_solution:322 - 4,2: 1.5
|
||||
2025-02-24 14:25:13.624 | DEBUG | __main__:check_solution:322 - 4,3: 1.5
|
||||
2025-02-24 14:25:13.624 | DEBUG | __main__:check_solution:322 - 4,5: 1.5
|
||||
2025-02-24 14:25:13.624 | DEBUG | __main__:check_solution:322 - 5,1: 1.5
|
||||
2025-02-24 14:25:13.624 | DEBUG | __main__:check_solution:322 - 5,2: 1.5
|
||||
2025-02-24 14:25:13.624 | DEBUG | __main__:check_solution:322 - 5,3: 1.5
|
||||
2025-02-24 14:25:13.624 | DEBUG | __main__:check_solution:322 - 5,4: 1.5
|
||||
2025-02-24 14:25:13.624 | INFO | __main__:check_solution:331 - min_peer_traffic=1.5 max_peer_traffic=1.5
|
||||
2025-02-24 14:25:13.624 | INFO | __main__:check_solution:332 - total_traffic=30.0 max_total_traffic=30
|
||||
2025-02-24 14:25:14.147 | SUCCESS | __main__:run:147 - saved solution to: output/DataPlacementModel-v_5-b_10-r_6-k_3-λ_2-lb_1-ub_1
|
||||
```
|
||||
|
||||
Note that some combinations of `--num_nodes` and `--replication_factor` may have no solution.
|
||||
|
||||
Then generate commands to create/remove storage targets.
|
||||
|
||||
```bash
|
||||
$ python src/setup/gen_chain_table.py --chain_table_type CR --node_id_begin 10001 --node_id_end 10005 --num_disks_per_node 16 --num_targets_per_disk 6 --incidence_matrix_path output/DataPlacementModel-v_5-b_10-r_6-k_3-λ_2-lb_1-ub_1/incidence_matrix.pickle
|
||||
|
||||
$ ls -1 output/
|
||||
DataPlacementModel-v_5-b_10-r_6-k_3-λ_2-lb_1-ub_1
|
||||
appsi_highs.log
|
||||
create_target_cmd.txt
|
||||
generated_chain_table.csv
|
||||
generated_chains.csv
|
||||
remove_target_cmd.txt
|
||||
```
|
||||
12
deploy/data_placement/requirements.txt
Normal file
12
deploy/data_placement/requirements.txt
Normal file
@@ -0,0 +1,12 @@
|
||||
psutil
|
||||
pandas
|
||||
plotly
|
||||
loguru
|
||||
highspy==1.8.0
|
||||
pyomo==6.8.0
|
||||
coverage~=7.4.4
|
||||
pytest==8.2.1
|
||||
pytest-cov==5.0.0
|
||||
pytest-forked==1.6.0
|
||||
pytest-xdist==3.6.1
|
||||
pytest-timeout==2.3.1
|
||||
0
deploy/data_placement/src/__init__.py
Normal file
0
deploy/data_placement/src/__init__.py
Normal file
549
deploy/data_placement/src/model/data_placement.py
Normal file
549
deploy/data_placement/src/model/data_placement.py
Normal file
@@ -0,0 +1,549 @@
|
||||
import math
|
||||
import pickle
|
||||
import random
|
||||
import time
|
||||
import psutil
|
||||
import os.path
|
||||
import pandas as pd
|
||||
import pyomo.environ as po
|
||||
import plotly.express as px
|
||||
from typing import Dict, Generator, Literal, Tuple
|
||||
from loguru import logger
|
||||
from pyomo.opt import SolverStatus, TerminationCondition
|
||||
|
||||
|
||||
class InfeasibleModel(Exception):
|
||||
pass
|
||||
|
||||
class SolverTimeout(Exception):
|
||||
pass
|
||||
|
||||
class SolverError(Exception):
|
||||
pass
|
||||
|
||||
class InvalidSolution(Exception):
|
||||
pass
|
||||
|
||||
|
||||
class DataPlacementModel(object):
|
||||
|
||||
def __init__(self, chain_table_type: Literal["EC", "CR"], num_nodes, group_size, num_groups=None, num_targets_per_disk=None, min_targets_per_disk=1, bibd_only=False, qlinearize=False, relax_lb=1, relax_ub=0):
|
||||
if num_targets_per_disk is None:
|
||||
num_nodes, num_groups, num_targets_per_disk, group_size = DataPlacementModel.find_params(num_nodes, group_size, min_r=min_targets_per_disk, bibd_only=bibd_only)
|
||||
self.chain_table_type = chain_table_type
|
||||
self.num_nodes = num_nodes
|
||||
self.group_size = group_size
|
||||
self.num_targets_per_disk = num_targets_per_disk
|
||||
self.num_groups = num_groups or self.num_targets_total // self.group_size
|
||||
self.bibd_only = bibd_only
|
||||
self.qlinearize = qlinearize
|
||||
self.relax_lb = relax_lb
|
||||
self.relax_ub = relax_ub
|
||||
|
||||
def __repr__(self):
|
||||
v, b, r, k, λ = self.v, self.b, self.r, self.k, self.λ
|
||||
lb, ub = self.relax_lb, self.relax_ub
|
||||
return f"{self.__class__.__name__}-{v=},{b=},{r=},{k=},{λ=},{lb=},{ub=}"
|
||||
|
||||
__str__ = __repr__
|
||||
|
||||
@property
|
||||
def path_name(self):
|
||||
return str(self).translate(str.maketrans(' ,:=', '---_'))
|
||||
|
||||
@property
|
||||
def v(self):
|
||||
return self.num_nodes
|
||||
|
||||
@property
|
||||
def b(self):
|
||||
return self.num_groups
|
||||
|
||||
@property
|
||||
def r(self):
|
||||
return self.num_targets_per_disk
|
||||
|
||||
@property
|
||||
def k(self):
|
||||
return self.group_size
|
||||
|
||||
@property
|
||||
def λ(self):
|
||||
return self.max_recovery_traffic_on_peer
|
||||
|
||||
@property
|
||||
def num_targets_used(self):
|
||||
return self.num_groups * self.group_size
|
||||
|
||||
@property
|
||||
def num_targets_total(self):
|
||||
return self.num_nodes * self.num_targets_per_disk
|
||||
|
||||
@property
|
||||
def all_targets_used(self):
|
||||
return self.num_targets_used == self.num_targets_total
|
||||
|
||||
@property
|
||||
def balanced_peer_traffic(self):
|
||||
return self.all_targets_used and self.sum_recovery_traffic_per_failure % (self.num_nodes-1) == 0
|
||||
|
||||
@property
|
||||
def recovery_traffic_factor(self):
|
||||
return (self.group_size - 1) if self.chain_table_type == "EC" else 1
|
||||
|
||||
@property
|
||||
def sum_recovery_traffic_per_failure(self):
|
||||
return self.num_targets_per_disk * self.recovery_traffic_factor
|
||||
|
||||
@property
|
||||
def max_recovery_traffic_on_peer(self):
|
||||
return math.ceil(self.sum_recovery_traffic_per_failure / (self.num_nodes-1))
|
||||
|
||||
@property
|
||||
def balanced_incomplete_block_design(self):
|
||||
return self.bibd_only and self.balanced_peer_traffic and self.relax_ub == 0
|
||||
|
||||
@staticmethod
|
||||
def find_params(v, k, min_r=1, max_r=100, bibd_only=False):
|
||||
if bibd_only: min_r = max(min_r, k)
|
||||
for r in range(min_r, max_r):
|
||||
if v * r % k == 0 and r * (k - 1) >= v - 1:
|
||||
b = v * r // k
|
||||
if not bibd_only or r * (k - 1) % (v - 1) == 0:
|
||||
return v, b, r, k
|
||||
raise ValueError(f"cannot find valid params: {v=}, {k=}")
|
||||
|
||||
def run(self, pyomo_solver=None, threads=psutil.cpu_count(logical=False), init_timelimit=1800, max_timelimit=3600*2, auto_relax=False, output_root="output", verbose=False, add_elapsed_time=None):
|
||||
init_relax_lb = self.relax_lb
|
||||
init_relax_ub = self.relax_ub
|
||||
timelimit = 0
|
||||
num_loops = self.max_recovery_traffic_on_peer*2
|
||||
os.makedirs(output_root, exist_ok=True)
|
||||
|
||||
for loop in range(num_loops):
|
||||
try:
|
||||
logger.info(f"solving model with {pyomo_solver} #{loop}: {self}")
|
||||
if add_elapsed_time is not None:
|
||||
add_elapsed_time()
|
||||
timelimit = min(timelimit + init_timelimit, max_timelimit)
|
||||
instance = self.solve(pyomo_solver, threads, timelimit, output_root, verbose)
|
||||
if add_elapsed_time is not None:
|
||||
add_elapsed_time(f"solve model time (lb={self.relax_lb}, ub={self.relax_ub})")
|
||||
except (InfeasibleModel, SolverTimeout) as ex:
|
||||
logger.error(f"cannot find solution for current params: {ex}")
|
||||
if auto_relax:
|
||||
self.relax_lb = init_relax_lb + (loop+1) // 2
|
||||
self.relax_ub = init_relax_ub + (loop+2) // 2
|
||||
continue
|
||||
elif loop + 1 < num_loops:
|
||||
logger.critical(f"failed to find solution after {num_loops} attempts")
|
||||
raise ex
|
||||
else:
|
||||
raise ex
|
||||
else:
|
||||
output_path = os.path.join(output_root, self.path_name)
|
||||
os.makedirs(output_path, exist_ok=True)
|
||||
self.save_solution(instance, output_path)
|
||||
self.visualize_solution(instance, output_path)
|
||||
logger.success(f"saved solution to: {output_path}")
|
||||
return instance
|
||||
|
||||
logger.catch(reraise=True, message="failed to solve model")
|
||||
def solve(self, pyomo_solver=None, threads=psutil.cpu_count(logical=False), timelimit=3600, output_path="output", verbose=False):
|
||||
if "highs" in pyomo_solver:
|
||||
self.qlinearize = True
|
||||
|
||||
instance = self.build_model()
|
||||
if verbose: instance.pprint()
|
||||
|
||||
try:
|
||||
results = self.solve_model(instance, pyomo_solver, threads, timelimit, output_path)
|
||||
except RuntimeError as ex:
|
||||
raise SolverError("unknown runtime error") from ex
|
||||
|
||||
if (results.solver.status == SolverStatus.ok) and (results.solver.termination_condition == TerminationCondition.optimal):
|
||||
logger.success(f"optimal solution: {str(results.solver)}")
|
||||
if pyomo_solver is not None: instance.solutions.load_from(results)
|
||||
elif results.solver.termination_condition == TerminationCondition.infeasible:
|
||||
raise InfeasibleModel(f"infeasible: {str(results.solver)}")
|
||||
elif results.solver.termination_condition in (TerminationCondition.maxTimeLimit, TerminationCondition.maxIterations):
|
||||
raise SolverTimeout(f"timeout: {str(results.solver)}")
|
||||
else:
|
||||
raise SolverError(f"error: {str(results.solver)}")
|
||||
|
||||
if verbose: self.print_solution(instance)
|
||||
try:
|
||||
self.check_solution(instance)
|
||||
except AssertionError as ex:
|
||||
raise InvalidSolution from ex
|
||||
return instance
|
||||
|
||||
def build_model(self):
|
||||
logger.info(f"{self.num_nodes=} {self.num_targets_per_disk=} {self.group_size=} {self.num_groups=} {self.qlinearize=} {self.relax_lb=} {self.relax_ub=}")
|
||||
# v >= k
|
||||
assert self.num_nodes >= self.group_size, f"{self.num_nodes=} < {self.group_size=}"
|
||||
# Fisher's inequality
|
||||
if self.balanced_incomplete_block_design:
|
||||
# b >= v
|
||||
assert self.num_groups >= self.num_nodes, f"{self.num_groups=} < {self.num_nodes=}"
|
||||
# r >= k
|
||||
assert self.num_targets_per_disk >= self.group_size, f"{self.num_targets_per_disk=} < {self.group_size=}"
|
||||
|
||||
logger.info(f"{self.sum_recovery_traffic_per_failure=} {self.max_recovery_traffic_on_peer=}")
|
||||
if self.sum_recovery_traffic_per_failure < self.num_nodes - 1:
|
||||
logger.warning(f"some disks do not share recovery traffic: {self.sum_recovery_traffic_per_failure=} < {self.num_nodes=} - 1")
|
||||
|
||||
logger.info(f"{self.all_targets_used=} {self.balanced_peer_traffic=}")
|
||||
logger.info(f"{self.num_targets_used=} {self.num_targets_total=}")
|
||||
if self.num_targets_used < self.num_targets_total:
|
||||
logger.warning(f"some disks have unused targets: {self.num_targets_used=} < {self.num_targets_total=}")
|
||||
else:
|
||||
assert self.num_targets_used == self.num_targets_total, f"{self.num_targets_used=} > {self.num_targets_total=}"
|
||||
|
||||
model = po.ConcreteModel()
|
||||
# index sets
|
||||
model.disks = po.RangeSet(1, self.num_nodes)
|
||||
model.target_idxs = po.RangeSet(1, self.num_targets_per_disk)
|
||||
model.targets = model.disks * model.target_idxs
|
||||
model.groups = po.RangeSet(1, self.num_groups)
|
||||
|
||||
def disk_pairs_init(model):
|
||||
for disk in model.disks:
|
||||
for peer in model.disks:
|
||||
if peer > disk:
|
||||
yield (disk, peer)
|
||||
model.disk_pairs = po.Set(dimen=2, initialize=disk_pairs_init)
|
||||
|
||||
# variables
|
||||
|
||||
model.disk_used_by_group = po.Var(model.disks, model.groups, domain=po.Binary)
|
||||
if self.qlinearize:
|
||||
model.disk_in_same_group = po.Var(model.disk_pairs, model.groups, domain=po.Binary)
|
||||
|
||||
# constraints
|
||||
|
||||
def calc_disk_in_same_group(model, disk, peer, group):
|
||||
return model.disk_used_by_group[disk,group] * model.disk_used_by_group[peer,group]
|
||||
|
||||
def define_disk_in_same_group_lower_bound(model, disk, peer, group):
|
||||
return model.disk_used_by_group[disk,group] + model.disk_used_by_group[peer,group] <= model.disk_in_same_group[disk,peer,group] + 1
|
||||
|
||||
def define_disk_in_same_group_upper_bound1(model, disk, peer, group):
|
||||
return model.disk_in_same_group[disk,peer,group] <= model.disk_used_by_group[disk,group]
|
||||
|
||||
def define_disk_in_same_group_upper_bound2(model, disk, peer, group):
|
||||
return model.disk_in_same_group[disk,peer,group] <= model.disk_used_by_group[peer,group]
|
||||
|
||||
if self.qlinearize:
|
||||
model.define_disk_in_same_group_lower_bound_eqn = po.Constraint(model.disk_pairs, model.groups, rule=define_disk_in_same_group_lower_bound)
|
||||
model.define_disk_in_same_group_upper_bound1_eqn = po.Constraint(model.disk_pairs, model.groups, rule=define_disk_in_same_group_upper_bound1)
|
||||
model.define_disk_in_same_group_upper_bound2_eqn = po.Constraint(model.disk_pairs, model.groups, rule=define_disk_in_same_group_upper_bound2)
|
||||
|
||||
def each_disk_has_limited_capcity(model, disk):
|
||||
if self.all_targets_used:
|
||||
return po.quicksum(model.disk_used_by_group[disk,group] for group in model.groups) == self.num_targets_per_disk
|
||||
else:
|
||||
return po.quicksum(model.disk_used_by_group[disk,group] for group in model.groups) <= self.num_targets_per_disk
|
||||
model.each_disk_has_limited_capcity_eqn = po.Constraint(model.disks, rule=each_disk_has_limited_capcity)
|
||||
|
||||
def enough_disks_assigned_to_each_group(model, group):
|
||||
return po.quicksum(model.disk_used_by_group[disk,group] for disk in model.disks) == self.group_size
|
||||
model.enough_disks_assigned_to_each_group_eqn = po.Constraint(model.groups, rule=enough_disks_assigned_to_each_group)
|
||||
|
||||
def calc_peer_recovery_traffic(model, disk, peer):
|
||||
if self.qlinearize:
|
||||
return po.quicksum(model.disk_in_same_group[disk,peer,group] for group in model.groups)
|
||||
else:
|
||||
return po.quicksum(calc_disk_in_same_group(model, disk, peer, group) for group in model.groups)
|
||||
|
||||
def peer_recovery_traffic_upper_bound(model, disk, peer):
|
||||
if self.balanced_incomplete_block_design:
|
||||
return calc_peer_recovery_traffic(model, disk, peer) == self.max_recovery_traffic_on_peer
|
||||
else:
|
||||
return calc_peer_recovery_traffic(model, disk, peer) <= self.max_recovery_traffic_on_peer + self.relax_ub
|
||||
model.peer_recovery_traffic_upper_bound_eqn = po.Constraint(model.disk_pairs, rule=peer_recovery_traffic_upper_bound)
|
||||
|
||||
def peer_recovery_traffic_lower_bound(model, disk, peer):
|
||||
return calc_peer_recovery_traffic(model, disk, peer) >= max(0, self.max_recovery_traffic_on_peer - self.relax_lb)
|
||||
|
||||
if self.balanced_incomplete_block_design:
|
||||
logger.info(f"lower bound not needed for balanced incomplete block design (BIBD)")
|
||||
elif self.all_targets_used:
|
||||
logger.info(f"lower bound imposed on peer traffic: {self.relax_lb=} {self.qlinearize=} {self.all_targets_used=}")
|
||||
model.peer_recovery_traffic_lower_bound_eqn = po.Constraint(model.disk_pairs, rule=peer_recovery_traffic_lower_bound)
|
||||
else:
|
||||
logger.info(f"lower bound not imposed on peer traffic: {self.relax_lb=} {self.qlinearize=} {self.all_targets_used=}")
|
||||
|
||||
def total_recovery_traffic(model):
|
||||
return po.summation(model.disk_in_same_group) * 2
|
||||
|
||||
# model.obj = po.Objective(rule=total_recovery_traffic, sense=po.minimize)
|
||||
model.obj = po.Objective(expr=1) # dummy objective
|
||||
return model
|
||||
|
||||
def solve_model(self, instance, pyomo_solver, threads, timelimit, output_path):
|
||||
if pyomo_solver is not None:
|
||||
solver = po.SolverFactory(pyomo_solver)
|
||||
return solver.solve(instance, options={"threads": str(threads), "log_file": os.path.join(output_path, f"{pyomo_solver}.log")}, load_solutions=False, timelimit=timelimit, tee=True)
|
||||
else:
|
||||
raise ValueError(f"no solver specified")
|
||||
|
||||
def get_peer_traffic(self, instance) -> Dict[Tuple[int,int], int]:
|
||||
peer_traffic_map = {}
|
||||
for disk in instance.disks:
|
||||
for peer in instance.disks:
|
||||
if disk == peer: continue
|
||||
peer_traffic_map[(disk, peer)] = sum(
|
||||
po.value(instance.disk_used_by_group[disk,group]) *
|
||||
po.value(instance.disk_used_by_group[peer,group])
|
||||
for group in instance.groups) * self.recovery_traffic_factor / (self.group_size - 1)
|
||||
return peer_traffic_map
|
||||
|
||||
def get_incidence_matrix(self, instance) -> Dict[Tuple[int, int], bool]:
|
||||
incidence_matrix = {}
|
||||
for disk in instance.disks:
|
||||
for group in instance.groups:
|
||||
val = instance.disk_used_by_group[disk,group]
|
||||
if math.isclose(po.value(val), 1):
|
||||
incidence_matrix[(disk,group)] = True
|
||||
if self.all_targets_used:
|
||||
assert len(incidence_matrix) % self.num_nodes == 0, f"{len(incidence_matrix)=} % {self.num_nodes=}"
|
||||
assert len(incidence_matrix) % self.num_groups == 0, f"{len(incidence_matrix)=} % {self.num_groups=}"
|
||||
return incidence_matrix
|
||||
|
||||
def check_solution(self, instance):
|
||||
has_peer_traffic_lower_bound = False
|
||||
for c in instance.component_objects(po.Constraint):
|
||||
if "peer_recovery_traffic_lower_bound_eqn" in str(c):
|
||||
has_peer_traffic_lower_bound = True
|
||||
|
||||
peer_traffic_map = self.get_peer_traffic(instance)
|
||||
for (disk, peer), peer_traffic in peer_traffic_map.items():
|
||||
logger.debug(f"{disk},{peer}: {peer_traffic:.1f}")
|
||||
assert peer_traffic <= self.max_recovery_traffic_on_peer + self.relax_ub + 1e-5, f"{peer_traffic=} > {self.max_recovery_traffic_on_peer=} + {self.relax_ub}"
|
||||
if has_peer_traffic_lower_bound:
|
||||
assert peer_traffic >= max(0, self.max_recovery_traffic_on_peer - self.relax_lb) - 1e-5, f"{peer_traffic=} < {self.max_recovery_traffic_on_peer=} - {self.relax_lb}"
|
||||
|
||||
min_peer_traffic = min(peer_traffic_map.values())
|
||||
max_peer_traffic = max(peer_traffic_map.values())
|
||||
total_traffic = sum(peer_traffic_map.values())
|
||||
max_total_traffic = self.num_nodes * self.sum_recovery_traffic_per_failure
|
||||
logger.info(f"{min_peer_traffic=:.1f} {max_peer_traffic=:.1f}")
|
||||
logger.info(f"{total_traffic=} {max_total_traffic=}")
|
||||
|
||||
peer_traffic_diff = max_peer_traffic - min_peer_traffic
|
||||
if has_peer_traffic_lower_bound:
|
||||
assert peer_traffic_diff <= self.relax_ub + self.relax_lb + 1e-5, f"{peer_traffic_diff=}"
|
||||
if self.balanced_incomplete_block_design:
|
||||
assert math.isclose(peer_traffic_diff, 0.0, abs_tol=1e-9), f"{peer_traffic_diff=}"
|
||||
|
||||
assert total_traffic <= max_total_traffic + 1e-5
|
||||
return total_traffic, min_peer_traffic, max_peer_traffic
|
||||
|
||||
def print_solution(self, instance):
|
||||
for disk in instance.disks:
|
||||
for group in instance.groups:
|
||||
val = instance.disk_used_by_group[disk,group]
|
||||
if math.isclose(po.value(val), 1):
|
||||
logger.info(f"{val}: {po.value(val)}")
|
||||
|
||||
def save_solution(self, instance, output_path: str="output"):
|
||||
incidence_matrix = self.get_incidence_matrix(instance)
|
||||
with open(os.path.join(output_path, "incidence_matrix.pickle"), "wb") as fout:
|
||||
pickle.dump(incidence_matrix, fout)
|
||||
|
||||
peer_traffic_map = self.get_peer_traffic(instance)
|
||||
with open(os.path.join(output_path, "peer_traffic_map.pickle"), "wb") as fout:
|
||||
pickle.dump(peer_traffic_map, fout)
|
||||
|
||||
def visualize_solution(self, instance, output_path: str="output", write_html=True):
|
||||
incidence_matrix = self.get_incidence_matrix(instance)
|
||||
disks, groups = zip(*incidence_matrix.keys())
|
||||
incidence_df = pd.DataFrame(zip(disks, groups), columns=["disk", "group"])
|
||||
|
||||
peer_traffic_map = self.get_peer_traffic(instance)
|
||||
min_peer_traffic = min(peer_traffic_map.values())
|
||||
max_peer_traffic = max(peer_traffic_map.values())
|
||||
|
||||
fig = px.scatter(
|
||||
incidence_df,
|
||||
x="disk",
|
||||
y="group",
|
||||
title=f"{self}, min/max peer traffic: {min_peer_traffic:.1f}/{max_peer_traffic:.1f}")
|
||||
fig.update_layout(
|
||||
xaxis_title="Nodes",
|
||||
yaxis_title="Groups",
|
||||
xaxis = dict(
|
||||
tickmode = 'array',
|
||||
tickvals = list(range(1, self.num_nodes+1)),
|
||||
),
|
||||
yaxis = dict(
|
||||
tickmode = 'array',
|
||||
tickvals = list(range(1, self.num_groups+1)),
|
||||
),
|
||||
)
|
||||
|
||||
if write_html:
|
||||
fig.write_html(os.path.join(output_path, f"data_placement.html"), include_plotlyjs=True)
|
||||
return fig
|
||||
|
||||
|
||||
class RebalanceTrafficModel(DataPlacementModel):
|
||||
|
||||
def __init__(self, existing_incidence_matrix, chain_table_type: Literal["EC", "CR"], num_nodes, group_size, num_groups=None, num_targets_per_disk=None, min_targets_per_disk=1, bibd_only=False, qlinearize=False, relax_lb=1, relax_ub=0):
|
||||
self.existing_incidence_matrix = existing_incidence_matrix
|
||||
self.existing_disks, self.existing_groups = zip(*existing_incidence_matrix.keys())
|
||||
num_existing_targets_per_disk = math.ceil(self.total_existing_targets / self.num_existing_disk)
|
||||
min_targets_per_disk = max(min_targets_per_disk, num_existing_targets_per_disk)
|
||||
if num_targets_per_disk is None:
|
||||
num_nodes, num_groups, num_targets_per_disk, group_size = DataPlacementModel.find_params(num_nodes, group_size, min_r=min_targets_per_disk, bibd_only=bibd_only)
|
||||
else:
|
||||
assert num_targets_per_disk >= min_targets_per_disk
|
||||
super().__init__(chain_table_type, num_nodes, group_size, num_groups, num_targets_per_disk, min_targets_per_disk, bibd_only, qlinearize, relax_lb, relax_ub)
|
||||
|
||||
@property
|
||||
def num_existing_disk(self):
|
||||
return max(self.existing_disks)
|
||||
|
||||
@property
|
||||
def num_existing_groups(self):
|
||||
return max(self.existing_groups)
|
||||
|
||||
@property
|
||||
def total_existing_targets(self):
|
||||
return len(self.existing_disks)
|
||||
|
||||
@property
|
||||
def existing_group_size(self):
|
||||
assert self.total_existing_targets % self.num_existing_groups == 0, f"{self.total_existing_targets=} % {self.num_existing_groups=}"
|
||||
return self.total_existing_targets // self.num_existing_groups
|
||||
|
||||
def build_model(self):
|
||||
max_existing_targets_per_disk = math.ceil(self.total_existing_targets / self.num_nodes)
|
||||
logger.info(f"{self.num_existing_disk=} {self.num_existing_groups=} {self.total_existing_targets=} {max_existing_targets_per_disk=}")
|
||||
|
||||
assert self.num_nodes >= self.num_existing_disk, f"{self.num_nodes=} < {self.num_existing_disk=}"
|
||||
assert self.num_groups >= self.num_existing_groups, f"{self.num_groups=} < {self.num_existing_groups=}"
|
||||
assert self.group_size == self.existing_group_size, f"{self.group_size=} != {self.existing_group_size=}"
|
||||
assert self.num_targets_per_disk >= max_existing_targets_per_disk, f"{self.num_targets_per_disk=} >= {max_existing_targets_per_disk=}"
|
||||
|
||||
model = super().build_model()
|
||||
|
||||
def existing_targets_evenly_distributed_to_disks(model, disk):
|
||||
return po.quicksum(model.disk_used_by_group[disk,group] for group in model.groups if group <= self.num_existing_groups) <= max_existing_targets_per_disk
|
||||
model.existing_targets_evenly_distributed_to_disks_eqn = po.Constraint(model.disks, rule=existing_targets_evenly_distributed_to_disks)
|
||||
|
||||
def num_existing_targets_not_moved(model):
|
||||
return po.quicksum(model.disk_used_by_group[disk,group] for disk in model.disks for group in model.groups if (disk,group) in self.existing_incidence_matrix)
|
||||
|
||||
def total_rebalance_traffic(model):
|
||||
return self.total_existing_targets - num_existing_targets_not_moved(model)
|
||||
|
||||
model.obj = po.Objective(expr=total_rebalance_traffic, sense=po.minimize)
|
||||
return model
|
||||
|
||||
def visualize_solution(self, instance, output_path = "output", write_html=True):
|
||||
incidence_matrix = self.get_incidence_matrix(instance)
|
||||
disks, groups = zip(*incidence_matrix.keys())
|
||||
incidence_df = pd.DataFrame(zip(disks, groups, [g > self.num_existing_groups for g in groups]), columns=["disk", "group", "new"])
|
||||
|
||||
peer_traffic_map = self.get_peer_traffic(instance)
|
||||
min_peer_traffic = min(peer_traffic_map.values())
|
||||
max_peer_traffic = max(peer_traffic_map.values())
|
||||
|
||||
fig = px.scatter(
|
||||
incidence_df,
|
||||
x="disk",
|
||||
y="group",
|
||||
color="new",
|
||||
title=f"{self}, min/max peer traffic: {min_peer_traffic:.1f}/{max_peer_traffic:.1f}, rebalance traffic: {po.value(instance.obj.expr)}")
|
||||
fig.update_layout(
|
||||
xaxis_title="Nodes",
|
||||
yaxis_title="Groups",
|
||||
xaxis = dict(
|
||||
tickmode = 'array',
|
||||
tickvals = list(range(1, self.num_nodes+1)),
|
||||
),
|
||||
yaxis = dict(
|
||||
tickmode = 'array',
|
||||
tickvals = list(range(1, self.num_groups+1)),
|
||||
),
|
||||
)
|
||||
|
||||
if write_html:
|
||||
fig.write_html(os.path.join(output_path, f"{self.path_name}.html"), include_plotlyjs=True)
|
||||
return fig
|
||||
|
||||
|
||||
def main():
|
||||
import psutil
|
||||
import argparse
|
||||
|
||||
parser = argparse.ArgumentParser(prog="model.py", description="3FS data placement")
|
||||
parser.add_argument("-pyomo", "--pyomo_solver", default="appsi_highs", choices=["appsi_highs", "cbc", "scip"], help="Solver used by Pyomo")
|
||||
parser.add_argument("-type", "--chain_table_type", type=str, required=True, choices=["CR", "EC"], help="CR - Chain Replication; EC - Erasure Coding")
|
||||
parser.add_argument("-j", "--solver_threads", type=int, default=psutil.cpu_count(logical=False)//2, help="Number of solver threads")
|
||||
parser.add_argument("-v", "--num_nodes", type=int, required=True, help="Number of storage nodes")
|
||||
parser.add_argument("-r", "--num_targets_per_disk", type=int, default=None, help="Number of storage targets on each disk")
|
||||
parser.add_argument("-min_r", "--min_targets_per_disk", type=int, default=1, help="Min number of storage targets on each disk")
|
||||
parser.add_argument("-k", "--replication_factor", "--group_size", dest="group_size", type=int, default=3, help="Replication factor or erasure coding group size")
|
||||
parser.add_argument("-b", "--num_groups", type=int, default=None, help="Number of chains or EC groups")
|
||||
parser.add_argument("-ql", "--qlinearize", action="store_true", help="Enable linearization of quadratic equations")
|
||||
parser.add_argument("-lb", "--relax_lb", type=int, default=1, help="Relax the lower bound of peer recovery traffic")
|
||||
parser.add_argument("-ub", "--relax_ub", type=int, default=0, help="Relax the upper bound of peer recovery traffic")
|
||||
parser.add_argument("-relax", "--auto_relax", action="store_true", help="Auto relax the lower/upper bound of peer recovery traffic when timeout")
|
||||
parser.add_argument("-bibd", "--bibd_only", action="store_true", help="Only create balanced incomplete block design (BIBD)")
|
||||
parser.add_argument("-t", "--init_timelimit", type=int, default=1800, help="Initial timeout for solver")
|
||||
parser.add_argument("-T", "--max_timelimit", type=int, default=3600*2, help="Max timeout for solver")
|
||||
parser.add_argument("-o", "--output_path", default="output", help="Path of output files")
|
||||
parser.add_argument("-m", "--existing_incidence_matrix", default=None, help="Existing incidence matrix for rebalance traffic model")
|
||||
parser.add_argument("-V", "--verbose", action="store_true", help="Show verbose output")
|
||||
args = parser.parse_args()
|
||||
|
||||
if args.existing_incidence_matrix is None:
|
||||
DataPlacementModel(
|
||||
args.chain_table_type,
|
||||
args.num_nodes,
|
||||
args.group_size,
|
||||
args.num_groups,
|
||||
args.num_targets_per_disk,
|
||||
args.min_targets_per_disk,
|
||||
args.bibd_only,
|
||||
args.qlinearize,
|
||||
args.relax_lb,
|
||||
args.relax_ub,
|
||||
).run(
|
||||
args.pyomo_solver,
|
||||
args.solver_threads,
|
||||
args.init_timelimit,
|
||||
args.max_timelimit,
|
||||
args.auto_relax,
|
||||
args.output_path,
|
||||
args.verbose)
|
||||
else:
|
||||
with open(args.existing_incidence_matrix, "rb") as fin:
|
||||
existing_incidence_matrix = pickle.load(fin)
|
||||
RebalanceTrafficModel(
|
||||
existing_incidence_matrix,
|
||||
args.chain_table_type,
|
||||
args.num_nodes,
|
||||
args.group_size,
|
||||
args.num_groups,
|
||||
args.num_targets_per_disk,
|
||||
args.min_targets_per_disk,
|
||||
args.bibd_only,
|
||||
args.qlinearize,
|
||||
args.relax_lb,
|
||||
args.relax_ub,
|
||||
).run(
|
||||
args.pyomo_solver,
|
||||
args.solver_threads,
|
||||
args.init_timelimit,
|
||||
args.max_timelimit,
|
||||
args.auto_relax,
|
||||
args.output_path,
|
||||
args.verbose)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
108
deploy/data_placement/src/model/data_placement_job.py
Normal file
108
deploy/data_placement/src/model/data_placement_job.py
Normal file
@@ -0,0 +1,108 @@
|
||||
# local test
|
||||
# pytest test/test_plan.py -v -x
|
||||
# production setup
|
||||
import functools
|
||||
import socket
|
||||
import sys
|
||||
import os.path
|
||||
import itertools
|
||||
import pandas as pd
|
||||
import pyarrow as arrow
|
||||
from typing import List, Literal
|
||||
from loguru import logger
|
||||
from smallpond.common import pytest_running
|
||||
from smallpond.logical.dataset import ArrowTableDataSet
|
||||
from smallpond.logical.node import Context, ConsolidateNode, DataSetPartitionNode, DataSourceNode, ArrowComputeNode, LogicalPlan, SqlEngineNode
|
||||
from smallpond.execution.driver import Driver
|
||||
from smallpond.execution.task import RuntimeContext, ArrowComputeTask
|
||||
|
||||
|
||||
def solve_model(runtime_task: ArrowComputeTask,
|
||||
chain_table_type, num_nodes, group_size, min_targets_per_disk,
|
||||
init_timelimit, max_timelimit,
|
||||
pyomo_solver="appsi_highs"):
|
||||
import logging
|
||||
pyomo_logger = logging.getLogger('pyomo')
|
||||
pyomo_logger.setLevel(logging.WARNING)
|
||||
|
||||
try:
|
||||
from src.model.data_placement import DataPlacementModel
|
||||
except:
|
||||
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
|
||||
from src.model.data_placement import DataPlacementModel
|
||||
|
||||
model = DataPlacementModel(chain_table_type, num_nodes, group_size, min_targets_per_disk=min_targets_per_disk, bibd_only=False, qlinearize=True, relax_lb=1, relax_ub=0)
|
||||
runtime_task.add_elapsed_time("build model time")
|
||||
|
||||
instance = model.run(
|
||||
pyomo_solver=pyomo_solver,
|
||||
threads=runtime_task.cpu_limit,
|
||||
init_timelimit=init_timelimit,
|
||||
max_timelimit=max_timelimit,
|
||||
auto_relax=True,
|
||||
output_root=runtime_task.runtime_output_abspath,
|
||||
add_elapsed_time=runtime_task.add_elapsed_time)
|
||||
return model, instance
|
||||
|
||||
def solve_loop(runtime_ctx: RuntimeContext, input_tables: List[arrow.Table],
|
||||
init_timelimit, max_timelimit,
|
||||
pyomo_solver="appsi_highs") -> arrow.Table:
|
||||
runtime_task = runtime_ctx.task
|
||||
model_params, = input_tables
|
||||
|
||||
output_table = None
|
||||
schema = arrow.schema([
|
||||
arrow.field("chain_table_type", arrow.string()),
|
||||
arrow.field("num_nodes", arrow.uint32()),
|
||||
arrow.field("group_size", arrow.uint32()),
|
||||
arrow.field("disks", arrow.list_(arrow.uint32())),
|
||||
arrow.field("groups", arrow.list_(arrow.uint32())),
|
||||
])
|
||||
|
||||
for chain_table_type, num_nodes, group_size, min_targets_per_disk in zip(*model_params.to_pydict().values()):
|
||||
model, instance = solve_model(runtime_task, chain_table_type, num_nodes, group_size, min_targets_per_disk, init_timelimit, max_timelimit, pyomo_solver)
|
||||
incidence_matrix = model.get_incidence_matrix(instance)
|
||||
disks, groups = zip(*incidence_matrix.keys())
|
||||
sol_table = arrow.Table.from_arrays([[chain_table_type], [num_nodes], [group_size], [disks], [groups]], schema=schema)
|
||||
output_table = sol_table if output_table is None else arrow.concat_tables((output_table, sol_table))
|
||||
return output_table
|
||||
|
||||
|
||||
def search_data_placement_plans(
|
||||
chain_table_type: Literal["EC", "CR"],
|
||||
num_nodes: List[int], group_size: List[int], min_targets_per_disk=1,
|
||||
init_timelimit=1800, max_timelimit=3600*3,
|
||||
solver_threads: int=64,
|
||||
pyomo_solver="appsi_highs"):
|
||||
params = pd.DataFrame([(chain_table_type, v, k, min_targets_per_disk)
|
||||
for v, k in itertools.product(num_nodes, group_size) if v >= k],
|
||||
columns=["chain_table_type", "num_nodes", "group_size", "min_targets_per_disk"])
|
||||
logger.warning(f"params: {params}")
|
||||
|
||||
ctx = Context()
|
||||
params_source = DataSourceNode(ctx, ArrowTableDataSet(arrow.Table.from_pandas(params)))
|
||||
params_partitions = DataSetPartitionNode(ctx, (params_source,), npartitions=len(params), partition_by_rows=True)
|
||||
|
||||
data_placement_sols = ArrowComputeNode(
|
||||
ctx, (params_partitions,),
|
||||
process_func=functools.partial(solve_loop, init_timelimit=init_timelimit, max_timelimit=max_timelimit, pyomo_solver=pyomo_solver),
|
||||
cpu_limit=solver_threads)
|
||||
return LogicalPlan(ctx, data_placement_sols)
|
||||
|
||||
|
||||
def main():
|
||||
driver = Driver()
|
||||
driver.add_argument("-pyomo", "--pyomo_solver", default="appsi_highs", choices=["appsi_highs", "cbc", "scip"], help="Solver used by Pyomo")
|
||||
driver.add_argument("-type", "--chain_table_type", type=str, required=True, choices=["EC", "CR"], help="CR - Chain Replication; EC - Erasure Coding")
|
||||
driver.add_argument("-v", "--num_nodes", nargs="+", type=int, required=True, help="Number of storage nodes")
|
||||
driver.add_argument("-k", "--replication_factor", "--group_size", dest="group_size", type=int, default=3, help="Replication factor or erasure coding group size")
|
||||
driver.add_argument("-min_r", "--min_targets_per_disk", type=int, default=1, help="Min number of storage targets on each disk")
|
||||
driver.add_argument("-j", "--solver_threads", type=int, default=32, help="Number of solver threads")
|
||||
driver.add_argument("-t", "--init_timelimit", type=int, default=1800, help="Initial timeout for solver")
|
||||
driver.add_argument("-T", "--max_timelimit", type=int, default=3600*3, help="Max timeout for solver")
|
||||
plan = search_data_placement_plans(num_executors=driver.num_executors, **driver.get_arguments())
|
||||
driver.run(plan)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
0
deploy/data_placement/src/setup/__init__.py
Normal file
0
deploy/data_placement/src/setup/__init__.py
Normal file
124
deploy/data_placement/src/setup/gen_chain_table.py
Normal file
124
deploy/data_placement/src/setup/gen_chain_table.py
Normal file
@@ -0,0 +1,124 @@
|
||||
import argparse
|
||||
import os.path
|
||||
from collections import Counter, defaultdict, namedtuple
|
||||
import pickle
|
||||
from typing import Dict, List, Literal, Tuple
|
||||
|
||||
|
||||
Target = namedtuple("Target", ["target_id", "node_id", "disk_index"])
|
||||
Chain = namedtuple("Chain", ["chain_id", "target_list"])
|
||||
|
||||
|
||||
def calc_target_id(target_id_prefix: int, node_id: int, disk_index: int, target_index: int):
|
||||
return ((target_id_prefix * 1_000_000 + node_id) * 1_000 + (disk_index+1)) * 100 + (target_index+1)
|
||||
|
||||
|
||||
def generate_chains(
|
||||
chain_table_type: Literal["EC", "CR"],
|
||||
node_id_begin: int,
|
||||
node_id_end: int,
|
||||
num_disks_per_node: int,
|
||||
num_targets_per_disk: int,
|
||||
target_id_prefix: int,
|
||||
chain_id_prefix: int,
|
||||
incidence_matrix: Dict[Tuple[int, int], bool],
|
||||
**kwargs):
|
||||
num_nodes = node_id_end - node_id_begin + 1
|
||||
nodes, groups = zip(*sorted(incidence_matrix.keys()))
|
||||
group_sizes = list(Counter(groups).values())
|
||||
assert max(nodes) == num_nodes, f"{max(nodes)=} != {num_nodes=}"
|
||||
assert all(s == group_sizes[0] for s in group_sizes[1:]), f"not all group sizes the same: {group_sizes}"
|
||||
assert len(incidence_matrix) % group_sizes[0] == 0, f"{len(incidence_matrix)=} % {group_sizes[0]=} != 0"
|
||||
assert len(incidence_matrix) == num_nodes * num_targets_per_disk, f"{len(incidence_matrix)=} != {num_nodes=} * {num_targets_per_disk=}"
|
||||
|
||||
global_target_list = []
|
||||
chain_target_list = defaultdict(list)
|
||||
|
||||
for disk_index in range(num_disks_per_node):
|
||||
group_slot_idx = defaultdict(int)
|
||||
for node_id in range(node_id_begin, node_id_end+1):
|
||||
for target_index in range(num_targets_per_disk):
|
||||
target_id = calc_target_id(target_id_prefix, node_id, disk_index, target_index)
|
||||
target_pos = (node_id - node_id_begin) * num_targets_per_disk + target_index
|
||||
|
||||
if chain_table_type == "EC":
|
||||
group_slot_idx[groups[target_pos]] += 1
|
||||
chain_index = (groups[target_pos]-1) * group_sizes[0] + group_slot_idx[groups[target_pos]]
|
||||
else:
|
||||
chain_index = groups[target_pos]
|
||||
|
||||
assert chain_index < 1_00_000, f"{chain_index} >= {1_00_000}"
|
||||
chain_id = (chain_id_prefix * 1_000 + (disk_index+1)) * 1_00_000 + chain_index
|
||||
target = Target(target_id, node_id, disk_index)
|
||||
global_target_list.append(target)
|
||||
chain_target_list[chain_id].append(target)
|
||||
|
||||
num_targets_on_node = list(Counter(target.node_id for target in global_target_list).values())
|
||||
num_targets_on_disk = list(Counter((target.node_id, target.disk_index) for target in global_target_list).values())
|
||||
assert len(global_target_list) == len(set(global_target_list)) == num_nodes * num_disks_per_node * num_targets_per_disk
|
||||
assert all(x == num_targets_on_node[0] for x in num_targets_on_node[1:])
|
||||
assert all(x == num_targets_on_disk[0] for x in num_targets_on_disk[1:])
|
||||
|
||||
if chain_table_type == "EC":
|
||||
assert all(len(target_ids) == 1 for target_ids in chain_target_list.values())
|
||||
assert len(chain_target_list) == num_nodes * num_disks_per_node * num_targets_per_disk
|
||||
else:
|
||||
assert all(len(target_ids) == group_sizes[0] for target_ids in chain_target_list.values())
|
||||
assert len(chain_target_list) == num_nodes * num_disks_per_node * num_targets_per_disk // group_sizes[0]
|
||||
|
||||
return [Chain(chain_id, target_list) for chain_id, target_list in sorted(chain_target_list.items())]
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(prog="model.py", description="Generate 3FS create target commands")
|
||||
parser.add_argument("-type", "--chain_table_type", type=str, required=True, choices=["EC", "CR"], help="CR - Chain Replication; EC - Erasure Coding")
|
||||
parser.add_argument("-b", "--node_id_begin", type=int, required=True, help="The first node id")
|
||||
parser.add_argument("-e", "--node_id_end", type=int, required=True, help="The last node id")
|
||||
parser.add_argument("-d", "--num_disks_per_node", type=int, required=True, help="Number of disk on each storage node")
|
||||
parser.add_argument("-r", "--num_targets_per_disk", type=int, required=True, help="Number of storage targets on each disk")
|
||||
parser.add_argument("-tp", "--target_id_prefix", type=int, default=10, help="Prefix of generated target id")
|
||||
parser.add_argument("-cp", "--chain_id_prefix", type=int, default=10, help="Prefix of generated chain id")
|
||||
parser.add_argument("-cs", "--chunk_size", nargs="+", help="A list of supported file chunk sizes")
|
||||
parser.add_argument("-mat", "--incidence_matrix_path", type=str, required=True, help="Incidence matrix generated by data placement model")
|
||||
parser.add_argument("-o", "--output_path", default="output", help="Path of output files")
|
||||
args = parser.parse_args()
|
||||
|
||||
with open(args.incidence_matrix_path, "rb") as fin:
|
||||
incidence_matrix = pickle.load(fin)
|
||||
|
||||
assert len(incidence_matrix) < 1_00_000
|
||||
assert args.node_id_end - args.node_id_begin < 1000
|
||||
assert args.node_id_end < 1_000_000
|
||||
assert args.node_id_begin < 1_000_000
|
||||
assert args.num_disks_per_node < 1000
|
||||
assert args.num_targets_per_disk < 100
|
||||
assert args.target_id_prefix < 100
|
||||
assert args.chain_id_prefix < 100
|
||||
|
||||
chain_list = generate_chains(**vars(args), incidence_matrix=incidence_matrix)
|
||||
|
||||
with open(os.path.join(args.output_path, "generated_chains.csv"), "w") as fout:
|
||||
print(f"ChainId,{','.join(['TargetId']*len(chain_list[0].target_list))}", file=fout)
|
||||
for chain in chain_list:
|
||||
print(f"{chain.chain_id},{','.join(str(target.target_id) for target in chain.target_list)}", file=fout)
|
||||
|
||||
with open(os.path.join(args.output_path, "generated_chain_table.csv"), "w") as fout:
|
||||
print("ChainId", file=fout)
|
||||
for chain in chain_list:
|
||||
print(f"{chain.chain_id}", file=fout)
|
||||
|
||||
with open(os.path.join(args.output_path, "create_target_cmd.txt"), "w") as fout:
|
||||
chunk_size_opt = f"--chunk-size {' '.join(args.chunk_size)}" if args.chunk_size else ""
|
||||
for chain in chain_list:
|
||||
for target in chain.target_list:
|
||||
print(f"create-target --node-id {target.node_id} --disk-index {target.disk_index} --target-id {target.target_id} --chain-id {chain.chain_id} {chunk_size_opt} --use-new-chunk-engine", file=fout)
|
||||
|
||||
with open(os.path.join(args.output_path, "remove_target_cmd.txt"), "w") as fout:
|
||||
for chain in chain_list:
|
||||
for target in chain.target_list:
|
||||
print(f"offline-target --node-id {target.node_id} --target-id {target.target_id}", file=fout)
|
||||
print(f"remove-target --node-id {target.node_id} --target-id {target.target_id}", file=fout)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
94
deploy/data_placement/test/test_model.py
Normal file
94
deploy/data_placement/test/test_model.py
Normal file
@@ -0,0 +1,94 @@
|
||||
import copy
|
||||
import glob
|
||||
import os.path
|
||||
import importlib
|
||||
import shutil
|
||||
import tempfile
|
||||
import pytest
|
||||
from src.model.data_placement import DataPlacementModel, RebalanceTrafficModel
|
||||
|
||||
|
||||
placement_params = [
|
||||
# simple cases for replication group
|
||||
{
|
||||
"chain_table_type": "EC",
|
||||
"num_nodes": 5,
|
||||
"num_targets_per_disk": 6,
|
||||
"group_size": 2,
|
||||
},
|
||||
{
|
||||
"chain_table_type": "EC",
|
||||
"num_nodes": 5,
|
||||
"num_targets_per_disk": 6,
|
||||
"group_size": 3,
|
||||
},
|
||||
# not all targets used: num_nodes * num_targets_per_disk % group_size != 0
|
||||
{
|
||||
"chain_table_type": "EC",
|
||||
"num_nodes": 7,
|
||||
"num_targets_per_disk": 5,
|
||||
"group_size": 4,
|
||||
},
|
||||
# always evenly distributed: num_targets_per_disk * (group_size-1) % (num_nodes-1) == 0
|
||||
{
|
||||
"chain_table_type": "EC",
|
||||
"num_nodes": 8,
|
||||
"num_targets_per_disk": 6,
|
||||
"group_size": 5,
|
||||
},
|
||||
# all targets used & evenly distributed
|
||||
{
|
||||
"chain_table_type": "EC",
|
||||
"num_nodes": 10,
|
||||
"num_targets_per_disk": 9,
|
||||
"group_size": 5,
|
||||
},
|
||||
]
|
||||
qlinearize = [False, True]
|
||||
relax_lb = [1, 2]
|
||||
|
||||
@pytest.mark.parametrize('qlinearize', qlinearize[1:])
|
||||
@pytest.mark.parametrize('relax_lb', relax_lb)
|
||||
@pytest.mark.parametrize('placement_params', placement_params)
|
||||
@pytest.mark.skipif(importlib.util.find_spec("highspy") is None, reason="cannot find solver")
|
||||
def test_solve_placement_model_with_highs(placement_params, qlinearize, relax_lb):
|
||||
DataPlacementModel(
|
||||
**placement_params,
|
||||
qlinearize=qlinearize,
|
||||
relax_lb=relax_lb,
|
||||
).run(pyomo_solver="appsi_highs")
|
||||
|
||||
@pytest.mark.parametrize('chain_table_type, num_nodes, group_size', [("CR", 25, 3), ("EC", 25, 20)])
|
||||
@pytest.mark.skipif(importlib.util.find_spec("highspy") is None, reason="cannot find solver")
|
||||
def test_solve_placement_model_v25(chain_table_type, num_nodes, group_size):
|
||||
model = DataPlacementModel(
|
||||
chain_table_type=chain_table_type,
|
||||
num_nodes=num_nodes,
|
||||
group_size=group_size,
|
||||
qlinearize=True,
|
||||
relax_lb=1,
|
||||
relax_ub=1,
|
||||
)
|
||||
model.run(pyomo_solver="appsi_highs", max_timelimit=30, auto_relax=True)
|
||||
|
||||
@pytest.mark.parametrize('placement_params', placement_params)
|
||||
@pytest.mark.skipif(importlib.util.find_spec("highspy") is None, reason="cannot find solver")
|
||||
def test_solve_rebalance_model(placement_params):
|
||||
model = DataPlacementModel(
|
||||
**placement_params,
|
||||
qlinearize=True,
|
||||
relax_lb=1,
|
||||
relax_ub=1,
|
||||
)
|
||||
instance = model.run(pyomo_solver="appsi_highs")
|
||||
|
||||
placement_params = copy.deepcopy(placement_params)
|
||||
placement_params["num_nodes"] *= 2
|
||||
placement_params.pop("num_targets_per_disk")
|
||||
RebalanceTrafficModel(
|
||||
existing_incidence_matrix=model.get_incidence_matrix(instance),
|
||||
**placement_params,
|
||||
qlinearize=True,
|
||||
relax_lb=2,
|
||||
relax_ub=1,
|
||||
).run(pyomo_solver="appsi_highs", max_timelimit=15, auto_relax=True)
|
||||
10
deploy/data_placement/test/test_plan.py
Normal file
10
deploy/data_placement/test/test_plan.py
Normal file
@@ -0,0 +1,10 @@
|
||||
from smallpond.test_fabric import TestFabric
|
||||
from src.model.data_placement_job import search_data_placement_plans
|
||||
|
||||
class TestPlan(TestFabric):
|
||||
|
||||
def test_search_data_placement_plans(self):
|
||||
for pyomo_solver in ["appsi_highs"]:
|
||||
with self.subTest(pyomo_solver=pyomo_solver):
|
||||
plan = search_data_placement_plans(chain_table_type="EC", num_nodes=[10], group_size=[5, 9], solver_threads=16, pyomo_solver=pyomo_solver)
|
||||
self.execute_plan(plan, num_executors=1)
|
||||
55
deploy/data_placement/test/test_setup.py
Normal file
55
deploy/data_placement/test/test_setup.py
Normal file
@@ -0,0 +1,55 @@
|
||||
from collections import Counter
|
||||
import glob
|
||||
import os.path
|
||||
import pytest
|
||||
|
||||
from src.model.data_placement import DataPlacementModel
|
||||
from src.setup.gen_chain_table import generate_chains
|
||||
|
||||
|
||||
@pytest.mark.parametrize('num_nodes, num_disks_per_node, num_targets_per_disk, num_replicas', [(5, 10, 6, 2), (10, 10, 9, 3)])
|
||||
def test_generate_cr_chains(num_nodes: int, num_disks_per_node: int, num_targets_per_disk: int, num_replicas: int):
|
||||
model = DataPlacementModel(
|
||||
chain_table_type="CR",
|
||||
num_nodes=num_nodes,
|
||||
num_targets_per_disk=num_targets_per_disk,
|
||||
group_size=num_replicas,
|
||||
qlinearize=True,
|
||||
relax_lb=1,
|
||||
relax_ub=1,
|
||||
)
|
||||
instance = model.run(pyomo_solver="appsi_highs", max_timelimit=15, auto_relax=True)
|
||||
|
||||
generate_chains(
|
||||
chain_table_type="CR",
|
||||
node_id_begin=1,
|
||||
node_id_end=num_nodes,
|
||||
num_disks_per_node=num_disks_per_node,
|
||||
num_targets_per_disk=num_targets_per_disk,
|
||||
target_id_prefix=1,
|
||||
chain_id_prefix=9,
|
||||
incidence_matrix=model.get_incidence_matrix(instance))
|
||||
|
||||
|
||||
@pytest.mark.parametrize('num_nodes, num_disks_per_node, num_targets_per_disk, ec_group_size', [(20, 10, 6, 12), (25, 10, 12, 20)])
|
||||
def test_generate_ec_chains(num_nodes: int, num_disks_per_node: int, num_targets_per_disk: int, ec_group_size: int):
|
||||
model = DataPlacementModel(
|
||||
chain_table_type="EC",
|
||||
num_nodes=num_nodes,
|
||||
num_targets_per_disk=num_targets_per_disk,
|
||||
group_size=ec_group_size,
|
||||
qlinearize=True,
|
||||
relax_lb=1,
|
||||
relax_ub=1,
|
||||
)
|
||||
instance = model.run(pyomo_solver="appsi_highs", max_timelimit=15, auto_relax=True)
|
||||
|
||||
generate_chains(
|
||||
chain_table_type="EC",
|
||||
node_id_begin=1,
|
||||
node_id_end=num_nodes,
|
||||
num_disks_per_node=num_disks_per_node,
|
||||
num_targets_per_disk=num_targets_per_disk,
|
||||
target_id_prefix=1,
|
||||
chain_id_prefix=9,
|
||||
incidence_matrix=model.get_incidence_matrix(instance))
|
||||
51
deploy/sql/3fs-monitor.sql
Normal file
51
deploy/sql/3fs-monitor.sql
Normal file
@@ -0,0 +1,51 @@
|
||||
CREATE DATABASE IF NOT EXISTS 3fs;
|
||||
|
||||
CREATE TABLE IF NOT EXISTS 3fs.counters (
|
||||
`TIMESTAMP` DateTime CODEC(DoubleDelta),
|
||||
`metricName` LowCardinality(String) CODEC(ZSTD(1)),
|
||||
`host` LowCardinality(String) CODEC(ZSTD(1)),
|
||||
`tag` LowCardinality(String) CODEC(ZSTD(1)),
|
||||
`val` Int64 CODEC(ZSTD(1)),
|
||||
`mount_name` LowCardinality(String) CODEC(ZSTD(1)),
|
||||
`instance` String CODEC(ZSTD(1)),
|
||||
`io` LowCardinality(String) CODEC(ZSTD(1)),
|
||||
`uid` LowCardinality(String) CODEC(ZSTD(1)),
|
||||
`pod` String CODEC(ZSTD(1)),
|
||||
`thread` LowCardinality(String) CODEC(ZSTD(1)),
|
||||
`statusCode` LowCardinality(String) CODEC(ZSTD(1))
|
||||
)
|
||||
ENGINE = MergeTree
|
||||
PRIMARY KEY (metricName, host, pod, instance, TIMESTAMP)
|
||||
PARTITION BY toDate(TIMESTAMP)
|
||||
ORDER BY (metricName, host, pod, instance, TIMESTAMP)
|
||||
TTL TIMESTAMP + toIntervalMonth(1)
|
||||
SETTINGS index_granularity = 8192;
|
||||
|
||||
CREATE TABLE IF NOT EXISTS 3fs.distributions (
|
||||
`TIMESTAMP` DateTime CODEC(DoubleDelta),
|
||||
`metricName` LowCardinality(String) CODEC(ZSTD(1)),
|
||||
`host` LowCardinality(String) CODEC(ZSTD(1)),
|
||||
`tag` LowCardinality(String) CODEC(ZSTD(1)),
|
||||
`count` Float64 CODEC(ZSTD(1)),
|
||||
`mean` Float64 CODEC(ZSTD(1)),
|
||||
`min` Float64 CODEC(ZSTD(1)),
|
||||
`max` Float64 CODEC(ZSTD(1)),
|
||||
`p50` Float64 CODEC(ZSTD(1)),
|
||||
`p90` Float64 CODEC(ZSTD(1)),
|
||||
`p95` Float64 CODEC(ZSTD(1)),
|
||||
`p99` Float64 CODEC(ZSTD(1)),
|
||||
`mount_name` LowCardinality(String) CODEC(ZSTD(1)),
|
||||
`instance` String CODEC(ZSTD(1)),
|
||||
`io` LowCardinality(String) CODEC(ZSTD(1)),
|
||||
`uid` LowCardinality(String) CODEC(ZSTD(1)),
|
||||
`method` LowCardinality(String) CODEC(ZSTD(1)),
|
||||
`pod` String CODEC(ZSTD(1)),
|
||||
`thread` LowCardinality(String) CODEC(ZSTD(1)),
|
||||
`statusCode` LowCardinality(String) CODEC(ZSTD(1))
|
||||
)
|
||||
ENGINE = MergeTree
|
||||
PRIMARY KEY (metricName, host, pod, instance, TIMESTAMP)
|
||||
PARTITION BY toDate(TIMESTAMP)
|
||||
ORDER BY (metricName, host, pod, instance, TIMESTAMP)
|
||||
TTL TIMESTAMP + toIntervalMonth(1)
|
||||
SETTINGS index_granularity = 8192;
|
||||
12
deploy/systemd/hf3fs_fuse_main.service
Normal file
12
deploy/systemd/hf3fs_fuse_main.service
Normal file
@@ -0,0 +1,12 @@
|
||||
[Unit]
|
||||
Description=fuse_main Server
|
||||
Requires=network-online.target
|
||||
After=network-online.target
|
||||
|
||||
[Service]
|
||||
LimitNOFILE=1000000
|
||||
ExecStart=/opt/3fs/bin/hf3fs_fuse_main --launcher_cfg /opt/3fs/etc/hf3fs_fuse_main_launcher.toml
|
||||
Type=simple
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
12
deploy/systemd/meta_main.service
Normal file
12
deploy/systemd/meta_main.service
Normal file
@@ -0,0 +1,12 @@
|
||||
[Unit]
|
||||
Description=meta_main Server
|
||||
Requires=network-online.target
|
||||
After=network-online.target
|
||||
|
||||
[Service]
|
||||
LimitNOFILE=1000000
|
||||
ExecStart=/opt/3fs/bin/meta_main --launcher_cfg /opt/3fs/etc/meta_main_launcher.toml --app-cfg /opt/3fs/etc/meta_main_app.toml
|
||||
Type=simple
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
12
deploy/systemd/mgmtd_main.service
Normal file
12
deploy/systemd/mgmtd_main.service
Normal file
@@ -0,0 +1,12 @@
|
||||
[Unit]
|
||||
Description=mgmtd_main Server
|
||||
Requires=network-online.target
|
||||
After=network-online.target
|
||||
|
||||
[Service]
|
||||
LimitNOFILE=1000000
|
||||
ExecStart=/opt/3fs/bin/mgmtd_main --launcher_cfg /opt/3fs/etc/mgmtd_main_launcher.toml --app-cfg /opt/3fs/etc/mgmtd_main_app.toml
|
||||
Type=simple
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
11
deploy/systemd/monitor_collector_main.service
Normal file
11
deploy/systemd/monitor_collector_main.service
Normal file
@@ -0,0 +1,11 @@
|
||||
[Unit]
|
||||
Description=monitor_collector_main Server
|
||||
Requires=network-online.target
|
||||
After=network-online.target
|
||||
|
||||
[Service]
|
||||
ExecStart=/opt/3fs/bin/monitor_collector_main --cfg /opt/3fs/etc/monitor_collector_main.toml
|
||||
Type=simple
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
14
deploy/systemd/storage_main.service
Normal file
14
deploy/systemd/storage_main.service
Normal file
@@ -0,0 +1,14 @@
|
||||
[Unit]
|
||||
Description=storage_main Server
|
||||
Requires=network-online.target
|
||||
After=network-online.target
|
||||
|
||||
[Service]
|
||||
LimitNOFILE=1000000
|
||||
LimitMEMLOCK=infinity
|
||||
TimeoutStopSec=5m
|
||||
ExecStart=/opt/3fs/bin/storage_main --launcher_cfg /opt/3fs/etc/storage_main_launcher.toml --app-cfg /opt/3fs/etc/storage_main_app.toml
|
||||
Type=simple
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
Reference in New Issue
Block a user