Add AWS NVIDIA GPU Support admonition

2025-06-26 18:17:44 +00:00 · 2025-06-11 11:53:06 +03:00 · 2025-06-11 11:53:06 +03:00 · 90a412d9db
commit 90a412d9db
parent 630c08d99e 9b0dd064b7
1 changed files with 39 additions and 13 deletions
--- a/docs/webapp/applications/apps_aws_autoscaler.md
+++ b/docs/webapp/applications/apps_aws_autoscaler.md
@ -19,6 +19,40 @@ each instance is spun up.

 For more information about how autoscalers work, see [Autoscalers Overview](../../cloud_autoscaling/autoscaling_overview.md#autoscaler-applications).

+:::info AWS NVIDIA GPU Support   
+* Recent NVIDIA AMIs only install the required drivers on initial user login. To make use of such AMIs, the autoscaler 
+  needs to mimic an initial user login. This can be accomplished by, adding the following script to the `Init script`
+  field in the app instance launch form:
+    
+  ```
+  apt-get update
+  DEBIAN_FRONTEND=noninteractive apt-get -y -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold" upgrade
+  su -l ubuntu -c '/usr/bin/bash /home/ubuntu/.profile'
+  ```
+
+* Sometimes training jobs may fail to detect GPUs after autoscaler provisioning, due to Nvidia drivers loading slowly.
+  To ensure GPU detection, add the following to the `Init script` field in the application instance launch form:
+
+  ```bash
+  cat /etc/docker/daemon.json 
+
+  sed -i 's|"runtimes": {|"exec-opts": ["native.cgroupdriver=cgroupfs"], "runtimes": {|g' /etc/docker/daemon.json
+
+  cat /etc/docker/daemon.json 
+
+  sed -i 's|#no-cgroups = false|no-cgroups = false|g' /etc/nvidia-container-runtime/config.toml
+
+  cat /etc/nvidia-container-runtime/config.toml
+
+  systemctl daemon-reload
+  systemctl restart docker
+
+  echo "try nvidia"
+  docker run -t --rm --ipc=host --gpus all bitnami/minideb:bullseye bash -c "nvidia-smi && apt update && nvidia-smi" 
+  ```
+
+:::
+
 ## Autoscaler Instance Configuration

 When configuring a new AWS Autoscaler instance, you can fill in the required parameters or reuse the configuration of 
@ -69,19 +103,11 @@ to open the app's instance launch form.
    * Availability Zone - The [EC2 availability zone](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concepts.RegionsAndAvailabilityZones.html#Concepts.RegionsAndAvailabilityZones.AvailabilityZones) 
      to launch this resource in
    * AMI ID - The AWS AMI to launch
-    :::note AMI prerequisites
-    The AMI used for the autoscaler must include docker runtime and virtualenv.
-    
-    Recent NVIDIA AMIs only install the required drivers on initial user login. To make use of such AMIs, the autoscaler 
-    needs to mimic an initial user login. This can be accomplished by, adding the following script to the `Init script`
-    field:
-    
-    ```
-    apt-get update
-    DEBIAN_FRONTEND=noninteractive apt-get -y -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold" upgrade
-    su -l ubuntu -c '/usr/bin/bash /home/ubuntu/.profile'
-    ```
-    :::
+     
+      :::note AMI prerequisites
+      The AMI used for the autoscaler must include docker runtime and virtualenv.
+      :::
+  
    * Max Number of Instances - Maximum number of concurrent running instances of this type allowed
    * Monitored Queue - Queue associated with this instance type. The tasks enqueued to this queue will be executed on 
      instances of this type