mirror of
https://github.com/clearml/clearml-agent
synced 2025-06-26 18:16:15 +00:00
Compare commits
108 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
3273f76b46 | ||
|
|
9af0f9fe41 | ||
|
|
205cd47cb9 | ||
|
|
0ff428bb96 | ||
|
|
bf8d9c96e9 | ||
|
|
a88487ff25 | ||
|
|
785e22dc87 | ||
|
|
6a2b778d53 | ||
|
|
b2c3702830 | ||
|
|
6302d43990 | ||
|
|
760bbca74e | ||
|
|
e63fd31420 | ||
|
|
2ff9985db7 | ||
|
|
b8c762401b | ||
|
|
99e1e54f94 | ||
|
|
a4d3b5bad6 | ||
|
|
b21665ed6e | ||
|
|
f877aa96e2 | ||
|
|
f99344d194 | ||
|
|
d9f2a1999a | ||
|
|
79d0abe707 | ||
|
|
6213ef4c02 | ||
|
|
aef6aa9fc8 | ||
|
|
0bb267115b | ||
|
|
f89a92556f | ||
|
|
8ba4d75e80 | ||
|
|
edc333ba5f | ||
|
|
2f0553b873 | ||
|
|
b2a4bf08ac | ||
|
|
f18c6b809f | ||
|
|
cd5b4d2186 | ||
|
|
5f1bab6711 | ||
|
|
ab9b9db0c9 | ||
|
|
93df021108 | ||
|
|
700ae85de0 | ||
|
|
f367c5a571 | ||
|
|
ebc5944b44 | ||
|
|
8f41002845 | ||
|
|
7e8670d57f | ||
|
|
77de343863 | ||
|
|
6b31883e45 | ||
|
|
e48b4756fa | ||
|
|
47147e3237 | ||
|
|
41fc4ec646 | ||
|
|
441e5a73b2 | ||
|
|
27ed6821c4 | ||
|
|
10c6629982 | ||
|
|
6fb48a4c6e | ||
|
|
105ade31f1 | ||
|
|
502e266b6b | ||
|
|
cd9a3b9f4e | ||
|
|
4179ac5234 | ||
|
|
98cc0d86ba | ||
|
|
293cbc0ac6 | ||
|
|
4387ed73b6 | ||
|
|
43443ccf08 | ||
|
|
3d43240c8f | ||
|
|
fc58ba947b | ||
|
|
22672d2444 | ||
|
|
6a4fcda1bf | ||
|
|
a4ebf8293d | ||
|
|
10fb157d58 | ||
|
|
56058beec2 | ||
|
|
9f207d5155 | ||
|
|
8a2bea3c14 | ||
|
|
f1f9278928 | ||
|
|
2de1c926bf | ||
|
|
e1104e60bb | ||
|
|
8b2970350c | ||
|
|
a2758250b2 | ||
|
|
01e8ffd854 | ||
|
|
74edf6aa36 | ||
|
|
09c5ef99af | ||
|
|
17ae28a62f | ||
|
|
059a9385e9 | ||
|
|
9a321a410f | ||
|
|
919013d4fe | ||
|
|
05530b712b | ||
|
|
8d15fd8798 | ||
|
|
b34329934b | ||
|
|
85049d8705 | ||
|
|
6fbd70786e | ||
|
|
05a65548da | ||
|
|
6657003d65 | ||
|
|
95dde6ca0c | ||
|
|
c9fc092f4e | ||
|
|
432ee395e1 | ||
|
|
98fc4f0fb9 | ||
|
|
111e774c21 | ||
|
|
3dd8d783e1 | ||
|
|
7c3e420df4 | ||
|
|
55b065a114 | ||
|
|
faa97b6cc2 | ||
|
|
f5861b1e4a | ||
|
|
030cbb69f1 | ||
|
|
564f769ff7 | ||
|
|
2c7f091e57 | ||
|
|
dd5d24b0ca | ||
|
|
996bb797c3 | ||
|
|
9ad49a0d21 | ||
|
|
ba4fee7b19 | ||
|
|
0131db8b7d | ||
|
|
d2384a9a95 | ||
|
|
5b86c230c1 | ||
|
|
21e4be966f | ||
|
|
9c6cb421b3 | ||
|
|
52405c343d | ||
|
|
46f0c991c8 |
2
.gitignore
vendored
2
.gitignore
vendored
@@ -14,3 +14,5 @@ dist/
|
||||
# VSCode
|
||||
.vscode
|
||||
|
||||
# MirrorD
|
||||
.mirrord
|
||||
|
||||
58
README.md
58
README.md
@@ -2,14 +2,17 @@
|
||||
|
||||
<img src="https://github.com/allegroai/clearml-agent/blob/master/docs/clearml_agent_logo.png?raw=true" width="250px">
|
||||
|
||||
**ClearML Agent - ML-Ops made easy
|
||||
ML-Ops scheduler & orchestration solution supporting Linux, macOS and Windows**
|
||||
**ClearML Agent - MLOps/LLMOps made easy
|
||||
MLOps/LLMOps scheduler & orchestration solution supporting Linux, macOS and Windows**
|
||||
|
||||
[](https://img.shields.io/github/license/allegroai/clearml-agent.svg)
|
||||
[](https://img.shields.io/pypi/pyversions/clearml-agent.svg)
|
||||
[](https://img.shields.io/pypi/v/clearml-agent.svg)
|
||||
[](https://pypi.org/project/clearml-agent/)
|
||||
[](https://artifacthub.io/packages/search?repo=allegroai)
|
||||
|
||||
`🌟 ClearML is open-source - Leave a star to support the project! 🌟`
|
||||
|
||||
</div>
|
||||
|
||||
---
|
||||
@@ -65,36 +68,39 @@ or [Free tier Hosting](https://app.clear.ml)
|
||||
|
||||
### Kubernetes Integration (Optional)
|
||||
|
||||
We think Kubernetes is awesome, but it should be a choice. We designed `clearml-agent` so you can run bare-metal or
|
||||
inside a pod with any mix that fits your environment.
|
||||
We think Kubernetes is awesome, but it is not a must to get started with remote execution agents and cluster management.
|
||||
We designed `clearml-agent` so you can run both bare-metal and on top of Kubernetes, in any combination that fits your environment.
|
||||
|
||||
Find Dockerfiles in the [docker](./docker) dir and a helm Chart in https://github.com/allegroai/clearml-helm-charts
|
||||
You can find the Dockerfiles in the [docker folder](./docker) and the helm Chart in https://github.com/allegroai/clearml-helm-charts
|
||||
|
||||
#### Benefits of integrating existing K8s with ClearML-Agent
|
||||
#### Benefits of integrating existing Kubernetes cluster with ClearML
|
||||
|
||||
- ClearML-Agent adds the missing scheduling capabilities to K8s
|
||||
- Allowing for more flexible automation from code
|
||||
- A programmatic interface for easier learning curve (and debugging)
|
||||
- Seamless integration with ML/DL experiment manager
|
||||
- ClearML-Agent adds the missing scheduling capabilities to your Kubernetes cluster
|
||||
- Users do not need to have direct Kubernetes access!
|
||||
- Easy learning curve with UI and CLI requiring no DevOps knowledge from end users
|
||||
- Unlike other solutions, ClearML-Agents work in tandem with other customers of your Kubernetes cluster
|
||||
- Allows for more flexible automation from code, building pipelines and visibility
|
||||
- A programmatic interface for easy CI/CD workflows, enabling GitOps to trigger jobs inside your cluster
|
||||
- Seamless integration with the ClearML ML/DL/GenAI experiment manager
|
||||
- Web UI for customization, scheduling & prioritization of jobs
|
||||
- **Enterprise Features**: RBAC, vault, multi-tenancy, scheduler, quota management, fractional GPU support
|
||||
|
||||
**Two K8s integration flavours**
|
||||
**Run the agent in Kubernetes Glue mode an map ClearML jobs directly to K8s jobs:**
|
||||
- Use the [ClearML Agent Helm Chart](https://github.com/allegroai/clearml-helm-charts/tree/main/charts/clearml-agent) to spin an agent pod acting as a controller
|
||||
- Or run the [clearml-k8s glue](https://github.com/allegroai/clearml-agent/blob/master/examples/k8s_glue_example.py) on
|
||||
a Kubernetes cpu node
|
||||
- The clearml-k8s glue pulls jobs from the ClearML job execution queue and prepares a Kubernetes job (based on provided
|
||||
yaml template)
|
||||
- Inside each pod the clearml-agent will install the job (experiment) environment and spin and monitor the
|
||||
experiment's process, fully visible in the clearml UI
|
||||
- Benefits: Kubernetes full view of all running jobs in the system
|
||||
- **Enterprise Features**
|
||||
- Full scheduler features added on Top of Kubernetes, with quota/over-quota management, priorities and order.
|
||||
- Fractional GPU support, allowing multiple isolated containers sharing the same GPU with memory/compute limit per container
|
||||
|
||||
- Spin ClearML-Agent as a long-lasting service pod:
|
||||
- Use [clearml-agent](https://hub.docker.com/r/allegroai/clearml-agent) docker image
|
||||
- map docker socket into the pod (soon replaced by [podman](https://github.com/containers/podman))
|
||||
- Allow the clearml-agent to manage sibling dockers
|
||||
- Benefits: full use of the ClearML scheduling, no need to worry about wrong container images / lost pods etc.
|
||||
- Downside: sibling containers
|
||||
- Kubernetes Glue, map ClearML jobs directly to K8s jobs:
|
||||
- Run the [clearml-k8s glue](https://github.com/allegroai/clearml-agent/blob/master/examples/k8s_glue_example.py) on
|
||||
a K8s cpu node
|
||||
- The clearml-k8s glue pulls jobs from the ClearML job execution queue and prepares a K8s job (based on provided
|
||||
yaml template)
|
||||
- Inside the pod itself the clearml-agent will install the job (experiment) environment and spin and monitor the
|
||||
experiment's process
|
||||
- Benefits: Kubernetes full view of all running jobs in the system
|
||||
- Downside: No real scheduling (k8s scheduler), no docker image verification (post-mortem only)
|
||||
### SLURM (Optional)
|
||||
|
||||
Yes! Slurm integration is available, check the [documentation](https://clear.ml/docs/latest/docs/clearml_agent/#slurm) for further details
|
||||
|
||||
### Using the ClearML Agent
|
||||
|
||||
|
||||
@@ -45,8 +45,8 @@
|
||||
# it solves passing user/token to git submodules.
|
||||
# this is a safer way to ensure multiple users using the same repository will
|
||||
# not accidentally leak credentials
|
||||
# Only supported on Linux systems, it will be the default in future releases
|
||||
# enable_git_ask_pass: false
|
||||
# Note: this is only supported on Linux systems
|
||||
# enable_git_ask_pass: true
|
||||
|
||||
# in docker mode, if container's entrypoint automatically activated a virtual environment
|
||||
# use the activated virtual environment and install everything there
|
||||
@@ -66,7 +66,7 @@
|
||||
type: pip,
|
||||
|
||||
# specify pip version to use (examples "<20.2", "==19.3.1", "", empty string will install the latest version)
|
||||
pip_version: ["<20.2 ; python_version < '3.10'", "<22.3 ; python_version >= '3.10'"],
|
||||
pip_version: ["<20.2 ; python_version < '3.10'", "<22.3 ; python_version >= '3.10' and python_version <= '3.11'", ">=23,<24.3 ; python_version >= '3.12'"]
|
||||
# specify poetry version to use (examples "<2", "==1.1.1", "", empty string will install the latest version)
|
||||
# poetry_version: "<2",
|
||||
# poetry_install_extra_args: ["-v"]
|
||||
@@ -80,6 +80,14 @@
|
||||
# additional artifact repositories to use when installing python packages
|
||||
# extra_index_url: ["https://allegroai.jfrog.io/clearml/api/pypi/public/simple"]
|
||||
|
||||
# turn on the "--use-deprecated=legacy-resolver" flag for pip, to avoid package dependency version mismatch
|
||||
# is any version restrictions are matched we add the "--use-deprecated=legacy-resolver" flag
|
||||
# example: pip_legacy_resolver = [">=20.3,<24.3", ">99"]
|
||||
# if pip==20.2 or pip==29.0 is installed we do nothing,
|
||||
# if pip==21.1 or pip==101.1 is installed the flag is added
|
||||
# disable the feature by passing an empty list
|
||||
pip_legacy_resolver = [">=20.3,<24.3"]
|
||||
|
||||
# control the pytorch wheel resolving algorithm, options are: "pip", "direct", "none"
|
||||
# Override with environment variable CLEARML_AGENT_PACKAGE_PYTORCH_RESOLVE
|
||||
# "pip" (default): would automatically detect the cuda version, and supply pip with the correct
|
||||
@@ -92,7 +100,7 @@
|
||||
# pytorch_resolve: "pip"
|
||||
|
||||
# additional conda channels to use when installing with conda package manager
|
||||
conda_channels: ["pytorch", "conda-forge", "defaults", ]
|
||||
conda_channels: ["pytorch", "conda-forge", "nvidia", "defaults", ]
|
||||
|
||||
# If set to true, Task's "installed packages" are ignored,
|
||||
# and the repository's "requirements.txt" is used instead
|
||||
@@ -177,6 +185,13 @@
|
||||
# these are local for this agent and will not be updated in the experiment's docker_cmd section
|
||||
# extra_docker_arguments: ["--ipc=host", ]
|
||||
|
||||
# Allow the extra docker arg to override task level docker arg (if the same argument is passed on both),
|
||||
# if set to False, a task docker arg will override the docker extra arg
|
||||
# docker_args_extra_precedes_task: true
|
||||
|
||||
# allows the following task docker args to be overridden by the extra_docker_arguments
|
||||
# protected_docker_extra_args: ["privileged", "security-opt", "network", "ipc"]
|
||||
|
||||
# optional shell script to run in docker when started before the experiment is started
|
||||
# extra_docker_shell_script: ["apt-get install -y bindfs", ]
|
||||
|
||||
@@ -211,6 +226,76 @@
|
||||
|
||||
# optional arguments to pass to docker image
|
||||
# arguments: ["--ipc=host", ]
|
||||
|
||||
# Choose the default docker based on the Task properties,
|
||||
# Notice: Enterprise feature, ignored otherwise
|
||||
# Examples: 'script.requirements', 'script.binary', 'script.repository', 'script.branch', 'project'
|
||||
# Notice: Matching is done via regular expression, for example "^searchme$" will match exactly "searchme" string
|
||||
"match_rules": [
|
||||
{
|
||||
"image": "python:3.6-bullseye",
|
||||
"arguments": "--ipc=host",
|
||||
"match": {
|
||||
"script": {
|
||||
"binary": "python3.6$",
|
||||
},
|
||||
}
|
||||
},
|
||||
{
|
||||
"image": "python:3.7-bullseye",
|
||||
"arguments": "--ipc=host",
|
||||
"match": {
|
||||
"script": {
|
||||
"binary": "python3.7$",
|
||||
},
|
||||
}
|
||||
},
|
||||
{
|
||||
"image": "python:3.8-bullseye",
|
||||
"arguments": "--ipc=host",
|
||||
"match": {
|
||||
"script": {
|
||||
"binary": "python3.8$",
|
||||
},
|
||||
}
|
||||
},
|
||||
{
|
||||
"image": "python:3.9-bullseye",
|
||||
"arguments": "--ipc=host",
|
||||
"match": {
|
||||
"script": {
|
||||
"binary": "python3.9$",
|
||||
},
|
||||
}
|
||||
},
|
||||
{
|
||||
"image": "python:3.10-bullseye",
|
||||
"arguments": "--ipc=host",
|
||||
"match": {
|
||||
"script": {
|
||||
"binary": "python3.10$",
|
||||
},
|
||||
}
|
||||
},
|
||||
{
|
||||
"image": "python:3.11-bullseye",
|
||||
"arguments": "--ipc=host",
|
||||
"match": {
|
||||
"script": {
|
||||
"binary": "python3.11$",
|
||||
},
|
||||
}
|
||||
},
|
||||
{
|
||||
"image": "python:3.12-bullseye",
|
||||
"arguments": "--ipc=host",
|
||||
"match": {
|
||||
"script": {
|
||||
"binary": "python3.12$",
|
||||
},
|
||||
}
|
||||
},
|
||||
]
|
||||
}
|
||||
|
||||
# set the OS environments based on the Task's Environment section before launching the Task process.
|
||||
@@ -242,6 +327,20 @@
|
||||
# cuda_version: 10.1
|
||||
# cudnn_version: 7.6
|
||||
|
||||
# Sanitize configuration printout using these settings
|
||||
sanitize_config_printout {
|
||||
# Hide values of configuration keys matching these regexps
|
||||
hide_secrets: ["^sanitize_config_printout$", "secret", "pass", "token", "account_key", "contents"]
|
||||
# As above, only show field's value keys if value is a dictionary
|
||||
hide_secrets_recursive: ["^environment$"]
|
||||
# Do not hide for keys matching these regexps
|
||||
dont_hide_secrets: ["^enable_git_ask_pass$"]
|
||||
# Hide secrets in docker commands, according to the 'agent.hide_docker_command_env_vars' settings
|
||||
docker_commands: ["^extra_docker_arguments$"]
|
||||
# Hide password in URLs found in keys matching these regexps (handles single URLs, lists and dictionaries)
|
||||
urls: ["^extra_index_url$"]
|
||||
}
|
||||
|
||||
# Hide docker environment variables containing secrets when printing out the docker command by replacing their
|
||||
# values with "********". Turning this feature on will hide the following environment variables values:
|
||||
# CLEARML_API_SECRET_KEY, CLEARML_AGENT_GIT_PASS, AWS_SECRET_ACCESS_KEY, AZURE_STORAGE_KEY
|
||||
@@ -268,6 +367,7 @@
|
||||
pip_cache: "/root/.cache/pip"
|
||||
poetry_cache: "/root/.cache/pypoetry"
|
||||
vcs_cache: "/root/.clearml/vcs-cache"
|
||||
venvs_cache: "/root/.clearml/venvs-cache"
|
||||
venv_build: "~/.clearml/venvs-builds"
|
||||
pip_download: "/root/.clearml/pip-download-cache"
|
||||
}
|
||||
|
||||
@@ -140,7 +140,7 @@
|
||||
vcs_repo_detect_async: true
|
||||
|
||||
# Store uncommitted git/hg source code diff in experiment manifest when training in development mode
|
||||
# This stores "git diff" or "hg diff" into the experiment's "script.requirements.diff" section
|
||||
# This stores "git diff" or into the experiment's "script.requirements.diff" section
|
||||
store_uncommitted_code_diff: true
|
||||
|
||||
# Support stopping an experiment in case it was externally stopped, status was changed or task was reset
|
||||
|
||||
@@ -1,5 +1,5 @@
|
||||
from ...backend_config.converters import safe_text_to_bool
|
||||
from ...backend_config.environment import EnvEntry
|
||||
from clearml_agent.helper.environment import EnvEntry
|
||||
from clearml_agent.helper.environment.converters import safe_text_to_bool
|
||||
|
||||
|
||||
ENV_HOST = EnvEntry("CLEARML_API_HOST", "TRAINS_API_HOST")
|
||||
@@ -11,6 +11,7 @@ ENV_AUTH_TOKEN = EnvEntry("CLEARML_AUTH_TOKEN")
|
||||
ENV_VERBOSE = EnvEntry("CLEARML_API_VERBOSE", "TRAINS_API_VERBOSE", type=bool, default=False)
|
||||
ENV_HOST_VERIFY_CERT = EnvEntry("CLEARML_API_HOST_VERIFY_CERT", "TRAINS_API_HOST_VERIFY_CERT", type=bool, default=True)
|
||||
ENV_CONDA_ENV_PACKAGE = EnvEntry("CLEARML_CONDA_ENV_PACKAGE", "TRAINS_CONDA_ENV_PACKAGE")
|
||||
ENV_USE_CONDA_BASE_ENV = EnvEntry("CLEARML_USE_CONDA_BASE_ENV", type=bool)
|
||||
ENV_NO_DEFAULT_SERVER = EnvEntry("CLEARML_NO_DEFAULT_SERVER", "TRAINS_NO_DEFAULT_SERVER", type=bool, default=True)
|
||||
ENV_DISABLE_VAULT_SUPPORT = EnvEntry('CLEARML_AGENT_DISABLE_VAULT_SUPPORT', type=bool)
|
||||
ENV_ENABLE_ENV_CONFIG_SECTION = EnvEntry('CLEARML_AGENT_ENABLE_ENV_CONFIG_SECTION', type=bool)
|
||||
@@ -21,6 +22,9 @@ ENV_INITIAL_CONNECT_RETRY_OVERRIDE = EnvEntry(
|
||||
'CLEARML_AGENT_INITIAL_CONNECT_RETRY_OVERRIDE', default=True, converter=safe_text_to_bool
|
||||
)
|
||||
ENV_FORCE_MAX_API_VERSION = EnvEntry("CLEARML_AGENT_FORCE_MAX_API_VERSION", type=str)
|
||||
# values are 0/None (task per node), 1/2 (multi-node reporting, colored console), -1 (only report rank 0 node)
|
||||
ENV_MULTI_NODE_SINGLE_TASK = EnvEntry("CLEARML_MULTI_NODE_SINGLE_TASK", type=int, default=None)
|
||||
|
||||
|
||||
"""
|
||||
Experimental option to set the request method for all API requests and auth login.
|
||||
|
||||
@@ -64,6 +64,8 @@ class Session(TokenManager):
|
||||
default_key = "EGRTCO8JMSIGI6S39GTP43NFWXDQOW"
|
||||
default_secret = "x!XTov_G-#vspE*Y(h$Anm&DIc5Ou-F)jsl$PdOyj5wG1&E!Z8"
|
||||
force_max_api_version = ENV_FORCE_MAX_API_VERSION.get()
|
||||
server_version = "1.0.0"
|
||||
user_id = None
|
||||
|
||||
# TODO: add requests.codes.gateway_timeout once we support async commits
|
||||
_retry_codes = [
|
||||
@@ -191,6 +193,8 @@ class Session(TokenManager):
|
||||
|
||||
Session.api_version = str(api_version)
|
||||
Session.feature_set = str(token_dict.get('feature_set', self.feature_set) or "basic")
|
||||
Session.server_version = token_dict.get('server_version', self.server_version)
|
||||
Session.user_id = (token_dict.get("identity") or {}).get("user") or ""
|
||||
except (jwt.DecodeError, ValueError):
|
||||
pass
|
||||
|
||||
@@ -256,8 +260,9 @@ class Session(TokenManager):
|
||||
def parse(vault):
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
print("Loaded {} vault: {}".format(
|
||||
print("Loaded {} vault{}: {}".format(
|
||||
vault.get("scope", ""),
|
||||
"" if not self.user_id else " for user {}".format(self.user_id),
|
||||
(vault.get("description", None) or "")[:50] or vault.get("id", ""))
|
||||
)
|
||||
d = vault.get("data", None)
|
||||
@@ -341,11 +346,12 @@ class Session(TokenManager):
|
||||
if self._propagate_exceptions_on_send:
|
||||
raise
|
||||
sleep_time = sys_random.uniform(*self._request_exception_retry_timeout)
|
||||
self._logger.error(
|
||||
"{} exception sending {} {}: {} (retrying in {:.1f}sec)".format(
|
||||
type(ex).__name__, method.upper(), url, str(ex), sleep_time
|
||||
if self._logger:
|
||||
self._logger.error(
|
||||
"{} exception sending {} {}: {} (retrying in {:.1f}sec)".format(
|
||||
type(ex).__name__, method.upper(), url, str(ex), sleep_time
|
||||
)
|
||||
)
|
||||
)
|
||||
time.sleep(sleep_time)
|
||||
continue
|
||||
|
||||
@@ -364,11 +370,12 @@ class Session(TokenManager):
|
||||
res.status_code == requests.codes.service_unavailable
|
||||
and self.config.get("api.http.wait_on_maintenance_forever", True)
|
||||
):
|
||||
self._logger.warning(
|
||||
"Service unavailable: {} is undergoing maintenance, retrying...".format(
|
||||
host
|
||||
if self._logger:
|
||||
self._logger.warning(
|
||||
"Service unavailable: {} is undergoing maintenance, retrying...".format(
|
||||
host
|
||||
)
|
||||
)
|
||||
)
|
||||
continue
|
||||
break
|
||||
self._session_requests += 1
|
||||
@@ -649,11 +656,14 @@ class Session(TokenManager):
|
||||
"""
|
||||
Return True if Session.api_version is greater or equal >= to min_api_version
|
||||
"""
|
||||
def version_tuple(v):
|
||||
v = tuple(map(int, (v.split("."))))
|
||||
return v + (0,) * max(0, 3 - len(v))
|
||||
return version_tuple(cls.api_version) >= version_tuple(str(min_api_version))
|
||||
|
||||
@classmethod
|
||||
def check_min_server_version(cls, min_server_version):
|
||||
"""
|
||||
Return True if Session.server_version is greater or equal >= to min_server_version
|
||||
"""
|
||||
return version_tuple(cls.server_version) >= version_tuple(str(min_server_version))
|
||||
def _do_refresh_token(self, current_token, exp=None):
|
||||
""" TokenManager abstract method implementation.
|
||||
Here we ignore the old token and simply obtain a new token.
|
||||
@@ -731,3 +741,8 @@ class Session(TokenManager):
|
||||
def propagate_exceptions_on_send(self, value):
|
||||
# type: (bool) -> None
|
||||
self._propagate_exceptions_on_send = value
|
||||
|
||||
|
||||
def version_tuple(v):
|
||||
v = tuple(map(int, (v.split("."))))
|
||||
return v + (0,) * max(0, 3 - len(v))
|
||||
|
||||
@@ -1,69 +1,8 @@
|
||||
import base64
|
||||
from distutils.util import strtobool
|
||||
from typing import Union, Optional, Any, TypeVar, Callable, Tuple
|
||||
|
||||
import six
|
||||
|
||||
try:
|
||||
from typing import Text
|
||||
except ImportError:
|
||||
# windows conda-less hack
|
||||
Text = Any
|
||||
|
||||
|
||||
ConverterType = TypeVar("ConverterType", bound=Callable[[Any], Any])
|
||||
|
||||
|
||||
def text_to_int(value, default=0):
|
||||
# type: (Any, int) -> int
|
||||
try:
|
||||
return int(value)
|
||||
except (ValueError, TypeError):
|
||||
return default
|
||||
|
||||
|
||||
def base64_to_text(value):
|
||||
# type: (Any) -> Text
|
||||
return base64.b64decode(value).decode("utf-8")
|
||||
|
||||
|
||||
def text_to_bool(value):
|
||||
# type: (Text) -> bool
|
||||
return bool(strtobool(value))
|
||||
|
||||
|
||||
def safe_text_to_bool(value):
|
||||
# type: (Text) -> bool
|
||||
try:
|
||||
return text_to_bool(value)
|
||||
except ValueError:
|
||||
return bool(value)
|
||||
|
||||
|
||||
def any_to_bool(value):
|
||||
# type: (Optional[Union[int, float, Text]]) -> bool
|
||||
if isinstance(value, six.text_type):
|
||||
return text_to_bool(value)
|
||||
return bool(value)
|
||||
|
||||
|
||||
def or_(*converters, **kwargs):
|
||||
# type: (ConverterType, Tuple[Exception, ...]) -> ConverterType
|
||||
"""
|
||||
Wrapper that implements an "optional converter" pattern. Allows specifying a converter
|
||||
for which a set of exceptions is ignored (and the original value is returned)
|
||||
:param converters: A converter callable
|
||||
:param exceptions: A tuple of exception types to ignore
|
||||
"""
|
||||
# noinspection PyUnresolvedReferences
|
||||
exceptions = kwargs.get("exceptions", (ValueError, TypeError))
|
||||
|
||||
def wrapper(value):
|
||||
for converter in converters:
|
||||
try:
|
||||
return converter(value)
|
||||
except exceptions:
|
||||
pass
|
||||
return value
|
||||
|
||||
return wrapper
|
||||
from clearml_agent.helper.environment.converters import (
|
||||
base64_to_text,
|
||||
text_to_bool,
|
||||
text_to_int,
|
||||
safe_text_to_bool,
|
||||
any_to_bool,
|
||||
or_,
|
||||
)
|
||||
|
||||
@@ -1,111 +1,6 @@
|
||||
import abc
|
||||
from typing import Optional, Any, Tuple, Callable, Dict
|
||||
from clearml_agent.helper.environment import Entry, NotSet
|
||||
|
||||
import six
|
||||
|
||||
from .converters import any_to_bool
|
||||
|
||||
try:
|
||||
from typing import Text
|
||||
except ImportError:
|
||||
# windows conda-less hack
|
||||
Text = Any
|
||||
|
||||
|
||||
NotSet = object()
|
||||
|
||||
Converter = Callable[[Any], Any]
|
||||
|
||||
|
||||
@six.add_metaclass(abc.ABCMeta)
|
||||
class Entry(object):
|
||||
"""
|
||||
Configuration entry definition
|
||||
"""
|
||||
|
||||
@classmethod
|
||||
def default_conversions(cls):
|
||||
# type: () -> Dict[Any, Converter]
|
||||
return {
|
||||
bool: any_to_bool,
|
||||
six.text_type: lambda s: six.text_type(s).strip(),
|
||||
}
|
||||
|
||||
def __init__(self, key, *more_keys, **kwargs):
|
||||
# type: (Text, Text, Any) -> None
|
||||
"""
|
||||
:param key: Entry's key (at least one).
|
||||
:param more_keys: More alternate keys for this entry.
|
||||
:param type: Value type. If provided, will be used choosing a default conversion or
|
||||
(if none exists) for casting the environment value.
|
||||
:param converter: Value converter. If provided, will be used to convert the environment value.
|
||||
:param default: Default value. If provided, will be used as the default value on calls to get() and get_pair()
|
||||
in case no value is found for any key and no specific default value was provided in the call.
|
||||
Default value is None.
|
||||
:param help: Help text describing this entry
|
||||
"""
|
||||
self.keys = (key,) + more_keys
|
||||
self.type = kwargs.pop("type", six.text_type)
|
||||
self.converter = kwargs.pop("converter", None)
|
||||
self.default = kwargs.pop("default", None)
|
||||
self.help = kwargs.pop("help", None)
|
||||
|
||||
def __str__(self):
|
||||
return str(self.key)
|
||||
|
||||
@property
|
||||
def key(self):
|
||||
return self.keys[0]
|
||||
|
||||
def convert(self, value, converter=None):
|
||||
# type: (Any, Converter) -> Optional[Any]
|
||||
converter = converter or self.converter
|
||||
if not converter:
|
||||
converter = self.default_conversions().get(self.type, self.type)
|
||||
return converter(value)
|
||||
|
||||
def get_pair(self, default=NotSet, converter=None, value_cb=None):
|
||||
# type: (Any, Converter, Callable[[str, Any], None]) -> Optional[Tuple[Text, Any]]
|
||||
for key in self.keys:
|
||||
value = self._get(key)
|
||||
if value is NotSet:
|
||||
continue
|
||||
try:
|
||||
value = self.convert(value, converter)
|
||||
except Exception as ex:
|
||||
self.error("invalid value {key}={value}: {ex}".format(**locals()))
|
||||
break
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
if value_cb:
|
||||
value_cb(key, value)
|
||||
except Exception:
|
||||
pass
|
||||
return key, value
|
||||
|
||||
result = self.default if default is NotSet else default
|
||||
return self.key, result
|
||||
|
||||
def get(self, default=NotSet, converter=None, value_cb=None):
|
||||
# type: (Any, Converter, Callable[[str, Any], None]) -> Optional[Any]
|
||||
return self.get_pair(default=default, converter=converter, value_cb=value_cb)[1]
|
||||
|
||||
def set(self, value):
|
||||
# type: (Any, Any) -> (Text, Any)
|
||||
# key, _ = self.get_pair(default=None, converter=None)
|
||||
for k in self.keys:
|
||||
self._set(k, str(value))
|
||||
|
||||
def _set(self, key, value):
|
||||
# type: (Text, Text) -> None
|
||||
pass
|
||||
|
||||
@abc.abstractmethod
|
||||
def _get(self, key):
|
||||
# type: (Text) -> Any
|
||||
pass
|
||||
|
||||
@abc.abstractmethod
|
||||
def error(self, message):
|
||||
# type: (Text) -> None
|
||||
pass
|
||||
__all__ = [
|
||||
"Entry",
|
||||
"NotSet"
|
||||
]
|
||||
|
||||
@@ -1,32 +1,6 @@
|
||||
from os import getenv, environ
|
||||
from os import environ
|
||||
|
||||
from .converters import text_to_bool
|
||||
from .entry import Entry, NotSet
|
||||
|
||||
|
||||
class EnvEntry(Entry):
|
||||
@classmethod
|
||||
def default_conversions(cls):
|
||||
conversions = super(EnvEntry, cls).default_conversions().copy()
|
||||
conversions[bool] = text_to_bool
|
||||
return conversions
|
||||
|
||||
def pop(self):
|
||||
for k in self.keys:
|
||||
environ.pop(k, None)
|
||||
|
||||
def _get(self, key):
|
||||
value = getenv(key, "").strip()
|
||||
return value or NotSet
|
||||
|
||||
def _set(self, key, value):
|
||||
environ[key] = value
|
||||
|
||||
def __str__(self):
|
||||
return "env:{}".format(super(EnvEntry, self).__str__())
|
||||
|
||||
def error(self, message):
|
||||
print("Environment configuration: {}".format(message))
|
||||
from clearml_agent.helper.environment import EnvEntry
|
||||
|
||||
|
||||
def backward_compatibility_support():
|
||||
@@ -34,6 +8,7 @@ def backward_compatibility_support():
|
||||
if ENVIRONMENT_BACKWARD_COMPATIBLE.get():
|
||||
# Add TRAINS_ prefix on every CLEARML_ os environment we support
|
||||
for k, v in ENVIRONMENT_CONFIG.items():
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
trains_vars = [var for var in v.vars if var.startswith('CLEARML_')]
|
||||
if not trains_vars:
|
||||
@@ -44,6 +19,7 @@ def backward_compatibility_support():
|
||||
except:
|
||||
continue
|
||||
for k, v in ENVIRONMENT_SDK_PARAMS.items():
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
trains_vars = [var for var in v if var.startswith('CLEARML_')]
|
||||
if not trains_vars:
|
||||
@@ -62,3 +38,9 @@ def backward_compatibility_support():
|
||||
backwards_k = k.replace('CLEARML_', 'TRAINS_', 1)
|
||||
if backwards_k not in keys:
|
||||
environ[backwards_k] = environ[k]
|
||||
|
||||
|
||||
__all__ = [
|
||||
"EnvEntry",
|
||||
"backward_compatibility_support"
|
||||
]
|
||||
@@ -31,7 +31,8 @@ def apply_environment(config):
|
||||
keys = list(filter(None, env_vars.keys()))
|
||||
|
||||
for key in keys:
|
||||
os.environ[str(key)] = str(env_vars[key] or "")
|
||||
value = env_vars[key]
|
||||
os.environ[str(key)] = str(value if value is not None else "")
|
||||
|
||||
return keys
|
||||
|
||||
@@ -52,7 +53,7 @@ def apply_files(config):
|
||||
target_fmt = data.get("target_format", "string")
|
||||
overwrite = bool(data.get("overwrite", True))
|
||||
contents = data.get("contents")
|
||||
mode = data.get("mode")
|
||||
mode = data.get("mode", None)
|
||||
|
||||
target = Path(expanduser(expandvars(path)))
|
||||
|
||||
|
||||
@@ -2,6 +2,7 @@ from __future__ import print_function
|
||||
|
||||
import json
|
||||
import time
|
||||
from typing import List, Tuple
|
||||
|
||||
from clearml_agent.commands.base import ServiceCommandSection
|
||||
from clearml_agent.helper.base import return_list
|
||||
@@ -57,6 +58,42 @@ class Events(ServiceCommandSection):
|
||||
# print('Sending events done: %d / %d events sent' % (sent_events, len(list_events)))
|
||||
return sent_events
|
||||
|
||||
def send_log_events_with_timestamps(
|
||||
self, worker_id, task_id, lines_with_ts: List[Tuple[str, str]], level="DEBUG", session=None
|
||||
):
|
||||
log_events = []
|
||||
|
||||
# break log lines into event packets
|
||||
for ts, line in return_list(lines_with_ts):
|
||||
# HACK ignore terminal reset ANSI code
|
||||
if line == '\x1b[0m':
|
||||
continue
|
||||
while line:
|
||||
if len(line) <= self.max_event_size:
|
||||
msg = line
|
||||
line = None
|
||||
else:
|
||||
msg = line[:self.max_event_size]
|
||||
line = line[self.max_event_size:]
|
||||
|
||||
log_events.append(
|
||||
{
|
||||
"type": "log",
|
||||
"level": level,
|
||||
"task": task_id,
|
||||
"worker": worker_id,
|
||||
"msg": msg,
|
||||
"timestamp": ts,
|
||||
}
|
||||
)
|
||||
|
||||
if line and ts is not None:
|
||||
# advance timestamp in case we break a line to more than one part
|
||||
ts += 1
|
||||
|
||||
# now send the events
|
||||
return self.send_events(list_events=log_events, session=session)
|
||||
|
||||
def send_log_events(self, worker_id, task_id, lines, level='DEBUG', session=None):
|
||||
log_events = []
|
||||
base_timestamp = int(time.time() * 1000)
|
||||
|
||||
@@ -1,14 +1,16 @@
|
||||
import json
|
||||
import re
|
||||
import shlex
|
||||
from copy import copy
|
||||
|
||||
from clearml_agent.backend_api.session import Request
|
||||
from clearml_agent.helper.docker_args import DockerArgsSanitizer
|
||||
from clearml_agent.helper.package.requirements import (
|
||||
RequirementsManager, MarkerRequirement,
|
||||
compare_version_rules, )
|
||||
|
||||
|
||||
def resolve_default_container(session, task_id, container_config):
|
||||
def resolve_default_container(session, task_id, container_config, ignore_match_rules=False):
|
||||
container_lookup = session.config.get('agent.default_docker.match_rules', None)
|
||||
if not session.check_min_api_version("2.13") or not container_lookup:
|
||||
return container_config
|
||||
@@ -17,6 +19,12 @@ def resolve_default_container(session, task_id, container_config):
|
||||
try:
|
||||
session.verify_feature_set('advanced')
|
||||
except ValueError:
|
||||
# ignoring matching rules only supported in higher tiers
|
||||
return container_config
|
||||
|
||||
if ignore_match_rules:
|
||||
print("INFO: default docker command line override, ignoring default docker container match rules")
|
||||
# ignoring matching rules only supported in higher tiers
|
||||
return container_config
|
||||
|
||||
result = session.send_request(
|
||||
@@ -159,9 +167,10 @@ def resolve_default_container(session, task_id, container_config):
|
||||
if not container_config.get('image'):
|
||||
container_config['image'] = entry.get('image', None)
|
||||
if not container_config.get('arguments'):
|
||||
container_config['arguments'] = entry.get('arguments', None)
|
||||
container_config['arguments'] = shlex.split(str(container_config.get('arguments') or '').strip())
|
||||
print('Matching default container with rule:\n{}'.format(json.dumps(entry)))
|
||||
container_config['arguments'] = entry.get('arguments', None) or ''
|
||||
if isinstance(container_config.get('arguments'), str):
|
||||
container_config['arguments'] = shlex.split(str(container_config.get('arguments') or '').strip())
|
||||
print('INFO: Matching default container with rule:\n{}'.format(json.dumps(entry)))
|
||||
return container_config
|
||||
|
||||
return container_config
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -1,6 +1,5 @@
|
||||
import shlex
|
||||
from datetime import timedelta
|
||||
from distutils.util import strtobool
|
||||
from enum import IntEnum
|
||||
from os import getenv, environ
|
||||
from typing import Text, Optional, Union, Tuple, Any
|
||||
@@ -9,6 +8,7 @@ import six
|
||||
from pathlib2 import Path
|
||||
|
||||
from clearml_agent.helper.base import normalize_path
|
||||
from clearml_agent.helper.environment.converters import strtobool
|
||||
|
||||
PROGRAM_NAME = "clearml-agent"
|
||||
FROM_FILE_PREFIX_CHARS = "@"
|
||||
@@ -158,11 +158,16 @@ ENV_WORKER_ID = EnvironmentConfig("CLEARML_WORKER_ID", "TRAINS_WORKER_ID")
|
||||
ENV_WORKER_TAGS = EnvironmentConfig("CLEARML_WORKER_TAGS")
|
||||
ENV_AGENT_SKIP_PIP_VENV_INSTALL = EnvironmentConfig("CLEARML_AGENT_SKIP_PIP_VENV_INSTALL")
|
||||
ENV_AGENT_SKIP_PYTHON_ENV_INSTALL = EnvironmentConfig("CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL", type=bool)
|
||||
ENV_AGENT_FORCE_CODE_DIR = EnvironmentConfig("CLEARML_AGENT_FORCE_CODE_DIR")
|
||||
ENV_AGENT_FORCE_EXEC_SCRIPT = EnvironmentConfig("CLEARML_AGENT_FORCE_EXEC_SCRIPT")
|
||||
ENV_AGENT_FORCE_POETRY = EnvironmentConfig("CLEARML_AGENT_FORCE_POETRY", type=bool)
|
||||
ENV_AGENT_FORCE_TASK_INIT = EnvironmentConfig("CLEARML_AGENT_FORCE_TASK_INIT", type=bool)
|
||||
ENV_DOCKER_SKIP_GPUS_FLAG = EnvironmentConfig("CLEARML_DOCKER_SKIP_GPUS_FLAG", "TRAINS_DOCKER_SKIP_GPUS_FLAG")
|
||||
ENV_AGENT_GIT_USER = EnvironmentConfig("CLEARML_AGENT_GIT_USER", "TRAINS_AGENT_GIT_USER")
|
||||
ENV_AGENT_GIT_PASS = EnvironmentConfig("CLEARML_AGENT_GIT_PASS", "TRAINS_AGENT_GIT_PASS")
|
||||
ENV_AGENT_GIT_HOST = EnvironmentConfig("CLEARML_AGENT_GIT_HOST", "TRAINS_AGENT_GIT_HOST")
|
||||
ENV_AGENT_DISABLE_SSH_MOUNT = EnvironmentConfig("CLEARML_AGENT_DISABLE_SSH_MOUNT", type=bool)
|
||||
ENV_AGENT_DEBUG_GET_NEXT_TASK = EnvironmentConfig("CLEARML_AGENT_DEBUG_GET_NEXT_TASK", type=bool)
|
||||
ENV_SSH_AUTH_SOCK = EnvironmentConfig("SSH_AUTH_SOCK")
|
||||
ENV_TASK_EXECUTE_AS_USER = EnvironmentConfig("CLEARML_AGENT_EXEC_USER", "TRAINS_AGENT_EXEC_USER")
|
||||
ENV_TASK_EXTRA_PYTHON_PATH = EnvironmentConfig("CLEARML_AGENT_EXTRA_PYTHON_PATH", "TRAINS_AGENT_EXTRA_PYTHON_PATH")
|
||||
@@ -240,6 +245,12 @@ ENV_CUSTOM_BUILD_SCRIPT = EnvironmentConfig("CLEARML_AGENT_CUSTOM_BUILD_SCRIPT")
|
||||
|
||||
ENV_PACKAGE_PYTORCH_RESOLVE = EnvironmentConfig("CLEARML_AGENT_PACKAGE_PYTORCH_RESOLVE")
|
||||
|
||||
ENV_TEMP_STDOUT_FILE_DIR = EnvironmentConfig("CLEARML_AGENT_TEMP_STDOUT_FILE_DIR")
|
||||
|
||||
ENV_GIT_CLONE_VERBOSE = EnvironmentConfig("CLEARML_AGENT_GIT_CLONE_VERBOSE", type=bool)
|
||||
|
||||
ENV_GPU_FRACTIONS = EnvironmentConfig("CLEARML_AGENT_GPU_FRACTIONS")
|
||||
|
||||
|
||||
class FileBuffering(IntEnum):
|
||||
"""
|
||||
|
||||
@@ -39,7 +39,7 @@ LOCAL_REGEX = re.compile(
|
||||
|
||||
class Requirement(object):
|
||||
"""
|
||||
Represents a single requirementfrom clearml_agent.external.requirements_parser.requirement import Requirement
|
||||
Represents a single requirement from clearml_agent.external.requirements_parser.requirement import Requirement
|
||||
|
||||
Typically instances of this class are created with ``Requirement.parse``.
|
||||
For local file requirements, there's no verification that the file
|
||||
@@ -214,6 +214,7 @@ class Requirement(object):
|
||||
def parse(cls, line):
|
||||
"""
|
||||
Parses a Requirement from a line of a requirement file.
|
||||
This is the main entry point for parsing a single requirements line (not parse_line!)
|
||||
|
||||
:param line: a line of a requirement file
|
||||
:returns: a Requirement instance for the given line
|
||||
@@ -226,7 +227,7 @@ class Requirement(object):
|
||||
return cls.parse_editable(
|
||||
re.sub(r'^(-e|--editable=?)\s*', '', line))
|
||||
elif '@' in line and ('#' not in line or line.index('#') > line.index('@')):
|
||||
# Allegro bug fix: support 'name @ git+' entries
|
||||
# ClearML bug fix: support 'name @ git+' entries
|
||||
name, uri = line.split('@', 1)
|
||||
name = name.strip()
|
||||
uri = uri.strip()
|
||||
|
||||
@@ -1,7 +1,20 @@
|
||||
from clearml_agent.definitions import EnvironmentConfig
|
||||
from clearml_agent.helper.environment import EnvEntry
|
||||
|
||||
ENV_START_AGENT_SCRIPT_PATH = EnvironmentConfig('CLEARML_K8S_GLUE_START_AGENT_SCRIPT_PATH')
|
||||
ENV_START_AGENT_SCRIPT_PATH = EnvEntry("CLEARML_K8S_GLUE_START_AGENT_SCRIPT_PATH", default="~/__start_agent__.sh")
|
||||
"""
|
||||
Script path to use when creating the bash script to run the agent inside the scheduled pod's docker container.
|
||||
Script will be appended to the specified file.
|
||||
"""
|
||||
|
||||
ENV_DEFAULT_EXECUTION_AGENT_ARGS = EnvEntry("K8S_GLUE_DEF_EXEC_AGENT_ARGS", default="--full-monitoring --require-queue")
|
||||
ENV_POD_AGENT_INSTALL_ARGS = EnvEntry("K8S_GLUE_POD_AGENT_INSTALL_ARGS", default="", lstrip=False)
|
||||
ENV_POD_MONITOR_LOG_BATCH_SIZE = EnvEntry("K8S_GLUE_POD_MONITOR_LOG_BATCH_SIZE", default=5, converter=int)
|
||||
ENV_POD_MONITOR_DISABLE_ENQUEUE_ON_PREEMPTION = EnvEntry(
|
||||
"K8S_GLUE_POD_MONITOR_DISABLE_ENQUEUE_ON_PREEMPTION", default=False, converter=bool
|
||||
)
|
||||
|
||||
ENV_POD_USE_IMAGE_ENTRYPOINT = EnvEntry("K8S_GLUE_POD_USE_IMAGE_ENTRYPOINT", default=False, converter=bool)
|
||||
"""
|
||||
Do not inject a cmd and args to the container's image when building the k8s template (depend on the built-in image
|
||||
entrypoint)
|
||||
"""
|
||||
@@ -18,7 +18,6 @@ from typing import Text, List, Callable, Any, Collection, Optional, Union, Itera
|
||||
|
||||
import yaml
|
||||
|
||||
from clearml_agent.backend_api.session import Request
|
||||
from clearml_agent.commands.events import Events
|
||||
from clearml_agent.commands.worker import Worker, get_task_container, set_task_container, get_next_task
|
||||
from clearml_agent.definitions import (
|
||||
@@ -26,9 +25,9 @@ from clearml_agent.definitions import (
|
||||
ENV_AGENT_GIT_USER,
|
||||
ENV_AGENT_GIT_PASS,
|
||||
ENV_FORCE_SYSTEM_SITE_PACKAGES,
|
||||
ENV_AGENT_DEBUG_GET_NEXT_TASK,
|
||||
)
|
||||
from clearml_agent.errors import APIError, UsageError
|
||||
from clearml_agent.glue.definitions import ENV_START_AGENT_SCRIPT_PATH
|
||||
from clearml_agent.glue.errors import GetPodCountError
|
||||
from clearml_agent.glue.utilities import get_path, get_bash_output
|
||||
from clearml_agent.glue.pending_pods_daemon import PendingPodsDaemon
|
||||
@@ -37,12 +36,18 @@ from clearml_agent.helper.dicts import merge_dicts
|
||||
from clearml_agent.helper.process import get_bash_output, stringify_bash_output
|
||||
from clearml_agent.helper.resource_monitor import ResourceMonitor
|
||||
from clearml_agent.interface.base import ObjectID
|
||||
from clearml_agent.backend_api.session import Request
|
||||
from clearml_agent.glue.definitions import (
|
||||
ENV_START_AGENT_SCRIPT_PATH,
|
||||
ENV_DEFAULT_EXECUTION_AGENT_ARGS,
|
||||
ENV_POD_AGENT_INSTALL_ARGS,
|
||||
ENV_POD_USE_IMAGE_ENTRYPOINT,
|
||||
)
|
||||
|
||||
|
||||
class K8sIntegration(Worker):
|
||||
SUPPORTED_KIND = ("pod", "job")
|
||||
K8S_PENDING_QUEUE = "k8s_scheduler"
|
||||
|
||||
K8S_DEFAULT_NAMESPACE = "clearml"
|
||||
AGENT_LABEL = "CLEARML=agent"
|
||||
QUEUE_LABEL = "clearml-agent-queue"
|
||||
@@ -64,19 +69,23 @@ class K8sIntegration(Worker):
|
||||
'echo "ldconfig" >> /etc/profile',
|
||||
"/usr/sbin/sshd -p {port}"]
|
||||
|
||||
DEFAULT_EXECUTION_AGENT_ARGS = os.getenv("K8S_GLUE_DEF_EXEC_AGENT_ARGS", "--full-monitoring --require-queue")
|
||||
POD_AGENT_INSTALL_ARGS = os.getenv("K8S_GLUE_POD_AGENT_INSTALL_ARGS", "")
|
||||
|
||||
CONTAINER_BASH_SCRIPT = [
|
||||
_CONTAINER_APT_SCRIPT_SECTION = [
|
||||
"export DEBIAN_FRONTEND='noninteractive'",
|
||||
"echo 'Binary::apt::APT::Keep-Downloaded-Packages \"true\";' > /etc/apt/apt.conf.d/docker-clean",
|
||||
"chown -R root /root/.cache/pip",
|
||||
"apt-get update",
|
||||
"apt-get install -y git libsm6 libxext6 libxrender-dev libglib2.0-0",
|
||||
]
|
||||
|
||||
CONTAINER_BASH_SCRIPT = [
|
||||
*(
|
||||
'[ ! -z "$CLEARML_AGENT_SKIP_CONTAINER_APT" ] || {}'.format(line)
|
||||
for line in _CONTAINER_APT_SCRIPT_SECTION
|
||||
),
|
||||
"declare LOCAL_PYTHON",
|
||||
"[ ! -z $LOCAL_PYTHON ] || for i in {{15..5}}; do which python3.$i && python3.$i -m pip --version && "
|
||||
"export LOCAL_PYTHON=$(which python3.$i) && break ; done",
|
||||
"[ ! -z $LOCAL_PYTHON ] || apt-get install -y python3-pip",
|
||||
'[ ! -z "$CLEARML_AGENT_SKIP_CONTAINER_APT" ] || [ ! -z "$LOCAL_PYTHON" ] || apt-get install -y python3-pip',
|
||||
"[ ! -z $LOCAL_PYTHON ] || export LOCAL_PYTHON=python3",
|
||||
"{extra_bash_init_cmd}",
|
||||
"[ ! -z $CLEARML_AGENT_NO_UPDATE ] || $LOCAL_PYTHON -m pip install clearml-agent{agent_install_args}",
|
||||
@@ -98,6 +107,7 @@ class K8sIntegration(Worker):
|
||||
num_of_services=20,
|
||||
base_pod_num=1,
|
||||
user_props_cb=None,
|
||||
runtime_cb=None,
|
||||
overrides_yaml=None,
|
||||
template_yaml=None,
|
||||
clearml_conf_file=None,
|
||||
@@ -106,6 +116,7 @@ class K8sIntegration(Worker):
|
||||
max_pods_limit=None,
|
||||
pod_name_prefix=None,
|
||||
limit_pod_label=None,
|
||||
force_system_packages=None,
|
||||
**kwargs
|
||||
):
|
||||
"""
|
||||
@@ -124,6 +135,7 @@ class K8sIntegration(Worker):
|
||||
:param callable user_props_cb: An Optional callable allowing additional user properties to be specified
|
||||
when scheduling a task to run in a pod. Callable can receive an optional pod number and should return
|
||||
a dictionary of user properties (name and value). Signature is [[Optional[int]], Dict[str,str]]
|
||||
:param callable runtime_cb: An Optional callable allowing additional task runtime to be specified (see user_props_cb)
|
||||
:param str overrides_yaml: YAML file containing the overrides for the pod (optional)
|
||||
:param str template_yaml: YAML file containing the template for the pod (optional).
|
||||
If provided the pod is scheduled with kubectl apply and overrides are ignored, otherwise with kubectl run.
|
||||
@@ -142,7 +154,8 @@ class K8sIntegration(Worker):
|
||||
self.k8s_pending_queue_name = k8s_pending_queue_name or self.K8S_PENDING_QUEUE
|
||||
self.k8s_pending_queue_id = None
|
||||
self.container_bash_script = container_bash_script or self.CONTAINER_BASH_SCRIPT
|
||||
force_system_packages = ENV_FORCE_SYSTEM_SITE_PACKAGES.get()
|
||||
if force_system_packages is None:
|
||||
force_system_packages = ENV_FORCE_SYSTEM_SITE_PACKAGES.get()
|
||||
self._force_system_site_packages = force_system_packages if force_system_packages is not None else True
|
||||
if self._force_system_site_packages:
|
||||
# Use system packages, because by we will be running inside a docker
|
||||
@@ -157,6 +170,7 @@ class K8sIntegration(Worker):
|
||||
self.base_pod_num = base_pod_num
|
||||
self._edit_hyperparams_support = None
|
||||
self._user_props_cb = user_props_cb
|
||||
self._runtime_cb = runtime_cb
|
||||
self.conf_file_content = None
|
||||
self.overrides_json_string = None
|
||||
self.template_dict = None
|
||||
@@ -181,7 +195,7 @@ class K8sIntegration(Worker):
|
||||
|
||||
self._agent_label = None
|
||||
|
||||
self._pending_pods_daemon = self._create_pending_pods_daemon(
|
||||
self._pending_pods_daemon = self._create_daemon_instance(
|
||||
cls_=PendingPodsDaemon,
|
||||
polling_interval=self._polling_interval
|
||||
)
|
||||
@@ -190,7 +204,15 @@ class K8sIntegration(Worker):
|
||||
self._min_cleanup_interval_per_ns_sec = 1.0
|
||||
self._last_pod_cleanup_per_ns = defaultdict(lambda: 0.)
|
||||
|
||||
def _create_pending_pods_daemon(self, cls_, **kwargs):
|
||||
self._server_supports_same_state_transition = (
|
||||
self._session.feature_set != "basic" and self._session.check_min_server_version("3.22.3")
|
||||
)
|
||||
|
||||
@property
|
||||
def agent_label(self):
|
||||
return self._get_agent_label()
|
||||
|
||||
def _create_daemon_instance(self, cls_, **kwargs):
|
||||
return cls_(agent=self, **kwargs)
|
||||
|
||||
def _load_overrides_yaml(self, overrides_yaml):
|
||||
@@ -370,8 +392,9 @@ class K8sIntegration(Worker):
|
||||
self.log.warning('Failed parsing kubectl output:\n{}\nEx: {}'.format(output, ex))
|
||||
|
||||
def get_pods_for_jobs(self, job_condition: str = None, pod_filters: List[str] = None, debug_msg: str = None):
|
||||
# Use metadata.uid so job related pods can be found filterin g following list with this param
|
||||
controller_uids = self.get_jobs_info(
|
||||
"spec.selector.matchLabels.controller-uid", condition=job_condition, debug_msg=debug_msg
|
||||
"metadata.uid", condition=job_condition, debug_msg=debug_msg
|
||||
)
|
||||
if not controller_uids:
|
||||
# No pods were found for these jobs
|
||||
@@ -417,6 +440,13 @@ class K8sIntegration(Worker):
|
||||
)
|
||||
raise GetPodCountError()
|
||||
|
||||
def resource_applied(self, resource_name: str, namespace: str, task_id: str, session):
|
||||
""" Called when a resource (pod/job) was applied """
|
||||
pass
|
||||
|
||||
def ports_mode_supported_for_task(self, task_id: str, task_data):
|
||||
return self.ports_mode
|
||||
|
||||
def run_one_task(self, queue: Text, task_id: Text, worker_args=None, task_session=None, **_):
|
||||
print('Pulling task {} launching on kubernetes cluster'.format(task_id))
|
||||
session = task_session or self._session
|
||||
@@ -426,7 +456,9 @@ class K8sIntegration(Worker):
|
||||
if self._is_same_tenant(task_session):
|
||||
try:
|
||||
print('Pushing task {} into temporary pending queue'.format(task_id))
|
||||
_ = session.api_client.tasks.stop(task_id, force=True)
|
||||
|
||||
if not self._server_supports_same_state_transition:
|
||||
_ = session.api_client.tasks.stop(task_id, force=True, status_reason="moving to k8s pending queue")
|
||||
|
||||
# Just make sure to clean up in case the task is stuck in the queue (known issue)
|
||||
self._session.api_client.queues.remove_task(
|
||||
@@ -486,8 +518,10 @@ class K8sIntegration(Worker):
|
||||
)
|
||||
)
|
||||
|
||||
if self.ports_mode:
|
||||
ports_mode = False
|
||||
if self.ports_mode_supported_for_task(task_id, task_data):
|
||||
print("Kubernetes looking for available pod to use")
|
||||
ports_mode = True
|
||||
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
@@ -498,12 +532,12 @@ class K8sIntegration(Worker):
|
||||
# Search for a free pod number
|
||||
pod_count = 0
|
||||
pod_number = self.base_pod_num
|
||||
while self.ports_mode or self.max_pods_limit:
|
||||
while ports_mode or self.max_pods_limit:
|
||||
pod_number = self.base_pod_num + pod_count
|
||||
|
||||
try:
|
||||
items_count = self._get_pod_count(
|
||||
extra_labels=[self.limit_pod_label.format(pod_number=pod_number)] if self.ports_mode else None,
|
||||
extra_labels=[self.limit_pod_label.format(pod_number=pod_number)] if ports_mode else None,
|
||||
msg="Looking for a free pod/port"
|
||||
)
|
||||
except GetPodCountError:
|
||||
@@ -553,17 +587,17 @@ class K8sIntegration(Worker):
|
||||
break
|
||||
pod_count += 1
|
||||
|
||||
labels = self._get_pod_labels(queue, queue_name)
|
||||
if self.ports_mode:
|
||||
labels = self._get_pod_labels(queue, queue_name, task_data)
|
||||
if ports_mode:
|
||||
labels.append(self.limit_pod_label.format(pod_number=pod_number))
|
||||
|
||||
if self.ports_mode:
|
||||
if ports_mode:
|
||||
print("Kubernetes scheduling task id={} on pod={} (pod_count={})".format(task_id, pod_number, pod_count))
|
||||
else:
|
||||
print("Kubernetes scheduling task id={}".format(task_id))
|
||||
|
||||
try:
|
||||
template = self._resolve_template(task_session, task_data, queue)
|
||||
template = self._resolve_template(task_session, task_data, queue, task_id)
|
||||
except Exception as ex:
|
||||
print("ERROR: Failed resolving template (skipping): {}".format(ex))
|
||||
return
|
||||
@@ -573,45 +607,79 @@ class K8sIntegration(Worker):
|
||||
except (KeyError, TypeError, AttributeError):
|
||||
namespace = self.namespace
|
||||
|
||||
if template:
|
||||
output, error = self._kubectl_apply(
|
||||
template=template,
|
||||
pod_number=pod_number,
|
||||
clearml_conf_create_script=clearml_conf_create_script,
|
||||
labels=labels,
|
||||
docker_image=container['image'],
|
||||
docker_args=container['arguments'],
|
||||
docker_bash=container.get('setup_shell_script'),
|
||||
task_id=task_id,
|
||||
queue=queue,
|
||||
namespace=namespace,
|
||||
if not template:
|
||||
print("ERROR: no template for task {}, skipping".format(task_id))
|
||||
return
|
||||
|
||||
output, error, pod_name = self._kubectl_apply(
|
||||
template=template,
|
||||
pod_number=pod_number,
|
||||
clearml_conf_create_script=clearml_conf_create_script,
|
||||
labels=labels,
|
||||
docker_image=container['image'],
|
||||
docker_args=container.get('arguments'),
|
||||
docker_bash=container.get('setup_shell_script'),
|
||||
task_id=task_id,
|
||||
queue=queue,
|
||||
namespace=namespace,
|
||||
task_token=task_session.token.encode("ascii") if task_session else None,
|
||||
)
|
||||
|
||||
print('kubectl output:\n{}\n{}'.format(error, output))
|
||||
if error:
|
||||
send_log = "Running kubectl encountered an error: {}".format(error)
|
||||
self.log.error(send_log)
|
||||
self.send_logs(task_id, send_log.splitlines())
|
||||
|
||||
# Make sure to remove the task from our k8s pending queue
|
||||
self._session.api_client.queues.remove_task(
|
||||
task=task_id,
|
||||
queue=self.k8s_pending_queue_id,
|
||||
)
|
||||
# Set task as failed
|
||||
session.api_client.tasks.failed(task_id, force=True)
|
||||
return
|
||||
|
||||
if pod_name:
|
||||
self.resource_applied(
|
||||
resource_name=pod_name, namespace=namespace, task_id=task_id, session=session
|
||||
)
|
||||
|
||||
print('kubectl output:\n{}\n{}'.format(error, output))
|
||||
if error:
|
||||
send_log = "Running kubectl encountered an error: {}".format(error)
|
||||
self.log.error(send_log)
|
||||
self.send_logs(task_id, send_log.splitlines())
|
||||
self.set_task_info(
|
||||
task_id=task_id, task_session=task_session, queue_name=queue_name, ports_mode=ports_mode,
|
||||
pod_number=pod_number, pod_count=pod_count, task_data=task_data
|
||||
)
|
||||
|
||||
def set_task_info(
|
||||
self, task_id: str, task_session, task_data, queue_name: str, ports_mode: bool, pod_number, pod_count
|
||||
):
|
||||
user_props = {"k8s-queue": str(queue_name)}
|
||||
if self.ports_mode:
|
||||
user_props.update(
|
||||
{
|
||||
"k8s-pod-number": pod_number,
|
||||
"k8s-pod-label": labels[0],
|
||||
"k8s-internal-pod-count": pod_count,
|
||||
"k8s-agent": self._get_agent_label(),
|
||||
}
|
||||
)
|
||||
runtime = {}
|
||||
if ports_mode:
|
||||
agent_label = self._get_agent_label()
|
||||
user_props.update({
|
||||
"k8s-pod-number": pod_number,
|
||||
"k8s-pod-label": agent_label, # backwards-compatibility / legacy
|
||||
"k8s-internal-pod-count": pod_count,
|
||||
"k8s-agent": agent_label,
|
||||
})
|
||||
|
||||
if self._user_props_cb:
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
custom_props = self._user_props_cb(pod_number) if self.ports_mode else self._user_props_cb()
|
||||
custom_props = self._user_props_cb(pod_number) if ports_mode else self._user_props_cb()
|
||||
user_props.update(custom_props)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
if self._runtime_cb:
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
custom_runtime = self._runtime_cb(pod_number) if ports_mode else self._runtime_cb()
|
||||
runtime.update(custom_runtime)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
if user_props:
|
||||
self._set_task_user_properties(
|
||||
task_id=task_id,
|
||||
@@ -619,7 +687,38 @@ class K8sIntegration(Worker):
|
||||
**user_props
|
||||
)
|
||||
|
||||
def _get_pod_labels(self, queue, queue_name):
|
||||
if runtime:
|
||||
task_runtime = self._get_task_runtime(task_id) or {}
|
||||
task_runtime.update(runtime)
|
||||
|
||||
try:
|
||||
res = task_session.send_request(
|
||||
service='tasks', action='edit', method=Request.def_method,
|
||||
json={
|
||||
"task": task_id, "force": True, "runtime": task_runtime
|
||||
},
|
||||
)
|
||||
if not res.ok:
|
||||
raise Exception("failed setting runtime property")
|
||||
except Exception as ex:
|
||||
print("WARNING: failed setting custom runtime properties for task '{}': {}".format(task_id, ex))
|
||||
|
||||
def _get_task_runtime(self, task_id) -> Optional[dict]:
|
||||
try:
|
||||
res = self._session.send_request(
|
||||
service='tasks', action='get_by_id', method=Request.def_method,
|
||||
json={"task": task_id, "only_fields": ["runtime"]},
|
||||
)
|
||||
if not res.ok:
|
||||
raise ValueError(f"request returned {res.status_code}")
|
||||
data = res.json().get("data")
|
||||
if not data or "task" not in data:
|
||||
raise ValueError("empty data in result")
|
||||
return data["task"].get("runtime", {})
|
||||
except Exception as ex:
|
||||
print(f"ERROR: Failed getting runtime properties for task {task_id}: {ex}")
|
||||
|
||||
def _get_pod_labels(self, queue, queue_name, task_data):
|
||||
return [
|
||||
self._get_agent_label(),
|
||||
"{}={}".format(self.QUEUE_LABEL, self._safe_k8s_label_value(queue)),
|
||||
@@ -652,9 +751,12 @@ class K8sIntegration(Worker):
|
||||
return {target: results} if results else {}
|
||||
return results
|
||||
|
||||
def get_task_worker_id(self, template, task_id, pod_name, namespace, queue):
|
||||
return f"{self.worker_id}:{task_id}"
|
||||
|
||||
def _create_template_container(
|
||||
self, pod_name: str, task_id: str, docker_image: str, docker_args: List[str],
|
||||
docker_bash: str, clearml_conf_create_script: List[str]
|
||||
docker_bash: str, clearml_conf_create_script: List[str], task_worker_id: str, task_token: str = None
|
||||
) -> dict:
|
||||
container = self._get_docker_args(
|
||||
docker_args,
|
||||
@@ -663,6 +765,32 @@ class K8sIntegration(Worker):
|
||||
convert=lambda env: {'name': env.partition("=")[0], 'value': env.partition("=")[2]},
|
||||
)
|
||||
|
||||
def add_or_update_env_var(name, value):
|
||||
env_vars = container.get('env', [])
|
||||
for entry in env_vars:
|
||||
if entry.get('name') == name:
|
||||
entry['value'] = value
|
||||
break
|
||||
else:
|
||||
container['env'] = env_vars + [{'name': name, 'value': value}]
|
||||
|
||||
# Set worker ID
|
||||
add_or_update_env_var('CLEARML_WORKER_ID', task_worker_id)
|
||||
|
||||
if ENV_POD_USE_IMAGE_ENTRYPOINT.get():
|
||||
# Don't add a cmd and args, just the image
|
||||
|
||||
# Add the task ID and token since we need it (it's usually in the init script passed to us
|
||||
add_or_update_env_var('CLEARML_TASK_ID', task_id)
|
||||
if task_token:
|
||||
# TODO: find a way to base64 encode the token
|
||||
add_or_update_env_var('CLEARML_AUTH_TOKEN', task_token)
|
||||
|
||||
return self._merge_containers(
|
||||
container, dict(name=pod_name, image=docker_image)
|
||||
)
|
||||
|
||||
# Create bash script for container and
|
||||
container_bash_script = [self.container_bash_script] if isinstance(self.container_bash_script, str) \
|
||||
else self.container_bash_script
|
||||
|
||||
@@ -675,8 +803,8 @@ class K8sIntegration(Worker):
|
||||
[line.format(extra_bash_init_cmd=self.extra_bash_init_script or '',
|
||||
task_id=task_id,
|
||||
extra_docker_bash_script=extra_docker_bash_script,
|
||||
default_execution_agent_args=self.DEFAULT_EXECUTION_AGENT_ARGS,
|
||||
agent_install_args=self.POD_AGENT_INSTALL_ARGS)
|
||||
default_execution_agent_args=ENV_DEFAULT_EXECUTION_AGENT_ARGS.get(),
|
||||
agent_install_args=ENV_POD_AGENT_INSTALL_ARGS.get())
|
||||
for line in container_bash_script])
|
||||
|
||||
extra_bash_commands = list(clearml_conf_create_script or [])
|
||||
@@ -710,15 +838,18 @@ class K8sIntegration(Worker):
|
||||
queue,
|
||||
task_id,
|
||||
namespace,
|
||||
template=None,
|
||||
pod_number=None
|
||||
template,
|
||||
pod_number=None,
|
||||
task_token=None,
|
||||
):
|
||||
if "apiVersion" not in template:
|
||||
template["apiVersion"] = "batch/v1" if self.using_jobs else "v1"
|
||||
if "kind" in template:
|
||||
if template["kind"].lower() != self.kind:
|
||||
return (
|
||||
"", f"Template kind {template['kind']} does not maych kind {self.kind.capitalize()} set for agent"
|
||||
"",
|
||||
f"Template kind {template['kind']} does not maych kind {self.kind.capitalize()} set for agent",
|
||||
None
|
||||
)
|
||||
else:
|
||||
template["kind"] = self.kind.capitalize()
|
||||
@@ -740,7 +871,7 @@ class K8sIntegration(Worker):
|
||||
spec.setdefault('backoffLimit', 0)
|
||||
spec_template = spec.setdefault('template', {})
|
||||
if labels:
|
||||
# Place same labels fro any pod spawned by the job
|
||||
# Place same labels for any pod spawned by the job
|
||||
place_labels(spec_template.setdefault('metadata', {}))
|
||||
|
||||
spec = spec_template.setdefault('spec', {})
|
||||
@@ -748,13 +879,17 @@ class K8sIntegration(Worker):
|
||||
containers = spec.setdefault('containers', [])
|
||||
spec.setdefault('restartPolicy', 'Never')
|
||||
|
||||
task_worker_id = self.get_task_worker_id(template, task_id, name, namespace, queue)
|
||||
|
||||
container = self._create_template_container(
|
||||
pod_name=name,
|
||||
task_id=task_id,
|
||||
docker_image=docker_image,
|
||||
docker_args=docker_args,
|
||||
docker_bash=docker_bash,
|
||||
clearml_conf_create_script=clearml_conf_create_script
|
||||
clearml_conf_create_script=clearml_conf_create_script,
|
||||
task_worker_id=task_worker_id,
|
||||
task_token=task_token,
|
||||
)
|
||||
|
||||
if containers:
|
||||
@@ -789,11 +924,11 @@ class K8sIntegration(Worker):
|
||||
process = subprocess.Popen(kubectl_cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
|
||||
output, error = process.communicate()
|
||||
except Exception as ex:
|
||||
return None, str(ex)
|
||||
return None, str(ex), None
|
||||
finally:
|
||||
safe_remove_file(yaml_file)
|
||||
|
||||
return stringify_bash_output(output), stringify_bash_output(error)
|
||||
return stringify_bash_output(output), stringify_bash_output(error), name
|
||||
|
||||
def _process_bash_lines_response(self, bash_cmd: str, raise_error=True):
|
||||
res = get_bash_output(bash_cmd, raise_error=raise_error)
|
||||
@@ -901,7 +1036,7 @@ class K8sIntegration(Worker):
|
||||
result = self._session.get(
|
||||
service='tasks',
|
||||
action='get_all',
|
||||
json={"id": task_ids, "status": ["in_progress", "queued"], "only_fields": ["id", "status"]},
|
||||
json={"id": task_ids, "status": ["in_progress", "queued"], "only_fields": ["id", "status", "status_reason"]},
|
||||
method=Request.def_method,
|
||||
)
|
||||
tasks_to_abort = result["tasks"]
|
||||
@@ -911,9 +1046,13 @@ class K8sIntegration(Worker):
|
||||
for task in tasks_to_abort:
|
||||
task_id = task.get("id")
|
||||
status = task.get("status")
|
||||
status_reason = (task.get("status_reason") or "").lower()
|
||||
if not task_id or not status:
|
||||
self.log.warning('Failed getting task information: id={}, status={}'.format(task_id, status))
|
||||
continue
|
||||
if status == "queued" and "pushed back by policy manager" in status_reason:
|
||||
# Task was pushed back to policy queue by policy manager, don't touch it
|
||||
continue
|
||||
try:
|
||||
if status == "queued":
|
||||
self._session.get(
|
||||
@@ -947,6 +1086,9 @@ class K8sIntegration(Worker):
|
||||
|
||||
return deleted_pods
|
||||
|
||||
def check_if_suspended(self) -> bool:
|
||||
pass
|
||||
|
||||
def run_tasks_loop(self, queues: List[Text], worker_params, **kwargs):
|
||||
"""
|
||||
:summary: Pull and run tasks from queues.
|
||||
@@ -958,6 +1100,8 @@ class K8sIntegration(Worker):
|
||||
:param worker_params: Worker command line arguments
|
||||
:type worker_params: ``clearml_agent.helper.process.WorkerParams``
|
||||
"""
|
||||
# print("debug> running tasks loop")
|
||||
|
||||
events_service = self.get_service(Events)
|
||||
|
||||
# make sure we have a k8s pending queue
|
||||
@@ -989,12 +1133,19 @@ class K8sIntegration(Worker):
|
||||
continue
|
||||
|
||||
# iterate over queues (priority style, queues[0] is highest)
|
||||
# print("debug> iterating over queues")
|
||||
for queue in queues:
|
||||
# delete old completed / failed pods
|
||||
self._cleanup_old_pods(namespaces, extra_msg="Cleanup cycle {cmd}")
|
||||
|
||||
if self.check_if_suspended():
|
||||
print("Agent is suspended, sleeping for {:.1f} seconds".format(self._polling_interval))
|
||||
sleep(self._polling_interval)
|
||||
break
|
||||
|
||||
# get next task in queue
|
||||
try:
|
||||
# print(f"debug> getting tasks for queue {queue}")
|
||||
response = self._get_next_task(queue=queue, get_task_info=self._impersonate_as_task_owner)
|
||||
except Exception as e:
|
||||
print("Warning: Could not access task queue [{}], error: {}".format(queue, e))
|
||||
@@ -1008,6 +1159,8 @@ class K8sIntegration(Worker):
|
||||
print("No tasks in queue {}".format(queue))
|
||||
continue
|
||||
|
||||
print('Received task {} from queue {}'.format(task_id, queue))
|
||||
|
||||
task_session = None
|
||||
if self._impersonate_as_task_owner:
|
||||
try:
|
||||
@@ -1059,8 +1212,9 @@ class K8sIntegration(Worker):
|
||||
|
||||
:param list(str) queue: queue name to pull from
|
||||
"""
|
||||
queues = queue if isinstance(queue, (list, tuple)) else ([queue] if queue else None)
|
||||
return self.daemon(
|
||||
queues=[ObjectID(name=queue)] if queue else None,
|
||||
queues=[ObjectID(name=q) for q in queues] if queues else None,
|
||||
log_level=logging.INFO, foreground=True, docker=False, **kwargs,
|
||||
)
|
||||
|
||||
@@ -1069,7 +1223,7 @@ class K8sIntegration(Worker):
|
||||
self._session, queue=queue, get_task_info=get_task_info
|
||||
)
|
||||
|
||||
def _resolve_template(self, task_session, task_data, queue):
|
||||
def _resolve_template(self, task_session, task_data, queue, task_id):
|
||||
if self.template_dict:
|
||||
return deepcopy(self.template_dict)
|
||||
|
||||
|
||||
@@ -9,6 +9,7 @@ from clearml_agent.helper.process import stringify_bash_output
|
||||
from .daemon import K8sDaemon
|
||||
from .utilities import get_path
|
||||
from .errors import GetPodsError
|
||||
from .definitions import ENV_POD_MONITOR_DISABLE_ENQUEUE_ON_PREEMPTION
|
||||
|
||||
|
||||
class PendingPodsDaemon(K8sDaemon):
|
||||
@@ -17,17 +18,16 @@ class PendingPodsDaemon(K8sDaemon):
|
||||
self._polling_interval = polling_interval
|
||||
self._last_tasks_msgs = {} # last msg updated for every task
|
||||
|
||||
def get_pods(self):
|
||||
def get_pods(self, pod_name=None, debug_msg="Detecting pending pods: {cmd}"):
|
||||
filters = ["status.phase=Pending"]
|
||||
if pod_name:
|
||||
filters.append(f"metadata.name={pod_name}")
|
||||
|
||||
if self._agent.using_jobs:
|
||||
return self._agent.get_pods_for_jobs(
|
||||
job_condition="status.active=1",
|
||||
pod_filters=["status.phase=Pending"],
|
||||
debug_msg="Detecting pending pods: {cmd}"
|
||||
job_condition="status.active=1", pod_filters=filters, debug_msg=debug_msg
|
||||
)
|
||||
return self._agent.get_pods(
|
||||
filters=["status.phase=Pending"],
|
||||
debug_msg="Detecting pending pods: {cmd}"
|
||||
)
|
||||
return self._agent.get_pods(filters=filters, debug_msg=debug_msg)
|
||||
|
||||
def _get_pod_name(self, pod: dict):
|
||||
return get_path(pod, "metadata", "name")
|
||||
@@ -73,6 +73,11 @@ class PendingPodsDaemon(K8sDaemon):
|
||||
if not namespace:
|
||||
continue
|
||||
|
||||
updated_pod = self.get_pods(pod_name=pod_name, debug_msg="Refreshing pod information: {cmd}")
|
||||
if not updated_pod:
|
||||
continue
|
||||
pod = updated_pod[0]
|
||||
|
||||
task_id_to_pod[task_id] = pod
|
||||
|
||||
msg = None
|
||||
@@ -149,9 +154,7 @@ class PendingPodsDaemon(K8sDaemon):
|
||||
"id": list(pending_tasks_details),
|
||||
"status": ["stopped"],
|
||||
"only_fields": ["id"]
|
||||
},
|
||||
method=Request.def_method,
|
||||
async_enable=False,
|
||||
}
|
||||
)
|
||||
aborted_task_ids = list(filter(None, (task.get("id") for task in result["tasks"])))
|
||||
|
||||
@@ -160,11 +163,27 @@ class PendingPodsDaemon(K8sDaemon):
|
||||
if not pod:
|
||||
self.log.error("Failed locating aborted task {} in pending pods list".format(task_id))
|
||||
continue
|
||||
|
||||
pod_name = self._get_pod_name(pod)
|
||||
if not self.get_pods(pod_name=pod_name):
|
||||
self.log.debug("K8S Glue pending monitor: pod {} is no longer pending, skipping".format(pod_name))
|
||||
continue
|
||||
|
||||
resource_name = self._get_k8s_resource_name(pod)
|
||||
self.log.info(
|
||||
"K8S Glue pending monitor: task {} was aborted but the k8s resource {} is still pending, "
|
||||
"deleting pod".format(task_id, resource_name)
|
||||
)
|
||||
|
||||
result = self._session.get(
|
||||
service='tasks',
|
||||
action='get_all',
|
||||
json={"id": [task_id], "status": ["stopped"], "only_fields": ["id"]},
|
||||
)
|
||||
if not result["tasks"]:
|
||||
self.log.debug("K8S Glue pending monitor: task {} is no longer aborted, skipping".format(task_id))
|
||||
continue
|
||||
|
||||
output = self.delete_k8s_resource(k8s_resource=pod, msg="Pending resource of an aborted task")
|
||||
if not output:
|
||||
self.log.warning("K8S Glue pending monitor: failed deleting resource {}".format(resource_name))
|
||||
@@ -177,32 +196,39 @@ class PendingPodsDaemon(K8sDaemon):
|
||||
if not msg or self._last_tasks_msgs.get(task_id, None) == (msg, tags):
|
||||
return
|
||||
try:
|
||||
# Make sure the task is queued
|
||||
result = self._session.send_request(
|
||||
service='tasks',
|
||||
action='get_all',
|
||||
json={"id": task_id, "only_fields": ["status"]},
|
||||
method=Request.def_method,
|
||||
async_enable=False,
|
||||
)
|
||||
if result.ok:
|
||||
status = get_path(result.json(), 'data', 'tasks', 0, 'status')
|
||||
# if task is in progress, change its status to enqueued
|
||||
if status == "in_progress":
|
||||
result = self._session.send_request(
|
||||
service='tasks', action='enqueue',
|
||||
json={
|
||||
"task": task_id, "force": True, "queue": self._agent.k8s_pending_queue_id
|
||||
},
|
||||
method=Request.def_method,
|
||||
async_enable=False,
|
||||
)
|
||||
if not result.ok:
|
||||
result_msg = get_path(result.json(), 'meta', 'result_msg')
|
||||
self.log.debug(
|
||||
"K8S Glue pods monitor: failed forcing task status change"
|
||||
" for pending task {}: {}".format(task_id, result_msg)
|
||||
if ENV_POD_MONITOR_DISABLE_ENQUEUE_ON_PREEMPTION.get():
|
||||
# This disables the option to enqueue the task which is supposed to sync the ClearML task status
|
||||
# in case the pod was preempted. In some cases this does not happen due to preemption but due to
|
||||
# cluster communication lag issues that cause us not to discover the pod is no longer pending and
|
||||
# enqueue the task when it's actually already running, thus essentially killing the task
|
||||
pass
|
||||
else:
|
||||
# Make sure the task is queued
|
||||
result = self._session.send_request(
|
||||
service='tasks',
|
||||
action='get_all',
|
||||
json={"id": task_id, "only_fields": ["status"]},
|
||||
method=Request.def_method,
|
||||
async_enable=False,
|
||||
)
|
||||
if result.ok:
|
||||
status = get_path(result.json(), 'data', 'tasks', 0, 'status')
|
||||
# if task is in progress, change its status to enqueued
|
||||
if status == "in_progress":
|
||||
result = self._session.send_request(
|
||||
service='tasks', action='enqueue',
|
||||
json={
|
||||
"task": task_id, "force": True, "queue": self._agent.k8s_pending_queue_id
|
||||
},
|
||||
method=Request.def_method,
|
||||
async_enable=False,
|
||||
)
|
||||
if not result.ok:
|
||||
result_msg = get_path(result.json(), 'meta', 'result_msg')
|
||||
self.log.debug(
|
||||
"K8S Glue pods monitor: failed forcing task status change"
|
||||
" for pending task {}: {}".format(task_id, result_msg)
|
||||
)
|
||||
|
||||
# Update task status message
|
||||
payload = {"task": task_id, "status_message": "K8S glue status: {}".format(msg)}
|
||||
|
||||
@@ -14,7 +14,6 @@ import sys
|
||||
import tempfile
|
||||
from abc import ABCMeta
|
||||
from collections import OrderedDict
|
||||
from distutils.spawn import find_executable
|
||||
from functools import total_ordering
|
||||
from typing import Text, Dict, Any, Optional, AnyStr, IO, Union
|
||||
|
||||
@@ -38,6 +37,7 @@ use_powershell = os.getenv("CLEARML_AGENT_USE_POWERSHELL", None)
|
||||
|
||||
|
||||
def which(cmd, path=None):
|
||||
from clearml_agent.helper.process import find_executable
|
||||
result = find_executable(cmd, path)
|
||||
if not result:
|
||||
raise ValueError('command "{}" not found'.format(cmd))
|
||||
@@ -420,6 +420,7 @@ def mkstemp(
|
||||
open_kwargs=None, # type: Optional[Dict[Text, Any]]
|
||||
text=True, # type: bool
|
||||
name_only=False, # type: bool
|
||||
mode=None, # type: str
|
||||
*args,
|
||||
**kwargs):
|
||||
# type: (...) -> Union[(IO[AnyStr], Text), Text]
|
||||
@@ -429,12 +430,14 @@ def mkstemp(
|
||||
:param open_kwargs: keyword arguments for ``io.open``
|
||||
:param text: open in text mode
|
||||
:param name_only: close the file and return its name
|
||||
:param mode: open file mode
|
||||
:param args: tempfile.mkstemp args
|
||||
:param kwargs: tempfile.mkstemp kwargs
|
||||
"""
|
||||
fd, name = tempfile.mkstemp(text=text, *args, **kwargs)
|
||||
mode = 'w+'
|
||||
if not text:
|
||||
if not mode:
|
||||
mode = 'w+'
|
||||
if not text and 'b' not in mode:
|
||||
mode += 'b'
|
||||
if name_only:
|
||||
os.close(fd)
|
||||
@@ -540,6 +543,36 @@ def convert_cuda_version_to_int_10_base_str(cuda_version):
|
||||
return str(int(float(cuda_version)*10))
|
||||
|
||||
|
||||
def get_python_version(python_executable, log=None):
|
||||
from clearml_agent.helper.process import Argv
|
||||
try:
|
||||
output = Argv(python_executable, "--version").get_output(
|
||||
stderr=subprocess.STDOUT
|
||||
)
|
||||
except subprocess.CalledProcessError as ex:
|
||||
# Windows returns 9009 code and suggests to install Python from Windows Store
|
||||
if is_windows_platform() and ex.returncode == 9009:
|
||||
if log:
|
||||
log.debug("version not found: {}".format(ex))
|
||||
else:
|
||||
if log:
|
||||
log.warning("error getting %s version: %s", python_executable, ex)
|
||||
return None
|
||||
except FileNotFoundError as ex:
|
||||
if log:
|
||||
log.debug("version not found: {}".format(ex))
|
||||
return None
|
||||
|
||||
match = re.search(r"Python ({}(?:\.\d+)*)".format(r"\d+"), output)
|
||||
if match:
|
||||
if log:
|
||||
log.debug("Found: {}".format(python_executable))
|
||||
# only return major.minor version
|
||||
return ".".join(str(match.group(1)).split(".")[:2])
|
||||
|
||||
return None
|
||||
|
||||
|
||||
class NonStrictAttrs(object):
|
||||
|
||||
@classmethod
|
||||
|
||||
@@ -17,6 +17,30 @@ if TYPE_CHECKING:
|
||||
from clearml_agent.session import Session
|
||||
|
||||
|
||||
def sanitize_urls(s: str) -> Tuple[str, bool]:
|
||||
"""
|
||||
Replaces passwords in URLs with asterisks.
|
||||
Returns the sanitized string and a boolean indicating whether sanitation was performed.
|
||||
"""
|
||||
regex = re.compile("^([^:]*:)[^@]+(.*)$")
|
||||
tokens = re.split(r"\s", s)
|
||||
changed = False
|
||||
for k in range(len(tokens)):
|
||||
if "@" in tokens[k]:
|
||||
res = urlparse(tokens[k])
|
||||
if regex.match(res.netloc):
|
||||
changed = True
|
||||
tokens[k] = urlunparse((
|
||||
res.scheme,
|
||||
regex.sub("\\1********\\2", res.netloc),
|
||||
res.path,
|
||||
res.params,
|
||||
res.query,
|
||||
res.fragment
|
||||
))
|
||||
return " ".join(tokens) if changed else s, changed
|
||||
|
||||
|
||||
class DockerArgsSanitizer:
|
||||
@classmethod
|
||||
def sanitize_docker_command(cls, session, docker_command):
|
||||
@@ -62,11 +86,11 @@ class DockerArgsSanitizer:
|
||||
elif key in keys:
|
||||
val = "********"
|
||||
elif parse_embedded_urls:
|
||||
val = cls._sanitize_urls(val)[0]
|
||||
val = sanitize_urls(val)[0]
|
||||
result[i + 1] = "{}={}".format(key, val)
|
||||
skip_next = True
|
||||
elif parse_embedded_urls and not item.startswith("-"):
|
||||
item, changed = cls._sanitize_urls(item)
|
||||
item, changed = sanitize_urls(item)
|
||||
if changed:
|
||||
result[i] = item
|
||||
except (KeyError, TypeError):
|
||||
@@ -75,22 +99,71 @@ class DockerArgsSanitizer:
|
||||
return result
|
||||
|
||||
@staticmethod
|
||||
def _sanitize_urls(s: str) -> Tuple[str, bool]:
|
||||
""" Replaces passwords in URLs with asterisks """
|
||||
regex = re.compile("^([^:]*:)[^@]+(.*)$")
|
||||
tokens = re.split(r"\s", s)
|
||||
changed = False
|
||||
for k in range(len(tokens)):
|
||||
if "@" in tokens[k]:
|
||||
res = urlparse(tokens[k])
|
||||
if regex.match(res.netloc):
|
||||
changed = True
|
||||
tokens[k] = urlunparse((
|
||||
res.scheme,
|
||||
regex.sub("\\1********\\2", res.netloc),
|
||||
res.path,
|
||||
res.params,
|
||||
res.query,
|
||||
res.fragment
|
||||
))
|
||||
return " ".join(tokens) if changed else s, changed
|
||||
def get_list_of_switches(docker_args: List[str]) -> List[str]:
|
||||
args = []
|
||||
for token in docker_args:
|
||||
if token.strip().startswith("-"):
|
||||
args += [token.strip().split("=")[0].lstrip("-")]
|
||||
|
||||
return args
|
||||
|
||||
@staticmethod
|
||||
def filter_switches(docker_args: List[str], exclude_switches: List[str]) -> List[str]:
|
||||
# shortcut if we are sure we have no matches
|
||||
if (not exclude_switches or
|
||||
not any("-{}".format(s) in " ".join(docker_args) for s in exclude_switches)):
|
||||
return docker_args
|
||||
|
||||
args = []
|
||||
in_switch_args = True
|
||||
for token in docker_args:
|
||||
if token.strip().startswith("-"):
|
||||
if "=" in token:
|
||||
switch = token.strip().split("=")[0]
|
||||
in_switch_args = False
|
||||
else:
|
||||
switch = token
|
||||
in_switch_args = True
|
||||
|
||||
if switch.lstrip("-") in exclude_switches:
|
||||
# if in excluded, skip the switch and following arguments
|
||||
in_switch_args = False
|
||||
else:
|
||||
args += [token]
|
||||
|
||||
elif in_switch_args:
|
||||
args += [token]
|
||||
else:
|
||||
# this is the switch arguments we need to skip
|
||||
pass
|
||||
|
||||
return args
|
||||
|
||||
@staticmethod
|
||||
def merge_docker_args(config, task_docker_arguments: List[str], extra_docker_arguments: List[str]) -> List[str]:
|
||||
base_cmd = []
|
||||
# currently only resolving --network, --ipc, --privileged
|
||||
override_switches = config.get(
|
||||
"agent.protected_docker_extra_args",
|
||||
["privileged", "security-opt", "network", "ipc"]
|
||||
)
|
||||
|
||||
if config.get("agent.docker_args_extra_precedes_task", True):
|
||||
switches = []
|
||||
if extra_docker_arguments:
|
||||
switches = DockerArgsSanitizer.get_list_of_switches(extra_docker_arguments)
|
||||
switches = list(set(switches) & set(override_switches))
|
||||
base_cmd += [str(a) for a in extra_docker_arguments if a]
|
||||
if task_docker_arguments:
|
||||
docker_arguments = DockerArgsSanitizer.filter_switches(task_docker_arguments, switches)
|
||||
base_cmd += [a for a in docker_arguments if a]
|
||||
else:
|
||||
switches = []
|
||||
if task_docker_arguments:
|
||||
switches = DockerArgsSanitizer.get_list_of_switches(task_docker_arguments)
|
||||
switches = list(set(switches) & set(override_switches))
|
||||
base_cmd += [a for a in task_docker_arguments if a]
|
||||
if extra_docker_arguments:
|
||||
extra_docker_arguments = DockerArgsSanitizer.filter_switches(extra_docker_arguments, switches)
|
||||
base_cmd += [a for a in extra_docker_arguments if a]
|
||||
return base_cmd
|
||||
|
||||
8
clearml_agent/helper/environment/__init__.py
Normal file
8
clearml_agent/helper/environment/__init__.py
Normal file
@@ -0,0 +1,8 @@
|
||||
from .entry import Entry, NotSet
|
||||
from .environment import EnvEntry
|
||||
|
||||
__all__ = [
|
||||
'Entry',
|
||||
'NotSet',
|
||||
'EnvEntry',
|
||||
]
|
||||
86
clearml_agent/helper/environment/converters.py
Normal file
86
clearml_agent/helper/environment/converters.py
Normal file
@@ -0,0 +1,86 @@
|
||||
import base64
|
||||
from typing import Union, Optional, Any, TypeVar, Callable, Tuple
|
||||
|
||||
import six
|
||||
|
||||
try:
|
||||
from typing import Text
|
||||
except ImportError:
|
||||
# windows conda-less hack
|
||||
Text = Any
|
||||
|
||||
|
||||
ConverterType = TypeVar("ConverterType", bound=Callable[[Any], Any])
|
||||
|
||||
|
||||
def base64_to_text(value):
|
||||
# type: (Any) -> Text
|
||||
return base64.b64decode(value).decode("utf-8")
|
||||
|
||||
|
||||
def text_to_int(value, default=0):
|
||||
# type: (Any, int) -> int
|
||||
try:
|
||||
return int(value)
|
||||
except (ValueError, TypeError):
|
||||
return default
|
||||
|
||||
|
||||
def text_to_bool(value):
|
||||
# type: (Text) -> bool
|
||||
return bool(strtobool(value))
|
||||
|
||||
|
||||
def safe_text_to_bool(value):
|
||||
# type: (Text) -> bool
|
||||
try:
|
||||
return text_to_bool(value)
|
||||
except ValueError:
|
||||
return bool(value)
|
||||
|
||||
|
||||
def any_to_bool(value):
|
||||
# type: (Optional[Union[int, float, Text]]) -> bool
|
||||
if isinstance(value, six.text_type):
|
||||
return text_to_bool(value)
|
||||
return bool(value)
|
||||
|
||||
|
||||
# noinspection PyIncorrectDocstring
|
||||
def or_(*converters, **kwargs):
|
||||
# type: (ConverterType, Tuple[Exception, ...]) -> ConverterType
|
||||
"""
|
||||
Wrapper that implements an "optional converter" pattern. Allows specifying a converter
|
||||
for which a set of exceptions is ignored (and the original value is returned)
|
||||
:param converters: A converter callable
|
||||
:param exceptions: A tuple of exception types to ignore
|
||||
"""
|
||||
# noinspection PyUnresolvedReferences
|
||||
exceptions = kwargs.get("exceptions", (ValueError, TypeError))
|
||||
|
||||
def wrapper(value):
|
||||
for converter in converters:
|
||||
try:
|
||||
return converter(value)
|
||||
except exceptions:
|
||||
pass
|
||||
return value
|
||||
|
||||
return wrapper
|
||||
|
||||
|
||||
def strtobool(val):
|
||||
"""Convert a string representation of truth to true (1) or false (0).
|
||||
|
||||
True values are 'y', 'yes', 't', 'true', 'on', and '1'; false values
|
||||
are 'n', 'no', 'f', 'false', 'off', and '0'. Raises ValueError if
|
||||
'val' is anything else.
|
||||
"""
|
||||
val = val.lower()
|
||||
if val in ('y', 'yes', 't', 'true', 'on', '1'):
|
||||
return 1
|
||||
elif val in ('n', 'no', 'f', 'false', 'off', '0'):
|
||||
return 0
|
||||
else:
|
||||
raise ValueError("invalid truth value %r" % (val,))
|
||||
|
||||
134
clearml_agent/helper/environment/entry.py
Normal file
134
clearml_agent/helper/environment/entry.py
Normal file
@@ -0,0 +1,134 @@
|
||||
import abc
|
||||
from typing import Optional, Any, Tuple, Callable, Dict
|
||||
|
||||
import six
|
||||
|
||||
from .converters import any_to_bool
|
||||
|
||||
try:
|
||||
from typing import Text
|
||||
except ImportError:
|
||||
# windows conda-less hack
|
||||
Text = Any
|
||||
|
||||
|
||||
NotSet = object()
|
||||
|
||||
Converter = Callable[[Any], Any]
|
||||
|
||||
|
||||
@six.add_metaclass(abc.ABCMeta)
|
||||
class Entry(object):
|
||||
"""
|
||||
Configuration entry definition
|
||||
"""
|
||||
|
||||
def default_conversions(self):
|
||||
# type: () -> Dict[Any, Converter]
|
||||
|
||||
if self.lstrip and self.rstrip:
|
||||
|
||||
def str_convert(s):
|
||||
return six.text_type(s).strip()
|
||||
|
||||
elif self.lstrip:
|
||||
|
||||
def str_convert(s):
|
||||
return six.text_type(s).lstrip()
|
||||
|
||||
elif self.rstrip:
|
||||
|
||||
def str_convert(s):
|
||||
return six.text_type(s).rstrip()
|
||||
|
||||
else:
|
||||
|
||||
def str_convert(s):
|
||||
return six.text_type(s)
|
||||
|
||||
return {
|
||||
bool: lambda x: any_to_bool(x.strip()),
|
||||
six.text_type: str_convert,
|
||||
}
|
||||
|
||||
def __init__(self, key, *more_keys, **kwargs):
|
||||
# type: (Text, Text, Any) -> None
|
||||
"""
|
||||
:rtype: object
|
||||
:param key: Entry's key (at least one).
|
||||
:param more_keys: More alternate keys for this entry.
|
||||
:param type: Value type. If provided, will be used choosing a default conversion or
|
||||
(if none exists) for casting the environment value.
|
||||
:param converter: Value converter. If provided, will be used to convert the environment value.
|
||||
:param default: Default value. If provided, will be used as the default value on calls to get() and get_pair()
|
||||
in case no value is found for any key and no specific default value was provided in the call.
|
||||
Default value is None.
|
||||
:param help: Help text describing this entry
|
||||
"""
|
||||
self.keys = (key,) + more_keys
|
||||
self.type = kwargs.pop("type", six.text_type)
|
||||
self.converter = kwargs.pop("converter", None)
|
||||
self.default = kwargs.pop("default", None)
|
||||
self.help = kwargs.pop("help", None)
|
||||
self.lstrip = kwargs.pop("lstrip", True)
|
||||
self.rstrip = kwargs.pop("rstrip", True)
|
||||
|
||||
def __str__(self):
|
||||
return str(self.key)
|
||||
|
||||
@property
|
||||
def key(self):
|
||||
return self.keys[0]
|
||||
|
||||
def convert(self, value, converter=None):
|
||||
# type: (Any, Converter) -> Optional[Any]
|
||||
converter = converter or self.converter
|
||||
if not converter:
|
||||
converter = self.default_conversions().get(self.type, self.type)
|
||||
return converter(value)
|
||||
|
||||
def get_pair(self, default=NotSet, converter=None, value_cb=None):
|
||||
# type: (Any, Converter, Callable[[str, Any], None]) -> Optional[Tuple[Text, Any]]
|
||||
for key in self.keys:
|
||||
value = self._get(key)
|
||||
if value is NotSet:
|
||||
continue
|
||||
try:
|
||||
value = self.convert(value, converter)
|
||||
except Exception as ex:
|
||||
self.error("invalid value {key}={value}: {ex}".format(**locals()))
|
||||
break
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
if value_cb:
|
||||
value_cb(key, value)
|
||||
except Exception:
|
||||
pass
|
||||
return key, value
|
||||
|
||||
result = self.default if default is NotSet else default
|
||||
return self.key, result
|
||||
|
||||
def get(self, default=NotSet, converter=None, value_cb=None):
|
||||
# type: (Any, Converter, Callable[[str, Any], None]) -> Optional[Any]
|
||||
return self.get_pair(default=default, converter=converter, value_cb=value_cb)[1]
|
||||
|
||||
def set(self, value):
|
||||
# type: (Any, Any) -> (Text, Any)
|
||||
# key, _ = self.get_pair(default=None, converter=None)
|
||||
for k in self.keys:
|
||||
self._set(k, str(value))
|
||||
|
||||
def _set(self, key, value):
|
||||
# type: (Text, Text) -> None
|
||||
pass
|
||||
|
||||
@abc.abstractmethod
|
||||
def _get(self, key):
|
||||
# type: (Text) -> Any
|
||||
pass
|
||||
|
||||
@abc.abstractmethod
|
||||
def error(self, message):
|
||||
# type: (Text) -> None
|
||||
pass
|
||||
28
clearml_agent/helper/environment/environment.py
Normal file
28
clearml_agent/helper/environment/environment.py
Normal file
@@ -0,0 +1,28 @@
|
||||
from os import getenv, environ
|
||||
|
||||
from .converters import text_to_bool
|
||||
from .entry import Entry, NotSet
|
||||
|
||||
|
||||
class EnvEntry(Entry):
|
||||
def default_conversions(self):
|
||||
conversions = super(EnvEntry, self).default_conversions().copy()
|
||||
conversions[bool] = lambda x: text_to_bool(x.strip())
|
||||
return conversions
|
||||
|
||||
def pop(self):
|
||||
for k in self.keys:
|
||||
environ.pop(k, None)
|
||||
|
||||
def _get(self, key):
|
||||
value = getenv(key, "")
|
||||
return value or NotSet
|
||||
|
||||
def _set(self, key, value):
|
||||
environ[key] = value
|
||||
|
||||
def __str__(self):
|
||||
return "env:{}".format(super(EnvEntry, self).__str__())
|
||||
|
||||
def error(self, message):
|
||||
print("Environment configuration: {}".format(message))
|
||||
@@ -15,10 +15,8 @@ from __future__ import print_function
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import json
|
||||
import os.path
|
||||
import platform
|
||||
import sys
|
||||
import time
|
||||
from datetime import datetime
|
||||
from typing import Optional
|
||||
|
||||
@@ -59,6 +57,21 @@ class GPUStat(object):
|
||||
"""
|
||||
return self.entry['uuid']
|
||||
|
||||
@property
|
||||
def mig_index(self):
|
||||
"""
|
||||
Returns the index of the MIG partition (as in nvidia-smi).
|
||||
"""
|
||||
return self.entry.get("mig_index")
|
||||
|
||||
@property
|
||||
def mig_uuid(self):
|
||||
"""
|
||||
Returns the uuid of the MIG partition returned by nvidia-smi when running in MIG mode,
|
||||
e.g. MIG-12345678-abcd-abcd-uuid-123456abcdef
|
||||
"""
|
||||
return self.entry.get("mig_uuid")
|
||||
|
||||
@property
|
||||
def name(self):
|
||||
"""
|
||||
@@ -163,14 +176,16 @@ class GPUStatCollection(object):
|
||||
_initialized = False
|
||||
_device_count = None
|
||||
_gpu_device_info = {}
|
||||
_mig_device_info = {}
|
||||
|
||||
def __init__(self, gpu_list, driver_version=None):
|
||||
def __init__(self, gpu_list, driver_version=None, driver_cuda_version=None):
|
||||
self.gpus = gpu_list
|
||||
|
||||
# attach additional system information
|
||||
self.hostname = platform.node()
|
||||
self.query_time = datetime.now()
|
||||
self.driver_version = driver_version
|
||||
self.driver_cuda_version = driver_cuda_version
|
||||
|
||||
@staticmethod
|
||||
def clean_processes():
|
||||
@@ -181,17 +196,18 @@ class GPUStatCollection(object):
|
||||
@staticmethod
|
||||
def new_query(shutdown=False, per_process_stats=False, get_driver_info=False):
|
||||
"""Query the information of all the GPUs on local machine"""
|
||||
|
||||
initialized = False
|
||||
if not GPUStatCollection._initialized:
|
||||
N.nvmlInit()
|
||||
GPUStatCollection._initialized = True
|
||||
initialized = True
|
||||
|
||||
def _decode(b):
|
||||
if isinstance(b, bytes):
|
||||
return b.decode() # for python3, to unicode
|
||||
return b
|
||||
|
||||
def get_gpu_info(index, handle):
|
||||
def get_gpu_info(index, handle, is_mig=False):
|
||||
"""Get one GPU information specified by nvml handle"""
|
||||
|
||||
def get_process_info(nv_process):
|
||||
@@ -200,10 +216,10 @@ class GPUStatCollection(object):
|
||||
if nv_process.pid not in GPUStatCollection.global_processes:
|
||||
GPUStatCollection.global_processes[nv_process.pid] = \
|
||||
psutil.Process(pid=nv_process.pid)
|
||||
ps_process = GPUStatCollection.global_processes[nv_process.pid]
|
||||
process['pid'] = nv_process.pid
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
# ps_process = GPUStatCollection.global_processes[nv_process.pid]
|
||||
# we do not actually use these, so no point in collecting them
|
||||
# process['username'] = ps_process.username()
|
||||
# # cmdline returns full path;
|
||||
@@ -227,12 +243,14 @@ class GPUStatCollection(object):
|
||||
pass
|
||||
return process
|
||||
|
||||
if not GPUStatCollection._gpu_device_info.get(index):
|
||||
device_info = GPUStatCollection._mig_device_info if is_mig else GPUStatCollection._gpu_device_info
|
||||
|
||||
if not device_info.get(index):
|
||||
name = _decode(N.nvmlDeviceGetName(handle))
|
||||
uuid = _decode(N.nvmlDeviceGetUUID(handle))
|
||||
GPUStatCollection._gpu_device_info[index] = (name, uuid)
|
||||
device_info[index] = (name, uuid)
|
||||
|
||||
name, uuid = GPUStatCollection._gpu_device_info[index]
|
||||
name, uuid = device_info[index]
|
||||
|
||||
try:
|
||||
temperature = N.nvmlDeviceGetTemperature(
|
||||
@@ -286,11 +304,11 @@ class GPUStatCollection(object):
|
||||
for nv_process in nv_comp_processes + nv_graphics_processes:
|
||||
try:
|
||||
process = get_process_info(nv_process)
|
||||
processes.append(process)
|
||||
except psutil.NoSuchProcess:
|
||||
# TODO: add some reminder for NVML broken context
|
||||
# e.g. nvidia-smi reset or reboot the system
|
||||
pass
|
||||
process = None
|
||||
processes.append(process)
|
||||
|
||||
# we do not actually use these, so no point in collecting them
|
||||
# # TODO: Do not block if full process info is not requested
|
||||
@@ -314,7 +332,7 @@ class GPUStatCollection(object):
|
||||
# Convert bytes into MBytes
|
||||
'memory.used': memory.used // MB if memory else None,
|
||||
'memory.total': memory.total // MB if memory else None,
|
||||
'processes': processes,
|
||||
'processes': None if (processes and all(p is None for p in processes)) else processes
|
||||
}
|
||||
if per_process_stats:
|
||||
GPUStatCollection.clean_processes()
|
||||
@@ -328,8 +346,36 @@ class GPUStatCollection(object):
|
||||
for index in range(GPUStatCollection._device_count):
|
||||
handle = N.nvmlDeviceGetHandleByIndex(index)
|
||||
gpu_info = get_gpu_info(index, handle)
|
||||
gpu_stat = GPUStat(gpu_info)
|
||||
gpu_list.append(gpu_stat)
|
||||
mig_cnt = 0
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
mig_cnt = N.nvmlDeviceGetMaxMigDeviceCount(handle)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
if mig_cnt <= 0:
|
||||
gpu_list.append(GPUStat(gpu_info))
|
||||
continue
|
||||
|
||||
got_mig_info = False
|
||||
for mig_index in range(mig_cnt):
|
||||
try:
|
||||
mig_handle = N.nvmlDeviceGetMigDeviceHandleByIndex(handle, mig_index)
|
||||
mig_info = get_gpu_info(mig_index, mig_handle, is_mig=True)
|
||||
mig_info["mig_name"] = mig_info["name"]
|
||||
mig_info["name"] = gpu_info["name"]
|
||||
mig_info["mig_index"] = mig_info["index"]
|
||||
mig_info["mig_uuid"] = mig_info["uuid"]
|
||||
mig_info["index"] = gpu_info["index"]
|
||||
mig_info["uuid"] = gpu_info["uuid"]
|
||||
mig_info["temperature.gpu"] = gpu_info["temperature.gpu"]
|
||||
mig_info["fan.speed"] = gpu_info["fan.speed"]
|
||||
gpu_list.append(GPUStat(mig_info))
|
||||
got_mig_info = True
|
||||
except Exception as e:
|
||||
pass
|
||||
if not got_mig_info:
|
||||
gpu_list.append(GPUStat(gpu_info))
|
||||
|
||||
# 2. additional info (driver version, etc).
|
||||
if get_driver_info:
|
||||
@@ -337,15 +383,32 @@ class GPUStatCollection(object):
|
||||
driver_version = _decode(N.nvmlSystemGetDriverVersion())
|
||||
except N.NVMLError:
|
||||
driver_version = None # N/A
|
||||
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
cuda_driver_version = str(N.nvmlSystemGetCudaDriverVersion())
|
||||
except BaseException:
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
cuda_driver_version = str(N.nvmlSystemGetCudaDriverVersion_v2())
|
||||
except BaseException:
|
||||
cuda_driver_version = None
|
||||
if cuda_driver_version:
|
||||
try:
|
||||
cuda_driver_version = '{}.{}'.format(
|
||||
int(cuda_driver_version)//1000, (int(cuda_driver_version) % 1000)//10)
|
||||
except (ValueError, TypeError):
|
||||
pass
|
||||
else:
|
||||
driver_version = None
|
||||
cuda_driver_version = None
|
||||
|
||||
# no need to shutdown:
|
||||
if shutdown:
|
||||
if shutdown and initialized:
|
||||
N.nvmlShutdown()
|
||||
GPUStatCollection._initialized = False
|
||||
|
||||
return GPUStatCollection(gpu_list, driver_version=driver_version)
|
||||
return GPUStatCollection(gpu_list, driver_version=driver_version, driver_cuda_version=cuda_driver_version)
|
||||
|
||||
def __len__(self):
|
||||
return len(self.gpus)
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -13,16 +13,17 @@ from .locks import FileLock
|
||||
|
||||
class FolderCache(object):
|
||||
_lock_filename = '.clearml.lock'
|
||||
_lock_timeout_seconds = 30
|
||||
_def_lock_timeout_seconds = 30
|
||||
_temp_entry_prefix = '_temp.'
|
||||
|
||||
def __init__(self, cache_folder, max_cache_entries=5, min_free_space_gb=None):
|
||||
def __init__(self, cache_folder, max_cache_entries=5, min_free_space_gb=None, lock_timeout_seconds=None):
|
||||
self._cache_folder = Path(os.path.expandvars(cache_folder)).expanduser().absolute()
|
||||
self._cache_folder.mkdir(parents=True, exist_ok=True)
|
||||
self._max_cache_entries = max_cache_entries
|
||||
self._last_copied_entry_folder = None
|
||||
self._min_free_space_gb = min_free_space_gb if min_free_space_gb and min_free_space_gb > 0 else None
|
||||
self._lock = FileLock((self._cache_folder / self._lock_filename).as_posix())
|
||||
self._lock_timeout_seconds = float(lock_timeout_seconds or self._def_lock_timeout_seconds)
|
||||
|
||||
def get_cache_folder(self):
|
||||
# type: () -> Path
|
||||
@@ -46,9 +47,11 @@ class FolderCache(object):
|
||||
# lock so we make sure no one deletes it before we copy it
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
self._lock.acquire(timeout=self._lock_timeout_seconds)
|
||||
self._lock.acquire(timeout=self._lock_timeout_seconds, readonly=True)
|
||||
except BaseException as ex:
|
||||
warning('Could not lock cache folder {}: {}'.format(self._cache_folder, ex))
|
||||
import traceback
|
||||
warning('DEBUG: Exception {}: {}'.format(ex, traceback.format_exc()))
|
||||
return None
|
||||
|
||||
src = None
|
||||
@@ -115,6 +118,8 @@ class FolderCache(object):
|
||||
self._lock.acquire(timeout=self._lock_timeout_seconds)
|
||||
except BaseException as ex:
|
||||
warning('Could not lock cache folder {}: {}'.format(self._cache_folder, ex))
|
||||
import traceback
|
||||
warning('DEBUG: Exception {}: {}'.format(ex, traceback.format_exc()))
|
||||
# failed locking do nothing
|
||||
return True
|
||||
keys = sorted(list(set(keys) | set(cached_keys)))
|
||||
@@ -194,16 +199,23 @@ class FolderCache(object):
|
||||
if cache_folder.is_dir() and not cache_folder.name.startswith(self._temp_entry_prefix)]
|
||||
folder_entries = sorted(folder_entries, key=lambda x: x[1], reverse=True)
|
||||
|
||||
number_of_entries_to_keep = self._max_cache_entries - 1 \
|
||||
if max_cache_entries is None else max(0, int(max_cache_entries))
|
||||
|
||||
# if nothing to do, leave
|
||||
if not folder_entries[number_of_entries_to_keep:]:
|
||||
return
|
||||
|
||||
# lock so we make sure no one deletes it before we copy it
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
self._lock.acquire(timeout=self._lock_timeout_seconds)
|
||||
except BaseException as ex:
|
||||
warning('Could not lock cache folder {}: {}'.format(self._cache_folder, ex))
|
||||
import traceback
|
||||
warning('DEBUG: Exception {}: {}'.format(ex, traceback.format_exc()))
|
||||
return
|
||||
|
||||
number_of_entries_to_keep = self._max_cache_entries - 1 \
|
||||
if max_cache_entries is None else max(0, int(max_cache_entries))
|
||||
for folder, ts in folder_entries[number_of_entries_to_keep:]:
|
||||
try:
|
||||
shutil.rmtree(folder.as_posix(), ignore_errors=True)
|
||||
|
||||
@@ -32,7 +32,7 @@ def open_atomic(filename, binary=True):
|
||||
... os.remove(filename)
|
||||
|
||||
>>> with open_atomic(filename) as fh:
|
||||
... written = fh.write(b'test')
|
||||
... written = fh.write(b"test")
|
||||
>>> assert os.path.exists(filename)
|
||||
>>> os.remove(filename)
|
||||
|
||||
@@ -67,7 +67,7 @@ class FileLock(object):
|
||||
def __init__(
|
||||
self, filename, mode='a', timeout=DEFAULT_TIMEOUT,
|
||||
check_interval=DEFAULT_CHECK_INTERVAL, fail_when_locked=False,
|
||||
flags=LOCK_METHOD, **file_open_kwargs):
|
||||
**file_open_kwargs):
|
||||
"""Lock manager with build-in timeout
|
||||
|
||||
filename -- filename
|
||||
@@ -101,11 +101,12 @@ class FileLock(object):
|
||||
self.timeout = timeout
|
||||
self.check_interval = check_interval
|
||||
self.fail_when_locked = fail_when_locked
|
||||
self.flags = flags
|
||||
self.flags_read = constants.LOCK_SH | constants.LOCK_NB
|
||||
self.flags_write = constants.LOCK_EX | constants.LOCK_NB
|
||||
self.file_open_kwargs = file_open_kwargs
|
||||
|
||||
def acquire(
|
||||
self, timeout=None, check_interval=None, fail_when_locked=None):
|
||||
self, timeout=None, check_interval=None, fail_when_locked=None, readonly=False):
|
||||
"""Acquire the locked filehandle"""
|
||||
if timeout is None:
|
||||
timeout = self.timeout
|
||||
@@ -123,12 +124,13 @@ class FileLock(object):
|
||||
if fh:
|
||||
return fh
|
||||
|
||||
# Get a new filehandler
|
||||
fh = self._get_fh()
|
||||
_fh = None
|
||||
try:
|
||||
# Get a new filehandler
|
||||
_fh = self._get_fh()
|
||||
# Try to lock
|
||||
fh = self._get_lock(fh)
|
||||
except exceptions.LockException as exception:
|
||||
fh = self._get_lock(_fh, readonly=readonly)
|
||||
except (exceptions.LockException, IOError) as exception:
|
||||
# Try till the timeout has passed
|
||||
timeoutend = current_time() + timeout
|
||||
while timeoutend > current_time():
|
||||
@@ -144,16 +146,18 @@ class FileLock(object):
|
||||
raise exceptions.AlreadyLocked(exception)
|
||||
|
||||
else: # pragma: no cover
|
||||
if not _fh:
|
||||
_fh = self._get_fh()
|
||||
# We've got the lock
|
||||
fh = self._get_lock(fh)
|
||||
fh = self._get_lock(_fh, readonly=readonly)
|
||||
break
|
||||
|
||||
except exceptions.LockException:
|
||||
except (exceptions.LockException, IOError):
|
||||
pass
|
||||
|
||||
else:
|
||||
# We got a timeout... reraising
|
||||
raise exceptions.LockException(exception)
|
||||
raise exceptions.LockTimeout(exception)
|
||||
|
||||
# Prepare the filehandle (truncate if needed)
|
||||
fh = self._prepare_fh(fh)
|
||||
@@ -176,16 +180,37 @@ class FileLock(object):
|
||||
pass
|
||||
self.fh = None
|
||||
|
||||
def delete_lock_file(self):
|
||||
# type: () -> bool
|
||||
"""
|
||||
Remove the local file used for locking (fail if file is locked)
|
||||
|
||||
:return: True is successful
|
||||
"""
|
||||
if self.fh:
|
||||
return False
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
os.unlink(path=self.filename)
|
||||
except BaseException:
|
||||
return False
|
||||
return True
|
||||
|
||||
def _get_fh(self):
|
||||
"""Get a new filehandle"""
|
||||
# Create the parent directory if it doesn't exist
|
||||
path, name = os.path.split(self.filename)
|
||||
if path and not os.path.isdir(path): # pragma: no cover
|
||||
os.makedirs(path, exist_ok=True)
|
||||
|
||||
return open(self.filename, self.mode, **self.file_open_kwargs)
|
||||
|
||||
def _get_lock(self, fh):
|
||||
def _get_lock(self, fh, readonly=False):
|
||||
"""
|
||||
Try to lock the given filehandle
|
||||
|
||||
returns LockException if it fails"""
|
||||
lock(fh, self.flags)
|
||||
lock(fh, self.flags_read if readonly else self.flags_write)
|
||||
return fh
|
||||
|
||||
def _prepare_fh(self, fh):
|
||||
|
||||
@@ -20,6 +20,9 @@ class exceptions:
|
||||
class FileToLarge(BaseLockException):
|
||||
pass
|
||||
|
||||
class LockTimeout(BaseLockException):
|
||||
pass
|
||||
|
||||
|
||||
class constants:
|
||||
# The actual tests will execute the code anyhow so the following code can
|
||||
@@ -185,6 +188,10 @@ elif os.name == 'posix': # pragma: no cover
|
||||
# The exception code varies on different systems so we'll catch
|
||||
# every IO error
|
||||
raise exceptions.LockException(exc_value, fh=file_)
|
||||
except BaseException as ex:
|
||||
# DEBUG
|
||||
print("Uncaught [{}] Exception [{}] in portalock: {}".format(locking_exceptions, type(ex), ex))
|
||||
raise
|
||||
|
||||
def unlock(file_):
|
||||
fcntl.flock(file_.fileno(), constants.LOCK_UN)
|
||||
|
||||
@@ -28,9 +28,13 @@ class PackageManager(object):
|
||||
_config_cache_folder = 'agent.venvs_cache.path'
|
||||
_config_cache_max_entries = 'agent.venvs_cache.max_entries'
|
||||
_config_cache_free_space_threshold = 'agent.venvs_cache.free_space_threshold_gb'
|
||||
_config_cache_lock_timeout = 'agent.venvs_cache.lock_timeout'
|
||||
_config_pip_legacy_resolver = 'agent.package_manager.pip_legacy_resolver'
|
||||
|
||||
def __init__(self):
|
||||
self._cache_manager = None
|
||||
self._existing_packages = []
|
||||
self._base_install_flags = []
|
||||
|
||||
@abc.abstractproperty
|
||||
def bin(self):
|
||||
@@ -78,6 +82,23 @@ class PackageManager(object):
|
||||
# type: (Iterable[Text]) -> None
|
||||
pass
|
||||
|
||||
def add_extra_install_flags(self, extra_flags): # type: (List[str]) -> None
|
||||
if extra_flags:
|
||||
extra_flags = [
|
||||
e for e in extra_flags if e not in list(self._base_install_flags)
|
||||
]
|
||||
self._base_install_flags = list(self._base_install_flags) + list(extra_flags)
|
||||
|
||||
def remove_extra_install_flags(self, extra_flags): # type: (List[str]) -> bool
|
||||
if extra_flags:
|
||||
_base_install_flags = [
|
||||
e for e in self._base_install_flags if e not in list(extra_flags)
|
||||
]
|
||||
if self._base_install_flags != _base_install_flags:
|
||||
self._base_install_flags = _base_install_flags
|
||||
return True
|
||||
return False
|
||||
|
||||
def upgrade_pip(self):
|
||||
result = self._install(
|
||||
*select_for_platform(
|
||||
@@ -86,19 +107,58 @@ class PackageManager(object):
|
||||
),
|
||||
"--upgrade"
|
||||
)
|
||||
packages = self.run_with_env(('list',), output=True).splitlines()
|
||||
# p.split is ('pip', 'x.y.z')
|
||||
pip = [p.split() for p in packages if len(p.split()) == 2 and p.split()[0] == 'pip']
|
||||
if pip:
|
||||
# noinspection PyBroadException
|
||||
|
||||
packages = (self.freeze(freeze_full_environment=True) or dict()).get("pip")
|
||||
if packages:
|
||||
from clearml_agent.helper.package.requirements import RequirementsManager
|
||||
from .requirements import MarkerRequirement, SimpleVersion
|
||||
|
||||
# store existing packages so that we can check if we can skip preinstalled packages
|
||||
# we will only check "@ file" "@ vcs" for exact match
|
||||
self._existing_packages = RequirementsManager.parse_requirements_section_to_marker_requirements(
|
||||
packages, skip_local_file_validation=True)
|
||||
|
||||
try:
|
||||
from .requirements import MarkerRequirement
|
||||
pip = pip[0][1].split('.')
|
||||
MarkerRequirement.pip_new_version = bool(int(pip[0]) >= 20)
|
||||
except Exception:
|
||||
pass
|
||||
pip_pkg = next(p for p in self._existing_packages if p.name == "pip")
|
||||
except StopIteration:
|
||||
pip_pkg = None
|
||||
|
||||
# check if we need to list the pip version as well
|
||||
if pip_pkg:
|
||||
MarkerRequirement.pip_new_version = SimpleVersion.compare_versions(pip_pkg.version, ">=", "20")
|
||||
|
||||
# add --use-deprecated=legacy-resolver to pip install to avoid mismatched packages issues
|
||||
self._add_legacy_resolver_flag(pip_pkg.version)
|
||||
|
||||
return result
|
||||
|
||||
def _add_legacy_resolver_flag(self, pip_pkg_version):
|
||||
if not self.session.config.get(self._config_pip_legacy_resolver, None):
|
||||
return
|
||||
|
||||
from .requirements import SimpleVersion
|
||||
|
||||
match_versions = self.session.config.get(self._config_pip_legacy_resolver)
|
||||
matched = False
|
||||
for rule in match_versions:
|
||||
matched = False
|
||||
# make sure we match all the parts of the rule
|
||||
for a_version in rule.split(","):
|
||||
o, v = SimpleVersion.split_op_version(a_version.strip())
|
||||
matched = SimpleVersion.compare_versions(pip_pkg_version, o, v)
|
||||
if not matched:
|
||||
break
|
||||
# if the rule is fully matched we have a match
|
||||
if matched:
|
||||
break
|
||||
|
||||
legacy_resolver_flags = ["--use-deprecated=legacy-resolver"]
|
||||
if matched:
|
||||
print("INFO: Using legacy resolver for PIP to avoid inconsistency with package versions!")
|
||||
self.add_extra_install_flags(legacy_resolver_flags)
|
||||
elif self.remove_extra_install_flags(legacy_resolver_flags):
|
||||
print("INFO: removing pip legacy resolver!")
|
||||
|
||||
def get_python_command(self, extra=()):
|
||||
# type: (...) -> Executable
|
||||
return Argv(self.bin, *extra)
|
||||
@@ -148,6 +208,18 @@ class PackageManager(object):
|
||||
return False
|
||||
except Exception:
|
||||
return False
|
||||
|
||||
try:
|
||||
from .requirements import Requirement, MarkerRequirement
|
||||
req = MarkerRequirement(Requirement.parse(package_name))
|
||||
|
||||
# if pip was part of the requirements, make sure we update the flags
|
||||
# add --use-deprecated=legacy-resolver to pip install to avoid mismatched packages issues
|
||||
if req.name == "pip" and req.version:
|
||||
PackageManager._selected_manager._add_legacy_resolver_flag(req.version)
|
||||
except Exception as e:
|
||||
print("WARNING: Error while parsing pip version legacy [{}]".format(e))
|
||||
|
||||
return True
|
||||
|
||||
@classmethod
|
||||
@@ -182,7 +254,7 @@ class PackageManager(object):
|
||||
def get_pip_versions(cls, pip="pip", wrap=''):
|
||||
return [
|
||||
(wrap + pip + version + wrap)
|
||||
for version in cls._pip_version or [pip]
|
||||
for version in cls._pip_version or [""]
|
||||
]
|
||||
|
||||
def get_cached_venv(self, requirements, docker_cmd, python_version, cuda_version, destination_folder):
|
||||
@@ -218,6 +290,8 @@ class PackageManager(object):
|
||||
if not self._get_cache_manager():
|
||||
return
|
||||
|
||||
print('Adding venv into cache: {}'.format(source_folder))
|
||||
|
||||
try:
|
||||
keys = self._generate_reqs_hash_keys(requirements, docker_cmd, python_version, cuda_version)
|
||||
return self._get_cache_manager().add_entry(
|
||||
@@ -302,7 +376,9 @@ class PackageManager(object):
|
||||
max_entries = int(self.session.config.get(self._config_cache_max_entries, 10))
|
||||
free_space_threshold = float(self.session.config.get(self._config_cache_free_space_threshold, 0))
|
||||
self._cache_manager = FolderCache(
|
||||
cache_folder, max_cache_entries=max_entries, min_free_space_gb=free_space_threshold)
|
||||
cache_folder, max_cache_entries=max_entries,
|
||||
min_free_space_gb=free_space_threshold,
|
||||
lock_timeout_seconds=self.session.config.get(self._config_cache_lock_timeout, None))
|
||||
except Exception as ex:
|
||||
print("WARNING: Failed accessing venvs cache at {}: {}".format(cache_folder, ex))
|
||||
print("WARNING: Skipping venv cache - folder not accessible!")
|
||||
|
||||
@@ -5,7 +5,6 @@ import re
|
||||
import os
|
||||
import subprocess
|
||||
from collections import OrderedDict
|
||||
from distutils.spawn import find_executable
|
||||
from functools import partial
|
||||
from itertools import chain
|
||||
from typing import Text, Iterable, Union, Dict, Set, Sequence, Any
|
||||
@@ -22,13 +21,13 @@ from clearml_agent.errors import CommandFailedError
|
||||
from clearml_agent.helper.base import (
|
||||
rm_tree, NonStrictAttrs, select_for_platform, is_windows_platform, ExecutionInfo,
|
||||
convert_cuda_version_to_float_single_digit_str, convert_cuda_version_to_int_10_base_str, )
|
||||
from clearml_agent.helper.process import Argv, Executable, DEVNULL, CommandSequence, PathLike
|
||||
from clearml_agent.helper.process import Argv, Executable, DEVNULL, CommandSequence, PathLike, find_executable
|
||||
from clearml_agent.helper.package.requirements import SimpleVersion
|
||||
from clearml_agent.session import Session
|
||||
from .base import PackageManager
|
||||
from .pip_api.venv import VirtualenvPip
|
||||
from .requirements import RequirementsManager, MarkerRequirement
|
||||
from ...backend_api.session.defs import ENV_CONDA_ENV_PACKAGE
|
||||
from ...backend_api.session.defs import ENV_CONDA_ENV_PACKAGE, ENV_USE_CONDA_BASE_ENV
|
||||
|
||||
package_normalize = partial(re.compile(r"""\[version=['"](.*)['"]\]""").sub, r"\1")
|
||||
|
||||
@@ -79,6 +78,11 @@ class CondaAPI(PackageManager):
|
||||
self.path = path
|
||||
self.env_read_only = False
|
||||
self.extra_channels = self.session.config.get('agent.package_manager.conda_channels', [])
|
||||
# install into base conda environment (should only be used if running in docker mode)
|
||||
self.use_conda_base_env = ENV_USE_CONDA_BASE_ENV.get(
|
||||
default=self.session.config.get('agent.package_manager.use_conda_base_env', None)
|
||||
)
|
||||
# notice this will not install any additional packages into the selected environment
|
||||
self.conda_env_as_base_docker = \
|
||||
self.session.config.get('agent.package_manager.conda_env_as_base_docker', None) or \
|
||||
bool(ENV_CONDA_ENV_PACKAGE.get())
|
||||
@@ -129,16 +133,38 @@ class CondaAPI(PackageManager):
|
||||
def bin(self):
|
||||
return self.pip.bin
|
||||
|
||||
def _parse_package_marker_match_python_ver(self, line=None, marker_req=None):
|
||||
if line:
|
||||
marker_req = MarkerRequirement(Requirement.parse(line))
|
||||
|
||||
try:
|
||||
mock_req = MarkerRequirement(Requirement.parse(marker_req.marker.replace("'", "").replace("\"", "")))
|
||||
except Exception as ex:
|
||||
print("WARNING: failed parsing, assuming package is okay {}".format(ex))
|
||||
return marker_req
|
||||
|
||||
if not mock_req.compare_version(requested_version=self.python):
|
||||
print("SKIPPING package `{}` not required python version {}".format(marker_req.tostr(), self.python))
|
||||
return None
|
||||
return marker_req
|
||||
|
||||
# noinspection SpellCheckingInspection
|
||||
def upgrade_pip(self):
|
||||
# do not change pip version if pre built environement is used
|
||||
if self.env_read_only:
|
||||
print('Conda environment in read-only mode, skipping pip upgrade.')
|
||||
return ''
|
||||
|
||||
pip_versions = []
|
||||
for req_pip_line in self.pip.get_pip_versions():
|
||||
req = self._parse_package_marker_match_python_ver(line=req_pip_line)
|
||||
if req:
|
||||
pip_versions.append(req.tostr(markers=False))
|
||||
|
||||
return self._install(
|
||||
*select_for_platform(
|
||||
windows=self.pip.get_pip_versions(),
|
||||
linux=self.pip.get_pip_versions()
|
||||
windows=pip_versions,
|
||||
linux=pip_versions
|
||||
)
|
||||
)
|
||||
|
||||
@@ -173,6 +199,14 @@ class CondaAPI(PackageManager):
|
||||
else:
|
||||
raise ValueError("Could not restore Conda environment, cannot find {}".format(
|
||||
self.conda_pre_build_env_path))
|
||||
elif self.use_conda_base_env:
|
||||
try:
|
||||
base_path = Path(self.conda).parent.parent.as_posix()
|
||||
print("Using base conda environment at {}".format(base_path))
|
||||
self._init_existing_environment(base_path, is_readonly=False)
|
||||
return self
|
||||
except Exception as ex:
|
||||
print("WARNING: Failed using base conda environment, reverting to new environment: {}".format(ex))
|
||||
|
||||
command = Argv(
|
||||
self.conda,
|
||||
@@ -200,10 +234,25 @@ class CondaAPI(PackageManager):
|
||||
|
||||
return self
|
||||
|
||||
def _init_existing_environment(self, conda_pre_build_env_path):
|
||||
def _init_existing_environment(self, conda_pre_build_env_path, is_readonly=True):
|
||||
print("Using pre-existing Conda environment from {}".format(conda_pre_build_env_path))
|
||||
self.path = Path(conda_pre_build_env_path)
|
||||
self.source = ("conda", "activate", self.path.as_posix())
|
||||
conda_env = self._get_conda_sh()
|
||||
self.source = CommandSequence(('source', conda_env.as_posix()), self.source)
|
||||
|
||||
conda_packages_json = json.loads(
|
||||
self._run_command((self.conda, "list", "--json", "-p", self.path), raw=True))
|
||||
|
||||
try:
|
||||
for package in conda_packages_json:
|
||||
if package.get("name") == "python" and package.get("version"):
|
||||
self.python = ".".join(package.get("version").split(".")[:2])
|
||||
print("Existing conda environment, found python version {}".format(self.python))
|
||||
break
|
||||
except Exception as ex:
|
||||
print("WARNING: failed detecting existing conda python version: {}".format(ex))
|
||||
|
||||
self.pip = CondaPip(
|
||||
session=self.session,
|
||||
source=self.source,
|
||||
@@ -211,9 +260,9 @@ class CondaAPI(PackageManager):
|
||||
requirements_manager=self.requirements_manager,
|
||||
path=self.path,
|
||||
)
|
||||
conda_env = self._get_conda_sh()
|
||||
self.source = self.pip.source = CommandSequence(('source', conda_env.as_posix()), self.source)
|
||||
self.env_read_only = True
|
||||
self.pip.source = self.source
|
||||
|
||||
self.env_read_only = is_readonly
|
||||
|
||||
def remove(self):
|
||||
"""
|
||||
@@ -223,7 +272,7 @@ class CondaAPI(PackageManager):
|
||||
Conda seems to load "vcruntime140.dll" from all its environment on startup.
|
||||
This means environment have to be deleted using 'conda env remove'.
|
||||
If necessary, conda can be fooled into deleting a partially-deleted environment by creating an empty file
|
||||
in '<ENV>\conda-meta\history' (value found in 'conda.gateways.disk.test.PREFIX_MAGIC_FILE').
|
||||
in '<ENV>\\conda-meta\\history' (value found in 'conda.gateways.disk.test.PREFIX_MAGIC_FILE').
|
||||
Otherwise, it complains that said directory is not a conda environment.
|
||||
|
||||
See: https://github.com/conda/conda/issues/7682
|
||||
@@ -499,7 +548,7 @@ class CondaAPI(PackageManager):
|
||||
if '.' not in m.specs[0][1]:
|
||||
continue
|
||||
|
||||
if m.name.lower() == 'cudatoolkit':
|
||||
if m.name.lower() in ('cudatoolkit', 'cuda-toolkit'):
|
||||
# skip cuda if we are running on CPU
|
||||
if not cuda_version:
|
||||
continue
|
||||
@@ -526,10 +575,22 @@ class CondaAPI(PackageManager):
|
||||
has_torch = True
|
||||
m.req.name = 'tensorflow-gpu' if cuda_version > 0 else 'tensorflow'
|
||||
|
||||
# push the clearml packages into the pip_requirements
|
||||
if "clearml" in m.req.name and "clearml" not in self.extra_channels:
|
||||
if self.session.debug_mode:
|
||||
print("info: moving `{}` packages to `pip` section".format(m.req))
|
||||
pip_requirements.append(m)
|
||||
continue
|
||||
|
||||
reqs.append(m)
|
||||
|
||||
if not has_cudatoolkit and cuda_version:
|
||||
m = MarkerRequirement(Requirement.parse("cudatoolkit == {}".format(cuda_version_full)))
|
||||
# nvidia channel is using `cuda-toolkit` and has newer versions of cuda,
|
||||
# older cuda can be picked from conda-forge (<12)
|
||||
if "nvidia" in self.extra_channels:
|
||||
m = MarkerRequirement(Requirement.parse("cuda-toolkit == {}".format(cuda_version_full)))
|
||||
else:
|
||||
m = MarkerRequirement(Requirement.parse("cudatoolkit == {}".format(cuda_version_full)))
|
||||
has_cudatoolkit = True
|
||||
reqs.append(m)
|
||||
|
||||
@@ -589,21 +650,30 @@ class CondaAPI(PackageManager):
|
||||
if r.name and not r.name.startswith('_') and not requirements.get('conda', None):
|
||||
r.name = r.name.replace('_', '-')
|
||||
|
||||
if has_cudatoolkit and r.specs and len(r.specs[0]) > 1 and r.name == 'cudatoolkit':
|
||||
if has_cudatoolkit and r.specs and len(r.specs[0]) > 1 and r.name in ('cudatoolkit', 'cuda-toolkit'):
|
||||
# select specific cuda version if it came from the requirements
|
||||
r.specs = [(r.specs[0][0].replace('==', '='), r.specs[0][1].split('.post')[0])]
|
||||
elif r.specs and r.specs[0] and len(r.specs[0]) > 1:
|
||||
# remove .post from version numbers it fails with ~= version, and change == to ~=
|
||||
r.specs = [(r.specs[0][0].replace('==', '~='), r.specs[0][1].split('.post')[0])]
|
||||
r.specs = [(s[0].replace('==', '~='), s[1].split('.post')[0]) for s in r.specs]
|
||||
|
||||
while reqs:
|
||||
# notice, we give conda more freedom in version selection, to help it choose best combination
|
||||
def clean_ver(ar):
|
||||
if not ar.specs:
|
||||
return ar.tostr()
|
||||
ar.specs = [(ar.specs[0][0], ar.specs[0][1] + '.0' if '.' not in ar.specs[0][1] else ar.specs[0][1])]
|
||||
return ar.tostr()
|
||||
conda_env['dependencies'] = [clean_ver(r) for r in reqs]
|
||||
markers = None
|
||||
if ar.marker:
|
||||
# check if we really need it based on python version
|
||||
ar = self._parse_package_marker_match_python_ver(marker_req=ar)
|
||||
if not ar:
|
||||
# empty lines should be skipped
|
||||
return ""
|
||||
# if we do make sure we note that we ignored markers
|
||||
print("WARNING: ignoring marker in `{}`".format(ar.tostr()))
|
||||
markers = False
|
||||
if ar.specs:
|
||||
ar.specs = [(s[0], s[1] + '.0' if '.' not in s[1] else s[1]) for s in ar.specs]
|
||||
return ar.tostr(markers=markers)
|
||||
conda_env['dependencies'] = [clean_ver(r) for r in reqs if clean_ver(r)]
|
||||
with self.temp_file("conda_env", yaml.dump(conda_env), suffix=".yml") as name:
|
||||
print('Conda: Trying to install requirements:\n{}'.format(conda_env['dependencies']))
|
||||
if self.session.debug_mode:
|
||||
@@ -730,6 +800,25 @@ class CondaAPI(PackageManager):
|
||||
return conda_env
|
||||
return base_conda_env
|
||||
|
||||
def add_cached_venv(self, *args, **kwargs):
|
||||
"""
|
||||
Copy the local venv folder into the venv cache (keys are based on the requirements+python+docker).
|
||||
"""
|
||||
# do not cache if this is a base conda environment
|
||||
if self.conda_env_as_base_docker or self.use_conda_base_env:
|
||||
return
|
||||
return super().add_cached_venv(*args, **kwargs)
|
||||
|
||||
def get_cached_venv(self, *args, **kwargs):
|
||||
"""
|
||||
Copy a cached copy of the venv (based on the requirements) into destination_folder.
|
||||
Return None if failed or cached entry does not exist
|
||||
"""
|
||||
# do not cache if this is a base conda environment
|
||||
if self.conda_env_as_base_docker or self.use_conda_base_env:
|
||||
return
|
||||
return super().get_cached_venv(*args, **kwargs)
|
||||
|
||||
|
||||
# enable hashing with cmp=False because pdb fails on un-hashable exceptions
|
||||
exception = attrs(str=True, cmp=False)
|
||||
|
||||
@@ -97,7 +97,7 @@ class SystemPip(PackageManager):
|
||||
return Argv(self.bin, '-m', 'pip', '--disable-pip-version-check', *command)
|
||||
|
||||
def install_flags(self):
|
||||
indices_args = tuple(
|
||||
base_args = tuple(self._base_install_flags or []) + tuple(
|
||||
chain.from_iterable(('--extra-index-url', x) for x in PIP_EXTRA_INDICES)
|
||||
)
|
||||
|
||||
@@ -105,7 +105,7 @@ class SystemPip(PackageManager):
|
||||
ENV_PIP_EXTRA_INSTALL_FLAGS.get() or \
|
||||
self.session.config.get("agent.package_manager.extra_pip_install_flags", None)
|
||||
|
||||
return (indices_args + tuple(extra_pip_flags)) if extra_pip_flags else indices_args
|
||||
return (base_args + tuple(extra_pip_flags)) if extra_pip_flags else base_args
|
||||
|
||||
def download_flags(self):
|
||||
indices_args = tuple(
|
||||
|
||||
@@ -37,7 +37,9 @@ class VirtualenvPip(SystemPip, PackageManager):
|
||||
|
||||
def load_requirements(self, requirements):
|
||||
if isinstance(requirements, dict) and requirements.get("pip"):
|
||||
requirements["pip"] = self.requirements_manager.replace(requirements["pip"])
|
||||
requirements["pip"] = self.requirements_manager.replace(
|
||||
requirements["pip"], existing_packages=self._existing_packages
|
||||
)
|
||||
super(VirtualenvPip, self).load_requirements(requirements)
|
||||
self.requirements_manager.post_install(self.session, package_manager=self)
|
||||
|
||||
@@ -64,9 +66,18 @@ class VirtualenvPip(SystemPip, PackageManager):
|
||||
Only valid if instantiated with path.
|
||||
Use self.python as self.bin does not exist.
|
||||
"""
|
||||
self.session.command(
|
||||
self.python, "-m", "virtualenv", self.path, *self.create_flags()
|
||||
).check_call()
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
self.session.command(
|
||||
self.python, "-m", "virtualenv", self.path, *self.create_flags()
|
||||
).check_call()
|
||||
except Exception as ex:
|
||||
# let's try with std library instead
|
||||
print("WARNING: virtualenv call failed: {}\n INFO: Creating virtual environment with venv".format(ex))
|
||||
self.session.command(
|
||||
self.python, "-m", "venv", self.path, *self.create_flags()
|
||||
).check_call()
|
||||
|
||||
return self
|
||||
|
||||
def remove(self):
|
||||
|
||||
@@ -6,6 +6,7 @@ import sys
|
||||
import os
|
||||
from pathlib2 import Path
|
||||
|
||||
from clearml_agent.definitions import ENV_AGENT_FORCE_POETRY
|
||||
from clearml_agent.helper.process import Argv, DEVNULL, check_if_command_exists
|
||||
from clearml_agent.session import Session, POETRY
|
||||
|
||||
@@ -39,11 +40,11 @@ def prop_guard(prop, log_prop=None):
|
||||
|
||||
class PoetryConfig:
|
||||
|
||||
def __init__(self, session, interpreter=None):
|
||||
# type: (Session, str) -> ()
|
||||
def __init__(self, session):
|
||||
# type: (Session, str) -> None
|
||||
self.session = session
|
||||
self._log = session.get_logger(__name__)
|
||||
self._python = interpreter or sys.executable
|
||||
self._python = sys.executable # default, overwritten from session config in initialize()
|
||||
self._initialized = False
|
||||
|
||||
@property
|
||||
@@ -52,7 +53,7 @@ class PoetryConfig:
|
||||
|
||||
@property
|
||||
def enabled(self):
|
||||
return self.session.config["agent.package_manager.type"] == POETRY
|
||||
return ENV_AGENT_FORCE_POETRY.get() or self.session.config["agent.package_manager.type"] == POETRY
|
||||
|
||||
_guard_enabled = prop_guard(enabled, log)
|
||||
|
||||
@@ -69,7 +70,7 @@ class PoetryConfig:
|
||||
path = path.replace(':'+sys.base_prefix, ':'+sys.real_prefix, 1)
|
||||
kwargs['env']['PATH'] = path
|
||||
|
||||
if self.session and self.session.config:
|
||||
if self.session and self.session.config and args and args[0] == "install":
|
||||
extra_args = self.session.config.get("agent.package_manager.poetry_install_extra_args", None)
|
||||
if extra_args:
|
||||
args = args + tuple(extra_args)
|
||||
@@ -87,32 +88,53 @@ class PoetryConfig:
|
||||
@_guard_enabled
|
||||
def initialize(self, cwd=None):
|
||||
if not self._initialized:
|
||||
# use correct python version -- detected in Worker.install_virtualenv() and written to
|
||||
# session
|
||||
if self.session.config.get("agent.python_binary", None):
|
||||
self._python = self.session.config.get("agent.python_binary")
|
||||
|
||||
if self.session.config.get("agent.package_manager.poetry_version", None) is not None:
|
||||
version = str(self.session.config.get("agent.package_manager.poetry_version"))
|
||||
print('Upgrading Poetry package {}'.format(version))
|
||||
# first upgrade pip if we need to
|
||||
try:
|
||||
from clearml_agent.helper.package.pip_api.venv import VirtualenvPip
|
||||
pip = VirtualenvPip(
|
||||
session=self.session, python=self._python,
|
||||
requirements_manager=None, path=None, interpreter=self._python)
|
||||
pip.upgrade_pip()
|
||||
except Exception as ex:
|
||||
self.log.warning("failed upgrading pip: {}".format(ex))
|
||||
|
||||
# get poetry version
|
||||
version = version.replace(' ', '')
|
||||
if ('=' in version) or ('~' in version) or ('<' in version) or ('>' in version):
|
||||
version = version
|
||||
elif version:
|
||||
version = "==" + version
|
||||
# (we are not running it yet)
|
||||
argv = Argv(self._python, "-m", "pip", "install", "poetry{}".format(version),
|
||||
"--upgrade", "--disable-pip-version-check")
|
||||
# this is just for beauty and checks, we already set the verion in the Argv
|
||||
if not version:
|
||||
version = "latest"
|
||||
else:
|
||||
# mark to install poetry if not already installed (we are not running it yet)
|
||||
argv = Argv(self._python, "-m", "pip", "install", "poetry", "--disable-pip-version-check")
|
||||
version = ""
|
||||
|
||||
# first upgrade pip if we need to
|
||||
try:
|
||||
from clearml_agent.helper.package.pip_api.venv import VirtualenvPip
|
||||
pip = VirtualenvPip(
|
||||
session=self.session, python=self._python,
|
||||
requirements_manager=None, path=None, interpreter=self._python)
|
||||
pip.upgrade_pip()
|
||||
except Exception as ex:
|
||||
self.log.warning("failed upgrading pip: {}".format(ex))
|
||||
|
||||
# check if we do not have a specific version and poetry is found skip installation
|
||||
if not version and check_if_command_exists("poetry"):
|
||||
print("Notice: Poetry was found, no specific version required, skipping poetry installation")
|
||||
else:
|
||||
print('Installing / Upgrading Poetry package to {}'.format(version))
|
||||
# now install poetry
|
||||
try:
|
||||
version = version.replace(' ', '')
|
||||
if ('=' in version) or ('~' in version) or ('<' in version) or ('>' in version):
|
||||
version = version
|
||||
elif version:
|
||||
version = "==" + version
|
||||
argv = Argv(self._python, "-m", "pip", "install", "poetry{}".format(version),
|
||||
"--upgrade", "--disable-pip-version-check")
|
||||
print(argv.get_output())
|
||||
except Exception as ex:
|
||||
self.log.warning("failed upgrading poetry: {}".format(ex))
|
||||
self.log.warning("failed installing poetry: {}".format(ex))
|
||||
|
||||
# now setup poetry
|
||||
self._initialized = True
|
||||
try:
|
||||
self._config("--local", "virtualenvs.in-project", "true", cwd=cwd)
|
||||
|
||||
@@ -53,12 +53,18 @@ class PriorityPackageRequirement(SimpleSubstitution):
|
||||
if not self._replaced_packages:
|
||||
return list_of_requirements
|
||||
|
||||
# we assume that both pip & setup tools are not in list_of_requirements, and we need to add them
|
||||
|
||||
if "pip" in self._replaced_packages:
|
||||
full_freeze = PackageManager.out_of_scope_freeze(freeze_full_environment=True)
|
||||
# now let's look for pip
|
||||
pips = [line for line in full_freeze.get("pip", []) if line.split("==")[0] == "pip"]
|
||||
if pips and "pip" in list_of_requirements:
|
||||
list_of_requirements["pip"] = [pips[0]] + list_of_requirements["pip"]
|
||||
if not full_freeze:
|
||||
if "pip" in list_of_requirements:
|
||||
list_of_requirements["pip"] = [self._replaced_packages["pip"]] + list_of_requirements["pip"]
|
||||
else:
|
||||
# now let's look for pip
|
||||
pips = [line for line in full_freeze.get("pip", []) if str(line.split("==")[0]).strip() == "pip"]
|
||||
if pips and "pip" in list_of_requirements:
|
||||
list_of_requirements["pip"] = [pips[0]] + list_of_requirements["pip"]
|
||||
|
||||
if "setuptools" in self._replaced_packages:
|
||||
try:
|
||||
@@ -87,6 +93,20 @@ class PriorityPackageRequirement(SimpleSubstitution):
|
||||
return list_of_requirements
|
||||
|
||||
|
||||
class CachedPackageRequirement(PriorityPackageRequirement):
|
||||
|
||||
name = ("setuptools", "pip", )
|
||||
optional_package_names = tuple()
|
||||
|
||||
def replace(self, req):
|
||||
"""
|
||||
Put the requirement in the list for later conversion
|
||||
:raises: ValueError if version is pre-release
|
||||
"""
|
||||
self._replaced_packages[req.name] = req.line
|
||||
return Text(req)
|
||||
|
||||
|
||||
class PackageCollectorRequirement(SimpleSubstitution):
|
||||
"""
|
||||
This RequirementSubstitution class will allow you to have multiple instances of the same
|
||||
|
||||
@@ -670,8 +670,7 @@ class PytorchRequirement(SimpleSubstitution):
|
||||
return MarkerRequirement(Requirement.parse(self._fix_setuptools))
|
||||
return None
|
||||
|
||||
@classmethod
|
||||
def get_torch_index_url(cls, cuda_version, nightly=False):
|
||||
def get_torch_index_url(self, cuda_version, nightly=False):
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
cuda = int(cuda_version)
|
||||
@@ -681,39 +680,39 @@ class PytorchRequirement(SimpleSubstitution):
|
||||
if nightly:
|
||||
for c in range(cuda, max(-1, cuda-15), -1):
|
||||
# then try the nightly builds, it might be there...
|
||||
torch_url = cls.nightly_extra_index_url_template.format(c)
|
||||
torch_url = self.nightly_extra_index_url_template.format(c)
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
if requests.get(torch_url, timeout=10).ok:
|
||||
print('Torch nightly CUDA {} index page found'.format(c))
|
||||
cls.torch_index_url_lookup[c] = torch_url
|
||||
return cls.torch_index_url_lookup[c], c
|
||||
self.torch_index_url_lookup[c] = torch_url
|
||||
return self.torch_index_url_lookup[c], c
|
||||
except Exception:
|
||||
pass
|
||||
return
|
||||
|
||||
# first check if key is valid
|
||||
if cuda in cls.torch_index_url_lookup:
|
||||
return cls.torch_index_url_lookup[cuda], cuda
|
||||
if cuda in self.torch_index_url_lookup:
|
||||
return self.torch_index_url_lookup[cuda], cuda
|
||||
|
||||
# then try a new cuda version page
|
||||
for c in range(cuda, max(-1, cuda-15), -1):
|
||||
torch_url = cls.extra_index_url_template.format(c)
|
||||
torch_url = self.extra_index_url_template.format(c)
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
if requests.get(torch_url, timeout=10).ok:
|
||||
print('Torch CUDA {} index page found, adding `{}`'.format(c, torch_url))
|
||||
cls.torch_index_url_lookup[c] = torch_url
|
||||
return cls.torch_index_url_lookup[c], c
|
||||
self.torch_index_url_lookup[c] = torch_url
|
||||
return self.torch_index_url_lookup[c], c
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
keys = sorted(cls.torch_index_url_lookup.keys(), reverse=True)
|
||||
keys = sorted(self.torch_index_url_lookup.keys(), reverse=True)
|
||||
for k in keys:
|
||||
if k <= cuda:
|
||||
return cls.torch_index_url_lookup[k], k
|
||||
return self.torch_index_url_lookup[k], k
|
||||
# return default - zero
|
||||
return cls.torch_index_url_lookup[0], 0
|
||||
return self.torch_index_url_lookup[0], 0
|
||||
|
||||
MAP = {
|
||||
"windows": {
|
||||
|
||||
@@ -19,7 +19,7 @@ import logging
|
||||
from clearml_agent.definitions import PIP_EXTRA_INDICES
|
||||
from clearml_agent.helper.base import (
|
||||
warning, is_conda, which, join_lines, is_windows_platform,
|
||||
convert_cuda_version_to_int_10_base_str, )
|
||||
convert_cuda_version_to_int_10_base_str, dump_yaml, )
|
||||
from clearml_agent.helper.process import Argv, PathLike
|
||||
from clearml_agent.helper.gpu.gpustat import get_driver_cuda_version
|
||||
from clearml_agent.session import Session, normalize_cuda_version
|
||||
@@ -94,6 +94,12 @@ class MarkerRequirement(object):
|
||||
def __repr__(self):
|
||||
return '{self.__class__.__name__}[{self}]'.format(self=self)
|
||||
|
||||
def __eq__(self, other):
|
||||
return isinstance(other, MarkerRequirement) and str(self) == str(other)
|
||||
|
||||
def __hash__(self):
|
||||
return str(self).__hash__()
|
||||
|
||||
def format_specs(self, num_parts=None, max_num_parts=None):
|
||||
max_num_parts = max_num_parts or num_parts
|
||||
if max_num_parts is None or not self.specs:
|
||||
@@ -116,6 +122,10 @@ class MarkerRequirement(object):
|
||||
def specs(self): # type: () -> List[Tuple[Text, Text]]
|
||||
return self.req.specs
|
||||
|
||||
@property
|
||||
def version(self): # type: () -> Text
|
||||
return self.specs[0][1] if self.specs else ""
|
||||
|
||||
@specs.setter
|
||||
def specs(self, value): # type: (List[Tuple[Text, Text]]) -> None
|
||||
self.req.specs = value
|
||||
@@ -143,6 +153,8 @@ class MarkerRequirement(object):
|
||||
If the requested version is 1.2 the self.spec should be 1.2*
|
||||
etc.
|
||||
|
||||
usage: it returns the value of the following comparison: requested_version "op" self.version
|
||||
|
||||
:param str requested_version:
|
||||
:param str op: '==', '>', '>=', '<=', '<', '~='
|
||||
:param int num_parts: number of parts to compare
|
||||
@@ -152,7 +164,7 @@ class MarkerRequirement(object):
|
||||
if not self.specs:
|
||||
return True
|
||||
|
||||
version = self.specs[0][1]
|
||||
version = self.version
|
||||
op = (op or self.specs[0][0]).strip()
|
||||
|
||||
return SimpleVersion.compare_versions(
|
||||
@@ -170,11 +182,21 @@ class MarkerRequirement(object):
|
||||
self.req.local_file = False
|
||||
return True
|
||||
|
||||
def validate_local_file_ref(self):
|
||||
def is_local_package_ref(self):
|
||||
# if local file does not exist, remove the reference to it
|
||||
if self.vcs or self.editable or self.path or not self.local_file or not self.name or \
|
||||
not self.uri or not self.uri.startswith("file://"):
|
||||
return False
|
||||
return True
|
||||
|
||||
def is_vcs_ref(self):
|
||||
return bool(self.vcs)
|
||||
|
||||
def validate_local_file_ref(self):
|
||||
# if local file does not exist, remove the reference to it
|
||||
if not self.is_local_package_ref():
|
||||
return
|
||||
|
||||
local_path = Path(self.uri[len("file://"):])
|
||||
if not local_path.exists():
|
||||
local_path = Path(unquote(self.uri)[len("file://"):])
|
||||
@@ -221,6 +243,19 @@ class SimpleVersion:
|
||||
_local_version_separators = re.compile(r"[\._-]")
|
||||
_regex = re.compile(r"^\s*" + VERSION_PATTERN + r"\s*$", re.VERBOSE | re.IGNORECASE)
|
||||
|
||||
@classmethod
|
||||
def split_op_version(cls, line):
|
||||
"""
|
||||
Split a string in the form of ">=1.2.3" into a (op, version), i.e. (">=", "1.2.3")
|
||||
Notice is calling with only a version string (e.g. "1.2.3") default operator is "=="
|
||||
which means you get ("==", "1.2.3")
|
||||
:param line: string examples: "<=0.1.2"
|
||||
:return: tuple of (op, version) example ("<=", "0.1.2")
|
||||
"""
|
||||
match = r"\s*([>=<~!]*)\s*(\S*)\s*"
|
||||
groups = re.match(match, line).groups()
|
||||
return groups[0] or "==", groups[1]
|
||||
|
||||
@classmethod
|
||||
def compare_versions(cls, version_a, op, version_b, ignore_sub_versions=True, num_parts=3):
|
||||
"""
|
||||
@@ -624,14 +659,54 @@ class RequirementsManager(object):
|
||||
return handler.replace(req)
|
||||
return None
|
||||
|
||||
def replace(self, requirements): # type: (Text) -> Text
|
||||
def replace(
|
||||
self,
|
||||
requirements, # type: Text
|
||||
existing_packages=None, # type: List[MarkerRequirement]
|
||||
pkg_skip_existing_local=True, # type: bool
|
||||
pkg_skip_existing_vcs=True, # type: bool
|
||||
pkg_skip_existing=True, # type: bool
|
||||
): # type: (...) -> Text
|
||||
parsed_requirements = self.parse_requirements_section_to_marker_requirements(
|
||||
requirements=requirements, cwd=self._cwd)
|
||||
requirements=requirements, cwd=self._cwd, skip_local_file_validation=True)
|
||||
|
||||
if parsed_requirements and existing_packages:
|
||||
skipped_packages = None
|
||||
if pkg_skip_existing:
|
||||
skipped_packages = set(parsed_requirements) & set(existing_packages)
|
||||
elif pkg_skip_existing_local or pkg_skip_existing_vcs:
|
||||
existing_packages = [
|
||||
p for p in existing_packages if (
|
||||
(pkg_skip_existing_local and p.is_local_package_ref()) or
|
||||
(pkg_skip_existing_vcs and p.is_vcs_ref())
|
||||
)
|
||||
]
|
||||
skipped_packages = set(parsed_requirements) & set(existing_packages)
|
||||
|
||||
if skipped_packages:
|
||||
# maintain order
|
||||
num_skipped_packages = len(parsed_requirements)
|
||||
parsed_requirements = [p for p in parsed_requirements if p not in skipped_packages]
|
||||
num_skipped_packages -= len(parsed_requirements)
|
||||
print("Skipping {} pre-installed packages:\n{}Remaining {} additional packages to install".format(
|
||||
num_skipped_packages,
|
||||
dump_yaml(sorted([str(p) for p in skipped_packages])),
|
||||
len(parsed_requirements)
|
||||
))
|
||||
|
||||
# nothing to install!
|
||||
if not parsed_requirements:
|
||||
return ""
|
||||
|
||||
# sanity check
|
||||
if not parsed_requirements:
|
||||
# return the original requirements just in case
|
||||
return requirements
|
||||
|
||||
# remove local file reference that do not exist
|
||||
for p in parsed_requirements:
|
||||
p.validate_local_file_ref()
|
||||
|
||||
def replace_one(i, req):
|
||||
# type: (int, MarkerRequirement) -> Optional[Text]
|
||||
try:
|
||||
@@ -805,7 +880,7 @@ class RequirementsManager(object):
|
||||
normalize_cuda_version(cudnn_version or 0))
|
||||
|
||||
@staticmethod
|
||||
def parse_requirements_section_to_marker_requirements(requirements, cwd=None):
|
||||
def parse_requirements_section_to_marker_requirements(requirements, cwd=None, skip_local_file_validation=False):
|
||||
def safe_parse(req_str):
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
@@ -815,7 +890,8 @@ class RequirementsManager(object):
|
||||
|
||||
def create_req(x):
|
||||
r = MarkerRequirement(x)
|
||||
r.validate_local_file_ref()
|
||||
if not skip_local_file_validation:
|
||||
r.validate_local_file_ref()
|
||||
return r
|
||||
|
||||
if not requirements:
|
||||
|
||||
@@ -8,7 +8,6 @@ import subprocess
|
||||
import sys
|
||||
from contextlib import contextmanager
|
||||
from copy import copy
|
||||
from distutils.spawn import find_executable
|
||||
from itertools import chain, repeat, islice
|
||||
from os.path import devnull
|
||||
from time import sleep
|
||||
@@ -492,3 +491,40 @@ def double_quote(s):
|
||||
# use single quotes, and put single quotes into double quotes
|
||||
# the string $"b is then quoted as "$"""b"
|
||||
return '"' + s.replace('"', '"\'\"\'"') + '"'
|
||||
|
||||
|
||||
def find_executable(executable, path=None):
|
||||
"""Tries to find 'executable' in the directories listed in 'path'.
|
||||
|
||||
A string listing directories separated by 'os.pathsep'; defaults to
|
||||
os.environ['PATH']. Returns the complete filename or None if not found.
|
||||
"""
|
||||
_, ext = os.path.splitext(executable)
|
||||
if (sys.platform == 'win32') and (ext != '.exe'):
|
||||
executable = executable + '.exe'
|
||||
|
||||
if os.path.isfile(executable):
|
||||
return executable
|
||||
|
||||
if path is None:
|
||||
path = os.environ.get('PATH', None)
|
||||
if path is None:
|
||||
try:
|
||||
path = os.confstr("CS_PATH")
|
||||
except (AttributeError, ValueError):
|
||||
# os.confstr() or CS_PATH is not available
|
||||
path = os.defpath
|
||||
# bpo-35755: Don't use os.defpath if the PATH environment variable is
|
||||
# set to an empty string
|
||||
|
||||
# PATH='' doesn't match, whereas PATH=':' looks in the current directory
|
||||
if not path:
|
||||
return None
|
||||
|
||||
paths = path.split(os.pathsep)
|
||||
for p in paths:
|
||||
f = os.path.join(p, executable)
|
||||
if os.path.isfile(f):
|
||||
# the file exists, we have a shot at spawn working
|
||||
return f
|
||||
return None
|
||||
|
||||
@@ -6,7 +6,6 @@ import stat
|
||||
import subprocess
|
||||
import sys
|
||||
import tempfile
|
||||
from distutils.spawn import find_executable
|
||||
from hashlib import md5
|
||||
from os import environ
|
||||
from random import random
|
||||
@@ -19,7 +18,7 @@ from pathlib2 import Path
|
||||
|
||||
import six
|
||||
|
||||
from clearml_agent.definitions import ENV_AGENT_GIT_USER, ENV_AGENT_GIT_PASS, ENV_AGENT_GIT_HOST
|
||||
from clearml_agent.definitions import ENV_AGENT_GIT_USER, ENV_AGENT_GIT_PASS, ENV_AGENT_GIT_HOST, ENV_GIT_CLONE_VERBOSE
|
||||
from clearml_agent.helper.console import ensure_text, ensure_binary
|
||||
from clearml_agent.errors import CommandFailedError
|
||||
from clearml_agent.helper.base import (
|
||||
@@ -30,7 +29,7 @@ from clearml_agent.helper.base import (
|
||||
create_file_if_not_exists, safe_remove_file,
|
||||
)
|
||||
from clearml_agent.helper.os.locks import FileLock
|
||||
from clearml_agent.helper.process import DEVNULL, Argv, PathLike, COMMAND_SUCCESS
|
||||
from clearml_agent.helper.process import DEVNULL, Argv, PathLike, COMMAND_SUCCESS, find_executable
|
||||
from clearml_agent.session import Session
|
||||
|
||||
|
||||
@@ -197,8 +196,9 @@ class VCS(object):
|
||||
self.log.info("successfully applied uncommitted changes")
|
||||
return True
|
||||
|
||||
# Command-line flags for clone command
|
||||
clone_flags = ()
|
||||
def clone_flags(self):
|
||||
"""Command-line flags for clone command"""
|
||||
return tuple()
|
||||
|
||||
@abc.abstractmethod
|
||||
def executable_not_found_error_help(self):
|
||||
@@ -322,6 +322,8 @@ class VCS(object):
|
||||
return
|
||||
|
||||
# rewrite ssh URLs only if either ssh port or ssh user are forced in config
|
||||
# TODO: fix, when url is in the form of `git@domain.com:user/project.git` we will fail to get scheme
|
||||
# need to add ssh:// and replace first ":" with / , unless port is specified
|
||||
if parsed_url.scheme == "ssh" and (
|
||||
self.session.config.get('agent.force_git_ssh_port', None) or
|
||||
self.session.config.get('agent.force_git_ssh_user', None)
|
||||
@@ -345,11 +347,18 @@ class VCS(object):
|
||||
# if we have git_user / git_pass replace ssh credentials with https authentication
|
||||
if (ENV_AGENT_GIT_USER.get() or self.session.config.get('agent.git_user', None)) and \
|
||||
(ENV_AGENT_GIT_PASS.get() or self.session.config.get('agent.git_pass', None)):
|
||||
|
||||
# only apply to a specific domain (if requested)
|
||||
config_domain = \
|
||||
ENV_AGENT_GIT_HOST.get() or self.session.config.get("agent.git_host", None)
|
||||
if config_domain and config_domain != furl(self.url).host:
|
||||
return
|
||||
if config_domain:
|
||||
if config_domain != furl(self.url).host:
|
||||
# bail out here if we have a git_host configured and it's different than the URL host
|
||||
# however, we should make sure this is not an ssh@ URL that furl failed to parse
|
||||
ssh_git_url_match = self.SSH_URL_GIT_SYNTAX.match(self.url)
|
||||
if not ssh_git_url_match or config_domain != ssh_git_url_match.groupdict().get("host"):
|
||||
# do not replace to ssh url
|
||||
return
|
||||
|
||||
new_url = self.replace_ssh_url(self.url)
|
||||
if new_url != self.url:
|
||||
@@ -366,7 +375,7 @@ class VCS(object):
|
||||
self._set_ssh_url()
|
||||
# if we are on linux no need for the full auth url because we use GIT_ASKPASS
|
||||
url = self.url_without_auth if self._use_ask_pass else self.url_with_auth
|
||||
clone_command = ("clone", url, self.location) + self.clone_flags
|
||||
clone_command = ("clone", url, self.location) + self.clone_flags()
|
||||
# clone all branches regardless of when we want to later checkout
|
||||
# if branch:
|
||||
# clone_command += ("-b", branch)
|
||||
@@ -543,7 +552,6 @@ class VCS(object):
|
||||
class Git(VCS):
|
||||
executable_name = "git"
|
||||
main_branch = ("master", "main")
|
||||
clone_flags = ("--quiet", "--recursive")
|
||||
checkout_flags = ("--force",)
|
||||
COMMAND_ENV = {
|
||||
# do not prompt for password
|
||||
@@ -555,7 +563,7 @@ class Git(VCS):
|
||||
def __init__(self, *args, **kwargs):
|
||||
super(Git, self).__init__(*args, **kwargs)
|
||||
|
||||
self._use_ask_pass = False if not self.session.config.get('agent.enable_git_ask_pass', None) \
|
||||
self._use_ask_pass = False if not self.session.config.get('agent.enable_git_ask_pass', True) \
|
||||
else sys.platform == "linux"
|
||||
|
||||
try:
|
||||
@@ -569,6 +577,12 @@ class Git(VCS):
|
||||
"origin/{}".format(b) for b in ([branch] if isinstance(branch, str) else branch)
|
||||
]
|
||||
|
||||
def clone_flags(self):
|
||||
return (
|
||||
"--recursive",
|
||||
"--verbose" if ENV_GIT_CLONE_VERBOSE.get() else "--quiet"
|
||||
)
|
||||
|
||||
def executable_not_found_error_help(self):
|
||||
return 'Cannot find "{}" executable. {}'.format(
|
||||
self.executable_name,
|
||||
@@ -583,7 +597,8 @@ class Git(VCS):
|
||||
)
|
||||
|
||||
def pull(self):
|
||||
self.call("fetch", "--all", "--recurse-submodules", cwd=self.location)
|
||||
self._set_ssh_url()
|
||||
self.call("fetch", "--all", "--tags", "--recurse-submodules", cwd=self.location)
|
||||
|
||||
def _git_pass_auth_wrapper(self, func, *args, **kwargs):
|
||||
try:
|
||||
@@ -765,7 +780,22 @@ def clone_repository_cached(session, execution, destination):
|
||||
# We clone the entire repository, not a specific branch
|
||||
vcs.clone() # branch=execution.branch)
|
||||
|
||||
vcs.pull()
|
||||
print("pulling git")
|
||||
try:
|
||||
vcs.pull()
|
||||
except Exception as ex:
|
||||
print("git pull failed: {}".format(ex))
|
||||
if (
|
||||
session.config.get("agent.vcs_cache.enabled", False) and
|
||||
session.config.get("agent.vcs_cache.clone_on_pull_fail", False)
|
||||
):
|
||||
print("pulling git failed, re-cloning: {}".format(no_password_url))
|
||||
rm_tree(cached_repo_path)
|
||||
vcs.clone()
|
||||
else:
|
||||
raise ex
|
||||
print("pulling git completed")
|
||||
|
||||
rm_tree(destination)
|
||||
shutil.copytree(Text(cached_repo_path), Text(clone_folder),
|
||||
symlinks=select_for_platform(linux=True, windows=False),
|
||||
@@ -796,8 +826,8 @@ def fix_package_import_diff_patch(entry_script_file):
|
||||
lines = f.readlines()
|
||||
except Exception:
|
||||
return
|
||||
# make sre we are the first import (i.e. we patched the source code)
|
||||
if not lines or not lines[0].strip().startswith('from clearml ') or 'Task.init' not in lines[1]:
|
||||
# make sure we are the first import (i.e. we patched the source code)
|
||||
if len(lines or []) < 2 or not lines[0].strip().startswith('from clearml ') or 'Task.init' not in lines[1]:
|
||||
return
|
||||
|
||||
original_lines = lines
|
||||
@@ -854,3 +884,90 @@ def fix_package_import_diff_patch(entry_script_file):
|
||||
f.writelines(new_lines)
|
||||
except Exception:
|
||||
return
|
||||
|
||||
|
||||
def _locate_future_import(lines):
|
||||
# type: (list[str]) -> int
|
||||
"""
|
||||
:param lines: string lines of a python file
|
||||
:return: line index of the last __future_ import. return -1 if no __future__ was found
|
||||
"""
|
||||
# skip over the first two lines, they are ours
|
||||
# then skip over empty or comment lines
|
||||
lines = [(i, line.split('#', 1)[0].rstrip()) for i, line in enumerate(lines)
|
||||
if line.strip('\r\n\t ') and not line.strip().startswith('#')]
|
||||
|
||||
# remove triple quotes ' """ '
|
||||
nested_c = -1
|
||||
skip_lines = []
|
||||
for i, line_pair in enumerate(lines):
|
||||
for _ in line_pair[1].split('"""')[1:]:
|
||||
if nested_c >= 0:
|
||||
skip_lines.extend(list(range(nested_c, i + 1)))
|
||||
nested_c = -1
|
||||
else:
|
||||
nested_c = i
|
||||
# now select all the
|
||||
lines = [pair for i, pair in enumerate(lines) if i not in skip_lines]
|
||||
|
||||
from_future = re.compile(r"^from[\s]*__future__[\s]*")
|
||||
import_future = re.compile(r"^import[\s]*__future__[\s]*")
|
||||
# test if we have __future__ import
|
||||
found_index = -1
|
||||
for a_i, (_, a_line) in enumerate(lines):
|
||||
if found_index >= a_i:
|
||||
continue
|
||||
if from_future.match(a_line) or import_future.match(a_line):
|
||||
found_index = a_i
|
||||
# check the last import block
|
||||
i, line = lines[found_index]
|
||||
# wither we have \\ character at the end of the line or the line is indented
|
||||
parenthesized_lines = '(' in line and ')' not in line
|
||||
while line.endswith('\\') or parenthesized_lines:
|
||||
found_index += 1
|
||||
i, line = lines[found_index]
|
||||
if ')' in line:
|
||||
break
|
||||
|
||||
else:
|
||||
break
|
||||
|
||||
return found_index if found_index < 0 else lines[found_index][0]
|
||||
|
||||
|
||||
def patch_add_task_init_call(local_filename):
|
||||
if not local_filename or not Path(local_filename).is_file() or not str(local_filename).lower().endswith(".py"):
|
||||
return
|
||||
|
||||
idx_a = 0
|
||||
# find the right entry for the patch if we have a local file (basically after __future__
|
||||
try:
|
||||
with open(local_filename, 'rt') as f:
|
||||
lines = f.readlines()
|
||||
except Exception as ex:
|
||||
print("Failed patching entry point file {}: {}".format(local_filename, ex))
|
||||
return
|
||||
|
||||
future_found = _locate_future_import(lines)
|
||||
if future_found >= 0:
|
||||
idx_a = future_found + 1
|
||||
|
||||
# check if we have not already patched it, no need to add another one
|
||||
if len(lines or []) >= idx_a+2 and lines[idx_a].strip().startswith('from clearml ') and 'Task.init' in lines[idx_a+1]:
|
||||
print("File {} already patched with Task.init()".format(local_filename))
|
||||
return
|
||||
|
||||
patch = [
|
||||
"from clearml import Task\n",
|
||||
"(__name__ != \"__main__\") or Task.init()\n",
|
||||
]
|
||||
lines = lines[:idx_a] + patch + lines[idx_a:]
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
with open(local_filename, 'wt') as f:
|
||||
f.writelines(lines)
|
||||
except Exception as ex:
|
||||
print("Failed patching entry point file {}: {}".format(local_filename, ex))
|
||||
return
|
||||
|
||||
print("Force clearml Task.init patch adding to entry point script: {}".format(local_filename))
|
||||
|
||||
@@ -1,19 +1,20 @@
|
||||
from __future__ import unicode_literals, division
|
||||
|
||||
import logging
|
||||
import os
|
||||
import re
|
||||
import shlex
|
||||
from collections import deque
|
||||
from itertools import starmap
|
||||
from threading import Thread, Event
|
||||
from time import time
|
||||
from typing import Text, Sequence
|
||||
from typing import Sequence, List, Union, Dict, Optional
|
||||
|
||||
import attr
|
||||
import psutil
|
||||
from pathlib2 import Path
|
||||
|
||||
from clearml_agent.definitions import ENV_WORKER_TAGS, ENV_GPU_FRACTIONS
|
||||
from clearml_agent.session import Session
|
||||
from clearml_agent.definitions import ENV_WORKER_TAGS
|
||||
|
||||
try:
|
||||
from .gpu import gpustat
|
||||
@@ -54,6 +55,14 @@ class ResourceMonitor(object):
|
||||
if value is not None
|
||||
}
|
||||
|
||||
@attr.s
|
||||
class ClusterReport:
|
||||
cluster_key = attr.ib(type=str)
|
||||
max_gpus = attr.ib(type=int, default=None)
|
||||
max_workers = attr.ib(type=int, default=None)
|
||||
max_cpus = attr.ib(type=int, default=None)
|
||||
resource_groups = attr.ib(type=Sequence[str], factory=list)
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
session, # type: Session
|
||||
@@ -61,7 +70,7 @@ class ResourceMonitor(object):
|
||||
sample_frequency_per_sec=2.0,
|
||||
report_frequency_sec=30.0,
|
||||
first_report_sec=None,
|
||||
worker_tags=None,
|
||||
worker_tags=None
|
||||
):
|
||||
self.session = session
|
||||
self.queue = deque(maxlen=1)
|
||||
@@ -79,6 +88,13 @@ class ResourceMonitor(object):
|
||||
self._gpustat_fail = 0
|
||||
self._gpustat = gpustat
|
||||
self._active_gpus = None
|
||||
self._default_gpu_utilization = session.config.get("agent.resource_monitoring.default_gpu_utilization", 100)
|
||||
# allow default_gpu_utilization as null in the config, in which case we don't log anything
|
||||
if self._default_gpu_utilization is not None:
|
||||
self._default_gpu_utilization = int(self._default_gpu_utilization)
|
||||
self._gpu_utilization_warning_sent = False
|
||||
self._disk_use_path = str(session.config.get("agent.resource_monitoring.disk_use_path", None) or Path.home())
|
||||
self._fractions_handler = GpuFractionsHandler() if session.feature_set != "basic" else None
|
||||
if not worker_tags and ENV_WORKER_TAGS.get():
|
||||
worker_tags = shlex.split(ENV_WORKER_TAGS.get())
|
||||
self._worker_tags = worker_tags
|
||||
@@ -91,6 +107,7 @@ class ResourceMonitor(object):
|
||||
else:
|
||||
# None means no filtering, report all gpus
|
||||
self._active_gpus = None
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
active_gpus = Session.get_nvidia_visible_env()
|
||||
# None means no filtering, report all gpus
|
||||
@@ -98,6 +115,10 @@ class ResourceMonitor(object):
|
||||
self._active_gpus = [g.strip() for g in str(active_gpus).split(',')]
|
||||
except Exception:
|
||||
pass
|
||||
self._cluster_report_interval_sec = int(session.config.get(
|
||||
"agent.resource_monitoring.cluster_report_interval_sec", 60
|
||||
))
|
||||
self._cluster_report = None
|
||||
|
||||
def set_report(self, report):
|
||||
# type: (ResourceMonitor.StatusReport) -> ()
|
||||
@@ -129,6 +150,7 @@ class ResourceMonitor(object):
|
||||
)
|
||||
log.debug("sending report: %s", report)
|
||||
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
self.session.get(service="workers", action="status_report", **report)
|
||||
except Exception:
|
||||
@@ -136,7 +158,76 @@ class ResourceMonitor(object):
|
||||
return False
|
||||
return True
|
||||
|
||||
def send_cluster_report(self) -> bool:
|
||||
if not self.session.feature_set == "basic":
|
||||
return False
|
||||
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
properties = {
|
||||
"max_cpus": self._cluster_report.max_cpus,
|
||||
"max_gpus": self._cluster_report.max_gpus,
|
||||
"max_workers": self._cluster_report.max_workers,
|
||||
}
|
||||
payload = {
|
||||
"key": self._cluster_report.cluster_key,
|
||||
"timestamp": int(time() * 1000),
|
||||
"timeout": int(self._cluster_report_interval_sec * 2),
|
||||
# "resource_groups": self._cluster_report.resource_groups, # yet to be supported
|
||||
"properties": {k: v for k, v in properties.items() if v is not None},
|
||||
}
|
||||
self.session.post(service="workers", action="cluster_report", **payload)
|
||||
except Exception as ex:
|
||||
log.warning("Failed sending cluster report: %s", ex)
|
||||
return False
|
||||
return True
|
||||
|
||||
def setup_cluster_report(self, available_gpus, gpu_queues, worker_id=None, cluster_key=None, resource_groups=None):
|
||||
# type: (List[int], Dict[str, int], Optional[str], Optional[str], Optional[List[str]]) -> ()
|
||||
"""
|
||||
Set up a cluster report for the enterprise server dashboard feature.
|
||||
If a worker_id is provided, cluster_key and resource_groups are inferred from it.
|
||||
"""
|
||||
if self.session.feature_set == "basic":
|
||||
return
|
||||
|
||||
if not worker_id and not cluster_key:
|
||||
print("Error: cannot set up dashboard reporting - worker_id or cluster key are required")
|
||||
return
|
||||
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
if not cluster_key:
|
||||
worker_id_parts = worker_id.split(":")
|
||||
if len(worker_id_parts) < 3:
|
||||
cluster_key = self.session.config.get("agent.resource_dashboard.default_cluster_name", "onprem")
|
||||
resource_group = ":".join((cluster_key, worker_id_parts[0]))
|
||||
print(
|
||||
'WARNING: your worker ID "{}" is not suitable for proper resource dashboard reporting, please '
|
||||
'set up agent.worker_name to be at least two colon-separated parts (i.e. "<category>:<name>"). '
|
||||
'Using "{}" as the resource dashboard category and "{}" as the resource group.'.format(
|
||||
worker_id, cluster_key, resource_group
|
||||
)
|
||||
)
|
||||
else:
|
||||
cluster_key = worker_id_parts[0]
|
||||
resource_group = ":".join((worker_id_parts[:2]))
|
||||
|
||||
resource_groups = [resource_group]
|
||||
|
||||
self._cluster_report = ResourceMonitor.ClusterReport(
|
||||
cluster_key=cluster_key,
|
||||
max_gpus=len(available_gpus),
|
||||
max_workers=len(available_gpus) // min(x for x, _ in gpu_queues.values()),
|
||||
resource_groups=resource_groups
|
||||
)
|
||||
|
||||
self.send_cluster_report()
|
||||
except Exception as ex:
|
||||
print("Error: failed setting cluster report: {}".format(ex))
|
||||
|
||||
def _daemon(self):
|
||||
last_cluster_report = 0
|
||||
seconds_since_started = 0
|
||||
reported = 0
|
||||
try:
|
||||
@@ -153,7 +244,7 @@ class ResourceMonitor(object):
|
||||
try:
|
||||
self._update_readouts()
|
||||
except Exception as ex:
|
||||
log.warning("failed getting machine stats: %s", report_error(ex))
|
||||
log.error("failed getting machine stats: %s", report_error(ex))
|
||||
self._failure()
|
||||
|
||||
seconds_since_started += int(round(time() - last_report))
|
||||
@@ -176,6 +267,15 @@ class ResourceMonitor(object):
|
||||
|
||||
# count reported iterations
|
||||
reported += 1
|
||||
|
||||
if (
|
||||
self._cluster_report and
|
||||
self._cluster_report_interval_sec
|
||||
and time() - last_cluster_report > self._cluster_report_interval_sec
|
||||
):
|
||||
if self.send_cluster_report():
|
||||
last_cluster_report = time()
|
||||
|
||||
except Exception as ex:
|
||||
log.exception("Error reporting monitoring info: %s", str(ex))
|
||||
|
||||
@@ -242,7 +342,7 @@ class ResourceMonitor(object):
|
||||
virtual_memory = psutil.virtual_memory()
|
||||
stats["memory_used"] = BytesSizes.megabytes(virtual_memory.used)
|
||||
stats["memory_free"] = BytesSizes.megabytes(virtual_memory.available)
|
||||
disk_use_percentage = psutil.disk_usage(Text(Path.home())).percent
|
||||
disk_use_percentage = psutil.disk_usage(self._disk_use_path).percent
|
||||
stats["disk_free_percent"] = 100 - disk_use_percentage
|
||||
sensor_stat = (
|
||||
psutil.sensors_temperatures()
|
||||
@@ -264,23 +364,48 @@ class ResourceMonitor(object):
|
||||
if self._active_gpus is not False and self._gpustat:
|
||||
try:
|
||||
gpu_stat = self._gpustat.new_query()
|
||||
report_index = 0
|
||||
for i, g in enumerate(gpu_stat.gpus):
|
||||
# only monitor the active gpu's, if none were selected, monitor everything
|
||||
if self._active_gpus and str(i) not in self._active_gpus:
|
||||
continue
|
||||
stats["gpu_temperature_{:d}".format(i)] = g["temperature.gpu"]
|
||||
stats["gpu_utilization_{:d}".format(i)] = g["utilization.gpu"]
|
||||
stats["gpu_mem_usage_{:d}".format(i)] = (
|
||||
if self._active_gpus:
|
||||
uuid = getattr(g, "uuid", None)
|
||||
mig_uuid = getattr(g, "mig_uuid", None)
|
||||
if (
|
||||
str(g.index) not in self._active_gpus
|
||||
and (not uuid or uuid not in self._active_gpus)
|
||||
and (not mig_uuid or mig_uuid not in self._active_gpus)
|
||||
):
|
||||
continue
|
||||
stats["gpu_temperature_{}".format(report_index)] = g["temperature.gpu"]
|
||||
|
||||
if g["utilization.gpu"] is not None:
|
||||
stats["gpu_utilization_{}".format(report_index)] = g["utilization.gpu"]
|
||||
elif self._default_gpu_utilization is not None:
|
||||
stats["gpu_utilization_{}".format(report_index)] = self._default_gpu_utilization
|
||||
if getattr(g, "mig_index", None) is None and not self._gpu_utilization_warning_sent:
|
||||
# this shouldn't happen for non-MIGs, warn the user about it
|
||||
log.error("Failed fetching GPU utilization")
|
||||
self._gpu_utilization_warning_sent = True
|
||||
|
||||
stats["gpu_mem_usage_{}".format(report_index)] = (
|
||||
100.0 * g["memory.used"] / g["memory.total"]
|
||||
)
|
||||
# already in MBs
|
||||
stats["gpu_mem_free_{:d}".format(i)] = (
|
||||
stats["gpu_mem_free_{}".format(report_index)] = (
|
||||
g["memory.total"] - g["memory.used"]
|
||||
)
|
||||
stats["gpu_mem_used_%d" % i] = g["memory.used"]
|
||||
|
||||
stats["gpu_mem_used_{}".format(report_index)] = g["memory.used"] or 0
|
||||
|
||||
if self._fractions_handler:
|
||||
fractions = self._fractions_handler.fractions
|
||||
stats["gpu_fraction_{}".format(report_index)] = \
|
||||
(fractions[i] if i < len(fractions) else fractions[-1]) if fractions else 1.0
|
||||
report_index += 1
|
||||
|
||||
except Exception as ex:
|
||||
# something happened and we can't use gpu stats,
|
||||
log.warning("failed getting machine stats: %s", report_error(ex))
|
||||
log.error("failed getting machine stats: %s", report_error(ex))
|
||||
self._failure()
|
||||
|
||||
return stats
|
||||
@@ -293,19 +418,142 @@ class ResourceMonitor(object):
|
||||
)
|
||||
self._gpustat = None
|
||||
|
||||
BACKEND_STAT_MAP = {"cpu_usage_*": "cpu_usage",
|
||||
"cpu_temperature_*": "cpu_temperature",
|
||||
"disk_free_percent": "disk_free_home",
|
||||
"io_read_mbs": "disk_read",
|
||||
"io_write_mbs": "disk_write",
|
||||
"network_tx_mbs": "network_tx",
|
||||
"network_rx_mbs": "network_rx",
|
||||
"memory_free": "memory_free",
|
||||
"memory_used": "memory_used",
|
||||
"gpu_temperature_*": "gpu_temperature",
|
||||
"gpu_mem_used_*": "gpu_memory_used",
|
||||
"gpu_mem_free_*": "gpu_memory_free",
|
||||
"gpu_utilization_*": "gpu_usage"}
|
||||
BACKEND_STAT_MAP = {
|
||||
"cpu_usage_*": "cpu_usage",
|
||||
"cpu_temperature_*": "cpu_temperature",
|
||||
"disk_free_percent": "disk_free_home",
|
||||
"io_read_mbs": "disk_read",
|
||||
"io_write_mbs": "disk_write",
|
||||
"network_tx_mbs": "network_tx",
|
||||
"network_rx_mbs": "network_rx",
|
||||
"memory_free": "memory_free",
|
||||
"memory_used": "memory_used",
|
||||
"gpu_temperature_*": "gpu_temperature",
|
||||
"gpu_mem_used_*": "gpu_memory_used",
|
||||
"gpu_mem_free_*": "gpu_memory_free",
|
||||
"gpu_utilization_*": "gpu_usage",
|
||||
"gpu_fraction_*": "gpu_fraction"
|
||||
}
|
||||
|
||||
|
||||
class GpuFractionsHandler:
|
||||
_number_re = re.compile(r"^clear\.ml/fraction(-\d+)?$")
|
||||
_mig_re = re.compile(r"^nvidia\.com/mig-(?P<compute>[0-9]+)g\.(?P<memory>[0-9]+)gb$")
|
||||
_frac_gpu_injector_re = re.compile(r"^clearml-injector/fraction$")
|
||||
|
||||
_gpu_name_to_memory_gb = {
|
||||
"A30": 24,
|
||||
"NVIDIA A30": 24,
|
||||
"A100-SXM4-40GB": 40,
|
||||
"NVIDIA-A100-40GB-PCIe": 40,
|
||||
"NVIDIA A100-40GB-PCIe": 40,
|
||||
"NVIDIA-A100-SXM4-40GB": 40,
|
||||
"NVIDIA A100-SXM4-40GB": 40,
|
||||
"NVIDIA-A100-SXM4-80GB": 79,
|
||||
"NVIDIA A100-SXM4-80GB": 79,
|
||||
"NVIDIA-A100-80GB-PCIe": 79,
|
||||
"NVIDIA A100-80GB-PCIe": 79,
|
||||
}
|
||||
|
||||
def __init__(self):
|
||||
self._total_memory_gb = [
|
||||
self._gpu_name_to_memory_gb.get(name, 0)
|
||||
for name in (self._get_gpu_names() or [])
|
||||
]
|
||||
self._fractions = self._get_fractions()
|
||||
|
||||
@property
|
||||
def fractions(self) -> List[float]:
|
||||
return self._fractions
|
||||
|
||||
def _get_fractions(self) -> List[float]:
|
||||
if not self._total_memory_gb:
|
||||
# Can't compute
|
||||
return [1.0]
|
||||
|
||||
fractions = (ENV_GPU_FRACTIONS.get() or "").strip()
|
||||
if not fractions:
|
||||
# No fractions
|
||||
return [1.0]
|
||||
|
||||
decoded_fractions = self.decode_fractions(fractions)
|
||||
|
||||
if isinstance(decoded_fractions, list):
|
||||
return decoded_fractions
|
||||
|
||||
totals = []
|
||||
for i, (fraction, count) in enumerate(decoded_fractions.items()):
|
||||
m = self._mig_re.match(fraction)
|
||||
if not m:
|
||||
continue
|
||||
try:
|
||||
total_gb = self._total_memory_gb[i] if i < len(self._total_memory_gb) else self._total_memory_gb[-1]
|
||||
if not total_gb:
|
||||
continue
|
||||
totals.append((int(m.group("memory")) * count) / total_gb)
|
||||
except ValueError:
|
||||
pass
|
||||
|
||||
if not totals:
|
||||
log.warning("Fractions count is empty for {}".format(fractions))
|
||||
return [1.0]
|
||||
|
||||
return totals
|
||||
|
||||
@classmethod
|
||||
def extract_custom_limits(cls, limits: dict):
|
||||
for k, v in list((limits or {}).items()):
|
||||
if cls._number_re.match(k):
|
||||
limits.pop(k, None)
|
||||
|
||||
@classmethod
|
||||
def get_simple_fractions_total(cls, limits: dict) -> float:
|
||||
try:
|
||||
if any(cls._number_re.match(x) for x in limits):
|
||||
return sum(float(v) for k, v in limits.items() if cls._number_re.match(k))
|
||||
except Exception as ex:
|
||||
log.error("Failed summing up fractions from {}: {}".format(limits, ex))
|
||||
return 0
|
||||
|
||||
@classmethod
|
||||
def encode_fractions(cls, limits: dict, annotations: dict) -> str:
|
||||
if limits:
|
||||
if any(cls._number_re.match(x) for x in (limits or {})):
|
||||
return ",".join(str(v) for k, v in sorted(limits.items()) if cls._number_re.match(k))
|
||||
return ",".join(("{}:{}".format(k, v) for k, v in (limits or {}).items() if cls._mig_re.match(k)))
|
||||
elif annotations:
|
||||
if any(cls._frac_gpu_injector_re.match(x) for x in (annotations or {})):
|
||||
return ",".join(str(v) for k, v in sorted(annotations.items()) if cls._frac_gpu_injector_re.match(k))
|
||||
|
||||
@staticmethod
|
||||
def decode_fractions(fractions: str) -> Union[List[float], Dict[str, int]]:
|
||||
try:
|
||||
items = [f.strip() for f in fractions.strip().split(",")]
|
||||
tuples = [(k.strip(), v.strip()) for k, v in (f.partition(":")[::2] for f in items)]
|
||||
if all(not v for _, v in tuples):
|
||||
# comma-separated float fractions
|
||||
return [float(k) for k, _ in tuples]
|
||||
# comma-separated slice:count items
|
||||
return {
|
||||
k.strip(): int(v.strip())
|
||||
for k, v in tuples
|
||||
}
|
||||
except Exception as ex:
|
||||
log.error("Failed decoding GPU fractions '{}': {}".format(fractions, ex))
|
||||
return {}
|
||||
|
||||
@staticmethod
|
||||
def _get_gpu_names():
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
gpus = gpustat.new_query().gpus
|
||||
names = [g["name"] for g in gpus]
|
||||
|
||||
print("GPU names: {}".format(names))
|
||||
|
||||
return names
|
||||
except Exception as ex:
|
||||
log.error("Failed getting GPU names: {}".format(ex))
|
||||
|
||||
|
||||
def report_error(ex):
|
||||
|
||||
@@ -44,6 +44,11 @@ WORKER_ARGS = {
|
||||
}
|
||||
|
||||
DAEMON_ARGS = dict({
|
||||
'--polling-interval': {
|
||||
'help': 'Polling interval in seconds. Minimum is 5 (default 5)',
|
||||
'type': int,
|
||||
'default': 5,
|
||||
},
|
||||
'--foreground': {
|
||||
'help': 'Pipe full log to stdout/stderr, should not be used if running in background',
|
||||
'action': 'store_true',
|
||||
@@ -62,7 +67,10 @@ DAEMON_ARGS = dict({
|
||||
'group': 'Docker support',
|
||||
},
|
||||
'--queue': {
|
||||
'help': 'Queue ID(s)/Name(s) to pull tasks from (\'default\' queue)',
|
||||
'help': 'Queue ID(s)/Name(s) to pull tasks from (\'default\' queue).'
|
||||
' Note that the queue list order determines priority, with the first listed queue having the'
|
||||
' highest priority. To change this behavior, use --order-fairness to pull from each queue in a'
|
||||
' round-robin order',
|
||||
'nargs': '+',
|
||||
'default': tuple(),
|
||||
'dest': 'queues',
|
||||
@@ -107,8 +115,11 @@ DAEMON_ARGS = dict({
|
||||
'--dynamic-gpus': {
|
||||
'help': 'Allow to dynamically allocate gpus based on queue properties, '
|
||||
'configure with \'--queue <queue_name>=<num_gpus>\'.'
|
||||
' Example: \'--dynamic-gpus --gpus 0-3 --queue dual_gpus=2 single_gpu=1\''
|
||||
' Example Opportunistic: \'--dynamic-gpus --gpus 0-3 --queue dual_gpus=2 max_quad_gpus=1-4 \'',
|
||||
' Example: \'--dynamic-gpus --gpus 0-3 --queue dual_gpus=2 single_gpu=1\'.'
|
||||
' Example Opportunistic: \'--dynamic-gpus --gpus 0-3 --queue dual_gpus=2 max_quad_gpus=1-4\'.'
|
||||
' Note that the queue list order determines priority, with the first listed queue having the'
|
||||
' highest priority. To change this behavior, use --order-fairness to pull from each queue in a'
|
||||
' round-robin order',
|
||||
'action': 'store_true',
|
||||
},
|
||||
'--uptime': {
|
||||
|
||||
@@ -4,6 +4,7 @@ import json
|
||||
import logging
|
||||
import os
|
||||
import platform
|
||||
import re
|
||||
import sys
|
||||
from copy import deepcopy
|
||||
from typing import Any, Callable
|
||||
@@ -19,7 +20,7 @@ from clearml_agent.definitions import ENVIRONMENT_CONFIG, ENV_TASK_EXECUTE_AS_US
|
||||
from clearml_agent.errors import APIError
|
||||
from clearml_agent.helper.base import HOCONEncoder
|
||||
from clearml_agent.helper.process import Argv
|
||||
from clearml_agent.helper.docker_args import DockerArgsSanitizer
|
||||
from clearml_agent.helper.docker_args import DockerArgsSanitizer, sanitize_urls
|
||||
from .version import __version__
|
||||
|
||||
POETRY = "poetry"
|
||||
@@ -240,38 +241,49 @@ class Session(_Session):
|
||||
except:
|
||||
pass
|
||||
|
||||
def print_configuration(
|
||||
self,
|
||||
remove_secret_keys=("secret", "pass", "token", "account_key", "contents"),
|
||||
skip_value_keys=("environment", ),
|
||||
docker_args_sanitize_keys=("extra_docker_arguments", ),
|
||||
):
|
||||
def print_configuration(self):
|
||||
def load_config(key, default):
|
||||
return [re.compile(x) for x in self.config.get(f"agent.sanitize_config_printout.{key}", default=default)]
|
||||
|
||||
dont_hide_secret_keys = load_config("dont_hide_secrets", ("^enable_git_ask_pass$",))
|
||||
hide_secret_keys = load_config("hide_secrets", ("secret", "pass", "token", "account_key", "contents"))
|
||||
hide_secret_section_keys = load_config("hide_secrets_recursive", ("^environment$",))
|
||||
docker_cmd_keys = load_config("docker_commands", ("^extra_docker_arguments$",))
|
||||
urls_keys = load_config("urls", ("^extra_index_url$",))
|
||||
|
||||
# remove all the secrets from the print
|
||||
def recursive_remove_secrets(dictionary, secret_keys=(), empty_keys=()):
|
||||
def recursive_remove_secrets(dictionary):
|
||||
for k in list(dictionary):
|
||||
for s in secret_keys:
|
||||
if s in k:
|
||||
dictionary.pop(k)
|
||||
break
|
||||
for s in empty_keys:
|
||||
if s == k:
|
||||
if not any(r.search(k) for r in dont_hide_secret_keys):
|
||||
if any(r.search(k) for r in hide_secret_keys):
|
||||
dictionary[k] = '****'
|
||||
continue
|
||||
if any(r.search(k) for r in hide_secret_section_keys):
|
||||
dictionary[k] = {key: '****' for key in dictionary[k]} \
|
||||
if isinstance(dictionary[k], dict) else '****'
|
||||
break
|
||||
continue
|
||||
if any(r.search(k) for r in urls_keys):
|
||||
value = dictionary.get(k, None)
|
||||
if isinstance(value, str):
|
||||
dictionary[k] = sanitize_urls(value)[0]
|
||||
elif isinstance(value, (list, tuple)):
|
||||
dictionary[k] = [sanitize_urls(v)[0] for v in value]
|
||||
elif isinstance(value, dict):
|
||||
dictionary[k] = {k_: sanitize_urls(v)[0] for k_, v in value.items()}
|
||||
if isinstance(dictionary.get(k, None), dict):
|
||||
recursive_remove_secrets(dictionary[k], secret_keys=secret_keys, empty_keys=empty_keys)
|
||||
recursive_remove_secrets(dictionary[k])
|
||||
elif isinstance(dictionary.get(k, None), (list, tuple)):
|
||||
if k in (docker_args_sanitize_keys or []):
|
||||
if any(r.match(k) for r in docker_cmd_keys):
|
||||
dictionary[k] = DockerArgsSanitizer.sanitize_docker_command(self, dictionary[k])
|
||||
for item in dictionary[k]:
|
||||
if isinstance(item, dict):
|
||||
recursive_remove_secrets(item, secret_keys=secret_keys, empty_keys=empty_keys)
|
||||
recursive_remove_secrets(item)
|
||||
|
||||
config = deepcopy(self.config.to_dict())
|
||||
# remove the env variable, it's not important
|
||||
config.pop('env', None)
|
||||
if remove_secret_keys or skip_value_keys or docker_args_sanitize_keys:
|
||||
recursive_remove_secrets(config, secret_keys=remove_secret_keys, empty_keys=skip_value_keys)
|
||||
if hide_secret_keys or hide_secret_section_keys or docker_cmd_keys or urls_keys:
|
||||
recursive_remove_secrets(config)
|
||||
# remove logging.loggers.urllib3.level from the print
|
||||
try:
|
||||
config['logging']['loggers']['urllib3'].pop('level', None)
|
||||
|
||||
@@ -1 +1 @@
|
||||
__version__ = '1.6.1'
|
||||
__version__ = '1.9.2'
|
||||
|
||||
37
docker/k8s-glue/build-image-helper.sh
Normal file
37
docker/k8s-glue/build-image-helper.sh
Normal file
@@ -0,0 +1,37 @@
|
||||
#!/bin/bash
|
||||
|
||||
# Check if image name and Dockerfile path are provided
|
||||
if [ -z "$1" ] || [ -z "$2" ]; then
|
||||
echo "Usage: $0 <image_name> <dockerfile_path> <build_context>"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Build the Docker image
|
||||
image_name=$1
|
||||
dockerfile_path=$2
|
||||
build_context=$3
|
||||
|
||||
if [ $build_context == "glue-build-aws" ] || [ $build_context == "glue-build-gcp" ]; then
|
||||
if [ ! -f $build_context/clearml.conf ]; then
|
||||
cp build-resources/clearml.conf $build_context
|
||||
fi
|
||||
if [ ! -f $build_context/entrypoint.sh ]; then
|
||||
cp build-resources/entrypoint.sh $build_context
|
||||
chmod +x $build_context/entrypoint.sh
|
||||
fi
|
||||
if [ ! -f $build_context/setup.sh ]; then
|
||||
cp build-resources/setup.sh $build_context
|
||||
chmod +x $build_context/setup.sh
|
||||
fi
|
||||
fi
|
||||
cp ../../examples/k8s_glue_example.py $build_context
|
||||
|
||||
docker build -f $dockerfile_path -t $image_name $build_context
|
||||
|
||||
# cleanup
|
||||
if [ $build_context == "glue-build-aws" ] || [ $build_context == "glue-build-gcp" ]; then
|
||||
rm $build_context/clearml.conf
|
||||
rm $build_context/entrypoint.sh
|
||||
rm $build_context/setup.sh
|
||||
fi
|
||||
rm $build_context/k8s_glue_example.py
|
||||
@@ -361,7 +361,7 @@ sdk {
|
||||
vcs_repo_detect_async: true
|
||||
|
||||
# Store uncommitted git/hg source code diff in experiment manifest when training in development mode
|
||||
# This stores "git diff" or "hg diff" into the experiment's "script.requirements.diff" section
|
||||
# This stores "git diff" or into the experiment's "script.requirements.diff" section
|
||||
store_uncommitted_code_diff: true
|
||||
|
||||
# Support stopping an experiment in case it was externally stopped, status was changed or task was reset
|
||||
|
||||
@@ -1,112 +0,0 @@
|
||||
"""
|
||||
This example assumes you have preconfigured services with selectors in the form of
|
||||
"ai.allegro.agent.serial=pod-<number>" and a targetPort of 10022.
|
||||
The K8sIntegration component will label each pod accordingly.
|
||||
"""
|
||||
from argparse import ArgumentParser
|
||||
|
||||
from clearml_agent.glue.k8s import K8sIntegration
|
||||
|
||||
|
||||
def parse_args():
|
||||
parser = ArgumentParser()
|
||||
group = parser.add_mutually_exclusive_group()
|
||||
|
||||
parser.add_argument(
|
||||
"--queue", type=str, help="Queue to pull tasks from"
|
||||
)
|
||||
group.add_argument(
|
||||
"--ports-mode", action='store_true', default=False,
|
||||
help="Ports-Mode will add a label to the pod which can be used as service, in order to expose ports"
|
||||
"Should not be used with max-pods"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--num-of-services", type=int, default=20,
|
||||
help="Specify the number of k8s services to be used. Use only with ports-mode."
|
||||
)
|
||||
parser.add_argument(
|
||||
"--base-port", type=int,
|
||||
help="Used in conjunction with ports-mode, specifies the base port exposed by the services. "
|
||||
"For pod #X, the port will be <base-port>+X. Note that pod number is calculated based on base-pod-num"
|
||||
"e.g. if base-port=20000 and base-pod-num=3, the port for the first pod will be 20003"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--base-pod-num", type=int, default=1,
|
||||
help="Used in conjunction with ports-mode and base-port, specifies the base pod number to be used by the "
|
||||
"service (default: %(default)s)"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--gateway-address", type=str, default=None,
|
||||
help="Used in conjunction with ports-mode, specify the external address of the k8s ingress / ELB"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--pod-clearml-conf", type=str,
|
||||
help="Configuration file to be used by the pod itself (if not provided, current configuration is used)"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--overrides-yaml", type=str,
|
||||
help="YAML file containing pod overrides to be used when launching a new pod"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--template-yaml", type=str,
|
||||
help="YAML file containing pod template. If provided pod will be scheduled with kubectl apply "
|
||||
"and overrides are ignored, otherwise it will be scheduled with kubectl run"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--ssh-server-port", type=int, default=0,
|
||||
help="If non-zero, every pod will also start an SSH server on the selected port (default: zero, not active)"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--namespace", type=str,
|
||||
help="Specify the namespace in which pods will be created (default: %(default)s)", default="clearml"
|
||||
)
|
||||
group.add_argument(
|
||||
"--max-pods", type=int,
|
||||
help="Limit the maximum number of pods that this service can run at the same time."
|
||||
"Should not be used with ports-mode"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--use-owner-token", action="store_true", default=False,
|
||||
help="Generate and use task owner token for the execution of each task"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--standalone-mode", action="store_true", default=False,
|
||||
help="Do not use any network connects, assume everything is pre-installed"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--child-report-tags", type=str, nargs="+", default=None,
|
||||
help="List of tags to send with the status reports from a worker that runs a task"
|
||||
)
|
||||
|
||||
return parser.parse_args()
|
||||
|
||||
|
||||
def main():
|
||||
args = parse_args()
|
||||
|
||||
user_props_cb = None
|
||||
if args.ports_mode and args.base_port:
|
||||
def k8s_user_props_cb(pod_number=0):
|
||||
user_prop = {"k8s-pod-port": args.base_port + pod_number}
|
||||
if args.gateway_address:
|
||||
user_prop["k8s-gateway-address"] = args.gateway_address
|
||||
return user_prop
|
||||
user_props_cb = k8s_user_props_cb
|
||||
|
||||
k8s = K8sIntegration(
|
||||
ports_mode=args.ports_mode, num_of_services=args.num_of_services, base_pod_num=args.base_pod_num,
|
||||
user_props_cb=user_props_cb, overrides_yaml=args.overrides_yaml, clearml_conf_file=args.pod_clearml_conf,
|
||||
template_yaml=args.template_yaml, extra_bash_init_script=K8sIntegration.get_ssh_server_bash(
|
||||
ssh_port_number=args.ssh_server_port) if args.ssh_server_port else None,
|
||||
namespace=args.namespace, max_pods_limit=args.max_pods or None
|
||||
)
|
||||
k8s.k8s_daemon(
|
||||
args.queue,
|
||||
use_owner_token=args.use_owner_token,
|
||||
standalone_mode=args.standalone_mode,
|
||||
child_report_tags=args.child_report_tags
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -1,4 +1,4 @@
|
||||
FROM ubuntu:18.04
|
||||
FROM ubuntu:22.04
|
||||
|
||||
USER root
|
||||
WORKDIR /root
|
||||
@@ -8,15 +8,16 @@ ENV LANG=en_US.UTF-8
|
||||
ENV LANGUAGE=en_US.UTF-8
|
||||
ENV PYTHONIOENCODING=UTF-8
|
||||
|
||||
COPY ../build-resources/setup.sh /root/setup.sh
|
||||
COPY ./setup.sh /root/setup.sh
|
||||
RUN /root/setup.sh
|
||||
|
||||
COPY ./setup_aws.sh /root/setup_aws.sh
|
||||
RUN /root/setup_aws.sh
|
||||
RUN chmod +x /root/setup_aws.sh && /root/setup_aws.sh
|
||||
|
||||
COPY ../build-resources/entrypoint.sh /root/entrypoint.sh
|
||||
COPY ./entrypoint.sh /root/entrypoint.sh
|
||||
COPY ./provider_entrypoint.sh /root/provider_entrypoint.sh
|
||||
COPY ./build-resources/k8s_glue_example.py /root/k8s_glue_example.py
|
||||
RUN chmod +x /root/provider_entrypoint.sh
|
||||
COPY ./k8s_glue_example.py /root/k8s_glue_example.py
|
||||
COPY ./clearml.conf /root/clearml.conf
|
||||
|
||||
ENTRYPOINT ["/root/entrypoint.sh"]
|
||||
@@ -4,7 +4,8 @@ curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip
|
||||
unzip awscliv2.zip
|
||||
./aws/install
|
||||
|
||||
curl -o kubectl https://amazon-eks.s3-us-west-2.amazonaws.com/1.21.2/2021-07-05/bin/linux/amd64/kubectl
|
||||
curl -o kubectl https://s3.us-west-2.amazonaws.com/amazon-eks/1.29.3/2024-04-19/bin/linux/amd64/kubectl
|
||||
#curl -o kubectl https://amazon-eks.s3-us-west-2.amazonaws.com/1.21.2/2021-07-05/bin/linux/amd64/kubectl
|
||||
#curl -o kubectl https://amazon-eks.s3.us-west-2.amazonaws.com/1.19.6/2021-01-05/bin/linux/amd64/kubectl
|
||||
chmod +x ./kubectl && mkdir -p $HOME/bin && cp ./kubectl $HOME/bin/kubectl && export PATH=$PATH:$HOME/bin
|
||||
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
FROM ubuntu:18.04
|
||||
FROM ubuntu:22.04
|
||||
|
||||
USER root
|
||||
WORKDIR /root
|
||||
@@ -8,15 +8,15 @@ ENV LANG=en_US.UTF-8
|
||||
ENV LANGUAGE=en_US.UTF-8
|
||||
ENV PYTHONIOENCODING=UTF-8
|
||||
|
||||
COPY ../build-resources/setup.sh /root/setup.sh
|
||||
COPY ./setup.sh /root/setup.sh
|
||||
RUN /root/setup.sh
|
||||
|
||||
COPY ./setup_gcp.sh /root/setup_gcp.sh
|
||||
RUN /root/setup_gcp.sh
|
||||
RUN chmod +x /root/setup_gcp.sh && /root/setup_gcp.sh
|
||||
|
||||
COPY ../build-resources/entrypoint.sh /root/entrypoint.sh
|
||||
COPY ./entrypoint.sh /root/entrypoint.sh
|
||||
COPY ./provider_entrypoint.sh /root/provider_entrypoint.sh
|
||||
COPY ./build-resources/k8s_glue_example.py /root/k8s_glue_example.py
|
||||
COPY ./k8s_glue_example.py /root/k8s_glue_example.py
|
||||
COPY ./clearml.conf /root/clearml.conf
|
||||
|
||||
ENTRYPOINT ["/root/entrypoint.sh"]
|
||||
@@ -1,6 +1,6 @@
|
||||
#!/bin/bash
|
||||
|
||||
curl -LO https://dl.k8s.io/release/v1.21.0/bin/linux/amd64/kubectl
|
||||
curl -LO https://dl.k8s.io/release/v1.29.3/bin/linux/amd64/kubectl
|
||||
|
||||
install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
|
||||
|
||||
|
||||
@@ -20,7 +20,7 @@ FROM python:${TAG} as target
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
ARG KUBECTL_VERSION=1.24.0
|
||||
ARG KUBECTL_VERSION=1.29.3
|
||||
|
||||
# Not sure about these ENV vars
|
||||
# ENV LC_ALL=en_US.UTF-8
|
||||
|
||||
@@ -2,7 +2,7 @@ ARG TAG=3.7.17-slim-bullseye
|
||||
|
||||
FROM python:${TAG} as target
|
||||
|
||||
ARG KUBECTL_VERSION=1.22.4
|
||||
ARG KUBECTL_VERSION=1.29.3
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
|
||||
@@ -1,98 +0,0 @@
|
||||
"""
|
||||
This example assumes you have preconfigured services with selectors in the form of
|
||||
"ai.allegro.agent.serial=pod-<number>" and a targetPort of 10022.
|
||||
The K8sIntegration component will label each pod accordingly.
|
||||
"""
|
||||
from argparse import ArgumentParser
|
||||
|
||||
from clearml_agent.glue.k8s import K8sIntegration
|
||||
|
||||
|
||||
def parse_args():
|
||||
parser = ArgumentParser()
|
||||
group = parser.add_mutually_exclusive_group()
|
||||
|
||||
parser.add_argument(
|
||||
"--queue", type=str, help="Queue to pull tasks from"
|
||||
)
|
||||
group.add_argument(
|
||||
"--ports-mode", action='store_true', default=False,
|
||||
help="Ports-Mode will add a label to the pod which can be used as service, in order to expose ports"
|
||||
"Should not be used with max-pods"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--num-of-services", type=int, default=20,
|
||||
help="Specify the number of k8s services to be used. Use only with ports-mode."
|
||||
)
|
||||
parser.add_argument(
|
||||
"--base-port", type=int,
|
||||
help="Used in conjunction with ports-mode, specifies the base port exposed by the services. "
|
||||
"For pod #X, the port will be <base-port>+X. Note that pod number is calculated based on base-pod-num"
|
||||
"e.g. if base-port=20000 and base-pod-num=3, the port for the first pod will be 20003"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--base-pod-num", type=int, default=1,
|
||||
help="Used in conjunction with ports-mode and base-port, specifies the base pod number to be used by the "
|
||||
"service (default: %(default)s)"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--gateway-address", type=str, default=None,
|
||||
help="Used in conjunction with ports-mode, specify the external address of the k8s ingress / ELB"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--pod-clearml-conf", type=str,
|
||||
help="Configuration file to be used by the pod itself (if not provided, current configuration is used)"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--overrides-yaml", type=str,
|
||||
help="YAML file containing pod overrides to be used when launching a new pod"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--template-yaml", type=str,
|
||||
help="YAML file containing pod template. If provided pod will be scheduled with kubectl apply "
|
||||
"and overrides are ignored, otherwise it will be scheduled with kubectl run"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--ssh-server-port", type=int, default=0,
|
||||
help="If non-zero, every pod will also start an SSH server on the selected port (default: zero, not active)"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--namespace", type=str,
|
||||
help="Specify the namespace in which pods will be created (default: %(default)s)", default="clearml"
|
||||
)
|
||||
group.add_argument(
|
||||
"--max-pods", type=int,
|
||||
help="Limit the maximum number of pods that this service can run at the same time."
|
||||
"Should not be used with ports-mode"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--use-owner-token", action="store_true", default=False,
|
||||
help="Generate and use task owner token for the execution of each task"
|
||||
)
|
||||
return parser.parse_args()
|
||||
|
||||
|
||||
def main():
|
||||
args = parse_args()
|
||||
|
||||
user_props_cb = None
|
||||
if args.ports_mode and args.base_port:
|
||||
def k8s_user_props_cb(pod_number=0):
|
||||
user_prop = {"k8s-pod-port": args.base_port + pod_number}
|
||||
if args.gateway_address:
|
||||
user_prop["k8s-gateway-address"] = args.gateway_address
|
||||
return user_prop
|
||||
user_props_cb = k8s_user_props_cb
|
||||
|
||||
k8s = K8sIntegration(
|
||||
ports_mode=args.ports_mode, num_of_services=args.num_of_services, base_pod_num=args.base_pod_num,
|
||||
user_props_cb=user_props_cb, overrides_yaml=args.overrides_yaml, clearml_conf_file=args.pod_clearml_conf,
|
||||
template_yaml=args.template_yaml, extra_bash_init_script=K8sIntegration.get_ssh_server_bash(
|
||||
ssh_port_number=args.ssh_server_port) if args.ssh_server_port else None,
|
||||
namespace=args.namespace, max_pods_limit=args.max_pods or None,
|
||||
)
|
||||
k8s.k8s_daemon(args.queue, use_owner_token=args.use_owner_token)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -1,4 +1,4 @@
|
||||
FROM ubuntu:18.04
|
||||
FROM ubuntu:22.04
|
||||
|
||||
USER root
|
||||
WORKDIR /root
|
||||
|
||||
@@ -33,4 +33,10 @@ if [ -z "$CLEARML_AGENT_NO_UPDATE" ]; then
|
||||
fi
|
||||
fi
|
||||
|
||||
clearml-agent daemon $DAEMON_OPTIONS --queue $QUEUES --docker "${CLEARML_AGENT_DEFAULT_BASE_DOCKER:-$TRAINS_AGENT_DEFAULT_BASE_DOCKER}" --cpu-only ${CLEARML_AGENT_EXTRA_ARGS:-$TRAINS_AGENT_EXTRA_ARGS}
|
||||
DOCKER_ARGS="--docker \"${CLEARML_AGENT_DEFAULT_BASE_DOCKER:-$TRAINS_AGENT_DEFAULT_BASE_DOCKER}\""
|
||||
|
||||
if [ -n "$CLEARML_AGENT_NO_DOCKER" ]; then
|
||||
DOCKER_ARGS=""
|
||||
fi
|
||||
|
||||
clearml-agent daemon $DAEMON_OPTIONS --queue $QUEUES $DOCKER_ARGS --cpu-only ${CLEARML_AGENT_EXTRA_ARGS:-$TRAINS_AGENT_EXTRA_ARGS}
|
||||
|
||||
@@ -58,8 +58,8 @@ agent {
|
||||
# it solves passing user/token to git submodules.
|
||||
# this is a safer way to ensure multiple users using the same repository will
|
||||
# not accidentally leak credentials
|
||||
# Only supported on Linux systems, it will be the default in future releases
|
||||
# enable_git_ask_pass: false
|
||||
# Note: this is only supported on Linux systems
|
||||
# enable_git_ask_pass: true
|
||||
|
||||
# in docker mode, if container's entrypoint automatically activated a virtual environment
|
||||
# use the activated virtual environment and install everything there
|
||||
@@ -108,10 +108,17 @@ agent {
|
||||
# pytorch_resolve: "pip"
|
||||
|
||||
# additional conda channels to use when installing with conda package manager
|
||||
conda_channels: ["pytorch", "conda-forge", "defaults", ]
|
||||
conda_channels: ["pytorch", "conda-forge", "nvidia", "defaults", ]
|
||||
# conda_full_env_update: false
|
||||
|
||||
# notice this will not install any additional packages into the selected environment, should be used in
|
||||
# conjunction with CLEARML_CONDA_ENV_PACKAGE which points to an existing conda environment directory
|
||||
# conda_env_as_base_docker: false
|
||||
|
||||
# install into base conda environment
|
||||
# (should only be used if running in docker mode, because it will change the conda base enrichment)
|
||||
# use_conda_base_env: false
|
||||
|
||||
# set the priority packages to be installed before the rest of the required packages
|
||||
# Note: this only controls the installation order of existing requirement packages (and does not add additional packages)
|
||||
# priority_packages: ["cython", "numpy", "setuptools", ]
|
||||
@@ -154,6 +161,9 @@ agent {
|
||||
vcs_cache: {
|
||||
enabled: true,
|
||||
path: ~/.clearml/vcs-cache
|
||||
|
||||
# if git pull failed, always revert to re-cloning the repo, it protects against old user name changes
|
||||
# clone_on_pull_fail: false
|
||||
},
|
||||
|
||||
# DEPRECATED: please use `venvs_cache` and set `venvs_cache.path`
|
||||
@@ -189,6 +199,13 @@ agent {
|
||||
# You can also pass host environments into the container with ["-e", "HOST_NAME=$HOST_NAME"]
|
||||
# extra_docker_arguments: ["--ipc=host", "-v", "/mnt/host/data:/mnt/data"]
|
||||
|
||||
# Allow the extra docker arg to override task level docker arg (if the same argument is passed on both),
|
||||
# if set to False, a task docker arg will override the docker extra arg
|
||||
# docker_args_extra_precedes_task: true
|
||||
|
||||
# prevent a task docker args to be used if already specified in the extra_docker_arguments
|
||||
# protected_docker_extra_args: ["privileged", "security-opt", "network", "ipc"]
|
||||
|
||||
# optional shell script to run in docker when started before the experiment is started
|
||||
# extra_docker_shell_script: ["apt-get install -y bindfs", ]
|
||||
|
||||
@@ -287,9 +304,11 @@ agent {
|
||||
# sdk_cache: "/clearml_agent_cache"
|
||||
# apt_cache: "/var/cache/apt/archives"
|
||||
# ssh_folder: "/root/.ssh"
|
||||
# ssh_ro_folder: "/.ssh"
|
||||
# pip_cache: "/root/.cache/pip"
|
||||
# poetry_cache: "/root/.cache/pypoetry"
|
||||
# vcs_cache: "/root/.clearml/vcs-cache"
|
||||
# venvs_cache: "/root/.clearml/venvs-cache"
|
||||
# venv_build: "~/.clearml/venvs-builds"
|
||||
# pip_download: "/root/.clearml/pip-download-cache"
|
||||
# }
|
||||
@@ -444,7 +463,7 @@ sdk {
|
||||
vcs_repo_detect_async: True
|
||||
|
||||
# Store uncommitted git/hg source code diff in experiment manifest when training in development mode
|
||||
# This stores "git diff" or "hg diff" into the experiment's "script.requirements.diff" section
|
||||
# This stores "git diff" or into the experiment's "script.requirements.diff" section
|
||||
store_uncommitted_code_diff_on_train: True
|
||||
|
||||
# Support stopping an experiment in case it was externally stopped, status was changed or task was reset
|
||||
|
||||
@@ -5,27 +5,30 @@
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Auto-Magically Spin AWS EC2 Instances On Demand \n",
|
||||
"# and Create a Dynamic Cluster Running *Trains-Agent*\n",
|
||||
"# and Create a Dynamic Cluster Running *ClearML-Agent*\n",
|
||||
"\n",
|
||||
"### Define your budget and execute the notebook, that's it\n",
|
||||
"### You now have a fully managed cluster on AWS 🎉 🎊 "
|
||||
"## Define your budget and execute the notebook, that's it\n",
|
||||
"## You now have a fully managed cluster on AWS 🎉 🎊"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"**trains-agent**'s main goal is to quickly pull a job from an execution queue, setup the environment (as defined in the experiment, including git cloning, python packages etc.) then execute the experiment and monitor it.\n",
|
||||
"**clearml-agent**'s main goal is to quickly pull a job from an execution queue, set up the environment (as defined in the experiment, including git cloning, python packages etc.), then execute the experiment and monitor it.\n",
|
||||
"\n",
|
||||
"This notebook defines a cloud budget (currently only AWS is supported, but feel free to expand with PRs), and spins an instance the minute a job is waiting for execution. It will also spin down idle machines, saving you some $$$ :)\n",
|
||||
"\n",
|
||||
"Configuration steps\n",
|
||||
"> **Note:**\n",
|
||||
"> This is just an example of how you can use ClearML Agent to implement custom autoscaling. For a more structured autoscaler script, see [here](https://github.com/allegroai/clearml/blob/master/clearml/automation/auto_scaler.py).\n",
|
||||
"\n",
|
||||
"Configuration steps:\n",
|
||||
"- Define maximum budget to be used (instance type / number of instances).\n",
|
||||
"- Create new execution *queues* in the **trains-server**.\n",
|
||||
"- Define mapping between the created the *queues* and an instance budget.\n",
|
||||
"- Create new execution *queues* in the **clearml-server**.\n",
|
||||
"- Define mapping between the created *queues* and an instance budget.\n",
|
||||
"\n",
|
||||
"**TL;DR - This notebook:**\n",
|
||||
"- Will spin instances if there are jobs in the execution *queues*, until it will hit the budget limit. \n",
|
||||
"- Will spin instances if there are jobs in the execution *queues* until it will hit the budget limit.\n",
|
||||
"- If machines are idle, it will spin them down.\n",
|
||||
"\n",
|
||||
"The controller implementation itself is stateless, meaning you can always re-execute the notebook, if for some reason it stopped.\n",
|
||||
@@ -39,7 +42,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"##### Install & import required packages"
|
||||
"### Install & import required packages"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -48,7 +51,7 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!pip install trains-agent\n",
|
||||
"!pip install clearml-agent\n",
|
||||
"!pip install boto3"
|
||||
]
|
||||
},
|
||||
@@ -56,7 +59,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"##### Define AWS instance types and configuration (Instance Type, EBS, AMI etc.)"
|
||||
"### Define AWS instance types and configuration (Instance Type, EBS, AMI etc.)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -92,17 +95,17 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"##### Define machine budget per execution queue\n",
|
||||
"### Define machine budget per execution queue\n",
|
||||
"\n",
|
||||
"Now that we defined our budget, we need to connect it with the **Trains** cluster.\n",
|
||||
"Now that we defined our budget, we need to connect it with the **ClearML** cluster.\n",
|
||||
"\n",
|
||||
"We map each queue to a resource type (instance type).\n",
|
||||
"\n",
|
||||
"Create two queues in the WebUI:\n",
|
||||
"- Browse to http://your_trains_server_ip:8080/workers-and-queues/queues\n",
|
||||
"Create two queues in the Web UI:\n",
|
||||
"- Browse to http://your_clearml_server_ip:8080/workers-and-queues/queues\n",
|
||||
"- Then click on the \"New Queue\" button and name your queues \"aws_normal\" and \"aws_high\" respectively\n",
|
||||
"\n",
|
||||
"The QUEUES dictionary hold the mapping between the queue name and the type/number of instances to spin connected to the specific queue.\n",
|
||||
"The QUEUES dictionary holds the mapping between the queue name and the type/number of instances to spin connected to the specific queue.\n",
|
||||
"```\n",
|
||||
"QUEUES = {\n",
|
||||
" 'queue_name': [(\"instance-type-as-defined-in-RESOURCE_CONFIGURATIONS\", max_number_of_instances), ]\n",
|
||||
@@ -116,7 +119,7 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Trains-Agent Queues - Machines budget per Queue\n",
|
||||
"# ClearML Agent Queues - Machines budget per Queue\n",
|
||||
"# Per queue: list of (machine type as defined in RESOURCE_CONFIGURATIONS,\n",
|
||||
"# max instances for the specific queue). Order machines from most preferred to least.\n",
|
||||
"QUEUES = {\n",
|
||||
@@ -129,7 +132,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"##### Credentials for your AWS account, as well as for your **Trains-Server**"
|
||||
"### Credentials for your AWS account, as well as for your **ClearML Server**"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -143,24 +146,25 @@
|
||||
"CLOUD_CREDENTIALS_SECRET = \"\"\n",
|
||||
"CLOUD_CREDENTIALS_REGION = \"us-east-1\"\n",
|
||||
"\n",
|
||||
"# TRAINS configuration\n",
|
||||
"TRAINS_SERVER_WEB_SERVER = \"http://localhost:8080\"\n",
|
||||
"TRAINS_SERVER_API_SERVER = \"http://localhost:8008\"\n",
|
||||
"TRAINS_SERVER_FILES_SERVER = \"http://localhost:8081\"\n",
|
||||
"# TRAINS credentials\n",
|
||||
"TRAINS_ACCESS_KEY = \"\"\n",
|
||||
"TRAINS_SECRET_KEY = \"\"\n",
|
||||
"# Git User/Pass to be used by trains-agent,\n",
|
||||
"# CLEARML configuration\n",
|
||||
"CLEARML_WEB_SERVER = \"http://localhost:8080\"\n",
|
||||
"CLEARML_API_SERVER = \"http://localhost:8008\"\n",
|
||||
"CLEARML_FILES_SERVER = \"http://localhost:8081\"\n",
|
||||
"# CLEARML credentials\n",
|
||||
"CLEARML_API_ACCESS_KEY = \"\"\n",
|
||||
"CLEARML_API_SECRET_KEY = \"\"\n",
|
||||
"# Git User/Pass to be used by clearml-agent,\n",
|
||||
"# leave empty if image already contains git ssh-key\n",
|
||||
"TRAINS_GIT_USER = \"\"\n",
|
||||
"TRAINS_GIT_PASS = \"\"\n",
|
||||
"CLEARML_AGENT_GIT_USER = \"\"\n",
|
||||
"CLEARML_AGENT_GIT_PASS = \"\"\n",
|
||||
"\n",
|
||||
"# Additional fields for trains.conf file created on the remote instance\n",
|
||||
"# Additional fields for clearml.conf file created on the remote instance\n",
|
||||
"# for example: 'agent.default_docker.image: \"nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04\"'\n",
|
||||
"EXTRA_TRAINS_CONF = \"\"\"\n",
|
||||
"\n",
|
||||
"EXTRA_CLEARML_CONF = \"\"\"\n",
|
||||
"\"\"\"\n",
|
||||
"\n",
|
||||
"# Bash script to run on instances before running trains-agent\n",
|
||||
"# Bash script to run on instances before running clearml-agent\n",
|
||||
"# Example: \"\"\"\n",
|
||||
"# echo \"This is the first line\"\n",
|
||||
"# echo \"This is the second line\"\n",
|
||||
@@ -168,9 +172,9 @@
|
||||
"EXTRA_BASH_SCRIPT = \"\"\"\n",
|
||||
"\"\"\"\n",
|
||||
"\n",
|
||||
"# Default docker for trains-agent when running in docker mode (requires docker v19.03 and above). \n",
|
||||
"# Leave empty to run trains-agent in non-docker mode.\n",
|
||||
"DEFAULT_DOCKER_IMAGE = \"nvidia/cuda\""
|
||||
"# Default docker for clearml-agent when running in docker mode (requires docker v19.03 and above).\n",
|
||||
"# Leave empty to run clearml-agent in non-docker mode.\n",
|
||||
"CLEARML_AGENT_DOCKER_IMAGE = \"nvidia/cuda\""
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -192,7 +196,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"##### Import Packages and Budget Definition Sanity Check"
|
||||
"### Import Packages and Budget Definition Sanity Check"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -209,7 +213,7 @@
|
||||
"from time import sleep, time\n",
|
||||
"\n",
|
||||
"import boto3\n",
|
||||
"from trains_agent.backend_api.session.client import APIClient"
|
||||
"from clearml_agent.backend_api.session.client import APIClient"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -227,36 +231,36 @@
|
||||
" \"A resource name can only appear in a single queue definition.\"\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
"# Encode EXTRA_TRAINS_CONF for later bash script usage\n",
|
||||
"EXTRA_TRAINS_CONF_ENCODED = \"\\\\\\\"\".join(EXTRA_TRAINS_CONF.split(\"\\\"\"))"
|
||||
"# Encode EXTRA_CLEARML_CONF for later bash script usage\n",
|
||||
"EXTRA_CLEARML_CONF_ENCODED = \"\\\\\\\"\".join(EXTRA_CLEARML_CONF.split(\"\\\"\"))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"##### Cloud specific implementation of spin up/down - currently supports AWS only"
|
||||
"### Cloud specific implementation of spin up/down - currently supports AWS only"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Cloud-specific implementation (currently, only AWS EC2 is supported)\n",
|
||||
"def spin_up_worker(resource, worker_id_prefix, queue_name):\n",
|
||||
" \"\"\"\n",
|
||||
" Creates a new worker for trains.\n",
|
||||
" Creates a new worker for clearml.\n",
|
||||
" First, create an instance in the cloud and install some required packages.\n",
|
||||
" Then, define trains-agent environment variables and run \n",
|
||||
" trains-agent for the specified queue.\n",
|
||||
" Then, define clearml-agent environment variables and run\n",
|
||||
" clearml-agent for the specified queue.\n",
|
||||
" NOTE: - Will wait until instance is running\n",
|
||||
" - This implementation assumes the instance image already has docker installed\n",
|
||||
"\n",
|
||||
" :param str resource: resource name, as defined in BUDGET and QUEUES.\n",
|
||||
" :param str worker_id_prefix: worker name prefix\n",
|
||||
" :param str queue_name: trains queue to listen to\n",
|
||||
" :param str queue_name: clearml queue to listen to\n",
|
||||
" \"\"\"\n",
|
||||
" resource_conf = RESOURCE_CONFIGURATIONS[resource]\n",
|
||||
" # Add worker type and AWS instance type to the worker name.\n",
|
||||
@@ -267,8 +271,8 @@
|
||||
" )\n",
|
||||
"\n",
|
||||
" # user_data script will automatically run when the instance is started. \n",
|
||||
" # It will install the required packages for trains-agent configure it using \n",
|
||||
" # environment variables and run trains-agent on the required queue\n",
|
||||
" # It will install the required packages for clearml-agent configure it using\n",
|
||||
" # environment variables and run clearml-agent on the required queue\n",
|
||||
" user_data = \"\"\"#!/bin/bash\n",
|
||||
" sudo apt-get update\n",
|
||||
" sudo apt-get install -y python3-dev\n",
|
||||
@@ -278,36 +282,36 @@
|
||||
" sudo apt-get install -y build-essential\n",
|
||||
" python3 -m pip install -U pip\n",
|
||||
" python3 -m pip install virtualenv\n",
|
||||
" python3 -m virtualenv trains_agent_venv\n",
|
||||
" source trains_agent_venv/bin/activate\n",
|
||||
" python -m pip install trains-agent\n",
|
||||
" echo 'agent.git_user=\\\"{git_user}\\\"' >> /root/trains.conf\n",
|
||||
" echo 'agent.git_pass=\\\"{git_pass}\\\"' >> /root/trains.conf\n",
|
||||
" echo \"{trains_conf}\" >> /root/trains.conf\n",
|
||||
" export TRAINS_API_HOST={api_server}\n",
|
||||
" export TRAINS_WEB_HOST={web_server}\n",
|
||||
" export TRAINS_FILES_HOST={files_server}\n",
|
||||
" python3 -m virtualenv clearml_agent_venv\n",
|
||||
" source clearml_agent_venv/bin/activate\n",
|
||||
" python -m pip install clearml-agent\n",
|
||||
" echo 'agent.git_user=\\\"{git_user}\\\"' >> /root/clearml.conf\n",
|
||||
" echo 'agent.git_pass=\\\"{git_pass}\\\"' >> /root/clearml.conf\n",
|
||||
" echo \"{clearml_conf}\" >> /root/clearml.conf\n",
|
||||
" export CLEARML_API_HOST={api_server}\n",
|
||||
" export CLEARML_WEB_HOST={web_server}\n",
|
||||
" export CLEARML_FILES_HOST={files_server}\n",
|
||||
" export DYNAMIC_INSTANCE_ID=`curl http://169.254.169.254/latest/meta-data/instance-id`\n",
|
||||
" export TRAINS_WORKER_ID={worker_id}:$DYNAMIC_INSTANCE_ID\n",
|
||||
" export TRAINS_API_ACCESS_KEY='{access_key}'\n",
|
||||
" export TRAINS_API_SECRET_KEY='{secret_key}'\n",
|
||||
" export CLEARML_WORKER_ID={worker_id}:$DYNAMIC_INSTANCE_ID\n",
|
||||
" export CLEARML_API_ACCESS_KEY='{access_key}'\n",
|
||||
" export CLEARML_API_SECRET_KEY='{secret_key}'\n",
|
||||
" {bash_script}\n",
|
||||
" source ~/.bashrc\n",
|
||||
" python -m trains_agent --config-file '/root/trains.conf' daemon --queue '{queue}' {docker}\n",
|
||||
" python -m clearml_agent --config-file '/root/clearml.conf' daemon --queue '{queue}' {docker}\n",
|
||||
" shutdown\n",
|
||||
" \"\"\".format(\n",
|
||||
" api_server=TRAINS_SERVER_API_SERVER,\n",
|
||||
" web_server=TRAINS_SERVER_WEB_SERVER,\n",
|
||||
" files_server=TRAINS_SERVER_FILES_SERVER,\n",
|
||||
" api_server=CLEARML_API_SERVER,\n",
|
||||
" web_server=CLEARML_WEB_SERVER,\n",
|
||||
" files_server=CLEARML_FILES_SERVER,\n",
|
||||
" worker_id=worker_id,\n",
|
||||
" access_key=TRAINS_ACCESS_KEY,\n",
|
||||
" secret_key=TRAINS_SECRET_KEY,\n",
|
||||
" access_key=CLEARML_API_ACCESS_KEY,\n",
|
||||
" secret_key=CLEARML_API_SECRET_KEY,\n",
|
||||
" queue=queue_name,\n",
|
||||
" git_user=TRAINS_GIT_USER,\n",
|
||||
" git_pass=TRAINS_GIT_PASS,\n",
|
||||
" trains_conf=EXTRA_TRAINS_CONF_ENCODED,\n",
|
||||
" git_user=CLEARML_AGENT_GIT_USER,\n",
|
||||
" git_pass=CLEARML_AGENT_GIT_PASS,\n",
|
||||
" clearml_conf=EXTRA_CLEARML_CONF_ENCODED,\n",
|
||||
" bash_script=EXTRA_BASH_SCRIPT,\n",
|
||||
" docker=\"--docker '{}'\".format(DEFAULT_DOCKER_IMAGE) if DEFAULT_DOCKER_IMAGE else \"\"\n",
|
||||
" docker=\"--docker '{}'\".format(CLEARML_AGENT_DOCKER_IMAGE) if CLEARML_AGENT_DOCKER_IMAGE else \"\"\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" ec2 = boto3.client(\n",
|
||||
@@ -405,7 +409,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"###### Controller Implementation and Logic"
|
||||
"#### Controller Implementation and Logic"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -430,18 +434,18 @@
|
||||
"\n",
|
||||
" # Internal definitions\n",
|
||||
" workers_prefix = \"dynamic_aws\"\n",
|
||||
" # Worker's id in trains would be composed from:\n",
|
||||
" # Worker's id in clearml would be composed from:\n",
|
||||
" # prefix, name, instance_type and cloud_id separated by ';'\n",
|
||||
" workers_pattern = re.compile(\n",
|
||||
" r\"^(?P<prefix>[^:]+):(?P<name>[^:]+):(?P<instance_type>[^:]+):(?P<cloud_id>[^:]+)\"\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" # Set up the environment variables for trains\n",
|
||||
" os.environ[\"TRAINS_API_HOST\"] = TRAINS_SERVER_API_SERVER\n",
|
||||
" os.environ[\"TRAINS_WEB_HOST\"] = TRAINS_SERVER_WEB_SERVER\n",
|
||||
" os.environ[\"TRAINS_FILES_HOST\"] = TRAINS_SERVER_FILES_SERVER\n",
|
||||
" os.environ[\"TRAINS_API_ACCESS_KEY\"] = TRAINS_ACCESS_KEY\n",
|
||||
" os.environ[\"TRAINS_API_SECRET_KEY\"] = TRAINS_SECRET_KEY\n",
|
||||
" # Set up the environment variables for clearml\n",
|
||||
" os.environ[\"CLEARML_API_HOST\"] = CLEARML_API_SERVER\n",
|
||||
" os.environ[\"CLEARML_WEB_HOST\"] = CLEARML_WEB_SERVER\n",
|
||||
" os.environ[\"CLEARML_FILES_HOST\"] = CLEARML_FILES_SERVER\n",
|
||||
" os.environ[\"CLEARML_API_ACCESS_KEY\"] = CLEARM_API_ACCESS_KEY\n",
|
||||
" os.environ[\"CLEARML_API_SECRET_KEY\"] = CLEARML_API_SECRET_KEY\n",
|
||||
" api_client = APIClient()\n",
|
||||
"\n",
|
||||
" # Verify the requested queues exist and create those that doesn't exist\n",
|
||||
@@ -520,7 +524,7 @@
|
||||
" # skip resource types that might be needed\n",
|
||||
" if resources in required_idle_resources:\n",
|
||||
" continue\n",
|
||||
" # Remove from both aws and trains all instances that are \n",
|
||||
" # Remove from both aws and clearml all instances that are\n",
|
||||
" # idle for longer than MAX_IDLE_TIME_MIN\n",
|
||||
" if time() - timestamp > MAX_IDLE_TIME_MIN * 60.0:\n",
|
||||
" cloud_id = workers_pattern.match(worker.id)[\"cloud_id\"]\n",
|
||||
@@ -535,7 +539,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"##### Execute Forever* (the controller is stateless, so you can always re-execute the notebook)"
|
||||
"### Execute Forever* (the controller is stateless, so you can always re-execute the notebook)"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
||||
@@ -13,61 +13,86 @@ def parse_args():
|
||||
group = parser.add_mutually_exclusive_group()
|
||||
|
||||
parser.add_argument(
|
||||
"--queue", type=str, help="Queue to pull tasks from"
|
||||
"--queue",
|
||||
type=str,
|
||||
help="Queues to pull tasks from. If multiple queues, use comma separated list, e.g. 'queue1,queue2'",
|
||||
)
|
||||
group.add_argument(
|
||||
"--ports-mode", action='store_true', default=False,
|
||||
"--ports-mode",
|
||||
action="store_true",
|
||||
default=False,
|
||||
help="Ports-Mode will add a label to the pod which can be used as service, in order to expose ports"
|
||||
"Should not be used with max-pods"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--num-of-services", type=int, default=20,
|
||||
help="Specify the number of k8s services to be used. Use only with ports-mode."
|
||||
"--num-of-services",
|
||||
type=int,
|
||||
default=20,
|
||||
help="Specify the number of k8s services to be used. Use only with ports-mode.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--base-port", type=int,
|
||||
"--base-port",
|
||||
type=int,
|
||||
help="Used in conjunction with ports-mode, specifies the base port exposed by the services. "
|
||||
"For pod #X, the port will be <base-port>+X. Note that pod number is calculated based on base-pod-num"
|
||||
"e.g. if base-port=20000 and base-pod-num=3, the port for the first pod will be 20003"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--base-pod-num", type=int, default=1,
|
||||
"--base-pod-num",
|
||||
type=int,
|
||||
default=1,
|
||||
help="Used in conjunction with ports-mode and base-port, specifies the base pod number to be used by the "
|
||||
"service (default: %(default)s)"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--gateway-address", type=str, default=None,
|
||||
help="Used in conjunction with ports-mode, specify the external address of the k8s ingress / ELB"
|
||||
"--gateway-address",
|
||||
type=str,
|
||||
default=None,
|
||||
help="Used in conjunction with ports-mode, specify the external address of the k8s ingress / ELB",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--pod-clearml-conf", type=str,
|
||||
help="Configuration file to be used by the pod itself (if not provided, current configuration is used)"
|
||||
"--pod-clearml-conf",
|
||||
type=str,
|
||||
help="Configuration file to be used by the pod itself (if not provided, current configuration is used)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--overrides-yaml", type=str,
|
||||
help="YAML file containing pod overrides to be used when launching a new pod"
|
||||
"--overrides-yaml", type=str, help="YAML file containing pod overrides to be used when launching a new pod"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--template-yaml", type=str,
|
||||
"--template-yaml",
|
||||
type=str,
|
||||
help="YAML file containing pod template. If provided pod will be scheduled with kubectl apply "
|
||||
"and overrides are ignored, otherwise it will be scheduled with kubectl run"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--ssh-server-port", type=int, default=0,
|
||||
help="If non-zero, every pod will also start an SSH server on the selected port (default: zero, not active)"
|
||||
"--ssh-server-port",
|
||||
type=int,
|
||||
default=0,
|
||||
help="If non-zero, every pod will also start an SSH server on the selected port (default: zero, not active)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--namespace", type=str,
|
||||
help="Specify the namespace in which pods will be created (default: %(default)s)", default="clearml"
|
||||
"--namespace",
|
||||
type=str,
|
||||
help="Specify the namespace in which pods will be created (default: %(default)s)",
|
||||
default="clearml",
|
||||
)
|
||||
group.add_argument(
|
||||
"--max-pods", type=int,
|
||||
"--max-pods",
|
||||
type=int,
|
||||
help="Limit the maximum number of pods that this service can run at the same time."
|
||||
"Should not be used with ports-mode"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--use-owner-token", action="store_true", default=False,
|
||||
help="Generate and use task owner token for the execution of each task"
|
||||
"--use-owner-token",
|
||||
action="store_true",
|
||||
default=False,
|
||||
help="Generate and use task owner token for the execution of each task",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--create-queue",
|
||||
action="store_true",
|
||||
default=False,
|
||||
help="Create the queue if it does not exist (default: %(default)s)",
|
||||
)
|
||||
return parser.parse_args()
|
||||
|
||||
@@ -77,21 +102,32 @@ def main():
|
||||
|
||||
user_props_cb = None
|
||||
if args.ports_mode and args.base_port:
|
||||
|
||||
def k8s_user_props_cb(pod_number=0):
|
||||
user_prop = {"k8s-pod-port": args.base_port + pod_number}
|
||||
if args.gateway_address:
|
||||
user_prop["k8s-gateway-address"] = args.gateway_address
|
||||
return user_prop
|
||||
|
||||
user_props_cb = k8s_user_props_cb
|
||||
|
||||
k8s = K8sIntegration(
|
||||
ports_mode=args.ports_mode, num_of_services=args.num_of_services, base_pod_num=args.base_pod_num,
|
||||
user_props_cb=user_props_cb, overrides_yaml=args.overrides_yaml, clearml_conf_file=args.pod_clearml_conf,
|
||||
template_yaml=args.template_yaml, extra_bash_init_script=K8sIntegration.get_ssh_server_bash(
|
||||
ssh_port_number=args.ssh_server_port) if args.ssh_server_port else None,
|
||||
namespace=args.namespace, max_pods_limit=args.max_pods or None,
|
||||
ports_mode=args.ports_mode,
|
||||
num_of_services=args.num_of_services,
|
||||
base_pod_num=args.base_pod_num,
|
||||
user_props_cb=user_props_cb,
|
||||
overrides_yaml=args.overrides_yaml,
|
||||
clearml_conf_file=args.pod_clearml_conf,
|
||||
template_yaml=args.template_yaml,
|
||||
extra_bash_init_script=K8sIntegration.get_ssh_server_bash(ssh_port_number=args.ssh_server_port)
|
||||
if args.ssh_server_port
|
||||
else None,
|
||||
namespace=args.namespace,
|
||||
max_pods_limit=args.max_pods or None,
|
||||
)
|
||||
k8s.k8s_daemon(args.queue, use_owner_token=args.use_owner_token)
|
||||
queue = [q.strip() for q in args.queue.split(",") if q.strip()] if args.queue else None
|
||||
|
||||
k8s.k8s_daemon(queue, use_owner_token=args.use_owner_token, create_queue=args.create_queue)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
@@ -1,15 +1,15 @@
|
||||
attrs>=18.0,<23.0.0
|
||||
attrs>=18.0,<24.0.0
|
||||
enum34>=0.9,<1.2.0 ; python_version < '3.6'
|
||||
furl>=2.0.0,<2.2.0
|
||||
jsonschema>=2.6.0,<5.0.0
|
||||
pathlib2>=2.3.0,<2.4.0
|
||||
psutil>=3.4.2,<5.10.0
|
||||
pyparsing>=2.0.3,<3.1.0
|
||||
pyparsing>=2.0.3,<3.2.0
|
||||
python-dateutil>=2.4.2,<2.9.0
|
||||
pyjwt>=2.4.0,<2.7.0
|
||||
pyjwt>=2.4.0,<2.9.0
|
||||
PyYAML>=3.12,<6.1
|
||||
requests>=2.20.0,<=2.31.0
|
||||
six>=1.13.0,<1.17.0
|
||||
typing>=3.6.4,<3.8.0 ; python_version < '3.5'
|
||||
urllib3>=1.21.1,<1.27.0
|
||||
urllib3>=1.21.1,<2
|
||||
virtualenv>=16,<21
|
||||
|
||||
Reference in New Issue
Block a user