mads-oestergaard
cc656e2969
Add support for uv as package manager ( #218 )
...
* add uv as a package manager
* update configs
* update worker and defs
* update environ
* Update configs to highlight sync command
* rename to sync_extra_args and set UV_CACHE_DIR
2024-11-27 13:44:55 +02:00
clearml
b65e5fed94
Scan more Python 3 versions
2024-11-17 13:55:51 +02:00
clearml
3273f76b46
Version bump to v1.9.2
2024-10-28 18:33:04 +02:00
clearml
9af0f9fe41
Fix reload method is found in the config object
2024-10-28 18:12:22 +02:00
clearml
205cd47cb9
Fix use req_token_expiration_sec when creating a task session and not the default value
2024-10-28 18:11:42 +02:00
clearml
0ff428bb96
Fix report index not advancing in resource monitoring causes more than one GPU not to be reported
2024-10-28 18:11:00 +02:00
Matteo Destro
bf8d9c96e9
Handle OSError when checking for is_file ( #215 )
2024-10-13 10:08:03 +03:00
allegroai
a88487ff25
Add support for pip legacy resolver for versions specified in the agent.package_manager.pip_legacy_resolver
configuration option
...
Add skip existing packages
2024-09-22 22:36:06 +03:00
Jake Henning
785e22dc87
Version bump to v1.9.1
2024-09-02 01:04:49 +03:00
Jake Henning
6a2b778d53
Add default pip version support for Python 3.12
2024-09-02 01:03:52 +03:00
allegroai
b2c3702830
Version bump to v1.9.0
2024-08-28 23:18:26 +03:00
allegroai
6302d43990
Add support for skipping container apt installs using CLEARML_AGENT_SKIP_CONTAINER_APT env var in k8s
...
Add runtime callback support for setting runtime properties per task in k8s
Fix remove task from pending queue and set to failed when kubectl apply fails
2024-08-27 23:01:27 +03:00
allegroai
760bbca74e
Fix failed Task in services mode logged "User aborted" instead of failed, add Task reason string
2024-08-27 22:56:37 +03:00
allegroai
e63fd31420
Fix string format
2024-08-27 22:55:49 +03:00
allegroai
2ff9985db7
Add user ID to the vault loading print
2024-08-27 22:55:32 +03:00
allegroai
b8c762401b
Fix use same state transition if supported by the server (instead of stopping the task before re-enqueue)
2024-08-27 22:54:45 +03:00
allegroai
99e1e54f94
Add support for tasks containing only bash script or python module command
2024-08-27 22:53:14 +03:00
allegroai
a4d3b5bad6
Fix only set Task started status on node rank 0
2024-08-27 22:52:31 +03:00
allegroai
b21665ed6e
Fix do not cache venv cache if venv/python skip env var was set
2024-08-27 22:52:01 +03:00
pollfly
f99344d194
Add queue priority info to CLI help ( #211 )
...
* add queue priority comment
* Add --order-fairness info
---------
Co-authored-by: Jake Henning <59198928+jkhenning@users.noreply.github.com>
2024-07-29 18:40:38 +03:00
allegroai
d9f2a1999a
Fix Only send pip freeze update on RANK 0, only update task status on exit on RANK 0
2024-07-29 17:40:24 +03:00
allegroai
6213ef4c02
Add /bin/bash -c "command" support. Task binary
should be set to /bin/bash
and entry_point should be set to -c command
2024-07-24 18:00:13 +03:00
allegroai
aef6aa9fc8
Fix a race condition where in rare conditions popping a Task from a queue that was aborted did not set it to started before the watchdog killed it. Does not happen in k8s/slurm
2024-07-24 17:59:46 +03:00
allegroai
0bb267115b
Add venvs_cache.path mount override for non-root containers (use: agent.docker_internal_mounts.venvs_cache)
2024-07-24 17:59:18 +03:00
allegroai
f89a92556f
Fix check logger is not None
2024-07-24 17:55:02 +03:00
allegroai
8ba4d75e80
Add CLEARML_TASK_ID and auth token to pod env vars in original entrypoint flow
2024-07-24 17:47:48 +03:00
allegroai
edc333ba5f
Add K8S_GLUE_POD_USE_IMAGE_ENTRYPOINT to allow running images without overriding the entrypoint (useful for agents using prebuilt images in k8s)
2024-07-24 17:46:27 +03:00
allegroai
2f0553b873
Fix CLEARML_MULTI_NODE_SINGLE_TASK should be read once not every reported line
2024-07-24 17:45:02 +03:00
allegroai
b2a4bf08ac
Fix pass --docker only (i.e. no default container image) for --dynamic-gpus feature
2024-07-24 17:44:35 +03:00
allegroai
f18c6b809f
Fix slurm multi-node rank detection
2024-07-24 17:44:05 +03:00
allegroai
cd5b4d2186
Add "-m module args" in script entry now supports standalone script, standalone script is converted to "untitled.py" by default or if specified in working_dir such as <dir>:<target_file> for example ".:standalone.py"
2024-07-24 17:43:21 +03:00
allegroai
5f1bab6711
Add default docker match_rules for enterprise users,
...
NOTICE: matching_rules are ignored if `--docker container` is passed in command line
2024-07-24 17:42:55 +03:00
allegroai
ab9b9db0c9
Add CLEARML_MULTI_NODE_SINGLE_TASK (values -1, 0, 1, 2) for easier multi-node singe Task workloads
2024-07-24 17:42:25 +03:00
allegroai
93df021108
Add support for .ipynb script entry files (install nbconvert in runtime, copnvert to python and execute the python script), including CLEARML_AGENT_FORCE_TASK_INIT patching of ipynb files (post python conversion)
2024-07-24 17:41:59 +03:00
allegroai
700ae85de0
Fix file mode should be optional in configuration files
section
2024-07-24 17:41:06 +03:00
allegroai
f367c5a571
Fix git fetch did not update new tags #209
2024-07-24 17:39:53 +03:00
allegroai
ebc5944b44
Fix setting tasks that someone just marked as aborted to started - only force Task to started after dequeuing it otherwise lease it as is
2024-07-24 17:39:26 +03:00
allegroai
8f41002845
Add task.script.binary /bin/bash support
...
Fix -m module $env to support parsing the $env before launching
2024-07-24 17:37:26 +03:00
allegroai
7e8670d57f
Find the correct python version when using a pre-installed python environment
2024-07-21 14:10:38 +03:00
allegroai
77de343863
Use "venv" module if virtualenv is not supported
2024-07-19 13:18:07 +03:00
allegroai
47147e3237
Fix cached repositories were not passing user/token when pulling, agent.vcs_cache.clone_on_pull_fail now defaults to false
2024-04-19 23:50:17 +03:00
allegroai
41fc4ec646
Fix disabling vcs cache should not add vcs mount point to container
2024-04-19 23:48:50 +03:00
allegroai
441e5a73b2
Fix conda env should not be cached if installing into base conda or conda existing env exists
2024-04-19 23:48:10 +03:00
allegroai
10c6629982
Support skipping re-enqueue on suspected preempted k8s pods
2024-04-19 23:46:57 +03:00
allegroai
6fb48a4c6e
Revert version to v1.8.1
2024-04-19 23:44:31 +03:00
allegroai
105ade31f1
Version bump to v1.8.2
2024-04-14 18:18:10 +03:00
allegroai
502e266b6b
Fix polling interval missing when not using daemon mode
2024-04-14 18:17:57 +03:00
allegroai
cd9a3b9f4e
Version bump to v1.8.1
2024-04-12 20:30:11 +03:00
allegroai
4179ac5234
Fix git pulling on cached invalid git entry. On error, re-clone the entire repo again (disable using "agent.vcs_cache.clone_on_pull_fail: false")
2024-04-12 20:29:36 +03:00
Liron Ilouz
98cc0d86ba
Add option to set daemon polling interval ( #197 )
...
* add option to set worker polling interval
* polling interval minimum value
---------
Co-authored-by: Liron <liron@tapwithus.com>
2024-04-03 14:33:52 +03:00
allegroai
293cbc0ac6
Version bump to v1.8.0
2024-04-02 16:38:22 +03:00
allegroai
4387ed73b6
Fix None handling when no limits exist
2024-04-02 16:36:09 +03:00
allegroai
43443ccf08
Pass task_id when resolving k8s template
2024-04-01 11:37:01 +03:00
allegroai
3d43240c8f
Improve conda package manager support
...
Add agent.package_manager.use_conda_base_env (CLEARML_USE_CONDA_BASE_ENV) allowing to use base conda environment (instead of installing a new one)
Fix conda support for python packages with markers and multiple specifications
Added "nvidia" conda channel and support for cuda-toolkit >= 12
2024-04-01 11:36:26 +03:00
allegroai
22672d2444
Improve GPU monitoring
2024-03-17 19:13:57 +02:00
allegroai
6a4fcda1bf
Improve resource monitor
2024-03-17 19:06:57 +02:00
allegroai
a4ebf8293d
Fix role support
2024-03-17 19:00:59 +02:00
allegroai
10fb157d58
Fix queue handling for backwards compatibility
2024-03-17 19:00:18 +02:00
allegroai
56058beec2
Update deprecated references
2024-03-17 18:59:48 +02:00
allegroai
9f207d5155
Fix dynamic GPU sometimes misses the initial print - if we found the closing print it should be good enough to signal everything is okay
2024-03-17 18:59:04 +02:00
allegroai
8a2bea3c14
Fix comment lines (#) are not ignored in docker startup bash script
2024-03-17 18:58:14 +02:00
allegroai
f1f9278928
Fix torch resolver settings applied to PytorchRequirement instance are not used
2024-03-17 18:56:47 +02:00
nfzd
2de1c926bf
Use correct Python version in Poetry init ( #179 )
...
* Use correct Python version in Poetry init
* Use interpreter override if configured
* Don't use agent.python_binary if it is empty
---------
Co-authored-by: Michael Mueller <michael.mueller@wsa.com>
2024-03-11 23:36:10 +02:00
ae-ae
8b2970350c
Fix FileNotFoundException crash in find_python_executable_for_version… ( #192 )
...
* Fix FileNotFoundException crash in find_python_executable_for_version (#164 )
* Add a Windows check for error 9009 when searching for Python
---------
Co-authored-by: 12037964+ae-ae@users.noreply.github.com 12037964+ae-ae@users.noreply.github.com <ae-ae>
2024-03-06 09:17:31 +02:00
FeU-aKlos
a2758250b2
Fix queue handling in K8sIntegration and k8s_glue_example.py ( #183 )
...
* Fix queue handling in K8sIntegration and k8s_glue_example.py
* Update Dockerfile and k8s_glue_example.py
* Add executable permission to provider_entrypoint.sh
* ADJUST docker
* Update clearml-agent version
* ADDJUST stuff
* ADJUST queue string handling
* DELETE pip install from own repo
2024-02-29 14:20:54 +02:00
allegroai
01e8ffd854
Improve venv cache handling:
...
- Add FileLock readonly mode, default is write mode (i.e. exclusive lock, preserving behavior)
- Add venv cache now uses readonly lock when copying folders from venv cache into target folder. This enables multiple read, single write operation
- Do not lock the cache folder if we do not need to delete old entries
2024-02-29 14:19:24 +02:00
allegroai
74edf6aa36
Fix IOError on file lock when using shared folder
2024-02-29 14:16:25 +02:00
allegroai
09c5ef99af
Fix Python 3.12 support by removing distutil imports
2024-02-29 14:12:21 +02:00
allegroai
17ae28a62f
Add agent.venvs_cache.lock_timeout to control the venv cache folder lock timeout (in seconds, default 30)
2024-02-29 14:06:06 +02:00
allegroai
059a9385e9
Fix delete temp console pipe log files after Task execution is completed. This is important for long lasting services agents, avoiding collecting temp files on host machine
2024-02-29 14:03:30 +02:00
allegroai
9a321a410f
Add CLEARML_AGENT_FORCE_TASK_INIT to allow runtime patching of script even if no repo is specified and the code is running a preinstalled docker
2024-02-29 14:02:27 +02:00
allegroai
919013d4fe
Add CLEARML_AGENT_FORCE_POETRY to allow forcing poetry even when using pip requirements manager
2024-02-29 13:59:26 +02:00
allegroai
05530b712b
Fix sanitization did not cover all keys
2024-02-29 13:56:14 +02:00
allegroai
8d15fd8798
Fix pippip
is returned as a pip version if no value exists in agent.package_manager.pip_version
2024-02-29 13:55:41 +02:00
allegroai
b34329934b
Add queue ID report before pulling task
2024-02-29 13:52:17 +02:00
allegroai
85049d8705
Move configuration sanitization settings to the default config file
2024-02-29 13:51:40 +02:00
allegroai
6fbd70786e
Add protection for truncate() call
2024-02-29 13:51:09 +02:00
allegroai
05a65548da
Fix agent.enable_git_ask_pass does not show in configuration dump
2024-02-29 13:50:52 +02:00
allegroai
6657003d65
Fix using controller-uid will not always return required pods
2024-02-29 13:49:30 +02:00
allegroai
c9fc092f4e
Support force_system_packages argument in k8s glue class
2023-12-26 10:12:32 +02:00
allegroai
432ee395e1
Version bump to v1.7.0
2023-12-20 18:08:38 +02:00
allegroai
98fc4f0fb9
Add agent.resource_monitoring.disk_use_path
configuration option to allow monitoring a different volume than the one containing the home folder
2023-12-20 17:49:33 +02:00
allegroai
111e774c21
Add extra_index_url sanitization in configuration printout
2023-12-20 17:49:04 +02:00
allegroai
3dd8d783e1
Fix agent.git_host
setting will cause git@domain URLs to not be replaced by SSH URLs since furl cannot parse them to obtain host
2023-12-20 17:48:18 +02:00
allegroai
7c3e420df4
Add git clone verbosity using CLEARML_AGENT_GIT_CLONE_VERBOSE
env var
2023-12-20 17:47:52 +02:00
allegroai
55b065a114
Update GPU stats and pynvml support
2023-12-20 17:47:19 +02:00
allegroai
faa97b6cc2
Set worker ID in k8s glue mode
2023-12-20 17:45:34 +02:00
allegroai
f5861b1e4a
Change default agent.enable_git_ask_pass
to True
2023-12-20 17:44:41 +02:00
allegroai
030cbb69f1
Fix check if process return code is SIGKILL (-9 or 137) and abort callback was called, do not mark as failed but as aborted
2023-12-20 17:43:02 +02:00
allegroai
564f769ff7
Add agent.docker_args_extra_precedes_task
, agent.protected_docker_extra_args
...
to prevent the same switch to be used by both `extra_docker_args` and the a Task's docker args
2023-12-20 17:42:36 +02:00
allegroai
dd5d24b0ca
Add CLEARML_AGENT_TEMP_STDOUT_FILE_DIR to allow specifying temp dir used for storing agent log files and temporary log files (daemon and execution)
2023-11-14 11:45:13 +02:00
allegroai
996bb797c3
Add env var in case we're running a service task
2023-11-14 11:44:36 +02:00
allegroai
9ad49a0d21
Fix KeyError if container does not contain the arguments field
2023-11-01 15:11:07 +02:00
allegroai
ba4fee7b19
Fix agent.package_manager.poetry_install_extra_args are used in all Poetry commands and not just in install ( #173 )
2023-11-01 15:10:40 +02:00
allegroai
0131db8b7d
Add support for resource_applied() callback in k8s glue
...
Add support for sending log events with k8s-provided timestamps
Refactor env vars infrastructure
2023-11-01 15:10:08 +02:00
allegroai
d2384a9a95
Add example and support for prebuilt containers including services-mode support with overrides CLEARML_AGENT_FORCE_CODE_DIR CLEARML_AGENT_FORCE_EXEC_SCRIPT
2023-11-01 15:05:57 +02:00
allegroai
5b86c230c1
Fix an environment variable that should be set with a numerical value of 0 (i.e. end up as "0" or "0.0") is set to an empty string
2023-11-01 15:04:59 +02:00
allegroai
21e4be966f
Fix recursion issue when deep-copying a session
2023-11-01 15:04:24 +02:00
allegroai
9c6cb421b3
When cleaning up pending pods, verify task is still aborted and pod is still pending before deleting the pod
2023-11-01 15:04:01 +02:00
allegroai
52405c343d
Fix k8s glue configuration might be contaminated when changed during apply
2023-11-01 15:03:37 +02:00