Commit Graph

67 Commits

Author SHA1 Message Date
clearml
28e9280a4f Reduce required packages 2025-01-26 23:03:16 +02:00
clearml
47d35ef48f Fix managed python environment inside container (PEP 668) remove usr/lib/python3.*/EXTERNALLY-MANAGED 2024-12-26 18:59:42 +02:00
clearml
fc1abbab0b Refactor k8s glue 2024-12-26 18:58:00 +02:00
clearml
4fa61dde1f Support ignoring kubectl errors 2024-12-12 23:41:31 +02:00
clearml
b65e5fed94 Scan more Python 3 versions 2024-11-17 13:55:51 +02:00
allegroai
6302d43990 Add support for skipping container apt installs using CLEARML_AGENT_SKIP_CONTAINER_APT env var in k8s
Add runtime callback support for setting runtime properties per task in k8s
Fix remove task from pending queue and set to failed when kubectl apply fails
2024-08-27 23:01:27 +03:00
allegroai
b8c762401b Fix use same state transition if supported by the server (instead of stopping the task before re-enqueue) 2024-08-27 22:54:45 +03:00
allegroai
8ba4d75e80 Add CLEARML_TASK_ID and auth token to pod env vars in original entrypoint flow 2024-07-24 17:47:48 +03:00
allegroai
edc333ba5f Add K8S_GLUE_POD_USE_IMAGE_ENTRYPOINT to allow running images without overriding the entrypoint (useful for agents using prebuilt images in k8s) 2024-07-24 17:46:27 +03:00
allegroai
10c6629982 Support skipping re-enqueue on suspected preempted k8s pods 2024-04-19 23:46:57 +03:00
allegroai
43443ccf08 Pass task_id when resolving k8s template 2024-04-01 11:37:01 +03:00
allegroai
10fb157d58 Fix queue handling for backwards compatibility 2024-03-17 19:00:18 +02:00
FeU-aKlos
a2758250b2
Fix queue handling in K8sIntegration and k8s_glue_example.py (#183)
* Fix queue handling in K8sIntegration and k8s_glue_example.py

* Update Dockerfile and k8s_glue_example.py

* Add executable permission to provider_entrypoint.sh

* ADJUST docker

* Update clearml-agent version

* ADDJUST stuff

* ADJUST queue string handling

* DELETE pip install from own repo
2024-02-29 14:20:54 +02:00
allegroai
b34329934b Add queue ID report before pulling task 2024-02-29 13:52:17 +02:00
allegroai
6657003d65 Fix using controller-uid will not always return required pods 2024-02-29 13:49:30 +02:00
allegroai
c9fc092f4e Support force_system_packages argument in k8s glue class 2023-12-26 10:12:32 +02:00
allegroai
55b065a114 Update GPU stats and pynvml support 2023-12-20 17:47:19 +02:00
allegroai
faa97b6cc2 Set worker ID in k8s glue mode 2023-12-20 17:45:34 +02:00
allegroai
9ad49a0d21 Fix KeyError if container does not contain the arguments field 2023-11-01 15:11:07 +02:00
allegroai
0131db8b7d Add support for resource_applied() callback in k8s glue
Add support for sending log events with k8s-provided timestamps
Refactor env vars infrastructure
2023-11-01 15:10:08 +02:00
allegroai
9c6cb421b3 When cleaning up pending pods, verify task is still aborted and pod is still pending before deleting the pod 2023-11-01 15:04:01 +02:00
allegroai
46f0c991c8 Add status reason when aborting before moving to k8s_scheduler queue 2023-11-01 15:02:24 +02:00
Alex Burlacu
946e9d9ce9 Fix invalid reference 2023-08-24 18:51:27 +03:00
allegroai
4c056a17b9 Add support for k8s jobs execution
Strip docker container obtained from task in k8s apply
2023-07-04 14:45:00 +03:00
allegroai
450df2f8d3 Support skipping agent pip upgrade in container bash script using the CLEARML_AGENT_NO_UPDATE env var 2023-07-04 14:38:50 +03:00
allegroai
95dadca45c Refactor k8s glue running/used pods getter 2023-05-21 22:56:12 +03:00
allegroai
7f5b3c8df4 Fix None config file in session causes k8s agent to raise exception 2023-03-28 14:33:55 +03:00
allegroai
40456be948 Black formatting
Refactor path support
2023-03-05 18:05:00 +02:00
allegroai
4f17a2c17d Fix K8s glue does not delete pending pods if the tasks they represent were aborted 2023-02-05 10:32:16 +02:00
allegroai
af6a77918f Fix _ is allowed in k8s label names 2023-02-05 10:29:48 +02:00
allegroai
b2da639582 Add CLEARML_AGENT_FORCE_SYSTEM_SITE_PACKAGES env var (default true) to allow overriding default "system_site_packages: true" behavior when running tasks in containers (docker mode and k8s-glue) 2022-12-10 20:00:46 +02:00
allegroai
dc5e0033c8 Remove support for kubectl run
Allow customizing pod name prefix and limit pod label
Return deleted pods from cleanup
Some refactoring
2022-12-05 11:40:19 +02:00
allegroai
3dd5973734 Filter by phase when detecting hanging pods
More debug print-outs
Use task session when possible
Push task into k8s scheduler queue only if running from the same tenant
Make sure we pass git_user/pass to the task pod
Fix cleanup command not issued when no pods exist in a multi-queue setup
2022-12-05 11:29:59 +02:00
allegroai
6e7fb5f331 Fix sending task logs fails when agent is not running in the same tenant 2022-12-05 11:19:14 +02:00
allegroai
d794b047be Fix system_site_packages is not turned on in k8s glue 2022-10-23 13:03:59 +03:00
allegroai
857a750eb1 Fix GCP load balancer not fwd GET request body, allow to change default request Action to Put/Post/Get. see api.http.default_method or CLEARML_API_DEFAULT_REQ_METHOD 2022-09-15 20:15:42 +03:00
allegroai
26aa50f1b5 Fix k8s glue extra_bash_init_cmd location in initial bash script 2022-09-02 23:50:03 +03:00
allegroai
8b4f1eefc2 Add more debug printouts in k8s glue 2022-09-02 23:49:28 +03:00
allegroai
97c2e21dcc Fix resolving k8s pending queue may cause a queue with a uuid name to be created 2022-09-02 23:49:28 +03:00
allegroai
7292263f86 Add CLEARML_K8S_GLUE_START_AGENT_SCRIPT_PATH to allow customizing the agent startup script location for k8s glue agent 2022-08-23 23:16:36 +03:00
allegroai
820ab4dc0c Fix k8s glue debug mode, refactoring 2022-08-01 18:55:49 +03:00
allegroai
d96b8ff906 Fix template namespace should override default namespace 2022-07-22 22:44:32 +03:00
allegroai
e687418194 Refactor k8s glue template handling 2022-07-22 22:43:07 +03:00
allegroai
2e5298b737 Add support for use-owner-token in k8s glue 2022-04-27 14:59:27 +03:00
allegroai
4c120d7cd0 Add ability to override container LOCAL_PYTHON, add auto python support (max 3.15) 2022-03-24 21:58:07 +02:00
allegroai
cd046927f3 Add k8s glue update task status_message in hanging pods daemon
Fix k8s glue not throwing error when failing to push to queue
2021-08-02 22:59:31 +03:00
allegroai
42606d9247 Fix multiple k8s glue instances with pod limits
Version bump
2021-07-15 10:28:43 +03:00
allegroai
499b3dfa66 Fix k8s glue, do not reset Task before re-enqueuing as it will remove runtime properties 2021-07-15 10:27:54 +03:00
allegroai
ca360b7d43 Improve max pod limit check 2021-07-15 10:26:49 +03:00
allegroai
6470b16b70 Add k8s set task container if using default image/arguments 2021-07-15 10:26:09 +03:00