--- title: Llama.cpp Model Deployment --- :::important Enterprise Feature The llama.cpp Model Deployment App is available under the ClearML Enterprise plan. ::: The llama.cpp Model Deployment app enables users to quickly deploy LLM models in GGUF format using [`llama.cpp`](https://github.com/ggerganov/llama.cpp). The llama.cpp Model Deployment application serves your model on a machine of your choice. Once an app instance is running, it serves your model through a secure, publicly accessible network endpoint. The app monitors endpoint activity and shuts down if the model remains inactive for a specified maximum idle time. :::important AI Application Gateway The llama.cpp Model Deployment app makes use of the ClearML Traffic Router which implements a secure, authenticated network endpoint for the model. If the ClearML AI application Gateway is not available, the model endpoint might not be accessible. ::: After starting a llama.cpp Model Deployment instance, you can view the following information in its dashboard: * Status indicator * Active server

- App instance is running and is actively in use *

- App instance is setting up * Idle server

- App instance is idle * Stopped server

- App instance is stopped * Idle time - Time elapsed since last activity * App - The publicly accessible URL of the model endpoint. Active model endpoints are also available in the [Model Endpoints](../webapp_model_endpoints.md) table, which allows you to view and compare endpoint details and monitor status over time * API base - The base URL for the model endpoint * API key - The authentication key for the model endpoint * Test Command - An example command line to test the deployed model * Requests - Number of requests over time * Latency - Request response time (ms) over time * Endpoint resource monitoring metrics over time * CPU usage * Network throughput * Disk performance * Memory performance * GPU utilization * GPU memory usage * GPU temperature * Console log - The console log shows the app instance's console output: setup progress, status changes, error messages, etc. ## Llama.cpp Model Deployment Instance Configuration When configuring a new llama.cpp Model Deployment instance, you can fill in the required parameters or reuse the configuration of a previously launched instance. Launch an app instance with the configuration of a previously launched instance using one of the following options: * Cloning a previously launched app instance will open the instance launch form with the original instance's configuration prefilled. * Importing an app configuration file. You can export the configuration of a previously launched instance as a JSON file when viewing its configuration. The prefilled configuration form can be edited before launching the new app instance. To configure a new app instance, click `Launch New` Add new

to open the app's configuration form. ## Configuration Options * Import Configuration - Import an app instance configuration file. This will fill the configuration form with the values from the file, which can be modified before launching the app instance * Project name - ClearML Project where your llama.cpp Model Deployment app instance will be stored * Task name - Name of [ClearML Task](../../fundamentals/task.md) for your llama.cpp Model Deployment app instance * Queue - The [ClearML Queue](../../fundamentals/agents_and_queues.md#agent-and-queue-workflow) to which the llama.cpp Model Deployment app instance task will be enqueued (make sure an agent is assigned to it) * Model - A ClearML Model ID or a Hugging Face model. The model must be in GGUF format. If you are using a HuggingFace model, make sure to pass the path to the GGUF file. For example: `provider/repo/path/to/model.gguf` * General * Hugging Face Token - Token for accessing Hugging Face models that require authentication * Number of GPU Layers - Number of layers to store in VRAM. `9999` indicates that all layers should be loaded in VRAM. Used to offload the model on the CPU RAM * Advanced Options * Idle Time Limit (Hours) - Maximum idle time after which the app instance will shut down * Last Action Report Interval (Seconds) - The frequency at which the last activity made by the application is reported. Used to stop the application from entering an idle state when the machine metrics are low but the application is actually still running