INFERENCE ENDPOINTS

Effortless secure
inference
at any scale

Deploy open-source models on dedicated GPUs for

secure fast inference

LLM
Llama 3.1 70B
META - NVIDIA H100 SXM
A multilingual LLM, pre-trained and instruction-tuned model, with top performance on key benchmarks.
LLM
Llama 3.1 8B
META - NVIDIA L40S
A small multilingual instruction-tuned LLM with top performance for it size.
LLM
Mistral 7B-Instruct
MISTRAL - NVIDIA L40S
Instruct fine-tuned version of the Mistral-7B-v0.3.
LLM
Qwen 2.5 1.5B
QWEN - NVIDIA L4
The small language model from the Qwen 2.5 family with up to 128k context length.
LLM
Llama 3.2 3B
META - NVIDIA L40S
A multimodal LLM fine-tuned for image reasoning, visual recognition and answering image-based queries.
LLM
TinyLlama 1.1B
TINYLLAMA - NVIDIA L4
A small language model for use with constrained memory footprint.

Dedicated inference
Dedicated model instances with their own GPUs. Fully secure, no data leakage.
Deploy
any model
Effortlessly deploy open-source or your own models with flexible endpoints
Limitless
auto-scaling
Scale to match your needs with endpoints that go from zero to thousands of GPUs
Safe &
Secure
Protect your AI models with HTTPS and authentication for secure access