How to run and scale Nemotron Nano 9B v2 on Ori Inference Endpoints with Streamlit UI

Nemotron-Nano 9B v2 is NVIDIA’s latest compact language model aimed at combining reasoning strength with lightning-fast performance. At its core, the model uses a hybrid architecture that blends two approaches: the Transformer, which excels at learning long-range relationships in text, and Mamba 2, a newer “state-space” architecture designed to handle sequences more efficiently by processing information in a continuous flow rather than one token at a time.
In simple terms, the Transformer layers provide accuracy and contextual understanding, while the Mamba 2 layers speed things up dramatically. This design allows Nemotron-Nano 9B v2 to deliver up to six times faster inference throughput than traditional models of similar size, all while supporting 128K-token context lengths and toggleable reasoning (developers can switch “thinking” on or off to balance accuracy and latency). Released under the open NVIDIA Model License, it’s a small but powerful model for reasoning, multilingual tasks, and real-time AI agents.
Here’s a brief overview of the model’s key specifications:
| NVIDIA Nemotron Nano 9B v2 | |
|---|---|
| Architecture | Hybrid: Mamba 2 (state-space layers) + Transformer attention layers |
| Size | 9 billion parameters (derived from a 12 B-parameter base) |
| Context Length | 128k tokens |
| License | NVIDIA Open Model License (commercial use permitted) |
Performance benchmarks shared by NVIDIA indicate that it outperforms other open weights models in the small language models category, such as Qwen 3 8B.

Source: NVIDIA
Why Ori Inference Endpoints for Nemotron Nano V2?
There’s no shortage of platforms today for running inference on leading open source AI models. Yet, many are either too rigid for real-world business needs or too complex and costly to manage. Ori Inference Endpoints offers a simpler alternative, an easy, scalable way to deploy cutting-edge AI models on dedicated GPUs with just one click. Unlike serverless inference, dedicated GPUs provide you greater control over the type of compute, scalability and the location of deployment.
Here’s how Ori Inference Endpoints makes production inference performant, scalable and effortless:
Select a GPU and region. Unlock powerful inference: serve your models on top-tier GPUs such as NVIDIA H200, H100, L40S or L4 and deploy in a region that helps minimize latency for your users.
Autoscale without limits: Ori Inference Endpoints automatically scales up or down based on demand. You can also scale all the way down to zero, helping you reduce GPU costs when your endpoints are idle.
Optimized for quick starts: Model loading is designed to launch instantly, making scaling fast, even when starting from zero.
HTTPS secured API endpoints: Experience peace of mind with HTTPS endpoints and authentication to keep them safe from unauthorized use.
Pay for what you use, by the minute:Per-minute pricing helps you keep your AI infrastructure affordable and costs predictable. No long-term commitments, just transparent, usage-based billing.
In this tutorial, we’ll walk you through deploying NVIDIA Nemotron Nano 9B v2, powered by Ori Inference Endpoints.
How to deploy Nemotron Nano 9B V2 on Ori’s Dedicated Inference Endpoints
We’ll be deploying Nemotron Nano 9B v2 on an Ori Inference Endpoint that features dedicated NVIDIA GPUs and use Streamlit to create an UI to interact with the model.
Step 1: Spin up an Ori Inference Endpoint and choose Nemotron Nano 9B v2 as the model you want to deploy. Pick a suggested GPU at a location of your choice.

Set up the minimum and maximum number of replicas you need for automatic scaling. Inference Endpoints can scale automatically with demand and go down all the way to zero if you’re endpoint is idle, helping you save significantly on inference costs.

Note your endpoint's URL and API Access Token.
Step 2: Install Streamlit and OpenAI packages
1pip install streamlit openaiStep 3: Create a Streamlit secrets file to store the endpoint credentials at the location: /root/.streamlit/secrets.toml
1ENDPOINT_URL = "Host URL provided by inference endpoint>"
2ENDPOINT_TOKEN = "Token from the Inference Endpoint>"Step 4: Save the Python code (app.py) to configure and run your Streamlit UI
1import os
2import streamlit as st
3from openai import OpenAI
4
5st.set_page_config(page_title="Nemotron-Nano 9B — Ori Inference", page_icon="🧠")
6
7# ---- Get config (secrets > env > UI) ----
8DEFAULT_URL = st.secrets.get("ENDPOINT_URL") or os.getenv("ENDPOINT_URL")
9DEFAULT_TOKEN = st.secrets.get("ENDPOINT_TOKEN") or os.getenv("ENDPOINT_TOKEN") or ""
10
11def normalize_base_url(u: str) -> str:
12 """Ensure base_url looks like https://host/v1/ (strip any extra suffixes)."""
13 u = (u or "").strip().rstrip("/")
14 for suffix in ("/v1/chat/completions", "/v1"):
15 if u.endswith(suffix):
16 u = u[: -len(suffix)].rstrip("/")
17 return f"{u}/v1/"
18
19with st.sidebar:
20 st.subheader("Connection")
21 raw_url = st.text_input("Endpoint URL (host root)", value=DEFAULT_URL, placeholder="https://<your-endpoint-host>")
22 token = st.text_input("Access Token", value=DEFAULT_TOKEN, type="password", placeholder="ogc_***")
23 st.caption("Base will resolve to: <host>/v1/ and call /chat/completions")
24 st.divider()
25 st.subheader("Chat Settings")
26 system_prompt = st.text_area("System prompt", value="You are a helpful assistant")
27#Setting up default model temperature
28 temperature = st.slider("Temperature", 0.0, 1.0, 0.2, 0.05)
29#Setting max tokens as 100,000
30 max_tokens = st.slider("Max tokens", 16, 100000, 256, 16)
31 stream = st.toggle("Stream responses", value=True)
32 if st.button("Clear chat"):
33 st.session_state.pop("messages", None)
34
35# Validate inputs
36if not raw_url or not token:
37 st.error("Provide the Endpoint URL and Access Token (in sidebar).")
38 st.stop()
39
40base_url = normalize_base_url(raw_url)
41
42# ---- OpenAI-compatible client pointing at Ori endpoint ----
43client = OpenAI(api_key=token, base_url=base_url)
44
45st.title("🧠 Nemotron-Nano 9B on Ori Inference")
46st.caption(f"Using base URL: `{base_url}` → endpoint: `/chat/completions`")
47
48# ---- Chat history ----
49if "messages" not in st.session_state:
50 st.session_state.messages = []
51 if system_prompt:
52 st.session_state.messages.append({"role": "system", "content": system_prompt})
53
54# Render history
55for m in st.session_state.messages:
56 if m["role"] == "system":
57 continue
58 with st.chat_message(m["role"]):
59 st.markdown(m["content"])
60
61# Input
62user_msg = st.chat_input("Ask me anything…")
63if user_msg:
64 st.session_state.messages.append({"role": "user", "content": user_msg})
65
66 # Echo user
67 with st.chat_message("user"):
68 st.markdown(user_msg)
69
70 # Assistant response
71 with st.chat_message("assistant"):
72 placeholder = st.empty()
73 output = ""
74
75 try:
76 if stream:
77 resp = client.chat.completions.create(
78 model="model", # Ori routes to your configured Nemotron endpoint
79 messages=st.session_state.messages,
80 temperature=temperature,
81 max_tokens=max_tokens,
82 stream=True,
83 )
84 for chunk in resp:
85 delta = getattr(chunk.choices[0].delta, "content", None)
86 if delta:
87 output += delta
88 placeholder.markdown(output)
89 else:
90 resp = client.chat.completions.create(
91 model="model",
92 messages=st.session_state.messages,
93 temperature=temperature,
94 max_tokens=max_tokens,
95 stream=False,
96 )
97 output = resp.choices[0].message.content
98 placeholder.markdown(output)
99 except Exception as e:
100 output = f"⚠️ Request failed: {e}"
101 placeholder.markdown(output)
102
103 st.session_state.messages.append({"role": "assistant", "content": output})Step 5: Run the Streamlit UI file
1streamlit run app.pyYou’ll see the URL for the Streamlit UI in the terminal and once you access it via your browser your UI will be ready for you. Here’s a snapshot of how the UI will look like:

Check if your endpoint is working as expected with an example prompt

Run limitless AI inference on Ori
Serve cutting-edge AI models in minutes without overspending on infrastructure:
- Deploy your preferred AI model in a single click, making inference truly effortless.
- Scale up automatically with demand, from zero to thousands of GPUs.
- Predictable pricing and automatic scale down to zero helping you minimize idle infrastructure.


