Nemotron-Nano 9B v2 is NVIDIA’s latest compact language model aimed at combining reasoning strength with lightning-fast performance. At its core, the model uses a hybrid architecture that blends two approaches: the Transformer, which excels at learning long-range relationships in text, and Mamba 2, a newer “state-space” architecture designed to handle sequences more efficiently by processing information in a continuous flow rather than one token at a time.

In simple terms, the Transformer layers provide accuracy and contextual understanding, while the Mamba 2 layers speed things up dramatically. This design allows Nemotron-Nano 9B v2 to deliver up to six times faster inference throughput than traditional models of similar size, all while supporting 128K-token context lengths and toggleable reasoning (developers can switch “thinking” on or off to balance accuracy and latency). Released under the open NVIDIA Model License, it’s a small but powerful model for reasoning, multilingual tasks, and real-time AI agents.

Here’s a brief overview of the model’s key specifications:

	NVIDIA Nemotron Nano 9B v2
Architecture	Hybrid: Mamba 2 (state-space layers) + Transformer attention layers
Size	9 billion parameters (derived from a 12 B-parameter base)
Context Length	128k tokens
License	NVIDIA Open Model License (commercial use permitted)

Join the
Ori community on Discord

Join the Communityhttps://discord.gg/2VrezwZBAR

Performance benchmarks shared by NVIDIA indicate that it outperforms other open weights models in the small language models category, such as Qwen 3 8B.

NVIDIA Nemotron Nano Performance Benchmarks

Source: NVIDIA

Why Ori Inference Endpoints for Nemotron Nano V2?

There’s no shortage of platforms today for running inference on leading open source AI models. Yet, many are either too rigid for real-world business needs or too complex and costly to manage. Ori Inference Endpoints offers a simpler alternative, an easy, scalable way to deploy cutting-edge AI models on dedicated GPUs with just one click. Unlike serverless inference, dedicated GPUs provide you greater control over the type of compute, scalability and the location of deployment.

Here’s how Ori Inference Endpoints makes production inference performant, scalable and effortless:

Select a GPU and region. Unlock powerful inference: serve your models on top-tier GPUs such as NVIDIA H200, H100, L40S or L4 and deploy in a region that helps minimize latency for your users.

Autoscale without limits: Ori Inference Endpoints automatically scales up or down based on demand. You can also scale all the way down to zero, helping you reduce GPU costs when your endpoints are idle.

Optimized for quick starts: Model loading is designed to launch instantly, making scaling fast, even when starting from zero.

HTTPS secured API endpoints: Experience peace of mind with HTTPS endpoints and authentication to keep them safe from unauthorized use.

Pay for what you use, by the minute:Per-minute pricing helps you keep your AI infrastructure affordable and costs predictable. No long-term commitments, just transparent, usage-based billing.

In this tutorial, we’ll walk you through deploying NVIDIA Nemotron Nano 9B v2, powered by Ori Inference Endpoints.

How to deploy Nemotron Nano 9B V2 on Ori’s Dedicated Inference Endpoints

We’ll be deploying Nemotron Nano 9B v2 on an Ori Inference Endpoint that features dedicated NVIDIA GPUs and use Streamlit to create an UI to interact with the model.

Step 1: Spin up an Ori Inference Endpoint and choose Nemotron Nano 9B v2 as the model you want to deploy. Pick a suggested GPU at a location of your choice.

Set up the minimum and maximum number of replicas you need for automatic scaling. Inference Endpoints can scale automatically with demand and go down all the way to zero if you’re endpoint is idle, helping you save significantly on inference costs.

Important

Note your endpoint's URL and API Access Token.

Step 2: Install Streamlit and OpenAI packages

Bash/ShellCopy

1pip install streamlit openai

Step 3: Create a Streamlit secrets file to store the endpoint credentials at the location: /root/.streamlit/secrets.toml

Copy

1ENDPOINT_URL = "Host URL provided by inference endpoint>"
2ENDPOINT_TOKEN = "Token from the Inference Endpoint>"

Step 4: Save the Python code (app.py) to configure and run your Streamlit UI

PythonCopy

1import os
2import streamlit as st
3from openai import OpenAI
4
5st.set_page_config(page_title="Nemotron-Nano 9B ΓÇö Ori Inference", page_icon="≡ƒºá")
6
7# ---- Get config (secrets > env > UI) ----
8DEFAULT_URL = st.secrets.get("ENDPOINT_URL") or os.getenv("ENDPOINT_URL")
9DEFAULT_TOKEN = st.secrets.get("ENDPOINT_TOKEN") or os.getenv("ENDPOINT_TOKEN") or ""
10
11def normalize_base_url(u: str) -> str:
12    """Ensure base_url looks like https://host/v1/ (strip any extra suffixes)."""
13    u = (u or "").strip().rstrip("/")
14    for suffix in ("/v1/chat/completions", "/v1"):
15        if u.endswith(suffix):
16            u = u[: -len(suffix)].rstrip("/")
17    return f"{u}/v1/"
18
19with st.sidebar:
20    st.subheader("Connection")
21    raw_url = st.text_input("Endpoint URL (host root)", value=DEFAULT_URL, placeholder="https://<your-endpoint-host>")
22    token = st.text_input("Access Token", value=DEFAULT_TOKEN, type="password", placeholder="ogc_***")
23    st.caption("Base will resolve to: <host>/v1/ and call /chat/completions")
24    st.divider()
25    st.subheader("Chat Settings")
26    system_prompt = st.text_area("System prompt", value="You are a helpful assistant")
27#Setting up default model temperature
28    temperature = st.slider("Temperature", 0.0, 1.0, 0.2, 0.05)
29#Setting max tokens as 100,000
30    max_tokens = st.slider("Max tokens", 16, 100000, 256, 16)
31    stream = st.toggle("Stream responses", value=True)
32    if st.button("Clear chat"):
33        st.session_state.pop("messages", None)
34
35# Validate inputs
36if not raw_url or not token:
37    st.error("Provide the Endpoint URL and Access Token (in sidebar).")
38    st.stop()
39
40base_url = normalize_base_url(raw_url)
41
42# ---- OpenAI-compatible client pointing at Ori endpoint ----
43client = OpenAI(api_key=token, base_url=base_url)
44
45st.title("≡ƒºá Nemotron-Nano 9B on Ori Inference")
46st.caption(f"Using base URL: `{base_url}`  ΓåÆ  endpoint: `/chat/completions`")
47
48# ---- Chat history ----
49if "messages" not in st.session_state:
50    st.session_state.messages = []
51    if system_prompt:
52        st.session_state.messages.append({"role": "system", "content": system_prompt})
53
54# Render history
55for m in st.session_state.messages:
56    if m["role"] == "system":
57        continue
58    with st.chat_message(m["role"]):
59        st.markdown(m["content"])
60
61# Input
62user_msg = st.chat_input("Ask me anythingΓÇª")
63if user_msg:
64    st.session_state.messages.append({"role": "user", "content": user_msg})
65
66    # Echo user
67    with st.chat_message("user"):
68        st.markdown(user_msg)
69
70    # Assistant response
71    with st.chat_message("assistant"):
72        placeholder = st.empty()
73        output = ""
74
75        try:
76            if stream:
77                resp = client.chat.completions.create(
78                    model="model",  # Ori routes to your configured Nemotron endpoint
79                    messages=st.session_state.messages,
80                    temperature=temperature,
81                    max_tokens=max_tokens,
82                    stream=True,
83                )
84                for chunk in resp:
85                    delta = getattr(chunk.choices[0].delta, "content", None)
86                    if delta:
87                        output += delta
88                        placeholder.markdown(output)
89            else:
90                resp = client.chat.completions.create(
91                    model="model",
92                    messages=st.session_state.messages,
93                    temperature=temperature,
94                    max_tokens=max_tokens,
95                    stream=False,
96                )
97                output = resp.choices[0].message.content
98                placeholder.markdown(output)
99        except Exception as e:
100            output = f"ΓÜá∩╕Å Request failed: {e}"
101            placeholder.markdown(output)
102
103    st.session_state.messages.append({"role": "assistant", "content": output})

Step 5: Run the Streamlit UI file

Bash/ShellCopy

1streamlit run app.py

You’ll see the URL for the Streamlit UI in the terminal and once you access it via your browser your UI will be ready for you. Here’s a snapshot of how the UI will look like:

Check if your endpoint is working as expected with an example prompt

Run limitless AI inference on Ori

Serve cutting-edge AI models in minutes without overspending on infrastructure:

Deploy your preferred AI model in a single click, making inference truly effortless.
Scale up automatically with demand, from zero to thousands of GPUs.
Predictable pricing and automatic scale down to zero helping you minimize idle infrastructure.

Deploy Nemotron Nano on Ori

Build limitless AI on Ori

Chart your own AI reality with Ori's comprehensive AI cloud platform.

How to run and scale Nemotron Nano 9B v2 on Ori Inference Endpoints with Streamlit UI

Join the Ori community on Discord