Tutorials

How to run and scale Nemotron Nano 9B v2 on Ori Inference Endpoints with Streamlit UI

Learn how to deploy NVIDIA’s Nemotron Nano 9B v2 on Ori Inference Endpoints using Streamlit UI. Scale inference effortlessly with per-minute pricing and GPU autoscaling.
Deepak Manoor
Posted : October, 27, 2025
Posted : October, 27, 2025
    NVIDIA Nemotron Nano 9B v2

    Nemotron-Nano 9B v2 is NVIDIA’s latest compact language model aimed at combining reasoning strength with lightning-fast performance. At its core, the model uses a hybrid architecture that blends two approaches: the Transformer, which excels at learning long-range relationships in text, and Mamba 2, a newer “state-space” architecture designed to handle sequences more efficiently by processing information in a continuous flow rather than one token at a time.

    In simple terms, the Transformer layers provide accuracy and contextual understanding, while the Mamba 2 layers speed things up dramatically. This design allows Nemotron-Nano 9B v2 to deliver up to six times faster inference throughput than traditional models of similar size, all while supporting 128K-token context lengths and toggleable reasoning (developers can switch “thinking” on or off to balance accuracy and latency). Released under the open NVIDIA Model License, it’s a small but powerful model for reasoning, multilingual tasks, and real-time AI agents.

    Here’s a brief overview of the model’s key specifications:

    NVIDIA Nemotron Nano 9B v2
    ArchitectureHybrid: Mamba 2 (state-space layers) + Transformer attention layers
    Size9 billion parameters (derived from a 12 B-parameter base)
    Context Length128k tokens
    LicenseNVIDIA Open Model License (commercial use permitted)
    background image

    Join the
    Ori community on Discord

    Join the Communityhttps://discord.gg/2VrezwZBAR
    NVIDIA Rubin GPU

    Performance benchmarks shared by NVIDIA indicate that it outperforms other open weights models in the small language models category, such as Qwen 3 8B.

    NVIDIA Nemotron Nano Performance Benchmarks
    Source: NVIDIA

    Why Ori Inference Endpoints for Nemotron Nano V2?

    There’s no shortage of platforms today for running inference on leading open source AI models. Yet, many are either too rigid for real-world business needs or too complex and costly to manage. Ori Inference Endpoints offers a simpler alternative, an easy, scalable way to deploy cutting-edge AI models on dedicated GPUs with just one click. Unlike serverless inference, dedicated GPUs provide you greater control over the type of compute, scalability and the location of deployment.

    Here’s how Ori Inference Endpoints makes production inference performant, scalable and effortless:

    Select a GPU and region. Unlock powerful inference: serve your models on top-tier GPUs such as NVIDIA H200, H100, L40S or L4 and deploy in a region that helps minimize latency for your users.

    Autoscale without limits: Ori Inference Endpoints automatically scales up or down based on demand. You can also scale all the way down to zero, helping you reduce GPU costs when your endpoints are idle.

    Optimized for quick starts: Model loading is designed to launch instantly, making scaling fast, even when starting from zero.

    HTTPS secured API endpoints: Experience peace of mind with HTTPS endpoints and authentication to keep them safe from unauthorized use.

    Pay for what you use, by the minute:Per-minute pricing helps you keep your AI infrastructure affordable and costs predictable. No long-term commitments, just transparent, usage-based billing.

    In this tutorial, we’ll walk you through deploying NVIDIA Nemotron Nano 9B v2, powered by Ori Inference Endpoints.

    How to deploy Nemotron Nano 9B V2 on Ori’s Dedicated Inference Endpoints

    We’ll be deploying Nemotron Nano 9B v2 on an Ori Inference Endpoint that features dedicated NVIDIA GPUs and use Streamlit to create an UI to interact with the model.

    Step 1: Spin up an Ori Inference Endpoint and choose Nemotron Nano 9B v2 as the model you want to deploy. Pick a suggested GPU at a location of your choice.

    How to run NVIDIA Nemotron Nano 9B

    Set up the minimum and maximum number of replicas you need for automatic scaling. Inference Endpoints can scale automatically with demand and go down all the way to zero if you’re endpoint is idle, helping you save significantly on inference costs.

    NVIDIA Nemotron
    Important

    Note your endpoint's URL and API Access Token.

    Step 2: Install Streamlit and OpenAI packages

    Bash/ShellCopy
    1pip install streamlit openai

    Step 3: Create a Streamlit secrets file to store the endpoint credentials at the location: /root/.streamlit/secrets.toml

    Copy
    1ENDPOINT_URL = "Host URL provided by inference endpoint>"
    2ENDPOINT_TOKEN = "Token from the Inference Endpoint>"

    Step 4: Save the Python code (app.py) to configure and run your Streamlit UI

    PythonCopy
    1import os
    2import streamlit as st
    3from openai import OpenAI
    4
    5st.set_page_config(page_title="Nemotron-Nano 9B — Ori Inference", page_icon="🧠")
    6
    7# ---- Get config (secrets > env > UI) ----
    8DEFAULT_URL = st.secrets.get("ENDPOINT_URL") or os.getenv("ENDPOINT_URL")
    9DEFAULT_TOKEN = st.secrets.get("ENDPOINT_TOKEN") or os.getenv("ENDPOINT_TOKEN") or ""
    10
    11def normalize_base_url(u: str) -> str:
    12    """Ensure base_url looks like https://host/v1/ (strip any extra suffixes)."""
    13    u = (u or "").strip().rstrip("/")
    14    for suffix in ("/v1/chat/completions", "/v1"):
    15        if u.endswith(suffix):
    16            u = u[: -len(suffix)].rstrip("/")
    17    return f"{u}/v1/"
    18
    19with st.sidebar:
    20    st.subheader("Connection")
    21    raw_url = st.text_input("Endpoint URL (host root)", value=DEFAULT_URL, placeholder="https://<your-endpoint-host>")
    22    token = st.text_input("Access Token", value=DEFAULT_TOKEN, type="password", placeholder="ogc_***")
    23    st.caption("Base will resolve to: <host>/v1/ and call /chat/completions")
    24    st.divider()
    25    st.subheader("Chat Settings")
    26    system_prompt = st.text_area("System prompt", value="You are a helpful assistant")
    27#Setting up default model temperature
    28    temperature = st.slider("Temperature", 0.0, 1.0, 0.2, 0.05)
    29#Setting max tokens as 100,000
    30    max_tokens = st.slider("Max tokens", 16, 100000, 256, 16)
    31    stream = st.toggle("Stream responses", value=True)
    32    if st.button("Clear chat"):
    33        st.session_state.pop("messages", None)
    34
    35# Validate inputs
    36if not raw_url or not token:
    37    st.error("Provide the Endpoint URL and Access Token (in sidebar).")
    38    st.stop()
    39
    40base_url = normalize_base_url(raw_url)
    41
    42# ---- OpenAI-compatible client pointing at Ori endpoint ----
    43client = OpenAI(api_key=token, base_url=base_url)
    44
    45st.title("🧠 Nemotron-Nano 9B on Ori Inference")
    46st.caption(f"Using base URL: `{base_url}`  →  endpoint: `/chat/completions`")
    47
    48# ---- Chat history ----
    49if "messages" not in st.session_state:
    50    st.session_state.messages = []
    51    if system_prompt:
    52        st.session_state.messages.append({"role": "system", "content": system_prompt})
    53
    54# Render history
    55for m in st.session_state.messages:
    56    if m["role"] == "system":
    57        continue
    58    with st.chat_message(m["role"]):
    59        st.markdown(m["content"])
    60
    61# Input
    62user_msg = st.chat_input("Ask me anything…")
    63if user_msg:
    64    st.session_state.messages.append({"role": "user", "content": user_msg})
    65
    66    # Echo user
    67    with st.chat_message("user"):
    68        st.markdown(user_msg)
    69
    70    # Assistant response
    71    with st.chat_message("assistant"):
    72        placeholder = st.empty()
    73        output = ""
    74
    75        try:
    76            if stream:
    77                resp = client.chat.completions.create(
    78                    model="model",  # Ori routes to your configured Nemotron endpoint
    79                    messages=st.session_state.messages,
    80                    temperature=temperature,
    81                    max_tokens=max_tokens,
    82                    stream=True,
    83                )
    84                for chunk in resp:
    85                    delta = getattr(chunk.choices[0].delta, "content", None)
    86                    if delta:
    87                        output += delta
    88                        placeholder.markdown(output)
    89            else:
    90                resp = client.chat.completions.create(
    91                    model="model",
    92                    messages=st.session_state.messages,
    93                    temperature=temperature,
    94                    max_tokens=max_tokens,
    95                    stream=False,
    96                )
    97                output = resp.choices[0].message.content
    98                placeholder.markdown(output)
    99        except Exception as e:
    100            output = f"⚠️ Request failed: {e}"
    101            placeholder.markdown(output)
    102
    103    st.session_state.messages.append({"role": "assistant", "content": output})

    Step 5: Run the Streamlit UI file

    Bash/ShellCopy
    1streamlit run app.py

    You’ll see the URL for the Streamlit UI in the terminal and once you access it via your browser your UI will be ready for you. Here’s a snapshot of how the UI will look like:

    Nemotron Nano with Streamlit UI

    Check if your endpoint is working as expected with an example prompt

    Nemotron Nano Streamlit

    Run limitless AI inference on Ori

    Serve cutting-edge AI models in minutes without overspending on infrastructure:

    • Deploy your preferred AI model in a single click, making inference truly effortless.
    • Scale up automatically with demand, from zero to thousands of GPUs.
    • Predictable pricing and automatic scale down to zero helping you minimize idle infrastructure.

    Share