Benchmarking gpt-oss-20b on Ori Inference Endpoints

As generative AI adoption accelerates, inference workloads are becoming increasingly dynamic and unpredictable. From conversational agents to real-time analytics, model inference must scale fluidly while maintaining low latency and high reliability. Ori Inference Endpoints are engineered to deliver this balance, combining dedicated GPUs, autoscaling, and predictable pricing to make inference effortless and cost-effective.

To validate Inference Endpoints’ performance under load, we benchmarked an Ori Inference Endpoint running GPT-OSS-20B on an NVIDIA H200 SXM GPU, using GPT-2 as a tokenizer and the Flexible Inference Benchmark (FIB) framework from CentML.

This analysis explores how Ori Inference Endpoints behave under varying concurrency, how throughput and latency scale with input size, and where breakpoints appear that signal the need for replication or horizontal scaling.

Benchmarking Framework

The Flexible Inference Benchmarker (FIB) is a modular, open-source tool designed to simulate real-world usage scenarios. It sends controlled traffic to an endpoint and captures a detailed suite of metrics, including:

Input/Output Token Throughput (tok/s): Measures how many tokens are processed per second.
Time To First Token (TTFT): The latency between request and first generated token - a proxy for responsiveness.
Time Per Output Token (TPOT): Average time to generate each token during streaming.
Inter-Token Latency (ITL): Variance in time between token outputs, indicating generation stability.
Request Loss Rate: The fraction of dropped or failed requests under high concurrency.

These metrics collectively illustrate how an inference endpoint performs as workloads scale and where operational limits begin.

Experiment Setup

Each experiment used the FIB command-line interface to simulate incremental concurrency loads. The command used was of the form:

Bash/ShellCopy

1fib benchmark -n <num_requests> \
2--backend openai-chat \
3--base-url <endpoint_base_url> \
4--endpoint /v1/chat/completions \
5--model model \
6--tokenizer gpt2 \
7--dataset-name sharegpt \
8--dataset-path <canonical_dataset> \
9--max-concurrent <concurrency_level> \
10--output-file <results_file> \
11--output-token-distribution uniform 5 6

To maintain reproducibility, output lengths were constrained to short completions (5–6 tokens), using ShareGPT as the source dataset. Prompts exceeding 1,024 tokens were excluded due to GPT-2 tokenizer limits.

Each benchmark was executed on a freshly deployed endpoint to eliminate caching effects. GPU caches and inference sessions were reset between tests using Ori’s API, ensuring that every run reflected true model compute performance.

Dataset and Caching Effects

Caching can significantly distort inference metrics - improving latency and throughput by serving repeated inputs from memory rather than recomputing them on the GPU.

To avoid such distortion:

Each run used a unique, filtered subset of the ShareGPT dataset.
Endpoints were deleted and recreated between runs to clear caches.

A control experiment confirmed the impact of caching: repeating inputs from the 512–1024 token range increased token throughput from ~40K tok/s to over 150K tok/s and reduced TTFT from ~580 ms to ~140 ms. Such improvements, while useful in production, are misleading for raw performance measurement - underscoring the need for cold-start benchmarks.

Performance Testing with Concurrency Ramping

The central question: how does inference performance change as more users (requests) hit the same model simultaneously?

By ramping concurrency, we simulated scenarios ranging from low to high user traffic. Resource contention, GPU saturation, and request queuing were measured systematically to find the concurrency threshold where reliability begins to drop.

This process was automated through Ori’s REST APIs, allowing a Python script to:

Filter and tokenize input data.
Delete existing endpoints and confirm termination.
Recreate endpoints with a defined configuration (H200SXM-141, gpt-oss-20B, london-3).
Wait for endpoint readiness, then run FIB at increasing concurrency levels.

This method ensured clean, repeatable runs for every concurrency stage.

Results and Observations

Throughput and Latency

At moderate concurrency (up to 32 simultaneous requests), the system maintained 100% success rates with predictable, linear performance across input sizes:

Input Size (tokens)	Requests/s	Input tok/s	Output tok/s	Mean TTFT (ms)	Mean TPOT (ms)	Mean ITL (ms)
1–63 (Small)	212.3	19,122	1,061	138.9	2.1	9.2
64–255 (Medium)	133.5	24,865	667	223.3	2.8	12.0
256–511 (Large)	73.6	29,655	368	414.3	3.3	13.9
512–1024 (Summaries)	53.5	39,237	267	577.6	3.6	15.6

Throughput declined as inputs grew longer, a natural outcome of increased attention compute.
TTFT scaled linearly with input size, reflecting efficient batching with no abnormal queueing delays.
TPOT and ITL remained consistent, showing that token streaming stayed smooth even as load increased.
Output tok/s per user (275–470 tok/s) confirmed stable and responsive user-side experience.

These results established a reliable performance baseline for single-replica endpoints.

Concurrency Scaling

Under incremental concurrency ramping, Ori’s inference infrastructure remained stable up to roughly 500 concurrent requests. Within the stable concurrency window, throughput per user and latency metrics stayed remarkably consistent. This demonstrates that Ori’s inference scheduler maintained efficient GPU utilization and batching even as request volume scaled by an order of magnitude.

Ori’s throughput and latency results compare well with many of the cloud providers listed in the Artificial Analysis’ gpt-oss-20b benchmarking. Similarly, the performance variation with concurrency exhibited a linear curve similar to the concurrency benchmark results from AIMultiple.

Key Takeaways

Predictable scaling: Ori Inference Endpoints maintain stable throughput and latency up to ~500 concurrent requests on a single H200 SXM GPU.
Caching awareness: Cache invalidation is critical for benchmarking accuracy; cold runs are the only reliable measure.
Consistent streaming: TTFT, TPOT, and ITL metrics indicate sustained decoding stability even under stress.
Scalable architecture: Beyond 500 concurrent users, horizontal replication or autoscaling ensures continued performance consistency.

Conclusion

The benchmarking study confirms that Ori’s inference endpoints deliver reproducible, stable, and scalable performance for large-scale language models. Running GPT-OSS-20B on a single H200 SXM GPU, the platform sustained high token throughput and minimal latency across varying input sizes and traffic patterns.

At real-world concurrency levels, Ori’s infrastructure efficiently balances GPU utilization, request scheduling, and caching controls, reinforcing its suitability for production-grade inference at scale.

Spin up an endpoint

Build limitless AI on Ori

Chart your own AI reality with Ori's comprehensive AI cloud platform.