GPT-OSS-20B benchmark: Comparing response time, tokens/second, and cost efficiency on L4, L40S & H100 · Insights

GPT 20B ability to handle parallel users on H100 80G

More and more companies of all sizes are already using LLMs across many aspects of their operations, but the key concern they raise now is: “Where does our data go?” As a result, many organizations have started considering hosting LLMs locally.

The market already offers high-quality LLM models (OpenAI, DeepSeek, Mistral, Qwen, and many others), but the real questions are: what are the limits, how much throughput can you achieve, and at what cost? In this post, we provide clear answers.

Shadow AI issue

A short quote from the OpenAI Privacy Policy: https://openai.com/policies/row-privacy-policy/

We collect Personal Data that you provide in the input to our Services, including your prompts and other content you upload… We may use Personal Data to provide, maintain, develop, and improve our Services… We may disclose Personal Data if required to do so by law or in response to valid legal requests by public authorities.

So Public ChatGPT (including the web at chatgpt.com or the mobile app) should not be considered a legally privileged or confidential communication system. It is a commercial online service governed by its Terms and Privacy Policy, not a protected legal environment.

This creates a new issue called Shadow AI. Companies take a serious risk when their employees use public AI tools outside approved systems, potentially uploading sensitive data — contracts, source code, financials, customer PII, or strategic plans — without governance or data-processing safeguards. This exposes the company to data leaks, compliance violations, IP uncertainty, and legal discovery risks while weakening security controls and auditability. For founders, the real danger is not just misuse — it’s losing visibility and control over how critical information flows through unapproved third-party AI systems.

Strategy 1: Build your own tooling using LLM APIs

Companies can use APIs (e.g., OpenAI) to build agents or custom chat systems. This model is safer than public SaaS, but not risk-free.

OpenAI states that API data is not used for model training by default and is typically retained for up to 30 days for abuse monitoring.

However, data may be disclosed if required by law, and the 30-day period is not an absolute guarantee against extended retention due to legal or compliance obligations.

Policy changes at arbitrary times are also a common concern.

Strategy 2: Self-hosting with a local provider

For example, a German company hosting its LLM with a German provider under a signed DPA gains clearer legal recourse than relying on a U.S.-based AI provider.

If a dispute arises, litigation within the same jurisdiction is generally more straightforward than cross-border claims.

This does not eliminate disclosure or operational risks — it increases jurisdictional control, not absolute security.

Strategy 3: Self-hosting using on-premise hardware

Running LLMs on fully owned on-premise hardware further reduces third-party exposure. Controlled physical access and isolated infrastructure maximize data locality and remove dependence on external AI or cloud providers.

Absolute zero risk does not exist. On-premise deployment reduces third-party trust assumptions but shifts full responsibility for security, resilience, and compliance to the organization.

Self-hosting - measuring limitations

Self-hosting a model on custom hardware is not technically complex for experienced teams, but the important factor is understanding the limitations, which mainly depend on the hardware. So here we will try to measure them on example of gpt-oss-20b model hosted on different hardware and with different vLLM params.

To understand every term in the future reading I strogly recommend to read my LLM Terminology explained simply post

Max effective sequence length benchmark (input + generated tokens)

On every sequence run, the model takes the whole prompt (including, for example, system messages, history messages, summary, and RAG context) and then generates output from it.

For reasoning models, it first generates some amount of “thoughts” in the output and only then generates the real response.

The sum of the actual input and output tokens is called the effective sequence length.

The actual effective sequence length can’t be larger than context_window. Example context windows:

Proprietary GPT-5 mini - 400k tokens
Open gpt-oss-20B - ~131k tokens (128 Ki tokens)
Open gpt-oss-120B - ~131k tokens (128 Ki tokens)
Open GPT-J which we self-hosted in 2022 (GPT 3 Davinci analog) - ~2k tokens (2 Ki tokens)

So what we will do in the experiment:

In each experiment, we take the first N words from Moby-Dick (Herman Melville, plain text) and ask the model to return a summary in JSON format.
We increase the prompt in large steps (e.g., 5k words per point) and measure how fast it generates results and whether the result is correct.
We add a secret code at the start of the book to ensure the model is still using the full content.
We always add a random number at the beginning to prevent vLLM from using prefix caching (we’re interested in the worst case).
At each point we run two checks: whether the response is good and parseable, and whether the secret code is correct. Since availability/parseability issues can generally be retried, we do 3 attempts: if all 3 attempts are OK we mark the point green; if at least one attempt is OK we mark it as “retried” (yellow); if all attempts fail we stop the experiment and mark that point red in the charts (no sense in continuing).
If the output is parseable but the secret code is not detected in at least one experiment, we mark it red, because this kind of sense loss can’t be fixed by retries.

In this test we never run parallel requests: we wait until the first one finishes before starting the next (single-user emulation).

We always run all requests with:

medium: { 
  max_tokens: 1536,  // max_tokens
  temperature: 0.15, 
  top_p: 0.9 
},

Here is example JSON which we received:

{
  "status": 200,
  "streamedResponse": {
    "main_theme": "The obsessive quest for the white whale and its destructive consequences",
    "key_characters": [
      "Ishmael",
      "Captain Ahab",
      "Starbuck",
      "Queequeg",
      "Moby Dick"
    ],
    "causal_events": [
      "Ahab's vow to kill Moby Dick",
      "The crew's moral conflict",
      "The final battle with the whale",
      "The ship's destruction"
    ],
    "hidden_implication": "The story critiques the hubris of humanity in confronting nature",
    "secret_code": "314781-21",
    "confidence": 0
}

Structured output mode

We executed all our experiments in Structured output mode. In two words, this is an API-level setting that asks vLLM to respect an output JSON schema at decoding time.

Modern structured output modes (like those in vLLM or OpenAI-compatible APIs) don’t just concatenate “please produce correct JSON” — they actively constrain generation at the token level. Under the hood, the model still produces logits (raw probabilities for every possible next token), but the runtime system masks out any tokens that would break the expected structure (for example, invalid JSON syntax at a given position). Sampling with temperature or top-p then happens only over the allowed tokens. This makes invalid outputs effectively impossible rather than just unlikely.

In production, this is critical: it dramatically reduces malformed responses, eliminates most retry logic, and significantly improves consistency and adherence to the required schema. If you naively prompt a model to “return valid JSON,” it will work most of the time — but occasionally (say 1 in 100 responses) you’ll still get broken JSON and need to retry. Structured decoding avoids that class of failures, which is why it’s essential for reliable, production-grade systems.

And vLLM supports it!

How we started vLLM

We used pure docker container delivered from terraform to server with GPU:

docker run --rm \
    --name gpt-20b \
    --gpus all \
    --ipc=host \
    -e HUGGING_FACE_HUB_TOKEN=${var.hf_token} \
    -p 8000:8000 \
    vllm/vllm-openai:latest \
    --model ${var.model_id} \
    --dtype bfloat16 \
    --kv-cache-dtype fp8

At the time of the research, the latest version of vLLM is 0.15.1.

Besides the explicit parameters, there are many implicit ones under the hood. Several important parameters that may be worth tuning are below. I intentionally left them at default values, since vLLM defaults are a good starting point for many tasks.

docker command param	default	Description
--gpu-memory-utilization	0.9	Max fraction of VRAM vLLM can use for weights, KV cache, and activations; recommended values are 0.9-0.95 if we run one vLLM per one GPU. The remaining ~5%-10% is needed for overheads; if you don’t leave enough headroom, you will hit OOM (out-of-memory). We use 0.9 for client setups as a battle-tested OOM-safe value.
--max-num-batched-tokens	2048	Upper bound on the total number of tokens processed in a single scheduler step across all active sequences. If exceeded, the scheduler splits the workload across multiple forward passes. Tradeoff: with lower values, TTFT grows because you need more forward passes, but decoding speed can increase.
--max-model-len	131072	Maximum allowed ESL. Must not exceed the model’s architectural context window, which is default (128 Ki tokens per `gpt-oss-20b`)
--max-num-seqs	256	How many parallel active sequences vLLM can handle. If it is set to 1 and two requests arrive close in time, the second request is queued and the user waits for the full prefill and full decode of the first request. Increasing this does not generally cause OOM, but actual parallelism may be limited by VRAM capacity.

gpt-oss-20B on L4 24G

Typical rental price for server with L4 1x24G assuming mid-range stable on-demand: €520–€1175 / mo higher ones closer to Enterprise-grade hosters.

If you want to have it on premises

NVIDIA Tesla L4 24GB Tensor Core GPU PCIe Gen4 x16 TDP Graphics Card Dell NG3PY On ebay for 2.5k

Here is our benchmark result:

Average decoding speed and time to first token for gpt-oss-20b hosted on L4 24 GiB

input_tokens	thinking_tokens	output_tokens	effective_sequence_length	time_to_first_token	decoding_tokens/s	thinking_time	Json_test_ok	secret_test_ok	nonempty_test_ok
7591	144	63	7912	2.926	62.387	2.251	pass	pass	pass
14203	205	67	14537	3.160	60.728	3.227	pass	pass	pass
20765	105	90	20993	5.228	60.297	1.749	pass	pass	pass
27431	245	81	27740	7.800	43.723	5.918	pass	pass	pass
33884	159	52	34043	10.718	57.761	2.810	pass	pass	pass
40717	107	100	40810	16.042	56.480	1.910	pass	pass	pass
47511	67	99	47513	18.169	34.185	1.187	pass	pass	pass
54483	71	86	54420	22.886	55.011	1.265	pass	pass	pass
61357	62	93	61266	29.654	54.082	1.103	pass	pass	pass
68018	97	140	67941	36.872	52.843	1.894	pass	pass	pass
74886	107	92	74717	40.947	52.013	1.918	pass	pass	pass
81677	85	187	81546	49.100	50.841	1.731	pass	pass	pass
88556	48	147	88273	55.950	50.284	1.007	pass	pass	pass
95944	104	107	95611	63.822	49.415	1.991	pass	pass	pass
102616	107	99	102194	71.371	48.677	2.153	pass	pass	pass
109060	58	167	108596	80.934	48.056	1.326	pass	pass	pass
115676	111	95	115147	89.445	47.455	2.119	pass	pass	pass
122257	83	115	121669	98.105	46.621	1.719	pass	pass	pass
128819	59	119	128175	109.318	46.756	1.324	pass	pass	pass
135614	0	0	0	0.000	0.000	0.000	fail	fail	fail

Time to first generation grows non-linearly and reaches ~120 seconds for a ~130k-token prompt. Decoding speed falls from ~120 to ~45 tokens/second.

And on this setup we can run sequences with ~55k input tokens (ESL up to ~55k + 1.5k = ~56.5k).

VRAM at this experiment is stable and reaches 21.725Gi/22.494G.

The two decoding-speed drops are most likely caused by internal attention-kernel switching inside the inference engine. As the input sequence length crosses certain thresholds (around ~30k and ~50k tokens), the runtime switches to different CUDA/FlashAttention kernel variants or GEMM configurations optimized for larger context sizes. These transitions can temporarily reduce SM occupancy or change memory access patterns, resulting in a brief throughput dip before stabilizing again. Since this happens with a single request on an otherwise idle GPU and the performance trend recovers immediately afterward, it strongly suggests a kernel-boundary effect rather than memory pressure or resource contention.

Does the model think more with more content?

In the same experiment we also fetched information about reasoning-token volumes (all experiments were executed with fixed reasoning_afford=medium):

Reasoning tokens and time depending on ESL

Conclusion: thinking tokens do not depend on input size; on average the model “thinks” for about the same time (with some variance) for both 5k input words and 40k input tokens.

gpt-oss-20B on L40S 48G

Another very popular video card on the market is the L40S. Average rental price is around €1050-€1400; a new one on eBay might cost $9249 USD.

VRAM consumption is 42.112Gi/44.988Gi and stayed stable during the full experiment session.

Average decoding speed and time to first token for gpt-oss-20b hosted on L40S-1-48G

input_tokens	thinking_tokens	output_tokens	effective_sequence_length	time_to_first_token	decoding_tokens/s	thinking_time	Json_test_ok	secret_test_ok	nonempty_test_ok
7589	76	64	7843	2.110	160.183	0.436	pass	pass	pass
14204	70	101	14437	1.048	156.022	0.452	pass	pass	pass
20766	136	105	21041	3.262	151.477	0.882	pass	pass	pass
27431	113	78	27604	2.414	148.986	0.690	pass	pass	pass
33883	120	82	34034	3.243	145.848	0.794	pass	pass	pass
40716	142	76	40820	4.237	142.857	0.920	pass	pass	pass
47511	76	105	47528	5.408	140.637	0.579	pass	pass	pass
54480	86	64	54410	7.690	132.159	0.650	pass	pass	pass
61355	191	77	61377	8.249	121.985	1.466	pass	pass	pass
68020	79	165	67950	11.399	123.732	0.812	pass	pass	pass
74885	178	52	74747	11.489	105.023	1.708	pass	pass	pass
81677	125	128	81527	14.947	105.285	1.322	pass	pass	pass
88557	125	117	88321	15.360	100.666	1.431	pass	pass	pass
95943	83	310	95792	19.289	109.961	1.050	pass	pass	pass
102615	92	100	102179	21.448	94.955	1.151	pass	pass	pass
109060	115	96	108582	23.942	88.470	1.483	pass	pass	pass
115679	132	121	115197	26.399	82.761	2.045	pass	pass	pass
122258	121	149	121742	28.948	81.008	2.128	pass	pass	pass
128817	51	138	128184	31.576	90.997	0.904	pass	pass	pass
135617	0	0	0	0.000	0.000	0.000	fail	fail	fail

gpt-oss-20B on H100-1-80G

Across the whole test run we saw VRAM up to 74.208Gi/79.647Gi, which means that with one sequence at a time the VRAM was not fully occupied.

Average decoding speed and time to first token for gpt-oss-20b hosted on H100-1-80G

input_tokens	thinking_tokens	output_tokens	effective_sequence_length	time_to_first_token	decoding_tokens/s	thinking_time	Json_test_ok	secret_test_ok	nonempty_test_ok
7590	121	62	7887	2.542	228.180	0.491	pass	pass	pass
14202	146	70	14480	1.784	223.602	0.601	pass	pass	pass
20767	72	72	20945	1.799	223.256	0.300	pass	pass	pass
27433	142	54	27612	1.418	219.731	0.604	pass	pass	pass
33885	110	52	33996	1.378	217.450	0.462	pass	pass	pass
40714	48	115	40763	1.727	245.482	0.145	pass	pass	pass
47509	107	184	47636	2.080	192.079	0.683	pass	pass	pass
54482	92	63	54417	3.308	171.460	0.589	pass	pass	pass
61355	106	86	61301	2.919	160.401	0.742	retried	retried	retried
68019	151	86	67942	4.344	146.296	1.152	pass	pass	pass
74887	102	69	74690	4.623	142.381	0.815	pass	pass	pass
81678	91	157	81523	5.058	150.303	0.863	pass	pass	pass
88554	68	91	88235	6.073	141.964	0.647	pass	pass	pass
95943	78	133	95611	6.857	138.179	0.858	pass	pass	pass
102615	66	114	102167	8.650	131.965	0.798	pass	pass	pass
109062	64	121	108558	7.991	131.392	0.782	pass	pass	pass
115678	46	108	115097	9.199	133.565	0.599	pass	pass	pass
122257	36	434	121941	9.913	161.512	0.641	pass	pass	pass
128817	106	128	128229	10.754	107.339	1.499	pass	pass	pass
135615	0	0	0	0.000	0.000	0.000	fail	fail	fail

At the 60k point we saw the first failure, which was then successfully retried from one attempt. The issue was next, model started to repeat:

"{"main_theme":"The secret code is 314781-21","key_characters":["Ishmael","Queequeg","Captain Ahab","Captain Peleg","Captain Bildad"],\
"causal_events":[ "Ishmael leaves New Bedford","Queequeg joins him","They sign aboard the Pequod","Captain Ahab is introduced","The Pequod sets sail","The crew faces the first whale encounter","The crew confronts the great whale","The crew is pursued by the whale","The crew is rescued","The crew returns to Nantucket","The crew faces the final whale","The crew is rescued again","The crew returns home","The crew faces the final whale again","The crew is rescued again","The crew returns home","The crew faces the final whale again","The crew is rescued again","The crew returns home","The crew faces the final whale again"... and so on up to max_tokens

See details about this issue at the end of the post.

Managed gpt-oss-20b hosting

Some hosting services nowadays provide inference-as-a-service. For example, they may host gpt-oss-20b for you, so you pay for used tokens instead of paying for a whole server running 24/7. This will be still cheaper as many official vendor LLMs, but again risk of data exposure is high if you don't sign dedicated DPAs.

Generally they don’t disclose what hardware they use under the hood or what vLLM parameters they run.

Average decoding speed and time to first token on 3rd party managed gpt-oss-20b hosting

input_tokens	thinking_tokens	output_tokens	effective_sequence_length	time_to_first_token	decoding_tokens/s	thinking_time	Json_test_ok	secret_test_ok
7585	134	77	7913	0.919	88.888	1.538	pass	pass
14199	138	103	14506	1.289	84.099	1.599	pass	pass
20764	82	88	20972	1.985	74.951	1.174	pass	pass
27429	102	93	27610	2.804	59.921	1.800	pass	pass
33881	150	87	34069	3.747	54.639	3.151	pass	pass
40712	79	90	40771	5.011	49.820	1.841	pass	pass
47507	87	88	47522	6.175	45.707	2.200	pass	pass
54475	182	87	54527	7.829	43.253	4.825	pass	pass
61351	91	84	61284	9.472	36.097	3.006	pass	pass
68016	92	90	67887	11.330	38.478	2.824	pass	pass
74882	46	96	74660	13.323	39.453	1.876	pass	pass
81673	79	88	81441	15.438	37.809	2.651	pass	pass
88551	87	125	88288	17.549	26.334	4.732	pass	pass
95939	97	132	95628	20.319	27.321	4.479	pass	pass
102614	90	101	102180	23.355	22.869	5.336	pass	pass
109059	55	160	108587	25.937	21.237	4.751	pass	pass
115674	106	1430	116479	0.000	41.197	6.034	fail	fail

What we can see from this chart:

For short sequences the best decoding throughput was better than on L4 (~80 token/s), but worsen then on L40S & H100 with our setup
Time to first token is nearly same to L40S (e.g., ~5s at ~40k input tokens), but our H100 setup is better.

Ability to handle parallelism benchmark (multiple users wait time)

Another test: we take a fixed point, e.g. 10k input words (pretty typical for an average RAG/agent task), and simulate requests from parallel users. This is like several users using the model at the same time, and we observe how average per-user throughput and response time react. In the previous example we measured decoding throughput (the speed at which tokens are generated after the first token is received). In this test it’s more interesting to look at min/max TTFT and min response time (for the most “lucky” user) versus max response time (for the most “unlucky” user).

gpt-oss-20B on L4 24G

On model load we saw the following in vLLM logs:

Your GPU does not have native support for FP4 computation but FP4 quantization is being used. Weight-only FP4 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.

Maximum concurrency for 131,072 tokens per request: 2.78x

Let’s run a benchmark:

Model ability to handle parallel users on L4 24G

virtual_users	min_ttft_s	max_ttft_s	min_response_time_s	max_response_time_s	Json_test_ok	secret_test_ok	nonempty_test_ok
1	3.502	3.502	6.932	6.932	pass	pass	pass
2	3.834	5.981	11.118	12.063	pass	pass	pass
3	4.553	9.957	15.123	16.375	pass	pass	pass
4	4.549	13.267	17.619	21.218	pass	pass	pass
5	4.609	16.378	20.988	24.036	pass	pass	pass
6	4.630	19.597	23.326	29.127	pass	pass	pass
7	4.675	22.669	27.487	31.389	pass	pass	pass
8	4.650	25.955	32.222	36.193	pass	pass	pass

With 8 parallel users, we still see that all 8 users get the final response at roughly the same time (the spread between max and min response time is ~3 seconds: the most “lucky” user gets a response in ~33 seconds and the most “unlucky” user in ~36 seconds). But the spread between the shortest and longest TTFT still grows. If the LLM is used in a chat-like interface with 8 users asking at the same time and prompts are around 10k words (e.g., include RAG and history), the chart implies that the most “lucky” user receives the first token after ~3 seconds and starts streaming, while the most “unlucky” user waits ~25 seconds for the first token.

You might ask: how, in this 8-user case, do they still receive the full response at almost the same time with such a significant TTFT spread? The large TTFT spread happens because TTFT depends on when each request gets its turn for prefill on the GPU — if a user is later in the queue, their first token simply starts later. However, once decoding begins, vLLM time-slices the GPU fairly across all active requests, so they generate tokens at roughly the same pace. The key insight is that the “unlucky” user doesn’t fall behind by the full TTFT gap, because while they were waiting, the others were not running at full speed either — they were also sharing the same GPU and spending time on prefill and other users’ work. That’s why TTFT spread can be large, but total response-time spread stays relatively small.

To illustrate the difference, let’s set --max-num-seqs to 1, which effectively disables parallelism:

Model ability to handle parallel users 3rd party hosting

virtual_users	min_ttft_s	max_ttft_s	min_response_time_s	max_response_time_s	Json_test_ok	secret_test_ok	nonempty_test_ok
1	5.153	5.153	9.102	9.102	pass	pass	pass
2	3.387	10.664	7.379	13.281	pass	pass	pass
3	3.494	17.423	7.309	20.060	pass	pass	pass
4	3.512	27.533	8.696	32.178	pass	pass	pass
5	3.550	31.566	7.217	34.308	pass	pass	pass
6	3.550	36.732	6.366	41.123	pass	pass	pass
7	2.707	41.665	5.849	45.934	pass	pass	pass
8	3.686	53.682	6.377	57.237	pass	pass	pass

Here we see huge spreads even in total response time: with 8 users, the most “lucky” user receives the full response within ~10s, and the most “unlucky” within ~60s.

Future way to tune the model

If you wish to reduce TTFT for all requests, you can experiment by raising --max-num-batched-tokens, e.g. to 4k or 8k. By doing this, the vLLM scheduler will handle larger prefill chunks, which reduces the total number of forward passes and can reduce overall TTFT. The drawback is that you increase the risk of VRAM OOM. Generally, VRAM OOM is less painful than RAM OOM, and Docker will gracefully auto-restart vLLM (at least it worked on my machine), but any OOM still causes downtime for a minute or so, which is definitely not good.

If you are going to change any parameters like --max-num-batched-tokens from default values, you should stress-test for OOM at the highest context and high concurrency. If you hit OOM after increasing MNBT (which is very probable), you’ll need to sacrifice something, e.g. decrease --max-model-len to 57344 tokens (56 Ki tokens = 56 × 1024).

gpt-oss-20B on L40S 48G

Here in launch log we can see that vLLM says we can run up to 16 requests on 131k sequences:

Your GPU does not have native support for FP4 computation but FP4 quantization is being used. Weight-only FP4 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.
Maximum concurrency for 131,072 tokens per request: 15.97x

So it is still the “Marlin” kernel, but the promised concurrency at 131k is ~16x; let’s see what it means in reality:

Here is result on our previous setup:

Model ability to handle parallel users on L40S 48G

virtual_users	min_ttft_s	max_ttft_s	min_response_time_s	max_response_time_s	Json_test_ok	secret_test_ok	nonempty_test_ok
1	1.169	1.169	1.916	1.916	pass	pass	pass
2	1.448	1.912	4.242	4.424	pass	pass	pass
3	3.112	4.435	6.415	6.558	pass	pass	pass
4	1.512	3.654	5.369	6.214	pass	pass	pass
5	1.415	4.550	6.780	7.647	pass	pass	pass
6	1.424	5.466	7.273	9.793	pass	pass	pass
7	1.439	6.365	8.698	9.480	pass	pass	pass
8	1.526	7.321	9.123	10.611	pass	pass	pass

gpt-oss-20B on H100 80G

Using Triton backend
Maximum concurrency for 131,072 tokens per request: 31.52x

This is a first model using Triton with native fp4 support.

GPT 20B ability to handle parallel users on H100 80G

virtual_users	min_ttft_s	max_ttft_s	min_response_time_s	max_response_time_s	Json_test_ok	secret_test_ok	nonempty_test_ok
1	0.798	0.798	1.621	1.621	pass	pass	pass
2	0.896	0.898	1.970	2.352	pass	pass	pass
3	1.657	1.690	3.160	3.503	pass	pass	pass
4	1.235	1.740	2.933	3.479	pass	pass	pass
5	1.289	2.365	3.483	4.512	pass	pass	pass
6	1.324	2.836	4.290	4.948	pass	pass	pass
7	1.385	3.411	5.031	5.866	pass	pass	pass
8	1.276	3.626	5.071	6.241	pass	pass	pass

How managed LLMs handle parallelism?

I tried to run the same test on one of these services.

Model ability to handle parallel users 3rd party hosting

We see that 3rd-party hosting keeps delay at a good level and it does not significantly degrade as the number of users increases. When you are using LLM-as-a-Service endpoints, you can generally run as many sequences (API requests) as you need in parallel and can expect them to generate in parallel. This is possible because the external provider has a horizontally scalable (from your perspective, effectively infinite) pool of GPUs and can split API requests from many clients. Obviously they split the payload across multiple inference endpoints to achieve maximum efficiency.

Structured output failure due to degenerate repetition

Since we had one retried case on H100, after running the main experiments, we stress-tested the model by running at peak points for half an hour.

And very rarely we see retries caused by broken JSON:

{
 "main_theme": "Moby-Dick",
 "key_characters":[ "Ishmael","Queequeg","Captain Ahab","Captain Peleg","Captain Bildad","Father Mapple"],
  "causal_events":[ "Ishmael leaves New Bedford for Nantucket","Ishmael meets Queequeg at the Spouter‑Inn","Ishmael and Queequeg sign aboard the Pequod","Captain Ahab’s leg is lost to a whale","The Pequod sets sail","The crew encounters the great white whale","The crew is pursued by the whale","The crew is rescued by the crew of the ship “The Rachel”","The crew is pursued by the whale again","The crew is rescued by the crew of the ship “The Delight”","The crew is pursued by the whale again"....

This issue was always successfully retried on the next attempt.

Its frequency is quite low: we saw from 1 to 5 occurrences during 30 minutes, but this frequency may depend, for example, on the entropy and lengths of incoming prompts.

This issue happened at any point of any experiment, and this may be a common picture for the GPT-20B model.

Moreover, this issue happens with any LLMs. Gemini, for example, quite often goes into this repetition even in the public UI, while OpenAI might hide it via retries (it’s quite easy to detect repetitions).

Attempts to run gpt-oss-120B on H100 80G

Tn theory 80G might be enough to run even a 120B model. It's weights in MXFP4 consume at least 60GB by definition (in fact I’ve seen 64.38 GiB in logs - some CUDA overhead), so what’s left for KV cache and activations is much less than 14G, which is very tight assuming number of layers for this model is higher.

Default parameters for this model on Hugging Face / vLLM are definitely not optimized for 80G VRAM:

| MML (max-model-len) | 131072 | | MNBT (max-num-batched-tokens) | 8192 | | MMU (gpu-memory-utilization) | 0.9 | | MNS (max-num-seqs) | 256 |

With these settings the model does not load at all:

Model loading took 64.38 GiB memory  
Available KV cache memory: -0.62 GiB

Setting MNBT to 2048 still did not help. So with the same config as we used for 20B on 24G VRAM, we can’t run 120B on 80G VRAM.

The reason 120B cannot start on a single H100 80GB with the same 131k context settings that worked for 20B is that KV cache memory scales with the number of layers and the hidden size of the model, not just with available VRAM. While 20B left ~10GB for KV cache on a 24GB GPU, the 120B model has roughly 4–5× larger per-token KV footprint due to significantly more layers and a much larger hidden dimension. So even though ~16GB appears “available” after loading weights on H100, each token now consumes far more KV memory, making a 131k context mathematically infeasible on a single GPU.

Reducing --max-model-len to half (e.g. 57344) didn’t help either. You can probably extract something from this card by reducing values even more, but it will be very limited and needs careful OOM testing. Ideally you should use 2× H100 80G and set --tensor-parallel-size 2.

Conclusion

At a typical “real” chat/RAG workload size (≈10k input tokens/words: history + retrieved context + a short user question like “Given the attached policy and our last 5 messages, what are the top 3 risks and the next steps?”), the single-user latency for gpt-oss-20b differs dramatically by GPU. In our 1-user run at that prompt size, L4 24G delivered ~3.50s TTFT and ~6.93s full response time, L40S 48G ~1.17s TTFT and ~1.92s full response time, and H100 80G ~0.80s TTFT and ~1.62s full response time.

If your system is used frequently by many people (for example, an internal chat tool where several employees ask questions at the same time), the same ~10k context request can be in-flight concurrently. With 5 parallel users at this prompt size, we observed that L4 24G spreads TTFT from ~4.61s (lucky) to ~16.38s (unlucky) and full response time from ~20.99s to ~24.04s; L40S 48G keeps TTFT in ~1.42–4.55s and full response in ~6.78–7.65s; and H100 80G keeps TTFT in ~1.29–2.37s and full response in ~3.48–4.51s. In practice this matters because user-perceived latency is dominated by TTFT in streaming UIs, and concurrency is the normal case for shared tools.

From a cost-efficiency standpoint, L4 is the cheapest entry point (and works well for low concurrency), but it becomes latency-limited quickly as context grows. L40S is often the sweet spot for gpt-oss-20b: it cuts single-user latency by ~3–4× versus L4 while typically costing less than 2×, and it also provides substantially more parallelism headroom at long contexts. H100 wins on raw latency and concurrency, but unless you specifically need the lowest possible tail latencies at higher load, it is usually harder to justify purely on €/latency for a 20B-class model.

Usefull links

Optimization and Tuning of vLLM clearly describes main tuning ideas.
A must LLM Terminology to understand this post explains all the terms used in this post.
gpt-oss-20b on Hugging Face