Self-hosted GPT: real response time, token throughput, and cost on L4, L40S and H100 for GPT-OSS-20B

We benchmarked modern open-source LLMs across several popular GPUs to measure real-world context limits, throughput, latency, and cost efficiency under varying levels of concurrency — as close as possible to real production conditions. Here we share the results.

Ivan Borshchov
Ivan Borshchov
CEO · Feb 20, 2026
Self-hosted GPT: real response time, token throughput, and cost on L4, L40S and H100 for GPT-OSS-20B

LLMs are now an inevitable part of every successful business. When used correctly, they can dramatically increase technology ROI.

More and more companies of all sizes are already using LLMs across many aspects of their operations, but the key concern they raise now is: “Where does our data go?” As a result, many organizations have started considering hosting LLMs locally.

The market already offers high-quality LLM models (OpenAI GPT OSS, DeepSeek, Mistral, Qwen, and many others), but the real questions are: what are the limits, how much throughput can you achieve, and at what cost? In this post, we provide clear answers.

Shadow AI issue

A short quote from the OpenAI Privacy Policy: https://openai.com/policies/row-privacy-policy/

We collect Personal Data that you provide in the input to our Services, including your prompts and other content you upload… We may use Personal Data to provide, maintain, develop, and improve our Services… We may disclose Personal Data if required to do so by law or in response to valid legal requests by public authorities.

Public ChatGPT is not a confidential or legally privileged channel, it’s a commercial service governed by its Terms and Privacy Policy.

This drives Shadow AI: employees using unapproved tools and sending sensitive data (contracts, code, financials, PII, strategy), increasing leakage, compliance, IP, auditability, and legal discovery risk.

Strategy 1: Build your own tooling using LLM APIs

Using LLM APIs (e.g., OpenAI) to build internal tools is generally safer than public SaaS and helps reduce Shadow AI, but it still requires governance. API data is typically not used for training by default and may be retained for abuse monitoring (often up to ~30 days), and it can still be disclosed under legal requests or affected by policy changes.

Strategy 2: Sovereign AI self-hosting on Local Hosting Provider

A “local” provider (in the same country/region and legal jurisdiction as your company) under a signed DPA can reduce data exposure and improve jurisdictional control (clearer legal recourse than cross-border setups), but it does not eliminate disclosure or operational risk.

This aligns with sovereign AI concept which means jurisdiction-level control over AI infrastructure, data, and operational governance — ensuring models are managed within a defined legal and regulatory framework. In this context, working with a local provider can be viewed as a practical step toward jurisdictional sovereignty, even if it does not amount to full-stack national AI autonomy.

Strategy 3: Self-hosting using on-premise hardware

Running LLMs on fully owned on-premise hardware further reduces third-party exposure. Controlled physical access and isolated infrastructure maximize data locality and remove dependence on external AI or cloud providers.

Absolute zero risk does not exist but on-premise deployment reduces third-party trust assumptions while shifting full responsibility for security, resilience, and compliance to the organization.

Hardware requirements for LLM self-hosting

Self-hosting a model on custom hardware is not technically complex for professional software dev companies like DevForth, but the important factor is understanding the limitations, which mainly depend on the hardware. So here we will try to measure them using the gpt-oss-20b model from OpenAI on different hardware and with different vLLM params.

To understand every term in what follows, I strongly recommend reading my LLM Terminology explained simply post

For LLM inference, GPU VideoRAM is usually the main bottleneck: it must fit both model weights and the KV cache (which grows with effective sequence length and with the number of concurrent requests). Running LLMs on CPU RAM downgrades performance typically to 100× - 1000× slower, with TTFT becoming painfully long.

For production-grade LLM hosting, the practical VRAM window usually lies between 24 GB and 80 GB. Yet memory size alone does not guarantee viability. Throughput, tensor performance, memory bandwidth, and platform support often make the decisive difference. For that reason, we consider only GPUs with demonstrated, real-world performance.

Very rough planning price ranges (vary by region/availability): 24 GB GPUs like RTX 3090(350W), RTX 4090 (450W), or L4 (72W - much better for server) are typically €200–€1200/mo to rent and €700–€3500 to buy; 48 GB (L40S) (350W) is often €1000–€1600/mo and €7k–€11k to buy; 80 GB (H100) (700W if SXM or 400W PCI) is roughly €1500–€2500/mo and €18k–€45k to buy.

Benchmark and self-hosted OpenAI GPT OSS software setup

We set up a self-hosted vLLM in Docker as the inference engine for our experiments.

  1. In each experiment, we take a portion of Moby-Dick (Herman Melville, plain text) and ask the model to return a summary in JSON format.
  2. We add a secret code in the first pages of the book and ask the model to find it, to ensure it is still using the full content.
  3. We always add a random number at the beginning to prevent vLLM from using prefix caching (we’re interested in the worst case).
  4. At each point, we run two checks: whether the response is good and parseable, and whether the secret code is correct. Since availability/parseability issues can generally be retried, we do up to 3 retry attempts in case of JSON errors. If the result is good on the first try, you’ll see a green point on the chart; if it was retried, you’ll see a yellow point; if all attempts fail, we stop the experiment and mark that point red in the charts (no sense in continuing).
  5. If the output is parseable but the secret code is not detected/correct, you will see a red point and we stop the experiment, because this kind of sense loss can’t be fixed by retries and means model degradation.

We always run all requests to vLLM with the same parameters:

reasoning_afford: "medium",  // reasoning affordance
max_tokens: 1536,  // max_tokens
temperature: 0.15, 
top_p: 0.9 

Here is example JSON which we received:

{
  "status": 200,
  "streamedResponse": {
    "main_theme": "The obsessive quest for the white whale and its destructive consequences",
    "key_characters": [
      "Ishmael",
      "Captain Ahab",
      "Starbuck",
      "Queequeg",
      "Moby Dick"
    ],
    "causal_events": [
      "Ahab's vow to kill Moby Dick",
      "The crew's moral conflict",
      "The final battle with the whale",
      "The ship's destruction"
    ],
    "hidden_implication": "The story critiques the hubris of humanity in confronting nature",
    "secret_code": "314781-21"
  }
}

Structured output mode

We executed all our experiments in Structured output mode. In two words, this is an API-level setting that asks vLLM to respect an output JSON schema at decoding time.

Modern structured output modes (like those in vLLM or OpenAI-compatible APIs) don’t just say to the model “please produce correct JSON” — they actively constrain generation at the token level. Under the hood, the model still produces logits (raw probabilities for every possible next token), but the runtime system masks out any tokens that would break the expected structure (for example, invalid JSON syntax at a given position). Sampling with temperature or top-p then happens only over the allowed tokens. This makes invalid outputs effectively impossible rather than just unlikely.

In production, this is critical: it dramatically reduces malformed responses, eliminates most retry logic, and significantly improves consistency and adherence to the required schema. If you naively prompt a model to “return valid JSON,” it will work most of the time — but occasionally (say 1 in 100 responses) you’ll still get broken JSON and need to retry. Structured decoding avoids that class of failures, which is why it’s essential for reliable, production-grade systems.

And vLLM supports it!

How we started vLLM

We used a plain Docker container deployed via Terraform to a GPU server:

docker run --rm \
    --name gpt-20b \
    --gpus all \
    --ipc=host \
    -e HUGGING_FACE_HUB_TOKEN=${var.hf_token} \
    -p 8000:8000 \
    vllm/vllm-openai:latest \
    --model openai/gpt-oss-20b \
    --dtype bfloat16 \
    --kv-cache-dtype fp8

At the time of this research, the latest version of vLLM is 0.15.1.

Besides the explicit parameters, there are many implicit ones under the hood. Several important parameters that may be worth tuning are below. We intentionally left them at default values, since vLLM defaults are a good starting point for many tasks.

For our clients’ setups, starting from recent versions of vLLM, we use defaults as they are battle-tested and OOM-safe.

Docker command paramDefaultDescription
--gpu-memory-utilization0.9Max fraction of VRAM vLLM can use for weights, KV cache, and activations; recommended values are 0.9-0.95 if we run one vLLM per GPU. The remaining ~5%-10% is needed for overheads; if you don’t leave enough headroom, you will hit OOM (out-of-memory).
--max-num-batched-tokens2048Upper bound on the total number of tokens processed in a single scheduler step across all active sequences. If exceeded, the scheduler splits the workload across multiple forward passes. Trade-off: with lower values, TTFT grows because you need more forward passes, but decoding speed can increase.
--max-model-len131072Maximum allowed ESL. Must not exceed the model’s architectural context window, which is the default (128 Ki tokens per gpt-oss-20b)
--max-num-seqs256How many parallel active sequences vLLM can handle. If it is set to 1 and two requests arrive close in time, the second request is queued and the user waits for the full prefill and full decode of the first request. Increasing this does not generally cause OOM, but actual parallelism may be limited by VRAM capacity.

Max effective sequence length benchmark (input + generated tokens)

On every sequence run, the model takes the whole prompt (including, for example, system messages, history messages, summary, and RAG context) and then generates output from it. For reasoning models, it first generates some amount of “thoughts” in the output and only then generates the real response.

The sum of the actual input and output tokens is called the effective sequence length. The actual effective sequence length can’t be larger than context_window. Example context windows:

  • Proprietary GPT-5 mini - 400k tokens
  • Open gpt-oss-20B - ~131k tokens (128 Ki tokens)
  • Open gpt-oss-120B - ~131k tokens (128 Ki tokens)
  • Open GPT-J which we self-hosted in 2022 (GPT 3 Davinci analog) - ~2k tokens (2 Ki tokens)

So here’s what we do in the experiment:

  1. In each experiment, we take the first N words from Moby-Dick and ask the model to return a summary. We slice text in words, but on charts you will see actual tokens.
  2. We increase the prompt in large steps (e.g., 5k words per point) and measure how fast it generates results and whether the result is correct.
  3. We see the real maximum sequence length and time to first token (TTFT).

In this test, we never run parallel requests: we wait until the first one finishes before starting the next (single-user emulation).

Ability to handle parallelism benchmark (multiple users wait time)

In another test, we take a fixed point (10k input words—a fairly typical RAG/agent workload) and simulate requests from parallel users.

This is like several users using the model at the same time, and we observe how the average per-user throughput and response time react. In the previous example we measured decoding throughput (the speed at which tokens are generated after the first token is received).

In this test it’s more interesting to look at min/max TTFT and min response time (for the most “lucky” user) versus max response time (for the most “unlucky” user). Difference in time between the most “lucky” and the most “unlucky” user is a good proxy for how well the model can handle parallel requests without letting some users wait too long.

L4 24G

Typical rental price for server with L4 1x24G assuming mid-range stable on-demand: €520–€1175 / mo higher ones closer to Enterprise-grade hosters.

If you want to have it on premises

NVIDIA Tesla L4 24GB Tensor Core GPU PCIe Gen4 x16 TDP Graphics Card Dell NG3PY On ebay for 2.5k You can insert it in some GPU-capable rack server (2U-height) with full-height slot for Video card like Dell PowerEdge R730 (500500-1000) and connect it to your network.

Here is a sequence-length benchmark result:

Average decoding speed and time to first token for gpt-oss-20b hosted on L4 24 GiB
Average decoding speed and time to first token for gpt-oss-20b hosted on L4 24 GiB
input_tokensthinking_tokensoutput_tokenseffective_sequence_lengthtime_to_first_tokendecoding_tokens/sthinking_timeJson_test_oksecret_test_oknonempty_test_ok
75911446379122.92662.3872.251passpasspass
1420320567145373.16060.7283.227passpasspass
2076510590209935.22860.2971.749passpasspass
2743124581277407.80043.7235.918passpasspass
33884159523404310.71857.7612.810passpasspass
407171071004081016.04256.4801.910passpasspass
4751167994751318.16934.1851.187passpasspass
5448371865442022.88655.0111.265passpasspass
6135762936126629.65454.0821.103passpasspass
68018971406794136.87252.8431.894passpasspass
74886107927471740.94752.0131.918passpasspass
81677851878154649.10050.8411.731passpasspass
88556481478827355.95050.2841.007passpasspass
959441041079561163.82249.4151.991passpasspass
1026161079910219471.37148.6772.153passpasspass
1090605816710859680.93448.0561.326passpasspass
1156761119511514789.44547.4552.119passpasspass
1222578311512166998.10546.6211.719passpasspass
12881959119128175109.31846.7561.324passpasspass
1356140000.0000.0000.000failfailfail

Time to first generation grows non-linearly and reaches ~109 seconds for a ~130k-token prompt. Decoding speed falls from ~60 to ~45 tokens/second.

VRAM at this experiment is stable and reaches 21.725Gi/22.494G.

The two decoding-speed drops are most likely caused by internal attention-kernel switching inside the inference engine. As the input sequence length crosses certain thresholds (around ~30k and ~50k tokens), the runtime switches to different CUDA/FlashAttention kernel variants or GEMM configurations optimized for larger context sizes. These transitions can temporarily reduce SM occupancy or change memory access patterns, resulting in a brief throughput dip before stabilizing again. Since this happens with a single request on an otherwise idle GPU and the performance trend recovers immediately afterward, it strongly suggests a kernel-boundary effect rather than memory pressure or resource contention.

What about ability to handle parallel users? On model load we saw the following in vLLM logs:

Your GPU does not have native support for FP4 computation but FP4 quantization is being used. Weight-only FP4 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.

Maximum concurrency for 131,072 tokens per request: 2.78x

Let’s run a benchmark:

Model ability to handle parallel users on L4 24G
Model ability to handle parallel users on L4 24G
virtual_usersmin_ttft_smax_ttft_smin_response_time_smax_response_time_sJson_test_oksecret_test_oknonempty_test_ok
13.5023.5026.9326.932passpasspass
23.8345.98111.11812.063passpasspass
34.5539.95715.12316.375passpasspass
44.54913.26717.61921.218passpasspass
54.60916.37820.98824.036passpasspass
64.63019.59723.32629.127passpasspass
74.67522.66927.48731.389passpasspass
84.65025.95532.22236.193passpasspass

With 8 parallel users, we still see that all 8 users get the final response at roughly the same time (the spread between max and min response time is ~3 seconds: the most “lucky” user gets a response in ~33 seconds and the most “unlucky” user in ~36 seconds). But the spread between the shortest and longest TTFT still grows. If the LLM is used in a chat-like interface with 8 users asking at the same time and prompts are around 10k words (e.g., include RAG and history), the chart implies that the most “lucky” user receives the first token after ~3 seconds and starts streaming, while the most “unlucky” user waits ~25 seconds for the first token.

You might ask: how, in this 8-user case, do they still receive the full response at almost the same time with such a significant TTFT spread? The large TTFT spread happens because TTFT depends on when each request gets its turn for prefill on the GPU — if a user is later in the queue, their first token simply starts later. However, once decoding begins, vLLM time-slices the GPU fairly across all active requests, so they generate tokens at roughly the same pace. The key insight is that the “unlucky” user doesn’t fall behind by the full TTFT gap, because while they were waiting, the others were not running at full speed either — they were also sharing the same GPU and spending time on prefill and other users’ work. That’s why TTFT spread can be large, but total response-time spread stays relatively small.

To illustrate the difference, let’s set --max-num-seqs to 1, which effectively disables parallelism:

Model ability to handle parallel users 3rd party hosting
Model ability to handle parallel users 3rd party hosting
virtual_usersmin_ttft_smax_ttft_smin_response_time_smax_response_time_sJson_test_oksecret_test_oknonempty_test_ok
15.1535.1539.1029.102passpasspass
23.38710.6647.37913.281passpasspass
33.49417.4237.30920.060passpasspass
43.51227.5338.69632.178passpasspass
53.55031.5667.21734.308passpasspass
63.55036.7326.36641.123passpasspass
72.70741.6655.84945.934passpasspass
83.68653.6826.37757.237passpasspass

Here we see huge spreads even in total response time: with 8 users, the most “lucky” user receives the full response within ~10s, and the most “unlucky” within ~60s.

Future way to tune the model

If you wish to reduce TTFT for all requests, you can experiment by raising --max-num-batched-tokens, e.g. to 4k or 8k. By doing this, the vLLM scheduler will handle larger prefill chunks, which reduces the total number of forward passes and can reduce overall TTFT. The drawback is that you increase the risk of VRAM OOM. Generally, VRAM OOM is less painful than RAM OOM, and Docker will gracefully auto-restart vLLM (at least it worked on my machine), but any OOM still causes downtime for a minute or so, which is definitely not good.

If you are going to change any parameters like --max-num-batched-tokens from default values, you should stress-test for OOM at the highest context and high concurrency. If you hit OOM after increasing MNBT (which is very probable), you’ll need to sacrifice something, e.g. decrease --max-model-len to 57344 tokens (56 Ki tokens = 56 × 1024).

Does the model think more with more content?

In the same experiment we also fetched information about reasoning-token volumes (all experiments were executed with fixed reasoning_afford=medium):

Reasoning tokens and time depending on ESL
Reasoning tokens and time depending on ESL

Conclusion: thinking tokens do not depend on input size; on average the model “thinks” for about the same time (with some variance) for both 5k input words and 40k input tokens.

L40S 48G

Another very popular video card on the market is the L40S. Average rental price is around €1050-€1400; a new one on eBay might cost $9249 USD.

VRAM consumption is 42.112Gi/44.988Gi and stayed stable during the full experiment session.

Average decoding speed and time to first token for gpt-oss-20b hosted on L40S-1-48G
Average decoding speed and time to first token for gpt-oss-20b hosted on L40S-1-48G
input_tokensthinking_tokensoutput_tokenseffective_sequence_lengthtime_to_first_tokendecoding_tokens/sthinking_timeJson_test_oksecret_test_oknonempty_test_ok
7589766478432.110160.1830.436passpasspass
1420470101144371.048156.0220.452passpasspass
20766136105210413.262151.4770.882passpasspass
2743111378276042.414148.9860.690passpasspass
3388312082340343.243145.8480.794passpasspass
4071614276408204.237142.8570.920passpasspass
4751176105475285.408140.6370.579passpasspass
544808664544107.690132.1590.650passpasspass
6135519177613778.249121.9851.466passpasspass
68020791656795011.399123.7320.812passpasspass
74885178527474711.489105.0231.708passpasspass
816771251288152714.947105.2851.322passpasspass
885571251178832115.360100.6661.431passpasspass
95943833109579219.289109.9611.050passpasspass
1026159210010217921.44894.9551.151passpasspass
1090601159610858223.94288.4701.483passpasspass
11567913212111519726.39982.7612.045passpasspass
12225812114912174228.94881.0082.128passpasspass
1288175113812818431.57690.9970.904passpasspass
1356170000.0000.0000.000failfailfail

Here in launch log we can see that vLLM says we can run up to 16 requests on 131k sequences:

Your GPU does not have native support for FP4 computation but FP4 quantization is being used. Weight-only FP4 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.
Maximum concurrency for 131,072 tokens per request: 15.97x

So it is still the “Marlin” kernel, but the promised concurrency at 131k is ~16x; let’s see what it means in reality:

Here is result on our previous setup:

Model ability to handle parallel users on L40S 48G
Model ability to handle parallel users on L40S 48G
virtual_usersmin_ttft_smax_ttft_smin_response_time_smax_response_time_sJson_test_oksecret_test_oknonempty_test_ok
11.1691.1691.9161.916passpasspass
21.4481.9124.2424.424passpasspass
33.1124.4356.4156.558passpasspass
41.5123.6545.3696.214passpasspass
51.4154.5506.7807.647passpasspass
61.4245.4667.2739.793passpasspass
71.4396.3658.6989.480passpasspass
81.5267.3219.12310.611passpasspass

H100 80G

Typical rental price for server with H100 1x80G (varies a lot by region, PCIe vs SXM, and commitment): ~€1500–€9000 / mo. Buying one is usually ~18k18k–45k depending on form factor and market.

Across the whole test run we saw VRAM up to 74.208Gi/79.647Gi, which means that with one sequence at a time the VRAM was not fully occupied.

Average decoding speed and time to first token for gpt-oss-20b hosted on H100-1-80G
Average decoding speed and time to first token for gpt-oss-20b hosted on H100-1-80G
input_tokensthinking_tokensoutput_tokenseffective_sequence_lengthtime_to_first_tokendecoding_tokens/sthinking_timeJson_test_oksecret_test_oknonempty_test_ok
75901216278872.542228.1800.491passpasspass
1420214670144801.784223.6020.601passpasspass
207677272209451.799223.2560.300passpasspass
2743314254276121.418219.7310.604passpasspass
3388511052339961.378217.4500.462passpasspass
4071448115407631.727245.4820.145passpasspass
47509107184476362.080192.0790.683passpasspass
544829263544173.308171.4600.589passpasspass
6135510686613012.919160.4010.742retriedretriedretried
6801915186679424.344146.2961.152passpasspass
7488710269746904.623142.3810.815passpasspass
8167891157815235.058150.3030.863passpasspass
885546891882356.073141.9640.647passpasspass
9594378133956116.857138.1790.858passpasspass
102615661141021678.650131.9650.798passpasspass
109062641211085587.991131.3920.782passpasspass
115678461081150979.199133.5650.599passpasspass
122257364341219419.913161.5120.641passpasspass
12881710612812822910.754107.3391.499passpasspass
1356150000.0000.0000.000failfailfail

At the 60k point we saw the first failure, which was then successfully retried from one attempt. The issue was next, model started to repeat:

"{"main_theme":"The secret code is 314781-21","key_characters":["Ishmael","Queequeg","Captain Ahab","Captain Peleg","Captain Bildad"],\
"causal_events":[ "Ishmael leaves New Bedford","Queequeg joins him","They sign aboard the Pequod","Captain Ahab is introduced","The Pequod sets sail","The crew faces the first whale encounter","The crew confronts the great whale","The crew is pursued by the whale","The crew is rescued","The crew returns to Nantucket","The crew faces the final whale","The crew is rescued again","The crew returns home","The crew faces the final whale again","The crew is rescued again","The crew returns home","The crew faces the final whale again","The crew is rescued again","The crew returns home","The crew faces the final whale again"... and so on up to max_tokens

See details about this issue at the end of the post.

Now let’s see how it handles parallel users. In startup logs we see:

Using Triton backend
Maximum concurrency for 131,072 tokens per request: 31.52x

This is a first model using Triton with native fp4 support.

GPT 20B ability to handle parallel users on H100 80G
GPT 20B ability to handle parallel users on H100 80G
virtual_usersmin_ttft_smax_ttft_smin_response_time_smax_response_time_sJson_test_oksecret_test_oknonempty_test_ok
10.7980.7981.6211.621passpasspass
20.8960.8981.9702.352passpasspass
31.6571.6903.1603.503passpasspass
41.2351.7402.9333.479passpasspass
51.2892.3653.4834.512passpasspass
61.3242.8364.2904.948passpasspass
71.3853.4115.0315.866passpasspass
81.2763.6265.0716.241passpasspass

Managed gpt-oss-20b hosting

Some hosting services nowadays provide inference-as-a-service. For example, they may host gpt-oss-20b for you, so you pay for used tokens instead of paying for a whole server running 24/7. This can still be cheaper than many official vendor LLMs, but again the risk of data exposure is high if you don't sign dedicated DPAs.

Generally they don’t disclose what hardware they use under the hood or what vLLM parameters they run.

Average decoding speed and time to first token on 3rd party managed gpt-oss-20b hosting
Average decoding speed and time to first token on 3rd party managed gpt-oss-20b hosting
input_tokensthinking_tokensoutput_tokenseffective_sequence_lengthtime_to_first_tokendecoding_tokens/sthinking_timeJson_test_oksecret_test_ok
75851347779130.91988.8881.538passpass
14199138103145061.28984.0991.599passpass
207648288209721.98574.9511.174passpass
2742910293276102.80459.9211.800passpass
3388115087340693.74754.6393.151passpass
407127990407715.01149.8201.841passpass
475078788475226.17545.7072.200passpass
5447518287545277.82943.2534.825passpass
613519184612849.47236.0973.006passpass
6801692906788711.33038.4782.824passpass
7488246967466013.32339.4531.876passpass
8167379888144115.43837.8092.651passpass
88551871258828817.54926.3344.732passpass
95939971329562820.31927.3214.479passpass
1026149010110218023.35522.8695.336passpass
1090595516010858725.93721.2374.751passpass
11567410614301164790.00041.1976.034failfail

What we can see from this chart:

  • For short sequences, the best decoding throughput was better than on L4 (~80 tokens/s), but worse than on L40S and H100 in our setup.
  • Time to first token is nearly the same as on L40S (e.g., ~5s at ~40k input tokens), but our H100 setup is better.

How does it handle parallel users?

Model ability to handle parallel users 3rd party hosting
Model ability to handle parallel users 3rd party hosting

We see that 3rd-party hosting keeps delay at a good level and it does not significantly degrade as the number of users increases. When you are using LLM-as-a-Service endpoints, you can generally run as many sequences (API requests) as you need in parallel and can expect them to generate in parallel. This is possible because the external provider has a horizontally scalable (from your perspective, effectively infinite) pool of GPUs and can split API requests from many clients. Obviously they split the payload across multiple inference endpoints to achieve maximum efficiency.

Structured output failure due to degenerate repetition

Since we had one retried case on H100, after running the main experiments, we stress-tested the model by running at peak points for half an hour.

And very rarely we see retries caused by broken JSON:

{
 "main_theme": "Moby-Dick",
 "key_characters":[ "Ishmael","Queequeg","Captain Ahab","Captain Peleg","Captain Bildad","Father Mapple"],
  "causal_events":[ "Ishmael leaves New Bedford for Nantucket","Ishmael meets Queequeg at the Spouter‑Inn","Ishmael and Queequeg sign aboard the Pequod","Captain Ahab’s leg is lost to a whale","The Pequod sets sail","The crew encounters the great white whale","The crew is pursued by the whale","The crew is rescued by the crew of the ship “The Rachel”","The crew is pursued by the whale again","The crew is rescued by the crew of the ship “The Delight”","The crew is pursued by the whale again"....

This issue was always successfully retried on the next attempt.

Its frequency is quite low: we saw from 1 to 5 occurrences during 30 minutes, but this frequency may depend, for example, on the entropy and lengths of incoming prompts.

This issue happened at any point of any experiment, and this may be a common picture for the GPT-20B model.

Moreover, this issue happens with any LLMs. Gemini, for example, quite often goes into this repetition even in the public UI, while OpenAI might hide it via retries (it’s quite easy to detect repetitions).

Attempts to run gpt-oss-120B on H100 80G

In theory, 80G might be enough to run even a 120B model. Its weights in MXFP4 consume at least 60GB by definition (in fact I’ve seen 64.38 GiB in logs - some CUDA overhead), so what’s left for KV cache and activations is much less than 14G, which is very tight assuming the number of layers in this model is higher.

Default parameters for this model on Hugging Face / vLLM are definitely not optimized for 80G VRAM:

| MML (max-model-len) | 131072 | | MNBT (max-num-batched-tokens) | 8192 | | MMU (gpu-memory-utilization) | 0.9 | | MNS (max-num-seqs) | 256 |

With these settings the model does not load at all:

Model loading took 64.38 GiB memory  
Available KV cache memory: -0.62 GiB

Setting MNBT to 2048 still did not help. So with the same config as we used for 20B on 24G VRAM, we can’t run 120B on 80G VRAM.

The reason 120B cannot start on a single H100 80GB with the same 131k context settings that worked for 20B is that KV cache memory scales with the number of layers and the hidden size of the model, not just with available VRAM. While 20B left ~10GB for KV cache on a 24GB GPU, the 120B model has roughly 4–5× larger per-token KV footprint due to significantly more layers and a much larger hidden dimension. So even though ~16GB appears “available” after loading weights on H100, each token now consumes far more KV memory, making a 131k context mathematically infeasible on a single GPU.

Reducing --max-model-len to half (e.g. 57344) didn’t help either. You can probably extract something from this card by reducing values even more, but it will be very limited and needs careful OOM testing. Ideally you should use 2× H100 80G and set --tensor-parallel-size 2.

Conclusion

At a typical “real” chat/RAG workload size (≈10k input words: history + retrieved context + a short user question like “Given the attached policy and our last 5 messages, what are the top 3 risks and the next steps?”), the single-user latency for gpt-oss-20b differs dramatically by GPU. In our 1-user run at that prompt size, L4 24G delivered ~3.50s TTFT and ~6.93s full response time, L40S 48G ~1.17s TTFT and ~1.92s full response time, and H100 80G ~0.80s TTFT and ~1.62s full response time.

If your system is used frequently by many people (for example, an internal chat tool where several employees ask questions at the same time), the same ~10k context request can be in-flight concurrently. With 5 parallel users at this prompt size, we observed that L4 24G spreads TTFT from ~4.61s (lucky) to ~16.38s (unlucky) and full response time from ~20.99s to ~24.04s; L40S 48G keeps TTFT in ~1.42–4.55s and full response in ~6.78–7.65s; and H100 80G keeps TTFT in ~1.29–2.37s and full response in ~3.48–4.51s. In practice this matters because user-perceived latency is dominated by TTFT in streaming UIs, and concurrency is the normal case for shared tools.

From a cost-efficiency standpoint, L4 is the cheapest entry point (and works well for low concurrency), but it becomes latency-limited quickly as context grows. L40S is often the sweet spot for gpt-oss-20b: it cuts single-user latency by ~3–4× versus L4 while typically costing less than 2×, and it also provides substantially more parallelism headroom at long contexts. H100 wins on raw latency and concurrency, but unless you specifically need the lowest possible tail latencies at higher load, it is usually harder to justify purely on €/latency for a 20B-class model.