AI Infrastructure

Engineering notes, architectural deep-dives, and practical playbooks from the Devforth team.

Latest

Self-hosted GPT: real response time, token throughput, and cost on L4, L40S and H100 for GPT-OSS-20B
AI Infrastructure

Self-hosted GPT: real response time, token throughput, and cost on L4, L40S and H100 for GPT-OSS-20B

We benchmarked modern open-source LLMs across several popular GPUs to measure real-world context limits, throughput, latency, and cost efficiency under varying levels of concurrency — as close as possible to real production conditions. Here we share the results.
LLM Terminology Guide: Weights, Inference, Effective sequence length, and Self-Hosting Explained
AI Infrastructure

LLM Terminology Guide: Weights, Inference, Effective sequence length, and Self-Hosting Explained

A clear guide to generative AI and LLM terminology. Learn how model weights, quantization, inference, context length, batching, sampling and many more — including how to evaluate vendor APIs and self-host models like GPT-OSS-20B.
GPT-J is a self-hosted open-source analog of GPT-3: how to run in Docker
AI Infrastructure

GPT-J is a self-hosted open-source analog of GPT-3: how to run in Docker

Learn how to setup open-source GPT-J model on custom cheapest servers with GPU. Try to run the text generation AI model of the future and talk to it right now!