The pace of generative AI (gen AI) innovation demands powerful, flexible and efficient solutions for deploying large language models (LLMs). We are going to introduce Red Hat AI Inference Server today. It is incorporated into Red Hat OpenShift AI and Red Hat Enterprise Linux AI (RHEL AI) as a crucial component of the Red Hat AI platform. AI Inference Server is also available as a standalone product, designed to bring optimized LLM inference capabilities with true portability across your hybrid cloud environments.
Across any deployment environment, AI Inference Server provides users with a hardened, supported distribution of vLLM along with intelligent LLM compression tools and an optimized model repository on Hugging Face, all backed by Red Hat’s enterprise support and third-party deployment flexibility pursuant to Red Hat’s third-party support policy.
Contents
Accelerating inference with a vLLM core & advanced parallelism
The vLLM serving engine is at the heart of AI Inference Server. vLLM is known for its high throughput and memory-efficient performance, achieved through novel methods like PagedAttention (optimizing GPU memory management, developed at the University of California, Berkeley) and continuous batching, which frequently achieve performance that is several times higher than that of traditional serving methods. Additionally, the server typically provides an API endpoint that is compatible with OpenAI, making integration easier.
To handle today’s massive gen AI models across diverse hardware, vLLM provides robust inference optimizations:
Tensor parallelism (TP) reduces latency and increases computational throughput for each model layer by spreading them across multiple GPUs, typically within a node.
Pipeline parallelism (PP): Stages sequential groups of model layers across different GPUs or nodes. This is crucial for fitting models that are too large even for a single multi-GPU node.
Expert parallelism (EP) for models with a mixture of experts (MoE): vLLM has specialized optimizations for managing the unique routing and computation requirements of MoE model architectures.
Data parallelism (DP): vLLM supports data parallel attention which routes individual requests to different vLLM engines. The data parallel engines unite during MoE layers, sharding experts across all tensor parallel and data parallel workers. This is especially crucial for models with few key-value attention (KV) heads, such as DeepSeek V3 and Qwen3, where tensor parallelism leads to wasteful KV cache duplication. Data parallelism lets vLLM scale to a larger number of GPUs in this case.
Quantization: LLM Compressor, a component of AI Inference Server, provides a unified library for creating compressed models with weight and activation quantization or weight-only quantization for faster inference with vLLM. vLLM has custom kernels, like Marlin and Machete, for optimized performance with quantization.
Speculative decoding: Speculative decoding improves the inference latency by using a smaller, faster draft model to generate several future tokens, which are then validated or corrected by the main, larger model in fewer steps. Throughput is increased and decoding latency is reduced without affecting output quality with this method. It’s important to note that these techniques can often be combined — for instance, using pipeline parallelism across nodes and tensor parallelism within each node — to effectively scale the largest models across complex hardware topologies.
Deployment portability via containerization
Delivered as a standard container image, AI Inference Server offers unparalleled deployment flexibility. This containerized format is key to its hybrid cloud portability, providing that the exact same inference environment runs consistently whether deployed via Red Hat OpenShift, Red Hat Enterprise Linux (RHEL), non-Red Hat Kubernetes platforms or other standard Linux systems. It provides a standardized, predictable foundation to serve LLMs anywhere your business requires, simplifying operations across diverse infrastructure.
Aid for multiple accelerators As a fundamental design principle, AI Inference Server is engineered with robust support for multiple accelerators. This capability allows the platform to seamlessly leverage a diverse range of hardware accelerators, including NVIDIA, AMD GPUs and Google TPUs. By providing a unified inference serving layer that abstracts away the complexities of underlying hardware, AI Inference Server offers significant flexibility and optimization opportunities.
This multi-accelerator support enables users to:
Optimize for performance and cost: Deploy inference workloads on the most suitable accelerator based on the specific model characteristics, latency requirements and cost considerations. Different accelerators excel in different areas, and the ability to choose the right hardware for the job leads to better performance and resource utilization.
Future-proof deployments: As new and more efficient accelerator technologies become available, AI Inference Server’s architecture allows for their integration without requiring significant changes to the serving infrastructure or application code. Long-term viability and adaptability are provided by this. Scale inference capacity: Easily scale inference capacity by adding more accelerators of the same type or by incorporating different types of accelerators to handle diverse workload demands. This provides the agility needed to meet fluctuating user traffic and evolving AI model complexities.
Accelerator choice: By supporting a variety of accelerator vendors with the same software interface, the platform reduces dependency on a single hardware provider, offering greater control over hardware procurement and cost management.
Simplify infrastructure management: AI Inference Server provides a consistent management interface across different accelerator types, simplifying the operational overhead associated with deploying and monitoring inference services on heterogeneous hardware.
Model optimization powered by Neural Magic expertise within Red Hat
Optimization is frequently required for the effective deployment of massive LLMs. AI Inference Server integrates powerful LLM compression capabilities, leveraging the pioneering model optimization expertise brought in by Neural Magic, now part of Red Hat. Using industry-leading quantization and sparsity techniques like SparseGPT, the integrated compressor drastically reduces model size and computational needs without significant accuracy loss. This allows for faster inference speeds and better resource utilization, leading to substantial reductions in memory footprint and enabling models to run effectively even on systems with constrained GPU memory.
Streamlined access with an optimized model repository
To simplify deployment further, AI Inference Server includes access to a curated repository of popular LLMs (such as Llama, Mistral and Granite families), conveniently hosted on the Red Hat AI page on Hugging Face.
These aren’t just standard models — they have been optimized using the integrated compression techniques, specifically for high-performance execution on the vLLM engine. As a result, you get ready-to-use, effective models right away, significantly reducing the amount of time and effort required to get your AI applications into production and delivering value more quickly.
Technical overview of Red Hat AI Inference Server
The vLLM architecture seeks to maximize throughput and minimize latency for LLM inference, particularly in systems handling high concurrency with varied request lengths. The EngineCore is the dedicated inference engine at the heart of this design. It coordinates forward computations, manages the KV cache, and dynamically batches tokens from multiple prompts at once. The EngineCore not only reduces the overhead of managing long context windows but also intelligently preempts or interleaves short, latency-sensitive requests with longer running queries. PagedAttention, a novel strategy that virtualizes the key-value cache for each request, is used in conjunction with queue-based scheduling to accomplish this. In effect, the EngineCore improves efficient GPU memory usage while reducing the idle time between computation steps.
To interface with user-facing services, the EngineCoreClient serves as an adapter that interacts with APIs (HTTP, gRPC, etc.) and relays requests to EngineCore. Multiple EngineCoreClients can communicate with one or more EngineCores, facilitating distributed or multinode deployments. By cleanly separating request handling from the low-level inference operations, vLLM allows for flexible infrastructure strategies, such as load balancing across multiple EngineCores or scaling the number of clients to match user demand.
This separation not only allows for flexible integration with various serving interfaces, it also enables distributed and scalable deployment. EngineCoreClients can run on separate processes, communicating with one or more EngineCores over the network to balance load and decrease CPU overheads.
