TLDR : Red Hat launched the Red Hat AI Inference Server, an open-source solution to simplify and improve AI model execution in hybrid cloud environments. Featuring advanced optimization tools, it provides execution flexibility on any AI accelerator and in any cloud, aiding in the democratization of generative AI in business.
During the Red Hat Summit 2025, Red Hat announced the launch of the Red Hat AI Inference Server, a new component of the Red Hat AI range. Designed for hybrid cloud environments, this open-source solution aims to simplify the execution of generative AI models while improving their operational performance. An inference server acts as an interface between AI applications and large language models (LLMs), facilitating the generation of responses from input data. As LLM deployments multiply in production, the inference phase becomes a critical issue, both technically and economically.
Based on the vLLM community project initiated by the University of
Berkeley,
Red Hat AI Inference Server includes advanced optimization tools, including those from Neural Magic, allowing for reduced
energy consumption, accelerated computations, and improved profitability. Available in a containerized version or integrated with RHEL AI and Red Hat OpenShift AI solutions, it offers great flexibility by running on any type of AI accelerator and in any cloud.
Among the main announced features:
The solution supports numerous leading language models (Gemma, Llama, Mistral, Phi), while integrating the latest vLLM language developments: multi-GPU processing, continuous batching, extended context, and high-throughput inference.
With this announcement,
Red Hat reaffirms its commitment to making vLLM an open standard for AI inference, promoting increased interoperability and strengthening the technological sovereignty of businesses. By addressing the growing needs of industrial inference, it actively contributes to the democratization of generative AI.
-
Model compression tools allowing for size and
energy footprint reduction without loss of precision;
-
-
Enterprise support and interoperability with third-party platforms, including Linux and Kubernetes outside of
Red Hat.
Towards the Democratization of Generative AI
The solution natively supports several leading language models, including Gemma, Llama, Mistral, and Phi, and leverages the latest features of vLLM: high-throughput inference, multi-GPU processing, continuous batching, and extended input context.
Red Hat thus aims to make the vLLM language an open inference standard for generative AI in
business, regardless of the AI model, the underlying accelerator, and the deployment environment.
To better understand
What is the vLLM project and why is it important for AI inference?
The vLLM project, initiated by the University of Berkeley, is an advanced technology for optimizing language models. It enhances the operational performance of AI models through innovations like multi-GPU processing and high-throughput inference, thereby reducing energy consumption and increasing profitability.
How can intelligent compression of AI models reduce energy consumption without compromising accuracy?
Intelligent compression reduces the size of AI models by eliminating redundancies and optimizing computations, which decreases resource needs while preserving accuracy through advanced optimization algorithms.