vLLM

Software Development

An open source, high-throughput and memory-efficient inference and serving engine for LLMs.

Discover all 12 employees

About us

vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs

Website: https://github.com/vllm-project/vllm
External link for vLLM
Industry: Software Development
Company size: 51-200 employees
Type: Nonprofit

Employees at vLLM

See all employees

Updates

vLLM reposted this
Daniel van Strien

Machine Learning Librarian at Hugging Face 🤗 | Making AI work for libraries, archives, and their communities
3d
Report this post
I just ran batch inference on a 30B parameter LLM across 4 GPUs with a single Python command! The secret? Modern AI infrastructure where everyone handles their specialty: 📦 UV (by Astral) handles dependencies via uv scripts 🖥️ Hugging Face Jobs handles GPU orchestration 🧠 Qwen AI team handles the model (Qwen3-30B-A3B-Instruct-2507) ⚡ vLLM handles efficient batched inference I'm very excited about using uv scripts as a nice way of packaging fairly simple but useful ML tasks in a somewhat reproducible way. This combined with Jobs opens up some nice oppertunities for making pipelines that require different types of compute. Technical deep dive and code examples: https://lnkd.in/e5BEBU95

Efficient batch inference for LLMs with vLLM + UV Scripts on HF Jobs

danielvanstrien.xyz

6 Comments

Like Comment Share
vLLM reposted this
Anyscale

51,628 followers
5d Edited
Report this post
🚨 Attention vLLM users – last call! 🚨 The Call for Proposals for our vLLM Featured Track at Ray Summit closes this Wednesday, July 30. If you're building with vLLM in production, optimizing inference, or exploring advanced use cases — we want to see it. This track is all about showcasing real-world implementations and hard-won lessons from the vLLM community. Need inspiration? Check out last year's top vLLM talks: https://lnkd.in/gmRhSbHk Submit your proposal here: https://lnkd.in/gjvKdvFF
Like Comment Share
vLLM reposted this
Raushan Turganbay

ML engineer at 🤗 | Multimodality and Generation | Erasmus Mundus MSc
1w
Report this post
🚀 Big big news for multimodal devs! The transformers ↔️ vLLM integration just leveled up: Vision-Language Models are now supported out of the box If the model is integrated into Transformers, you can now run it directly with vLLM — no need to rewrite or duplicate code. Just plug it in and go. Zero extra effort Performance might differ model to model (we’re working on that!), but functional support is guaranteed Curious how to serve Transformers models with vLLM? Full docs here 👉 https://lnkd.in/d-KjqbmU #multimodal #transformers #vLLM #VLM #opensource
14 Comments

Like Comment Share
vLLM reposted this
NVIDIA AI

1,325,015 followers
2w
Report this post
🎉Congratulations to Microsoft for the new Phi-4-mini-flash-reasoning model trained on NVIDIA H100 and A100 GPUs. This latest edition to the Phi family provides developers with a new model optimized for high-throughput and low-latency reasoning in resource-constrained environments. Bring your data and try out demos on the multimodal playground for Phi on the NVIDIA API Catalog ➡️ https://lnkd.in/geuGhZsS 📷 The first plot shows average inference latency as a function of generation length, while the second plot illustrates how inference latency varies with throughput. Both experiments were conducted using the vLLM inference framework on a single A100-80GB GPU over varying concurrency levels of user requests. 🤗 https://lnkd.in/gswYMYt9
15 Comments

Like Comment Share
vLLM reposted this
Erik Kaunismäki

Software Engineer @Hugging Face 🤗
2w
Report this post
We just released native support for SGLang and vLLM in Inference Endpoints 🔥 Inference Endpoints is becoming the central place where you deploy high performance Inference Engines. And that provides the managed infra for it. Instead of spending weeks configuring infrastructure, managing servers, and debugging deployment issues, you can focus on what matters most: your AI model and your users 🙌
6 Comments

Like Comment Share
vLLM

2,991 followers
2w
Report this post
Calling vLLM users! In partnership with the Anyscale, we’re opening a special call for proposals for our first ever Featured Track – dedicated entirely to the most exciting inference work happening today. If you’re building with vLLM, we want to see what you’ve got. Last year's vLLM sessions were among our most popular – now we're giving this ecosystem the spotlight it deserves. Check out our Ray Summit 2024 vLLM sessions [linked in comments] for inspiration, then show us what you're building next. Submit your proposal by July 29 to be considered for this Featured Track. Submit here: https://lnkd.in/gjvKdvFF

Ray Summit 2025

anyscale.com

1 Comment

Like Comment Share
vLLM reposted this
Anyscale

51,628 followers
2w
Report this post
Calling vLLM users! In partnership with the vLLM team, we’re opening a special call for proposals for our first ever Featured Track – dedicated entirely to the most exciting inference work happening today. If you’re building with vLLM, we want to see what you’ve got.Last year's vLLM sessions were among our most popular – now we're giving this ecosystem the spotlight it deserves. Check out our Ray Summit 2024 vLLM sessions [linked in comments] for inspiration, then show us what you're building next. Submit your proposal by July 29 to be considered for this Featured Track. Submit here: https://lnkd.in/gjvKdvFF
1 Comment

Like Comment Share
vLLM reposted this
Anyscale

51,628 followers
2w
Report this post
Calling vLLM users! In partnership with the vLLM team, we’re opening a special call for proposals for our first ever Featured Track – dedicated entirely to the most exciting inference work happening today. If you’re building with vLLM, we want to see what you’ve got.Last year's vLLM sessions were among our most popular – now we're giving this ecosystem the spotlight it deserves. Check out our Ray Summit 2024 vLLM sessions [linked in comments] for inspiration, then show us what you're building next. Submit your proposal by July 29 to be considered for this Featured Track. Submit here: https://lnkd.in/gjvKdvFF
1 Comment

Like Comment Share
vLLM reposted this
Embedded LLM

8,408 followers
3w
Report this post
A Pro-Tip for vLLM Users: Free ~90% of Your VRAM in Seconds Struggling with GPU memory while juggling multiple models? There's a lesser-known vLLM feature that lets your server "power-nap" and wake up in seconds—no restarts needed. Imagine you need to: - Hot-swap different models or checkpoints on the same GPU without a full server restart. - Run multiple models on a single GPU, cycling through them as needed for different tasks. - Optimize batch processing jobs that require different LLMs for various stages. - Free up GPU memory during the training phase in RLHF/GRPO loops, making the entire process more efficient. Quick Setup: - Enable sleep mode: export VLLM_SERVER_DEV_MODE=1 vllm serve $MODEL --enable-sleep-mode - Toggle with a simple API call: # Put the model to sleep curl -X POST :8000/sleep -d'{"level":1}' # Wake it up curl -X POST :8000/wake_up Heads up: 🔒 These are dev endpoints; keep them internal. 🧠 Level 1 sleep uses host RAM; Level 2 is slower to wake.
3 Comments

Like Comment Share
vLLM reposted this
MiniMax

9,036 followers
4w Edited
Report this post
🎉 Join MiniMax-M1 Technical Seminar We’re excited to announce our first official seminar on MiniMax-M1 — the world’s first open-weight, large-scale hybrid-attention reasoning model, setting new standards in long-context reasoning with a 1M-token input and 80K-token output window. This online event brings together leading voices in AI from Anthropic, Hugging Face, vLLM, SGLang, MIT CSAIL, HKUST, University of Waterloo, and more — alongside the MiniMax technical team. 🔍 What to Expect: • Behind-the-scenes of M1’s architecture & algorithm design • Inference performance & real-world applications • Expert panel discussions • Live Q&A with the global AI community 📅 Date: Thursday, July 10, 2025 🕓 Time: 4 PM PST / 7 PM EST / 7 AM CST (July 11) 💻 Format: Zoom (Online only, limited seats) Whether you’re a researcher, developer, or AI enthusiast, we welcome you to join the discussion. Innovation begins with conversation. 👉 Register here: https://lu.ma/d7ptaky2
1 Comment

Like Comment Share

vLLM

Software Development

An open source, high-throughput and memory-efficient inference and serving engine for LLMs.

About us

Employees at vLLM

Michael Goin

Inference Optimization @ Red Hat | vLLM Maintainer

Robert Shaw

Director of Engineering at Red Hat

Wenlong Wang

Ph.D. @ UMN | AI System & Infra | vLLM

Kyle Mistele

Building open-source AI technology

Updates

Efficient batch inference for LLMs with vLLM + UV Scripts on HF Jobs

danielvanstrien.xyz

Ray Summit 2025

anyscale.com

Join now to see what you are missing

Similar pages

Embedded LLM

Hugging Face

Unsloth AI

sgl-project

LMCache Lab

Canopy Labs

Anyscale

Modal

Neural Magic (Acquired by Red Hat)

PyTorch