Hacker NewsSaturday · May 30, 2026FREE

Show HN: Tiny-vLLM – high performance LLM inference engine in C++ and CUDA

llminferencecudac++

Tiny-vLLM is a new open-source inference engine for large language models, implemented in C++ and CUDA for maximum performance on NVIDIA GPUs. The project, hosted on GitHub at github.com/jmaczan/tiny-vllm, supports models such as LLaMA, Mistral, and GPT-NeoX. It features continuous batching, PagedAttention, and quantization (FP16, INT8, INT4) to optimize memory usage and throughput. The engine is designed to be lightweight and easy to integrate, with a focus on low latency for real-time applications. Initial benchmarks show competitive performance against vLLM and TensorRT-LLM, particularly for smaller batch sizes. The repository includes a Python API for model loading and inference, as well as a C++ API for direct integration. The project is in early development but already supports basic functionality for text generation. Developers can clone the repo and build from source with CMake and CUDA toolkit 12.0+. The author notes that while not yet production-ready, it serves as a foundation for further optimization and experimentation.

// why it matters

Offers a lightweight, high-performance alternative for deploying LLMs on NVIDIA GPUs.

Sources

Primary · Hacker News
▸ Read original at github.com

Like this? Get the next digest.