Hacker NewsSaturday · May 23, 2026FREE

CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs

transformersgpuoptimizationattention

A new paper titled "CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs" introduces a technique to express transformer blocks as a single matrix multiplication (GEMM) followed by an epilogue kernel. By fusing the attention and MLP computations into a unified GEMM operation, CODA reduces the number of kernel launches and memory reads/writes. The authors demonstrate that this approach can achieve up to 2x speedup over standard implementations on NVIDIA GPUs, with particular benefits for models with large hidden dimensions. The method is compatible with existing transformer architectures and requires no changes to model weights. The paper is available on arXiv (2605.19269) and was published on May 22, 2026.

// why it matters

CODA could significantly reduce inference and training costs for large transformer models.

Sources

Primary · Hacker News
▸ Read original at arxiv.org

Like this? Get the next digest.