CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs
A new paper titled "CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs" introduces a technique to express transformer blocks as a single matrix multiplication (GEMM) followed by an epilogue kernel. By fusing the attention and MLP computations into a unified GEMM operation, CODA reduces the number of kernel launches and memory reads/writes. The authors demonstrate that this approach can achieve up to 2x speedup over standard implementations on NVIDIA GPUs, with particular benefits for models with large hidden dimensions. The method is compatible with existing transformer architectures and requires no changes to model weights. The paper is available on arXiv (2605.19269) and was published on May 22, 2026.
CODA could significantly reduce inference and training costs for large transformer models.