Optimize a Triton vector-addition kernel for 2^20 CUDA elements.
All challenges
Keywordscuda
Optimize a Triton vector-addition kernel for 2^28 CUDA elements.
Optimize a Triton vector-addition kernel for 2^24 CUDA elements.
Optimize ragged CUDA attention with per-row length masks.
Optimize packed INT4 quantized dot products on CUDA.
Optimize QK RMSNorm on CUDA tensors.
Optimize mixed-precision linear, bias, and GELU CUDA computation.
Optimize a CUDA/Triton implementation of the Mamba2 sequential scan recurrence.
Optimize batched CUDA matrix multiplication across grouped shapes.
Optimize Triton GEMM with GELU for transformer-like CUDA shapes.
Optimize Triton GEMM with GELU for square CUDA matrix shapes.
Optimize a Triton GEMM with GELU for tall/skinny and short/wide matrices.