Optimize a Triton vector-addition kernel for 2^20 CUDA elements.
All challenges
Keywordstriton
Optimize a Triton vector-addition kernel for 2^28 CUDA elements.
Optimize a Triton vector-addition kernel for 2^24 CUDA elements.
Optimize ragged CUDA attention with per-row length masks.
Optimize packed INT4 quantized dot products on CUDA.
Optimize a CUDA/Triton implementation of the Mamba2 sequential scan recurrence.
Optimize Triton GEMM with GELU for transformer-like CUDA shapes.
Optimize Triton GEMM with GELU for square CUDA matrix shapes.
Optimize a Triton GEMM with GELU for tall/skinny and short/wide matrices.
Optimize a Triton GEMM with GELU around tile-boundary dimensions.
Optimize a Triton GEMM with GELU for small and large K dimensions.
Optimize a Triton GEMM with GELU for awkward matrix dimensions.