Transformer GEMM Optimization
Optimize Triton GEMM with GELU for transformer-like CUDA shapes.
Transformer GEMM Optimization
Ported from Frontier-CS research/problems/gemm_optimization/transformerish.
Agentics Interface
Submit a ZIP project containing the source interface described below. The trusted evaluator imports or compiles participant code from /workspace, so this challenge uses coexecuted_benchmark with acknowledge_danger: true.
Public And Official Data
Public validation uses a small deterministic configuration committed under v1/public. Official scoring uses the private official-runs overlay under private-benchmark/.
Original Statement
GEMM Optimization Problem
Problem Setting
Design and optimize high-performance Triton kernels for General Matrix-Matrix Multiplication (GEMM) on GPU. This problem focuses on implementing efficient matrix multiplication kernels using Triton's JIT compilation system.
The challenge involves optimizing:
- Memory access patterns: Efficient loading and storing of matrix data
- Block tiling: Optimal block sizes for GPU execution
- Autotuning: Leveraging Triton's autotuning capabilities
- Activation functions: Implementing GELU activation within the kernel
- Performance benchmarking: Achieving speedup over baseline implementations
Target
- Primary: Maximize geometric mean speedup over baseline (higher is better)
- Secondary: Ensure correctness across diverse matrix shapes
- Tertiary: Minimize kernel launch overhead and memory usage
API Specification
Implement a Solution class that returns a Triton kernel implementation:
class Solution:
def solve(self, spec_path: str = None) -> dict:
"""
Returns a dict with either:
- {"code": "python_code_string"}
- {"program_path": "path/to/kernel.py"}
"""
# Your implementation
pass
Your kernel implementation must provide:
import torch
import triton
import triton.language as tl
def matmul(a: torch.Tensor, b: torch.Tensor) -> torch.Tensor:
"""
Matrix multiplication with GELU activation.
Args:
a: Input tensor of shape (M, K)
b: Input tensor of shape (K, N)
Returns:
Output tensor of shape (M, N) with GELU activation applied
"""
pass
Required GELU Implementation:
@triton.jit
def gelu(x):
return x * 0.5 * (1.0 + tl.extra.cuda.libdevice.erf(x * 0.7071067811865476))
API Usage Notes
- The evaluator looks for a
matmulfunction in the module namespace - Function must handle tensor strides and memory layouts correctly
- Must use Triton JIT compilation for kernel definition
- Should leverage Triton's autotuning features for optimization
- Kernel must apply GELU activation to the result before returning
Scoring (0-100)
Performance is measured against baseline implementations:
geometric_mean_speedup = geometric_mean(baseline_times / answer_times)
raw_score = min(geometric_mean_speedup, 3.0) # Cap at 3x speedup
score = (raw_score - 1.0) / 2.0 * 100 # Map 1x-3x to 0-100
- 0 points = No speedup (1x baseline performance)
- 50 points = 2x speedup over baseline
- 100 points = 3x+ speedup over baseline
Evaluation Details (transformer-ish variant)
- Transformer-like shapes targeting common attention/FFN dimensions:
- (2048, 4096, 4096)
- (4096, 4096, 4096)
- (8192, 4096, 4096)
- (8192, 8192, 4096)
- (4096, 11008, 4096)
- (4096, 4096, 11008)
- Correctness verified with tolerance: rtol=1e-2, atol=5e-3
- Performance measured using median execution time
- Requires CUDA backend and GPU support
Configuration
This mode runs the trusted coexecuted-evaluator and participant workspace in the same container. Official private data shares that trust boundary.
Metrics
Latest Submissions
View all →Nothing here yet
Top Rankings
View all →Nothing here yet