VLIW Kernel Optimization

Overview

Optimize a kernel running on a custom VLIW SIMD processor simulator. The kernel performs a batched tree traversal with hashing — your job is to make it run in as few clock cycles as possible.

Based on Anthropic's Performance Take-Home.

Architecture

The processor is a single-core VLIW (Very Long Instruction Word) machine with SIMD support:

Engines: ALU (12 slots), Vector ALU (6 slots), Load (2 slots), Store (2 slots), Flow (1 slot)
Vector length: 8 elements
Scratch space: 1536 words (registers + cache)
Memory: 32-bit words

All engine slots execute in parallel within a single cycle. Effects take place at end of cycle.

Scoring

The naive baseline runs in 147,734 cycles. Open-ended — there is no known theoretical minimum. Lower is better.

Milestone	Cycles
Baseline (naive scalar)	147,734
Updated starting point	18,532
Claude Opus 4 (many hours)	2,164
Claude Opus 4.5 (casual session)	1,790
Claude Opus 4.5 (2hr harness)	1,579
Claude Opus 4.5 (11.5hr harness)	1,487
Claude Opus 4.5 (improved harness)	1,363