CovAgent Evaluation Report
ReAct Framework vs Codex CLI β Automated Hardware Verification Coverage Closure
Cross-Design Summary
Final Coverage by Design (%)
Token Efficiency (Tokens per Run)
Time to 95% Coverage (seconds)
Total Run Time (seconds)
Design-Level Comparison
chacha_top
ChaCha20 stream cipher β memory-mapped register interface, init/next FSM, 20-round quarter-round core
Verdict: Both agents converged to the same ceiling (99.4%) in exactly 2 iterations. ReAct did it in 2.7Γ fewer tokens and 2.8Γ less time. The single remaining hole (chacha_core.v:566 β 32-bit block counter overflow) is identical for both and is practically unreachable from the top-level interface (requires 2Β³Β² block operations).
Residual Coverage Holes (Shared)
chacha_core.v:566-567 β block1_ctr overflow carry. Counter seed hardwired to 0 at top level; reaching 0xFFFFFFFF requires 2Β³Β² next operations. Both agents correctly identified and classified this.
ethmac_eth_with_cop
Ethernet MAC controller β Wishbone bus, TX/RX DMA, MII/RMII PHY, MAC control frames, buffer descriptors
Verdict: ReAct reached 90% coverage in 221s vs Codex's 982s (4.4Γ faster to this milestone) and achieved slightly higher final coverage (98.6 vs 98.0). The design is the most complex tested (3482 coverable statements). Both agents stalled on the same class of holes: protocol-level corner cases requiring precise multi-signal timing across TX/RX DMA, MAC control, and Wishbone bus interactions.
Residual Coverage Holes
eth_receivecontrol.v:337 β dead branch (ResetSlotTimer already consumed by RxReset). eth_spram_256x32.v:229-230 β print_ram task never called by RTL.
eth_registers.v:794,797 β TxC interrupt sync requires precise TxCtrlEndFrm + StartTxDone + r_TxFlow timing. eth_maccontrol.v:112 β muxed abort needs TxAbortIn rising while TxUsedDataOutDetected already asserted. eth_rxethmac.v:248 β broadcast detect at exact byte count/state alignment. eth_wishbone.v:938-959+ β deeper DMA/burst/abort/underrun paths requiring precise TX/RX BD + master-bus handshake sequencing.
trng_top
True Random Number Generator β SHA-512 mixer, ChaCha20 CSPRNG, entropy sources, FIFO, debug subsystem
Verdict: ReAct outperformed Codex in both coverage (+1.6pp) and efficiency (2.1Γ fewer tokens, 2.3Γ faster). The gap emerged in the 95β97% range where ReAct ran 5 targeted iterations vs Codex's 3 broader ones. Both hit the same fundamental wall: integration-tied-off hardware (entropy0 disabled, SHA-512 ports unconnected) and FSM timing precision (CSPRNG cancel edges, mixer discard windows).
Residual Coverage Holes (Combined)
sha512_core.v:257-302 β state writeback ports unconnected in trng_mixer instantiation. sha512_core.v:509 β work_factor_num hardwired to 0. trng_mixer.v:668+ β entropy source 0 disabled at integration. trng_csprng.v:263 β ready_we never driven high (dead code).
trng_csprng.v:534,564 β CTRL_CANCEL from CTRL_INIT0/CTRL_NEXT0 requires !enable_reg||seed_reg asserted during a precise 1-cycle FSM window. trng_mixer.v:1012+ β discard-driven early-exit branches require holding discard asserted across multiple consecutive cycles during active mixer operation. chacha_core.v:682-688 β init during CTRL_DONE (re-init without reset) needs deterministic reseed sequence.
sd-card-controller_top
SD Card Controller β Wishbone slave/master, SD CMD/DAT protocol, DMA, dual-clock FIFO, CRC generation
Verdict: This design produced the largest gap between agents. ReAct reached 95% (which Codex never did) and achieved it in 2.4Γ fewer tokens. The SD card controller is protocol-heavy: most remaining holes require a protocol-faithful behavioral model of the SD card (correct CRC7/CRC16, proper response framing, data tokens). Neither agent can synthesize such a model from scratch via stimulus alone β this represents the hardest category of coverage hole for LLM-based verification.
Residual Coverage Holes (Combined)
generic_fifo_dc_gray.v:168,192 β FIFO clear ports hardwired to 1'b0 in sd_fifo_filler.v. FSM defaults in sd_cmd_serial_host.v:134, sd_cmd_master.v:97, sd_data_master.v:74 β defensive branches for impossible state encodings.
sd_cmd_serial_host.v:265 β CRC match requires card-side response with correct CRC7. sd_data_master.v:135,176 β TX FIFO underrun and transfer-complete with crc_ok require correct SD data protocol. sd_data_serial_host.v:99,247+ β multi-block write continuation and 4-bit CRC shifting require protocol-correct card behavior. sd_data_xfer_trig.v:36,43 β data transfer trigger needs correct command_reg programming with "with data" fields.
Coverage Hole Taxonomy
Across all 4 designs, every residual coverage hole falls into one of four categories. This taxonomy generalizes across designs and reveals where LLM-based stimulus generation fundamentally struggles.
| Category | Description | Designs Affected | LLM Solvable? |
|---|---|---|---|
| Type 1 Integration Tied-Off | Hardware ports/signals hardwired or unconnected at the integration level. No top-level stimulus can reach these paths. | trng_top, sd-card-controller_top | No. Requires coverage exclusions or design changes. |
| Type 2 Infeasible Boundary | Code is theoretically reachable but requires astronomically many operations (2Β³Β² increments, timeout counters exceeding practical sim time). | chacha_top, trng_top | No. Inherent simulation limitation. Agent should recommend waiver. |
| Type 3 Protocol Model Gap | Coverage requires an external protocol-compliant behavioral model (SD card with CRC, Ethernet PHY) that responds correctly to DUT outputs. | sd-card-controller_top, ethmac | Partially. LLMs can generate simple BFMs but struggle with protocol-correct CRC computation and multi-phase handshakes. |
| Type 4 FSM Timing Precision | Path requires specific multi-signal timing alignment across a narrow FSM window (1-2 cycles). Random stimulus has vanishingly low probability of hitting these. | trng_top, ethmac, sd-card-controller_top | Partially. LLMs can reason about FSM transitions but cannot observe runtime state to time stimulus precisely. |
Key Insights
The 95% Cliff: Where LLMs Hit the Wall
Both agents reach ~90% coverage quickly and comparably. The 90β95% range is where structured coverage feedback matters β ReAct's faster iteration loop outperforms here consistently. Above 95%, neither agent makes meaningful progress regardless of compute invested. This is not a model capability limit; it is a verification methodology limit of open-loop, stimulus-only approaches without behavioral models or internal observability.
Token Efficiency β Iteration Count
ReAct uses 2β3Γ fewer tokens than Codex while achieving equal or better coverage. Codex's hidden internal iteration (compile-fix loops within its sandbox) consumes significant tokens on error recovery that never reaches the coverage frontier.
Protocol Knowledge Is the Binding Constraint
The hardest residual holes require the testbench to act as a protocol-compliant peer (SD card, Ethernet PHY, bus master). LLMs can generate stimulus sequences that drive inputs, but they cannot easily synthesize correct closed-loop behavioral models that compute CRCs, maintain state machines, and respond to DUT outputs in real-time within a single testbench.
Coverage Feedback Quality > Quantity
ReAct's annotated source feedback (marking specific uncovered lines with surrounding context) outperforms Codex's approach of scanning full coverage reports. However, even high-quality feedback fails when the agent cannot reason about why a line is uncovered.
Agent Architecture Matters Less Than Environment
ReAct and Codex hit the same coverage ceilings on every design and identify the same residual holes. The bottleneck is not in reasoning architecture but in what information the agent has access to (no runtime observability, no protocol reference implementations, no formal reachability analysis).
Unreachable Code Detection Is a Solved Subtask
Both agents reliably identify Types 1 and 2 holes from RTL (trace signal connectivity, check port bindings, analyze counter bounds). The challenge is that agents spend iterations attempting to cover these lines before concluding they are unreachable, wasting 20β40% of post-95% compute.
Why LLMs Fail at Corner-Case Stimulus Generation
Core Finding: LLM-driven stimulus generation is fundamentally an open-loop process β the agent generates input sequences without observing internal DUT state at runtime. The coverage holes that resist this approach are precisely those that require closed-loop interaction or precise temporal targeting.
1. No Runtime Observability β Blind Timing
LLMs generate stimulus at "compile time" of the testbench β all timing is pre-determined. For FSM corner cases that require stimulus precisely when the DUT is in a specific internal state, the agent must guess timing based on static RTL analysis. The probability of blind timing hitting a 1-cycle window in a 1000+ cycle simulation is vanishingly small.
2. No Behavioral Model Synthesis β Broken Protocol Loops
Many designs implement protocols where the DUT's behavior depends on correct responses from external agents. Covering the DUT's receive/response-processing paths requires a testbench component that observes DUT outputs in real-time and responds correctly β computing CRCs, echoing command indices, maintaining protocol state.
3. No Formal Reachability β Wasted Compute on Dead Code
Agents spend 20β40% of post-95% compute attempting to cover lines that are provably unreachable from the top-level interface. A formal or static reachability pre-pass would eliminate this wasted compute entirely.
4. Specification-to-Stimulus Gap β Missing Verification Intent
LLMs read specifications and generate "obvious" stimulus. Specifications rarely describe corner cases explicitly β they describe intended behavior, not edge conditions. Bridging this gap requires RTL-aware reasoning (tracing gating conditions backward from uncovered lines to determine what input sequence could reach them).
Implications for CovAgent
The path to higher coverage is not "better prompting" or "more iterations" β it requires fundamentally richer agent capabilities: formal reachability pre-filtering, protocol BFM generation from specifications, and runtime-aware directed stimulus through simulation hooks or co-simulation.
| Capability Gap | Impact on Coverage | Potential Solution |
|---|---|---|
| Reachability Pre-pass | Eliminates 20-40% wasted post-95% compute | Static cone-of-influence analysis or lightweight formal check before agent starts targeting |
| BFM Generation | Unlocks 2-5pp coverage on protocol-heavy designs | Dedicated sub-agent or tool that synthesizes protocol-compliant response models from spec + DUT port observation |
| Runtime Observability | Enables precise FSM-state-aligned stimulus | Co-simulation hooks, SystemVerilog coverpoint-guided feedback, or assertion-based state detection in testbench |
| RTL Backward Tracing | Converts "uncovered line" β "required input sequence" | Tool that traces gating conditions from uncovered line back to top-level ports, providing agent with a concrete stimulus recipe |