CovAgent Evaluation Report

ReAct Framework vs Codex CLI — Automated Hardware Verification Coverage Closure

4 designs · GPT 5.2 · QuestaSim · Top-level stimulus only · February 2026

ReAct (CovAgent) Codex CLI

Cross-Design Summary

Avg Coverage (ReAct)

97.7%

Avg Coverage (Codex)

96.7%

Avg Token Ratio

3.2×

Codex uses more

Avg Time Ratio

2.5×

Codex uses more

Final Coverage by Design (%)

chacha_top

ReAct

99.4%

Codex

99.4%

ethmac

ReAct

98.6%

Codex

98%

trng_top

ReAct

97.4%

Codex

95.8%

sd_card_ctrl

ReAct

95.2%

Codex

93.4%

Token Efficiency (Tokens per Run)

chacha_top

ReAct

20,217

Codex

55,000

ethmac

ReAct

151,813

Codex

179,000

trng_top

ReAct

44,529

Codex

94,000

sd_card_ctrl

ReAct

38,956

Codex

95,000

Time to 95% Coverage (seconds)

chacha_top

ReAct

67s

Codex

139s

ethmac

ReAct

461s

Codex

982s

trng_top

ReAct

142s

Codex

357s

sd_card_ctrl

ReAct

211s

Codex

0 (n/a)

Total Run Time (seconds)

chacha_top

ReAct

117s

Codex

323s

ethmac

ReAct

1,470s

Codex

1,635s

trng_top

ReAct

226s

Codex

521s

sd_card_ctrl

ReAct

244s

Codex

633s

Design-Level Comparison

chacha_top

ChaCha20 stream cipher — memory-mapped register interface, init/next FSM, 20-round quarter-round core

Coverage

99.4%

Coverage

99.4%

Tokens

20K

2 iterations

Tokens

55K

2 iterations

Time

117s

67s → 95%

Time

323s

139s → 95%

Verdict: Both agents converged to the same ceiling (99.4%) in exactly 2 iterations. ReAct did it in 2.7× fewer tokens and 2.8× less time. The single remaining hole (chacha_core.v:566 — 32-bit block counter overflow) is identical for both and is practically unreachable from the top-level interface (requires 2³² block operations).

Residual Coverage Holes (Shared)

Unreachable

chacha_core.v:566-567 — block1_ctr overflow carry. Counter seed hardwired to 0 at top level; reaching 0xFFFFFFFF requires 2³² next operations. Both agents correctly identified and classified this.

ethmac_eth_with_cop

Ethernet MAC controller — Wishbone bus, TX/RX DMA, MII/RMII PHY, MAC control frames, buffer descriptors

Coverage

98.6%

Coverage

98.0%

Tokens

152K

stopped manually

Tokens

179K

5 iterations

Time

1470s

221s → 90%

Time

1635s

982s → 90%

Verdict: ReAct reached 90% coverage in 221s vs Codex's 982s (4.4× faster to this milestone) and achieved slightly higher final coverage (98.6 vs 98.0). The design is the most complex tested (3482 coverable statements). Both agents stalled on the same class of holes: protocol-level corner cases requiring precise multi-signal timing across TX/RX DMA, MAC control, and Wishbone bus interactions.

Residual Coverage Holes

Excludable

eth_receivecontrol.v:337 — dead branch (ResetSlotTimer already consumed by RxReset). eth_spram_256x32.v:229-230 — print_ram task never called by RTL.

Needs Effort / Protocol Corner Cases

eth_registers.v:794,797 — TxC interrupt sync requires precise TxCtrlEndFrm + StartTxDone + r_TxFlow timing. eth_maccontrol.v:112 — muxed abort needs TxAbortIn rising while TxUsedDataOutDetected already asserted. eth_rxethmac.v:248 — broadcast detect at exact byte count/state alignment. eth_wishbone.v:938-959+ — deeper DMA/burst/abort/underrun paths requiring precise TX/RX BD + master-bus handshake sequencing.

trng_top

True Random Number Generator — SHA-512 mixer, ChaCha20 CSPRNG, entropy sources, FIFO, debug subsystem

Coverage

97.4%

Coverage

95.8%

Tokens

45K

5 iterations

Tokens

94K

3 iterations

Time

226s

142s → 95%

Time

521s

357s → 95%

Verdict: ReAct outperformed Codex in both coverage (+1.6pp) and efficiency (2.1× fewer tokens, 2.3× faster). The gap emerged in the 95–97% range where ReAct ran 5 targeted iterations vs Codex's 3 broader ones. Both hit the same fundamental wall: integration-tied-off hardware (entropy0 disabled, SHA-512 ports unconnected) and FSM timing precision (CSPRNG cancel edges, mixer discard windows).

Residual Coverage Holes (Combined)

Unreachable — Tied-off Hardware

sha512_core.v:257-302 — state writeback ports unconnected in trng_mixer instantiation. sha512_core.v:509 — work_factor_num hardwired to 0. trng_mixer.v:668+ — entropy source 0 disabled at integration. trng_csprng.v:263 — ready_we never driven high (dead code).

Needs Effort — FSM Timing Precision

trng_csprng.v:534,564 — CTRL_CANCEL from CTRL_INIT0/CTRL_NEXT0 requires !enable_reg||seed_reg asserted during a precise 1-cycle FSM window. trng_mixer.v:1012+ — discard-driven early-exit branches require holding discard asserted across multiple consecutive cycles during active mixer operation. chacha_core.v:682-688 — init during CTRL_DONE (re-init without reset) needs deterministic reseed sequence.

sd-card-controller_top

SD Card Controller — Wishbone slave/master, SD CMD/DAT protocol, DMA, dual-clock FIFO, CRC generation

Coverage

95.2%

Coverage

93.4%

Tokens

39K

5 iterations

Tokens

95K

3 iterations

Time

244s

211s → 95%

Time

633s

never hit 95%

Verdict: This design produced the largest gap between agents. ReAct reached 95% (which Codex never did) and achieved it in 2.4× fewer tokens. The SD card controller is protocol-heavy: most remaining holes require a protocol-faithful behavioral model of the SD card (correct CRC7/CRC16, proper response framing, data tokens). Neither agent can synthesize such a model from scratch via stimulus alone — this represents the hardest category of coverage hole for LLM-based verification.

Residual Coverage Holes (Combined)

Unreachable — Tied-off / Defensive

generic_fifo_dc_gray.v:168,192 — FIFO clear ports hardwired to 1'b0 in sd_fifo_filler.v. FSM defaults in sd_cmd_serial_host.v:134, sd_cmd_master.v:97, sd_data_master.v:74 — defensive branches for impossible state encodings.

Needs Effort — Protocol Behavioral Model Required

sd_cmd_serial_host.v:265 — CRC match requires card-side response with correct CRC7. sd_data_master.v:135,176 — TX FIFO underrun and transfer-complete with crc_ok require correct SD data protocol. sd_data_serial_host.v:99,247+ — multi-block write continuation and 4-bit CRC shifting require protocol-correct card behavior. sd_data_xfer_trig.v:36,43 — data transfer trigger needs correct command_reg programming with "with data" fields.

Coverage Hole Taxonomy

Across all 4 designs, every residual coverage hole falls into one of four categories. This taxonomy generalizes across designs and reveals where LLM-based stimulus generation fundamentally struggles.

Category	Description	Designs Affected	LLM Solvable?
Type 1 Integration Tied-Off	Hardware ports/signals hardwired or unconnected at the integration level. No top-level stimulus can reach these paths.	trng_top, sd-card-controller_top	No. Requires coverage exclusions or design changes.
Type 2 Infeasible Boundary	Code is theoretically reachable but requires astronomically many operations (2³² increments, timeout counters exceeding practical sim time).	chacha_top, trng_top	No. Inherent simulation limitation. Agent should recommend waiver.
Type 3 Protocol Model Gap	Coverage requires an external protocol-compliant behavioral model (SD card with CRC, Ethernet PHY) that responds correctly to DUT outputs.	sd-card-controller_top, ethmac	Partially. LLMs can generate simple BFMs but struggle with protocol-correct CRC computation and multi-phase handshakes.
Type 4 FSM Timing Precision	Path requires specific multi-signal timing alignment across a narrow FSM window (1-2 cycles). Random stimulus has vanishingly low probability of hitting these.	trng_top, ethmac, sd-card-controller_top	Partially. LLMs can reason about FSM transitions but cannot observe runtime state to time stimulus precisely.

Key Insights

Insight 1

The 95% Cliff: Where LLMs Hit the Wall

Both agents reach ~90% coverage quickly and comparably. The 90→95% range is where structured coverage feedback matters — ReAct's faster iteration loop outperforms here consistently. Above 95%, neither agent makes meaningful progress regardless of compute invested. This is not a model capability limit; it is a verification methodology limit of open-loop, stimulus-only approaches without behavioral models or internal observability.

Insight 2

Token Efficiency ≠ Iteration Count

ReAct uses 2–3× fewer tokens than Codex while achieving equal or better coverage. Codex's hidden internal iteration (compile-fix loops within its sandbox) consumes significant tokens on error recovery that never reaches the coverage frontier.

Insight 3

Protocol Knowledge Is the Binding Constraint

The hardest residual holes require the testbench to act as a protocol-compliant peer (SD card, Ethernet PHY, bus master). LLMs can generate stimulus sequences that drive inputs, but they cannot easily synthesize correct closed-loop behavioral models that compute CRCs, maintain state machines, and respond to DUT outputs in real-time within a single testbench.

Insight 4

Coverage Feedback Quality > Quantity

ReAct's annotated source feedback (marking specific uncovered lines with surrounding context) outperforms Codex's approach of scanning full coverage reports. However, even high-quality feedback fails when the agent cannot reason about why a line is uncovered.

Insight 5

Agent Architecture Matters Less Than Environment

ReAct and Codex hit the same coverage ceilings on every design and identify the same residual holes. The bottleneck is not in reasoning architecture but in what information the agent has access to (no runtime observability, no protocol reference implementations, no formal reachability analysis).

Insight 6

Unreachable Code Detection Is a Solved Subtask

Both agents reliably identify Types 1 and 2 holes from RTL (trace signal connectivity, check port bindings, analyze counter bounds). The challenge is that agents spend iterations attempting to cover these lines before concluding they are unreachable, wasting 20–40% of post-95% compute.

Why LLMs Fail at Corner-Case Stimulus Generation

Core Finding: LLM-driven stimulus generation is fundamentally an open-loop process — the agent generates input sequences without observing internal DUT state at runtime. The coverage holes that resist this approach are precisely those that require closed-loop interaction or precise temporal targeting.

1. No Runtime Observability → Blind Timing

LLMs generate stimulus at "compile time" of the testbench — all timing is pre-determined. For FSM corner cases that require stimulus precisely when the DUT is in a specific internal state, the agent must guess timing based on static RTL analysis. The probability of blind timing hitting a 1-cycle window in a 1000+ cycle simulation is vanishingly small.

2. No Behavioral Model Synthesis → Broken Protocol Loops

Many designs implement protocols where the DUT's behavior depends on correct responses from external agents. Covering the DUT's receive/response-processing paths requires a testbench component that observes DUT outputs in real-time and responds correctly — computing CRCs, echoing command indices, maintaining protocol state.

3. No Formal Reachability → Wasted Compute on Dead Code

Agents spend 20–40% of post-95% compute attempting to cover lines that are provably unreachable from the top-level interface. A formal or static reachability pre-pass would eliminate this wasted compute entirely.

4. Specification-to-Stimulus Gap → Missing Verification Intent

LLMs read specifications and generate "obvious" stimulus. Specifications rarely describe corner cases explicitly — they describe intended behavior, not edge conditions. Bridging this gap requires RTL-aware reasoning (tracing gating conditions backward from uncovered lines to determine what input sequence could reach them).

Implications for CovAgent

The path to higher coverage is not "better prompting" or "more iterations" — it requires fundamentally richer agent capabilities: formal reachability pre-filtering, protocol BFM generation from specifications, and runtime-aware directed stimulus through simulation hooks or co-simulation.

Capability Gap	Impact on Coverage	Potential Solution
Reachability Pre-pass	Eliminates 20-40% wasted post-95% compute	Static cone-of-influence analysis or lightweight formal check before agent starts targeting
BFM Generation	Unlocks 2-5pp coverage on protocol-heavy designs	Dedicated sub-agent or tool that synthesizes protocol-compliant response models from spec + DUT port observation
Runtime Observability	Enables precise FSM-state-aligned stimulus	Co-simulation hooks, SystemVerilog coverpoint-guided feedback, or assertion-based state detection in testbench
RTL Backward Tracing	Converts "uncovered line" → "required input sequence"	Tool that traces gating conditions from uncovered line back to top-level ports, providing agent with a concrete stimulus recipe

Vihaan Patel

CovAgent Evaluation Report

Cross-Design Summary

Final Coverage by Design (%)

Token Efficiency (Tokens per Run)

Time to 95% Coverage (seconds)

Total Run Time (seconds)

Design-Level Comparison

chacha_top

Residual Coverage Holes (Shared)

ethmac_eth_with_cop

Residual Coverage Holes

trng_top

Residual Coverage Holes (Combined)

sd-card-controller_top

Residual Coverage Holes (Combined)

Coverage Hole Taxonomy

Key Insights

The 95% Cliff: Where LLMs Hit the Wall

Token Efficiency ≠ Iteration Count

Protocol Knowledge Is the Binding Constraint

Coverage Feedback Quality > Quantity

Agent Architecture Matters Less Than Environment

Unreachable Code Detection Is a Solved Subtask

Why LLMs Fail at Corner-Case Stimulus Generation

1. No Runtime Observability → Blind Timing

2. No Behavioral Model Synthesis → Broken Protocol Loops

3. No Formal Reachability → Wasted Compute on Dead Code

4. Specification-to-Stimulus Gap → Missing Verification Intent

Implications for CovAgent