Vihaan Patel

CovAgent Evaluation Report

ReAct Framework vs Codex CLI β€” Automated Hardware Verification Coverage Closure

4 designs Β· GPTΒ 5.2 Β· QuestaSim Β· Top-level stimulus only Β· FebruaryΒ 2026

ReAct (CovAgent) Codex CLI

Cross-Design Summary

Avg Coverage (ReAct)
97.7%
Avg Coverage (Codex)
96.7%
Avg Token Ratio
3.2Γ—
Codex uses more
Avg Time Ratio
2.5Γ—
Codex uses more

Final Coverage by Design (%)

chacha_top
ReAct
99.4%
Codex
99.4%
ethmac
ReAct
98.6%
Codex
98%
trng_top
ReAct
97.4%
Codex
95.8%
sd_card_ctrl
ReAct
95.2%
Codex
93.4%

Token Efficiency (Tokens per Run)

chacha_top
ReAct
20,217
Codex
55,000
ethmac
ReAct
151,813
Codex
179,000
trng_top
ReAct
44,529
Codex
94,000
sd_card_ctrl
ReAct
38,956
Codex
95,000

Time to 95% Coverage (seconds)

chacha_top
ReAct
67s
Codex
139s
ethmac
ReAct
461s
Codex
982s
trng_top
ReAct
142s
Codex
357s
sd_card_ctrl
ReAct
211s
Codex
0 (n/a)

Total Run Time (seconds)

chacha_top
ReAct
117s
Codex
323s
ethmac
ReAct
1,470s
Codex
1,635s
trng_top
ReAct
226s
Codex
521s
sd_card_ctrl
ReAct
244s
Codex
633s

Design-Level Comparison

chacha_top

ChaCha20 stream cipher β€” memory-mapped register interface, init/next FSM, 20-round quarter-round core

Coverage
99.4%
Coverage
99.4%
Tokens
20K
2 iterations
Tokens
55K
2 iterations
Time
117s
67s β†’ 95%
Time
323s
139s β†’ 95%
Verdict: Both agents converged to the same ceiling (99.4%) in exactly 2 iterations. ReAct did it in 2.7Γ— fewer tokens and 2.8Γ— less time. The single remaining hole (chacha_core.v:566 β€” 32-bit block counter overflow) is identical for both and is practically unreachable from the top-level interface (requires 2Β³Β² block operations).

Residual Coverage Holes (Shared)

Unreachable

chacha_core.v:566-567 β€” block1_ctr overflow carry. Counter seed hardwired to 0 at top level; reaching 0xFFFFFFFF requires 2Β³Β² next operations. Both agents correctly identified and classified this.

ethmac_eth_with_cop

Ethernet MAC controller β€” Wishbone bus, TX/RX DMA, MII/RMII PHY, MAC control frames, buffer descriptors

Coverage
98.6%
Coverage
98.0%
Tokens
152K
stopped manually
Tokens
179K
5 iterations
Time
1470s
221s β†’ 90%
Time
1635s
982s β†’ 90%
Verdict: ReAct reached 90% coverage in 221s vs Codex's 982s (4.4Γ— faster to this milestone) and achieved slightly higher final coverage (98.6 vs 98.0). The design is the most complex tested (3482 coverable statements). Both agents stalled on the same class of holes: protocol-level corner cases requiring precise multi-signal timing across TX/RX DMA, MAC control, and Wishbone bus interactions.

Residual Coverage Holes

Excludable

eth_receivecontrol.v:337 β€” dead branch (ResetSlotTimer already consumed by RxReset). eth_spram_256x32.v:229-230 β€” print_ram task never called by RTL.

Needs Effort / Protocol Corner Cases

eth_registers.v:794,797 β€” TxC interrupt sync requires precise TxCtrlEndFrm + StartTxDone + r_TxFlow timing. eth_maccontrol.v:112 β€” muxed abort needs TxAbortIn rising while TxUsedDataOutDetected already asserted. eth_rxethmac.v:248 β€” broadcast detect at exact byte count/state alignment. eth_wishbone.v:938-959+ β€” deeper DMA/burst/abort/underrun paths requiring precise TX/RX BD + master-bus handshake sequencing.

trng_top

True Random Number Generator β€” SHA-512 mixer, ChaCha20 CSPRNG, entropy sources, FIFO, debug subsystem

Coverage
97.4%
Coverage
95.8%
Tokens
45K
5 iterations
Tokens
94K
3 iterations
Time
226s
142s β†’ 95%
Time
521s
357s β†’ 95%
Verdict: ReAct outperformed Codex in both coverage (+1.6pp) and efficiency (2.1Γ— fewer tokens, 2.3Γ— faster). The gap emerged in the 95–97% range where ReAct ran 5 targeted iterations vs Codex's 3 broader ones. Both hit the same fundamental wall: integration-tied-off hardware (entropy0 disabled, SHA-512 ports unconnected) and FSM timing precision (CSPRNG cancel edges, mixer discard windows).

Residual Coverage Holes (Combined)

Unreachable β€” Tied-off Hardware

sha512_core.v:257-302 β€” state writeback ports unconnected in trng_mixer instantiation. sha512_core.v:509 β€” work_factor_num hardwired to 0. trng_mixer.v:668+ β€” entropy source 0 disabled at integration. trng_csprng.v:263 β€” ready_we never driven high (dead code).

Needs Effort β€” FSM Timing Precision

trng_csprng.v:534,564 β€” CTRL_CANCEL from CTRL_INIT0/CTRL_NEXT0 requires !enable_reg||seed_reg asserted during a precise 1-cycle FSM window. trng_mixer.v:1012+ β€” discard-driven early-exit branches require holding discard asserted across multiple consecutive cycles during active mixer operation. chacha_core.v:682-688 β€” init during CTRL_DONE (re-init without reset) needs deterministic reseed sequence.

sd-card-controller_top

SD Card Controller β€” Wishbone slave/master, SD CMD/DAT protocol, DMA, dual-clock FIFO, CRC generation

Coverage
95.2%
Coverage
93.4%
Tokens
39K
5 iterations
Tokens
95K
3 iterations
Time
244s
211s β†’ 95%
Time
633s
never hit 95%
Verdict: This design produced the largest gap between agents. ReAct reached 95% (which Codex never did) and achieved it in 2.4Γ— fewer tokens. The SD card controller is protocol-heavy: most remaining holes require a protocol-faithful behavioral model of the SD card (correct CRC7/CRC16, proper response framing, data tokens). Neither agent can synthesize such a model from scratch via stimulus alone β€” this represents the hardest category of coverage hole for LLM-based verification.

Residual Coverage Holes (Combined)

Unreachable β€” Tied-off / Defensive

generic_fifo_dc_gray.v:168,192 β€” FIFO clear ports hardwired to 1'b0 in sd_fifo_filler.v. FSM defaults in sd_cmd_serial_host.v:134, sd_cmd_master.v:97, sd_data_master.v:74 β€” defensive branches for impossible state encodings.

Needs Effort β€” Protocol Behavioral Model Required

sd_cmd_serial_host.v:265 β€” CRC match requires card-side response with correct CRC7. sd_data_master.v:135,176 β€” TX FIFO underrun and transfer-complete with crc_ok require correct SD data protocol. sd_data_serial_host.v:99,247+ β€” multi-block write continuation and 4-bit CRC shifting require protocol-correct card behavior. sd_data_xfer_trig.v:36,43 β€” data transfer trigger needs correct command_reg programming with "with data" fields.


Coverage Hole Taxonomy

Across all 4 designs, every residual coverage hole falls into one of four categories. This taxonomy generalizes across designs and reveals where LLM-based stimulus generation fundamentally struggles.

CategoryDescriptionDesigns AffectedLLM Solvable?
Type 1
Integration Tied-Off
Hardware ports/signals hardwired or unconnected at the integration level. No top-level stimulus can reach these paths.trng_top, sd-card-controller_topNo. Requires coverage exclusions or design changes.
Type 2
Infeasible Boundary
Code is theoretically reachable but requires astronomically many operations (2Β³Β² increments, timeout counters exceeding practical sim time).chacha_top, trng_topNo. Inherent simulation limitation. Agent should recommend waiver.
Type 3
Protocol Model Gap
Coverage requires an external protocol-compliant behavioral model (SD card with CRC, Ethernet PHY) that responds correctly to DUT outputs.sd-card-controller_top, ethmacPartially. LLMs can generate simple BFMs but struggle with protocol-correct CRC computation and multi-phase handshakes.
Type 4
FSM Timing Precision
Path requires specific multi-signal timing alignment across a narrow FSM window (1-2 cycles). Random stimulus has vanishingly low probability of hitting these.trng_top, ethmac, sd-card-controller_topPartially. LLMs can reason about FSM transitions but cannot observe runtime state to time stimulus precisely.

Key Insights

Insight 1

The 95% Cliff: Where LLMs Hit the Wall

Both agents reach ~90% coverage quickly and comparably. The 90β†’95% range is where structured coverage feedback matters β€” ReAct's faster iteration loop outperforms here consistently. Above 95%, neither agent makes meaningful progress regardless of compute invested. This is not a model capability limit; it is a verification methodology limit of open-loop, stimulus-only approaches without behavioral models or internal observability.

Insight 2

Token Efficiency β‰  Iteration Count

ReAct uses 2–3Γ— fewer tokens than Codex while achieving equal or better coverage. Codex's hidden internal iteration (compile-fix loops within its sandbox) consumes significant tokens on error recovery that never reaches the coverage frontier.

Insight 3

Protocol Knowledge Is the Binding Constraint

The hardest residual holes require the testbench to act as a protocol-compliant peer (SD card, Ethernet PHY, bus master). LLMs can generate stimulus sequences that drive inputs, but they cannot easily synthesize correct closed-loop behavioral models that compute CRCs, maintain state machines, and respond to DUT outputs in real-time within a single testbench.

Insight 4

Coverage Feedback Quality > Quantity

ReAct's annotated source feedback (marking specific uncovered lines with surrounding context) outperforms Codex's approach of scanning full coverage reports. However, even high-quality feedback fails when the agent cannot reason about why a line is uncovered.

Insight 5

Agent Architecture Matters Less Than Environment

ReAct and Codex hit the same coverage ceilings on every design and identify the same residual holes. The bottleneck is not in reasoning architecture but in what information the agent has access to (no runtime observability, no protocol reference implementations, no formal reachability analysis).

Insight 6

Unreachable Code Detection Is a Solved Subtask

Both agents reliably identify Types 1 and 2 holes from RTL (trace signal connectivity, check port bindings, analyze counter bounds). The challenge is that agents spend iterations attempting to cover these lines before concluding they are unreachable, wasting 20–40% of post-95% compute.


Why LLMs Fail at Corner-Case Stimulus Generation

Core Finding: LLM-driven stimulus generation is fundamentally an open-loop process β€” the agent generates input sequences without observing internal DUT state at runtime. The coverage holes that resist this approach are precisely those that require closed-loop interaction or precise temporal targeting.

1. No Runtime Observability β†’ Blind Timing

LLMs generate stimulus at "compile time" of the testbench β€” all timing is pre-determined. For FSM corner cases that require stimulus precisely when the DUT is in a specific internal state, the agent must guess timing based on static RTL analysis. The probability of blind timing hitting a 1-cycle window in a 1000+ cycle simulation is vanishingly small.

2. No Behavioral Model Synthesis β†’ Broken Protocol Loops

Many designs implement protocols where the DUT's behavior depends on correct responses from external agents. Covering the DUT's receive/response-processing paths requires a testbench component that observes DUT outputs in real-time and responds correctly β€” computing CRCs, echoing command indices, maintaining protocol state.

3. No Formal Reachability β†’ Wasted Compute on Dead Code

Agents spend 20–40% of post-95% compute attempting to cover lines that are provably unreachable from the top-level interface. A formal or static reachability pre-pass would eliminate this wasted compute entirely.

4. Specification-to-Stimulus Gap β†’ Missing Verification Intent

LLMs read specifications and generate "obvious" stimulus. Specifications rarely describe corner cases explicitly β€” they describe intended behavior, not edge conditions. Bridging this gap requires RTL-aware reasoning (tracing gating conditions backward from uncovered lines to determine what input sequence could reach them).


Implications for CovAgent

The path to higher coverage is not "better prompting" or "more iterations" β€” it requires fundamentally richer agent capabilities: formal reachability pre-filtering, protocol BFM generation from specifications, and runtime-aware directed stimulus through simulation hooks or co-simulation.
Capability GapImpact on CoveragePotential Solution
Reachability Pre-passEliminates 20-40% wasted post-95% computeStatic cone-of-influence analysis or lightweight formal check before agent starts targeting
BFM GenerationUnlocks 2-5pp coverage on protocol-heavy designsDedicated sub-agent or tool that synthesizes protocol-compliant response models from spec + DUT port observation
Runtime ObservabilityEnables precise FSM-state-aligned stimulusCo-simulation hooks, SystemVerilog coverpoint-guided feedback, or assertion-based state detection in testbench
RTL Backward TracingConverts "uncovered line" β†’ "required input sequence"Tool that traces gating conditions from uncovered line back to top-level ports, providing agent with a concrete stimulus recipe