How AI Models Change Code Hunter's Audit Behavior on the Same Project: Vexus Multi-Language Virtual Currency Exchange

Introduction

Security-audit reports generated by AI systems are not only shaped by the scanned codebase. They are also shaped by the model's confirmation threshold, the evidence format it prefers, the kinds of risks it prioritizes, and the way it handles uncertainty. The Vexus multi-language virtual currency exchange reports provide a useful case study because three Code Hunter versions reviewed the same project family while using different model configurations.

The three source reports cover a Java-based cryptocurrency or virtual-currency exchange platform with backend modules for API access, administration, data services, wallet operations, trading, deposits, withdrawals, DApp functionality, Redis/MySQL infrastructure, backup utilities, and WebSocket delivery. The reports differ materially in scope, finding volume, severity labeling, and confirmation style.

The comparison is not a benchmark of model "accuracy" in the abstract. The source reports do not provide a ground-truth vulnerability set, dynamic exploit validation, or a single normalized severity taxonomy across all runs. This report therefore compares audit behavior as reflected in the reports: what each model surfaced, how it classified evidence, how it separated confirmed issues from hypotheses, and what those differences imply for Code Hunter users.

Source material and comparison method

This analysis uses only the three original Code Hunter reports listed in the final "Source reports" section. No pre-filtered summaries, external vulnerability databases, or invented metrics are used.

Where the source report states a precise value, such as the number of confirmed findings, that value is reproduced. Where a comparison is based on reading the finding descriptions, functional binding tables, or repeated report structure, it is described qualitatively and identified as an inference from the report text.

The three compared runs are:

Code Hunter report Model configuration stated in report Reviewed findings stated in report Confirmation distribution stated in report

Code Hunter 3.1.75 mimo-v2.5-pro 37 26 Confirmed, 5 Highly Likely, 6 Needs More Evidence, 0 Rejected

Code Hunter 3.1.80-dev.0 GPT 5.5 19 5 Confirmed, 9 Highly Likely, 5 Needs More Evidence, 0 Rejected

Code Hunter 3.1.81-dev.0 auditor_1=mimo-v2.5-pro, auditor_2=deepseek-v4-pro, reviewer=deepseek-v4-pro 18 13 Confirmed, 3 Highly Likely, 2 Needs More Evidence, 0 Rejected

The 3.1.81 report also states that only 18 of 69 identified candidates were reviewed and that 51 remained untriaged. That limitation is important: its higher confirmation ratio should not be read as full-project coverage.

Detection coverage and severity distribution

Code Hunter 3.1.75 / mimo-v2.5-pro: broadest surface coverage

The 3.1.75 report is the broadest of the three by stated reviewed finding count. It evaluates 37 findings and confirms 26. It covers a wide set of vulnerability classes: OS command injection, arbitrary file read, unsafe upload, weak password hashing, hardcoded secrets, permissive CORS, TLS hostname-verification issues, XSS filtering weakness, deprecated cryptography, token leakage, CSRF gaps, default-allow authorization, dependency risk, Redis-related race conditions, withdrawal double-spend, recharge double-credit, C2C and contract race conditions, and other financial workflow risks.

Its severity distribution for confirmed findings is stated as 4 Critical, 16 High, and 6 Medium. The report text also identifies the four critical findings as OS command injection in database backup, arbitrary file read via path traversal, wallet double-spending through a Redis race condition, and withdrawal double-spend.

A small reporting-quality caveat matters here: the 3.1.75 severity table states 16 High findings, but the listed High-severity IDs appear to contain more entries than that stated count. This research report does not normalize or correct the metric. It preserves the source report's stated severity distribution and notes the inconsistency as a report-format issue.

The model behavior reflected in 3.1.75 is expansive. It attempts to build a systemic risk map across the application rather than only reporting a small number of tightly closed bugs. It is particularly strong at identifying cross-cutting architectural risk clusters: Redis-first asynchronous wallet updates, JVM-local locks, default-allow authorization, token exposure, and file-handling weaknesses.

Code Hunter 3.1.80-dev.0 / GPT 5.5: narrower confirmation, more conservative triage

The 3.1.80-dev.0 report reviews 19 findings and confirms 5. It classifies 9 as Highly Likely and 5 as Needs More Evidence. This is the most conservative report in the set by confirmed finding count.

The confirmed findings are path traversal / arbitrary file read, unsafe file upload, missing API authentication / authorization on WebSocket subscription, hardcoded plaintext secrets, and an async principal-binding gap. The report's functional binding table marks each of the confirmed findings as High severity. The report does not present a separate aggregate severity-distribution table; the observation that confirmed issues are High severity is inferred from the listed confirmed rows.

The 3.1.80 behavior is strongly shaped by confirmation discipline. Many issues that 3.1.75 treated as confirmed or material are kept in Highly Likely or Needs More Evidence when GPT 5.5 does not see full end-to-end controllability, runtime context, or policy baseline. For example, CORS is treated as Highly Likely rather than confirmed because browser credential and endpoint sensitivity context are not fully closed. Known vulnerable dependency risk is held at Needs More Evidence because the report does not close CVE-to-reachable-sink proof.

This model configuration appears useful when users want a smaller set of issues that are easier to assign directly to engineering teams. Its cost is coverage breadth: the report gives fewer confirmed business-logic and financial-integrity findings than the 3.1.75 and 3.1.81 reports.

Code Hunter 3.1.81-dev.0 / multi-auditor reviewer flow: stronger evidence tables, filtered scope

The 3.1.81-dev.0 report reviews 18 finding candidates and confirms 13. It classifies 3 as Highly Likely and 2 as Needs More Evidence. It states that confirmed severity is uniformly High.

Its confirmed findings include OS command injection in backup utilities and command utilities, persistent API tokens without expiration, hardcoded backup/SFTP credential decryption keys, Redis initialized without authentication, unauthenticated data WebSocket access, automatic 302 redirect following in an HTTP helper, wallet transfer recipient-eligibility gaps, caller-supplied wallet transfer fees, non-atomic cross-symbol exchange execution with a sleep gap, in-memory exchange locks rather than distributed locks, spot trigger orders that skip wallet debit at creation, and admin override of recharge success amount.

This report has the strongest explicit confirmation standard. It states that a finding reaches Confirmed only when all evidence points have verified file paths, symbols, and line spans, and its confirmed findings are presented as Source → Transit → Sink → Failed/Missing Control evidence chains.

At the same time, the report is explicit about its coverage limitation: only 18 of 69 candidates were reviewed. It also notes cross-model divergence inside the run: two independent auditors produced disjoint finding sets, and one auditor's larger set of findings was not incorporated because of evidence-completeness concerns. That limitation is valuable because it makes the report's confidence boundary visible.

How the models handle false positives and uncertainty

The most important behavioral difference across the reports is not raw finding count. It is how each model decides whether a candidate is confirmed.

The 3.1.75 report uses a four-tier confirmation structure and includes Highly Likely and Needs More Evidence sections with missing evidence and upgrade paths. It does not collapse all suspicious patterns into confirmed findings. However, it is the most assertive of the three reports: 26 of 37 reviewed findings are confirmed. The report is comfortable confirming systemic issues when it sees source-level evidence for missing controls and plausible business impact. That makes it effective for broad triage, but it also increases the need for downstream validation when runtime configuration or exploit preconditions matter.

The 3.1.80 report is more restrictive. It repeatedly uses phrases such as "missing end-to-end controller-to-caller closure," "runtime gateway/container limits not confirmed," and "policy baseline absent." This model treats incomplete exploitability proof as a reason to hold findings below Confirmed even when unsafe code patterns are visible. That reduces false-positive risk, but it may also under-report material business risk if the missing context is difficult to prove statically.

The 3.1.81 report combines a high confirmation ratio with unusually explicit limitations. It confirms many findings, but it also documents unreviewed candidates, dynamic-testing gaps, frontend exclusion, and database-schema parsing limitations. Its uncertainty handling is strongest where it separates "observable control deficiency" from "confirmed exploit path." For example, the admin XSS filter is treated as Needs More Evidence because the disabled filter is visible, but downstream output rendering is not verified. Fastjson AutoType is also held as Needs More Evidence because the risky configuration is visible, but attacker-controlled deserialization is not traced.

In practical terms, 3.1.80 is the most conservative about false positives, 3.1.75 is the most expansive, and 3.1.81 is the most explicit about confirmation boundaries.

Reasoning style and evidence backing

3.1.75: systemic security narrative

The 3.1.75 report reads like a full audit narrative. It does not only enumerate vulnerabilities; it builds risk clusters and remediation priorities. Its most developed theme is concurrency and idempotency risk. The report repeatedly connects Redis-first wallet updates, asynchronous database synchronization, missing distributed locks, and JVM-local locks into a broader financial-integrity problem.

This reasoning style is useful for executive and architecture-level remediation. It shows how individual bugs compose into systemic risk. For example, it links token leakage, path traversal, and log exposure into account-takeover risk, and it links CORS, token-in-parameter authentication, and missing CSRF into cross-site financial-operation risk.

The weakness of this style is that it can blur the boundary between a confirmed local code issue and a higher-order exploit chain. The report usually provides evidence fields, but some compound conclusions depend on the interaction of several findings and deployment assumptions. Those chains are useful, but users should read them as risk analysis rather than as a substitute for exploit reproduction.

3.1.80: compact contract-based review

The 3.1.80 report is more compact and formal. It uses a consistent template: entry point, risk surface, missing control, evidence and data flow, why confirmed, business impact, and remediation. Its Highly Likely and Needs More Evidence sections are strongly procedural, often stating what closure point is missing and what would be needed to upgrade the finding.

This style is good for reducing ambiguity. A user can quickly see why a finding is confirmed or why it was deferred. It is less useful for strategic risk mapping because it does not develop the same cross-system narrative as 3.1.75. It also confirms fewer domain-specific financial findings.

3.1.81: evidence-chain tables and reviewer filtering

The 3.1.81 report is the most structured and evidence-heavy. Each confirmed issue is framed around a pipeline: Source, Transit, Sink, and Failed/Missing Control. It provides file paths, symbols, and line spans, then explains why the evidence is sufficient.

This reasoning style is well suited to engineering handoff. A developer can locate the entry point and sink quickly. A security reviewer can see whether the model has merely recognized a dangerous pattern or actually traced it through the code.

The tradeoff is that the report's filtered scope must be taken seriously. It may be excellent for the candidates it reviewed, but it explicitly does not claim to have reviewed all identified candidates.

Confirmation quality: paths, lines, reproduction logic, and hypothesis separation

All three reports provide file paths and line evidence, but they differ in precision and consistency.

The 3.1.75 report provides extensive source-level references. Examples include MysqlBackupImpl.java, ShowImgController, ImageDispatcher.findFile, UploadImgController, AllRequestFilter, TokenServiceImpl, WalletServiceImpl, and WithdrawServiceImpl. It also includes reproduction logic for business-impact scenarios, such as concurrent withdrawals where two requests can both observe sufficient balance. Its confirmation categories are clear, but the report's breadth creates some consistency risk. The severity table inconsistency noted above is one example of a publication-quality issue that should be cleaned before presenting the report as definitive.

The 3.1.80 report gives strong line evidence for its confirmed findings. Its path traversal finding traces request input from ShowImgController into ImageDispatcher.findFile and then to FileInputStream. Its upload finding identifies multipart handling, disabled validation, and persistence to disk. Its WebSocket findings cite WebSocketServer lifecycle methods and the absence of token, role, or ownership validation. However, its likely findings often use generic upgrade language, so they are less immediately actionable than the confirmed set.

The 3.1.81 report has the strongest confirmation mechanics. Its findings are structured around precise source, transit, sink, and control rows. It also avoids overstating issues where the sink or exploit path is incomplete. For example, it does not confirm Fastjson deserialization merely because AutoType is enabled; it requires proof that attacker-controlled JSON reaches deserialization. This is a higher-quality confirmation standard.

On reproduction logic, none of the reports provide dynamic test results. 3.1.75 and 3.1.81 both discuss concurrency and race scenarios, but 3.1.81 explicitly states that dynamic testing was not performed. Therefore, concurrency findings should be treated as source-supported risk claims, not as load-tested exploit demonstrations unless separately reproduced.

Model strengths and weaknesses on this project

mimo-v2.5-pro in Code Hunter 3.1.75

The main strength of the 3.1.75/mimo report is breadth. It identifies the largest set of reviewed and confirmed findings and connects them across technical and business domains. It is particularly effective at detecting systemic financial-integrity themes: double-spend risk, asynchronous wallet-update risk, in-memory locking, and withdrawal-specific weaknesses.

Its second strength is prioritization. The report groups findings into clusters and gives remediation priorities that align with exchange risk: atomic balance operations, distributed locking, file-handling fixes, authentication modernization, default-deny authorization, dependency upgrades, secrets management, and mandatory secondary authentication.

Its weakness is that the high finding volume can increase triage burden. Some findings rely on source-visible absence of controls plus plausible runtime or business assumptions. That is valuable for discovery, but users should not treat every broad chain as equally confirmed without follow-up validation.

GPT 5.5 in Code Hunter 3.1.80-dev.0

The main strength of the GPT 5.5 report is caution. It is disciplined about closure. If a sink is unsafe but user controllability is not fully proven, or if policy context is missing, the finding tends to remain Highly Likely or Needs More Evidence.

This makes the report useful for teams that want a low-noise, engineering-ready confirmed set. The five confirmed findings have direct evidence and clear missing controls.

The weakness is under-coverage. The report confirms fewer findings and does not develop the deeper financial-logic issues that appear in the other reports. For a cryptocurrency exchange, where business logic and accounting integrity are central, this conservative behavior can leave important risk areas under-prioritized unless users also review the Highly Likely and Needs More Evidence sections.

Multi-auditor / deepseek reviewer flow in Code Hunter 3.1.81-dev.0

The 3.1.81 report's strength is confirmation quality. It is rigorous about evidence chains and gives strong handoff material: file path, symbol, line span, source, transit, sink, failed control, business impact, and remediation.

It is also strong on business logic. Compared with 3.1.80, it surfaces more exchange-specific issues: caller-controlled transfer fees, recipient eligibility validation gaps, non-atomic exchange legs, trigger orders without balance reservation, and admin recharge amount override.

Its weakness is filtered coverage. The report says only 18 of 69 identified candidates were reviewed. It also notes internal cross-model divergence and excludes one auditor's larger finding set because of evidence-completeness concerns. That is not a flaw in honesty; it is a boundary condition. Users should not interpret the 13 confirmed findings as the complete Vexus risk inventory.

What these differences mean for Code Hunter as a security-audit product

The three reports show that Code Hunter's value is not simply "AI finds vulnerabilities." The product's behavior changes depending on the model and review workflow. The same project can produce a broad systemic audit, a conservative closed-evidence report, or a structured reviewer-filtered report.

For a security-audit product, this has several implications.

First, model choice changes audit posture. Some models behave like broad reconnaissance analysts. Others behave like strict confirmers. A product should expose that distinction to users rather than hiding it behind a single "scan complete" label.

Second, confirmation tiering is essential. The reports are most useful where they separate Confirmed, Highly Likely, and Needs More Evidence. This prevents plausible but incomplete findings from being misrepresented as proven vulnerabilities. It also preserves useful leads that would otherwise be discarded.

Third, severity must be interpreted relative to the report's taxonomy. The 3.1.75 report uses Critical for several confirmed findings. The 3.1.81 report marks all confirmed findings High, even though some have direct financial-loss implications. A user comparing reports should not assume that absence of "Critical" means lower practical risk. Severity labels are not fully normalized across these reports.

Fourth, coverage is not the same as confirmation ratio. The 3.1.81 report confirms 13 of 18 reviewed candidates, but it explicitly leaves 51 candidates untriaged. The 3.1.75 report reviews more candidates and confirms more total findings. A high confirmation percentage is useful only when paired with scope.

Fifth, evidence quality is a product feature. File paths, line spans, source-to-sink traces, and missing-control explanations are what make AI audit output usable. The 3.1.81 report demonstrates a stronger direction for Code Hunter: reviewer-filtered findings with explicit source evidence and explicit limitations.

Practical guidance for Code Hunter users

Users should choose and interpret Code Hunter reports according to audit objective.

For first-pass security discovery, use the broadest model or workflow available. The 3.1.75 report shows the value of breadth: it identifies systemic clusters that a narrower report may not confirm. Treat this output as a risk map and triage backlog. Start with issues that combine direct source evidence, financial impact, and low mitigation ambiguity.

For engineering-ready tickets, prioritize confirmed findings with closed evidence chains. The 3.1.80 and 3.1.81 reports are stronger examples of this mode. Findings with precise entry points, sinks, line spans, and missing controls can be assigned to developers with less rework.

For business-logic review, do not rely only on generic vulnerability classes. In Vexus, some of the most important risks involve exchange-specific state transitions: wallet updates, withdrawal application, recharge confirmation, transfer fees, trigger orders, and cross-symbol execution. The 3.1.75 and 3.1.81 reports are stronger than 3.1.80 in this respect.

When reading a report, interpret the status layer before the severity label. A High-severity Needs More Evidence item is not equivalent to a High-severity Confirmed item. It may still be important, but it requires additional proof such as runtime configuration, exploit-path closure, policy baseline, database schema verification, or load testing.

Do not ignore Highly Likely findings. In the Vexus reports, many Highly Likely findings are plausible and security-relevant. They are not rejected; they are deferred because one closure element is missing. Users should convert these into validation tasks rather than discarding them.

For concurrency and financial-integrity findings, perform dynamic validation. Static analysis can identify non-atomic reads, writes, sleeps, local locks, and missing distributed coordination. It cannot fully prove production race exploitability without testing deployment topology, database isolation, queue behavior, and concurrent request timing.

Finally, compare model outputs rather than treating one as authoritative. Overlap between reports is valuable because it strengthens confidence. Divergence is also valuable because it exposes blind spots: one model may find broad architectural risk, another may enforce stricter confirmation, and another may surface business-logic issues that the others missed.

Balanced conclusion

The Vexus reports show that different AI models change Code Hunter's audit behavior in materially different ways.

The 3.1.75/mimo-v2.5-pro report behaves like a broad, system-level audit. It finds the most issues, identifies critical financial and infrastructure risk clusters, and provides remediation priorities. Its breadth is valuable, but it requires careful triage and validation.

The 3.1.80-dev.0/GPT 5.5 report behaves like a conservative verifier. It confirms fewer findings and moves many plausible issues into uncertainty tiers. This lowers noise and improves confidence in confirmed issues, but it can understate broader business-logic risk unless users inspect the non-confirmed sections.

The 3.1.81-dev.0 multi-auditor reviewer report behaves like a structured evidence engine. It provides the clearest source-to-sink confirmation and the most explicit limitations. It is strong for engineering handoff and business-logic findings, but its own scope note prevents treating it as a complete audit of all candidates.

For Code Hunter as a product, the lesson is straightforward: model diversity is useful, but only if the product makes confirmation standards, coverage boundaries, and evidence chains visible. For users, the best practice is to combine breadth-oriented scans with strict confirmation passes, then validate the highest-impact financial and concurrency findings in a controlled runtime environment.

Source reports

https://github.com/SEc-123/codehunter-docs/blob/main/docs/examples/codehunter-3.1.75-vexus-crypto-exchange-audit-report.md

https://github.com/SEc-123/codehunter-docs/blob/main/docs/examples/codehunter-3.1.80-dev.0-vexus-multi-language-virtual-currency-exchange-audit-report.md

https://github.com/SEc-123/codehunter-docs/blob/main/docs/examples/codehunter-3.1.81-dev.0-vexus-multi-language-virtual-currency-exchange-audit-report.md

How AI Models Change Code Hunter Audit Behavior: Vexus Exchange