How Different AI Models Change Code Hunter's Audit Behavior on the Same Project: Happy Dou Dizhu Web Game

Introduction

Two Code Hunter audit reports examine the same target: the Happy Dou Dizhu / Happy Dou Di Zhu browser-based multiplayer card game. Both reports use full_deep audit depth and both organize findings through layered confirmation categories, but they were generated by different Code Hunter versions and different AI models. The earlier report was generated by Code Hunter 3.1.75 using mimo-v2.5-pro; the later report was generated by Code Hunter 3.1.80-dev.0 using GPT 5.5. This means the comparison is not a perfectly controlled model-only experiment: changes in Code Hunter's audit pipeline, report schema, and model behavior are all present. Still, the two reports provide a useful side-by-side view of how model choice can change the audit surface, severity framing, and confirmation discipline for the same codebase.

The main difference is not simply that one report found "more" and the other found "less." The 3.1.75 / mimo-v2.5-pro report is broader and more expansive. It reviews 30 findings, marks 18 as confirmed, and assigns many confirmed issues Critical severity. The 3.1.80-dev.0 / GPT 5.5 report reviews 23 findings, confirms 12, and is more restrained about moving structurally suspicious patterns into the confirmed layer. It also focuses more explicitly on business-logic integrity, value-transfer atomicity, and governance controls.

Detection coverage and severity distribution

The 3.1.75 report reviewed 30 findings: 18 Confirmed, 6 Highly Likely, 4 Needs More Evidence, and 2 Rejected / Excluded. Its confirmed layer covers supply-chain update compromise, installer SQL injection, systemic SQL string concatenation, admin SQL execution, weak password hashing, missing cookie flags, IDOR on the score endpoint, unsafe file upload, plaintext credential storage, verbose error disclosure, mass assignment / config injection, weak CSRF enforcement, cookie reuse as a CSRF token, race conditions, chat flooding, and player data disclosure. Counting the severity labels shown in the report text, the confirmed layer contains 17 Critical findings and 1 Medium finding; the Highly Likely and Needs More Evidence layers are all Medium in the layered overview. That severity distribution is inferred from the report's tables, not from a separate metric supplied by the report.

The 3.1.80-dev.0 report reviewed 23 findings: 12 Confirmed, 5 Highly Likely, 5 Needs More Evidence, and 1 Rejected / Excluded. Its confirmed layer emphasizes direct admin SQL execution, weak updater transport and URL controls, weak password and token primitives, missing cookie attributes, room-join race conditions, recharge-code double consumption, daily gift idempotency failure, non-atomic round-start deductions, non-transactional match settlement, patch chain-of-custody weakness, and lack of business-object safeguards around admin SQL execution. Counting the severity labels shown in the report text, its confirmed layer contains 11 High findings and 1 Medium finding; its likely layer contains 1 High and 4 Medium findings; its Needs More Evidence layer contains 1 High, 3 Medium, and 1 Low finding. This is also inferred from the report text.

The severity schemes are not directly equivalent. The first report uses Critical / Medium, while the second uses High / Medium / Low. The difference matters because the 3.1.75 report tends to present many confirmed issues as Critical, while the 3.1.80-dev.0 report uses a narrower High label and reserves confirmation for more explicitly closed chains. A user comparing the two reports should therefore avoid treating "Critical" versus "High" as a strict model-to-model severity delta. The labels come from different report versions and likely different severity-normalization logic.

What each model surfaced

The 3.1.75 / mimo-v2.5-pro run is strongest as a broad application-security sweep. It identifies classic PHP web risks across installer exposure, SQL construction, authentication, cookie/session handling, admin configuration, upload handling, disclosure, and update delivery. It also separates candidate findings into confirmation tiers and provides detailed source-to-sink chains for several confirmed issues, such as the updater flow from remote XML to ZIP download and application-root extraction, and the installer SQL injection from account_name to a concatenated PDO::exec() statement.

The 3.1.80-dev.0 / GPT 5.5 run is more compact but more business-logic-oriented. It splits value-transfer and concurrency problems into specific confirmed findings: recharge code double consumption, daily gift multiple-credit risk, seat assignment races, round-start deduction atomicity, and match settlement accounting divergence. These are not merely generic "race condition" labels; they are mapped to concrete endpoints and state transitions such as /hall/buy, /hall/gift, /hall/get, /room/ready, and /room/play.

The two reports overlap on several important themes: updater trust failure, admin SQL execution, weak credentials or token primitives, missing cookie protections, and concurrency-sensitive game-state transitions. The difference is in decomposition. The 3.1.75 report often groups broad risk areas and assigns high-impact severity aggressively. The 3.1.80-dev.0 report tends to decompose business processes into narrower failure modes and then classify each by the strength of its evidence.

False-positive and uncertainty handling

Both reports use four confirmation tiers, but they handle uncertainty differently. The 3.1.75 report defines Confirmed as a complete source-to-sink evidence chain, Highly Likely as strong structural evidence dependent on runtime conditions, Needs More Evidence as preliminary indicators, and Rejected / Excluded as refuted or deprioritized candidates. It uses this framework throughout the report, including for installer exposure, second-order SQL injection, template cache poisoning, room lifecycle races, installer gating weakness, and settlement idempotency failure.

The 3.1.80-dev.0 report is stricter in several places where the earlier report is more assertive. A clear example is shared SQL construction. The 3.1.75 report confirms systemic SQL injection via string concatenation, stating that the database helper constructs SQL through where(), insert(), update(), fetch(), and delete() without prepared statements. The 3.1.80-dev.0 report treats string-built SQL in the shared DB helper as Highly Likely, noting that the unsafe pattern is present but that a full externally controllable exploit chain was not closed for specific endpoints.

Another useful example is upload handling. The 3.1.75 report confirms unsafe file upload based on extension-only validation, missing MIME/content checks, and predictable avatar storage. It still notes that code-execution impact depends on web server configuration. The 3.1.80-dev.0 report classifies the analogous upload issue as Highly Likely, saying that content validation is limited but direct code execution impact is not closed from current evidence. This does not prove that one report is "right" and the other is "wrong"; it shows different confirmation thresholds.

CSRF handling shows the same pattern. The 3.1.75 report confirms cookie reuse as a CSRF token and connects it to missing HttpOnly protection: stealing the cookie would also provide the expected CSRF token. The 3.1.80-dev.0 report treats the referer-plus-cookie-mirrored-token design as Highly Likely, because the control weakness is visible but a practical cross-site bypass was not fully demonstrated. For product users, this distinction is important: both reports identify a real design concern, but the later report is more conservative about exploit confirmation.

Reasoning style and evidence quality

The 3.1.75 report reads like a detailed security-audit dossier. Each major confirmed issue includes entry point, sensitive sink, control failure, observed controls, expected controls, source/transit/flow steps, reason for confirmation, exploitability, business impact, and remediation. This format is valuable for a human reviewer because it makes the argument explicit. For example, the updater finding links TLS verification being disabled, remote XML persistence, patch URL extraction, ZIP download, and extraction to the application root.

The 3.1.80-dev.0 report reads like a normalized confirmation ledger. It consistently names the bound function, subfunction, flow, entry point, sensitive point, control gap, evidence and data flow, confirmation rationale, business impact, and remediation. The report is shorter, but it is precise about what is closed and what remains open. For uncertain findings, it repeatedly states missing evidence and gives an upgrade path or additional inputs needed.

Neither report should be read as a substitute for runtime reproduction. Both reports are source-code audit artifacts. They provide file paths, line references, control-gap reasoning, and source-to-sink logic, but they do not provide full runnable proof-of-concept requests for every issue. This matters most for concurrency, path traversal, CSRF, update-pipeline exploitation, XML parser behavior, and upload impact, where exploitability depends on timing, runtime configuration, deployment topology, or web server behavior.

Confirmation quality: paths, lines, reproduction logic, and hypothesis separation

On file-path evidence, both reports are strong. The 3.1.75 report often cites exact files and lines inside prose, such as server/controller/admin/update.php, server/model/admin.php, server/controller/install/finish.php, server/library/db.php, server/controller/index/hall.php, and server/controller/index/room.php. The 3.1.80-dev.0 report uses a more compact pipe-delimited style, pairing methods, files, and line ranges such as Command::exec | server/controller/admin/command.php | 18-20, Update::down | server/controller/admin/update.php | 31-35, and Hall::buy | server/controller/index/hall.php | 57-65.

On hypothesis separation, the 3.1.80-dev.0 report is cleaner. It explicitly says when source and sink are present but exploitability is not proven, as with archive extraction path traversal, XML parser hardening, vendored frontend dependencies, room-owner transfer, and audit-trail linkage. It also rejects a stored/reflected HTML injection candidate because the contextual evidence was insufficient.

The 3.1.75 report also separates hypotheses from confirmed issues, but it is more willing to mark broad patterns as confirmed when source-level control gaps are visible. This can be useful for prioritizing design remediation, but it may overstate exploit certainty if the reader treats every confirmed item as an independently reproduced vulnerability. The report itself sometimes acknowledges runtime dependencies, such as upload impact depending on server configuration and settlement double-spend requiring concurrent request timing.

Model strengths and weaknesses on this project

The 3.1.75 / mimo-v2.5-pro report's main strength is breadth. It gives Code Hunter users a wide attack-surface inventory and catches many classes of issues that are common in older PHP applications: SQL concatenation, weak hashing, missing cookie flags, admin SQL execution, unsafe update download, plaintext secrets, mass assignment, and player data disclosure. It is useful when the goal is to avoid missing classes of risk during an initial pass.

Its weakness is that breadth can blur confirmation boundaries. Some findings appear to overlap, such as missing cookie security flags and insecure cookie flags, or IDOR on the score endpoint and player data disclosure through the same Hall::score() flow. The report also assigns Critical severity broadly. That may be acceptable for risk discovery, but it requires reviewer discipline before turning the output into an engineering backlog.

The 3.1.80-dev.0 / GPT 5.5 report's main strength is adjudication discipline. It is better at saying "this is unsafe by design, but the exploit chain is not closed" and "this should be upgraded only after reproducible evidence." It is especially strong on business-logic integrity in a game economy: recharge redemption, daily gifts, seat ownership, round-start costs, and match settlement are treated as distinct asset-movement risks rather than as one generic race-condition bucket.

Its weakness is that the narrower report may feel less complete to a user expecting broad web-audit coverage. Some issues that the earlier report confirms are absent as separate confirmed findings, merged into broader categories, or downgraded into likely / evidence-gap status. For example, the later report treats shared DB helper SQL construction as Highly Likely rather than confirmed, and treats upload content validation as Highly Likely rather than confirmed. That makes the report less noisy, but it may require a second pass if the user wants maximum candidate discovery.

What this means for Code Hunter as a security-audit product

The comparison shows that Code Hunter's value is not only in finding vulnerabilities. It is also in how it structures evidence maturity. A product audit report is more useful when it distinguishes confirmed defects, plausible risks, and unresolved hypotheses. The 3.1.75 report demonstrates the value of broad candidate generation and detailed human-readable reasoning. The 3.1.80-dev.0 report demonstrates the value of tighter confirmation thresholds and business-process decomposition.

For a security-audit product, this suggests that model selection should be exposed as an audit strategy choice rather than hidden as an implementation detail. A broad model run is useful for discovery. A stricter model run is useful for validation and prioritization. When the two disagree, the disagreement itself is useful: it identifies findings that need reproduction, environmental validation, or source-review deepening.

The reports also show why Code Hunter should preserve model name, Code Hunter version, audit depth, confirmation standard, and evidence traces in every report. Without that metadata, a user could misinterpret differences in finding count or severity as changes in project risk rather than changes in audit behavior. In these two reports, the same project produces different output not because the code necessarily changed, but because the audit version, report schema, and model behavior differ.

Practical guidance for users

Users should not choose a model by counting findings. A higher finding count may mean better recall, but it can also mean more overlapping issues, broader severity assignment, or lower confirmation strictness. A lower finding count may mean better precision, but it can also omit useful candidate risks. In this case, the 3.1.75 report is better suited to first-pass risk discovery, while the 3.1.80-dev.0 report is better suited to evidence-based backlog creation.

Users should inspect the confirmation layer before acting. Confirmed findings with file paths, line evidence, and closed source-to-sink chains should move into remediation planning first. Highly Likely findings should usually become validation tasks: reproduce the exploit path, verify runtime configuration, or inspect the missing source segment. Needs More Evidence findings should not be ignored, but they should not be treated as proven vulnerabilities.

Users should also normalize severity across reports. "Critical" in one report and "High" in another are not necessarily equivalent. The better comparison unit is the evidence chain: entry point, attacker control, sensitive sink, missing control, exploit preconditions, and business impact.

For concurrency and game-economy findings, users should request or perform runtime testing. Source review can show read-then-update patterns, missing transaction boundaries, and absent idempotency guards, but race exploitability depends on timing and deployment behavior. The 3.1.80-dev.0 report's treatment of recharge, gifts, round-start deductions, and settlement flows provides a good template for such validation.

For classic web issues such as SQL injection, cookie security, weak password hashing, and unsafe updater behavior, users should prioritize architectural fixes even when exploit reproduction is incomplete. Prepared statements, modern password hashing, independent CSRF tokens, secure cookie attributes, TLS verification, URL allowlisting, signed update artifacts, and transaction boundaries are defensible remediations regardless of which model produced the finding.

Conclusion

The two Code Hunter reports show that different AI models-and different Code Hunter versions-can materially change audit behavior on the same project. The 3.1.75 / mimo-v2.5-pro report is expansive, narrative, and high-recall. It surfaces a wide range of PHP application-security issues and often frames visible source-level control gaps as confirmed. The 3.1.80-dev.0 / GPT 5.5 report is more compact, more conservative, and more focused on business-logic and value-transfer integrity. It is stronger at separating confirmed issues from hypotheses and at explaining what evidence is still missing.

For Code Hunter users, the practical lesson is to treat model output as an evidence-structured audit artifact, not as a single vulnerability score. A broad model can reveal more candidate risk. A stricter model can reduce noise and improve backlog quality. The best workflow is not to ask which model is universally better, but to use model differences to identify which findings are confirmed, which require validation, and which represent architectural risk even before exploit reproduction is complete.

Source reports

https://github.com/SEc-123/codehunter-docs/blob/main/docs/examples/codehunter-3.1.75-happy-dou-di-zhu-audit-report.md

https://github.com/SEc-123/codehunter-docs/blob/main/docs/examples/codehunter-3.1.80-dev.0-happy-dou-dizhu-web-game-audit-report.md

Code Hunter Model Comparison on Happy Dou Dizhu Web Game