Same Project, Different Models: What Two Code Hunter Reports Reveal About AI-Assisted Audit Behavior

Introduction

Two Code Hunter reports examine the same anti-counterfeit traceability / one-item-one-code PHP application, but they do not behave like interchangeable audit outputs. The 3.1.75 report was generated with deepseek-v4-pro and describes a broad full-stack audit covering the public frontend, agent portal, admin backend, business functions, data flows, and infrastructure. The 3.1.80-dev.0 report was generated with GPT 5.5 and describes a narrower "layered confirmation" review focused on public query, admin operations, distributor operations, installer behavior, and integration-related controls. Because the Code Hunter version, report structure, and stated scope also differ, this is not a perfectly isolated model-only experiment. It is better read as a comparison of model-shaped audit behavior as reflected in two Code Hunter outputs on the same project family.

Detection coverage and severity distribution

The most visible difference is detection volume. The Code Hunter 3.1.75 / deepseek-v4-pro report reviewed 69 candidate findings: 52 Confirmed, 13 Highly Likely, 4 Needs More Evidence, and 0 Rejected / Excluded. It also reports an official confirmed-severity distribution of 3 Critical, 25 High, 23 Medium, and 1 Low. Its confirmed findings span conventional web vulnerabilities, dependency and configuration issues, and many business-logic defects tied to code lifecycle, traceability flows, agent operations, and admin privilege boundaries.

The Code Hunter 3.1.80-dev.0 / GPT 5.5 report is much smaller. Its Layered Overview states that 20 findings were reviewed: 9 Confirmed, 6 Highly Likely, 4 Needs More Evidence, and 1 Rejected / Excluded. However, the Confirmed Finding Details and Functional Binding View appear to list 10 confirmed rows, not 9. Therefore, the official layer count should be quoted as the report gives it, while any confirmed-severity distribution must be described as inferred from the detailed rows rather than as a clean report metric. From the detailed confirmed list, the confirmed items are high and medium severity only, with no Critical or Low confirmed findings shown.

The 3.1.75 report behaves like a broad discovery engine. It identifies universal SQL injection, stored and reflected XSS, unauthenticated KindEditor upload and file listing, hardcoded database credentials, plaintext WeChat secrets, missing CSRF controls, weak sessions, web-accessible backups, deprecated mysql_* usage, obfuscated core code, weak cryptography, IDOR, insecure installation behavior, and numerous traceability-specific business risks. The 3.1.80-dev.0 report focuses on a smaller set of closed evidence chains: admin-login SQL injection, bootstrap wx SQL injection, MD5 password hashing, installer config write risk, CSRF on privileged config updates, disabled TLS verification in a WeChat client, shipment replay, batch shipment idempotency, sensitive export persistence, and non-atomic shipment/audit writes.

False-positive and uncertainty handling

The deepseek-v4-pro report is not careless about uncertainty: it has Highly Likely and Needs More Evidence tiers, and many downgraded items explicitly say they were reduced from confirmed status "per cross-model consensus." That is useful because it prevents some speculative items from being counted as confirmed. But the same section can be semantically awkward: several Highly Likely entries still begin with "source code evidence confirms" before explaining that the item was downgraded. This creates a mixed signal for readers trying to distinguish proof from suspicion.

The GPT 5.5 report is stricter and cleaner in its uncertainty language. Highly Likely findings identify current evidence, why the item is likely, the missing critical evidence, and an upgrade-to-confirmed path. Needs More Evidence findings explicitly state why confirmation is not yet possible. The report also includes one rejected / excluded candidate: a supply-chain governance gap based on unmanaged third-party components, excluded because the candidate premise was not supported by file evidence. This is a meaningful product behavior: the model is willing to say "not enough evidence" and remove a candidate from the accepted set.

Reasoning style and evidence backing

The 3.1.75 report uses a wide, thematic reasoning style. It groups issues by risk family and repeatedly ties them to business impact: full system compromise, supply-chain data exposure, broken traceability state, undetectable privilege escalation, and disruption of anti-counterfeit code validity. This is valuable for a product like Code Hunter because a user does not only need CWE labels; they need to understand how a code defect affects the one-item-one-code business process. For example, the report describes a public backend test endpoint that bypasses authentication, CAPTCHA, query limits, and history logging, then explains that attackers could enumerate barcodes and harvest product, agent, and traceability data.

Its weakness is that some reasoning becomes boilerplate. Several unrelated findings reuse broad remediation and impact language such as "attacker can execute arbitrary SQL, execute XSS, or access unauthorized data," even where the issue is hardcoded credentials, missing security headers, or another non-SQLi class. One confirmed SSRF / insecure external request item even shows N/A in the evidence field while still being treated as confirmed. These patterns do not make the underlying findings false, but they reduce confirmation quality because the explanation is less precisely bound to the specific vulnerability.

The 3.1.80-dev.0 report uses a more forensic style. Findings are compact, and each confirmed issue tends to follow a consistent structure: entry point, sensitive point, missing control, source-transit-sink evidence, why confirmed, and business impact. The admin-login SQL injection finding, for example, identifies manage/index.php?act=adminlogin, maps source lines 46-50, transit line 50, sink line 51, and states that the source-to-sink chain is closed. The shipment replay and non-atomic audit-write findings similarly map entry point, flow, sink, and business result.

The tradeoff is compression. The GPT 5.5 report is easier to triage, but it gives less exploratory context. It does not surface many issue classes that appear in the 3.1.75 report, including unauthenticated KindEditor upload, public backend test exposure, obfuscated core code, web-accessible database backups, and several detailed anti-counterfeit state-machine failures. Given the different scopes and Code Hunter versions, this should not be interpreted as proof that GPT 5.5 "missed" those issues. It means those issues are not present in the linked GPT 5.5 report's reviewed set.

Confirmation quality

For engineering handoff, the GPT 5.5 report has the stronger confirmation format. It separates confirmed issues from hypotheses, identifies missing evidence for unresolved items, and consistently provides source-transit-sink reasoning. Its "Notes on Evidence Gaps" section states that Highly Likely and Needs More Evidence findings are intentionally not promoted to Confirmed without closure. That is exactly the distinction auditors need when deciding what can be fixed immediately versus what requires reproduction, runtime validation, or policy clarification.

The DeepSeek report provides far more file paths and examples overall, but the evidence quality varies. Its best findings are strong: the universal SQL injection item cites unsanitized bianhao input flowing into a SQL query and history insert; the unauthenticated upload item cites missing session checks in manage/editor/php/upload_json.php; the business-logic findings cite concrete state variables such as tgs_code.hits, tgs_code.fahuo, tgs_code.qiyong, and tgs_admin. These are useful anchors for remediation.

But the 3.1.75 report also contains rough edges: repeated generic remediation, occasional broad "no controls observed" statements, partial or awkward evidence snippets, and some Highly Likely items whose wording still sounds confirmed. That makes it stronger as an attack-surface inventory than as a final, litigation-grade confirmation document.

Model strengths and weaknesses on this project

The deepseek-v4-pro run is strongest at breadth. It behaves like an aggressive auditor that tries to map the whole application: public query, QR code flow, admin CRUD, distributor lifecycle, upload surfaces, database backup/restore, installer behavior, traceability templates, shipment/return state machines, and auditability. For this project type, that breadth matters because security defects are not only web bugs. A one-item-one-code platform can fail through business-state corruption: duplicate shipments, disabled codes moving through the supply chain, agent ownership reassignment without audit logs, and privilege escalation through admin account creation. The 3.1.75 report captures many of those domain risks.

Its weakness is precision discipline. The report's scale creates duplication and boilerplate. Some findings need a human reviewer to normalize severity, merge overlapping items, remove weakly supported claims, and rewrite remediation so it matches the exact flaw. Users should treat it as a strong first-pass discovery report, not as a final validated vulnerability register without review.

The GPT 5.5 run is strongest at evidentiary restraint. It gives cleaner source-to-sink narratives and is more explicit about what remains unproven. It is particularly useful for issue triage, developer ticket creation, and audit workflows where "confirmed" must mean "the report shows a closed evidence chain." Its weakness is narrower coverage. The linked report gives fewer findings and less contextual exploration, and the internal mismatch between the Layered Overview count and the detailed confirmed rows shows that even a stricter report can still have report-generation consistency issues.

What this means for Code Hunter as a security-audit product

The main product lesson is that model choice changes the shape of the audit, not merely the wording. In this comparison, one output optimizes for coverage and domain exploration; the other optimizes for confirmation discipline and uncertainty separation. A security-audit product should make that distinction visible to users.

For Code Hunter, the best product behavior would combine both patterns. A broad model pass can maximize candidate discovery across code paths and business functions. A stricter confirmation pass can then require source-to-sink closure, concrete line evidence, reproduction logic where applicable, and explicit missing-evidence statements. The 3.1.75 report shows the value of broad issue discovery; the 3.1.80-dev.0 report shows the value of disciplined confirmation. The ideal user experience is not choosing one permanently, but understanding which stage of the audit each model behavior serves.

The reports also show why metadata matters. Users need to see the Code Hunter version, model, audit depth, scope, and confirmation standard. Without those fields, a report with 69 candidates and a report with 20 reviewed findings could be misread as directly comparable recall measurements. They are not. They are different report artifacts with different scopes, model behavior, and confirmation styles.

Practical guidance for users

Users who want maximum initial coverage should start with the broader report style. It is better for finding classes of risk that may not be obvious from isolated source-to-sink analysis: traceability state-machine failures, query-budget abuse, public test endpoints, admin workflow gaps, weak auditability, and code lifecycle hazards. The output should then be deduplicated and reviewed for overbroad impact statements.

Users who need developer-ready tickets should prefer the stricter confirmation style. A finding with entry point, sensitive point, source-transit-sink path, missing control, and "why confirmed" is easier to assign, reproduce, and fix. In the GPT 5.5 report, that structure is clear for admin-login SQL injection, MD5 password hashing, CSRF on privileged config mutation, shipment replay, export persistence, and non-atomic shipment/audit writes.

When interpreting either report, readers should not treat all findings equally. Confirmed findings can be prioritized for remediation, but "Confirmed" still means confirmed according to static report evidence unless runtime reproduction is also supplied. Highly Likely findings should be treated as serious leads that need closure. Needs More Evidence items should not be counted as vulnerabilities until the missing source controllability, exploit path, policy requirement, or runtime condition is established. Rejected or excluded items should remain out of the vulnerability register unless new evidence appears.

For this project, the safest remediation priority is the overlap between reports and the most severe domain-specific items. SQL injection, weak MD5 credential handling, missing CSRF on privileged mutations, shipment replay/idempotency failures, sensitive export persistence, non-atomic traceability writes, and insecure WeChat/TLS behavior appear as confirmed or strongly evidenced issues in the stricter report and also align with broader systemic themes in the larger report. Those should be treated as high-confidence remediation candidates.

Conclusion

The two Code Hunter reports show two different audit behaviors on the same anti-counterfeit traceability platform. The deepseek-v4-pro report is expansive, domain-aware, and useful for discovering a wide attack surface, especially business-logic flaws in one-item-one-code workflows. The GPT 5.5 report is narrower, cleaner, and better at separating confirmed evidence from hypotheses. Neither style is universally superior. For a security-audit product, the practical answer is staged use: broad candidate discovery first, strict confirmation second, with clear metadata and explicit uncertainty handling throughout.

Source reports

https://github.com/SEc-123/codehunter-docs/blob/main/docs/examples/codehunter-3.1.75-personal-audit-report.md

https://github.com/SEc-123/codehunter-docs/blob/main/docs/examples/codehunter-3.1.80-dev.0-anti-counterfeit-traceability-one-item-one-code-platform-audit-report.md

Code Hunter Model Comparison on Anti-Counterfeit Traceability