How AI Model Choice Changes Code Hunter's Audit Behavior: A Comparative Review of MicroCommunity PropertyApp

Introduction

Security-audit output is shaped by more than the scanned codebase. It is also shaped by the model, the product version, the audit scope, and the confirmation standard used to adjudicate findings. The two Code Hunter reports for MicroCommunity PropertyApp provide a useful case study because they examine the same project family but produce noticeably different audit behavior.

The first report was generated by Code Hunter 3.1.75 using the mimo-v2.5-pro model. It presents a full-repository static audit of MicroCommunity PropertyApp, including Spring Boot backend services, uni-app and WeChat mini-program frontend code, MyBatis mappers, configuration files, dependency declarations, payment logic, IoT integrations, and authentication flows. The second report was generated by Code Hunter 3.1.80-dev.0 using GPT 5.5. It is also marked full_deep, but its stated scope is narrower: frontend application flows, client-side controls, and risk candidates tied to reviewed business and generic findings.

This distinction matters. The comparison below should not be read as a controlled A/B test of two models under identical conditions. The reports differ in Code Hunter version, model, generation date, and declared audit scope. Therefore, not every difference can be attributed solely to the AI model. Still, the reports show how different model-and-configuration combinations change Code Hunter's behavior: one report emphasizes backend-confirmed exploitability and systemic architectural failures; the other emphasizes client-flow risk modeling, evidence sufficiency, and conservative classification when backend proof is unavailable.

Report-level comparison

At the highest level, the two reports differ in confirmation density.

The Code Hunter 3.1.75 / mimo-v2.5-pro report reviewed 27 findings. It classified 14 as Confirmed, 3 as Highly Likely, 10 as Needs More Evidence, and 0 as Rejected / Excluded. For confirmed findings, it also provided a severity breakdown: 5 Critical, 7 High, and 2 Medium. The report's own appendix gives percentages for the status distribution: Confirmed 51.9%, Highly Likely 11.1%, Needs More Evidence 37.0%, and Rejected / Excluded 0.0%.

The Code Hunter 3.1.80-dev.0 / GPT 5.5 report reviewed 18 items. It classified 1 as Confirmed, 5 as Highly Likely, 11 as Needs More Evidence, and 1 as Rejected / Excluded. It does not provide a percentage breakdown or a single aggregate severity table equivalent to the 3.1.75 report. The report does, however, assign severity at the individual finding level. The only confirmed finding is High severity. Many non-confirmed business-flow findings are marked High, while several client-control and workflow findings are Medium. This is a qualitative distribution inferred from the finding table and finding details, not a source-provided aggregate severity metric.

The practical implication is straightforward: the 3.1.75 report reads like a high-confidence backend and platform audit with many confirmed source-to-sink chains. The 3.1.80-dev.0 report reads like a cautious frontend and business-flow triage report that intentionally leaves many findings unconfirmed until backend or runtime evidence is available.

Detection coverage

Code Hunter 3.1.75 with mimo-v2.5-pro

The 3.1.75 report has broad coverage across backend, middleware, configuration, payment processing, IoT callbacks, cryptography, data access, and selected mobile configuration. Its confirmed findings include:

Dynamic SQL execution through MyBatis ${} interpolation in QueryServiceDAOImplMapper.xml.

OA workflow dynamic SQL execution through raw SQL and table-name interpolation.

Hardcoded cryptographic constants in AuthenticationFactory.java.

AES/ECB encryption mode.

DES encryption usage.

Unauthenticated smart-meter callback bean resolution.

Excessive JWT filter exclusions affecting more than 100 URI patterns.

Missing rate limiting on authentication endpoints.

Excessive Android permissions in manifest.json.

Zero-amount payment confirmation logic.

Client-supplied deduction amount manipulation.

Unauthenticated charging-station callback endpoints.

The coverage pattern is backend-heavy and exploit-chain oriented. The model follows HTTP entry points into service logic, MyBatis mappers, authentication filters, and payment or IoT callback handlers. It detects not only generic vulnerability classes, such as SQL injection and cryptographic weakness, but also business-logic vulnerabilities, such as confirming a property fee without payment or manipulating deduction amounts.

The report also captures systemic problems. Its appendix identifies authentication boundary erosion, cryptographic debt, unsafe input-trust assumptions, inconsistent IoT callback authentication, and missing observability. These are not isolated bugs; they are architectural risk themes derived from multiple findings.

Code Hunter 3.1.80-dev.0 with GPT 5.5

The GPT 5.5 report is narrower but more focused on frontend behavior and client-driven business flows. Its single confirmed finding is credential-bearing profile data stored without effective encryption. The evidence chain is closed entirely within visible client code: username and password are inserted into a user-info object, passed through a des_encrypt function whose implementation returns the message unchanged, and written to local storage as USER_INFO.

The Highly Likely findings focus on client-side control weaknesses and plausible downstream effects:

A static sign header reused across requests.

Token storage and cookie reconstruction from client storage.

Insecure runtime/development network configuration.

Repair completion without visible anti-replay or idempotency control.

Workflow file URLs opened or downloaded without visible origin or business-scope checks.

The Needs More Evidence findings are largely business-flow trust-boundary candidates: route-provided room, community, and fee identifiers; client-carried owner/member identifiers; client-supplied repair amounts, goods, actions, and actor fields; OA task identifiers and audit actions from route or UI state; procurement workflow data assembled client-side; and communityId carried in sensitive request payloads.

This is a different detection profile. Instead of confirming backend exploitability, the GPT 5.5 report maps where client state enters sensitive flows and asks whether the backend independently validates that state. It is particularly sensitive to client-controlled identifiers and workflow transitions, but it refuses to confirm unauthorized access or financial manipulation without backend enforcement evidence.

Severity distribution and what it means

The 3.1.75 report supplies a confirmed-only severity distribution: 5 Critical, 7 High, and 2 Medium. The Critical confirmed findings cluster around database compromise, arbitrary Spring bean invocation, and payment bypass. The High confirmed findings cluster around cryptographic weakness, authentication bypass surface, client-supplied financial values, and unauthenticated charging callbacks. This distribution reflects a report that reached decisive conclusions on high-impact backend and business-logic risks.

The GPT 5.5 report has a different severity shape. Its only confirmed issue is High severity: local storage of credential-bearing profile data without effective encryption. The report's non-confirmed findings include many High-severity business-flow candidates, but their status remains Needs More Evidence because the client-side evidence alone cannot prove backend authorization failure, settlement manipulation, state-machine bypass, or tenant isolation failure. This severity pattern suggests that GPT 5.5 is willing to label the potential impact as high while still separating impact severity from confirmation status.

That separation is useful. A High-severity Needs More Evidence finding should not be treated as harmless. It means the impact would be serious if the missing server-side or runtime proof confirms the control gap. For users, this distinction prevents two common mistakes: treating all high-severity candidates as confirmed vulnerabilities, or dismissing unconfirmed findings as noise.

False-positive and uncertainty handling

The two reports handle uncertainty differently.

The 3.1.75 report has no rejected or excluded findings. That does not mean it blindly confirms everything; it explicitly classifies 10 findings as Needs More Evidence and 3 as Highly Likely. For example, the WorkType SQL injection candidate is marked Highly Likely because the mapper uses ${typeName}, but the data flow from end-user input to that parameter is not fully traced. The CallComponentController reflection issue is also marked Highly Likely because a privilege check exists, and exploitability depends on whether privilege configuration is complete.

The Needs More Evidence section shows additional restraint. The report does not confirm Fastjson deserialization because runtime autoType behavior is unknown. It does not confirm XXE in the WeChat XML path because dom4j default behavior may not expand external entities and WeChat signature verification narrows exposure. It does not confirm hardcoded database credentials as production exposure because deployment overrides may exist. These are meaningful false-positive controls.

However, because the report has no explicit rejected findings, its uncertainty handling is mostly expressed through partial-evidence tiers rather than active exclusion. That makes the report strong for remediation discovery but may require users to perform their own de-duplication and validation triage.

The GPT 5.5 report is more explicit about false-positive reduction. It includes one Rejected / Excluded candidate: dependency version governance weakness. The report states that the candidate is too generic and lacks a source-to-sink exploit chain. Across the rest of the report, many findings repeat a consistent pattern: current client-side evidence is insufficient to close exploitability or authorization impact without backend or runtime proof. The report repeatedly asks for runtime logs, server-side control logic, and end-to-end traceability.

This produces a more conservative report. It may feel less decisive because only one finding is confirmed, but it is clearer about what is known, what is inferred, and what remains unproven.

Reasoning style and evidence backing

The 3.1.75 report uses a classic source-transit-sink style. Confirmed findings often include:

Entry point.

Sensitive point or risk sink.

Control failure.

Evidence and data flow.

"Why This Is Confirmed."

Business impact.

Remediation.

For example, the dynamic SQL finding traces POST /app/{service} through AppController.java, ApiServiceSMOImpl.java, and an internal call to QueryServiceDAOImplMapper.xml, where ${value} directly interpolates SQL. The OA workflow SQL finding similarly traces raw SQL into OaWorkflowFormServiceDaoImplMapper.xml. The zero-amount payment bypass traces /app/toPay into ToPaySMOImpl.java, where money <= 0 leads to fee.payFeeConfirm without a payment gateway.

This reasoning style is evidence-backed and security-engineering oriented. It is strongest when the repository contains the server-side code needed to close the chain.

The GPT 5.5 report uses a layered confirmation style. It binds each finding to a function, subfunction, and flow, then separates Current Evidence, Why Highly Likely or Why Not Confirmed Yet, Missing Key Evidence, and Upgrade Path to Confirmed. This structure is especially visible in client-side findings. For example, the static sign header finding identifies the fixed header value and where it is attached to outgoing requests, but does not confirm exploitability because backend signature verification is not visible. The route-parameter payment findings identify where roomId, communityId, and feeIds enter frontend flows, but do not confirm unauthorized payment order creation without backend authorization evidence.

This reasoning style is also evidence-backed, but it is oriented toward audit triage rather than confirmed exploitation. It turns the report into a validation queue: each unconfirmed item tells the user what proof would be needed next.

Confirmation quality

Both reports provide file paths and line references, but they differ in the type of confirmation they provide.

The 3.1.75 report provides strong confirmation quality for backend findings. It frequently gives complete file-path and line-level evidence across source, transit, and sink. For confirmed findings, the report often explains why the observed control failure is enough to establish the risk. Its strongest examples are SQL injection, cryptographic constants, JWT filter exclusions, IoT callback authentication gaps, and payment-processing logic.

Its reproduction logic is mostly static and data-flow based rather than an executable proof-of-concept. It does not generally provide step-by-step exploit scripts. But for many confirmed issues, the report provides enough path, line, and control-failure evidence for an engineer to reproduce or validate the behavior manually.

The GPT 5.5 report provides strong line-level evidence within the frontend scope. The confirmed credential-storage finding is particularly clear: it identifies the login path, the no-op encryption function, and the local storage sink. For non-confirmed findings, it provides the client-side evidence and then explicitly states what backend or runtime evidence is missing.

The GPT 5.5 report is better at separating confirmed issues from hypotheses. Many findings are framed as "possible if backend accepts tampered identifiers" or "potential if backend authorization is weak." That language is important because it prevents client-side observations from being overstated as server-side vulnerabilities.

Neither report should be read as a complete dynamic test result. Both are static or source-evidence reports. The 3.1.75 report often reaches confirmed status through complete code evidence. The GPT 5.5 report often stops at Needs More Evidence because the reviewed frontend code cannot prove server-side behavior.

Model strengths and weaknesses for this project

mimo-v2.5-pro in Code Hunter 3.1.75

The primary strength of the mimo-v2.5-pro report is depth of backend confirmation. It finds severe vulnerabilities where code evidence supports a complete chain: SQL execution, authentication exclusions, cryptography, payment confirmation, and IoT callback handling. It also synthesizes systemic observations from individual findings, which is valuable for remediation planning.

Its weakness is that it can produce a dense finding set with overlapping business and technical perspectives. For example, smart-meter arbitrary bean invocation appears both as a technical callback issue and as a business-impact issue. CallComponentController reflection also appears in related technical and business-impact forms. This is not necessarily wrong, but it means users may need to merge related findings before creating tickets.

Another limitation is that non-confirmed findings in the 3.1.75 functional binding view do not always have explicit severity values. The report is highly informative, but the user must treat its Needs More Evidence section as a separate validation backlog rather than a confirmed vulnerability list.

GPT 5.5 in Code Hunter 3.1.80-dev.0

The primary strength of the GPT 5.5 report is disciplined uncertainty management. It is careful not to convert client-side observations into confirmed backend vulnerabilities. It also explicitly rejects a generic dependency-governance candidate for lacking a source-to-sink chain. This is useful for reducing false positives.

GPT 5.5 is also strong at business-flow modeling from the frontend: it identifies where route parameters, client-carried identifiers, local state, workflow actions, and community IDs enter sensitive operations. For a property-management application, that is important because many real vulnerabilities arise when backend services trust client-selected tenants, owners, members, fees, task IDs, or workflow transitions.

Its weakness is that the narrower frontend scope prevents decisive confirmation of many high-impact candidates. The report identifies plausible risks, but most remain unresolved until backend access-control, settlement, idempotency, tenant-isolation, and workflow-state-machine logic is reviewed. As a standalone audit, it is therefore less actionable for immediate remediation than the 3.1.75 report, except for the confirmed credential-storage issue.

What this means for Code Hunter as a security-audit product

These reports show that Code Hunter's value is not only in finding vulnerabilities, but in how it expresses confidence. A useful security-audit product must help users distinguish confirmed bugs, high-risk hypotheses, and rejected candidates. The GPT 5.5 report is stronger in this respect: it makes uncertainty highly visible. The 3.1.75 report is stronger at surfacing confirmed backend and architectural failures when the audit scope includes enough code to close the chain.

For Code Hunter, the product lesson is that model choice should be exposed as an audit parameter, not hidden as an implementation detail. Users need to know whether a report was optimized for broad backend confirmation, client-flow triage, conservative uncertainty handling, or systemic architecture review.

The reports also show the need for cross-model normalization. Raw finding counts are not enough. A report with 14 confirmed findings is not automatically "better" than a report with 1 confirmed finding if the scopes are different. Conversely, a conservative report with many Needs More Evidence items is not necessarily weaker; it may be protecting the user from unsupported claims.

A mature Code Hunter workflow should therefore support:

Clear model, version, and scope metadata.

Side-by-side comparison of confirmation tiers.

De-duplication of overlapping technical and business-impact findings.

Explicit "missing evidence" fields.

Reproducibility guidance for static findings.

Runtime validation hooks for client-side and business-flow hypotheses.

Separate prioritization for confirmed remediation and validation backlog.

The product should also avoid presenting model output as a single flat vulnerability list. The 3.1.80-dev.0 report shows why: many high-impact client-side observations depend entirely on whether the backend enforces authorization and recalculates trusted state. Treating those as confirmed would mislead users. Treating them as irrelevant would also be wrong.

Practical guidance for users choosing a model or interpreting reports

Users should not choose a model by counting findings. They should choose based on the audit question.

If the goal is to find confirmed backend vulnerabilities in a full repository, the 3.1.75 / mimo-v2.5-pro report is the stronger example. It provides many complete source-to-sink chains and identifies severe platform risks in SQL execution, authentication, cryptography, payments, and IoT callbacks.

If the goal is to analyze frontend trust boundaries, client-carried identifiers, workflow transitions, and evidence gaps, the 3.1.80-dev.0 / GPT 5.5 report is the stronger example. It is cautious, structured, and useful for planning targeted backend validation.

When interpreting Code Hunter reports, users should separate findings into three working queues.

First, confirmed findings should go directly into remediation planning. In this project, that includes the 3.1.75 confirmed SQL injection, cryptographic weaknesses, JWT exclusions, payment bypass, deduction manipulation, and unauthenticated IoT callbacks, as well as the GPT 5.5 confirmed credential-storage issue.

Second, Highly Likely findings should become validation tasks. They usually have strong code evidence but lack one decisive proof, such as privilege configuration completeness or backend verification behavior.

Third, Needs More Evidence findings should not be dismissed. They should be tested against runtime behavior, server-side authorization logic, deployment configuration, and production controls. In the GPT 5.5 report, many Needs More Evidence findings are exactly the kinds of issues that become serious if the backend trusts client-provided communityId, memberId, fee identifiers, workflow task IDs, or monetary values.

Users should also pay attention to missing evidence language. A finding that says "backend ACL evidence missing" is not the same as a false positive. It means the current audit slice cannot answer the exploitability question. In that case, the right next step is to inspect server-side authorization, not to close the finding.

Finally, users should compare reports by evidence quality rather than by severity labels alone. A Medium confirmed issue with a complete file-and-line chain may be more immediately actionable than a High Needs More Evidence issue. Conversely, a High Needs More Evidence issue in a payment or tenant-isolation flow may deserve urgent validation because the business impact would be large if confirmed.

Balanced conclusion

The two Code Hunter reports show different audit behaviors on MicroCommunity PropertyApp.

The Code Hunter 3.1.75 / mimo-v2.5-pro report behaves like a deep backend and platform auditor. It confirms many severe issues, provides complete data-flow traces, and identifies systemic weaknesses in authentication, cryptography, input trust, IoT callbacks, and payment processing. Its output is remediation-oriented, but users may need to merge overlapping findings and separately manage the validation backlog.

The Code Hunter 3.1.80-dev.0 / GPT 5.5 report behaves like a cautious client-flow and confirmation-layer auditor. It confirms fewer issues, but it is disciplined about not overstating frontend evidence. It is useful for identifying where backend validation must be checked, especially in tenant, owner/member, payment, repair, procurement, and workflow flows. Its main limitation is that many findings remain unresolved without backend or runtime evidence.

For Code Hunter as a product, the comparison reinforces a central principle: model choice changes audit posture. A model can be more confirmatory, more exploratory, more conservative, or more business-flow oriented. The best user experience is not to hide these differences, but to make them explicit and operational. Users should be able to see what was confirmed, what was inferred, what remains hypothetical, and what evidence would close the gap.

Source reports

https://github.com/SEc-123/codehunter-docs/blob/main/docs/examples/codehunter-3.1.75-microcommunity-propertyapp-audit-report.md

https://github.com/SEc-123/codehunter-docs/blob/main/docs/examples/codehunter-3.1.80-dev.0-propertyapp-audit-report.md

Code Hunter Model Comparison on MicroCommunity PropertyApp