Whatnot Refund Policy — Interactive Analysis

01 · Diagnosis

Scale, drivers, exposure

Leadership suspects buyers are gaming refunds. The data has no confirmed-abuse label, so the analytical lens is buyer concentration within the gameable-reason universe (99% of approved refunds). Ranked by a 2-feature ex-ante risk score, a small buyer cohort drives most of the refund spend — and a disproportionate share of the refund queue.

Requested refund $

—

Σ refund_amount_usd WHERE refund_requested=1

Approved refund $ primary denom

—

Σ refund_amount_usd WHERE refund_approved=1

Denied refund $ (row-based)

—

requested=1 AND approved=0

Observed order value

—

Approved ÷ observed = —

Approved refund $ by risk-likelihood decile

D1 = riskiest 10% of refund-requesting buyers. Deciles formed using only prior_excess_ratio + prior_gameable_request_count percentiles. Approved $ is outcome, never part of the score.

—

of approved refund $ from top-quintile cohort (24 buyers)

Top 12 (decile): — = — Top 24 (quintile): — = —

—

of refund requests come from that same cohort

That's why Tier 3+ rules touch nearly half the refund queue — not a niche slice. Actual Tier 3+ volume depends on simulator calibration.

Reason-code concentration in the high-risk cohort

Within the top-24 cohort, three hard-to-verify reason codes dominate: Marked delivered not received, Missing item, Wrong item — the reasons buyers can file without proof. Outside the cohort, these are much less common.

Abuse vs. legitimate exposure — bounded, not a point claim

Upper-bound exposure

—

top-quintile approved $

−

Seller-baseline proxy

—

Σ(order_value × seller_defect_rate)

=

Baseline-adjusted suspected excess

—

plausibly policy-addressable

Residual is not proof of abuse — it's the headroom a policy could plausibly address. Likely legitimate refunds (outside-cohort + cohort baseline) ≈ —.

01.5 · Data model

Normalized from the flat xlsx — every number here has a traceable grain.

The raw file mixed order rows with buyer aggregates, which is a trap: SUM() over a column that's already aggregated inflates numbers if you forget to dedupe. I split it into 5 tables with explicit primary keys + foreign keys. Every metric on this page is sourced from one of these tables, and every cited number shows the exact SQL that produced it — click any SQL badge on the page to verify.

Loading schema…

Replicate anywhere. The same 5 tables ship as parquet (for DuckDB / Snowflake / Redshift) and as a fully-formatted Excel workbook with pre-built pivot tables and an editable policy simulator.

Excel workbook 188 KB · 12 sheets Parquet bundle 5 files · 97 KB

02 · Policy Simulator

Calibrate the thresholds. Re-route the 297 requests. See the tradeoff live.

Drag the sliders to set risk thresholds and support floors. Every move re-routes all 297 historical refund requests through the decision tree and updates the tier distribution, expected approved-$ reduction, and ops manual-review load. Defaults match the values proposed in the 2-page brief.

Risk thresholds ⓘ

Tier 1 cutoff — prior_excess_ratio ≤ P50

Tier 3 entry — prior_excess_ratio ≥ P80

Tier 4 entry — prior_excess_ratio ≥ P90

Support floor ⓘ

Min buyer_total_orders to enter Tier 4/5 — 5

Min prior_gameable_request_count for Tier 4 — 3

Evidence-process assumptions ⓘ

Tier 3 evidence-compliance rate — 70%

Tier 4 reviewer approval rate — 40%

Refund requests by tier (of 297)

Projected outcome on the 228 historical approvals

$22,413

Approved $ before policy

—

Approved $ after policy

—

Exposure reduction

—

Manual-review volume

Guardrail watch (simulated)

Ordered decision tree (live, using current thresholds)

03 · Query Replay

Every number above, re-derivable in SQL. In your browser.

This page runs DuckDB-WASM entirely in your browser against the parquet data we ship. No server, no install. Click Run on any query to see the output; edit the SQL inline and re-run. Use the provided parquet files (orders.parquet, buyers.parquet, rq_buyers.parquet) to replicate in your own DuckDB, Snowflake, or Redshift environment.

Booting DuckDB-WASM…

04 · 30 / 60 / 90 Projection

Pick a pilot aggressiveness. See the tradeoff curve.

Forward projection over 90 days. Toggle pilot aggressiveness; the model re-estimates refund-$ reduction trajectory, appeal-overturn projection, and ops-review volume. Assumes calibration against ~10% of monthly refund requests for QA.

Approved refund $ reduction trajectory (baseline-adjusted)

Appeal-overturn rate (target < 20%)

Day 30

—

Trust/risk + Data/analytics

Calibrated QA panel agreement ≥ 85% on 50-case blind double-review; sampled false-positive audit (20 Tier 4/5 cases) false-positive rate ≤ 10%; Pareto validated on fresh month.

Day 60

—

Operations + Product

Pilot ambition: baseline-adjusted suspected excess in top-quintile cohort reduced — value depends on selected pilot aggressiveness. Appeal-overturn <20%; manual-review volume within capacity.

Day 90

—

Trust/risk + Operations

Sustained reduction; repeat-purchase and CSAT within ±2pp of baseline; manual-review SLA <48h; seller-side NPS unchanged.

05 · Methodology & caveats

What I assumed, what I caught, what production would look like.

The 2-feature ex-ante risk score

risk_likelihood_score(b, t) = avg(
  percentile(prior_excess_ratio(b, t)),
  percentile(prior_gameable_request_count(b, t))
)

prior_excess_ratio(b, t) =
  Σ refund_amount_usd WHERE buyer=b AND
    is_gameable_reason=1 AND refund_request_date < t
  / Σ order_value_usd WHERE buyer=b AND order_date < t
  − Σ(order_value_i × seller_defect_rate_i) / Σ order_value_i
    [prior-spend-weighted seller-baseline]

Approved refund $ is never part of the score (avoids circular ranking). current_requested_refund_amount controls evidence burden + queue order only, never tier assignment.

Data-dictionary discrepancy I caught

buyer_gameable_refund_count is documented as "approved gameable refunds," but the per-buyer values consistently match the request count, not the approved count. Example: buyer B_A_013 has count=10, but only 8 of their 10 gameable requests were approved.

Treated throughout as prior_gameable_request_count. Approved exposure is computed directly from row-level refund_approved=1. Flagged so reviewers know I read the data.

Descriptive vs. backtest

Percentile breakpoints in the simulator (P50, P80, P90, etc.) are computed over the full-period buyer aggregates provided in the xlsx. This is a descriptive view, not a clean backtest of request-time decisions.

Production policy would compute prior_* features using only data known at refund-request time (refund_request_date = order_date + days_to_refund_request), and thresholds would be recalibrated on the prior-at-request distribution before Tier 4/5 enforcement goes live.

Production scale on AWS

This analysis runs in DuckDB-WASM against ~160 KB of parquet — perfect for demonstrating logic, overkill for production volume.

At Whatnot scale, the policy would ingest from Kafka into Snowflake/Redshift; feature store computes prior_* features at request time; rule engine (or lightweight ML scorer as v2) emits a tier label for every refund request; manual-review queue is prioritized by action_priority = risk_likelihood × current_requested_refund_amount. Shadow mode for 2 weeks before enforcement, recalibration quarterly.

Assumptions in the 2-page brief (3 bullets): (1) "Gameable reason" (99% of refunds) is the eligibility universe, not an abuse label; the analytical lens is buyer concentration within it. (2) Tenure proxy = buyer_total_orders (no signup-date field). (3) Descriptive analysis uses 12-month aggregates; production policy computes features at refund-request time.

Cold-start protection: first-time refund requesters default to Tier 2 (Standard). Hard-to-verify reason + seller defect rate ≤ P50 → Tier 3 with soft low-$ evidence. Never denied on first touch.

Denial conditions: (a) direct-contradiction deny after manual review — carrier/seller/order facts contradict the specific claim; or (b) Tier 5 repeat-risk deny — high-risk buyer fails evidence process after reminder+48h OR has ≥2 prior confirmed contradictions. Never behavior alone. Always appealable.

Data: whatnot_refund_data.xlsx — 2,697 orders, 297 refund requests, 228 approved, 12 months (— → —). No confirmed-abuse labels, no account-age field. refund_amount_usd occasionally exceeds order_value_usd (56 requested / 43 approved rows, ~$51 delta — immaterial).