Commerce S&O · Take-Home Assessment

Refund abuse: diagnose it, size it, ship a policy that's fair to the 89%.

An interactive companion to the 2-page brief — live calibration of the proposed policy thresholds against the provided dataset, full SQL query replay via DuckDB-WASM in your browser, and a forward-projection model for 30/60/90 rollout scenarios.

Submitted by Kevin Astuhuaman · Dataset whatnot_refund_data.xlsx · 12 months · Runtime 100% client-side · no server

Scale, drivers, exposure

Leadership suspects buyers are gaming refunds. The data has no confirmed-abuse label, so the analytical lens is buyer concentration within the gameable-reason universe (99% of approved refunds). Ranked by a 2-feature ex-ante risk score, a small buyer cohort drives most of the refund spend — and a disproportionate share of the refund queue.

Requested refund $
Σ refund_amount_usd WHERE refund_requested=1
Denied refund $ (row-based)
requested=1 AND approved=0
Observed order value
Approved ÷ observed =
Approved refund $ by risk-likelihood decile
D1 = riskiest 10% of refund-requesting buyers. Deciles formed using only prior_excess_ratio + prior_gameable_request_count percentiles. Approved $ is outcome, never part of the score.
of approved refund $ from top-quintile cohort (24 buyers)
Top 12 (decile):  = Top 24 (quintile):  =
of refund requests come from that same cohort
That's why Tier 3+ rules touch nearly half the refund queue — not a niche slice. Actual Tier 3+ volume depends on simulator calibration.

Reason-code concentration in the high-risk cohort

Within the top-24 cohort, three hard-to-verify reason codes dominate: Marked delivered not received, Missing item, Wrong item — the reasons buyers can file without proof. Outside the cohort, these are much less common.

Abuse vs. legitimate exposure — bounded, not a point claim

Upper-bound exposure
top-quintile approved $
Seller-baseline proxy
Σ(order_value × seller_defect_rate)
=
Baseline-adjusted suspected excess
plausibly policy-addressable

Residual is not proof of abuse — it's the headroom a policy could plausibly address. Likely legitimate refunds (outside-cohort + cohort baseline) ≈ .

Normalized from the flat xlsx — every number here has a traceable grain.

The raw file mixed order rows with buyer aggregates, which is a trap: SUM() over a column that's already aggregated inflates numbers if you forget to dedupe. I split it into 5 tables with explicit primary keys + foreign keys. Every metric on this page is sourced from one of these tables, and every cited number shows the exact SQL that produced it — click any SQL badge on the page to verify.

Loading schema…
Replicate anywhere. The same 5 tables ship as parquet (for DuckDB / Snowflake / Redshift) and as a fully-formatted Excel workbook with pre-built pivot tables and an editable policy simulator.

Calibrate the thresholds. Re-route the 297 requests. See the tradeoff live.

Drag the sliders to set risk thresholds and support floors. Every move re-routes all 297 historical refund requests through the decision tree and updates the tier distribution, expected approved-$ reduction, and ops manual-review load. Defaults match the values proposed in the 2-page brief.

Refund requests by tier (of 297)
Projected outcome on the 228 historical approvals
$22,413
Approved $ before policy
Approved $ after policy
Exposure reduction
Manual-review volume
Guardrail watch (simulated)
Ordered decision tree (live, using current thresholds)

    Every number above, re-derivable in SQL. In your browser.

    This page runs DuckDB-WASM entirely in your browser against the parquet data we ship. No server, no install. Click Run on any query to see the output; edit the SQL inline and re-run. Use the provided parquet files (orders.parquet, buyers.parquet, rq_buyers.parquet) to replicate in your own DuckDB, Snowflake, or Redshift environment.

    Booting DuckDB-WASM…

    Pick a pilot aggressiveness. See the tradeoff curve.

    Forward projection over 90 days. Toggle pilot aggressiveness; the model re-estimates refund-$ reduction trajectory, appeal-overturn projection, and ops-review volume. Assumes calibration against ~10% of monthly refund requests for QA.

    Approved refund $ reduction trajectory (baseline-adjusted)
    Appeal-overturn rate (target < 20%)
    Day 30
    Trust/risk + Data/analytics
    Calibrated QA panel agreement ≥ 85% on 50-case blind double-review; sampled false-positive audit (20 Tier 4/5 cases) false-positive rate ≤ 10%; Pareto validated on fresh month.
    Day 60
    Operations + Product
    Pilot ambition: baseline-adjusted suspected excess in top-quintile cohort reduced — value depends on selected pilot aggressiveness. Appeal-overturn <20%; manual-review volume within capacity.
    Day 90
    Trust/risk + Operations
    Sustained reduction; repeat-purchase and CSAT within ±2pp of baseline; manual-review SLA <48h; seller-side NPS unchanged.

    What I assumed, what I caught, what production would look like.

    The 2-feature ex-ante risk score

    risk_likelihood_score(b, t) = avg(
      percentile(prior_excess_ratio(b, t)),
      percentile(prior_gameable_request_count(b, t))
    )
    
    prior_excess_ratio(b, t) =
      Σ refund_amount_usd WHERE buyer=b AND
        is_gameable_reason=1 AND refund_request_date < t
      / Σ order_value_usd WHERE buyer=b AND order_date < t
      − Σ(order_value_i × seller_defect_rate_i) / Σ order_value_i
        [prior-spend-weighted seller-baseline]

    Approved refund $ is never part of the score (avoids circular ranking). current_requested_refund_amount controls evidence burden + queue order only, never tier assignment.

    Data-dictionary discrepancy I caught

    buyer_gameable_refund_count is documented as "approved gameable refunds," but the per-buyer values consistently match the request count, not the approved count. Example: buyer B_A_013 has count=10, but only 8 of their 10 gameable requests were approved.

    Treated throughout as prior_gameable_request_count. Approved exposure is computed directly from row-level refund_approved=1. Flagged so reviewers know I read the data.

    Descriptive vs. backtest

    Percentile breakpoints in the simulator (P50, P80, P90, etc.) are computed over the full-period buyer aggregates provided in the xlsx. This is a descriptive view, not a clean backtest of request-time decisions.

    Production policy would compute prior_* features using only data known at refund-request time (refund_request_date = order_date + days_to_refund_request), and thresholds would be recalibrated on the prior-at-request distribution before Tier 4/5 enforcement goes live.

    Production scale on AWS

    This analysis runs in DuckDB-WASM against ~160 KB of parquet — perfect for demonstrating logic, overkill for production volume.

    At Whatnot scale, the policy would ingest from Kafka into Snowflake/Redshift; feature store computes prior_* features at request time; rule engine (or lightweight ML scorer as v2) emits a tier label for every refund request; manual-review queue is prioritized by action_priority = risk_likelihood × current_requested_refund_amount. Shadow mode for 2 weeks before enforcement, recalibration quarterly.