The non-atomic onboarding write that could strand users

At a Glance

Priority

Eliminated

Race window

~1 hour

Time to fix

Files changed

+35 / −16

Lines delta

What Broke

A new user lands on /onboarding, fills in their academic profile, picks 3–5 interest categories, and hits Continue. The server action replaces their user_interests rows so the new set takes effect. The replacement was a DELETE followed by a separate INSERT — two round-trips, no transaction.

If the INSERT failed for any reason — a transient network drop after the DELETE returned, a temporary RLS hiccup, an FK violation — the row state on the server was now zero interests, but the action returned { ok: false, message: "Failed to save interests" }. The user retries. The Zod schema downstream still requires ≥1 interest. The retry fails on its own merits; meanwhile their old picks are gone. They cannot move forward, they cannot move back, and there is no UI to show them what state the DB is actually in. From the user's seat, the form ate their input and locked them out.

The window was narrow but the blast radius was the wrong shape: this is the only required gate between sign-up and a usable account.

Root Cause

The DELETE+INSERT pattern is a classic non-atomic "replace this set" anti-pattern. The original saveStepTwoAction did this:

await supabase.from("user_interests").delete().eq("user_id", user.id);

const { error: interestsError } = await supabase
  .from("user_interests")
  .insert(rows);

The PostgREST client cannot wrap these in a single transaction — each .from() call is a separate HTTP request to the database. There is no rollback. Once the DELETE commits server-side, the old set is gone forever.

What makes this specific to onboarding is the downstream invariant: OnboardingStepTwoSchema validates interests.length >= 1, and proxy.ts gates the user behind profiles.onboarding_completed = false. The combination means a single failed INSERT is not a recoverable user-facing error — it is a stuck account. Anywhere else in the app you could just re-submit. Here, the form's own contract makes retry insufficient: you need rows that no longer exist.

The PK on user_interests is (user_id, category) — exactly the conflict target needed for a proper upsert. The original code did not use it.

Timeline

Apr 1, 2026
Non-atomic write enters the codebase
The 3-step onboarding form lands (commit 2dfdbcd). saveStepTwoAction ships with a DELETE+INSERT on user_interests. The pattern is conventional-looking enough that it survives the initial review.
May 10, 2026
Sprint audit flags the failure window
During a backlog hardening pass before merging the auth sprint, the row "Make user_interests step-2 update atomic" gets logged in docs/ISSUES.md as P1. The trigger: a re-read of the action with the question "what happens if any of these calls fail?" — the answer was bad enough to escalate. Exact detection timestamp not separately committed; using the patch commit date as the lower bound.
May 10, 2026
Upsert + prune ships
Commit 49d8ca3 replaces DELETE+INSERT with upsert (onConflict: "user_id,category", ignoreDuplicates: true) followed by a fail-soft prune via .notIn(...). Single-file change in features/auth/actions/onboarding.action.ts. No DDL.
May 10, 2026
Fix lands on main; ISSUES row closed
The P1 row in docs/ISSUES.md flips to closed in the same commit. This was the final P1 of the auth-hardening sprint.

The Fix

The fix is a two-statement replacement, but the order matters: insert-first, prune-second. The old set must survive until the new set is committed.

Two real changes worth pulling out separately.

The upsert is insert-first. onConflict: "user_id,category" targets the actual PK; ignoreDuplicates: true means rows that already exist are no-ops, not constraint violations. If the call fails — network, RLS, anything — the old rows are still there and the user can retry without data loss.

The prune is fail-soft. .notIn("category", desired) deletes the user's rows whose category is not in the new set. If the prune fails, the user already has their new picks (the upsert ran). A few stale rows is a recoverable state mismatch; a hard 500 here would block onboarding for no good reason. The next step-2 save will retry the prune naturally.

Before·features/auth/actions/onboarding.action.ts

  // Replace interests: delete existing, then insert new
  await supabase.from("user_interests").delete().eq("user_id", user.id);

  const { error: interestsError } = await supabase
    .from("user_interests")
    .insert(
      parsed.data.interests.map((category) => ({
        user_id: user.id,
        category,
      })),
    );

  if (interestsError) {
    console.error("[onboarding:step2] interests insert:", interestsError);
    return { ok: false, message: "Failed to save interests. Please try again." };
  }

After·features/auth/actions/onboarding.action.ts

  // Replace interests via upsert+prune so partial failures don't lose data.
  // Insert-first preserves the old set if the upsert fails; the previous
  // DELETE+INSERT had a window where a failed insert left zero interests.
  const desired = parsed.data.interests;

  if (desired.length === 0) {
    // Defensive: Zod schema currently requires ≥1 interest (ISSUES.md P2).
    await supabase.from("user_interests").delete().eq("user_id", user.id);
  } else {
    const { error: upsertError } = await supabase
      .from("user_interests")
      .upsert(
        desired.map((category) => ({ user_id: user.id, category })),
        { onConflict: "user_id,category", ignoreDuplicates: true },
      );

    if (upsertError) {
      console.error("[onboarding:step2] interests upsert:", upsertError);
      return { ok: false, message: "Failed to save interests. Please try again." };
    }

    // Prune the complement: rows for this user not in the new set.
    // Prune failure is non-fatal: the user has their new interests; a few
    // stale ones may persist. The next step-2 update will retry the prune.
    // Failing here would block onboarding for a recoverable state mismatch.
    const { error: pruneError } = await supabase
      .from("user_interests")
      .delete()
      .eq("user_id", user.id)
      .notIn("category", desired);

    if (pruneError) {
      console.error("[onboarding:step2:prune-failed]", pruneError);
    }
  }

The empty-set branch is dead code today — Zod rejects empty input before this runs — but ISSUES.md P2 ("Decide step-2 onboarding skip semantics") may relax that. Without this branch, an empty desired array would slip through .notIn([]), which in PostgREST evaluates to "delete nothing." Future-me would have spent an hour staring at it. Cheaper to write the branch now and comment why it is currently unreachable.

Before·features/auth/actions/onboarding.action.ts

  // no defensive branch — desired could be empty after a future schema relaxation

After·features/auth/actions/onboarding.action.ts

  if (desired.length === 0) {
    // Defensive: Zod schema currently requires ≥1 interest (ISSUES.md P2).
    await supabase.from("user_interests").delete().eq("user_id", user.id);
  }

After the change, the worst case shifts from "user is locked out" to "user has a few stale rows alongside their new ones." That is a different category of bug — recoverable, invisible to onboarding, and self-healing on the next save. The action's error contract also gets honest: { ok: false } from this code path now means "your old set is intact, please retry," not "we silently destroyed your data."

The fix is 35 added / 16 removed lines in a single file. No DDL, no RLS changes, no migration. The PK on user_interests was already the right shape for the upsert — the original code just was not using it.

Verification

Verification was adversarial code review, not a chaos-engineering harness. The relevant questions were narrow enough to answer by inspection.

Walk the failure paths
For each error: branch in the new action, ask: "If this fails, what is in the DB?" The upsert-failure branch leaves the old rows. The prune-failure branch leaves the new rows plus possibly some old ones. Neither path leaves zero rows. Confirmed by reading the diff against the PK behavior of upsert(... { ignoreDuplicates: true }).
Verify the conflict target matches the PK
onConflict: "user_id,category" must exactly match the table's PK or the upsert silently degrades to a plain insert. docs/SCHEMA.md § user_interests confirms PK (user_id, category). Match.
Confirm RLS allows the new operations
user_interests policies cover SELECT, INSERT, and DELETE for auth.uid() = user_id. upsert resolves to INSERT (or no-op on conflict) so the INSERT policy applies. The .notIn(...) delete is a normal DELETE. No new RLS work required.
Confirm the issue row closed
docs/ISSUES.md had this tracked as a P1: "Make user_interests step-2 update atomic." The commit closes the row and links the fix. The audit-trail thread is intact.

Lessons Learned

A "replace this set" operation should never have a window where the set is empty. Insert-first, prune-second — and treat the prune as best-effort, because the user already has what they came for.

The thing I keep relearning: PostgREST is not a transaction. It is two HTTP calls. If you write code that reads like a transaction — "delete the old, then insert the new" — you are pattern-matching against the wrong mental model. The database happily commits each call as it arrives, and any failure between them is a real, durable state.

The cheap defense is to never let your code put the world in an invariant-violating state, even transiently. Here, the invariant is "this user has ≥1 interest." The old code violated it between line 1 and line 2 of the action. The new code never does — at worst it has extra rows, which the schema allows.

The second lesson is about failure granularity. The old action returned a single failure mode: "save failed." The new action distinguishes between "upsert failed → your data is intact, retry" (hard failure) and "prune failed → you got what you wanted, ignore" (soft failure, log only). Picking the right severity per error path is small work that pays back every time someone reads a log line.

The third, smaller thing: PostgREST .notIn([]) is "delete nothing," not an error. The empty-set defensive branch costs three lines and prevents a future bug that would be expensive to track down.

What Broke

The window was narrow but the blast radius was the wrong shape: this is the only required gate between sign-up and a usable account.

Root Cause

The DELETE+INSERT pattern is a classic non-atomic "replace this set" anti-pattern. The original saveStepTwoAction did this:

await supabase.from("user_interests").delete().eq("user_id", user.id);

const { error: interestsError } = await supabase
  .from("user_interests")
  .insert(rows);

The PK on user_interests is (user_id, category) — exactly the conflict target needed for a proper upsert. The original code did not use it.

Timeline

Apr 1, 2026

Non-atomic write enters the codebase

The 3-step onboarding form lands (commit 2dfdbcd). saveStepTwoAction ships with a DELETE+INSERT on user_interests. The pattern is conventional-looking enough that it survives the initial review.

May 10, 2026

Sprint audit flags the failure window

During a backlog hardening pass before merging the auth sprint, the row "Make user_interests step-2 update atomic" gets logged in docs/ISSUES.md as P1. The trigger: a re-read of the action with the question "what happens if any of these calls fail?" — the answer was bad enough to escalate. Exact detection timestamp not separately committed; using the patch commit date as the lower bound.

May 10, 2026

Upsert + prune ships

Commit 49d8ca3 replaces DELETE+INSERT with upsert (onConflict: "user_id,category", ignoreDuplicates: true) followed by a fail-soft prune via .notIn(...). Single-file change in features/auth/actions/onboarding.action.ts. No DDL.

May 10, 2026

Fix lands on main; ISSUES row closed

The P1 row in docs/ISSUES.md flips to closed in the same commit. This was the final P1 of the auth-hardening sprint.

The Fix

The fix is a two-statement replacement, but the order matters: insert-first, prune-second. The old set must survive until the new set is committed.

Two real changes worth pulling out separately.

Before·features/auth/actions/onboarding.action.ts

  // Replace interests: delete existing, then insert new
  await supabase.from("user_interests").delete().eq("user_id", user.id);

  const { error: interestsError } = await supabase
    .from("user_interests")
    .insert(
      parsed.data.interests.map((category) => ({
        user_id: user.id,
        category,
      })),
    );

  if (interestsError) {
    console.error("[onboarding:step2] interests insert:", interestsError);
    return { ok: false, message: "Failed to save interests. Please try again." };
  }

After·features/auth/actions/onboarding.action.ts

  // Replace interests via upsert+prune so partial failures don't lose data.
  // Insert-first preserves the old set if the upsert fails; the previous
  // DELETE+INSERT had a window where a failed insert left zero interests.
  const desired = parsed.data.interests;

  if (desired.length === 0) {
    // Defensive: Zod schema currently requires ≥1 interest (ISSUES.md P2).
    await supabase.from("user_interests").delete().eq("user_id", user.id);
  } else {
    const { error: upsertError } = await supabase
      .from("user_interests")
      .upsert(
        desired.map((category) => ({ user_id: user.id, category })),
        { onConflict: "user_id,category", ignoreDuplicates: true },
      );

    if (upsertError) {
      console.error("[onboarding:step2] interests upsert:", upsertError);
      return { ok: false, message: "Failed to save interests. Please try again." };
    }

    // Prune the complement: rows for this user not in the new set.
    // Prune failure is non-fatal: the user has their new interests; a few
    // stale ones may persist. The next step-2 update will retry the prune.
    // Failing here would block onboarding for a recoverable state mismatch.
    const { error: pruneError } = await supabase
      .from("user_interests")
      .delete()
      .eq("user_id", user.id)
      .notIn("category", desired);

    if (pruneError) {
      console.error("[onboarding:step2:prune-failed]", pruneError);
    }
  }

Before·features/auth/actions/onboarding.action.ts

  // no defensive branch — desired could be empty after a future schema relaxation

After·features/auth/actions/onboarding.action.ts

  if (desired.length === 0) {
    // Defensive: Zod schema currently requires ≥1 interest (ISSUES.md P2).
    await supabase.from("user_interests").delete().eq("user_id", user.id);
  }

Verification

Verification was adversarial code review, not a chaos-engineering harness. The relevant questions were narrow enough to answer by inspection.

Walk the failure paths

For each error: branch in the new action, ask: "If this fails, what is in the DB?" The upsert-failure branch leaves the old rows. The prune-failure branch leaves the new rows plus possibly some old ones. Neither path leaves zero rows. Confirmed by reading the diff against the PK behavior of upsert(... { ignoreDuplicates: true }).

Verify the conflict target matches the PK

onConflict: "user_id,category" must exactly match the table's PK or the upsert silently degrades to a plain insert. docs/SCHEMA.md § user_interests confirms PK (user_id, category). Match.

Confirm RLS allows the new operations

user_interests policies cover SELECT, INSERT, and DELETE for auth.uid() = user_id. upsert resolves to INSERT (or no-op on conflict) so the INSERT policy applies. The .notIn(...) delete is a normal DELETE. No new RLS work required.

Confirm the issue row closed

docs/ISSUES.md had this tracked as a P1: "Make user_interests step-2 update atomic." The commit closes the row and links the fix. The audit-trail thread is intact.

Lessons Learned

A "replace this set" operation should never have a window where the set is empty. Insert-first, prune-second — and treat the prune as best-effort, because the user already has what they came for.

The third, smaller thing: PostgREST .notIn([]) is "delete nothing," not an error. The empty-set defensive branch costs three lines and prevents a future bug that would be expensive to track down.

The non-atomic onboarding write that could strand users

At a Glance

What Broke

Root Cause

Timeline

The Fix

Verification

Lessons Learned

Engineering War Stories

The non-atomic onboarding write that could strand users

At a Glance

What Broke

Root Cause

Timeline

The Fix

Verification

Lessons Learned