Context
Fraud today
Fraud is a BIG problem. At the time of writing this, it’ll be a problem the size of Uber (or Palantir/Blackstone/Pfizer/pick your “favorite” large-cap company). The problem is only compounding: U.S. companies lost ~9% of revenue to fraud in 2025, up 46% from the year before.
The incidents are large-scale and hard to ignore:
- Robinhood was hit with large-scale phishing attacks targeting its user base
- Stripe’s 2025 State of AI and Fraud report found free trial abuse grew 6.2x in just four months; AI startups are seeing 10x more abuse than enterprise companies
- Meta took action against 1 billion fake accounts in Q1 2025 alone
Fraudsters are also crawling the web at historic rates. Malicious bots now make up ~37% of all internet traffic, doing things like:
- API reconnaissance: mapping hidden access points to services
- Credential stuffing: testing stolen emails, passwords, credit card numbers, and SSNs at scale
- Vulnerability probing: automated scanning for exploitable surface area
- DDoS: overwhelming services to degrade or deny availability
AI is only accelerating fraudsters’ attack vectors and attack velocity.
Adobe Risk Platform
At Adobe, we’ve similarly seen a 10x increase in fraud attempts between last year and now. Allowing fraudulent payments and risky behavior on Adobe products is a huge problem for us. Not only does fraud force revenue re-statements when earnings were inflated by bad transactions, it also degrades trust with the customer.
That’s where my team comes in. We sit between end users and Adobe’s core products, providing fraud detection and risk assessment at scale. It’s become very clear to us that attackers’ tools have gotten sharper, and the tools we use to detect and respond also need to keep pace.
We’re in a never-ending battle of whack-a-mole, balancing plugging-the-holes mitigation efforts with building solutions to left-shift the fraud prevention and stop problems we haven’t seen before.
Problem Space: Fraud detection SDK
That’s where the fraud detection SDK comes in. It’s our primary instrumentation layer in the browser: collecting user signals, detecting bot activity, enforcing policy, and orchestrating Adobe-internal and third-party fraud detection vendors.
I had just joined the team in Q4 2025. Although the SDK was a few years old by this point, each new feature had to fight the same structural constraints that made it hard to scale and onboard new Adobe clients. As our platform continued to mature, it was important that we addressed all of these issues to move fast and stay correct without reinventing the wheel each time.
Five structural problems drove the revamp:
| # | Issue | Symptom |
|---|---|---|
| 1 | Version skew: Stable URL + aggressive caching | You ship detection rules but can’t know who got them |
| 2 | No singleton | Competing inits + async vendor load = nondeterministic behavior, too dependent on clients |
| 3 | Callback vendor lifecycle | Slow to add vendors when delivery/init are also fragile |
| 4 | Feature flagging | A/B testing fraud-driven experiments in the front-end |
| 5 | Too much client control | Clients could pass overrides and create unexplainable behavior |
1. Version skew: Stable URL + aggressive caching The old integration model was a single minified script behind a stable URL, easy to embed, but browsers and CDNs cache aggressively. After shipping a detection rule, not every session guaranteed the new code. For a risk platform SDK, that’s a correctness problem. Short-term mitigations (shorter cache TTLs, cache-busting query params) helped at the margins but didn’t give us a stable entrypoint embedders keep forever and a versioned implementation that rolls out predictably.
2. No singleton The SDK sits in environments we don’t control. SPAs can initialize it multiple times. We also had clients who loaded their own JavaScript inside our other clients, creating race conditions between competing inits. Feature flags, vendor orchestration, and policy evaluation all need to flow from one coherent initialization.
3. Callback vendor lifecycle Vendor integrations had grown into a tangle of nested callbacks with no clear lifecycle. Each vendor had its own loader and token reporting, so adding one meant copying patterns rather than plugging into a shared model. It’s slow to extend, and hard to debug.
4. Feature flagging Feature flagging depended on a global flag array the host application had to inject before the script loaded. That made fraud experiments brittle: turning a vendor on or off meant a client-side change, not a platform decision. Worse, that array lived in client-side state anyone on the page could influence (via XSS, a compromised embed, or DevTools), so detection policy wasn’t something we could trust. Calling the flag service directly from the browser wasn’t viable either; any credential in front-end code is visible in DevTools.
5. Too much client control Initialization had become a negotiation. Embedders passed vendor lists, feature overrides, and large metadata maps to override platform defaults on their own timeline. Each knob had a reasonable short-term reason, but together they created unexplainable behavior; debugging meant untangling what the host app passed in from what the platform intended to run.
Solutions
For each issue listed above, I’ll share the core idea of the solution and high-level implementation.
Split what integrators embed from what actually runs
Issue 1: Version skew: Stable URL + aggressive caching
The old monolithic script became two artifacts: a lightweight loader (~1-2KB) and a content-hashed implementation bundle. When the implementation changes, the hash- and thus the URL- changes, but the loader URL never does. Embedders keep the same script tag forever. Versioning happens entirely on our side.
Browser requests loader (short CDN TTL, no browser cache)
→ loader injects <script> tag for hashed SDK (long TTL / immutable)
→ SRI check passes
→ SDK runs
Caching follows this split deliberately.
The loader uses s-maxage=300, max-age=0, must-revalidate- CDN-cached for 5 minutes, never browser-cached. Preventing browser caching on the loader is a conscious tradeoff: it costs a fast CDN round-trip (~5-20ms) on every page load, but it means a CDN cache purge is immediately effective for all users. If we push a critical security patch and issue a purge, the next request from any browser picks up the new version.
The hashed SDK bundle uses Cache-Control: public, max-age=31536000, immutable- safe to cache forever in both CDN and browser, because a new hash means a new URL, which is an automatic cache bust.
Integrity at load time: the build process computes an SHA-256 digest of the hashed SDK bundle and bakes it into the loader as a Subresource Integrity attribute. If the CDN serves a tampered file, the browser rejects it natively before executing.
Resilience: if loading the hashed script fails from a transient network glitch, the loader retries once before surfacing the failure. The SRI and retry are coupled: if SRI verification fails, onerror fires just as it would for a network failure, and the retry re-fetches from the CDN. If a CDN edge node is consistently serving a bad file, both attempts fail and the SDK does not load.
Origin resilience: the loader also carries stale-if-error=86400. If the origin server is unreachable during CDN revalidation, the CDN continues serving the last known good loader for up to 24 hours. A stale loader still points to a valid previously-deployed SDK hash, which is still cached immutably at the CDN edge. Fraud protection stays active during an origin outage.
Rollback: because we retain the last N hashed bundles on the server, rollback is a loader redeployment- point the loader at a previous hash. The old bundle is still on the CDN with an immutable TTL. Within 5 minutes (the loader’s CDN TTL), clients pick up the previous version.
Enforce a singleton with an explicit factory and loader handoff
Issue 2: No singleton
Issue 1 solves which file runs. The handoff solves when host code can call into the SDK.
The loader installs a stub on window before the hashed bundle has finished downloading. That stub accepts calls like initAsync and queues them against a promise that resolves when the real SDK loads. As an optimization, the stub also immediately fires a fetch() for the client’s vendor configuration from CDN when initAsync is called- before the SDK bundle has even arrived. That prefetch runs in parallel with the SDK download, so by the time the real SDK loads and processes the config, the response is either already in-flight or already cached.
Early callers don’t need custom onload choreography or race-prone “poll until the global exists” loops. They invoke the same API they’d use post-load, and the work is held until the implementation is ready.
When the hashed script runs, it replaces the stub and flushes the queue. From the host’s perspective: one stable entrypoint, one stable call pattern, regardless of when the heavy code actually arrives.
This design pairs directly with the singleton enforcement. initAsync() flows through a create() factory with a private constructor. At most one instance is ever constructed:
class SDK {
private static _instance: SDK | null = null;
private static _createPromise: Promise<SDK> | null = null;
static async create(config: SDKConfig): Promise<SDK> {
if (SDK._instance !== null) {
SDK._instance._logger.warn('SDK.create() called more than once. Returning existing instance.');
return SDK._instance;
}
if (SDK._createPromise !== null) {
return SDK._createPromise; // deduplicate concurrent calls
}
SDK._createPromise = (async () => {
const instance = new SDK(config); // sync: logger, config, validation
const [, vendorConfigResponse] = await Promise.all([
instance.fetchFeatureFlags(),
instance.fetchVendorConfig(config._vendorConfigPrefetch),
]);
instance._vendorConfig = instance.applyConfig(vendorConfigResponse);
instance.loadVendors(); // fire-and-forget
SDK._instance = instance;
return instance;
})();
return SDK._createPromise;
}
private constructor(config: SDKConfig) { /* ... */ }
}
_createPromise deduplicates concurrent callers; _instance is assigned only after init succeeds.
Industry leaders like Fingerprint Pro and Stripe.js solve the same problem with implicit singletons: documentation tells integrators to store the promise and reuse it. We enforce it in the SDK itself via a private constructor and static factory, so correct behavior does not depend on developer discipline.
Replace nested callbacks with per-vendor state machines and push-driven lifecycle
Issue 3: Callback vendor lifecycle
Each vendor extends a shared abstract base class with an explicit state machine. Subclasses own vendor-specific load logic: script injection, third-party API calls, and how each vendor reports results back. The base class owns what was duplicated before: state transitions, token storage, cookies, and performance timing.
Idle → Loading → Ready → VendorResults
↘ Error → (single retry)
Valid transitions gate all side effects. Vendors push signals and errors to the SDK. Vendor script loading runs in the background after init returns so third-party scripts do not block the main thread, deferring injection until DOMContentLoaded if the DOM is not ready. Then, each vendor callback fires progressively as each vendor reports in.
Fetch feature flags through a BFF
Issue 4: Feature flagging
The SDK fetches flags at init through a backend-for-frontend (BFF) we own, with auth handled server-side so the browser never talks to the flag service directly. That fetch runs in parallel with vendor config. Flags control what’s enabled; CDN config controls how vendors are set up, and those two sources stay separate. If either fetch fails, init still succeeds with safe defaults rather than crashing the page.
Narrow the initialization contract
Issue 5: Too much client control
Init now accepts a small, stable surface: a client identifier, session identifier, environment, and a few callbacks. That’s the full contract. The rest of the logic flows from our platform internally.
Engineering choices and other things worth naming
Coming soon…
Reflections
Write the spec, then keep rewriting it
This was one of the first major production-ready projects led by spec-driven development. I’d design the spec first: what the system had to guarantee, how components interacted, what the integration surface looked like. Then I’d break it into features and a work breakdown that matched those boundaries.
Direction changed constantly, and the sunk-cost fallacy of implementing suboptimal features stopped being a good reason to stay the course. When the callback model couldn’t support a cleaner state machine, we’d update the spec, rethink the breakdown, and rework the pieces that no longer fit. AI made that loop practical enough to use every day. Prompts like “implement this as cleanly as possible” (basically, code the shit out of this) and “give me three other ways to design this” became part of the normal workflow.
Spec-driven development is the standard workflow today, especially with AI in the loop, and on this project it earned that status. I’m not sure it stays that way. If tooling gets good enough to constantly align the codebase without a separate document, or auto-update spec to follow the codebase, the spec might stop being the primary artifact.
AI defensively over-engineering
AI-generated code still has a habit of being defensively over-engineered. It adds try/catch blocks, fallbacks, and null checks for errors that should be caught upstream, which makes code look robust but hides real bugs. On this project, that meant inner modules handling invalid config that upstream validations should have already rejected.
I still need to get better at spelling out failure boundaries in the spec, so that AI can think in terms of system-wide error handling instead of solely optimizing locally everytime.