← Blog

Applying ASVS, Part 1: Safe Concurrency

ASVSv5 15.4.1 and 15.4.2, demonstrated through a worked Go example.

When I joined the OWASP ASVS working group, I wanted to contribute deeper guidance in areas where the standard had room to grow. I gravitated towards safe concurrency, because race conditions tend to get misunderstood as a single class of bug or as something the language handles for you. In my experience, though, the underlying logic bugs aren’t any easier to avoid just because concurrency syntax is easier to write.

Race conditions are logic bugs, where timing is the broken logic: what happens, and when. The most concrete kind is two threads writing to the same piece of memory at the same moment; this is what most code reviewers are trained to look for. But most race conditions are more abstract than that. For example, in a TOCTOU race, a program reads some state, decides what to do based on it, and acts, while another request changes that state in the gap between the steps. This class has produced critical bugs in production systems.

A reviewer watching only for the memory-level kind misses all of these, and they are often the more serious ones. A low-level race usually crashes a program or corrupts its data, loudly enough to notice. A logic race can leave it running exactly as intended while the wrong person gets in, or learns something they shouldn’t. Code that is clean of every low-level race can still act on stale data, leading to even more unforeseen vulnerabilities.

The existing requirement in 4.0.3 was:

V4.0.3-11.1.6: Verify that the application doesn’t suffer from TOCTOU issues or other race conditions for sensitive operations.

That sentence reads as shorthand for the already-initiated, pointing at TOCTOU and a wider class of race conditions while leaving them undefined for the reader. But there’s more than one way for bad timing to wreck your code, and the old requirement from ASVSv4 treats them all as one catch-all category.

5.0 splits the original requirement into four discrete recommendations we felt would be most impactful for code reviewers:

  • shared objects and the primitives that protect them (15.4.1)
  • atomic compound operations, with TOCTOU named explicitly (15.4.2)
  • deadlock and livelock prevention (15.4.3)
  • thread starvation prevention (15.4.4)

This is the first in a series of worked examples I’m planning to write, each one grounded in the kind of code a reader might plausibly encounter in a production system. I’ll focus on the first two concurrency requirements here; the rest will get their own treatment another day.

The Service

The codebase is a small Go service for issuing and verifying stateful one-time passwords. Stateful here means the server picks the code at issuance time and persists it for the validity window, as opposed to stateless schemes like TOTP where both sides derive the code from a shared secret and the clock. Picture an SMS or email second factor: the server generates a six-digit code, sends it to the user, waits for the user to submit it back, and decides whether it matches. The service is built around three components: a challenge store that holds the active codes, a rate limiter that caps how many guesses a given user can make within a window, and a validator that wires them together.

Go makes sense for a service like this because its concurrency model is one of the language’s strongest features. Goroutines and channels are first-class, the standard library leans heavily on them, and writing concurrent code requires very little ceremony compared to most other languages. None of that makes races impossible though, as we’ll see.

I’ll describe the service from the perspective of a hypothetical, small dev team, shipping this OTP service alongside several others they own. They work in an organization where the platform team runs shared infrastructure including Redis, but they have a standing preference for keeping their Redis footprint lean where they can justify it. They aren’t building for scale they don’t have yet.

The Threat Model

Before writing any code, the team threat-modeled the service. Of the risks in their threat model, the one we’ll follow here is online brute force against the six-digit code: an attacker guessing values until one works. The team addressed it by limiting how many wrong guesses each user gets before the service stops accepting them; we’ll review how they built that limiter. For the full picture, including the risks they accepted as tradeoffs, see the OTP service threat model).

Threat model
ID Risk ROAM STRIDE Notes
OTP-3 Code prediction / weak RNG Owned
Spoofing
Predictable codes let an attacker derive a valid second factor without observing it
OTP-9 Cross-user code confusion Owned
Spoofing Elevation of Privilege
A code valid for one user is accepted for another, granting access to a different account
OTP-1 Online brute force of the code Owned
Spoofing
Attacker impersonates the legitimate user by guessing their second factor
OTP-8 Replay of consumed codes Owned
Spoofing Elevation of Privilege
An already-used code is presented again to authenticate a second time
OTP-7 Timing attacks on code comparison Owned
Information Disclosure
Response latency leaks information about correct prefixes of the code
OTP-6 Email account compromise Accepted
Spoofing
Attacker controls the delivery channel and receives codes intended for the legitimate user
OTP-2 Issuance flooding / SMS bombing Accepted
Denial of Service
Exhausts SMS budget and disrupts the user's device with unsolicited codes
OTP-11 Multi-replica state divergence Accepted
Tampering
Routing requests across replicas circumvents the rate limit counter on any single replica
OTP-10 Service restart wipes rate limit state Accepted
Tampering
The rate limit counter is effectively reset by an out-of-band action, weakening the control
OTP-4 SIM swapping Accepted
Spoofing
Attacker takes control of the delivery channel to receive codes intended for the legitimate user
OTP-5 SS7 interception Accepted
Information Disclosure
Code is exposed to a network-level adversary in transit


The Naive Implementation

The limiter is backed by a map from user identifier to a small entry struct that tracks the failure count and when the current window ends. Locked reads the entry for a key and returns whether the count has reached the limit. RecordFailure reads the entry, checks the count against the limit, and increments it if there’s room.

With the two methods in place, the team wired the limiter into the validator and started writing tests. The threat model had named parallelized brute force as an Owned risk, so they wrote a test that fires that exact load at the limiter: fifty goroutines submit failed verification attempts against the same user. With the limit set to five, the team expected the limiter to admit five and reject the other forty-five.

Explorer
package ratelimit

import (
	"context"
	"time"

	"github.com/cronchie/totpauth/internal/ratelimit/internal/store"
)

type mapLimiter struct {
	attempts map[string]*store.Entry
	ttl      time.Duration
	now      func() time.Time
}

func New(opts ...Option) Limiter {
	s := store.Apply(opts)
	return &mapLimiter{
		attempts: make(map[string]*store.Entry),
		ttl:      s.TTL,
		now:      s.Now,
	}
}

func (l *mapLimiter) Locked(_ context.Context, key string, limit int) error {
	e, ok := store.Lookup(l.attempts, key, l.now())
	if !ok {
		return nil
	}
	if e.Count >= limit {
		return ErrLimitExceeded
	}
	return nil
}

func (l *mapLimiter) RecordFailure(_ context.Context, key string, limit int) error {
	e := store.Ensure(l.attempts, key, l.ttl, l.now())
	if e.Count >= limit {
		return ErrLimitExceeded
	}
	e.Count++
	return nil
}

When they ran it, the test failed in a way that took them a minute to understand. Forty-seven of the fifty attempts had been admitted before the limiter started rejecting, against a limit of five.

They ran it again with Go’s race detector turned on. The -race flag instruments the program at compile time so the runtime can watch for goroutines reading and writing the same memory without coordinating, and report what it sees. It reported eight separate data races inside RecordFailure, all on the same pattern: one goroutine reading the entry while another wrote to it, or two goroutines writing to it at once.


Reaching for sync.Map

Looking at the code with the race detector’s output in hand, the structure of the problem was clear. Inside RecordFailure, the read of e.Count happens at one moment, the comparison against limit happens just after, and the increment writes back at a third moment. In between those steps, nothing prevents another goroutine from doing the same thing on the same entry. Two goroutines could both read e.Count as 2, both see that 2 is under the limit of 5, and both increment to 3. The map ended up with whichever write landed last, and the count would lag behind the number of failures that had actually occurred.

This is what v5.0.0-15.4.1, one of the new requirements in ASVS 5.0, calls out:

Verify that shared objects in multi-threaded code (such as caches, files, or in-memory objects accessed by multiple threads) are accessed safely by using thread-safe types and synchronization mechanisms like locks or semaphores to avoid race conditions and data corruption.

The team reached for sync.Map. The choice made sense: the documentation calls out the case where many goroutines read, write, and overwrite entries across disjoint keys, which is what a per-user rate limiter does, one key per user. The methods on a sync.Map are safe to call from any number of goroutines at once, so the team swapped the map for a sync.Map, adjusted the call sites in RecordFailure and Locked to use the Load, Store, and Delete methods, and reran the tests under -race. The race detector stayed silent and the test passed, so the team considered the work done and the bug fixed.

Explorer
package ratelimit

import (
	"context"
	"sync"
	"time"

	"github.com/cronchie/totpauth/internal/ratelimit/internal/store"
)

type syncMapLimiter struct {
	attempts sync.Map
	ttl      time.Duration
	now      func() time.Time
}

func New(opts ...Option) Limiter {
	s := store.Apply(opts)
	return &syncMapLimiter{ttl: s.TTL, now: s.Now}
}

func (l *syncMapLimiter) load(key string, now time.Time) (store.Entry, bool) {
	val, ok := l.attempts.Load(key)
	if !ok {
		return store.Entry{}, false
	}
	e := val.(store.Entry)
	if store.Expired(&e, now) {
		l.attempts.Delete(key)
		return store.Entry{}, false
	}
	return e, true
}

func (l *syncMapLimiter) Locked(_ context.Context, key string, limit int) error {
	e, ok := l.load(key, l.now())
	if !ok {
		return nil
	}
	if e.Count >= limit {
		return ErrLimitExceeded
	}
	return nil
}

func (l *syncMapLimiter) RecordFailure(_ context.Context, key string, limit int) error {
	now := l.now()
	e, ok := l.load(key, now)
	if !ok {
		e = store.Entry{ExpiresAt: now.Add(l.ttl)}
	}
	if e.Count >= limit {
		return ErrLimitExceeded
	}
	e.Count++
	l.attempts.Store(key, e)
	return nil
}

A Second Test

Something about the work felt unfinished. The team’s test had checked that exactly five attempts were admitted, but that’s not quite the property their threat model cared about. The identified risk was about whether an attacker could keep guessing, and whether the account would lock and stay locked. The team went back to the test and added a second assertion: after the burst of fifty concurrent failures, calling Locked on the key should return ErrLimitExceeded.

The new assertion failed even though the race detector stayed silent and the first assertion passed. After fifty concurrent failures against a limit of five, the stored count sat at one. Every goroutine had loaded the entry while its count was still zero, checked that zero was under five, incremented its own local copy to one, and stored it back. The fifty stores all wrote the same value, and the count converged on a single increment regardless of how many failures had occurred. The account wasn’t locked, an attacker could come back and trigger another burst, and the count would climb by one.

To make matters worse, a counter stuck at one never locks the account, and an unlocked account sets off no alarms. An attacker can make dozens of OTP guesses while the team’s logs and service monitors show nothing amiss. So in this case, this broken rate limiter was arguably worse than no limiter at all.

Going Back to the Standard

Satisfying 15.4.1 had made the data structure safe, but the bug was still there. The bug pointed at something 15.4.2 names directly:

Verify that checks on a resource’s state, such as its existence or permissions, and the actions that depend on them are performed as a single atomic operation to prevent time-of-check to time-of-use (TOCTOU) race conditions. For example, checking if a file exists before opening it, or verifying a user’s access before granting it.

That’s exactly what RecordFailure does wrong. The check is “is e.Count under the limit?” The action that depends on it is incrementing e.Count and storing the result. The two steps aren’t a single atomic operation, even when the underlying map is thread-safe. Fifty goroutines can all check, all see room under the limit, all increment their local copy, and all store back the same value.

Explorer
package ratelimit

import (
	"context"
	"sync"
	"time"

	"github.com/cronchie/totpauth/internal/ratelimit/internal/store"
)

type mutexLimiter struct {
	mu       sync.Mutex
	attempts map[string]*store.Entry
	ttl      time.Duration
	now      func() time.Time
}

func New(opts ...Option) Limiter {
	s := store.Apply(opts)
	return &mutexLimiter{
		attempts: make(map[string]*store.Entry),
		ttl:      s.TTL,
		now:      s.Now,
	}
}

func (l *mutexLimiter) Locked(_ context.Context, key string, limit int) error {
	l.mu.Lock()
	defer l.mu.Unlock()

	e, ok := store.Lookup(l.attempts, key, l.now())
	if !ok {
		return nil
	}
	if e.Count >= limit {
		return ErrLimitExceeded
	}
	return nil
}

func (l *mutexLimiter) RecordFailure(_ context.Context, key string, limit int) error {
	l.mu.Lock()
	defer l.mu.Unlock()

	e := store.Ensure(l.attempts, key, l.ttl, l.now())
	if e.Count >= limit {
		return ErrLimitExceeded
	}
	e.Count++
	return nil
}

A single mutex around RecordFailure fixes the bug by serializing every call through one lock. For this service, that costs nothing the team will feel: one instance, modest traffic, a lock held just long enough to read a couple of fields and store one back. They dropped the sync.Map, and instead put a regular map under one sync.Mutex. One global lock is enough until the service needs more throughput than a single lock can pass.

An alternate fix could have given each account its own lock, held across the check and the increment. Requests for different users never wait on each other, and two requests for the same account serialize, which is the one place the contention belongs.

The point worth leaving on is that sync.Map was never the wrong tool. It did what its documentation promises: concurrent access to the map stays safe, and lock contention drops across disjoint keys. The bug lived a level above the map, in the check-and-act sequence no collection is responsible for making atomic.


Final Thoughts

Two requirements caught two distinct bugs in the same method. 15.4.1 surfaced the data structure problem, which sync.Map fixed. 15.4.2 surfaced the atomicity problem, which the mutex fixed. The split between them isn’t arbitrary; a team can satisfy one without satisfying the other, which is what the team in this article ended up doing.

Starting with a mutex could have satisfied both requirements in this example, but that does not make “use a mutex” the general lesson. Developers often reach for concurrency-safe collections because they provide safe access without forcing every operation through one application-level lock. Go’s sync.Map, C#‘s ConcurrentDictionary, and similar types can be good fits when the operation being performed is the individual load, store, add, update, or delete they protect.

I hope teams use these requirements to inspect the places where their code checks state, makes a security decision from that state, and then updates it. If two requests can make the same decision from the same stale answer, the bug isn’t gone just because the map stopped racing.

The Applying ASVS Series

In my time doing AppSec, the talks I’ve enjoyed most gave me problems to chew on: something concrete enough to follow, but open enough that I kept thinking about how it applied to my own work. That’s why I started building more code samples into my talks, and why I wanted this series to include runnable examples instead of just commentary (although my commentary has been, I’m assured, solid so far). I find that security guidance becomes easier to internalize when you can look at sample code, tweak it, and run it to watch the failure happen for yourself.

If any of these examples help you review code or reminds you of a related security flaw, I’d like to hear about it. That feedback is what keeps the examples grounded in the kind of code I imagine you, the reader, are routinely writing and reviewing. My next article on applying ASVS (Part 2 for Safe Concurrency) is already in the works, and in true concurrent fashion, you’re welcome to appreciate it now while it’s still being written 🙂