Strongly typed IDs · Peakscale

There's a lot of discussion about what IDs to use in your databases, but the human/developer aspect of IDs often falls by the wayside.

Your ID format choice is quite important for database performance. I'm not arguing against that. These days, the contenders are usually integers, UUID v4, ULID, or the new UUID v7. There are more options and aspects of the decision beyond performance, like storage costs and information leakage. It's a fun topic.

Those topics aside, a distributed system needs to be easy to reason about and debug. My system has around 25 types of objects in the main business logic, and they all have IDs. These IDs are used in the service, async system, analytics, event system, and event archives. They're also customer-facing in some cases.

Thinking about the system design, I knew it would rarely have a performance issue solely based on the key choice. The system uses both PostgreSQL and DynamoDB, and DynamoDB is where millions of records are piling up. In most DynamoDB designs, you want something like a UUID for partition keys so that the data is spread more evenly to avoid hotspots. The typical tradeoffs between temporal locality, key size, etc., don't apply as much there.

I also leaned into the idea that the database keys can be separate from the entity IDs. There are many spots where the PostgreSQL records will have auto-incrementing integer PKs but a separate field for this longer entity ID.

That pattern is often seen where you have a customer-facing ID and an internal ID. You end up with an index on both fields, one for internal use and one for retrieval, when a customer request includes the customer-facing ID. Look it up by customer-facing ID, then use the PK from there on in all queries and FK relationships.

UUIDs were a natural choice for a distributed system where many different processes create new entities and there can't be any global conflicts. But I decided on two extra things: a prefix and a timestamp. For the timestamp part, pretend I used ULID; the implementation is different but very similar.

Each ID has a fixed-length character prefix:

FEED-01AN4Z07BY79KA1307SR9X4MV3
ACCT-01B3MKZY21AW1WRRASF0TXN2YK
PROC-01H1NKAY1T44XKG12RSQRQEZVK
RESP-01F4B1AY1TAA9BC3128THNZYV4
EVNT-01H3MMWN5K7TC84G533P5HWGV9
etc.

At all service and messaging boundaries, these IDs are used. Calling an internal RPC? Queueing a message? Logging an event? The fields are strings, and they will include the prefixes. They aren't print-only artifacts. When the strings are parsed at these system boundaries, the 'acct' field better include an ID with an 'ACCT' prefix, etc. Otherwise, it's a catastrophic error, and it would never make it past tests.

It's a sanity check. It's easy to reason about and implement. The parsing overhead is trivial compared to the time it takes to make/receive external calls.

Database fields are an exception; they'll usually have the prefix stripped. The unmarshaling code must be aware of each field and translate them into their strongly typed form.

In Rust, for that strongly typed form, I chose a struct. It looks similar to this:

pub struct TimeBasedId {
    pub prefix: String,
    pub ulid: Ulid,
}

Each specific ID can have its own explicit type.

In Rust, it's tempting to use a type alias for each ID type, but the compiler would treat them interchangeably which defeats the purpose.

Using these types prevents confusion and parameter ordering mistakes. Here's a function call. Strings are not anywhere to be found:

fn do_something(acct_id: &AccountID, feed_id: &FeedID) {
    // do something
}

Using these in parameters and return types is probably the biggest benefit. Think about a function that returns a few IDs, each for a different type of object:

fn create_something() -> Result<(FeedID, JobID)> {
    // create something
}

When the function returns a list of strings, the developer needs to handle unpacking that ordering, probably after reading about it in the doc string, and there are many opportunities to screw it up (or mis-document it). At the least, it's cognitive overhead, and it adds up. Here, the order is directly in the function signature, and the compiler will help catch ordering mistakes.

This also lets you log things quickly. When you're writing some debug statement about an action with associated IDs, it's quick and thoughtless to throw things into it:

debug!("did something. {} / {}", job_id, acct_id);

The associated function that renders it to a string will include the prefix.

The typical way you'd write these log statements is something like this:

debug!("did something. job ID: {} / account ID: {}", job_id, acct_id);

This is a tiny ergonomic improvement, but it will happen thousands of times over the years. Typing out a label each time is unnecessary with strongly typed IDs.

Having the prefixes in the logs means they are very greppable. Need to gather all account IDs active in a particular hour, and you don't have a service/database function for that yet? Some grep, cut, sort, uniq, etc., and you are on your way. Running a stress test and need to gather all of the generated object IDs so that you can investigate them in a loop in a script? Pretty trivial to do, even though many other types of IDs are being logged.

I'm not sure this happens to everyone, but I've been lied to because of ordering mixups in log statements. Once up and running and tested, strongly typed IDs are systematic.

It's best to have one function to parse them from incoming strings and one function to create new ones. Otherwise, there should be no other way to obtain an instance.

To do that in Rust, I add a private field to prevent the struct from being populated any other way than those functions (most languages have a pattern to guarantee validity when an object is instantiated):

pub struct TimeBasedId {
    pub prefix: String,
    pub ulid: Ulid,
    _none: (),
}

The timestamp component can play a role in your ID choice for efficient databases. It has no role in this global entity ID scheme other than as a debugging aid. It's helpful for sorting when you're grepping logs. It helps zero in on what log file to look at for events around the same time period. It shouldn't be directly interpreted by your application logic, though. I always keep a separate created_at field. IDs should be opaque.

Objection: "You're saying IDs should be opaque, but you're using prefixes that are interpreted by the code? Then those aren't opaque IDs!"

Yes. We're going to reject some customs but not go too far.

It's not novel. And I'm sure it won't sit right with some people. But it worked out quite well.