Challenges with data

Modeling is hard

Jan 30, 2024

It occurred to me after sending off my book to the editors that I missed a crucial section. A huge failure mode of moving quick is being stymied by your data model. I’d argue that spending major thought and time to how you store and model your data is a true make-or-break scenario. In my book, “Building A Startup - A Primer For The Individual Contributor”, I often talk about the concept of one-way streets: decisions that once made are hard to undo and tend to set a direction for your project. When the data model works against you you’ll find that things that should seemingly be easy become extraordinarily difficult.

Why is this a problem? Well it’s because most everything in your application uses data! Data informs the domain model, it determines how classes and data objects are structured, and that then propagates upward into how relationships and code is structured. If the data is poorly managed then often the higher level API and abstractions become cludgy and difficult to use. Direct REST API’s or CRUD from gRPC/Thrift/etc can leak the difficult data model requiring disjoint inputs just to do basic work. These issues all stem from a database that isn’t structured well or wasn’t given enough love and care to be able to properly model the domain it’s storing.

Naming is hard

First and foremost is naming. Names are hard but changing column or field names in a database after the fact can be daunting, especially if the data size is enormous because migrations can be error prone, take a long time, and sometimes require application downtime (which nobody likes to do). Do the names match the domain they are modeling? Are they descriptive? Short cuts and acronyms aren’t worth the effort.

If its an “account_id” then don’t call it “acc_id”. Long term think about who is going to read this content: other developers, data scientists, you (after you’ve forgotten everything). Ensure those names are consistent with other names as well! If you’re using snake case, keep snake case everywhere. Consistency helps minimize cognitive load, people will learn and understand patterns from the consistent naming conventions. Don’t reference “account_id” in one table, and “account” in another table.

When modeling cross domain references I like to use context to prefix a foreign key id. For example, if we have a users table that references an internal and external account id’s (but both fields are foreign keys to the “accounts” table) then we might want to call them “external_account_id” and “internal_account_id”.

users
id, name, external_account_id, internal_account_id

Keeping with a logical convention, however you choose, makes it easy to reason about the data system and to understand what links and relationships data has.

Consider also how you name or your foreign keys. Use a consistent naming pattern. It might be

fk_<source_table>_<field>_<foreign table>_<field>

like “fk_users_external_account_id_accounts_id” which now tells someone

users.external_account_id -> accounts.id

Names aren’t just important in relational storage either, they have just as much, if not more importance, in non-relational data because in a No-SQL world different key elements can have different shape’d data. Shapes are important, and ensuring that the shapes you store are consistent across the semantic type they represent is important. I often say that just because No-SQL is schema-less doesn’t mean you don’t enforce a schema. The opposite in fact! You have to enforce the schema in application code now, and without a clear and unified schema managing your data becomes a nightmare. What data is being represented by a specific key/value pair? Is that data correct? What does changing that data mean when you start to add/remove new fields?

In the No-SQL world I strongly recommend storing data blobs with version and type fields, so that you can disambiguate how data maps to model classes in code. For example, using a JSON format like below can make your key-value storage data easily consumable and discoverable as well as safe from compatibility mismatches

{
   version: 1,
   type: "foo",
   data: {
     ... your fields ..
   }
}

Sources of truth

The next thing I often consider, after agonizing over how to name things, is what the relationship of data is and what to store in a particular table (or data blob). For example, why are we storing? Do we have to it? What does it mean to store this data? The goal of this thought exercise is to make sure that we’re storing exactly what we need, we’re not denormalizing it (or if we are it’s for good reason), and that adding this particular field creates value for us long term. I’ve seen it in in certain places where people store things like

table
id, name, event_payload, first_name, name_title, name2, event_name

It seems contrived, but we’ve probably all seen some travesties of data modeling that look similar. In this example we have 4 different name types, and presumably some raw data in “event_payload” that maybe “event_name” is part of. In this example, I’d argue that we didn’t think clearly about what it means to store a name and we didn’t think clearly about what it means to store parts (or all) of an event. Who is the source of truth here? Not to mention is the name being stored somewhere else as well? Is it cross databases? Is there a centralized way to query and get that data?

Without clear sources of truth and ownership questions tend to arise of how you manage and update this data. What do you need to actually do an update? Is the update valid? Do you have to update multiple places at once? All these questions can arise!

Assuming this is the one and only place this name has to exist, I’d probably recommend (in a relational world) virtual columns to project data from a blob to make querying easier, or just store compound data as a JSON blob in a field. For example, if we created a unified name object:

interface Name {
  given_name: string
  family_name: string
  middle_name?: string
  ..
}

Then we can pair data together as a group, update it as a group, and query it more sanely. Without proper pairing of data we might end up having to query or join multiple tables to materialize a person’s full name. This makes it a difficult and frustrating experience for developers to answer basic questions.

Smell tests

In general, I like to use a litmus test of “how easy is it to do basic CRUD and answer domain related questions”. If your domain has user data split across 4 tables and they are always queried together… are they really actually separate? Should it all be flattened in one? Conversely if you have disparate data mashed into one table should it logically be split out and referenced separately? When CRUD is complex, either by requiring multiple different calls, or complex SQL to be written that’s signal that something is amiss. In my experience non trivial queries tend to be limited in scope, and most things do not require more than a handful of joins.

I also like to make sure that data is never contextual. You should never have a field that means X sometimes and Y other times. Having this kind of implicit codification based on data state creates challenges in reasoning about systems. If you have conditional data, pair it together as a blob or types or add explicit metadata about what you need to be storing. Contextual examples might be something like “sometimes field A is the user id and sometimes it’s the account id”. This happens a lot when designing single table No-SQL schemas where columns get overloaded based on context of what the key is. Woe is the engineer who has to decipher this spaghetti.

Conclusion

Working with data is challenging because data often changes over time. The domain you think you are modeling now might not be the domain you end up with over time. However, data is the most basic layer of abstraction in your system. If the data is hard to work with, all layers above it will be hard to work with. Giving the data model the respect and diligence it deserves pays off dividends over time.

onoffswitch.net

Discussion about this post