Data Fingerprinting in JavaScript

I want to talk a little about how you can use content-based addressing (aka data fingerprinting) as a general approach to make your applications faster and more secure with some practical JavaScript examples.

Intro

First off, I find the concept of content-based addressing dope af. 👀

It's an extremely powerful tool for building services that are fundamentally more performant, scalable and secure. 💪

It's related to immutability, decentralization, data integrity, and more buzzwords...

But it's also so useful and under-appreciated in general that I wanted to write a practical intro to show how it works alongside some real-world JavaScript.

What the hell are you talking about?

You can think of content-based addressing as fingerprinting for data.

Just like how fingerprints allow you to:

Identify a person based on their fingerprint

Refer to a fingerprint as a unique ID for the person

Tell if two people are the same person based on their fingerprints

Quickly test to see if a person is in a database using just their fingerprint

Just replace "person" with "data" in the above descriptions and you have a rough overview of what content-based addressing enables.

Put another way, content-based addressing allows you to uniquely and efficiently reference data based on it's actual content as opposed to something external like an ID or a URL.

Database-generated IDs, random GUIDs, and URLs are all useful in their own right, but they're not quite as powerful as data fingerprinting (more on this below).

Shut up and show me some code

Let's see how this looks with some real-world code that I've used for reals:


const pick = require('lodash.pick')
const stableStringify = require('fast-json-stable-stringify')

const data = pick(myData, ['keyFoo', 'keyBar'])
const fingerprint = hash(stableStringify(data))

This tiny snippet is hiding so much power...

This snippet leaves out the hash function (more on that below), but it does represent the core algorithm pretty clearly.

It creates a content-based hash fingerprint of any JavaScript object myData that is a unique representation of that object based on the keys we care about [ 'keyFoo', 'keyBar' ].

In short, this fingerprint offers you a very efficient way of telling when two JavaScript objects are the same.

If two content-based IDs are the same, the data in those objects is the same.

No need for a deep comparison. No need for Redux. Just pure immutable goodness.

So how does this actually work?

We'll break this process into three distinct steps: 1) Input data 2) data cleaning 3) simplification.

Let's take another look at our JavaScript code:


const pick = require('lodash.pick')
const stableStringify = require('fast-json-stable-stringify')

const data = pick(myData, ['keyFoo', 'keyBar'])
const fingerprint = hash(stableStringify(data))

First, we take as input any JavaScript object myData. This could be a model from your database or some object containing Redux-like app state, for instance.

Second, we clean our data to ensure that we're only considering parts of the data we actually care about via lodash.pick. This step is optional but usually you'll want to clean your data like this before proceeding. I've found in practice that most of the time there will be parts of your data that aren't actually representative of the uniqueness of your model (we'll refer to this extra stuff as metadata 😉).

As an example, let's say I want to create unique IDs for all of the rows in a SQL table. Most SQL implementations will add metadata to your table like the date an entry was created or modified, and it's unlikely we'd want this metadata to affect our notion of uniqueness. In other words, if two rows were inserted into the table at different times but have the exact same values according to our application's business logic, then we want to treat them as having the same fingerprint so we filter out this extra metadata.

Third, we simplify our cleaned data into a stable, efficient representation that we can store and use for quick comparisons. Most of the time this step involves some sort of cryptographic hash to normalize the way we refer to our content in a unique, concise manner.

In the code above, we want to make sure that our hashing is stable, which is made easy for us by the fast-json-stable-stringify package.

This awesome package recursively makes sure that no matter how our JavaScript object was constructed or what order its keys may be in, it will always output the same string representation for any two objects that have deep equality.

There are some details this explanation is glossing over, but that's the beauty of the NPM ecosystem – we don't have to understand all the bits & pieces to take advantage of their abstractions.

Let's hash this thing out

Up until now, we've glossed over the hashing aspect of things, so let's see what this looks like in code:


const hasha = require('hasha')

const hash = (input) => hasha(input, { algorithm: 'sha256' })

Note that there are lots of different ways you could define your hash function. This example uses a very common SHA256 hash function and outputs a 64-character hex encoding of the results.

Here is an example output fingerprint: 2d3ea73f0faacebbb4a437ff758c84c8ef7fd6cce45c07bee1ff59deae3f67f5

Here is an alternative hash implementation that uses the Node.js crypto package directly:


const crypto = require('crypto')

const hash = (d) => {
  const buffer = Buffer.isBuffer(d) ? d : Buffer.from(d.toString())
  return crypto.createHash('sha256').update(buffer).digest('hex')
}

Both of these hash implementations are equivalent for our purposes.

The most important thing to keep in mind here is that we want to use a cryptographic hash function to output a compact, unique fingerprint that changes if our input data changes and remains the same if our input data remains the same.

So where should you go from here?

Once you start thinking about how data can be uniquely defined by its content, the applications are really endless.

Here are a few use cases where I've personally found this approach useful:

Generating unique identifiers for immutable deployments of serverless functions at Saasify. I know ZEIT uses a very similar approach to optimize their lambda deployments and package dependencies.

Generating unique identifiers for videos based on the database schema we used to generate them at Automagical. If two videos have the same fingerprint, they should have the same content. One note here is that it's often useful to add a version number to your data before hashing since changes in our video renderer resulted in changes to the output videos.

Caching Stripe plans and coupons that have the same parameters across different projects and accounts at Saasify.

Caching client-side messages and HTTP metadata in a React webapp for Eko.

We've really only started to scratch the surface of what you can do with content-based addressing. Hopefully, I've shown how simple this mindset shift can be done in JavaScript and touched on a bit on the benefits this approach brings to the table.

If you enjoy this stuff, I would recommend checking out:

The power of content-based addressing - An awesome intro to the topic with a focus on content identifiers (CIDs) as they're used in IPFS.

Multihashes - Self-describing hashes. 💪

Merkle trees - A recursive data structure built on top of content-based hashes.

Rabin fingerprinting - An efficient string searching algorithm that uses content-based hashing.

IPFS - InterPlanetary File System.

libp2p - Modular building blocks for decentralized applications.

Saasify - An easier way for devs to earn passive income... Oh wait, that's my company and it's not really related to content-based addressing but cut me some slack haha 😂

Thanks! 🙏

👉

Follow me on twitter for more awesome stuff like this @transitive_bs