Normalization vs Compression
July 9, 2010, 3:02:36 pm

Recently I was working on a relational database and I was juggling the eternal act of normalization with efficiency. There are many books about database normalization written by people who love writing books about database normalization. I've read a couple myself and it all seemed reasonable back in university.

But it's the 21st century and it dawned on me that normalization is the futile act of trying to manually compress data. If you break up your data records in to its related component parts to -avoid duplicate data- then you're literally trying to re-invent compression ..badly.

So let's say we let compression deal with our data storage. Smart hashing would mean you'd easily be able to store just as much data in "document" format. But how do you identify when two records are actually the same? I suspect this is the job of hashing and indexing. If two people have the same (and more interestingly - similar) address, statistical machine analysis will find this fact without a programmer having to define the concept of address as its own table.

I've also seen a few impressive search engines that work by doing exactly this - compressing the data and using hashes to look it up rapidly. That combined with similar indexing and automatic data normalization through reduction seems interesting to me. It could find patterns that you'd never think of normalizing normally but ultimately make your program more efficient, based on how you're using the data.

You can also use the same technique to apply indexing automatically. It is compression after all, so finding the kinds of data you're looking for is the job of statistical machine learning. Is there anyone out there doing this right now? I'd be interested in trying to use this kind of database over a classical relational database.

By Cees de Groot on July 13, 2010, 8:38:00 am

It's exactly the sort of stuff we deal with on a day-by-day basis. Any big website will have search engines and whatnot backing up a relational database, but we never use the actual relational bits of it. Accesses to the DB are usually by primary key only, the search engine handles the rest.

So, what you end up with is further screwing up the issues with disk latency. Instead of a single read to pull a 4k block from the drive containing The Document, the relational database gets a slew of queries scattered over several tables to pull together all the various normalized records that comprise the logical datum.

As far as I'm concerned, normalization is definitely a thing of the past for the sort of website I work at :)

By Isaac on July 13, 2010, 8:45:36 am

I don't think normalization is primarily meant to save disk/memory space, but rather to make the model of the data as exact as possible so that you can contruct any sort of query you want against it, and to make sure you know that data has a single canonical source. If the same data is in the database in multiple places (an employee's phone number is repeated for every project she is assigned to) then it's not clear which is authoritative.

I've generally found normalization to be extremely helpful with any long lasting dataset (since you end up using it later in ways you hadn't expected originally), and also think it's acceptable to de-normalize pieces of data for performance reasons, as long as it's still clear where the "official" source of that data is.

By JenaVelescu on July 13, 2010, 3:37:03 pm

mmmmI think you may be right in what you have said in this post. Whoever there will be many people will not agree with your opinion

By Joerg Beekmann on July 16, 2010, 5:51:48 pm

In your example if the "same" address can be entered multiple times you know there will be discrepancies. If the data was guaranteed to be duplicated exactly I couldn't care less how many times it written.