Monday, April 20, 2009

The Database Rant

It's time for SQL to die.

SQL was invented in the 70s for business analysts to write reports without having to go bug the programmers. Let me repeat that. SQL was invented in the 70s for business analysts to write reports without having to go bug the programmers.

Now, ask yourself why SQL databases are the standard data store for web apps in 2009?

Most applications that use SQL aren't using the right tool for the job. This is especially true of web-apps. Developers pick SQL databases because they are the closest thing we have to what is actually wanted.

Let's start with the good features of SQL databases have that developers want, and then we'll get to the parts that are just plain broken.

The Good

  • Libraries - I claim the primary reason developers default to using DBs is that their programming language of choice already has a library to interact with the DB. When faced with the challenge of coming up with a file format to store their data, and writing the code to pull data out of the format, most developers would (rightly) balk. "We already have tools for that, that's what databases are for". They'd be right, except for the fact that SQL sucks. Having a library that (mostly) handles the serialization and de-serialization of a whole bunch of data is pretty nice.

  • Data Integrity / Constraints - Being able to write declarative constraints on your data, such as "serial_number varchar(11) unique not null" is very powerful.

  • Transactions

  • Relational Model - This rant is about SQL the language, not the relational database model. I still think the relational model is pretty useful, and essential for allowing ad-hoc queries.

  • ad-hoc declarative queries

The Bad

All of software is about building one abstraction on another, into more powerful tools. We build new programming languages on top of old ones. We build Twitter on top of HTTP. It is obvious that SQL is not a good tool for building abstractions on because there are no good tools that build on it.


The fact that ORMs are the dominant paradigm for interacting with databases today is an excellent demonstration that SQL databases do not provide the correct tool that programmers want. ORMs are duct tape to make databases look almost like the correct tool. Every ORM I've ever seen is a very leaky abstraction []. Sooner or later, you are always back to writing SQL by hand, only now you have hand written SQL and ORM code in the same code base, usually with two different ways of handling connections and transactions, and the data "comes out" in two different formats.

queries are not composable

SQL queries are not composable. In functional programming languages, this means that if I have a function f(x), and a function g(x), I can put them together and make a new function h(x) that just calls f(g(x)). Notice that I can do this without modifying f or g. SQL doesn't have this property.

Let's say I have a query to find all employees who are in the accounting department:

"select * from employee where dept = 'accounting'"

and I have another query to find all employees who have been with the company more than 5 years:

"select * from employee where hire_date < now() - interval '5 years'"

Now I want to put them together, how do I do that? The naive and wrong solution is to use a subselect i.e.

"select * from (select * from employee where hire_data < now() - interval '5 years') where dept = 'accounting'"

Notice that this solution is composed of the first two queries, however composing in SQL kills your performance, and this is generally not the right way to structure the query. The "right" way to do this is to write a new query that merges the two together. Because our queries aren't composable, we can't write modular, reusable code snippets and piece them together to solve new problems; this is one of the primary techniques developers have for solving large, complex problems.

The Ugly

SQL has it's own sucky, ancient programming language for doing logic

PL/SQL is primitive by today's standards, yet it's the only game in town for certain classes of features, like stored procedures and triggers. Basically every production programming language is better than PL/SQL in every way. I don't need to go on, do I?

It's not obvious which queries are slow

The SQL model is declarative, which is nice when you want to write simple queries quickly, but falls down quickly when you need to write complex, high performing queries. Being able to look at a query and know whether it is fast or not depends highly on your DB vendor and product version, and is basically impossible in the general case without the EXPLAIN command. Writing efficient SQL requires you know that your vendors optimizes this query, but not that query, but only in versions 3.1.4 and later.

It gets worse. Finding the optimal plan for a complex queries is an NP hard problem . So in the cases where the DB optimization, shouldn't there be a manual override?

No back door for map, filter, reduce

No matter how good the planner gets, there will always be queries that generate slow plans. In some cases, you can fix things by rewriting your query so that it parses into a different tree that the planner does know how to optimize. In other cases, you're just screwed and you have to pull all the data into RAM and write the query yourself in your programming language.

Current databases don't give the developer a way to bypass the optimizer. At the end of the day, databases are just a bunch of data structures and algorithms that are efficient when using disk. Tables are just b-trees and sets and vectors. It should be possible to expose an API that looks like "do a search on that b-tree using index foo, now filter the data using function bar"

In some cases, it is much simpler than a page of SQL. No programming paradigm (imperative, OO, declarative, functional) is perfectly applicable in all situations. So why does SQL decide that declarative style is the One True Way to get access to your data?

datatypes don't match your PL

Ok, this is just annoying, but it's still a problem. Your database has it's own opinion of how big an Integer is, or what constitutes a timestamp. I spent weeks dealing with timestamps in a product, because Ruby and Rails had no concept of timezones on timestamps (until recently), yet the data had to be stored as "timestamp with time zone" in Postgres. There were all kinds of stupid bugs because the Rails code would insert a time in the current timezone, and postgres would interpret that as a UTC time.

I had to deal with another bug between Java and Postgres. Postgres has a built in datatype to represent IP addresses, along with the assorted functions you would expect, like "find me all the ip addresses that came from this subnet". Java also has a class to represent internet addresses, yet you can't insert them into Postgres directly. The Postgres datatype stores the netmask, while the Java type does not. This bug has been known since the early 2000s, yet it hasn't been fixed, and doesn't look like it will be fixed any time soon.

The Solution

Enough complaining. Time to talk about how to fix all of this. The design I'm thinking of looks fairly similar to a modern RDBMS. The relational model would still be present, you would still have ACID and transactions and constraints and foreign keys, except without the SQL.

  • Pick a VM. I like the JVM, and I think it has a strong future. This ties the database to a known set of primitive datatypes, to make interoperating with the programming language much simpler. True, it won't make everyone happy, but it will at least make some people happy.

  • Pick a simple syntax. I think a lisp/s-exp based syntax is appropriate. It's easy to generate, it's easy to parse, and a sane DB API is going to end up looking very similar to map/filter/reduce anyways, which fits the lisp concept very well. Plus, an s-exp based language would work very well with Clojure, my favorite programming language. But because sexp syntax is so formless, you could even translate the syntax into a JSON based format as well.

  • Expose primitive operations. This is a key point. In all databases today, there are functions that look like "go search this b-tree using this index". Expose all of those in a public (but "advanced") API. Build the query planner on top of the primitive operations. I think git is a good example of this. They have a set of tools called plumbing, and a set of tools called porcelain. The user-visible commands like "git checkout" "git commit" are all porcelain, but they're all built entirely on top of plumbing commands. In our DB example, this is all plumbing. The declarative query stuff is all porcelain.

  • Create a declarative query syntax. This would be s-exp based, and have similar concepts to today's SELECT. However, < big handwave > the API should be designed to allow queries to compose without killing performance </big handwave >.

  • Allow the DB to operate inside the app's process, or externally, similary to Derby.

That's my solution. What's yours?