Monday, April 20, 2009

The Database Rant

It's time for SQL to die.

SQL was invented in the 70s for business analysts to write reports without having to go bug the programmers. Let me repeat that. SQL was invented in the 70s for business analysts to write reports without having to go bug the programmers.

Now, ask yourself why SQL databases are the standard data store for web apps in 2009?

Most applications that use SQL aren't using the right tool for the job. This is especially true of web-apps. Developers pick SQL databases because they are the closest thing we have to what is actually wanted.

Let's start with the good features of SQL databases have that developers want, and then we'll get to the parts that are just plain broken.

The Good

  • Libraries - I claim the primary reason developers default to using DBs is that their programming language of choice already has a library to interact with the DB. When faced with the challenge of coming up with a file format to store their data, and writing the code to pull data out of the format, most developers would (rightly) balk. "We already have tools for that, that's what databases are for". They'd be right, except for the fact that SQL sucks. Having a library that (mostly) handles the serialization and de-serialization of a whole bunch of data is pretty nice.

  • Data Integrity / Constraints - Being able to write declarative constraints on your data, such as "serial_number varchar(11) unique not null" is very powerful.

  • Transactions

  • Relational Model - This rant is about SQL the language, not the relational database model. I still think the relational model is pretty useful, and essential for allowing ad-hoc queries.

  • ad-hoc declarative queries

The Bad

All of software is about building one abstraction on another, into more powerful tools. We build new programming languages on top of old ones. We build Twitter on top of HTTP. It is obvious that SQL is not a good tool for building abstractions on because there are no good tools that build on it.


The fact that ORMs are the dominant paradigm for interacting with databases today is an excellent demonstration that SQL databases do not provide the correct tool that programmers want. ORMs are duct tape to make databases look almost like the correct tool. Every ORM I've ever seen is a very leaky abstraction []. Sooner or later, you are always back to writing SQL by hand, only now you have hand written SQL and ORM code in the same code base, usually with two different ways of handling connections and transactions, and the data "comes out" in two different formats.

queries are not composable

SQL queries are not composable. In functional programming languages, this means that if I have a function f(x), and a function g(x), I can put them together and make a new function h(x) that just calls f(g(x)). Notice that I can do this without modifying f or g. SQL doesn't have this property.

Let's say I have a query to find all employees who are in the accounting department:

"select * from employee where dept = 'accounting'"

and I have another query to find all employees who have been with the company more than 5 years:

"select * from employee where hire_date < now() - interval '5 years'"

Now I want to put them together, how do I do that? The naive and wrong solution is to use a subselect i.e.

"select * from (select * from employee where hire_data < now() - interval '5 years') where dept = 'accounting'"

Notice that this solution is composed of the first two queries, however composing in SQL kills your performance, and this is generally not the right way to structure the query. The "right" way to do this is to write a new query that merges the two together. Because our queries aren't composable, we can't write modular, reusable code snippets and piece them together to solve new problems; this is one of the primary techniques developers have for solving large, complex problems.

The Ugly

SQL has it's own sucky, ancient programming language for doing logic

PL/SQL is primitive by today's standards, yet it's the only game in town for certain classes of features, like stored procedures and triggers. Basically every production programming language is better than PL/SQL in every way. I don't need to go on, do I?

It's not obvious which queries are slow

The SQL model is declarative, which is nice when you want to write simple queries quickly, but falls down quickly when you need to write complex, high performing queries. Being able to look at a query and know whether it is fast or not depends highly on your DB vendor and product version, and is basically impossible in the general case without the EXPLAIN command. Writing efficient SQL requires you know that your vendors optimizes this query, but not that query, but only in versions 3.1.4 and later.

It gets worse. Finding the optimal plan for a complex queries is an NP hard problem . So in the cases where the DB optimization, shouldn't there be a manual override?

No back door for map, filter, reduce

No matter how good the planner gets, there will always be queries that generate slow plans. In some cases, you can fix things by rewriting your query so that it parses into a different tree that the planner does know how to optimize. In other cases, you're just screwed and you have to pull all the data into RAM and write the query yourself in your programming language.

Current databases don't give the developer a way to bypass the optimizer. At the end of the day, databases are just a bunch of data structures and algorithms that are efficient when using disk. Tables are just b-trees and sets and vectors. It should be possible to expose an API that looks like "do a search on that b-tree using index foo, now filter the data using function bar"

In some cases, it is much simpler than a page of SQL. No programming paradigm (imperative, OO, declarative, functional) is perfectly applicable in all situations. So why does SQL decide that declarative style is the One True Way to get access to your data?

datatypes don't match your PL

Ok, this is just annoying, but it's still a problem. Your database has it's own opinion of how big an Integer is, or what constitutes a timestamp. I spent weeks dealing with timestamps in a product, because Ruby and Rails had no concept of timezones on timestamps (until recently), yet the data had to be stored as "timestamp with time zone" in Postgres. There were all kinds of stupid bugs because the Rails code would insert a time in the current timezone, and postgres would interpret that as a UTC time.

I had to deal with another bug between Java and Postgres. Postgres has a built in datatype to represent IP addresses, along with the assorted functions you would expect, like "find me all the ip addresses that came from this subnet". Java also has a class to represent internet addresses, yet you can't insert them into Postgres directly. The Postgres datatype stores the netmask, while the Java type does not. This bug has been known since the early 2000s, yet it hasn't been fixed, and doesn't look like it will be fixed any time soon.

The Solution

Enough complaining. Time to talk about how to fix all of this. The design I'm thinking of looks fairly similar to a modern RDBMS. The relational model would still be present, you would still have ACID and transactions and constraints and foreign keys, except without the SQL.

  • Pick a VM. I like the JVM, and I think it has a strong future. This ties the database to a known set of primitive datatypes, to make interoperating with the programming language much simpler. True, it won't make everyone happy, but it will at least make some people happy.

  • Pick a simple syntax. I think a lisp/s-exp based syntax is appropriate. It's easy to generate, it's easy to parse, and a sane DB API is going to end up looking very similar to map/filter/reduce anyways, which fits the lisp concept very well. Plus, an s-exp based language would work very well with Clojure, my favorite programming language. But because sexp syntax is so formless, you could even translate the syntax into a JSON based format as well.

  • Expose primitive operations. This is a key point. In all databases today, there are functions that look like "go search this b-tree using this index". Expose all of those in a public (but "advanced") API. Build the query planner on top of the primitive operations. I think git is a good example of this. They have a set of tools called plumbing, and a set of tools called porcelain. The user-visible commands like "git checkout" "git commit" are all porcelain, but they're all built entirely on top of plumbing commands. In our DB example, this is all plumbing. The declarative query stuff is all porcelain.

  • Create a declarative query syntax. This would be s-exp based, and have similar concepts to today's SELECT. However, < big handwave > the API should be designed to allow queries to compose without killing performance </big handwave >.

  • Allow the DB to operate inside the app's process, or externally, similary to Derby.

That's my solution. What's yours?


Andy Chambers said...

I'd suggest building your API on top of an rdf triple store.

arthole said...

This has been done by IBM! It's called RPG. Report Program Generator language. It provides a non-sql interface to a dbms on iSeries/AS400 midrange computers.

compared to python and SQL it sucks.

you started your rant by pointing out that SQL was created for business analysts of a DBMS not programmers. Thus, the easiest solution to ad hoc queries is to expose the SQL interface to those users.

Managing users and table/column authorizations in a single application is a reasonably safe way to handle this. Letting a user write their own SQL and not have to bother the developers.

For complex and varied queries, I generally embed select statements in the database itself, using variables, to achieve a level of composability. In this way, specific queries can be generated as a function of a higher query with a bit of search and replace.

It's also possible to store database definitions in tables, and abstract the applications sql statements into higher order sql combinations.

this does not happen often in practice, because it is essentially creating an API for the application's database and unless ad-hoc queries are the rule, the universe of query statements in most applications becomes static over time.

In general, I have found that having to go into code and tweak sql over and over is caused by a changing database. And if that is the case, it means the database is incomplete - an incomplete semantic representation of the users data "world".

at some point, this changing slows or stops entirely, and the employed sql becomes very static. The beauty of SQL is that it will be quite portable to a new language when the old one dies off.

I've seen single SQL statements replace whole RPG (and Cobol!) programs. It is a very good (though not perfect) interface to an RDBMS. Other interfaces will always be language or RDBMS specific. And for this reason, those platforms will suffer "isolation" problems. (this is exactly the problem faced by the iSeries/AS400)

pawel said...

Disagree with Andy Chambers, RDF triple store would be SQL of semantic web :-)

Agree SQL should and will die soon.

Advanced semantic technologies will replace all this stuff.

Eventually all will be in the Global Brain anyway.

Pawel Lubczonok

Timothy said...

I would respectfully disagree. I will be happy to move away from SQL when something better comes along, but I am not certain that what you are describing is better, and I think some of your assumptions are false. (Though I would love to see select statements become fully comoposable as you described and still maintain performance.)

For one thing, you say that ORMs are the dominant model. I am not certain that is true yet since most projects I have worked on and talked to others about do not use them.

And you point out that ORMs are a leaky model, which I will agree with, but look at the how young they are relatively. Microsofts main ORM Linq was released at the end of 2007. The other one I have worked with is SQL Alchemy which has not even hit the 1.0 milestone. They have major problems now, but they will get better.

And you say that there is no low level access to the data. This is partially true, but for multiuser environments with potentially mixed programming languages I think this may be a good thing and is at worst a mixed blessing. Also keep in mind that many RDBMS's do offer a fairly good degree of control over the plan that will be used. MS SQL Server for instance has a variety of query hints which let you force a lot of the paramaters of the final plan.

As someone who regularly uses more than one programming language I would be very concerned about your suggestion to pick a virtual machine. That would strongly suggest a particular language to use with that database as well as end up keeping on many of your complaints about primitive type mismatches when you did use a different language.

I see where you are coming from and you have some valid points, but at least for the forseeable future I believe the answer is to continue to refine and evolve SQL and the new ORM systems, not to discard them by the wayside.

John "Z-Bo" Zabroski said...

Check our Adam Marcus's BlendDB MIT Masters thesis project. It talks about table layout schemes for relational databases that are geared toward navigating records.