Wednesday, July 08, 2009

Can your programming language do this?

In information theory, entropy is the measurement of the amount of randomness in a probability distribution.

It's defined as

In my code, it's

(defn entropy [X]
      (* -1 (Σ [i X] (* (p i) (log (p i))))))

Compare the similarity of my code above with the actual definition. Having a language that lets you declare constructs that look very similar to their mathematical definition is a huge win for readability.

The two important features I used in this example are

1) Clojure allows identifiers to be unicode
2) Macros

My macro is defined as

(defmacro Σ
  "computes the sum of a series."
  [[var-name values] & body]
  `(reduce +
       (map (fn [~var-name]
          ~@body) ~values)))

Of course, I have a similar one:

(defmacro Π
  "computes the product of a series."

You can imagine how it looks.

The other big win here is I've abstracted out *how* the sum and product of a series are implemented. This allows me to do performance optimizations behind the scenes. For example, I could replace that call to map with a call to pmap.

The point of this isn't to show off how Clojure is better than other languages; I'm trying to show that these features are powerful. If your language of choice doesn't have these features, maybe it should.

Of course, macros are just a means to an end. I find them great, but that doesn't mean they're the One True Way to accomplish our goal of readable code. I'd love to see examples of solving this problem in other ways, in any language.

Wednesday, June 24, 2009

Startups as Functions

I used to worry about the efficiency of my code. I still do, but now that I'm working on a startup I have additional concerns. Now I have to worry about money as well.

I pay money to get web hosting for my site. I'm using AWS, which charges by the CPU/hour. For the smallest box, they charge $0.10/hr. Each of those boxes can support some number of users, and I make some amount of money off each user. Now I have an equation:

`"input money" - ($0.10 / "CPU-hr") * "CPU-hr" / "user" * "users"/"month" + "users"/ "month" *"money" / "user" = "output money"/"month"`

Wait a minute. I have money on both sides of the equation. But the input money and the output money can be different. And hopefully, output money > input money. What would it look like if I took the output money and fed it into the other side?

Obviously at this point we don't have a true equation because there's time involved, and that isn't modeled here. What have now is more like a recursive function: the output of one run through the function is the input of the next call.

if output money - expenses > input money, I have a successful company.

There's another input to the function here, because you need input users, (and you want to output happy users). If your users are happy, they'll return or recommend you to other users:

So what are the benefits of this mental model? If there are no benefits, there's no reason to bother with it. Well, I thought it was interesting. But more significantly, it turns company creation and analysis into an engineering problem, and we're good at those. It becomes easier to see the whole picture, so we can identify problems. There's also a large body of Engineering/CS/technical knowledge that can be applied to this model. Debugging, profiling, pipeline stalls, complexity analysis, all of these tools are easily applicable to this model.

Look at the function again. Each input is a parameter that can be tweaked. What happens when I improve the code to make users happier? What happens when I raise my advertising budget, attempting to acquire more users?

All successful companies have a similar function, it's just easy to see in a web startup because the cycle is simple. Retail stores have policies for what employees are supposed to do, and when to re-order goods. Some stores have even automated the process

There's one more input to the recursive function. Eric Ries has impressed on me the value of using A/B split testing and Google Analytics. What this means is that each time a user goes through the system, I collect data on their trip. I can use that data to analyze the system's performance, and make decisions about how to make the system more profitable. I can improve the code to make the users happier, increase monitization per user, or reduce costs:

It turns out, like startups, people are functions too. If the company is a function, then founders are higher-order functions. Every day, we take the existing function (company) and improve it. And now, I've derived the Y Combinator, 4 years after Paul Graham.

People reading this article are part of the loop too. Some percentage of readers will visit reasonr. They'll either use the site, or not, and like it, or not. All of that feedback is useful for making the company succeed, and all I had to do was write this blog post.

Footnote:
I was originally going to write an article about how my site isn't in a position to do A/B tests effectively yet, because I don't have enough traffic to make statistically sound judgments. This idea of "companies as functions" had been banging around in my head for a while, and the two ideas merged into A strange Douglas Hofstadter meta post, where talking about my idea results in more data to do A/B testing.

Wednesday, June 17, 2009

Preventing tests (or, 5 Whys in action)

I was working on some Clojure for my site, Reasonr today. I develop on my laptop, using Emacs, and I have a production box on AWS. I recently set up swank-clojure on the production web server, so I can SSH tunnel onto the box, and get a REPL. Very cool.

There is one downside however. I had tunneled into the box to check something out, and then forgot about. I went back to developing. I was ready to test my changes, and I ran (run-all-tests). Normally this would be fine, except my slime buffer was still connected to the production box, and run-all-tests does more than just unit tests, it does things like insert (fake) data into the database and makes sure it comes back out correctly. And I was still connected to the production box, so now that fake data is in my production DB.

#$^%#

After deleting all the crap data out of the DB, I thought of Eric Ries' 5 Whys, and set about trying to make sure this never happens again.

Step 1: Don't allow tests in production

(defn no-tests-for-you []
  (println "You're in production, dumbass. No tests for you!"))

(defn prevent-tests []
  (with-ns/with-ns 'clojure.contrib.test-is)
    (def run-all-tests user/no-tests-for-you)
    (def test-ns user/no-tests-for-you)
    (def test-var user/no-tests-for-you))

(when (= winston.env/environment :production)
  (prevent-tests))

Here, I have a dummy function that reminds the user we're talking to the production box. I also have a function that redefines most of the ways to run clojure tests. with-ns is from clojure.contrib.with-ns, which evaluates body in the specified namespace. Thankfully, (run-all-tests) is the most common way for me to start tests, and the remaining ways all take more effort to run, so this net will likely catch most of my goofs. I also have a var, winston.env/environment which specifies if we're in production or development. If we're in production, we obviously don't want to run tests.

Step 2: Fix the prompt

A big part of my confusion was because I wasn't sure which box I was connected to. Let's fix that by modifying the repl prompt to print the hostname. I already had code that started a repl manually, rather than using the default one in clojure.main. All I had to do was supply a new definition for the repl-prompt

(def hostname (.trim (sh "hostname")))

(defn repl-prompt []
  (printf "%s:%s=> " winston.env/hostname *ns*))

(defn start-repl []
  (binding [clojure.main/repl-prompt winston.repl/repl-prompt]
    (clojure.main/repl)))

I used the sh function from clojure.contrib.shell-out to determine the hostname, and store that in a variable. Then I modified the repl prompt to print hostname:namespace=> rather than just "namespace=>". binding is clojure's way to safely monkey patch. Inside the scope of binding, all calls to clojure.main/repl-prompt will be replaced with calls to winston.repl/repl-prompt.

Step 3: Fix Slime?

I sort of cheated on step 2. Everything I wrote works great, but it doesn't fix Slime. Apparently slime hardcodes the definition of the prompt. You'll see the modified prompt when using the repl at the console, but not via slime. And I don't know of a good way to fix the slime prompt. Anyone out there have ideas?

Monday, April 20, 2009

The Database Rant

It's time for SQL to die.

SQL was invented in the 70s for business analysts to write reports without having to go bug the programmers. Let me repeat that. SQL was invented in the 70s for business analysts to write reports without having to go bug the programmers.

Now, ask yourself why SQL databases are the standard data store for web apps in 2009?

Most applications that use SQL aren't using the right tool for the job. This is especially true of web-apps. Developers pick SQL databases because they are the closest thing we have to what is actually wanted.

Let's start with the good features of SQL databases have that developers want, and then we'll get to the parts that are just plain broken.

The Good

Libraries - I claim the primary reason developers default to using DBs is that their programming language of choice already has a library to interact with the DB. When faced with the challenge of coming up with a file format to store their data, and writing the code to pull data out of the format, most developers would (rightly) balk. "We already have tools for that, that's what databases are for". They'd be right, except for the fact that SQL sucks. Having a library that (mostly) handles the serialization and de-serialization of a whole bunch of data is pretty nice.

Data Integrity / Constraints - Being able to write declarative constraints on your data, such as "serial_number varchar(11) unique not null" is very powerful.

Transactions

Relational Model - This rant is about SQL the language, not the relational database model. I still think the relational model is pretty useful, and essential for allowing ad-hoc queries.

ad-hoc declarative queries

The Bad

All of software is about building one abstraction on another, into more powerful tools. We build new programming languages on top of old ones. We build Twitter on top of HTTP. It is obvious that SQL is not a good tool for building abstractions on because there are no good tools that build on it.

ORMs

The fact that ORMs are the dominant paradigm for interacting with databases today is an excellent demonstration that SQL databases do not provide the correct tool that programmers want. ORMs are duct tape to make databases look almost like the correct tool. Every ORM I've ever seen is a very leaky abstraction [http://www.joelonsoftware.com/articles/LeakyAbstractions.html]. Sooner or later, you are always back to writing SQL by hand, only now you have hand written SQL and ORM code in the same code base, usually with two different ways of handling connections and transactions, and the data "comes out" in two different formats.

queries are not composable

SQL queries are not composable. In functional programming languages, this means that if I have a function f(x), and a function g(x), I can put them together and make a new function h(x) that just calls f(g(x)). Notice that I can do this without modifying f or g. SQL doesn't have this property.

Let's say I have a query to find all employees who are in the accounting department:

"select * from employee where dept = 'accounting'"

and I have another query to find all employees who have been with the company more than 5 years:

"select * from employee where hire_date < now() - interval '5 years'"

Now I want to put them together, how do I do that? The naive and wrong solution is to use a subselect i.e.

"select * from (select * from employee where hire_data < now() - interval '5 years') where dept = 'accounting'"

Notice that this solution is composed of the first two queries, however composing in SQL kills your performance, and this is generally not the right way to structure the query. The "right" way to do this is to write a new query that merges the two together. Because our queries aren't composable, we can't write modular, reusable code snippets and piece them together to solve new problems; this is one of the primary techniques developers have for solving large, complex problems.

The Ugly

SQL has it's own sucky, ancient programming language for doing logic

PL/SQL is primitive by today's standards, yet it's the only game in town for certain classes of features, like stored procedures and triggers. Basically every production programming language is better than PL/SQL in every way. I don't need to go on, do I?

It's not obvious which queries are slow

The SQL model is declarative, which is nice when you want to write simple queries quickly, but falls down quickly when you need to write complex, high performing queries. Being able to look at a query and know whether it is fast or not depends highly on your DB vendor and product version, and is basically impossible in the general case without the EXPLAIN command. Writing efficient SQL requires you know that your vendors optimizes this query, but not that query, but only in versions 3.1.4 and later.

It gets worse. Finding the optimal plan for a complex queries is an NP hard problem . So in the cases where the DB optimization, shouldn't there be a manual override?

No back door for map, filter, reduce

No matter how good the planner gets, there will always be queries that generate slow plans. In some cases, you can fix things by rewriting your query so that it parses into a different tree that the planner does know how to optimize. In other cases, you're just screwed and you have to pull all the data into RAM and write the query yourself in your programming language.

Current databases don't give the developer a way to bypass the optimizer. At the end of the day, databases are just a bunch of data structures and algorithms that are efficient when using disk. Tables are just b-trees and sets and vectors. It should be possible to expose an API that looks like "do a search on that b-tree using index foo, now filter the data using function bar"

In some cases, it is much simpler than a page of SQL. No programming paradigm (imperative, OO, declarative, functional) is perfectly applicable in all situations. So why does SQL decide that declarative style is the One True Way to get access to your data?

datatypes don't match your PL

Ok, this is just annoying, but it's still a problem. Your database has it's own opinion of how big an Integer is, or what constitutes a timestamp. I spent weeks dealing with timestamps in a product, because Ruby and Rails had no concept of timezones on timestamps (until recently), yet the data had to be stored as "timestamp with time zone" in Postgres. There were all kinds of stupid bugs because the Rails code would insert a time in the current timezone, and postgres would interpret that as a UTC time.

I had to deal with another bug between Java and Postgres. Postgres has a built in datatype to represent IP addresses, along with the assorted functions you would expect, like "find me all the ip addresses that came from this subnet". Java also has a class to represent internet addresses, yet you can't insert them into Postgres directly. The Postgres datatype stores the netmask, while the Java type does not. This bug has been known since the early 2000s, yet it hasn't been fixed, and doesn't look like it will be fixed any time soon.

The Solution

Enough complaining. Time to talk about how to fix all of this. The design I'm thinking of looks fairly similar to a modern RDBMS. The relational model would still be present, you would still have ACID and transactions and constraints and foreign keys, except without the SQL.

Pick a VM. I like the JVM, and I think it has a strong future. This ties the database to a known set of primitive datatypes, to make interoperating with the programming language much simpler. True, it won't make everyone happy, but it will at least make some people happy.

Pick a simple syntax. I think a lisp/s-exp based syntax is appropriate. It's easy to generate, it's easy to parse, and a sane DB API is going to end up looking very similar to map/filter/reduce anyways, which fits the lisp concept very well. Plus, an s-exp based language would work very well with Clojure, my favorite programming language. But because sexp syntax is so formless, you could even translate the syntax into a JSON based format as well.

Expose primitive operations. This is a key point. In all databases today, there are functions that look like "go search this b-tree using this index". Expose all of those in a public (but "advanced") API. Build the query planner on top of the primitive operations. I think git is a good example of this. They have a set of tools called plumbing, and a set of tools called porcelain. The user-visible commands like "git checkout" "git commit" are all porcelain, but they're all built entirely on top of plumbing commands. In our DB example, this is all plumbing. The declarative query stuff is all porcelain.

Create a declarative query syntax. This would be s-exp based, and have similar concepts to today's SELECT. However, < big handwave > the API should be designed to allow queries to compose without killing performance </big handwave >.

Allow the DB to operate inside the app's process, or externally, similary to Derby.

That's my solution. What's yours?

/dev/rohner

AsciiMathML