May 19, 2012

ddd

Not Sure About “NO DB”

Abstractions are fundamental to object-oriented programming — to computing even — to life even. One abstraction that Uncle Bob likes is the database abstraction. In his NO DB post, he makes two main points:

Choosing a SQL database isn’t a given — you should decide what storage is best
Your database should be totally abstracted in your code — you can switch between any underlying data store and your code still works the same

First point is a good one, imo. Second one is a bit more contentious. It’s a pretty popular view though — every one of my colleagues has massive belief in it. All of these people are smarter than me, so I should probably change my opinion, but…

…I’m a bit quirky, and even though my ratio of correct to incorrect is about 1:10, I’ll show you why I still have the belief that a database abstraction to that extreme is possibly a bad one.

THIS IS NOT A RANT OK??!!!

Abstraction Trade Offs

Abstractions are concepts. We take a set of lower-level features and combine them into a single conceptual unit. We then think about this single unit without caring about the lower-level features it encompasses.

Complexity / Runtime Performance Ratio

All abstractions leave open the possibility of a runtime performance penalty. You are hiding those lower-level features and potentially blocking off your path to them. Sometimes this good….

C# is compiled to CIL which goes on to become assembler. If you were to write the assembler yourself you could write it in a more efficient way that resulted in faster execution of your code at runtime. It would take you a looooong time, though.

For this abstraction, the complexity is reduced by an insane amount, yet the performance penalty is relatively little with modern computers.

With a database abstraction the complexity being hidden is simple and it doesn’t help us create software any faster. Yet due to database querying and crossing the network it can slow down runtime performance by orders or magnitude more than other abstractions (from nanoseconds, to hundreds of milliseconds, to seconds even).

Switching Abstractions

Uncle Bob didn’t care much for performance. He focused on the other impact of abstractions — you can switch the lower-level features you’ve hidden behind that abstraction.

And for many an abstraction, I think that is a grand thing to do. Different payment calculations for different types of staff, different promotions for different months of the year, different validation rules for different business rule — some abstractions we switch freely and frequently and let us write simple code. Most of the time the performance implications of these small in-memory algorithms is nil.

I don’t recall too many times I’ve used a database abstraction polymorphically. I don’t recall too many times I’ve switched databases, either. You may have different experiences to me, though.Abstractions are Inefficient at Runtime

So far, we’ve accumulated the benefit of being able to switch the underlying database — although we probably won’t ever do it, and we’ve hidden the complexity — although there probably isn’t that much.

In return we’re expecting some performance degradation, with the potential for significant performance degradation. I’ll explore now how much this might be.Different Database Abstractions Need Different Data Models

To get optimal performance from our database, we need to design our code model so that it fits with a data model that is best suited for querying the chosen type of database. In his example Uncle Bob talked about being able to switch from a relational to a document database with no change in code.

Considering relational and document databases are fundamentally different — tables vs key value stores — it already seems quite possible that a code model that fits both is going to have significant performance implications with one or both databases lying underneath it.

Let’s have a look at simple examples of how you would model for relational and document databases to get better performance.

SQL

If using SQL via NHibernate, a typical model might look like this:

public class House

{

public string Id { get; set; }

public string Address { get; set; }

public bool HasGarden { get; set; }

public bool HasBeenPurchased { get; set; }

public IList Rooms { get; set; }

}

public class Room

{

public string Id { get; set; }

public string Name { get; set; }

public int Area { get; set; }

public int Height { get; set; }

public int NumberOfElectricalSockets { get; set; }

}

Sometimes we want to get a House but not the rooms, so NHibernate will lazy load them for us. We could use a micro-ORM that doesn’t lazy load, so we will keep al list of the Ids of each room instead.

Document Databases

public class House

{

public string Id { get; set; }

public string Address { get; set; }

public bool HasGarden { get; set; }

public bool HasBeenPurchased { get; set; }

public int RoomsId { get; set; }

}

public class HouseRooms

{

public int Id { get; set; }

public IList Rooms { get; set; }

}

public class Room

{

public string Name { get; set; }

public int Area { get; set; }

public int Height { get; set; }

public int NumberOfElectricalSockets { get; set; }

}

When using a document database, to get best performance I’d denormalise the rooms, so all of a house’s rooms live in one document (HouseRooms). This fits my usage patterns perfectly — I either want a house on its own, or a house with all of its rooms.

The alternative is a map-reduce, which I believe is not good for performance or complexity.

This is a fairly small difference, but applied to every entity in the domain it could be very significant.

More Opinions

I found the following links useful when I was learning about document databases and how to model for them.

http://code.google.com/p/servicestack/wiki/DesigningNoSqlDatabase
http://ayende.com/blog/4466/that-no-sql-thing-modeling-documents-in-a-document-database
http://ayende.com/blog/4465/that-no-sql-thing-the-relational-modeling-anti-pattern-in-document-databases Runtime Database Inefficiency Might Not Scale

If we accept the assumptions so far, then our database abstraction will cost us performance, and potentially quite a lot. But so far, the customers are happy, the business is happy and our application works well. We are selling lots of automated house designs.

Now the business is spreading across the land into all sorts of new territories. Customers are going insane after the last bout of television ads for the product. But the application is taking a pretty heavy load.

Poor Mr Database, he is being battered. If it weren’t for all the select N+1 in the code and the inefficient queries from the application he could easily handle this load. As it stands, though, he is the bottleneck in the system and the application has slowed down to 30 second response times.

As a side-note, it was interesting to hear that Twitter had a similar problem where there database abstractions did not scale http://www.infoq.com/presentations/Abstractions-at-Scale (go to around 27–33 minutes).Hardware is Cheaper Than Dev Time — Throw Money at it

An obvious answer to the scaling problem is to just beef up the server. And indeed, that probably works up to a point. That point will either be when a bigger server is more expensive than multiple small servers, or when there are no bigger servers. What then?

Now you add more servers. Now you need to start replicating and maybe sharding your data. Distributed transactions, retry logic, failover…..

I would personally like to put that off as long as possible. I’m sure the systems team and the DBA team would, too.

We may even come to rue the associations between our domain entities that heavily impact how the data can be split up amongst servers — if at all possible.Are Performance Requirements and Load Testing the Answer?

One idea floating around is that the business provides performance requirements and a load testing environment that is able to accurately simulate that load to verify it. Personally I have never seen this and don’t know if it is possible. I would love it though.

But even if it was, and the load test — which somehow replicated live load, and live usage patterns — gave a damning report of your performance, now you have to optimise it. How do you do that?

Your domain model needs to change now for performance.

We’ve seen how to get the best out of your database you need to model your domain entities in a way that fits with your database. So we could re-model the entities? Those entities which the entire domain model is built around?

Then you need to change the queries — those queries that hide the database technology behind your database abstraction. On a lucky day we could just optimise the logic in the abstraction, but if the whole algorithm the query is a part of has to change it might be too difficult depending how many abstractions you built on top.

Now for the caveat

I have never been in that exact position, so I’m talking semi-shit. But I’ve seen code susceptible to those problems. The data access was so far abstracted that small changes hurt and performance improvements would have been difficult.

We didn’t hit those heights of performance drama, but it felt like we were well on the way. That code base is now being completely over-hauled as it happens.How Much Should we Abstract?

If we have a simple system along the lines of CRUD then there’s no domain model to abstract. I showed in my last post that I feel abstractions in those cases are unnecessary.

For a domain model as complex as the one Uncle Bob has here I would probably pass around the interface I use to communicate with my database with e.g. RavenDB/NHibernate session or maybe something lightweight over a micro-ORM if I was using one.

I want to it to be clear when I’m communicating with the database or shooting myself in the foot with N+1. When I need to optimise I can see exactly where the problems are. I don’t believe this pollutes the domain too much at all. And we can certainly refactor to methods before interfaces to reveal intention.

If the database changes, then yes, the code for the business rules will change a bit too. It’s not ideal, but it’s a trade-off. If future performance is any kind of concern you might like it.

Even Eric Evans in DDD concedes that the database will have an effect on the domain model — although it should try to be limited.

I certainly do not like seeing SQL data readers being passed around. Nor does my friend Ronald….If Things are Working for You….

If your abstractions are working for you then you probably do not need to worry. You might be abstracting your database into oblivion yet you’re product is performing well and making money for the business.

You will possibly not be running at optimal performance, and arguably there would have been minimal sacrifice to get to that stage (by just not abstracting the database) but that’s no problem right now.

Sometimes it might be the opposite case and what I’ve suggested in this post might be close to being correct. But there are a lot of ifs and buts in this post, and probably some retarded theory. So just remember….

I talk rubbish and I’m wrong 9 times out of 10