Sunday, January 18, 2009

Views across distributed CouchDB instances?

I just got into an interesting conversation over on Chris Schandler's blog. I ended up writing a long comment about a topic only loosely related to his original post, so I'm making a post out of it here.

I wondered if there is such a thing in CouchDB as a view that spans multiple databases. Paul J. Davis confirmed that there is not, and suggested creating a "meta db" to serve that kind of need.

It sounds like in the "meta db" case, what you're effectively doing is rolling your own instance of the sort of meta-view I was asking about. To CouchDB it would simply be another database, and it would be up to other parts of your application to keep it synchronized with the other databases it is aggregating from.

It sounds like there is no compelling reason to do this sort of sharding for an app that fits on one server (or one disk, let's say), unless one is convinced that map-reduce views across databases are not and will not be wanted or needed.

When/if you need to split databases to meet scaling demands, then you'll have databases on separate compute nodes and so a CouchDB instance level view that aggregates them wouldn't be too useful anyway. The application would need some other way of combining results.

When running a map-reduce across multiple databases, the 'map' part wouldn't need to change at all. The 'reduce' is where it gets interesting. I could see wanting to reduce each database (and store as is done currently), and then re-reduce between separate databases to combine the output into one result.

Assuming we're in a distributed system with databases on multiple nodes, the question is where does the reduction happen? The vanilla MapReduce model is to have an arbitrarily large set of reducers, have all key-value pairs with the same key sent to a common reducer, and then run the reduce and collect the results from the reducers.

I could see a place for some companion piece of software to do this sort of distributed view re-reduce for CouchDB setups. It could take the form of another CouchDB instance (or likely more than one) running a particular application geared toward such a re-reduction. (And taking it a step farther, multiple MapReduce passes would be more of the same).

What's interesting to me now is that this may not be too hard to build out of multiple CouchDB instances, without changing CouchDB itself at all. The main reason for piling on "more CouchDBs" would be to persist the results at each step to disk, so that if the same view is asked for twice, it is still only computed once.

What do people think about this?

Tuesday, January 13, 2009

Voldemort, a distributed key-value store based on Dynamo

Came across an interesting post on Volemort over at high scalability:
http://highscalability.com/product-project-voldemort-distributed-database

Having just read up on nonrelation database stuff, including CouchDB and dynamo, a couple obvious differences jump out:

o Voldemort doesn't seem to mention anything akin to CouchDB's incrementally-updated views. Presumably you could roll your own, where the objects that make up the view are simply more objects with their own identifying keys.

o CouchDB doesn't have the features that actually manage a distributed system, such as allocating keys to servers and handling failure scenarios. Presumably you could roll your own, where the CouchDB instances act as the storage mechanism in a similar way to how BerkeleyDB or MySQL can act as Dynamo's storage mechanism. If Voldemort is similar to the description of the Dynamo paper, then it presumably does have that.

Since Voldemort and Dynomite apparently both aim to be an open source implementation of Dynamo, that naturally leads to a wish to compare those two. Looks like Voldemort is written in Java and hosted in SVN on Google Code, while Dynomite is written in Erlang and hosted on Github. Sounds like Dynomite gets all the cool points :)

On the other hand, Dynamo is already in production use at LinkedIn and has some nice illustrated documentation, while I haven't seen anything indicating how mature or widely used (or not) Dynomite is. (Update: Dynomite is in production use at Powerset! So it gets production points and still more cool points...)

I'm also curious about the idea of combining Dynomite (or something like it) with CouchDB, since they appear to target slightly different levels of the stack. Dynamo and its kin don't actually take on the job of storage, after all. Is there something to be gained by combining its highly available architecture with CouchDB's highly available node, and incorporating CouchDB's incremental mapreduce views?

Sunday, June 15, 2008

Ruby on Rails vs (Python on) Django: A Comparison (Part 5)

Rails and Django both include tools aimed at easing the task of designing and specifying the models an application will use, and keeping these in sync with the schema of the database the project is pointing at. They both aim to do this in their native languages (Ruby and Python, respectively), allowing a developer to easily switch out one database for another (say, SQLite for PostgreSQL). However, the two frameworks take different approaches to solving this issue.

The philosophy behind Django's approach is the more straightforward of the two: for each type of entity to be modeled and represented in the database, a programmer defines a class inheriting from Django's django.db.models.Model class. A Django Model is "the single, definitive source of data about your data. It contains the essential fields and behaviors of the data you're storing... The goal is to define your data model in one place and automatically derive things from it." (The quote is from the Django tutorial on the project's website.) A Model defines a list of fields and their associated types, as well as inner classes with extra metadata such as whether to include a model in Django's admin application (covered below), how to represent the model as a string, what type of user interface form is preferred for viewing or editing the model, etc. To name one example of the DRY principle at work here, this class is the place to specify the maximum length of a character field (string) in a model. This max_length value is then mapped to both the field size in the database schema and the validation logic for user-submitted forms bound to that model.

Once a model is defined in Django, you can run the command "manage.py syncdb" to generate and execute a set of SQL commands tailored to the particular database specified in your settings file. This will add primary key fields automatically, handle foreign key fields specified in your models and make them explicit where supported, and add some other necessities (e.g. indexes, join tables for many-to-many relationships, or constraints) that may be defined in your model code. It will create tables for any new models defined since it was last run.

As we in software development all know, requirements change. It is extremely likely that during the course of a project we'll want to add or remove some tables, some columns within a table, or some relationships between tables. This complicates things a bit. Changing a database schema always involves an extra step since it is derived from, but not synonymous with, our application's models. Deleting a table or column probably means deleting data. We may decide that the last schema change we made wasn't a good idea after all, and we want to return to a schema design we had earlier. Unfortunately, Django's syncdb command is a bit crude - it's not always possible to inspect a database schema to the degree that would be required to keep the database perfectly in sync. If you want to add a new column to a table, for example, you can use some Django commands to help you determine what ALTER TABLE command you'll need, but the tool won't do it for you.

The Rails approach to this problem of mapping a model classes to database schemas was designed with this set of issues in mind. If, after using Rails' script to auto-generate a new model with a couple string fields, you decide to have a look at your new model class, you may be surprised to find that it is completely empty. There are plenty of reasons to have the class there, like specifying relationships and implementing business logic, but the fields have been recorded elsewhere. As it turns out, they're in the project's 'db/migrate' directory, which keeps a chronological history of all the changes that have been made to the database schema. (They're also in 'db/schema.rb' which holds the end product of applying those changes in order, and should reflect the actual, current state of the database you're working with.) By tracking model changes this way Rails is able, through a set of convenient scripts, to go back and forth over the project's history of database changes, undoing or redoing them to revisit schemas of weeks or months past.

Rails appears to have made a bit of a tradeoff to provide this capability, however. Whereas python has one canonical place where everything to do with models is spelled out, Rails divides this information between the app/models folder and the db/migrate folder (and the db/schema.rb file). Thus, while Django was able to specify a column size in a model and map that to both the database table definition and the model validation logic, it takes a bit more work to accomplish that same feat in Rails.

Friday, June 13, 2008

Ruby on Rails vs (Python on) Django: A Comparison (Part 4)

In both Rails and Django, a project has a small collection of settings files that define some constants like the database connection parameters the application will use, which modules external to the project will be imported, which modules internal to the project will be enabled, and so on.

In Django, these files are written in pure Python and live in a project's root directory. Rails uses both Ruby and YAML (hint, it's a recursive acronym and the M and L stand for "markup language") for config files, which live in a 'config' directory under the project root.

Many settings may be specified independently for different configurations. For example, in a development configuration, you'd connect to a local database containing meaningless data for the purpose of testing, and set a debug flag to "on" to get a detailed stack trace whenever an unhandled exception brings things to a halt. In a production configuration you'd point the database parameters to the production database with all the application's real, live data, and make sure the debug setting is "off" so that an error tells an end user something like, "500 Internal Server Error" rather than something which may be potentially helpful to an attacker, such as a stack trace. One nice thing about using a framework is the fact that others have thought of all these things, so you don't need to remember them all each time you start a new project!

This post is rather thin; I realize that. I plan to fill in more detail later as I use the frameworks more and important differences are brought to my attention.

Next, we'll talk about specifying models and synchronizing them with a database schema.

Monday, June 9, 2008

Ruby on Rails vs (Python on) Django: A Comparison (Part 3)

Whether you're developing with Rails or Django, the way you start a new project is basically the same: run a script to generate a new project.

With either tool, you'll open a command line in the directory you want the project to live underneath (e.g. ~/projects) and run a script provided by the framework to auto-generate a bare-bones project in a new directory matching your chosen project name. For Rails this is simply the command "rails "; for Django it is "django-admin.py startproject ".

A major difference between Rails and Django, however, is in the organization of files and directories. The Rails approach could be summed up as "a place for everything and everything in its place". Each model, view, and controller class resides in its own file in the '/app/models', '/app/views', or '/app/controllers' directory, respectively. You, the programmer, really don't have to bother with any decisions about where to put new classes - they've been made for you. There are rules spelling out exactly how classes, and the files containing them, are named and capitalized, and even whether or not they are pluralized (a 'Person' class will map to a table named 'people' for instance - Rails includes a pluralization dictionary that a developer may extend). In choosing which template to render when generating a response, Rails first looks for one matching the first part of the name of the controller being run.

As long as a developer sticks with these conventions, he is free from the task of wiring all these pieces together by spelling out their locations in configuration files or include statements -- this is all handled behind the scenes by some clever metaprogramming in the Rails framework. Here, Rails is trading some performance for convenience to the application developer. The question of whether this design choice results in more readable code is a subject of debate. There are a lot of conventions. If the person reading the code knows and understands these conventions he'll know where to find just about anything in a project he's never seen before, and he'll have a pretty good idea of how most of it fits together.

If, on the other hand, someone who knows the Ruby language but not the Rails framework opens up a Rails project for the first time, things may not make much sense to him. He'll see a large directory tree of modules, many with just one or two functions or properties, and not much control logic tying it all together. If he understands the model-view-controller design pattern though, he'll probably make excellent headway with the aid of Rails' excellent naming conventions. It helps here to think in terms of a number of small components that may be simply dropped into a system that will inspect and index them, find the components it needs when it needs them, and know how to use them.

Django's approach follows a philosophy that the application developer is basically building a standard python application using standard python programming practices, such as grouping functions and classes into modules and importing a module explicitly when it is needed. View functions always render a template explicitly (if returning a response and not, say, an error). In fact, this goes back to one of the 'core values' of python: explicit is better than implicit. As a consequence of this, while it will not be as concise or as DRY as a similar Rails application, a typical Django application will probably be readily understandable by a programmer with a solid background in Python -- even if he has never seen Django before.

In keeping with this philosophy, running Django's startproject script creates a new directory with just four files (one of which is a completely empty __init__.py file), as opposed to the equivalent rails command which generates a directory tree with more than 40 files and folders. Classes in a Django project are grouped into files rather than one class per file, as in Rails, making a file called "models.py" the Rails equivalent of a Django "models/" folder. Personally, I prefer having one class per file. Then, in an editor like textmate, I can go straight to a class by clicking its filename in the tree view. I once attempted to bring this Rails convention to Django but found that this resulted in typing a lot of repetitive code such as, "import mysite.myapp.person.Person" (import the Person class from the person module), repeated for every model I wanted to import, in every file in which I wished to use it. The design of Python and Django just didn't seem to lend itself well to this kind of convention, so I decided it's probably better to go with the flow here.

Where files belong is more loosely defined in Django than in Rails. While a model in a Rails application will be found in the folder /app/models, Django allows for an arbitrary number of applications within a project. Therefore, a model in a Django application may be in /[app_name]/models.py where [app_name] may be one of any number of applications. Thus, Django also has a command (manage.py startapp) for generating a new Application (which I use here in capitals to distinguish the meaning as a Django package from the general meaning as in "web application"). In fact, Applications can be located outside a project directory, referenced as external libraries, and used by an arbitrary number of projects. Thus, Applications can be designed as self-contained modules which may be reused across projects, and there exists a many-to-many relationship between Applications and projects.

After generating a new project, both frameworks allow you to immediately run a development server and open it in a web browser. These lightweight servers show a running log of any warnings or errors raised, automatically look for updated time-stamps on any loaded modules, and reload modules that have been updated. This facilitates a quick change-test cycle, helping the developer get things done.

The next part of the development process is to configure a project settings file. We'll talk about that in Part 4

Saturday, June 7, 2008

Ruby on Rails vs (Python on) Django: A Comparison (Part 2)

So, what is it like to create an application using these two tools?

In many respects, the process advocated by the project leaders and book authors is very similar for each. Both follow the Model-View-Controller (MVC) design pattern, though Django uses some different terms for it. In both frameworks, a Model is an object representing a record in a database. In Django a Template - a document (typically HTML) with placeholders for content that may change - performs exactly the same function as a View in Rails. Somewhat non-intuitively, a View in Django actually corresponds to a Controller in Rails. This is more a slight philosophical difference between the respective designers than a functional one.

By the reckoning of Django's designers, this module's job is to assemble an abstract view of something - a series of named attributes - and a template simply gives one possible rendering of that view. They avoided the term 'controller' because they delegate business logic, as much as possible, to the Models in their framework.

The Rails designers gave the name Controller to this same class in their design because the Controller determines which View (meaning 'template' on the Django side) will be rendered. Again, this division of roles works the same way in Django, so the difference is in the naming.

Both frameworks also use a central list of URL patterns to match against incoming requests. A request URL is tested against a list of regular expressions, and the first match determines which handler function will be called by the framework.

In both frameworks, developers follow a rough sequence of steps in starting a new project or adding a new feature to an existing project:

  • Run a script to generate a new project
  • Add some necessary configuration parameters to a settings file
  • Design your models and synchronize your database schema to the updated models
  • Auto-generate a web interface for viewing and entering data
  • Design the urls your application will use
  • Write your controllers or views
  • Write your views or templates
  • Write tests

Let's look at each in turn.


Go to Part 3

Friday, June 6, 2008

Ruby on Rails vs (Python on) Django: A Comparison (Part 1)

I did some reading on how to put a web application together on the cheap. There are many options available, and I decided to evaluate them. I started with what many consider the bread and butter of open source web development tools: the LAMP stack. LAMP stands for Linux, Apache, MySQL, PHP - an operating system, web server, database, and development language, respectively, widely used across the internet from the smallest sites to some of the largest and most successful.

Using the first two pieces of that stack - the operating system and server software - is largely a matter of downloading, installing and configuring them. These basic system administration tasks are a routine matter of reading and following documentation, until one faces the task of scaling an application to handle a high level of visitor traffic.

The real design work, then, lies in creating the database schema, writing the application layer, designing the graphic layout of documents served to the client, and writing client-side application logic (if any).

Writing a standard web application requires some understanding of quite an array of languages and technologies. Keeping our focus on this standard LAMP architecture, we must know SQL to create and access a relational database. We must know PHP to implement the core application logic and map objects in memory to rows in database tables. We must know the HTTP protocol and the details of request and response methods, headers and status codes in order to specify content types, control caching behavior, track visitor sessions, control access to content, support different locales, guard against attacks, and a host of other considerations. We must know HTML to create the structure of pages which may include complex interactive elements such as user-submitted forms, and to add semantic meaning to that structure. We must know CSS to control the look and feel of those structured semantic documents. We must know javascript (the de-facto client-side language on the web, though flash is also worth a mention) if we wish to add dynamic behaviors to otherwise static pages, and we must deal with idiosyncrasies in the way different web browsers handle the combination of HTML, CSS and Javascript. For any other services we interface with, we must be able to work with machine-readable formats for structured data such as XML, JSON, likely using conventions specific to each service.

For these reasons, creating a web application can be a complex undertaking. We'll want a design broken into modules that deal with each of the challenges presented above, loosely coupled enough that we may easily swap them out for alternative tools from other sources, but integrated enough that they work without requiring whole extra layers of repetitive 'glue code' or configuration files. We'll want to separate responsibilities in a way that keeps each module clear and understandable for the designers and any maintainers or testers that come after. A PHP application can be written in a single document, with logic and presentation interwoven (the PHP language actually encourages this) but in a good design, we want to avoid that.

It is worth considering at this point that many of the above issues have already been solved by thousands of applications running on the web today. There are some very large, very active communities of developers and designers sharing ideas and code, and improving on one another's work.

As we attack this problem, we seek leverage. We'd like to draw from some of the best design ideas available, and concentrate as much as possible on the challenges that are unique to our particular project. Since there are so many alternatives available which we may immediately download and use, narrowing the field and evaluating a few top contenders becomes a useful skill.

To narrow the field, there are some non-technical considerations that, while not directly related to technical merit, make great practical heuristics. Top among these, arguably, is the presence of an active, passionate, and helpful community developing the tool. An active community will likely keep a tool up to date with new improvements, keep the documentation up to date and help new users, publish books and tutorials, subject a tool to a lot of real-world testing, bring new people into the fold, and keep this virtuous cycle alive. That implies that our investment in learning and building on our chosen piece of software will pay dividends well into the future.

With that in mind, I focused on two of the best-regarded web frameworks, based on a substantial amount of online reading, both with very active communities, both based on powerful, elegant scripting languages (which themselves have great communities), both adhering to a DRY ("Don't Repeat Yourself") principle of design, and both possessing great documentation and tutorials: Ruby on Rails (which uses the Ruby language, as the name implies), and Django (which uses Python). I read blog posts, followed tutorials, experimented with small projects, and even watched screencasts (both communities have a few members who publish video podcasts with narrated, screen captured walkthroughs covering particular features of their tool). I read two books covering Rails development and one covering Django.

So, what is it like to create an application using these two tools?

Go to Part 2