Archive for the 'open source' Category

Django External Schema Evolution Branch

Friday, October 19th, 2007

Just an update on my former SoC2006 work…

We now no longer require a patch to Django. One import statement in settings.py allows our program to fake it via the very crafty Python language.

The new website is here: http://code.google.com/p/deseb/
The discussion list us here: http://groups.google.com/group/deseb-discuss

Plus there is an introductory video available here: http://kered.org/deseb_demo.mpeg

Schema Evolution Confusion / Example Case

Friday, August 3rd, 2007

A concern of my schema evolution solution is as follows:

The ‘aka’ approach has some serious flaws. It is ambiguous for all but trivial use cases. It also doesn’t capture the idea that database changes occur in bulk, in sequence. For example, On Monday, I add two fields, remove 1 field, rename a table. That creates v2 of the database. On Tuesday, I bring back the deleted field, and remove one of the added fields, creating v3 of the database. In each stage of the migration, the DB is a stable state; this approach doesn’t track which state a given database is in, and doesn’t apply changes in blocks appropriate to versioned changes.

The fallacy in this is twofold:

  1. that an automated introspection/evolution must generate and apply schema changes in the same logical order that a DBA would
  2. that keeping intermediate state metadata is always necessary (obviously required from #1)

I argue that the exact path from v1 => v3 is irrelevant, as long as it is functionally equivalent to the DBA generated one, and minimizes information loss. To demonstrate this, I’ve coded the above example into three different models.py files:

v1
from django.db import models

class Russ(models.Model):
    "this model is going to have a bit of a day (v1)"
    a = models.CharField(maxlength=200)
v2
from django.db import models

class WasRuss(models.Model):
    "this model is going to have a bit of a day (v2)"
    b = models.CharField(maxlength=200)
    c = models.CharField(maxlength=200)

    class Meta:
        aka = ('Russ')
v3
from django.db import models

class WasRuss(models.Model):
    "this model is going to have a bit of a day (v3)"
    a = models.CharField(maxlength=200)
    b = models.CharField(maxlength=200)

    class Meta:
        aka = ('Russ')

Now let’s assume we have three users: Alice, Bob and Charles. Alice is the developer and Bob and Charles are sys-admins, deploying her application.

On day one, Alice writes her new model (v1) and calls syncdb to create it as you normally would. She then adds data to the table for testing. But on day two, she decides that her original implementation is inadequate and makes her modifications (v2). But instead of writing and storing her own migration scripts or just tossing all her data, she runs sqlevolve, which gives her the following:

v1 => v2
ALTER TABLE `case06_russ_russ` RENAME TO `case06_russ_wasruss`;
ALTER TABLE `case06_russ_wasruss` ADD COLUMN `b` varchar(200) NOT NULL;
ALTER TABLE `case06_russ_wasruss` ADD COLUMN `c` varchar(200) NOT NULL;
-- warning: the following may cause data loss
ALTER TABLE `case06_russ_wasruss` DROP COLUMN `a`;
-- end warning

Now day three rolls around, and she’s changed her model again (v3). Again she run’s sqlevolve to get the following:

v2 => v3
ALTER TABLE `case06_russ_wasruss` ADD COLUMN `a` varchar(200) NOT NULL;
-- warning: the following may cause data loss
ALTER TABLE `case06_russ_wasruss` DROP COLUMN `c`;
-- end warning

Which gets her exactly to where she needs to be: a schema identical to what a fresh sqlall would give, without destroying all her data. (she did lose everything in column a, however this is acceptable because an identical loss would come from the versioned scripts she would have written by hand)

Now Bob is a bleeding-edge kind of guy. He likes to stay on top of Alice’s work daily. So, assuming she’s a timely svn commiter, each day he runs the following four commands:

$ /etc/init.d/apache stop
$ svn update
$ ./manage sqlevolve | mysql -u root -p my_db
$ /etc/init.d/apache start

This deploys to his database in two days the exact same two scripts she ran, including the same information loss in column a.

Now Charles is more of a conservative deployer – he only deploys when Alice gives them notice, which happened at the end of days one and three. On day one, his syncdb created the database to v1’s specifications. However on day three, when he runs the same commands Bob ran, the following is deployed to his database:

v1 => v3
ALTER TABLE `case06_russ_russ` RENAME TO `case06_russ_wasruss`;
ALTER TABLE `case06_russ_wasruss` ADD COLUMN `b` varchar(200) NOT NULL;

As you can see, it is a different script than either Alice or Bob ran, however it gets him to a functionally equivalent schema, and it gets him there with less data loss. (he gets to keep his column a information)

Now this can be argued as either a wonderful or horrible thing. Should Charles be forced to dump his column a data? In some really huge, highly critical, heavily deployed production environments, maybe. But I have managed such before, and I think those cases are few and far between. Much more likely the user is going to want to keep their data. But if they do, a simple procedural change is all that’s necessary. Alice needs only to dump her generated evolution SQL into versioned migration scripts, ala Mike Heald’s dbmigration tool.

So to wrap up, I hope I’ve demonstrated that the idea of “database changes must occur in bulk, in sequence” is flawed, and that what is key is schema equivalence , not making sure you can recreate the exact same set of scripts at runtime for all users using all versions. But that if you do need to make sure identical scripts are run by all users, this can be easily done still using the evolution functionality through minor procedural changes in development and deployment.

I should also note that all the scripts used in this article were generated with the code already checked into the schema-evolution branch. I encourage you to try it out for yourself. (and send me bug reports if you find them!)

Thanks,
Derek

Django Schema Evolution

Thursday, July 19th, 2007

I’ve ported my schema evolution work from my SoC project last summer to Django v0.96.   To use it, download the patch below, and run the following:

$ cd /<path_to_python_dir>/site-packages/django/
$ patch -p1 < ~/<download_dir>/django_schema_evolution-v096patch.txt

It should output the following:

patching file core/management.py
patching file db/backends/mysql/base.py
patching file db/backends/mysql/introspection.py
patching file db/backends/postgresql/base.py
patching file db/backends/postgresql/introspection.py
patching file db/backends/sqlite3/base.py
patching file db/backends/sqlite3/introspection.py
patching file db/models/fields/__init__.py
patching file db/models/options.py

To use it:

$ cd /<path_to_project_dir>/
$ ./manage.py sqlevolve <app_name>

It should output something like this:

BEGIN;
ALTER TABLE `main_query` CHANGE COLUMN `accuracy` `accuracynew` numeric(10, 6) NULL;
ALTER TABLE `main_query` ADD COLUMN `price` varchar(256) NULL;
COMMIT;

Assuming you have a model such as this:

class Query(models.Model):
    query = models.CharField(maxlength=256, blank=False)
    accuracynew = models.FloatField(max_digits=10, decimal_places=6, null=True, blank=True, aka='accuracy')
    price = models.CharField(maxlength=256, null=True, blank=True) # new column

Note the aka field where I changed the name of “accuracy” to “accuracynew”.

Source code:

Documentation:

Let me know if you find any bugs.

UPDATE: This project is now on Google Code: http://code.google.com/p/deseb/

A great quote…

Saturday, November 18th, 2006

From a guy on /., of all places.

It’s not about communist hippies in Birkenstocks saying everything should be free.   Every web application I build needs a database, an application server and a web server.   On top of that it needs libraries that do a lot of things common to other projects.   It makes no sense for us to take on the burden of developing these from scratch.   It is in our best interest to collaborate with others to build stable, portable solutions that can be deployed en mass without exorbitant licensing fees.

d3ik

Summer of Code

Wednesday, May 24th, 2006

Google’s Summer of Code project has accepted my application. For those of you not familiar with this, basically Google funds you for a summer to work on an open source project of your choosing. It’s a fairly big w00t in the programming and academic worlds. I can’t tell you all how exciting this is. :)

The skinny: I will be implementing schema evolution for the Django project. Watch this category for further updates.

The meat of my proposal:

I would like to implement the introspection and migration schema evolution suggestions detailed here: [http://code.djangoproject.com/wiki/SchemaEvolution]

I am in a unique position for this proposal because I have implemented much of this functionality before. I am the PM/tech lead for a large (~400k loc) java based web application. We have a 200+ table schema, and last December I wrote a schema manager that implements much of what is described in your schema evolution proposal. (this is no coincidence – Django’s need was my catalyst/inspiration) Obviously I would not be able to use any of the code (as it was written for an Air Force contract), but the techniques would be transferable.

The schema manager I wrote (in Java, manipulating an Oracle DB, and likely on the outer difficulty edge of any similar implementation) was developed independently from your wiki’s proposal. (it was inspired by the one-line ‘this would be hard but nice to have’ comment in the 0.9.0 documentation, IIRC) However it implements everything listed in your ‘Automatically applied migration code’ suggestion. Additionally, it supports the following:

  • automatic downgrades are supported (in addition to upgrades)
  • the application can automatically identify a schema without a ’schema version’ table
  • the application can verify the accuracy of a schema
  • multiple schema identification algorithms are simultaneously supported (for instance: primary keys, foreign keys and constraints can be named in Oracle – should a constraint name change be considered a schema version change? you decide)
  • multiple different schemas with ‘equivalent’ functionality can be mapped back to a single logical schema version (‘equivalency’ obviously being determined by the developer)

I would like to note that this would not be simply a porting project. For obvious legal reasons I would not have access to my previous code base. It would be in a different language. (I am familiar with Python, but not yet competent) It would be designed against a different database. And it would be for an object framework very different from my previous application experience. Nonetheless, many of the ‘hard’ problems associated with this type of functionality I have already solved, proven and deployed.

As far as the introspection functionality, this was prototyped, but never deployed for the following reasons:

  • my application does not have a central, unified object definition mechanism from which to base such a function
  • converting our application to such would have been prohibitive
  • other development requirements took priority

However my exploratory code towards this was promising. (you all would be proud of my tuple-based model implementation in Java – made quite the use of the new 1.5 features and had full compile-time checking of all parameters)

BTW, I briefly considered a full Java port of Django last winter. (I’ve owned django4j.com for a while now) I still think it would be a good long-term idea, but a bit much for a summer project.

As far as any legal concerns you might have, I’ve contributed to open source projects before with both my company’s and the government’s knowledge and approval. (including my own project, buglist) I know the landscape well. Plus my contract will be ending soon. (on 6/30)

Anyway, I look forward to working with you all. :)
– Derek Anderson

P.S. Yes, getting *everyone* to save their migration scripts in a VCS should be an absolute, unwavering requirement. Good god it scares me how few people do this.

The Importance of Open Source

Wednesday, April 6th, 2005

If you’re not watching what’s going on between BitKeeper and OSDL, here’s a quick recap:

BitKeeper provides a “Free Linux License” for their version control system which Linus and other major kernel developers have adopted for Linux development. Linus’ employer OSDL employs many contractors working on many different projects. One of their contractors (on his own time – not on a contracted project) decided to create a truly open source version of the BitKeeper client. BitKeeper didn’t like this (for obvious reasons), so they threatened OSDL/Linus/everyone that “unless you convince OSDL’s contractor to stop his efforts on his own project, we’ll stop creating a Linux version of the BitKeeper client”. And now they’ve followed through.

This will obviously create a significant hardship for any OSS project using BitKeeper’s technology. (the Linux Kernel being the most prominent)

Now I feel that BitKeeper has the 100% right to control how they give/sell/whatever their own closed source products. We’ve got no course telling them what their business process should be. But this highlights EXACTLY why proprietary software is a significant business risk, even if it is completely free. If another company can rip out the rug from underneath you at any time for any reason, and you don’t have access and control of the products you depend on… You’re asking for trouble. And it Linus is powerless to stop it – there is little chance any of us have.

This is why software freedom matters. This is why I do everything I can to avoid non-OSS software. And this is *exactly* why I encourage others to do the same.


<Kered.org>   © Copyright 2000-2005 by Derek Anderson
Get Firefox