Promiscuous Django Models

One of the first steps in scaling your web application (after investing in caching and streamlining your database queries) is offloading processes that take an inordinate amount of time and would otherwise disrupt your users’ experience. Things like updating Facebook profile caches—something typically done when users authenticate—or delivering e-mail. These are what are referred to as blocking operations and you pretty much want to avoid having your users wait for them at all costs.

Nifty Fact
Here at matchFWD we utilize a package called Celery to perform our background operations; the reasons for picking Celery over other solutions is beyond the scope of this short blog post, but rest assured it’s pretty awesome.

One of the interesting problems when using background tasks is how you pass data to them. Most solutions in Python use a serialization format native to Python called pickling, provided by the pickle or cpickle modules. Django models, by default, do some pretty unfortunate things when you try to pickle them.

Before I go into details let’s set up a simple test model:

from pickle import dumps
from django.db import models
 
class User(models.Model):
    name = models.CharField(max_length=200)
    email = models.EmailField()
 
# Create the record we'll be testing with.
meep = User(name="Bob Dole", email="bdole@whitehouse.gov")

Now that we have a model and sample record, let’s see what happens when we pickle it using Django’s default hooks:

print dumps(meep)
# cdjango.db.models.base\nmodel_unpickle\np0\n(csrc.testing.models\nUser\np1\n(lp2\ncdjango.db.models.base\nsimple_class_factory\np3\ntp4\nRp5\n(dp6\nS'email'\np7\nS'bdole@whitehouse.gov'\np8\nsS'_state'\np9\nccopy_reg\n_reconstructor\np10\n(cdjango.db.models.base\nModelState\np11\nc__builtin__\nobject\np12\nNtp13\nRp14\n(dp15\nS'adding'\np16\nI00\nsS'db'\np17\nS'default'\np18\nsbsS'id'\np19\nI1\nsS'name'\np20\nS'Bob Dole'\np21\nsb.

That’s 389 bytes that actually includes a complete copy of the record’s data! Surely the default hooks provided by Python for pickling objects would do a better job, so let’s try that next:

# We replace Django's reduce method with the default.
User.__reduce__ = object.__reduce__
 
print dumps(meep)
# ccopy_reg\n_reconstructor\np0\n(csrc.testing.models\nUser\np1\nc__builtin__\nobject\np2\nNtp3\nRp4\n(dp5\nS'email'\np6\nS'bdole@whitehouse.gov'\np7\nsS'_state'\np8\ng0\n(cdjango.db.models.base\nModelState\np9\ng2\nNtp10\nRp11\n(dp12\nS'adding'\np13\nI00\nsS'db'\np14\nS'default'\np15\nsbsS'id'\np16\nI1\nsS'name'\np17\nS'Bob Dole'\np18\nsb.

While a little better at only 300 bytes let me explain what’s really going on here. Notice at the beginning of the (very long) dumps output what looks like a package path. This path references the function used to reconstitute the object when it is de-pickled. Sensibly, since models are fairly fancy objects, Django uses one of its own functions to do this. What do these functions actually do?

Django’s de-pickle function does things like check for lazily evaluated values (values originally excluded from the object) and a bunch of other important things related to passing around copies of real data and re-integrating it into an instance of your model. Why aren’t Django signals sent when de-pickling? Because when you de-pickle something only the class’ __new__ is called, not __init__, which is good.

Python’s default pickling mechanism basically just copies the instance’s __dict__ and a reference to the class that spawned the instance. Simple, but effective. Both of these are the Wrong Solution™ when dealing with data loaded from a database. The biggest reason why this is bad is simple and illustrated by the following scenario:

  1. Bob Dole gets elected president and signs up to your service with the e-mail address president@whitehouse.gov.
  2. At some point poor Mr. Dole isn’t president any more.
  3. Your application decides to send him an e-mail. It queues up your spammy_spam function for background execution, passing along Bob’s User instance.
  4. Bob Dole changes his e-mail address on your service to bdole1969@hotmail.com.
  5. The spammy_spam function is eventually executed, Bob’s User object de-pickled, and you send some delicious food-like products by e-mail… to the wrong person.

That’s a problem! It gets worse when you realize that it’s a problem for any not-lazily-loaded database column stored this way. A scheduled task to remind someone of a past-due balance? Pickle their record and they’ll get e-mailed even if they paid before the task was scheduled to run. So how do we fix this and make de-pickling actually load the record out of the database for us, all fresh and accurate?

The first half of the problem was solved above by replacing the __reduce__ method on our model with Python’s default one. If we don’t do this then the following addition to our model will never be executed:

class User(models.Model):
    name = models.CharField(max_length=200)
    email = models.EmailField()
 
    __reduce__ = object.__reduce__  # from above
 
    def __getstate__(self):
        return self.pk
 
    def __setstate__(self, pk):
        self.__dict__ = self.__class__.objects.get(pk=pk).__dict__

This might look quite hairy, but works extremely well. What happens now is:

  1. When pickling an instance of the model the __getstate__ method is called and the returned value—whatever it is—is pickled.
  2. When de-pickling, a new instance is created—without calling __init__—and __setstate__ is called with the de-pickled value returned above as its only argument.

In this case we save the primary key, then attempt to get the object by its primary key. But since __setstate__ doesn’t return a new instance, we have to hot-swap the current instance for the one just loaded from the database. This is technically a mutation of the borg (monostate) pattern, the pattern most Pythonistas use instead of singleton. What does the result of pickling look like now?

print dumps(meep)
# ccopy_reg\n_reconstructor\np0\n(csrc.testing.models\nUser\np1\nc__builtin__\nobject\np2\nNtp3\nRp4\nI1\nb.

Much better! 94 bytes, a reduction to nearly 24% the original size. Now the only things being stored are the function used to de-pickle, a full reference to the class we’ve pickled, and the ID of 1.

At matchFWD use a parent class common to all of our models to ensure everything is loaded from the database when de-pickled, but you can use a mix-in class to selectively apply this behaviour if you wish.

Updated to add: In response to comments from several social sharing sites, here’s the second definition of promiscuous provided by Mac’s Dictionary app to help clarify the title:

pro·mis·cu·ous |prəˈmiskyo͞oəs|
2a. demonstrating or implying an undiscriminating or unselective approach; indiscriminate or casual: the city fathers were promiscuous with their honours
2b. consisting of a wide range of different things: Americans are free to pick and choose from a promiscuous array of values and behaviour.

  • Anders Pearson

    When I offload background tasks to Celery, I just pass the primary key and let the task retrieve it from the database. Much simpler and pretty much guaranteed to be smaller than a pickle. The only reason I can think of that I’d ever want to pass a pickle to a task instead of a primary key is if I specifically don’t want it to update (ie, I want to be sure that the task runs with the object in exactly the state it was in when the task was created).

    Could you explain why you would want to do it this way? Why pickle it if you are just going to reload it from the database anyway?

    • http://nomulous.com/ nomulous

      The point is to avoid the extra steps involved in getting and passing around a primary key and then manually getting the object back. Also for things like introspecting the arguments of a task, it’s helpful to see the actual objects being used instead of just an ID or something. It won’t be faster than storing an id (which is pickled anyways by the way), but it won’t be significantly slower, and it will be more elegant in a lot of ways.

      • Anders Pearson

        I don’t know…

        I feel like doing some_task.delay(myobj.pk) and

        @task
        def some_task(obj_pk):
            obj = MyObj.objects.get(pk=obj_pk)
            ...
        

        Is easy, simple, and explicit. No subclassing needed, less data passed on the wire, and anyone glancing at the code sees right away that it’s retrieving data from the database.

        With your approach, with your model classes all subclassing a common parent class that enables the hotswapping, I think you’re setting yourself up for subtle bugs in the future. You get yourself in the habit of writing your tasks to expect objects passed in to behave that way, and they do as long as they are the objects from your own codebase. But then someday one of your developers pulls in a model object from some common third party reusable django application (django-taggit, django-reminders, whatever) which doesn’t subclass your common parent and passes it directly to a task just like everything else. Everything works fine until some day a race condition like you describe above happens.

        I’d rather have it explicit and consistent.

        • http://www.gothcandy.com/ Alice Bevan-McGregor

          There are a number of ways to handle the external model case; if we monkey patch Django’s base Model’s __reduce__ to raise an exception we’ll be explicitly notified if we try to pass to Celery an object not from our codebase. We could also directly apply our patch against Django’s base Model at which point all models, internal or external, behave the same. We did this initially, but switched to using our own BaseModel superclass to avoid potential issues with other uses of pickling, esp. of third-party models, e.g. in session storage.

          So this method can be made consistent, but we weren’t comfortable with the potential wide-ranging side-effects of such a change, though our testing suite should catch such conditions.

        • http://www.gothcandy.com/ Alice Bevan-McGregor

          One other quick note: anything worth doing twice is worth automating. The boilerplate of ‘obj = SomeModel.objects.get(pk=pk)` in two Celery tasks is too much boilerplate in my view.

          • http://lukeplant.me.uk/ spookylukey

            If you write any amount of Django you’ll have loads of repetition of SomeModel.objects.get() – that’s just called using an API. If you really can’t cope with the verbosity, you can always write something like:

            @task
            @withModel(SomeModel)
            def do_something(instance):
            ….

            where withModel converts a pk to an instance. Or use withSomeModel = withModel(SomeModel). You’ve now got it down to one token – @withSomeModel. It’s a little bit magical in that it effectively changes the ‘type signature’ of do_something, but the magic is explicitly added right there, you don’t have to dig into base classes to work out what is going on. And you are left with the option of using pickling to pickle and unpickle the object as it was, if you ever need that case. That seems miles far less likely to trip up future maintainers to me.

          • http://www.gothcandy.com/ Alice Bevan-McGregor

            So instead of writing one line to load a specific record, you write a two line decorator and add one line everywhere it’s used. Instead of n repetitions, you have n+2. Making it general increases that overhead. That doesn’t look like an improvement over my solution.

            Pickling a record (which Celery does automatically anyway for all arguments to your task) you put in a record and get one back. No need to dig into the underlying machinery since nothing has effectively changed other than the fact that the record is fresh, not stale!

            I also don’t have to worry about function signatures (see functools.wraps to mostly solve that, BTW) and using a mix-in on the models we want ‘live un-pickling’ of continues to give us the ability to use stale data if we really want to. (Which we don’t. ;^)

            Edited to add: in addition, decorators are non-obvious visual noise to new Python programmers (or any programmers not familiar with first-class functions) which, worse than the automatic pickling machinery, mutates the value that was stored into something else. You certainly would have to look at the decorator’s code to see what’s going on…

      • http://mikegrouchy.com mgrouchy

        One advantage you get to passing the id of the model (especially if you have long running tasks or very large queues) is that you can insure the data you get is the freshest from the database, if you for example, perform some modification to a model object passed to a Celery Task you can overwrite the fresher data with your modifications. This is highly dependant on your usecase but also just one more way to introduce subtle bugs by pickling Model Objects for processing in Background tasks.

    • http://www.gothcandy.com/ Alice Bevan-McGregor

      Even an integer will get pickled by Celery, though yes, it’s even smaller. (E.g. ‘I1n.’)

      The flexibility of being able to pass around real objects without worrying is huge. Many of our models are based on common abstract bases, for example TalentOpportunity and JobOpportunity. Some of our background tasks might not care which type of opportunity is being passed to it. By passing just an ID around we’d be forced to additionally pass the ‘type’ of record the ID represents. Pickling does this combination for us automatically.

  • Paul Garner

    I wouldn’t want to change the pickling implementation on my model like this for two reasons:

    - I’m already passing primary keys to the task queue instead of model instances
    - I’m using a memcache layer over the ORM… naturally this caches pickled model instances. It loses all benefit of caching if cached objects have to be fetched from the datastore on unpickling.

    I’d rather keep all the data access explicit since it’s often the slow point in the web application. You don’t want ‘hidden’ db hits happening semi-magically when you later need to track them down.

    Also, what if you are passing model several instances across to the task? They’re going to cause a series of discrete ‘get’ queries but it’d be more efficient to do a single filter(pk__in=[...])

    • http://www.gothcandy.com/ Alice Bevan-McGregor

      Indeed, multiple record lookups would be inefficient this way, which is why in those situations we pass a list of IDs. Even if we weren’t adjusting the pickling mechanic you wouldn’t pass a list of real model objects since the whole data set would be pickled, which would be terrible!