Upgrading to Django REST Framework 3

A client recently asked me to help them with a ticket

At this point I'd been working with Django and Django REST Framework for years, so thought to myself: Upgrading a 3rd party library from 2to3, easy pickings! Change a few field names and maybe some Serializer method names. I like Serializers. Serializers are cool.

A first

$ pip install -U djangorestframework
$ python manage.py runserver

revealed a large amount of assertion errors coming from the DRF 3 source code.

Code like this:

class Field(object):
    def __init__(...):
        # Some combinations of keyword arguments do not make sense.
        assert not (read_only and write_only), NOT_READ_ONLY_WRITE_ONLY
        assert not (read_only and required), NOT_READ_ONLY_REQUIRED
        assert not (required and default is not empty), NOT_REQUIRED_DEFAULT
        assert not (read_only and self.__class__ == Field), USE_READONLYFIELD

was causing errors like

    AssertionError: Instantiate ReadOnlyField instead of Field(read_only=True)


    Field.to_representation() must be implemented.
    If you are upgrading from REST framework version 2
    you might want `ReadOnlyField`

Cool, s/Field/ReadOnlyField/ should take care of that.

Time to set some aggressive estimates:

It wasn't until the next day that I ran the entire test suite:

=========================== short test summary info ============================
 341 failed, 4068 passed, 511 skipped, 2 warnings, 533 error in 993.89 seconds =

Hmm, 341 failed + 533 error.

Wait, how many custom Serializers do we have. 104??!


A week later I was still sitting on 409 failed + 29 error.

This was another beast altogether.

Every code base is unique, but I've compiled a list of ideas that helped me through the process:

  •  Read the announcement. You'd think this would be obvious and the announcement actually tells you to read it too, but one thing that would've helped me tremendeously was if it had done so in bold red. Read the page! -- Yes. That is better.
  •  Familiarise yourself with the DRF3 source before starting. What exactly do BaseSerializers, Serializers, ModelSerializers do and when do you want to sub-class each one? What's the difference between .data and .validated_data (hint: it's not the validation). How does validation work now anyway? Why does DRF3 not call Model.clean() anymore when de-serializing? Should you instantiate a throw-away in order to runModel(**data).clean() or maybe move the validation code into a more generalised spot? The announcement and DRF3 source code (which is well documented) will help with such questions.
  •  Set all the compatibility flags in the beginning and work backwards. DRF3 coerces beautiful native Date, Time, DateTime and Decimal objects into strings in Serializer output ... setting DATETIME_FORMAT, DATE_FORMAT, TIME_FORMAT to None andCOERCE_DECIMAL_TO_STRING to False respectively will revert such behaviour.
  •  Get to green as quickly as possible. I started working through the test failures on a submodule-by-submodule basis but midway through went rambo and started to leave small optimisations, refactors & cleanups by the way-side and went into hack-mode spraying code and TODO's wherever I went in an effort to "just get to green" on my test build and have a solid basis to work from. After this was done, I also had a better overview of the which problems kept reoccuring and required careful refactoring and which ones were one-offs that could be left as special cases.
  •  Leave bugs that confuse you for later. When all of a sudden completely unrelated code starts breaking, it might be a good idea to leave it and focus on the more concrete stuff first. This ties in with "Get to green as quickly as possible". Some things are obviously wrong behaviour, while others can simply be a special case that was intended but undocumented, or a new DRF3 behaviour, or an old DRF2 behaviour that's not possible anymore, or a different output format causing something else to behave differently in subtle ways, and so on and so forth.

Here's the example of a create flow for one of the Serializers

You would probably be right to assume that this can and will break in many places if you completely re-architecture the APIs and under-lying philosophies of the code the white and gold blue and purple rows were built upon.

Imagine the CustomAddressField throws a ValidationError. After tracing through the code you find while the value originates in gen_contact_details(), it sometimes gets modified in fill_project_data(), a utility function used widely throughout the 300kLOC code base. What is this mysterious value supposed to be? Searching through the code base some other function tries to cast this value into an int() after callingfill_project_data() but leaves it be incase it can't do it. This means it must be str orint in most cases, maybe a float. Since we're using dynamically typed Python, there's a lot of guess-work involved. Leaving hard-to-debug errors for later gives you more time to get to know your own code base and those of DRF2 and DRF3.

A lot of the time opaque high-level failures will resolve themselves when you get to the more fine-grained tests and are able to quickly pin-point & solve issues at a lower level.

  •  Ration your willpower. Carefully stepping through nested upon nested levels of testing + view + serializer + DRF3 source code is mentally taxing work and I decided to reserve my most productive phases of the day for the heavier mental lifting while picking out easier tasks otherwise. There's no shame in switching to a more pedestrian JavaScript bug late in the day, when the same time tomorrow could also be spent working on something tougher.
  •  Don't be afraid to go against what you see in the DRF3 source. A lot of the implicit behaviours in DRF 2 have been replaced by stricter and more explicit ones in DRF3. I enjoy code-philosophical discourse as much as the next guy, but sometimes re-implementing a bit of woo-woo and magic can be an easier way to keep API consistency than to try and shove a DRF2-grown cactus into a DRF3-shaped pot.

Was it worth it?

For me personally: Yes. While altogether a rather bumpy & long ride, I believe it's often the things we don't enjoy at first that help us grow. Serializers were one of the areas I was less familiar with and I now - quite literally - know every nook and cranny of the small little beasts.

Like other big projects the reasons for upgrading from version 2 to 3 are not quite obvious. Under the hood it's a lot nicer. And there's quite a few goodies. At the same time all this goes against the maxim of "If it ain't broke, don't fix it".

I wonder if there's something about working on large open-source code-bases, where a v3makes developers & maintainers snap and spiral into a we-do-things-like-this-nowfrenzy because working with the old code day-in-day-out simply becomes too intellectually & aesthetically insulting.




7 Python Libraries you should know about

In my years of programming in Python and roaming around GitHub's Explore section, I've come across a few libraries that stood out to me as being particularly enjoyable to use. This blog post is an attempt to further spread that knowledge.

I decided to exclude awesome libraries like requests, SQLAlchemy, Flask, fabricetc. because I think they're already pretty "main-stream". If you know what you're trying to do, it's almost guaranteed that you'll stumble over the aforementioned. This is a list of libraries that in my opinion should be better known, but aren't.

1. pyquery (with lxml)

pip install pyquery

For parsing HTML in Python, Beautiful Soup is oft recommended and it does a great job. It sports a good pythonic API and it's easy to find introductory guides on the web. All is good in parsing-land .. until you want to parse more than a dozen documents at a time and immediately run head-first into performance problems. It's - simply put - very, very slow.

Just how slow? Check out this chart from the excellent Python HTML Parser comparison Ian Bicking compiled in 2008

What immediately stands out is how fast lxml is. Compared to Beautiful Soup, the lxml docs are pretty sparse and that's what originally kept me from adopting this mustang of a parsing library. lxml is pretty clunky to use. Yeah you can learn and use Xpath or cssselect to select specific elements out of the tree and it becomes kind of tolerable. But once you've selected the elements that you actually want to get, you have to navigate the labyrinth of attributes lxml exposes, some containing the bits you want to get at, but the vast majority just returning None. This becomes easier after a couple dozen uses but it remains unintuitive.

So either slow and easy to use or fast and hard to use, right?


Enter PyQuery

Oh PyQuery you beautiful seductress:

from pyquery import PyQuery
page = PyQuery(some_html)

last_red_anchor = page('#container > a.red:last')

Easy as pie. It's ever-beloved jQuery but in Python!

There are some gotchas, like for example that PyQuery, like jQuery, exposes its internals upon iteration, forcing you to re-wrap:

for paragraph in page('#container > p'):
    paragraph = PyQuery(paragraph)
    text = paragraph.text()

That's a wart the PyQuery creators ported over from jQuery (where they'd fix it if it didn't break compatability). Understandable but still unfortunate for such a great library.

2. dateutil

pip install python-dateutil

Handling dates is a pain. Thank god dateutil exists. I won't even go near parsing dates without trying dateutil.parser first:

from dateutil.parser import parse

>>> parse('Mon, 11 Jul 2011 10:01:56 +0200 (CEST)')
datetime.datetime(2011, 7, 11, 10, 1, 56, tzinfo=tzlocal())

# fuzzy ignores unknown tokens

>>> s = """Today is 25 of September of 2003, exactly
...        at 10:49:41 with timezone -03:00."""
>>> parse(s, fuzzy=True)
datetime.datetime(2003, 9, 25, 10, 49, 41,
                  tzinfo=tzoffset(None, -10800))

Another thing that dateutil does for you, that would be a total pain to do manually, is recurrence:

>>> list(rrule(DAILY, count=3, byweekday=(TU,TH),
...            dtstart=datetime(2007,1,1)))
[datetime.datetime(2007, 1, 2, 0, 0),
 datetime.datetime(2007, 1, 4, 0, 0),
 datetime.datetime(2007, 1, 9, 0, 0)]

3. fuzzywuzzy

pip install fuzzywuzzy

fuzzywuzzy allows you to do fuzzy comparison on wuzzes strings. This has a whole host of use cases and is especially nice when you have to deal with human-generated data.

Consider the following code that uses the Levenshtein distance comparing some user input to an array of possible choices.

from Levenshtein import distance

countries = ['Canada', 'Antarctica', 'Togo', ...]

def choose_least_distant(element, choices):
    'Return the one element of choices that is most similar to element'
    return min(choices, key=lambda s: distance(element, s))

user_input = 'canaderp'
choose_least_distant(user_input, countries)
>>> 'Canada'

This is all nice and dandy but we can do better. The ocean of 3rd party libs in Python is so vast, that in most cases we can just import something and be on our way:

from fuzzywuzzy import process

process.extractOne("canaderp", countries)
>>> ("Canada", 97)

More has been written about fuzzywuzzy here.

4. watchdog

pip install watchdog

watchdog is a Python API and shell utilities to monitor file system events. This means you can watch some directory and define a "push-based" system. Watchdog supports all kinds of problems. A solid piece of engineering that does it much better than the 5 or so libraries I tried before finding out about it.

5. sh

pip install sh

sh allows you to call any program as if it were a function:

from sh import git, ls, wc

# checkout master branch

# print(the contents of this directory

# get the longest line of this file
longest_line = wc(__file__, "-L")

6. pattern

pip install pattern

This behemoth of a library advertises itself quite modestly:

Pattern is a web mining module for the Python programming language.

... that does Data MiningNatural Language ProcessingMachine Learning and Network Analysis all in one. I myself yet have to play with it but a friend's verdict was very positive.

7. path.py

pip install path.py

When I first learned Python os.path was my least favorite part of the stdlib.

Even something as simple as creating a list of files in a directory turned out to be grating:

import os

some_dir = '/some_dir'
files = []

for f in os.listdir(some_dir):
    files.append(os.path.joinpath(some_dir, f))

That listdir is in os and not os.path is unfortunate and unexpected and one would really hope for more from such a prominent module. And then all this manual fiddling for what really should be as simple as possible.

But with the power of path, handling file paths becomes fun again:

from path import path

some_dir = path('/some_dir')

files = some_dir.files()


Other goodies include:

>>> path('/').owner

>>> path('a/b/c').splitall()
[path(''), 'a', 'b', 'c']

# overriding __div__
>>> path('a') / 'b' / 'c'

>>> path('ab/c').relpathto('ab/d/f')

Best part of it all? path subclasses Python's str so you can use it completely guilt-free without constantly being forced to cast it to str and worrying about libraries that checkisinstance(s, basestring) (or even worse isinstance(s, str)).

That's it! I hope I was able to introduce you to some libraries you didn't know before.