Python DTO Choices

DTO Stands for 'Data Transfer Object'. But what does that mean? Well this Stack Overflow post gives a nice summary:

‘A Data Transfer Object is an object that is used to encapsulate data, and send it from one subsystem of an application to another.’

Is data coming into your system from ‘one place’ and being handed to ‘another’? A Data Transfer Object can be an intermediary in the code.

For a recent project, I found myself using Pydantic to perform exactly this function. My Django application called a third party API, I parsed the response into Pydantic classes and then displayed these to my users in templates! Having previously toiled with accessing values in dict-like structures using square brackets, being able to get datapoints as attributes using dot notation felt like a breath of fresh air. Also the Pydantic classes are readable and typed: everything is just so simple to look at, I felt like I was comprehending my code better.

'Classic'(?) Class Definition: 

class Example:
    def __init__(self, other: str, thing: str = None):
        self.other = other
        self.thing = thing

Same class, but via Pydantic: 

class Example(BaseModel):
    other: str 
    thing: str = None

Accessing values from a dict-like object: 

a_thing['foo']['bar']

Getting values with class-like dot notation (IMO: nicer to write, nicer to read) 

a_thing.foo.bar


However, it soon became clear to me that this ‘nicer’ class definition is a problem that has been solved many times over in Python. There are many standard library solutions, as well as third party packages, allowing you to basically write classes without ‘dunder’ methods. I began to panic and wonder if I had made the right choice with Pydantic? This probably warranted some kind of benchmarking attempt, so that I could make a call on whether or not to swap out Pydantic for the equally esteemed ‘Attrs’, or something from the standard library liked ‘Typed Dict’.

In order to simulate what I’m doing in my app most closely I am going to:

1) Create views which initially call an external API endpoint. I found this one which gives fake ‘user’ data:

https://randomuser.me/api/

It returns a response body like so:

{
   "results":[
      {
         "gender":"male",
         "name":{
            "title":"Mr",
            "first":"Mohammed",
            "last":"Strauß"
         },
         "location":{
            "street":{
               "number":8343,
               "name":"Eichenweg"
            },
            "city":"Hemmingen",
            "state":"Berlin",
            "country":"Germany",
            "postcode":19282,
            "coordinates":{
               "latitude":"-40.3361",
               "longitude":"-84.7425"
            },
            "timezone":{
               "offset":"+2:00",
               "description":"Kaliningrad, South Africa"
            }
         },
         "email":"mohammed.strauss@example.com",
         "login":{
            "uuid":"bd8d4521-8a2d-449a-b131-04d528893f42",
            "username":"bluebutterfly132",
            "password":"allgood",
            "salt":"HMjZsJOB",
            "md5":"59207c14ae03ca031b93f60d62674d11",
            "sha1":"23c6f7dfb2bedb68da99419a259732e95d4fbf93",
            "sha256":"09436ae464d047025d375c40597569f7dfebdbd1eaaa1f1fb703ac61a36fbecd"
         },
         "dob":{
            "date":"1958-09-07T07:19:21.517Z",
            "age":64
         },
         "registered":{
            "date":"2016-11-04T17:36:00.287Z",
            "age":6
         },
         "phone":"0155-6292119",
         "cell":"0171-0685963",
         "id":{
            "name":"",
            "value":null
         },
         "picture":{
            "large":"https://randomuser.me/api/portraits/men/31.jpg",
            "medium":"https://randomuser.me/api/portraits/med/men/31.jpg",
            "thumbnail":"https://randomuser.me/api/portraits/thumb/men/31.jpg"
         },
         "nat":"DE"
      }
   ],
   "info":{
      "seed":"cde68e97c38bbef9",
      "results":1,
      "page":1,
      "version":"1.3"
   }
}

2) Create an object which represents a subset of the data returned by that API. In the case of Pydantic, my definition of the DTOs look like this:

from pydantic import BaseModel, AnyUrl


class PydanticAddress(BaseModel):
    street: str
    city: str
    postcode: str


class PydanticExample(BaseModel):
    name: str
    address: PydanticAddress
    age: int
    avatar: AnyUrl

Note that one class is nested inside the other. 


3) Have a parsing function which takes care of the un-seemly fiddling around. 

def parse_to_dto(json, dto_class, dto_class_address):
    return dto_class(
        name=json["results"][0]["name"]["title"]
        + " "
        + json["results"][0]["name"]["first"]
        + " "
        + json["results"][0]["name"]["last"],
        address=dto_class_address(
            street=str(json["results"][0]["location"]["street"]["number"])
            + " "
            + json["results"][0]["location"]["street"]["name"],
            city=json["results"][0]["location"]["city"],
            postcode=json["results"][0]["location"]["postcode"],
        ),
        age=json["results"][0]["dob"]["age"],
        avatar=json["results"][0]["picture"]["thumbnail"],
    )

4) Have a template which reads from the DTO:

<ul>
    <li>{{ object.name }}</li>
    <li>{{ object.address.street }} {{ object.address.city }} {{ object.address.postcode }}</li>
    <li>This person is {{ object.age }} years old</li>
    <li>{{ object.avatar }}</li>
    <li>The Dataclass was {{ size }} bytes</li>
</ul>


The entire view for the Pydantic endpoint looks like: 
 

def pydantic_view(request):
    response = requests.get("https://randomuser.me/api/")
    parsed = parse_to_dto(response.json(), PydanticExample, PydanticAddress)
    return TemplateResponse(
        request, "comparison.html", {"object": parsed, "size": getsizeof(parsed)}
    )

You will have noticed from the template and view that I am using the sys module's getsizeof function to measure the size of the DTO in memory.

In a really basic Django app, I've hooked up routes and views for Pydantic, Attrs, TypedDict, Named Tuple and Data Classes


I used the Salvo load testing tool to the measure speed. I’m not especially interested in how it behaves under load, but given that there is an (unpredictable) network call in amongst all of this, the fairest thing to do seems to be to make a lot of syncronous requests, and then Salvo will be able to calculate an average length of the request/response cycle.

With the app running on my local host, Salvo can be run like so: 

salvo http://127.0.0.1:8000/pydantic/ --concurrency 1 --requests 100

The Results: 

DTO Average Response Time (seconds) Size (bytes)
Pydantic 0.2175 48
Attrs 0.2153 72
DataClass 0.2083 48
TypedDict 0.2636 232
NamedTuple 0.2135 72
     
StandardClass 0.2226 48

The 'Standard' class was defined like so: 

class StandardAddress:
    def __init__(self, street,city,postcode):
        self.street = street
        self.city = city
        self.postcode = postcode

class StandardExample:
    def __init__(self, name,address,age,avatar):
        self.name = name
        self.address = address
        self.age = age
        self.avatar = avatar

So both the 3rd party options come in at faster than a standard class (beat only by plain DataClasses), with Attrs having a slightly larger memory footprint. Typed Dicts seem significantly slower/ bigger. 


This overall approach feels a bit ‘hacky’ to me. I’m effectively using a load testing tool to profile an application. This got me thinking, was there a different approach; where I could analyse what is going on in the code, in a more granular and less crude way? The package django-cprofile-middleware seemingly fits the bill. If I add the middleware into my project, I can get stats for different parts of the call stack. It will, in effect, tell me how long the program spends in each part of the code.

After pip installing it, I amend my projects settings: 

# settings.py

...

MIDDLEWARE = [
    "django.middleware.security.SecurityMiddleware",
    "django.contrib.sessions.middleware.SessionMiddleware",
    "django.middleware.common.CommonMiddleware",
    "django.middleware.csrf.CsrfViewMiddleware",
    "django.contrib.auth.middleware.AuthenticationMiddleware",
    "django.contrib.messages.middleware.MessageMiddleware",
    "django.middleware.clickjacking.XFrameOptionsMiddleware",
    
    # New! 
    "django_cprofile_middleware.middleware.ProfilerMiddleware",
]

DJANGO_CPROFILE_MIDDLEWARE_REQUIRE_STAFF = False

...

Now if I start the Django development server, I can hit any endpoint and the addition of a ‘prof’ query param to the URL, will return the stats to me.
 


In order to get a grasp of what the middleware is doing, lets focus on the ‘cumtime’ column, representing the ‘cumulative time’ spent in a function, (so also any time spent 'in' functions called from within that function). If we examine the Pydantic endpoint, sorted by cumulative time, http://127.0.0.1:8000/pydantic/?prof&sort=cumtime , we see the ‘pydantic_view’ view at the very top. This is expected, as the view is responsible for the entire request and response cycle. Everything else we are measuring happens ‘within’ that view.

Another thing worth examining here is that the call with the next largest cumulative time is from the requests package, and it’s only 0.001 seconds less than everything else that happens in the view (so the parsing and DTO initialisation), so we can see that far and away the biggest amount of overhead is coming from the call to the external API.

If I sort by ‘tottime’ (i.e the total time spent per function, irrespective of sub-calls), we can confirm that all the parts of the program which are actually the slowest, are all related to making the http call, as suspected. This is true of the faster or slower DTOs as diagnosed by my Salvo script.

 

So far, django-cprofile-middleware seems to be pointing me to the fact thathat the differences are negligible.

Here are some decent tutorials relating to django-cprofile-middleware:

It is often used in conjunction with the SnakeViz viewer to produce a GUI for inspecting the results (detailed in both the above posts). 


Given the extra functionality* one can get with Pydantic and Attrs, I'd say that they were well-worth using, especially as there seems to be little speed difference when compared to the standard lib offerings, (despite the crude nature of my benchmarking). 

*I've deliberately avoided discussion those 'extra' functionalities here, but some excellent articles can be found below, which discuss the feature sets of the options explored in greater depth. 

Dataclasses vs Attrs vs Pydantic by Jack McKew

7 ways to implement DTOs in Python... by Izabela Kowal

Why not... from the Attrs docs

It would appear to me that Attrs raison d'etre is to help people write classes more easily. The standard library's Data Class was created as a response to this same problem, but Attrs remains more extensible. Pydantic's main focus is data validation and parsing. NamedTuple and TypedDicts are fundemanetally those kinds of datasructures under the hood, its just that they can just be defined with a 'class-like' syntax. So to summarize,  if you want to write cleaner classes and only care about speed, use Data Classes. If you think you might care about more than speed in the future, use Attrs.  If you want parsing, validating and much more, (alongside everything else), Pydantic is the way to go! 


For my test I was using Python 3.8.9 and the versions of the key libraries was: 

attrs==21.4.0
Django==4.0.2
pydantic==1.9.0
requests==2.27.1
salvo==0.2
You may also like: