---
layout: post
title: "python: metaprogramming marshmallow"
author: mjadud
tags:
- cc-sa
- python
- metaprogramming
- blog
- "2020"
- 2020-03
publishDate: 2020-03-19
---
## tl;dr
I used Python's metaprogramming features to auto-generate Marshmallow schemas that correspond to `attrs`-derived data classes.
If you like the thought of thinking about metaprogramming as much as I do, you'll grove on this post.
## a theme of metaprogramming...
*Oddly, related as a piece to my explorations of `tbl` in Python, as well looking at GraphQL, but still it's own post...*
It is hard to extend Python's syntax, but that doesn't mean you can't engage in some dynamic metaprogramming in the language. While it isn't always the first tool you should reach for, it can be nice for **reducing boilerplate**.
For example, I am staring down a bunch of JSON-y things. They come-and-go from the front-end to the back-end:
```json
{ email: "vaderd@empire.com",
token: "89425abc-69f9-11ea-b973-a244a7b51496" }
```
Let's pretend that the front-end is [React](https://reactjs.org/), the storage layer is [MongoDB](https://www.mongodb.com/), and the middleware is [Flask](https://palletsprojects.com/p/flask/) (a Python web framework).
At the Flask layer, there's a lot of work that needs to be done: the JSON comes in, and in the first instance, it comes in as a dictionary. This is not very nice. By "not very nice," I mean "dictionary convey no notion of types or the regularity of their contents, and therefore provide us with no notion of safety." What I'd like is for the data coming from the front-end to be strongly typed and well described, the middleware to be aware of those types, and the database to help enforce them as well. (I'm thinking GraphQL starts to do things like this... almost.)
BUT, we have a RESTful web application sharing data in webby, untyped ways. This inspired me to do some digging. First, I found Flask Resful, which is a nice library. It lets you define a class, set up `get`, `put`, `post`, and other methods on endpoints, and register them with the app. Leaving a bunch of bits out, this looks like:
```python
from flask_restful import Resource, Api
import db.models as M
import db.db as DB
class Tokens(Resource):
def post(self, email):
# Create a UUID string
tok = str(uuid.uuid1())
# Create a TimedToken object, with a current timestamp
t = M.TimedToken(email=email, token=tok, created_at=time())
# Grab the correct collection in Mongo for tokens
collection = DB.get_collection(M.TimedToken.collection)
# Save the token into Mongo by dumping the token through marshmallow
as_json = t.dump()
collection.insert(as_json)
# Return the token as JSON to the client
return as_json
mapping = [
[Tokens, "/token/"]
]
def add_api(api):
for m in mapping:
api.add_resource(m[0], m[1])
```
which is in a module called "API", and at the top level of the app:
```python
from flask_restful import Api
from flask import Flask
import hydra
from api.api import add_api
app = Flask(__name__)
@hydra.main(config_path="config.yaml")
def init(cfg):
# Dynamically define classes from the YAML config.
M.create_classes(cfg)
# Set the Mongo params from the config.
DB.set_params(cfg.db.host, cfg.db.port, cfg.db.database)
# Add the REST API to the app.
A = Api(app)
add_api(A)
```
This is a lot to take in, but I'm actually trying to get to the good bit. The top level has an `init` function that reads in a configuration file (more on that later), and uses that to build a whole bunch of classes *dynamically at run time*. (This is the cool bit.) Those are instantiated in the `models` submodule of `db`, and they get used throughout the application.
Looking back at the first code block, it's possible to see some of those uses. For example, I'm creating a timed token (e.g. a random string associated with a user that will ultimately have a finite lifetime).
```python
t = M.TimedToken(email=email, token=tok, created_at=time())
```
This class takes three parameters: `email`, `token`, and `created_at`. The whole purpose of the class is that I want it to serve as a `struct` (in Racket or C) or `record` (in... Pascal?). In Python, `namedtuple`s, `dataclass`es, and classes decorated with `attrs` are all examples of what I'm aiming for.
But... **BUT**... I also want easy marshalling to-and-from JSON. The front-end speaks it, and Mongo speaks it... but, while I'm in the middle, I need to interact with it. I would like it to be *typed* (in as much as Python is typed) while I am working with it in the middleware. And, I'd rather not do the conversions myself. (Why would I write code if I wanted to do all the hard stuff by hand?)
To solve this, enter [marshmallow](https://marshmallow.readthedocs.io/en/stable/). This Python library lets you define schemas for classes, and in doing so, leverage machinery to marshal JSON structures to-and-from those classes. For example, my `TimedToken` class looks looks (er, used to look like):
```python
@attr.s
class TimedToken:
email = attr.ib(type=int)
token = attr.ib(type=str)
created_at = attr.ib(type=float)
```
To marshal this to-and-from JSON, I can use marshmallow. I need to create a schema first:
```python
from marshmallow import Schema, fields
class TimedTokenSchema(Schema):
email = fields.Str()
token = fields.Str()
created_at = fields.Number()
```
Once I have a schema, I can do things like this:
```python
a_token = TimedToken(...)
schema = TimedTokenSchema()
as_json = schema.dump(a_token)
```
The machinery inside of marshmallow will take an object of type `TimedToken`, a schema describing them (`TimedTokenSchema`), and use the schema to walk through a `TimedToken` object to convert it to JSON (and, back, if you want).
This is cool.
But, it's not automatic. And, for every data structure I want to create in my app, I need to write a schema. This is duplicating code. If I change a structure, I need to remember to change the corresponding schema. *That isn't going to happen*. What's actually going to happen is that I'll forget something, and everything will break.
## enter metaprogramming!
I wanted to be able to declare my data structures as YAML, and then have Python generate both the `attrs`-based class as well as the `marshmallow`-based schema. Is that so much to ask? No, I don't think it is.
Using Facebook's [Hydra](https://hydra.cc/), I created a config file. This important bit (for this discussion) looks like this:
```yaml
models:
- name: TimedToken
fields:
- email
- token
- created_at
types:
- String
- UUID
- Number
```
Then, the fun bit is the function `create_classes`. It takes a config that includes the `models` key, and does the following:
```python
def create_classes(cfg):
for c in cfg.models:
make_classes(c.name, c.fields, c.types)
```
OK... so, `make_classes` must do the interesting work.
```python
def make_classes(name, fs, ts):
# Dynamically generate the marshmallow schema
schema = make_schema(fs, ts)
# Generate a base class, and wrap it with the attr.s decorator.
base = attr.s(make_base(name, fs,ts, schema))
# Insert the class into the namespace.
globals()[name] = base
```
This is probably **really bad**. But, it's fun, so I'll keep going.
I pass in the name of the class as a string (`"TimedToken"`), and then I pass in the fields as a list of strings, and their types as a list of strings. (These are given in the YAML, above). The last line here is where the evil happens. The function `globals()` returns the dictionary representing the current namespace. I proceed to overwrite the namespace; specifically, I insert a new class of the name `TimedToken` (in this example). (I *hope* the use of `global()` is restricted to the *module*, and not the entire *application*... I have some more reading/experimenting to do in that regard. It *seems* like it is the module...)
Backing up, I'll start with `make_schema()`. It takes the fields and types, and does the following:
```python
def make_schema(fs, ts):
# Create an empty dictionary
d = {}
# Walk the fields and types together (using zip)
for f, t in zip(fs, ts):
# Convert each type into the appropriate fields.X from marshmallow
# and insert it into the dictionary
d[f] = get_field_type(t)
# Use marshmallow's functionality to create a schema from a dictionary
return Schema.from_dict(d)
```
`get_field_type()` is pretty simple:
```python
def get_field_type(t):
if t == "Integer":
return fields.Integer()
if t == "Float":
return fields.Float()
if t == "String":
return fields.String()
if t == "UUID":
return fields.UUID()
if t == "Number":
return fields.Number()
```
(No, there's no error handling yet. Not even a default case... *sigh*.)
The `make_schema` function literally returns a `class` that I can use to convert objects that match the layout of the dictionary that I built. That's great... but what good is a `TimedTokenSchema` if I don't have a `TimedToken` class in the first place? Hm...
```python
@attr.s
class Base ():
pass
def make_base(name, fs, ts, schema):
cls = type(name, tuple([Base]), {})
setattr(cls, "schema", schema)
setattr(cls, "dump", lambda self: self.schema().dump(self))
setattr(cls, "collection", "{}s".format(name.lower()))
for f, t in zip(fs, ts):
setattr(cls, f, attr.ib())
return cls
```
The function `make_base()` does some heavy lifting for me. First, it uses the `type()` function in Python to dynamically generate a class. In this case, it will create a class with the name `TimedToken`, it will use `Base` as a superclass, and it will attach no attributes at time of creation. (I actually do not want to overwrite anything, because `attrs` does a lot of invisible work.)
The function `setattr` is, used casually, probably a bad thing. It literally reaches into a class (not an *object*, but a *class*) and attaches attributes to the class. If you're not used to metaprogramming, this is like... writing the code for the class on-the-fly.
I attach three attributes:
* `schema` is a field that will hold a marshmallow `Schema` class. (Because, in Python, classes are objects too! Wait...) If you look back, you can see that I pass it in after creating it in `make_classes()`.
* `dump`, which is a function of zero arguments. It takes a reference to `self` (because this class will get instantiated as an object), and it instantiates the `schema` that I've stored, and then invokes `dump()` on... itself. This feels metacircular, but fortunately marshmallow knows to only look for fields that are in the schema. Therefore, we don't get an infinite traversal here.
* `collection`, which is so I can map directly into Mongo. I take the name of the class, lowercase it, and add an 's'. So, `TimedToken` becomes `timedtokens` as a collection name. I like the idea of the object knowing where it should be stored, so I don't have to think about it.
Once I have these things set up, I walk the fields, and add them to the class. For each, I add a (currently) untyped `attr.ib()` to the field. This way, the `TimedToken` class will act like a proper `attrs` class.
Finally, I return this class, which then gets attached (back in `make_classes()`) to the `global()` namespace.
## what?
If you like the thought of thinking about metaprogramming as much as I do, you're excited at this point. If you're wondering why I would do this... well, I'll go back to my REST handler for TimedTokens:
```python
from flask_restful import Resource, Api
import db.models as M
import db.db as DB
class Tokens(Resource):
def post(self, email):
# Create a UUID string
tok = str(uuid.uuid1())
# Create a TimedToken object, with a current timestamp
t = M.TimedToken(email=email, token=tok, created_at=time())
# Grab the correct collection in Mongo for tokens
collection = DB.get_collection(M.TimedToken.collection)
# Save the token into Mongo by dumping the token through marshmallow
as_json = t.dump()
collection.insert(as_json)
# Return the token as JSON to the client
return as_json
mapping = [
[Tokens, "/token/"]
]
def add_api(api):
for m in mapping:
api.add_resource(m[0], m[1])
```
The function `create_classes(cfg)` is in the `db.models` module. I import that as `M`. Because I created classes in this module at the point that Flask was initialized, I now have a whole bunch of dynamically generated classes floating around in there. Those classes were generated *from a YAML file*, and can be used anywhere in the application.
```yaml
models:
- name: TimedToken
fields:
- email
- token
- created_at
types:
- String
- UUID
- Number
```
To add a new class to my application, I add it to the YAML file, and restart Flask. This will call `create_classes` as part of the init, and the new class will be generated in the `db.models` module. I can then use those classes just as if I had written them out, by hand, duplicating the effort of defining both the `attrs` class and the marshmallow `Schema` class.
In my REST handler, this is where this dynamic programming comes into play:
```python
# Create a TimedToken object, with a current timestamp
t = M.TimedToken(email=email, token=tok, created_at=time())
# Grab the correct collection in Mongo for tokens
collection = DB.get_collection(M.TimedToken.collection)
# Save the token into Mongo by dumping the token through marshmallow
as_json = t.dump()
collection.insert(as_json)
# Return the token as JSON to the client
return as_json
```
I create the object. Then, I use the `collection` attribute to ask for a database connection to the collection that holds objects of this type (this is like a table in relational databases). Next, I convert the object to JSON by invoking the `.dump()` method, which was added dynamically. In fact, it is using a Schema class that was created dynamically as well, and then embedded in the enclosing object for later use. Finally, I insert this JSON into the Mongo database, and return it to the client, because both Mongo and the client speak JSON natively.
The result is that I've metaprogrammed my way around `attrs` and `marshmallow` to create a dynamic middleware layer that can marshal to-and-from JSON. In doing this, I've saved myself a large amount of boilerplate, and I have a single point of control/failure for all of my class definitions, which is external to the code itself. (I think I still need to add the marshalling *from* JSON, but that won't be hard.)
## what will you do with this, matt?
Personally, I haven't found anything on the net that eliminates the boilerplate in marshmallow. In the world of open source, I'd say this is an "itch" that I scratched. It might be an itch other people have.
Perhaps my next post will be about packing code for `pip`?