Failsafing Multitenancy in SQLAlchemy
July 6, 2016
python
sqlalchemy
development
At Cratejoy, we run a hosted, multi-tenant application, and for multiple reasons, certain data is centralized between our separate merchants. Merchant specific data should never be crossed, for obvious reasons, however our current setup led to simple engineering oversights to lead to exactly this. We were able to trace data leaks back to our use of joinload
’s of SQLAlchemy backrefs on centralized models. Ie, we have a centralized model with a child relation to something that is store specific. When we joinload the child model, it includes the children across all merchants, not the specific one we want at the time. While we could manually filter this in most cases, we take these data leakages very seriously, and do not want an oversight by an engineer to so easily lead to data leakage. We took some time to develop a failsafe system that should prevent data leakage between stores, even in the event of missing filters.
For this walk through, we’ll use the following two models:
## models.py
class Customer(db.Model):
id = db.Column(db.Integer, primary_key=True)
name = db.Column(db.String(50))
addresses = db.relationship('Address',
backref='customer', collection_class=FilteredList)
class Address(db.Model):
id = db.Column(db.Integer, primary_key=True)
email = db.Column(db.String(50))
customer_id = db.Column(db.Integer, db.ForeignKey('customer.id'))
merchant_id = db.Column(db.Integer, db.ForeignKey('merchant.id'))
Here, we have a Customer who can have multiple addresses, on multiple stores. However accessing customer.addresses will give us addresses from both stores, which is not what we want. Luckily for us, the majority of client specific tables already have a merchant_id attribute, and we were able to reduce this problem to a single statement: “If something is loaded through a SQLAlchemy backref during a request, and it has a merchant_id, it’s merchant_id should always be the same as the merchant that request is for”. From this problem statement, we were able to build a custom collection class to use for our SQLAlchemy backrefs, which explicitly checks this and transparently removes models that were incorrectly loaded.
A custom collection_class is a class which must implement the list interface, and will be used to hold the collection of returned child models. With this in mind, we can create the following collection_class:
## filteredlist.py
from sqlalchemy.orm.collections import InstrumentedList
class FilteredList(InstrumentedList):
"""
List-like collection allowing us to prevent cross account
data leakage from centralized resources. Works by
preventing non-matching objects from being added to
backref containers.
e.g:
attr_name = u'merchant_id'
attr_val = 1
Will prevent anything with a `merchant_id` attribute
who's value does not equal `1` from being added to a
FilteredList instance
Use by setting the `collection_class` attribute on
SQLAlchemy backrefs.
"""
# Attributes
attr_name = None
attr_val = None
@classmethod
def enable_filter(cls, filter_name, filter_val):
"""
'Activate' the filtering using the provided attribute
name and value.
"""
cls.attr_name = filter_name
cls.attr_val = filter_val
@classmethod
def reset_filter(cls):
"""
'Deactivate' the filtering.
"""
cls.attr_name, cls.attr_val = None, None
def append(self, item):
"""
Append iff:
- the filters are not set, or
- the object matches the filters
"""
if FilteredList.attr_name and FilteredList.attr_val:
if hasattr(item, FilteredList.attr_name):
object_val = getattr(item, FilteredList.attr_name)
if object_val is not None and
object_val != FilteredList.attr_val:
# Prevent actually appending, effectively dropping
# the non-matching object
return
super(FilteredList, self).append(item)
As you can see, this is checking the merchant_id against a static value on the class on each add. Only if the merchant_id matches the static ID do we allow it to actually add to the list. In our case we’re scoping the filtering to a given request, so the ID to match must be explicitly set at the beginning of each request. We do this in a pre-request hook (shown below) however your use may vary.
## register_filter_hooks.py
def register_hooks(app):
@app.before_request
def _before_request():
FilteredList.enable_filter(u'store_id', cur_store.id)
@app.after_request(app)
def _after_request(response):
FilteredList.reset_filter()
Note that this works for us since each request is synchronous and blocking, but this would have to be modified slightly to work in an asynchronous environment where a single thread could be handling multiple requests simultaneously.
Now you’re probably thinking something like:
But wait Nick, this seems very inefficient. You’re loading things that you know you don’t want, and then filtering them afterwards. Why not just filter them specifically?!
You would be right. This is very inefficient, and if we load a lot of unrelated data it will be very slow. Code being written should definitely be explicitly filtering, and oversights like this are always looked for in code review. This method is meant as our fallback for two cases:
- Both the engineer writing the code, and the reviewers, overlook bad data being loaded.
- Situations where explicit filtering isn’t possible/practical. This is the case when some libs are used, where rewriting and refactoring the usage would have taken much longer, and be more error prone.
This approach actually works really well. By modifying the FilteredList implementation above slightly, we can add logging to identify when things are being loaded that shouldn’t be, and go back and explicitly fix it.
One final thing: a transient bug was later discovered with this implementation. In the event that objects were being excluded by the filters, sometimes (but only sometimes) data that we actually wanted was also being dropped. More information, as well as a fix, can be found here.
Hopefully this works out for you, or was at least helpful. If you have any questions, don’t hesitate to shoot me an email, or follow me on twitter @nrmitchi.