VDR Platform - S3 Connector

My most recent project has been to develop a proof of concept for a common integration use-case in the LegalTech world. Virtual Data Rooms facilitate M&A Transactions and VDR Platforms can allow law firms to run many transactions at once, all managed from a single SaaS Platform.

Crucially, VDR Platforms will generally charge by the amount of data stored, so an integration which can provide efficient transfer out of the system, is a valuable tool for system administrators. At a very high level, this app sits as a connector in-between a VDR Platform and a cloud storage solution, (in this case Amazon S3).

The Django app itself provides a basic dashboard for searching and viewing metadata associated with these data rooms (also termed ‘sites’). Individual sites can have their detail viewed and from there, the file & folder structures of these sites can be downloaded onto the server, as well as the connected remote storage. The individual sites can also be 'soft' (i.e recycle bin) deleted or even 'hard' (irretrievably) deleted. It supposed to be a ‘one-stop-shop’ for managing data rooms, post-transaction. The folder structures are archived in another solution and the data is purged from the VDR service.

Points of discussion…

Service Layer

Whilst there are many debates within the Django world which never seem to die off (FBV vs CBV, anyone?), I don’t suppose there is any contention around the notion that Views should be ‘thin’. Often this involves pushing a lot of the business logic right down to the model layer, which isn't always ideal, (and could say to be trampling all over the ‘single responsibility principle’).

When consuming an external service in an app like the one I’ve built here, its doubly an issue because:

  1.  The App is a ‘connector’, so we dont REALLY have a model layer to dump logic into — we are not really persisting any data our end, just grabbing it from one place and slinging it to another.
  2.  There is A LOT to do. Make the API call, check the response code, parse the body of the response, transform the data, make another API call, etc ,etc… A View which encapsulated all of this could easily become an unreadable nightmare.

Could a ‘Service Layer’ come to the rescue here and what is a service layer anyway? The Hacksoft Style guide defines a service layer as

‘’The service layer speaks the specific domain language of the software, can access the database & other resources & can interact with other parts of your system.’’

and goes onto say:

‘’A service can be:

  • A simple function.
  • A class.
  • An entire module.
  • Whatever makes sense in your case."

 

That feels as though I’ve been given a lot of wriggle room to define a service layer however I see fit, so long as the Views stay ‘view-y’ and the Models stay ‘model-y’, (their diagram, couldnt really make it any clearer).

In the end I settled on groups of functions, each of which does ‘one thing’, before passing off to the next layer.
 

Core 

|-- http_handlers/ 

   |-- some python files... 

|-- data_parsers/ 

   |-- etc... 

|-- data_classes/ 

   |-- etc… 

The http_handler functions only make a call to the external service, they then hand the response body off to data_parser functions, who populate data_classes, which are eventually returned to the view. In the below code examples I've omitted comments and some of the code for simplicity, but it should be clear whats going on. This is the 'flow' through the service layers for when a user wants to see the details of a virtual data room.

core/views.py

...

def site_detail_view(request, id: int):

    site = get_single_site(request.user, id)

    return TemplateResponse(request, "site_detail.html", {"context_data": site})

...

core/http_handlers/site_http_handlers.py

...

def get_single_site(request_user, id: int):

    VDR_BASEURL = get_setting("remote_system_base_url")

    site_id = id

    access_token = SocialToken.objects.get(account__user=request_user)
    url = f"{VDR_BASEURL}/sites/{site_id}"
    headers = {
        "Authorization": f"Bearer {access_token}",
        "Accept": "application/json",
    }

    response = requests.get(url, headers=headers)

    #omitting some error handling here for brevity

    result = parse_get_single_site(response.json())

    return result

...

core/data_parsers/site_data_parsers.py

...

def parse_get_single_site(json) -> VDRSiteDetail:

    # omitting some logic for brevity

    site = VDRSiteDetail(
        id=json["id"],
        name=json["sitename"],
        owner_email=json["siteowner"]["email"],
        owner_name=json["siteowner"]["firstname"] + " " + json["siteowner"]["lastname"],
        created_date=created_date,
        status=json["status"],
        active_document_size=json["rawsitesize"]["activedocumentsize"],
        deleted_document_size=json["rawsitesize"]["deleteddocumentsize"],
        total_size=json["rawsitesize"]["totalsize"],
        site_root_folder_id=json["sitefolderID"],
        description=site_description,
        administrator_notes=admin_note,
        start_date=start_date,
        archived_date=archived_date,
        bidder_site=int(json["biddersite"]["enable"]),
        modules=parse_site_modules(json["module"]),
        categories=parse_site_categories(json["categories"]),
        password_protected=json["siteRestrictionType"]["passwordprotected"],
        two_factor_authentication=json["siteRestrictionType"][
            "twoFactorAuthentication"
        ],
        terms_and_conditions=json["siteRestrictionType"]["termsandconditions"],
        ip_restriction=json["siteRestrictionType"]["iprestrictedsite"],
        digital_rights_management=json["siteRestrictionType"]["drm"],
        error=0,
    )

    return site

...

Data Transfer Objects

As mentioned above, beyond some config, we are not really persisting anything at the application level, so ‘Models’ as we understand them in Django are less of a concern here. More relevant would be the notion of a ‘Data Transfer Object’, as we are passing data between entities. There is a good article on Python and DTO HERE.

Accessing attributes from deeply nested objects is a bit fiddly and ugly in Python. Depending on the complexity of whatever data structure the API you are calling gives you back, its not uncommon to see things like:

thing = json["foo"][0]["bar"][1][“fizz”]

Best to have the data_parsing layer to deal exclusively with this kind of grubbing about and define something far more legible and elegant that your code can actually make use of.

For this project, I’ve employed the popular Pydantic library. Not only does it give us a less cluttered interface when defining objects, we also get data validation thrown in for free! 

Here is the site detail object, relevant to the code examples above:

... 

class VDRSiteDetail(BaseModel):
    id: int
    name: str
    description: str
    administrator_notes: str
    owner_email: str
    owner_name: str
    created_date: str
    start_date: str
    archived_date: str
    status: str
    active_document_size: int
    deleted_document_size: int
    total_size: int
    site_root_folder_id: int
    bidder_site: bool
    error: bool = 0
    modules: List[VDRSiteModule]
    categories: List[VDRSiteCategory]
    password_protected: bool
    two_factor_authentication: bool
    terms_and_conditions: bool
    ip_restriction: bool
    digital_rights_management: bool

... 

 

Background Tasks

In my experience, (supporting a VDR Platform), users tend to put an amount of data in which could be described as ‘big-medium’ (!?). Think tens of GBs. Nothing crazy, but also something which will take long enough to download, that you’ll want to do it outside of the request-response cycle. Obviously, Celery is the Python world’s ‘go-to’ for background tasks. I want to users of my app to be able to identify the sites they want to either download or delete and then have those operations run by a background worker process, whilst the frontend of my app polls to see whether or not its finished.

Good Tutorial on this HERE . My integration is using Celery Groups to run its ‘local’ (i.e on the server) download and streaming of the files to S3 in parallel. 

The Celery Tasks are also recursive, allow for us to easily 'walk-down' a VDR's folder structure: 

@shared_task()
def recursive_site_builder_task(
    request_user_id: int,
    folder_id: int,
    local_path=None,
    vdr_path=None,
    report_id: int = None,
) -> None:


    current_folder = FolderContentsForLocal(
        user=request_user_id,
        folder_id=folder_id,
        local_path=local_path,
        vdr_path=vdr_path,
        report_id=report_id,
    )
    current_folder.prepare_folder()
    current_folder.write_folder_to_local()

    if current_folder.has_files():
        current_folder.iterate_over_and_write_files_to_local()

    if current_folder.has_subfolders():
        for folder in current_folder.subfolders.subfolder_list:
            recursive_site_builder_task.delay(
                current_folder.user.id,
                folder.id,
                current_folder.local_path,
                current_folder.vdr_path,
                current_folder.report_id,
            )

In the above code, we have a 'FolderContentsForLocal' object, which does a lot of our heavy lifting. After calling a 'prepare folder' method (just to check we have all the detail we need to perform the different actions), we create the folder on the local server. If there are files in the folder, we download all those and if there are subfolders, we recursively call the same Celery task on those subfolders. Recursion is made easier in this instance, as there will always be a termination case, (eventually we will hit a folder with no subfolders!). 
 


Mocking & Monkey Patching

 

Alongside Celery and Pydantic, Pytest must be one of the most ubiquitous packages in Python web development. The testing framework makes for easy mocking and monkey patching. For a layered architecture this is crucial to test components in isolation. Especially when some layers will be calling external APIs.

Here is an example of a Pytest fixture, which returns the type of JSON response we know the external API will give: 

@pytest.fixture
def vdr_site_list_json_response():
    with open("tests/test_utilities/json_files/site_list.json", "r") as f:
        return json.loads(f.read())

Its usage in an actual test: 

@pytest.mark.django_db
def test_parse_get_all_sites(vdr_site_list_json_response):
    result = parse_get_all_sites(vdr_site_list_json_response)
    assert type(result) == VDRSiteList

Testing getting an error response from the external service. Note we are monkeypatching the retrieval of the authorization token aswell as the network call:

@pytest.mark.django_db
def test_get_single_folder_details_error_response(
    monkeypatch,
    generic_user,
    mock_get_bearer_token,
    mock_object_with_error_response,
    remote_system_settings,
):

    monkeypatch.setattr(SocialToken.objects, "get", mock_get_bearer_token)
    monkeypatch.setattr(requests, "get", mock_object_with_error_response)

    result = file_and_folder_http_handlers.get_single_folder_details(generic_user, 4)
    assert type(result) == VDRServiceError

 

Boto3

The Boto3 library is AWS’s Python tookit. I’ve made use of it here due to my integration streaming the files to an S3 bucket. Crucially, other cloud service providers often make their API’s interoperable with AWS’s, so for instance, it wouldn’t be too difficult to swap S3 out for Digital Ocean Spaces or use another provider’s SDK.

API Proxy

As detailed in this blog post I made, the API endpoints in the main application, effectively obfuscate the ‘real’ API which I need to call. I have created a Proxy using the micro framework Bottle. In the long-term this will serve to make the connector more generic. The ‘stub’ endpoints can remain the same, but the proxy can be switched out depending on VDR provider.

Singleton Object for Settings

Rather than have the settings, ('base' URL for API calls, AWS credentials, etc), be configured by the developers as environment variables, we want the users of this integration to be able to set them. Afterall, they may want to update them in the future to point to a different instance of the VDR Platform, or a different provider altogether. Having said this, we need this configuration to make the external calls, so rather than query the DB each time, we can use Memcached to cache the settings in memory for faster retrieval. The package Django Solo accommodates these settings being a ‘Singleton Object’ . There will only ever be one ‘settings object’ stored and the database wont become unnecessary ‘clogged’, should users keep switching the settings.

Django AllAuth Custom Provider

The Django AllAuth package’s ability to accommodate a custom auth provider is something I’ve written about on this blog post. It not only allows us to use the VDR for authentication, but it already takes care of a lot of the token management required for the API calls that the app needs to make.

The Github Repository can be viewed HERE

You may also like: