Data Repository – A design pattern for data storage – Part 1

Data storage in applications is a complex task that requires careful consideration of various factors. These include the location and method of data storage, the metadata associated with it, data querying, access authorization, cost optimization, data auditing, and more.

For simple applications with a single reader and writer, direct usage of Binary Large OBjects (BLOBs) such as Azure Blob, Amazon S3, or disk storage is perfect. Azure Blob even allows for the storage of additional simple meta data¹ or tags² to be stored with the data, which can be queried to create custom collections.

However, when multiple applications need to read, write, update, and overlay an authorization system, it becomes necessary to place something in front of the BLOB storage. A simple façade can be implemented that also abstracts away the storage API, providing a good starting point. However, you may quickly find the need to extend this functionality.

The implementation of a data repository pattern is recommended for several reasons:

Abstraction: It abstracts the specific implementation of file storage from ‘n’ other services.
Metadata Enrichment: It provides metadata enrichment from a single location.
Statelessness: It allows consumers to be more stateless.
Reduced Complexity: It reduces the complexity of consumers.
Data Consistency: It enforces data consistency.
Reduced Redundancy: It reduces data redundancy.
Lifecycle Management: It manages the data lifecycle easily.
Dynamic Collections: It dynamically creates collections of data.

The most significant advantage is the first one. Instead of implementing the same blob management structures in multiple languages and frameworks, you can reduce complexity, and the ‘n’ services can interact with a greatly simplified API.

The simplest data repository takes the form of this graph below.

flowchart TD
    C[c. Data Service] --- D[Website]
    C --- E[Phone]
    C --- F[Car]
subgraph ci[Data Repository]
    A[a. BLOB Storage] --- C
    B[b. Database] --- C
    G[e. Meta Enrichment] -.- H
    G -.- C
    H[d. Message Queue] -.- C
end

The components of this data repository are:

BLOB Storage: This is the service, infrastructure, or system to store binary files. This could be Azure Blobs, Amazon S3, Disk Drives, or any location where a file can be stored.
Database: This is the service, infrastructure, or system to store the meta data. This could be a relational DB such as Postgres, Microsoft SQL Server, Maria DB, or it could be a document DB such as Mongo DB, CouchDB, or Azure Cosmos DB.
Data Service: This is the service that consumer applications interact with. It is the external-facing API surface area that provides the façade, or interface³ to the operations and processes behind it.
Message Queue: This is a service such as RabbitMQ that queues up new data received by the Data Repository for metadata enrichment.
Metadata Enrichment: These are bespoke services that consume from the message queue. There may be a single enrichment that performs a function as simple as averaging the colour of an image, all the way up to running ‘n’ Machine Learning models that are used to describe, classify, or create a vector representation.

The Data Service abstracts all of these so that downstream services do not need to be concerned about the specific implementation details.

However, removing direct access to the blob storage can lead to issues if not handled correctly. For instance, if the Valet Pattern⁴ is not implemented, when a consumer requests data, it must first be downloaded (or proxied) via the data service. The valet pattern allows the consumer to request a URL that is pre-signed and will enable the download of data by just using the URL. This is hugely powerful as simple HTTP requests can be made without needing to use specific packages for Azure Blobs, Amazon S3, or others.

An example of the Valet Pattern can be seen as implemented in the How to use dotnet and Traefik to connect a legacy PHP application to AWS S3 post.

Another potential issue is not considering the identity of the user/entity that is uploading the data. This is an issue that will have implications in the future, as all data in your system has ownership, usage, and licensing applied to it. This becomes a more significant consideration when the data you have is sensitive in nature, or has privacy implications.

Part 2 will contain additional considerations to be made during the implementation of a Data Repository.