Data Governance of Data Mesh Architecture

October 24, 2022

Data Governance and Data Mesh Architecture

Data products have a long lifecycle, therefore, data governance over their lifetime is needed. This blog describes how data can be governed in compliance with global policies over time. The goal of a successful data governance strategy is to convert data governance policies into code and embed them in each data product. These embedded policies are activated and applied at the appropriate time through the product’s life cycle. A good illustration of this process is the in-memory encryption of certain data types if a policy embedded in the data product as code demands it. What is meant by ‘policy as code’ is that the policies are versioned, tested, executed and observed just like code. Policies, too, just like data and data schema, change with time.

Data Governance Design

Since data governance is a universal and ongoing concern. it is advisable if this capability is embedded within the data platform. This feature should be embedded in the mesh experience plane so that users can easily configure policies and control the products’ right to be forgotten across the mesh.

Policies are of two types: domain-specific and agnostic. A data platform should be able to handle both types because sometimes, a particular domain may require a different level of data governance that needs to comply with a governing body’s security standards.

Standardized Data Governance Policies

Policy standardization is necessary to express, configure and enforce policies across all data products consistently. Every data product has a policy that forms a part of its interface. Therefore, standardizing how policies are expressed and enforced leads to simplicity of design. For example, the access control policy that defines and enforces who can access what data requires the identification of data users whether they are human or machine users and whether they are on the mesh or off it. If user identity cannot be verified and standardized successfully, there is little hope of enabling data sharing across multiple domains. The more standardized the identification method, the lesser the friction and cost of sharing data is. This may seem like common sense, however, data management systems have yet to standardize identity and access control.

Encryption

As confidential computing is evolving, encryption methods are also getting more advanced (and frequent). Depending on the security policy of a data product. Data may get encrypted in transit, at rest and in storage. Unlike access control policies, data encryption methods are pretty standardized, since encryption always happens as data travels to-and-fro from one part to the other. And in the absence of a standard encryption protocol, this exchange would be impossible. Data Mesh can enhance this standardization in the data product’s discovery interfaces and schemas by capturing the nature, level and method of encryption.

Access Control and Identity

The need to standardize rules to verify and define identity becomes even more acute in a distributed architecture like data mesh where data is shared across domains and organizations. Moreover, this standardization has to be technology-agnostic and independent of who the cloud provider is.

While standardization of identity related policies in accessing operational data is well established, it lags behind in analytical data access. Standardization in operational data was a result of the move towards distributed APIs (end user applications and services), such as, OpenID Connect Protocol, JSON Web Token (JWT) validation, X.509 certificates for services, and SPIFFE standardization protocols.

As far as analytical data is concerned. It remains locked by technology vendors who do not see any capital advantage in sharing their data with other vendors. Therefore, they do not see the need to create access control standardization for analytical data. Data Mesh architecture may shift this situation since it shares data beyond technology silos. Thus creating the need for standardization of analytical data.

Privacy and Consent

Regional privacy laws provide protection to individuals, whose data is being captured and controlled by organizations, by safeguarding their personally identifiable information. These laws have led to a rudimentary form of standardization of processes in data management. But, a lack of standardization in data sharing also causes a disinterest in standardization in privacy and consent. We remain far from adopting a consistent way of recognizing which information must be protected as private data. How individual consent is incorporated within the data set. How consent can be revoked or altered. And how transparency can be ensured. Being a distributed architecture, data mesh requires a consistent policy for managing privacy and consent across all data products. Individual consent can be considered a kind of computational policy. This policy can be embedded in a data product.

Data Governance Policy Integration

For a data product to function as a stand-alone commodity, data, code and policy have to be transactional as a whole. This unity helps in data governance. Especially when managing privacy and consent outside of a particular data storage. If consent policy is separate from data, it is difficult to track it once it exits a particular technical system where the consent was originally being managed. Data Mesh makes the tracking of consent across platforms easier. It does so by linking consent policy with the respective data.

Linking Policies

Policies can also be linked to data in such a way that this link is sustained even if the data transfers from one data product to another. If, for some reason, the data gets de-linked from its original schema, the link to the governing policy is automatically severed. This link to the original data set is important to maintain. So that the governing policies can stay up-to-date for all data products. In case there is a change in policy within the source data product. Although data mesh architecture allows for the creation of a distributed policy that can be shared along the data across systems. Still, this area requires further development.