Bringing Home Your Very First Data Product | Issue #49
From Logical Modeling and Product Ideation to Data Product Improvement
This piece is a community contribution from Francesco, an expert craftsman of efficient Data Architectures using various patterns. He embraces new patterns as well, such as Data Mesh and Fabric, to capitalize on data value more effectively. We highly appreciate his contribution and readiness to share his knowledge with MD101.
We actively collaborate with data experts to bring the best resources to a 3000+ strong community of data practitioners. If you have something to say on Modern Data practices and innovations, feel free to reach out to us!
Note: All submissions are vetted for quality and relevance.
We want to partake in the data as a product experience, and in that pursuit, we need to define our first data product. To do so, we need to meet several prerequisites:
Understanding the topology of data domains.
Identifying the potential data owners affected.
Familiarity with the data model.
For the sake of discussion, let's consider ourselves as the data owners of all the data we will use, based on the following logical data model for the Customer Domain for our enterprise (called ACME):
Now, let's consider our physical data model, which consists of three distinct data assets inside our Data System (it could be a DWH, a Data Platform or whatever). Two relational tables ("CustomerTable" and "AddressTable") that explicitly express the many-to-many relationship through a simple foreign key, and a topic “AddressQueue” that encapsulates address change information (for simplicity, let's assume the schema perfectly overlaps with the "AddressTable" table). In this case, we have the following representation (note: in real life, it would be slightly different, but I've simplified it for illustrative purposes):
📝 Note from Editor: More on Modeling for Data Products
💡 Data Product Ideation
Assuming that our first data product aims to publish a comprehensive version of the customer base along with their addresses. At this point, someone should have already defined the structure of the data contract, which should include at least the following information:
Name: The name of the data product.
Data Asset: Identification of the specific data assets involved, such as CustomerTable, AddressTable, etc.
Data Content: Description of the content included in the data product, which in this case would be the complete customer base along with their addresses.
Quality Assurance: Procedures and standards for ensuring the quality and accuracy of the data.
Sensitivity: This includes assessing the Maturity Level, Reliability, and Trustworthiness of the data product.
SLA (Service Level Agreement): Agreed-upon performance metrics and commitments regarding data availability, reliability, and other service aspects.
Support and Contact Information: Details of who to contact for support, including contact methods and availability.
Data Delivery: Specifications regarding how the data will be delivered, whether it's through a file transfer, API, database access, etc.
Data Usage: Guidelines and restrictions on how the data can be used, ensuring compliance with legal and regulatory requirements.
📝 Note from Editor: More on Data Contracts for Data Products
Step 0️⃣: Design Information.
The data product owner should identify the basic information needed to satisfy a business need, whether it be for publication/sharing (data product source aligned) or for solving a specific use case (data product consumer aligned).
Together with any supporting figures (data owner, data steward, and so on), they should be able to complete the initial sections of the data contract, such as the name, classification, and necessary assets.
For our first iteration, let's consider we want to publish customer registry information with addresses without any logic other than the mere publication on our marketplace (or data mesh registry). In this way, the first version of the data contract for the 'Customer Registry' product would be as follows:
There are some fields that are important to focus on. The "Data Content: Business Entities" field should refer to the taxonomy (the list of business terms in simple terms) of our data catalog. This is one of the cornerstones that enables subsequent semantic interoperability because it ensures two effects:
It supports in identifying the data owners involved in any certification processes.
It enables the possibility of search and cataloging.
Then there is the "Quality Assurance" field, which is actually a separate section as it involves integration with the data quality engine, the exposure of scores, and their weighting (data veracity paradigm).
The absence of these fields is limiting because it does not allow for the management of interoperability and the evaluation of the contextualization of the reliability degree of the data product.
📝 Note from Editor:
Step 1️⃣: Design the Access
The preceding information should be sufficient for publication on a generic marketplace. Once published on the marketplace, it's necessary to define the access possibilities. The first check to perform is what the company policies are for "internal" data (information that we have included in the first component of the contract). Assuming, therefore, that our company is not particularly restrictive and allows usage by the entire internal user base, we could have different cones on our marketplace for roles:
Data Analyst
Data Scientist
Business User
and for organizational units
Users from the same department
Users from the same region
All employees
By invitation only
These valuations obviously have aspects that can be further explored (in terms of scope and cost model), but it's decided that the dataset can be requested by "data analysts" and "data scientists" belonging to our same region (for example, Italy). We thus obtain the following enrichment.
Step 2️⃣: Design Output Ports / Delivery Mode
Once the data product and its audience have been defined, it's necessary to establish the access methods to the same. Obviously, the components can vary greatly depending on the architectural aspects, but we can assume that the following methods have been defined:
For Data Analysts & Business Users:
Provisioning in the reporting workspace.
For Data Scientists:
Read-only access to the structures.
Provisioning in the AI workspace (e.g., datalake + python).
Having selected both data scientists and data analysts in the previous section as potential users, there should be a constraint to select at least one output port per type. This way, it is decided to eliminate read-only access (for example, due to workload or access cost reasons). The contract would then appear as follows:
In the example, the possibility of applying filters or local authorizations was not explicitly stated, but the system accommodates the profiling mentioned above. Therefore, users would have the option, depending on their role, to access only one mode of utilization (reporting for data analysts, lake for data scientists).
All other categories, such as non-Italian users, external users, and business analysts, would see the product on the marketplace but would not be able to access it (not even in the preview).
📝 Note from Editor: More on Governance for Data Products
Step 3️⃣: Certify
In our example, we have included the simple case where the data product owner (DPO) or data product manager is equivalent to the data owner of all entities, and there are no complex validation paths due to the sensitivity level of the data. It is therefore reasonable to assume that the publication is self-approving, and there are no validation steps.
Step 4️⃣: Monitor & Improve
Once the data product is published and made accessible/searchable, it should receive requests for usage (here, it needs to be evaluated whether they are automatically approved or not) and become an active member of the data economy. Therefore, DPOs must keep the products alive in the solution, evolve them, and manage them like any other asset under their responsibility.
Some of the evolutions could include:
Including the "addressqueue" in our data product.
Enhancing the level of data quality.
Enriching the record schema.
Offering services on the data product.
Increasing/decreasing support SLAs (Service Level Agreements).
📝 Note from Editor: More on strategising an ecosystem of Data Products
🔏 Author Connect
Find me on LinkedIn
For more diverse resources and insights, hop to our freshly printed website!