Optimising Data Modeling for the Data-First Stack | Issue #4
Existing Challenges, Solutions, and Proposed Data Flow
In a recent LinkedIn post, I pointed out the discord between data producers and consumers that is specifically ignited due to poor data modeling practices. The irony is data modelling is supposed to break down the high wall between the two counterparts.
Given the vast head nods on that post, it was validation of the fact that it is not just an exclusive problem torturing the likes of me, my friends, or acquaintances, but it is, in fact, a devil for the data industry at large.
Letโs break down the problem into the trifecta of data personas.
Data Producers
Data producers are constantly stuck with rigid legacy models that reduced the flexibility of data production and hinder produce of high-quality composable data. Multiple iterations with data teams are often necessary to enable the required production schema and requirements.
Data Consumers
Data consumers suffer from slow metrics and KPIs, especially when the data and query load of warehouses or data stores increase over time. Expensive, complex, and time-taking joins to make the consumption process littered with stretched-out timelines, bugs, and unnecessary iterations with data teams. There is often no single source of truth that different consumers could reliably refer to, and thereafter, discrepancies are rampant across the outcome of BI tools. Consumers are the largest victim of broken dependencies, and sometimes they are not even aware of it.
Data Engineers
Data engineers are bogged down with countless requests from both producers and consumers. They are consistently stuck between the choices of creating a new data model or updating an old one. Every new model they generate based on unique requests adds to the plethora of data models they are required to maintain for as long as the dependencies last (lifetime). Updating models often means falling back on complex queries that are buggy and lead to broken pipelines and a bunch of new requests because of those broken pipelines. In short, data engineers suffer tremendously in the current data stack, and it is not sustainable.
In short, Data Models are creating a high wall between data producers and data engineers while their sole objective is to eliminate the gap between the two ends. However, it is not Data Modelingโs fault. Data Modeling has been and is one of the coolest ways to manage data. The problem lies in the way they are implemented, constantly making bottlenecks out of poor data engineers.
The chaos goes back years and decades, but it has started impacting strategic conversations lately, especially due to the growing importance and volume of data for organizations. Data was only an afterthought before, used only for fundamental analysis work. But the narrative has changed, and how!
Today businesses that have a good grasp of data make the difference between winning and losing the competitive edge. Many data-first organizations, the likes of Uber, Airbnb, and Google, understood this long back and dedicated major projects to becoming data-first.
Modern Data Stack Vs. Data-First Stack
Contrary to popular belief, the modern data stack is a barrier to optimizing the capabilities of data models. The primary reason causing the silo between data producers and data consumers is a chaotic bunch of tools and processes that are clogged into the system. Each somehow trying to make use of the rigid data model defined by someone possibly with no idea about the lay of the business landscape.
Spending more capital on one more tool is not a solution, but it's an additional layer on a chaotic base. More tools bring in more cruft (debt) and make the problem more complex.
As one of my idols, Martin Fowler, would say:
"๐๐ฉ๐ช๐ด ๐ด๐ช๐ต๐ถ๐ข๐ต๐ช๐ฐ๐ฏ ๐ช๐ด ๐ค๐ฐ๐ถ๐ฏ๐ต๐ฆ๐ณ ๐ต๐ฐ ๐ฐ๐ถ๐ณ ๐ถ๐ด๐ถ๐ข๐ญ ๐ฆ๐น๐ฑ๐ฆ๐ณ๐ช๐ฆ๐ฏ๐ค๐ฆ. ๐๐ฆ ๐ข๐ณ๐ฆ ๐ถ๐ด๐ฆ๐ฅ ๐ต๐ฐ ๐ด๐ฐ๐ฎ๐ฆ๐ต๐ฉ๐ช๐ฏ๐จ ๐ต๐ฉ๐ข๐ต ๐ช๐ด "๐ฉ๐ช๐จ๐ฉ ๐ฒ๐ถ๐ข๐ญ๐ช๐ต๐บ" ๐ข๐ด ๐ด๐ฐ๐ฎ๐ฆ๐ต๐ฉ๐ช๐ฏ๐จ ๐ต๐ฉ๐ข๐ต ๐ค๐ฐ๐ด๐ต๐ด ๐ฎ๐ฐ๐ณ๐ฆ. ๐๐ถ๐ต ๐ธ๐ฉ๐ฆ๐ฏ ๐ช๐ต ๐ค๐ฐ๐ฎ๐ฆ๐ด ๐ต๐ฐ ๐ต๐ฉ๐ฆ ๐ข๐ณ๐ค๐ฉ๐ช๐ต๐ฆ๐ค๐ต๐ถ๐ณ๐ฆ ๐ข๐ฏ๐ฅ ๐ฐ๐ต๐ฉ๐ฆ๐ณ ๐ข๐ด๐ฑ๐ฆ๐ค๐ต๐ด ๐ฐ๐ง ๐ช๐ฏ๐ต๐ฆ๐ณ๐ฏ๐ข๐ญ ๐ฒ๐ถ๐ข๐ญ๐ช๐ต๐บ, ๐ต๐ฉ๐ช๐ด ๐ณ๐ฆ๐ญ๐ข๐ต๐ช๐ฐ๐ฏ๐ด๐ฉ๐ช๐ฑ ๐ช๐ด ๐ณ๐ฆ๐ท๐ฆ๐ณ๐ด๐ฆ๐ฅ. ๐๐ช๐จ๐ฉ ๐ช๐ฏ๐ต๐ฆ๐ณ๐ฏ๐ข๐ญ ๐ฒ๐ถ๐ข๐ญ๐ช๐ต๐บ ๐ญ๐ฆ๐ข๐ฅ๐ด ๐ต๐ฐ ๐ง๐ข๐ด๐ต๐ฆ๐ณ ๐ฅ๐ฆ๐ญ๐ช๐ท๐ฆ๐ณ๐บ ๐ฐ๐ง ๐ฏ๐ฆ๐ธ ๐ง๐ฆ๐ข๐ต๐ถ๐ณ๐ฆ๐ด ๐ฃ๐ฆ๐ค๐ข๐ถ๐ด๐ฆ ๐ต๐ฉ๐ฆ๐ณ๐ฆ ๐ช๐ด ๐ญ๐ฆ๐ด๐ด ๐ค๐ณ๐ถ๐ง๐ต ๐ต๐ฐ ๐จ๐ฆ๐ต ๐ช๐ฏ ๐ต๐ฉ๐ฆ ๐ธ๐ข๐บ.โ
Source: martinfowler.com
Proposed Solutions
Transition to a Unified Approach or a Data-First Stack.
Contrary to the widespread mindset that it takes years to build a data-first stack, with new storage and compute tools and innovative technologies that have popped up in the last couple of years, this is no longer true. It is not impossible to build a data-first stack and reap value from it within weeks instead of months and years.
Referring again to Martin Fowlerโs architectural ideology:
โHigh internal quality leads to faster delivery of new features because there is less cruft to get in the way. While it is true that we can sacrifice quality for faster delivery in the short term, before the build-up of cruft has an impact, people underestimate how quickly the cruft leads to an overall slower delivery. While this isn't something that can be objectively measured, experienced developers, reckon thatย attention to internal quality pays off in weeks, not months.โ
We need to be ruthless about trimming the countless moving parts that plug into a data model. Chop down multiple tools and, with it, eliminate integration overheads, maintenance overheads, expertise overheads, and licensing costs that build up to millions with no tangible outcome.
For a deep dive into the overheads of point solutions that make up the Modern Data Stack, refer to:
Handover the reins of business logic
In the traditional approach, data teams are stuck with defining data models in spite of the fact that they do not have much exposure to the business side. However, the task still falls on them since data modeling is largely considered to be part of the engineering stack. This narrative needs to change.
The purpose of a data model is to build the right roadmap for data to fall into. Who better to do this than business folks who work day and night with the data and know exactly how and why they want it? This would give back control of business logic to business teams and leave the management to data teams and technologies. But how should this be materialized? Declarative and Semantic layers of abstraction.
Business teams would give a hard pass to complex SQLs, databases, or low-level modeling techniques. Itโs a tragedy that they are forced to deal with them, but if given the opportunity, they would choose the more intuitive and quicker path that impacts business at the sweet time and the sweet spot.
Moreover, such abstractions are not just for business folks exclusively. To make life easier for all (producers, consumers, and data engineers), we need to create a seamless way for business personnel to inject their vast industry and domain knowledge into the modeling pipeline. Reduce the complexity of SQLs through abstractions that analysts can easily understand and operate. Omit the need for analysts to struggle with their double life as analytical engineers.
Transition to a semantic source of truth
A semantic source of truth is different from what is usually referred to as a single source of truth for data. A semantic source of truth refers to a single point that emits verified logic that the organization could blindly rely on.
โBlindly relyingโ is a big step, so we need the right system to enable optimal reliability. Surely youโve heard of data contracts? Contracts are your one-stop lever to declaratively manage schema, semantics, and governance to bring harmony between producers and consumers (keep watching Modern Data 101 for more about Contracts).
Proposed data flow
Raw Data
This is where all the data sources and raw data are pooled in. This data could exist in the form of a warehouse database, as unstructured data in data lakes/lakehouses, or have data sourced directly from third-party systems and SaaS applications.
This layer is constantly on the verge of achieving a tangled state of a data swamp. A data swamp is a heap of rich and valuable data that cannot be utilized or operationalized by data teams due to missing business context. To combat this, the next layer comes to the rescue.Integrated Knowledge
The knowledge layer strives to take the necessary raw data from the first data layer, connect the right dots with metadata, and eventually activate the dormant data that rests in the bottom layer.
The knowledge layer is typically powered through knowledge graphs, catalog services, and metadata engines. The web developed at this layer can be leveraged for countless applications, including upstream/downstream lineage, observability assertions, governance policy validation, and much more.Contractual Handshake
The Contract layer is the key bridge that enables declarative collaboration between data producers and data consumers. Contracts can exist on both the consumption and production side and are advisably created by domain owners who own expertise in domain knowledge. It contains strongly typed data expectations like column type, business meaning, quality, and security. Think of them as guidelines to which data may or may not adhere.
Simply put, a data contract is a simple agreement between data producers and data consumers that documents and guarantees the fulfilment of data requirements and constraints. A data contract does not just describe data but also adds meaning to it by defining semantics or logical business constraints. It is enforceable at any chosen point in the data ecosystem, enabling automation, standardization, control, and reliability.Semantic Data Modeling
The semantic data model is powered by the business userโs exclusive knowledge and also the semantics that is extracted from the underlying knowledge and contract layers. This layer should ideally be a simple interface that enables abstractions through drag-and-drop GUI or abstracted queries on top of SQL that eliminate complexities from the business userโs plate.
The semantic data model is a custom lens to the entire data pod that could power a specific use case the semantic model demands. This layer takes off the pressure from data engineers who are no longer bottlenecks or obstacles to fluent business logic, which has a tendency to change dynamically. With control back to business, fluent data activation becomes second nature to the organization.Activated and Reliable Data
The outcome of the above layers is a Data Product that achieves the DAUNTIVS capabilities: Discoverable, Addressable, Understandable, Natively accessible, Trustworthy and truthful, Interoperable and composable, Valuable on its own, and Secure.
The data products can henceforth be used across data applications to power analytics, research, growth campaigns, development, and much more that impact actual business outcomes at the sweet spot!
Iโm Animesh, and I solve data problems. Creating, modifying, destroying, obviating โ those are the details.
Data is a product, and it thrives when treated like one - This is the irrevocable conclusion I've come to after spending about twenty years in the data industry and working with a plethora of data experts, including data engineers, data scientists, and academics researchers.
I share my thoughts on innovating a holistic data experience here, where we debate ideas around Data as a product, Data as an experience, and Data as the differentiator.
Since its inception, ModernData101 has garnered a select group of Data Leaders and Practitioners among its readership. Weโd love to welcome more experts in the field to share their story here and connect with more folks building for better. If you have a story to tell, feel free to email your title and a brief synopsis to the email-ID below:
Interesting perspective and characterization of problems that plague the data ecosystems and why organizations are not able to derive value from their data despite innumerable investments in many products.
Good perspective a quick question on the (2) Integrated Knowledge when there are plethora of systems delivering similar data but not exactly same, does this call for a consistent modelling requirement? From a semantic perspective, the meaning can differ per domain and it can be challenging to present this in a Knowledge Graph / Data Catalog if they have same names with rather different results.
Please share your thoughts on this.