5 Comments
User's avatar
john boddie's avatar

Re: " ELT Over ETL

Transformation belongs where computation is elastic and parallel, not in fragile ingestion code. That’s why the evolved approach is ELT: keep extract and load as lean as possible, then let downstream systems handle the heavy work."

If I'm reading this correctly, your position is that it's a good idea to bring new data into your data or record and then clean it up rather than cleaning it first, although you later espouse ENL and normalize the data before loading it.

john boddie's avatar

That should be "data of record", not "data or record"

Darpan Vyas's avatar

John - thanks for the careful read and for calling this out.

I think the confusion you’re pointing to is a fair one, and it comes down to an important distinction we didn’t make explicit enough in the piece.

When we argue for ELT, we’re specifically talking about business and semantic transformation (joins, aggregations, metric logic, and enrichment), all of which should reside downstream, where compute is elastic and logic can evolve independently of ingestion.

When we later introduce ENL, the “N” is not meant to contradict that. It’s not about “cleaning” or applying business logic before load. It’s about minimal structural normalization required to make data usable as a data of record: flattening nested payloads, canonicalizing types, handling schema drift, and preserving fidelity so downstream systems aren’t forced to re-interpret raw blobs.

In other words, the position isn’t “load raw and clean later” versus “clean first and then load,” but:

- "Extract" bytes safely

- "Normalize" just enough to make the data structurally reliable

- "Load" into the data of record

- "Transform" for meaning and analytics downstream

Framed that way, ENL doesn’t replace or weaken ELT; it reinforces it by keeping ingestion lean without pushing avoidable structural debt downstream.

john boddie's avatar

Thank you for your response.

Another issue that needs to be addressed raises its head during initial data discovery. My practice is to identify data that is present in existing reports and concentrate on that because it is (or was at some point) demonstrably important to the operation of the enterprise. It sounds simple, but in fact, many of the reports that are useful are presented to the user in Excel spreadsheet format, and the user will directly merge data from other reports or sources to create the output that is directly applied to support actions.

When we are tasked with a data migration, it can be difficult to find these derivative spreadsheets until we "complete" the migration and get the question of

"Where's my closure report by customer age?" All of a sudden, there is data that's been overlooked and we'll assume that a new spreadsheet will be generated. That spreadsheet will remain hidden until the next migration.

The fact that we "overlooked" a report will not be forgotten and it will increase the difficulty of obtaining future funding for the already underfunded migration efforts.

Darpan Vyas's avatar

John, this is a really important point!

What you’re describing is critical logic in derivative Excel reports and ad-hoc merges, and it is how many enterprises actually operate. Missing one of those during a migration absolutely has long-term trust and funding consequences.

The ENL/ELT discussion in the article is about how data moves, not an assumption that we already know which data matters. In fact, the reality you describe is a strong argument against shaping or filtering too aggressively up front.

Landing data broadly, normalizing it structurally, and preserving fidelity in a data of record makes it far easier to respond when the inevitable “where’s my report?” question shows up. And to trace gaps back to sources instead of discovering them too late.

Shadow reports will always surface after the fact. The goal is to design data movement systems that make those discoveries diagnosable and recoverable rather than program-derailing.

If you further consider, the Excel reports you're describing are already "products" being used by some data consumers. They just don’t have explicit ownership, contracts, or lineage, which makes them somewhat poor products since data consumers are scrambling to use them to their full advantage.

During migration, the risk isn’t that they miss a table; it’s that they miss an implicit product someone depends on. Treating those outputs as first-class data products would allow them to work backward: identify what decisions they support, what data they depend on, and then ensure the underlying data of record can reliably serve them.

ENL + incremental ingestion makes the supply side resilient, but data products are what make demand visible, which is ultimately what protects trust (and funding) during migrations.