This post is about module 11 in the Cloud Skills challenge.

Microsoft Learn Module

Organizing a Fabric lakehouse using medallion architecture design

This module is about the concepts of medallion architecture design - Bronze, Silver, and Gold layers of a lakehouse. The focus is understanding how to effectively organize, refine and curate data. It is a recommended data design pattern used to organize data in a lakehouse.

  • Bronze = raw data, either structured or unstructured, but in its original format, possibly not having had any changes made to it yet.
  • Silver = a validated layer where data is combined, merged and validated to avoid duplicates or whatever other business logic is required.
  • Gold = the further refinement of the data for specific business and analytics needs, including aggregating to a particular level of granularity, but typically includes modelling it in a star schema to optimize it for reporting.

The exercise in this module has the user creating a lakehouse and walking through ingesting data into a "bronze" layer, transforming it into a "silver" layer, and further transforming into a star schema "gold" layer for reporting, creating a semantic model and then creating relationships between the tables in Power BI. The transformations are done in Notebooks with pySpark.

The link below is the session I attended on this topic. It was the 7th session in the live video series.

Key Takeaways

I admit I hadn't heard the term "medallion architecture" before but the concepts are not unfamiliar to me. The most interesting parts of the discussions were around data security and automating deployment.

From a security standpoint, defining who needs/has access at any given layer is important, so we can ensure that only authorized users can interact with sensitive data on one hand and ensure that data governance is maintained. I can envision scenarios where someone has too much access to a bronze or silver layer and inadvertently makes changes to something that ultimately impacts the semantic model further down the chain. The bronze layer would be the most restricted, read-only, where the silver layer would need a balance between flexibility to perform data modelling tasks and security of the data.