This post is the last of the 9 that had a "Learn Together" live video session. The posts after this will cover the modules I missed, numbers 8, 9, and 13 to round out the rest of the challenge. The deadline for the challenge is coming up quickly (ends Feb 22, 2024), and as of writing this post, I have not yet completed those 3 modules!

Microsoft Learn Module

Ingest data with Spark and Microsoft Fabric notebooks

This module is another one I will need to revisit, as it covers newer topics (for me) that I am just not as familiar with as I am with SQL or Power Query. The focus of this module is using Spark and notebooks to ingest data into a lakehouse, including connecting to external sources, authenticating, and ingesting data into structured or unstructured destinations in the lakehouse.

The exercise in this module has the user creating a lakehouse and ingesting sample data from the public New York City Taxi & Limousine Commission dataset.

The session link below is the one I watched for this particular session. The thing I appreciated the most was both of the presenters calling out the fact that the Microsoft Learn material was presenting examples of what would be considered bad practices in data security. The examples in Unit 2 show tokens and secrets in plain text in a notebook that could be read by multiple users, where best practice would be using key vaults for that kind of thing instead. Had I read through Microsoft Learn without that context, I might have missed that.

Key Takeaways

Other than the examples not being representative of good data security as I noted above, the majority of the module is about using Spark and Python to write to Delta tables which is still new to me and largely not well understood yet (by me!). Conceptually I get the process of it, but the specific commands and syntax I do not.

There were some scripts (commands?) for optimizing table reads and writes, as notebooks are best suited for ingesting large volumes of data. Enabling "V-Order" for faster reads, and optimizing write reduces the number of files written, increasing their size.