In this post, I am covering module 3 of the Microsoft Learn challenge.
Previous posts in the series
Microsoft Learn Module 2
Use Apache Spark in Microsoft Fabric
- This module shows how to configure and use Apache Spark for data processing and analytics in Microsoft Fabric. Spark is frequently used in large-scale data processing and analytics and can be used in a few places in Microsoft Azure as well as in Microsoft Fabric. The modules explore adding CSV files to a lakehouse and then loading them into a Spark dataframe for further processing including generating some visuals in the notebook via Python graphics libraries.
- I definitely need to revisit this module and exercises before attempting the DP-600 exam! Content that is similar to databases and SQL I'm learning quickly, content that departs from that I am finding it harder to wrap my head around.
Learn Together links (recordings from wave 1)
The Learn Together sessions for this were the "Day 2" content. I watch the Pacific time zone session live, with Microsoft MVPs Shabnam Watson and Matthias Falland leading the session. That link is below:
Key Takeaways
One small "hint" I picked up from another training session makes it sound like using Spark in Fabric is far easier (or more approachable?) than it can be in other areas of Azure. To be honest, for a Day 2 topic, I felt like I jumped right into the deep end of the pool for this one and still do not entirely understand it.
Spark can process large volumes of data quickly by distributing the workload among different clusters. This is handled by Spark automatically. You can run Spark in notebooks or by defining a Spark job. The exercises for this module focus on the Spark Notebook side of things, although the learning itself does have a unit on Spark jobs. Notebooks allow you to explore and analyze data as well as processing or transforming it. I like that code can be written in a variety of languages including Spark SQL, which isn't the same as T-SQL but, yay, for that is my strength. Markdown can also be used to document the code which is cool, and better than just "commenting code", as it can be more verbose if and when needed.
One of the cool things covered in the exercises was creating a visual in the notebook using two different Python graphics libraries (seaborn and matplotlib). The exercise in the module has you walking through all of this right from creating the workspace. The thing I like about this collection of Microsoft Learn modules is each one seems to be independent of the previous, even if they were assigned to a "collection" together. None of the exercises I've done so far require the user to have the files or data from the previous module, and, at the end of each exercise component, it walks users through removing all the artifacts and workspace.
Overall, I can't envision where I would be using Notebooks myself unless I'm just practicing the concepts, as in the course of my own consulting business I would not necessarily be dealing with "big data" that necessitates considering tools like this.