This post is about module 9 of the Cloud Challenge series. Like the previous post in this series, it is a topic not covered in the DP-600 exam based on the study guide I read at least. After this, there will be one more post to complete the modules in the challenge, just before the deadline of February 22, 2024.
Previous posts in this series:
- Part 1 (Introduction, Lakehouses)
- Part 2 (Apache Spark)
- Part 3 (Delta Lake tables)
- Part 4 (Using Data Factory pipelines)
- Part 5 (Ingesting data with Dataflows Gen2)
- Part 6 (Getting started with data warehouses)
- Part 7 (Administration of Microsoft Fabric)
- Part 8 (Medallion architecture design)
- Part 9 (Spark & notebooks)
- Part 10 (Get started with real-time analytics)
Microsoft Learn Module
Get started with data science in Microsoft Fabric
This module covers the data science process and some machine learning, in the context of a typical data science project. Units cover the types of models, the typical process that might be taken and how that works in Microsoft Fabric.
The exercise has us working through a diabetes dataset from the Azure Open Datasets, using Notebooks (of course! 😄) to work through ingesting data, preparing data using Data Wrangler, and training machine learning models, then exploring the results.
Key Takeaways
All of this is new to me, so the takeaways are going to be fairly rudimentary!
Data scientists will train machine learning models to find patterns in their data, out of which the patterns can be used to predict behaviour or generate new insights. There are 4 common types of machine learning models, and knowing which model you need to train requires an understanding of the business problem first, and what kind of data is available to you.
- Classification models predict something like whether a customer will churn, a classification or a categorical value.
- Regression models predict something like the price of a product, a numerical value.
- Clustering models grouping similar data points into clusters or groups.
- Forecasting models predict future numerical values based on time-series data, like expected revenues or sales.
The forecasting concept is more relevant or understandable to me from my background as an accountant, where forecasting profits or revenues can be a typical part of our roles.
One interesting part I noted in the unit reading material is one must have an understanding of how our choices in training the models will influence a model's success.
While I attempted the exercise, I realized quite quickly how much I hated statistics in university! I am not ashamed to admit that I did not complete the exercise, data science is one area of data I will never fully understand. Thank goodness this is not part of the DP-600 exam! 😄
Finally, I am headed on to the last module!