Greg Beaumont's Data & Analytics Blog

220M+ row Microsoft Fabric demo using Direct Lake, Lakehouse, Warehouse, Spark and Pipelines

November 28, 2023

Microsoft Fabric became generally available on November 15th. Fabric is a SaaS (Software as a Service) product that unites Azure data tools including Azure Data Lake, Power BI, Azure Data Factory, Azure Synapse, IoT tools such as Kusto DBs for streaming data and more into a single unified product. The underlying data structure integrates smoothly with Azure Databricks and other tools. If you are interested in trying out Microsoft Fabric, we’ve put together a Git Repo that can be deployed in a few hours for end-to-end solution modules that can be used as for POCs.

Data used in the Git Repo is open healthcare data, so you can try out the tools without PHI/PII concerns. There are two modules at the time of this article: 1) the Business Intelligence and Data Warehouse module uses over 220 million rows of real CMS Medicare Part D data and 2) the Data Science module uses data features that can predict diabetes.

The Git Repo is available at this link: https://github.com/isinghrana/fabric-samples-healthcare/tree/main

A 5 minute preview demo of the Direct Lake/Data Warehouse module is provided below:
End-to-end Fabric Git Repo for 220M rows of CMS data

August 4, 2023
Reposted from the Microsoft Healthcare & Life Sciences Blog at this link: https://techcommunity.microsoft.com/blog/healthcareandlifesciencesblog/end-to-end-fabric-git-repo-for-220m-rows-of-cms-data/3891410

Microsoft Fabric is uniting data and analytics tools within a single SaaS platform. Fabric encompasses many different user personas and tools, and new users are seeking opportunities to skill up. This article reviews a new Github repository that will allow anyone with access to Fabric to deploy an end-to-end solution in Fabric that leverages 220 million rows of real healthcare open data from CMS. Without having to code, users can follow the instructions in the Git repo to import the data into OneLake, serve it up in the Lakehouse, and then query it from Power BI and Excel using the new Direct Lake connector. Below is an architectural diagram of the solution:

Here is a link to the Git repo: https://github.com/isinghrana/fabric-samples-healthcare/tree/main/analytics-bi-directlake-starschema

This is the first release for the Github repository which will be a hub for new easy-to-deploy Fabric healthcare solutions moving forward. In the diagram above, the simple steps of the solution are shown:
1. Download the files from CMS and Upload them to Fabric OneLake
2. Combine the files into a single table in the Fabric Lakehouse using delta parquet file format. Don’t worry, you can deploy the Spark Notebook without having to write code!
3. Create a Fabric Direct Lake dataset that queries the table without caching any of the 220M+ rows.
4. Create reports in Power BI and Excel that query the Lakehouse with impressive query performance.
The data used in the solution is real CMS open data for Medicare Part D Prescribers – By Prescriber and Drug. The data details drug names, physician names, geographical data, costs, beneficiary counts, and more. The data spans from 2013 to 2021, and totals over 220 million rows.

This solution was created by Greg Beaumont and Inder Rana, who are Data & AI Technical Specialists for Microsoft Healthcare and Life Sciences:

Inder Rana

Linkedin: https://www.linkedin.com/in/singhinderjit

Blog: https://i singhrana.medium.com/

Greg Beaumont

Linkedin: https://www.linkedin.com/in/gregbeaumont

Twitter: https://twitter.com/grbeaumont

Future planned releases for this GitHub repo include easy-to-deploy healthcare solutions such as:
- Machine Learning and Predictive Analytics within Fabric
- Comparing Direct Lake, SQL Serverless Endpoint, and Import models for Power BI
- Comparing query performance for flattened data models versus star schemas and composite models
- OpenAI integration with Fabric
- Pass along your ideas and suggestions!
Here’s a few of the instructional videos from the Git Repo tutorial:

Import the files manually into Fabric OneLake

Use a Fabric Spark Notebook to create a table in the Lakehouse in delta parquet format

Create a Fabric Power BI dataset in Direct Lake mode to query 220M+ rows of data
Azure OpenAI ChatGPT Solves Cryptic Error Codes from Power BI, SQL Server, Power Apps, and more!

April 24, 2023

Reposted from the Microsoft Healthcare & Life Sciences Blog at this link: https://techcommunity.microsoft.com/blog/healthcareandlifesciencesblog/azure-openai-chatgpt-solves-cryptic-error-codes-from-power-bi-sql-server-power-a/3803661

Are you tired of cryptic error message codes from your IT tools filled with GUIDS, seemingly random numbers, and vague text? Azure OpenAI ChatGPT can help! When tools such as Power BI, SQL Server, Power Apps, and others encounter problems the error codes can often be difficult to decipher. Users will often paste them into search engines and comb through results for hours, open service tickets with their IT team or Microsoft, or paste the messages (which may sometimes contain company data) to community forums for help. The video embedded below shows how Azure OpenAI ChatGPT can decipher those cryptic error codes in minutes. Here’s a link: https://youtu.be/D-O3GuqSWjw

The example in the video uses a simple deployment of gpt-3.5-turbo in the Azure OpenAI studio. There are no embeddings, no fine-tuning, and no plug-ins. The model is being used straight out of the box. By using Azure OpenAI, you can capture the chat history between your user and ChatGPT. Down the road, this data could be used to enhance the solution using embeddings or other techniques. Also, if a user enters sensitive data that is within the error message, it will be contained in your Azure environment. Here’s a quick diagram of how that might look at a high level:

If we built out a GitHub framework to deploy over Azure OpenAI ChatGPT to enable simple error code resolution and capture chat, how would you like that to look? A Power App? Custom Bot? Something else?
Add an Azure SQL DB Turbo Button using Power Apps to Reduce Costs & Keep Users Happy

August 17, 2022

A recurring challenge I’ve encountered over the years is finding the balance between cost and performance for SQL databases used by front end business teams. Business Users always want more memory and compute, IT Leadership always wants to minimize unnecessary costs, and usage of databases by business end users can be cyclical and unpredictable.

With Azure SQL Database you can programmatically scale the memory and compute up and down, but how do you make sure it happens at the right time? Also, how do you do chargebacks to the teams that needed extra compute for their SQL Server? This video will review a method for adding a Power Apps button that end users can push to turbo-charge an Azure SQL Database for a period of time. IT Leadership can control who pushes the button, the size of the scale-up, the duration of the scale-up, and the Power App will also trigger a record that can be used for chargebacks to prevent abuse of the turbo button. This solution is a simple and easy approach to keep business end users happy, possibly reduce or at least control costs and possibly even your carbon footprint, and store end user activity for easy and customizable translation to chargebacks.
Predictive Diagnostics using Power Apps in Teams with Azure ML

July 7, 2022

Connecting low code / no code tools to Machine Learning Models is often discussed but rarely implemented in the real world. The video above demonstrates near-real-time predictions in a Power App using Azure ML to make a presumptive medical diagnosis based on answers about symptoms entered by a user. Symptom data entered into a Power App within Microsoft Teams is passed to Azure ML via Power Automate, scored by the Machine Learning Model, logged in the Dataverse, and then seamlessly returned to the Power App in Teams.

The demo predicts Acute Inflammation of Urinary Bladder (Cystitis) and/or Acute Nephritis of Renal Pelvis, which are separate diagnoses often having overlapping symptoms. The University of California Irvine provides an Open Source dataset cataloguing assessments of patients presenting with symptoms related to these diagnoses, which was used to train the Machine Learning Model. Links to the source data and reference articles are included in the video description at this link. This Demo is not intended to be used in the real world (even though the data is real), but is for the purpose of demonstrating the ease by which these types of Presumptive Diagnostic Algorithms can be deployed to achieve Realized Value. Once a machine learning model has been built in Azure ML, the front end interface can be built out in a few hours using low code / no code Microsoft tools.

The video also walks through the architecture for ingesting the ML training data using Azure Data Factory, Azure Data Lake, analyzing the training data with Power BI, and then building an Azure ML Predictive Model. Presumptive Diagnostics are a complicated topic requiring due diligence by medical professionals, but when implemented ethically and with scientific rigor can provide benefits to quality and availability of care.

Everything in this demo is 100% real and operational. As I’m writing this article, it is available in the Microsoft Healthcare Demo environment. If you are a Microsoft Customer, you can ask your field representative to give you a tour of this solution.
Azure Data LakeHouse in an Hour Virtual On-Demand Workshop

May 23, 2022
The video embedded above is a virtual on-demand workshop that walks you through the process of building a simple Azure Data LakeHouse using real CDC National Notifiable Diseases Surveillance System (NNDSS) open data in about an hour. You can watch the workshop to learn about Azure Data LakeHouse architecture, or for a more hands-on experience there are Chapters and Links in the description of the video to help you build your own Azure Data LakeHouse with a low code experience.

When I first heard the term “Data LakeHouse” I assumed that the architecture would be complex and difficult to learn. Instead, I found it to be an architectural pattern that makes sense not only from a Data Warehouse perspective, but also for Business Analysts and Data Scientists. This Virtual On-Demand Workshop is designed to be useful for:
- Data Architects, Developers, and Engineers
- Business Intelligence Professionals looking to expand their skills
- Data Scientists seeking a scalable and flexible Azure data architecture
- Anyone looking to become a Data Professional
In addition to a Workshop teaching the basics of an Azure Data LakeHouse architecture, the end product will provide you a POC with access to real CDC National Notifiable Diseases Surveillance System (NNDSS) which updates weekly: https://data.cdc.gov/NNDSS/NNDSS-Weekly-Data/x9gk-5huc

Here’s a link to watch the on-demand video workshop with Chapters and Links to Resources: https://youtu.be/G1IzJG0N8Z0
Use a Keyword in Teams Chat to Resume and Pause Azure Synapse using Power Automate

January 27, 2022
Reposted from the Microsoft Healthcare & Life Sciences Blog at this link: https://techcommunity.microsoft.com/blog/healthcareandlifesciencesblog/use-a-keyword-in-teams-chat-to-resume-and-pause-azure-synapse-using-power-automa/3073096

Recently I worked with a team having a requirement to send ad-hoc queries to an Azure Synapse Analytics database using front-end tools such as Power BI and Excel. Users would need to query Synapse periodically a few times a week without a set schedule or specific working hours. Behind the SQL endpoint, the underlying Azure Synapse Dedicated SQL Pools provide the power of a Cloud MPP database for massive data volumes (potentially billions of rows) receiving complex queries. When the Synapse Dedicated SQL Pools are running, they are also running up a bill. The situation presented a unique problem – How can I give non-technical ad-hoc users the ability to resume Synapse without taking the risk that they forget to pause it and run up unnecessary charges by leaving it running while nobody is sending queries?

Fortunately, integration with other tools in the Microsoft ecosystem provided a solution. In less than an hour (a few minutes once you’ve done it before) you can enable those non-technical ad-hoc users to simply enter a keyword into Teams to Resume Synapse, run it for a set length of time, and then have it automatically Pause again.

I’ve embedded two videos below that walk you through the process of setting up a Microsoft Teams “Keyword Trigger” to Resume and then Pause Synapse Dedicated SQL Pools. Here’s a quick list of the Microsoft tools you will need to build it yourself:
- Instance of Azure Synapse Analytics with Dedicated SQL Pools (you can deploy a demo and test version from this GitHub site – https://github.com/kunal333/E2ESynapseDemo)
- Azure Data Factory (can be deployed within the same Azure Resource Group as Synapse). In the future, it may be possible to use Azure Synapse Pipelines with Power Automate and avoid a separate Azure Data Factory.
- A Team in Microsoft Teams
- Power Automate
Video #1 – Create Azure Data Factory Pipelines to Pause & Resume Azure Synapse Dedicated SQL Pools

Steps:
1. Create a New Data Factory (or use an existing one!)
2. Give “Power Apps and Flow” access to use that Data Factory
3. Create a Pipeline with a Web Activity
4. Add the API for “Resume Synapse”
5. Repeat steps 3-4 for another Pipeline to “Pause Synapse”
6. Give the Data Factory access to the Azure Synapse Workspace
Video #2 – Use a Keyword in Microsoft Teams with a Power Automate Flow to Resume & Pause Azure Synapse

Steps:
1. Create a Power Automate Flow
2. Designate a Teams Keyword to Trigger the Flow
3. Choose the “Resume Synapse” Data Factory Pipeline created in Video #1
4. Determine the duration of time that Synapse will run (“Delay” step in the Flow)
5. Choose the “Pause Synapse” Data Factory Pipeline created in Video #1
6. Test the Power Automate Flow in Teams
Those of you who have experience with either Azure Data Factory or Power Automate will realize that this architectural pattern can be extended to a great deal of new possibilities. For example:
- With Azure Data Factory
  - Extend the Synapse Pause/Resume pipelines in Azure Data Factory to be part of larger ELT/ETL processes
  - Trigger Pipelines completely unrelated to Synapse – Update forecasting reports, refresh the daily data in an Azure SQL DB, etc.
- With Power Automate
  - Create multiple Power Automate Flows that different people can use for different durations of Synapse uptime. Maybe a manager can Resume Synapse for 8 hours, while an analyst can resume it for 1 hour?
  - Use different triggers to turn on Synapse. Maybe a Power BI button or an action from another tool?
  - Add steps to log who is using the Power Automate Flow, and when.
Raw Data Podcast from P3 Adaptive – Healthcare Analytics and Business Intelligence with Greg Beaumont

August 17, 2021
Microsoft Partner P3 Adaptive has been a part of the Microsoft data ecosystem for many years. I recently had the opportunity to join an episode of their Raw Data podcast to discuss Business Intelligence and Analytics from the perspective of Microsoft Healthcare. We took a deep dive into common customer use cases, discussed COVID as a catalyst for analytics in Healthcare, talked about Security and Governance in Regulated Industries, discussed common challenges Healthcare customers encounter due to data volumes and complexity, and speculated about the future of data and analytics.

Here is a link to the podcast episode: https://p3adaptive.com/rawdatapodcast/episode/greg-beaumont/
- P3 Adaptive recently conducted a Power BI Administration and Governance Webinar which can be accessed here: https://p3adaptive.com/power-bi-governance-for-health-life-sciences-presentation-files/
- P3 Adaptive offers Power BI Training which you can learn more about at this link: https://p3adaptive.com/training/
- Information about their Consulting Services is here: https://p3adaptive.com/consulting/
Planning for a Secure and Scalable Power BI Enterprise Architecture with Sensitive Data at the Core

June 28, 2021
Reposted from the Microsoft Healthcare & Life Sciences Blog at this link: https://techcommunity.microsoft.com/blog/healthcareandlifesciencesblog/planning-for-a-secure-and-scalable-power-bi-enterprise-architecture-with-sensiti/2492766

Leveraging data for Analytics, Reporting, and self-service Business Intelligence can provide great value to an organization. Most large organizations (especially those in Regulated Industries) also need to ensure that Sensitive Data such as PHI, PII, PIFI, Company IP, and Company Financials are properly managed and controlled. With Microsoft Power BI and some thoughtful planning, organizations can scale out an Analytics, Reporting and self-service Business Intelligence platform that ensures Security and Compliance for Sensitive Data.

Below is a recorded version of a presentation reviewing considerations and best practices for planning a secure and scalable Power BI Enterprise Architecture with Sensitive Data at the core. Topics include:
- Securing Power BI Tools such as Power BI Desktop and Gateways
- Planning Azure Active Directory (AAD) based authentication and Workspace Architectures
- Row Level Security (RLS) and Object Level Security (OLS)
- Microsoft Information Protection (MIP)
- Microsoft Cloud App Security (MCAS)
- vNets / Private Links
- Azure Data Lake integration
Ingest Healthcare Open Data into Azure and Power BI using New GitHub Repository

April 20, 2021
Reposted from the Microsoft Healthcare & Life Sciences Blog at this link: https://techcommunity.microsoft.com/blog/healthcareandlifesciencesblog/ingest-healthcare-open-data-into-azure-and-power-bi-using-new-github-repository/2278523

Numerous Government agencies make Healthcare Open Data available to the public at no cost. Data from the CDC, CMS, FDA, World Bank, US Census, USDA and many others provide rich sources of valuable data. These vast sources of robust and useful data are free to use but can have different file formats, different table structures, different context, and different data granularities. Ingesting all of this data into a common place where it can be used and shared is often time-consuming and challenging. I’ve put together a repository in GitHub called Power Pop Health to help with these challenges.

Power Pop Health is a collection of content intended to simplify the process of ingesting and prepping Healthcare Open Data for Analytics, Business Intelligence, Data Science, and more. Power Pop Health has a simple mission: Make it easy for you to ingest, transform and format Healthcare Open Data and common reference tables so that you can achieve more. The GitHub repository can be viewed at this link.

How does Power Pop Health work? I’ve tried to make it simple with low code/no code/no PowerShell deployment so that anyone can use it with nothing more than an Azure subscription and Power BI. Where code is necessary, there are cut-and-paste scripts with tutorial videos for the deployment:
- Step 1 – Ingest Raw Data into an Azure Data Lake
- Step 2 – Make the Data usable in Azure and/or Power BI
- Step 3 – You take it from here! The data is ready to blend with your Organizational data, use for training, create demos, analyze to find trends, etc.
What data is currently available in the first release of Power Pop Health?

Over the last few years I have accumulated examples and tutorials that leverage public Healthcare data. This first release is a repository to share those examples in a unified format, and in one place. Future additions to this repository will be based on feedback from the community, with an initial plan to focus primarily on Population Health data such as Social Determinants of Health. Below is a chart of the data available in this first release:

Here’s a quick summary of each data set in the initial release. Before using these data sources, I’d also recommend reading the licensing terms from the data providers to ensure that you are using the data appropriately:

1. CDC Daily PM 2.5 Concentrations – Air Quality measurements at the level of States and Counties for 2001-2016.
2. CDC Population Weighted UV Irradiance – Ultraviolet Radiation measurements at the level of States and Counties for 2004-2015.
3. CMS DRG /MDC / Surgical Class v38.1 – Diagnoses Related Groups (DRGs), Major Diagnostic Categories, and Surgical Class version 38.1.
4. CMS ICD10 CM 2021 – 2021 ICD10 CM Diagnosis codes for the US.
5. CMS ICD10 PCS 2021 – 2021 ICD10 PCS Procedural codes for the US.
6. Date Table (DataFlows) – A custom Date Table that can be deployed to Power BI DataFlows.
7. Date Table (Power Query) – A custom Date Table that can be deployed to Power BI Power Query.
8. Time Table – (DataFlows) – A custom Time Table that can be deployed to Power BI DataFlows.
9. Time Table (Power Query) – A custom Time Table that can be deployed to Power BI Power Query.
10. FCC State & County FIPS – A reference table for State and County FIPS geographical mapping codes provided by the FCC.
11. FDA Food Recall Enforcement Reports – Foods that have been recalled.
12. FDA CAERS Reports (Food Events) – Adverse events attributed to Foods.

13. Medicare Part D Provider Utilization and Payment Data 2013-2018 – I’ll have this data available in the next release, but for now it is available in an end-to-end Azure Synapse and Power BI solution at this link: https://github.com/kunal333/E2ESynapseDemo

recent posts

about