Power BI dataflows to store historical data

Power BI dataflows is a nice capability of Power BI. It’s actually not a new Microsoft product. If you use Excel and Power BI long enough, there is a big chance that you’ve already used Power Query at least once. And basically, to keep things simple, it’s just Power Query, but you have to edit it on a web-browser, in Power BI Service.

Pro license is your entrance ticket to use this nice capability. Although Premium licenses (per capacity or per user) provide more advanced features, many applications of Power BI dataflows are already available with Pro license. One of them is to store historical data, and of course, to show them in reports.

In this blog post, I will briefly introduce Power BI dataflows (with some useful links if you want to learn more). Next, I’ll share how to store historical data in your own storage account and show both current and historical data in a Power BI Report (with ready M scripts, and a pbit file posted on my GitHub repository), so you can roll up your sleeves and start right away with your own use case!

How I Met “pOWER BI dataflows”?

Two years ago, I was required to build a report with data source as APIs[1]. I had to show not only the current data, but also the historical data in the report. Imagine, you transform the data from many sources, then build your report. But the report can only show the current state of your data. After the refresh, your old data in the data model is replaced by the new one. So basically, displaying current and historical data in the same report is not something off-the-shelf.

For example, a report shows the “Completion percentage” of a project. Yesterday it was 50%, today it is 55%. In a normal report, it always shows the latest number which is 55%. But what if you want to see the trend of the “Completion percentage” of not only today, but also yesterday, a month ago, and a year ago? Even if you had a proper data warehouse with Persistent Staging Area (PSA), retrieving and transforming only the history that you need from that huge PSA would not always be an easy and lightweight solution. Needless to say, you sometimes cannot avoid data sources like Excel files, csv files or APIs. So, the main goal is to transform the raw data, accumulate the cleansed data somewhere, and then combine them.

At that moment, I thought of Python as a promising solution. Whenever the data is refreshed, the “Run Python script” step in Power Query can save the Source (the transformed table in the first step) as a csv file and dump it in a storage location. From that storage place, the accumulated-overtime files can be appended together to show the historical data.

Running Python script in Power Query to save as csv file

It worked as expected in Power BI Desktop. However, when I published it to Power BI Service, it didn’t support the standard mode of the Gateway for Python/R scripts in Power Query steps[2]. Although I think using Python/R can be a great supplement to Power Query for data transformation, by the time I’m writing this post, this idea still hasn’t got enough attention from Microsoft for a proper action yet[3].

While waiting for Microsoft to fix that (and I’m happy that I didn’t wait), I did some research, and Power BI dataflows came as a savior, although, at that time, it was still in Preview. If you compare it to today’s version, PBI dataflows has come a long way.

WHAT IS “pOWER BI dataflows”?

As I mentioned earlier, Power BI dataflows (PBI dataflows)[4] is not a new tool. If you know Power Query already, PBI dataflows is just Power Query[5], but hosted on Power BI Service. The good news is that the learning curve for it is not that steep if you already master Power Query. Actually, there is not much to learn more about the tool itself, other than its applications to the real world use cases, for example the topic that we discuss today. The other good news is that Power Query is not only used in Power BI Desktop, or Excel, but it has been integrated in more and more Microsoft products like Power Platform (Power BI Service as we talk about in this post, Power Apps, and Power Automate[6]), Dynamics 365 Customer Insights, and Azure Data Factory. So it’s safe to say that you just need to master one “Power Query” tool, you can apply that knowledge to different Microsoft products to connect and transform your data either in a self-service or an enterprise-scale project[7].

In this post, I don’t intend to tell you everything about PBI dataflows, as it’s by itself already a huge topic, and I am not the expert in all details about PBI dataflows. This series of blog posts from Matthew Roche – Principal Program Manager Lead, who leads the development of PBI dataflows, is more than enough for you to start.

let’s Play with “pBI dataflows”

If your organization already has its own Azure Data Lake Storage Gen2 (ADLS Gen2) in place to store the PBI dataflows, you may skip to “The most interesting part” section.

STORE pBI DATAFLOWS IN YOUR AZURE ENVIRONMENT

By default, when you create PBI dataflows in Power BI service, the metadata json file and the content csv files of the transformed tables (or entities as an alternative term) are stored and maintained in ADLS Gen2 by Power BI Service itself. The good news is that you can also configure the PBI dataflows to write all of those data and their versions into an ADLS Gen2 resource of your own organization’s environment. Details of the integration are in this documentation. There might be some hiccups while doing the configuration, as I did encounter myself. However, as long as you are the owner of a ADLS Gen2 (not Gen1) resource, and either a Power BI administrator (if you want to connect to the whole Power BI tenant) or a Workspace Admin (if connect to a specific Workspace only), you should be able to have your PBI dataflows storing data inside your own tenant’s ADLS Gen2. If not, please let me know in the comment.

create your FIRST pBI dataflow

When the configuration to store the PBI is done, you can start creating your first PBI dataflow. As it’s not the scope of this post, I just introduce you to this Microsoft document to show you how to create your own PBI dataflows. You can then use your Power Query skills to transform the data as much as you wish.

In one PBI dataflow, you can have as many tables (or entities) as you want. For each one, you can choose to “Enable load” it (by right-clicking while hovering the table) to store it in your already-configured ADLS Gen2.

In some cases, you want use your existing table (Source) to create a new table (Target), you can use the “Reference” function as in the picture above. However, if you do that, and your Source and Target tables are in “Enable load” mode, PBI dataflows needs in-storage computations, which is a Premium feature. As a workaround when you have only Pro license, you can disable load the Source. This tip also applies to when you want to merge/append table A with table B.

QUICK LOOK AT YOUR ADLS GEN2

When you are satisfied with your PBI dataflow, you can save it and refresh it for the first time. If the refresh is not successful, you may have to review your code until it refreshes successfully. Afterwards, the data as well as the metadata of your PBI dataflow stored in your ADLS Gen2 should look like the screenshot below.

Your ADLS Gen2 of the PBI dataflow (viewed in Microsoft Azure Storage Explorer)

As you can see on the left side, your ADLS Gen2 storage account contains 2 containers with the default names as “powerbi” and “power-bi-backup”. In the “powerbi” container, it contains all the Power BI workspaces (as yourWorkspaceName in the screenshot) that you configure to store the PBI dataflows. You can have more than one PBI dataflow in one PBI workspace. If you choose one of them (as yourDataflowName in the screenshot), it should contain at least two folders:

  • model.json.snapshots contains json files, which are versions of the metadata of the csv files contained in the other “xyz.csv.snapshots” folder(s). The folder always has the fixed name like that.
  • yourEntityNameCurrent.csv.snapshots and yourEntityNameHistorical.csv.snapshots contain csv files, which are the data versions of the entities (tables) created in “yourDataflowName” dataflow.

You may wonder why there are two entities yourEntityNameCurrent and yourEntityNameHistorical. My reasoning is to make the M scripts available on my GitHub more general. In most cases, you just need to create one entity in the PBI dataflow for both the current and historical data. After any refresh, the new data from the PBI dataflow is accumulated in the ADLS Gen2. So you can extract the latest data from the PBI dataflow itself and the historical data from ADLS Gen2. That idea is summarized in the following diagram and is explained in more details in the next part.

Summarized architect of the PBI report showing both current and historical data (Image by author)

THE MOST INTERESTING PART

Ok, that’s enough of the introduction. Now, I would like to tell you the part that I love most. I will explain the high-level idea, so you can understand the big picture. For more details, you can use the ready M scripts and the pbit file in my GitHub repository to jump start your own solution.

THE POWER QUERY

As you can see in the diagram above, we have two main sources: PBI dataflow for the most updated data, and ADLS Gen2 for historical data. The ADLS Gen2 stores the snapshots of the tables/entities created by the aforementioned PBI dataflow.

In the Power Query structure screenshot below, you can see a “Parameters” group and two main query groups underneath. As the M scripts are highly parameterized, you do not need to change anything, except for the five parameters listed in that “Parameters” group. I will give more details how to find the values of your own environment for those parameters later in the post.

Power Query structure

“Current Data (from PBI dataflow)”: it’s pretty straightforward as we have only one table “Current Data” in this group. Actually, I have tweaked the code a bit, to make it more parameterized for generalization. If you connect to PBI dataflows directly from your Power BI Desktop, then select the appropriate Workspace, PBI dataflow, and table, the default code generated is very specific with GUIDs. You can drag the “< > icon” in the picture below to see the difference.

Comparison between highly parametized code and default code by Microsoft

Under “Historical Data (from ADLS Gen2)” group, we have three queries:

  • The stColumnsAttributes (a staging table) gets the latest json file, containing the column names and column types of csv files.
  • The fnRenameColumns (a function) gets the results of stColumnsAttributes and renames the unamed columns of the csv files.
  • The Historical Data (the final table) gets the results of the previous two queries, merges all csv files (the history) together, and changes the data types of the columns, so they are the same as configured in the PBI dataflow.

You don’t need to care too much about the details of those queries to start with your own project. (But if you are curious, let me know in the comment!). What you need to know is how to find the values for the parameters.

As introduced in the “Quick look at your ALDS Gen2” part, you can use Microsoft Azure Storage Explorer (similar in Azure Portal) to find the appropriate values for the five parameters prWorkspaceName, prDataflowName, prEntityNameCurrent, prADLSStorageAccountName, prEntityNameHistorical. Unlike DAX, please be aware that M in Power Query is case-sensitive, so capital and lower-case letters matter.

You do not need to copy and paste all the M scripts showed above in Power Query Editor. You can just download the .PBIDataflow_StoreHistoricalData.pbit file. When opening it, you’ll be asked to fill in the 5 parameters that I mentioned earlier, which looks like the screenshot below.

After having inputted the correct parameters, you are prompted to input your credentials. For the ADLS Gen2, you can choose between “Organizaion acount” and “Account key” methods. It’s highly recommended to use the former, as it’s more secured and gives you more granular control of the access to the folders and files (it’s called Access control lists or ACLs). For the PBI dataflows, the only method is your “Organizaion acount”. Having the authorized access to those resources, it takes a while to load your data. The end result is the only two tables that I enable load: “Historical Data” and “Current Data”.

THE DATA MODEL

For the sake of demonstration, and for you to start your own use case, I have this simple data model of 2 fact tables, where I have ProjectID as mapping key, and a one-to-many relationship between “Current Data” (one side) and “Historical Data (many side). It makes sense as for a single ProjectID in “Current Data” table, you have many “duplicated” ProjectID but with different Dates in the history. Each date can have different “% Complete” value.

Relationship between Current and Historical Data

It’s worth mentioning that when you load your data with the correct parameters and credentials, you do not have the column names like I show above, but your own data. You have to find your own mapping key, which should be unique in the “Current Data” table. Otherwise, you may encounter a Many-to-Many relationship, which should be avoided. Then, you have to create one yourself in both tables and create again the relationship. It’s good practice to create a better model with dimension tables around the “Current Data” fact table in case you want to slice and dice your data more. Furthermore, you should create your own Date table, instead of using the Auto date/time table as in this simplified demo.

This type of data model allows us to do some drill-through or tooltips that I will show in the next part.

THE VISUALIZATION

As you can see in the screenshot below, the “Current overview” shows the basic information of a project, including the Project Name and % Complete. As an example, Project 04 is currently 100% regarding the “% Complete”. So how is the trend of completion percentage of that project in the past until it gets 100%?

Current overview page with drill-through feature

With some initial setup, you can right-click and drill-through to the “DT_Project Progress” page to answer the question above. As you can see in the screenshot below, the project started at around 40% in November 2020 (when we started to collect the data), and reached 100% at around March 2021.

Drill-through page with dynamic title

You can also set up the tooltip as in the screenshot below, where you can hover the project and quickly see its trend of Completion percentage.

Current overview with tooltip feature

The setup of the Drill-through page, Tooltip or Dynamic title that you see above is out of scope of this blog post. You can find some clues how I did that in the pbit file. Otherwise, Bing or Google or DuckDuckGo or whatever search engines are your best friends. Please be informed that when you load your own data in the pbit file, all of the visualizations get errors as you have to change your own column name to the visuals.

Last but not least, the whole solution that I have described works almost the same for Excel (exactly the same for the “Historical Data” part, and some modifications for the “Current Data” part). That is because up until now, Excel does not support PBI dataflows connector yet. So you have to find a workaroud for that. You can use the same ADLS Gen2 connector, but this time not combine all the history, but just take the latest csv file.

Should you have any difficulty in working on your own use case, please comment. I would be more than happy to help and see what the use cases can be. Thanks for reading!


[1] The trick introduced by this blog post can be applied to any other data sources connectable to Power BI dataflows, like Excel, SQL, Oracle, etc.
[2] To enable Power BI Service (a cloud service) to “talk” securely with on-premise data sources, you need to install a Data Gateway. There are two modes: Standard and Personal. Actually, you can still set schedule refresh for a report with Python/R scripts by using Personal mode, but that data cannot be shared with other users.
[3] Just to avoid any misconception, users need the Gateway if there are Python/R scripts in their Power Query steps. They don’t need to install any Gateway if they use Python/R only in the visualizations. However, Python/R visuals in Power BI are slow, does not support full interaction and has many other limitations.
[4] Although “Power BI dataflows” is in plural form, I refer to it as a whole product name. If “PBI dataflow” is written, it refers to a single PBI dataflow. For the ease of reading, I shorten them as PBI dataflows/PBI dataflow.
[5] As you may already know, Power Query is a wonderful data connectivity and preparation tool that allows you to import and transform the data from many sources, by interacting with a User Interface, without using any code. But if you know enough M script (the language behind Power Query), you still can tweak the code, or write the code from scratch to achieve many other powerful transformation needs.
[6] If you click the link that I shared above, you will see the name “Microsoft Flow”. It is totally different from “Power BI dataflows”. A bit out of topic, believe it or not, on top of my head right now, apart from these two, I can list the other four more Microsoft products (of all time) that contains “flow” word: Mapping data flows (or Data flows in the UI) in Azure Data Factory (ADF), Wrangling data flows (now rebranded as Power Query in ADF), Control Flow and Data Flow in SQL Server Integration Services. Bonus: in Power Automate, you have Cloud flow (called Microsoft Flow before), Desktop flow, and Business process flow. I’m a huge fan of Microsoft, but not of its products name. Although I know the core differences among these products, in practice, it’s hard to talk with business users, and they can easily get confused.
[7] It’s worth mentioning that the features and connectors supported by Power Query differ per hosting products. I do have a strong belief that Power Query will be integrated in more products of Microsoft, and sooner or later, we can have the same Power Query experience across all of these products.

3 thoughts on “Power BI dataflows to store historical data

  1. This is great! What happens when you try to put together the historical data and you have added or removed columns from your dataflow table? For example, maybe your first version of the table had 5 columns but your latest version has 8. Thanks.

    Like

    1. Great question!

      Like in your question, 1st version has 5 columns A,B,C,D,E (existed from Day 1 to Day 10), and the 2nd version has 8 columns (A,B,C,D,E, and extra F,G,H) (Day 11 to present). When you combine the history, you will have a table of 8 columns, with columns F,G,H having blank values from Day 1 to 10.

      Other columns will have data as expected.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s