Recently, I had a chance to work with Azure Analysis Services (AS) sourcing data from Azure Data Warehouse (DW) external tables. Optimizing the processing of the Azure Analysis Services partitions to use with the Azure DW external tables is a bit different from working with the regular (physical) data tables, and I will discuss the steps I’ve taken in order to speed up the processing time.
Dataset
The dataset is a collection of daily sales data from a fictional retail company. Each day, there are about 2.5M rows of data – around 150 MB. The daily sales data come into the Azure Data Lake Storage (DLS) via another process, and the folder structure looks like the diagram below: Each year is a folder containing 12 monthly folders, and each of the monthly folders contains 28-31 daily folders with the data files.
Figure: Data Structure in Azure Data Lake Storage
Steps to Optimization
The Initial Setup
Component | Pricing Tier |
Azure Data Warehouse | DW400 |
Azure Analysis Services | S4 400 QPU |
Table: Initial Azure Component Pricing Tier
Initially, an external table was created at the fact\sales folder level.
CREATE EXTERNAL TABLE [dbo].[Sales]
(
[Date] [date] NULL,
[StoreProductID] [nvarchar](50) NULL,
[StoreID] [nvarchar](100) NULL,
[Amount] [float] NULL,
[Units] [int] NULL
)
WITH (DATA_SOURCE = [MyDLS],LOCATION = N’/fact/sales/’,FILE_FORMAT = [textdelimitedCSV],REJECT_TYPE = VALUE,REJECT_VALUE = 1000000)
This external table is great because it will pick up any sales data added to the Azure DLS going forward.
On the Azure Analysis Services side, we created a single table (well an expression and a table) that consumed data from the Sales table. Then, the partitions were created within the Analysis Services table for each of the month. Roughly, the query for each partition in the Sales table would be like below. PLEASE NOTE that, in reality, M aka Power Query is used in Analysis Services – not SQL as shown below.
SELECT
*
FROM
dbo.Sales
WHERE
[Date] BETWEEN ‘1/1/2018’ AND ‘1/31/2018’ — These date ranges would change for each partition
Using the initial setup worked fine with the development dataset (around 60M rows), but with the 2 full years’ worth of data with 1.8B rows, processing the table in Analysis Services just failed after a few hours due to a connection timeout issue.
If the Azure Analysis Services was sourcing the data from a physical table in a database, this monthly Azure Analysis table partition would have been perfect as the physical table in the database can also be partitioned into monthly chunks. The external tables in Azure DW, however, are essentially views of the data stored somewhere else (in this case: Azure DLS) therefore partitioning only on the Analysis Services side did not have any performance improvements.
The Lazy Man’s Attempt
“More hardware” was the first thing we thought about, and Azure made scaling up the hardware very easy. The Azure components were scaled up to the settings shown below. To my clients, I would never recommend that the hardware should simply be scaled up when facing a performance issue, but for this project, since it was not for a client, there was more freedom to play with the hardware. Although, there were improvements in the Analysis Services processing time with a smaller dataset, the full dataset still failed to process. The “more hardware” was not the solution.
Component | Intial Pricing Tier | Scaled Up Pricing Tier |
Azure Data Warehouse | DW400 | DW6000 |
Azure Analysis Services | S4 400 QPU | S9 640 QPU |
Table: Azure Components Scaled Up
The First Optimization Test
Hypothesis
Most of the time was spent on the IO between Azure DLS and Azure DW.
Test
- The hardware was scaled back down to a more reasonable level.
Component | Pricing Tier |
Azure Data Warehouse | DW1500 |
Azure Analysis Services | S4 400 QPU |
Table: Azure Components Pricing Tier for the First Optimization Attempt
- A physical Sales table was created in the Azure DW instead of the external table.
- No partitions were created for the Sales table in Azure DW since the hypothesis we wanted to test was that the IO time was the bottleneck.
- The Control table is still the external Sales table in Azure DW.
Results
Number of Rows | External Table | Physical Table |
60M | ~9 minutes | ~7 minutes |
1.8B | Failed | Failed |
In the preliminary test with a smaller subset of data, the physical table did perform better than the external table – just around 7 minutes for 60M rows compared to around 9 minutes for the same dataset. When the full 1.8B rows of data were loaded into the physical table, however, the performance gains were wiped out, and Analysis Services never processed the full 1.8B rows before failing.
The Second Optimization Test
Because the requirement was to NOT have a physical table, and because we believed that we have proven that the issue was not entirely with the IO time between Azure DLS and Azure DW, we moved onto the second hypothesis.
Hypothesis
Azure Analysis Services loads all 1.8B rows of data first then filters out the data when processing each partition.
Test
- The hardware was kept same as the first optimization test.
- A view was created in Azure DW for each of the month.
- The Control table is still the external Sales table in Azure DW.
-
Each Analysis Services partition now sourced data specifically from a monthly view rather than the whole table. The SQL representation of each partition would be:
SELECT
*
FROM
dbo.cpgSalesYYYYMM — YYYYMM represents the Year Month part of the view name
Results
Number of Rows | External Table | Monthly View |
60M | ~9 minutes | ~9 minutes |
1.8B | Failed | Failed |
In the preliminary test with a smaller subset of data, there were no discernable difference using a view or using a whole table. That made sense to us since Azure Analysis Services would have sent a query down to Azure DW to get a subset of the data even in the initial set up of the partition. During the Analysis Services processing of the Sales table, we have noticed that Analysis Services was loading data into one partition at a time.
The Third Optimization Test
Because the previous test showed that the Analysis Services was not bringing in the entire dataset for each of the partition THEN filtering them out, we moved on to the next hypothesis.
Hypothesis
Azure DW external table does not cache any data – every time you query the data from an external table, Azure DW will reach back to the source system (Azure DLS in this case), read all the data into memory, and perform the query. This is not only an IO issue but the sheer amount of data that needs to be loaded into memory on every query. Therefore, an external table for each month, would help each external table carry less data at a time.
Test
- The hardware was kept same as the first optimization test.
- An external table was created in Azure DW for each of the month.
- The Control table is still the external Sales table in Azure DW.
-
Each Analysis Services partition now sourced data specifically from a monthly external table rather than the whole table. The SQL representation of each partition would be:
SELECT
*
FROM
dbo.SalesYYYYMM — YYYYMM represents the Year Month part of the external table name
Results
Number of Rows | External Table | Monthly External Table |
60M | ~9 minutes | ~6 minutes |
1.8B | Failed | Succeeded in ~50 minutes |
In the preliminary test with a smaller subset of data, using the monthly external tables showed about 33% improvement in performance over the control table (an external table with 25 months of data / 1.8B rows). With the full set of data, 1.8B rows of data, using the monthly external tables succeeded in processing the whole Analysis Services table. During the Analysis Services processing of the Sales table, we have also noticed that Analysis Services was loading data into multiple partitions at a time. Azure DW and Azure Analysis Services were finally able to parallel task the data ingestion.
Conclusion
In our tests, creating an external table in Azure DW at the monthly level was the key to optimizing the process time in Azure Analysis Services. To generalize this test result, a smaller external table confined to a less number physical files in the source system allows what Azure DW does best: parallel task. I believe the issue is still on the IO side, because Azure DW external table would have to load every file with every query, the amount of IO time just bottlenecked everything else.
I welcome your feedback on these findings. Please do add to this article in the comments below.