Optimizing Azure Analysis Services Partitions for Azure Data Warehouse External Tables

No comments

Recently, I had a chance to work with Azure Analysis Services (AS) sourcing data from Azure Data Warehouse (DW) external tables. Optimizing the processing of the Azure Analysis Services partitions to use with the Azure DW external tables is a bit different from working with the regular (physical) data tables, and I will discuss the steps I’ve taken in order to speed up the processing time.

Dataset

The dataset is a collection of daily sales data from a fictional retail company. Each day, there are about 2.5M rows of data – around 150 MB. The daily sales data come into the Azure Data Lake Storage (DLS) via another process, and the folder structure looks like the diagram below: Each year is a folder containing 12 monthly folders, and each of the monthly folders contains 28-31 daily folders with the data files.

Figure: Data Structure in Azure Data Lake Storage

Steps to Optimization

The Initial Setup

Component Pricing Tier
Azure Data Warehouse DW400
Azure Analysis Services S4 400 QPU

Table: Initial Azure Component Pricing Tier

Initially, an external table was created at the fact\sales folder level.

CREATE EXTERNAL TABLE [dbo].[Sales]

(

    [Date] [date] NULL,

    [StoreProductID] [nvarchar](50) NULL,

    [StoreID] [nvarchar](100) NULL,

    [Amount] [float] NULL,

    [Units] [int] NULL

)

WITH (DATA_SOURCE = [MyDLS],LOCATION = N’/fact/sales/’,FILE_FORMAT = [textdelimitedCSV],REJECT_TYPE = VALUE,REJECT_VALUE = 1000000)

This external table is great because it will pick up any sales data added to the Azure DLS going forward.

On the Azure Analysis Services side, we created a single table (well an expression and a table) that consumed data from the Sales table. Then, the partitions were created within the Analysis Services table for each of the month. Roughly, the query for each partition in the Sales table would be like below. PLEASE NOTE that, in reality, M aka Power Query is used in Analysis Services – not SQL as shown below.

SELECT

*

FROM

dbo.Sales

WHERE

    [Date] BETWEEN ‘1/1/2018’ AND ‘1/31/2018’ — These date ranges would change for each partition

Using the initial setup worked fine with the development dataset (around 60M rows), but with the 2 full years’ worth of data with 1.8B rows, processing the table in Analysis Services just failed after a few hours due to a connection timeout issue.

If the Azure Analysis Services was sourcing the data from a physical table in a database, this monthly Azure Analysis table partition would have been perfect as the physical table in the database can also be partitioned into monthly chunks. The external tables in Azure DW, however, are essentially views of the data stored somewhere else (in this case: Azure DLS) therefore partitioning only on the Analysis Services side did not have any performance improvements.

The Lazy Man’s Attempt

“More hardware” was the first thing we thought about, and Azure made scaling up the hardware very easy. The Azure components were scaled up to the settings shown below. To my clients, I would never recommend that the hardware should simply be scaled up when facing a performance issue, but for this project, since it was not for a client, there was more freedom to play with the hardware. Although, there were improvements in the Analysis Services processing time with a smaller dataset, the full dataset still failed to process. The “more hardware” was not the solution.

Component Intial Pricing Tier Scaled Up Pricing Tier
Azure Data Warehouse DW400 DW6000
Azure Analysis Services S4 400 QPU S9 640 QPU

Table: Azure Components Scaled Up

The First Optimization Test

Hypothesis

Most of the time was spent on the IO between Azure DLS and Azure DW.

Test

  1. The hardware was scaled back down to a more reasonable level.
Component Pricing Tier
Azure Data Warehouse DW1500
Azure Analysis Services S4 400 QPU

Table: Azure Components Pricing Tier for the First Optimization Attempt

  1. A physical Sales table was created in the Azure DW instead of the external table.
  2. No partitions were created for the Sales table in Azure DW since the hypothesis we wanted to test was that the IO time was the bottleneck.
  3. The Control table is still the external Sales table in Azure DW.

Results

Number of Rows External Table Physical Table
60M ~9 minutes ~7 minutes
1.8B Failed Failed

In the preliminary test with a smaller subset of data, the physical table did perform better than the external table – just around 7 minutes for 60M rows compared to around 9 minutes for the same dataset. When the full 1.8B rows of data were loaded into the physical table, however, the performance gains were wiped out, and Analysis Services never processed the full 1.8B rows before failing.

The Second Optimization Test

Because the requirement was to NOT have a physical table, and because we believed that we have proven that the issue was not entirely with the IO time between Azure DLS and Azure DW, we moved onto the second hypothesis.

Hypothesis

Azure Analysis Services loads all 1.8B rows of data first then filters out the data when processing each partition.

Test

  1. The hardware was kept same as the first optimization test.
  2. A view was created in Azure DW for each of the month.

  1. The Control table is still the external Sales table in Azure DW.
  2. Each Analysis Services partition now sourced data specifically from a monthly view rather than the whole table. The SQL representation of each partition would be:

    SELECT

    *

    FROM

    dbo.cpgSalesYYYYMM — YYYYMM represents the Year Month part of the view name

Results

Number of Rows External Table Monthly View
60M ~9 minutes ~9 minutes
1.8B Failed Failed

In the preliminary test with a smaller subset of data, there were no discernable difference using a view or using a whole table. That made sense to us since Azure Analysis Services would have sent a query down to Azure DW to get a subset of the data even in the initial set up of the partition. During the Analysis Services processing of the Sales table, we have noticed that Analysis Services was loading data into one partition at a time.

The Third Optimization Test

Because the previous test showed that the Analysis Services was not bringing in the entire dataset for each of the partition THEN filtering them out, we moved on to the next hypothesis.

Hypothesis

Azure DW external table does not cache any data – every time you query the data from an external table, Azure DW will reach back to the source system (Azure DLS in this case), read all the data into memory, and perform the query. This is not only an IO issue but the sheer amount of data that needs to be loaded into memory on every query. Therefore, an external table for each month, would help each external table carry less data at a time.

Test

  1. The hardware was kept same as the first optimization test.
  2. An external table was created in Azure DW for each of the month.

  1. The Control table is still the external Sales table in Azure DW.
  2. Each Analysis Services partition now sourced data specifically from a monthly external table rather than the whole table. The SQL representation of each partition would be:

    SELECT

    *

    FROM

    dbo.SalesYYYYMM — YYYYMM represents the Year Month part of the external table name

Results

Number of Rows External Table Monthly External Table
60M ~9 minutes ~6 minutes
1.8B Failed Succeeded in ~50 minutes

In the preliminary test with a smaller subset of data, using the monthly external tables showed about 33% improvement in performance over the control table (an external table with 25 months of data / 1.8B rows). With the full set of data, 1.8B rows of data, using the monthly external tables succeeded in processing the whole Analysis Services table. During the Analysis Services processing of the Sales table, we have also noticed that Analysis Services was loading data into multiple partitions at a time. Azure DW and Azure Analysis Services were finally able to parallel task the data ingestion.

Conclusion

In our tests, creating an external table in Azure DW at the monthly level was the key to optimizing the process time in Azure Analysis Services. To generalize this test result, a smaller external table confined to a less number physical files in the source system allows what Azure DW does best: parallel task. I believe the issue is still on the IO side, because Azure DW external table would have to load every file with every query, the amount of IO time just bottlenecked everything else.

I welcome your feedback on these findings. Please do add to this article in the comments below.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s