This post is a beginning to a series of articles about building analytical capabilities in Azure using data lake, Databricks and Power BI. On the surface, those technologies seem like they were specifically designed to complement each other as they provide a set of foundational capabilities necessary to develop scalable and cost-effective business intelligence solutions.
Unfortunately, for the reasons we don’t need to get into in this post, you are very likely to run into issues as you try to connect the three together, therefore, I will try to address a couple of things here to hopefully make your life a little easier.
First, let’s talk about connecting Databricks and ADLS Gen2
Databricks provides a fairly good overview of what steps are necessary to mount ADLS Gen2 to DBFS (Databricks internal file system). This mount point allows us to create Spark tables on top of ADLS Gen2 folders. Why would we want to do that? Well, a lot more details will follow in subsequent articles, but for now let’s just say that we are looking to leverage Compute-on-Demand and Schema-on-Read design principles in our solution architecture. I don’t see much value to rehash the concepts already covered in the article, I will just talk about some issues that you might run into as you follow the directions in it.
At the high end, the process is very simple, you (1) need to create a Service Principal with “Delegated permissions”, (2) run dbutils.fs.mount(…) command to create a mount, (3) make sure that the service principal has necessary permissions on the datalake folders and then (4) you can create you Spark tables. As you follow these steps, the first issue you might encounter is “This request is not authorized to perform this operation using this permission” error message when you are trying to create a mount point. In order to get past it, you have to make sure that you have admin access to AAD applications so you can find your app on the list of app registrations, then go to Settings->Required Permissions and then click on Grant Permissions. Before that, you also need to go do Add permissions->Select an API type in your application name (I know it’s confusing because you are already in the settings for the same application) and then selected and add it to the list of permissions before you click on Grant Permissions.
If you set up your Service Principal correctly, you will be able to mount the ADLS Gen2. However, just because you can mount it does not mean that you can create/query your Spark tables yet. If your access to the data lake is not set up correctly, you will still get the same error message (“This request is not authorized to perform this operation using this permission“) after you create your Spark tables (no error message) but right after you try to SELECT from your newly defined Spark tables.
The reason you will still get the same error message is most likely because you did not assign access permissions to your Service Principal. This is where things get confusing. Unlike ADLS Gen1, you don’t assign access form the Azure portal interface, you have to do it using Azure Storage Explorer. The next confusing thing is that you cannot use Service Principal name as you try to add it to the list of users, you have to use its Object ID. If you provided the Object ID and your Spark tables still don’t work, that is because you provided a wrong Object ID. The confusing part is that according to the documentation, you will use Appr Registration menu option under Azure Active Directory to create your Service Principal and you will also tend to pick your app from existing App Registrations and then use its Object ID to assignaccess/permissions.
That Object ID will not work, you must go to Enterprise Applications instead, find your app there, click on it, go to Properties and then get Object ID (not Application ID) there.
Now you can use this Object ID to assign access/permissions to all the folders necessary to create a mount (at least execute is required to the top-level folders, and then read for the folders/files with the data files. I usually use the same Service Principal for other things, like Data Factory, so I just give it read/write/execute permissions for the whole thing at least in the Dev environment)
What about connecting Spark tables in Databricks to Power BI?
At this point, your CREATE and SELECT statements for Spark tables will work while you are in the Databricks environment. Which will give you a false sense of accomplishment as you will likely to feel disappointed again as you try to connect Power BI to your newly minted Spark tables and views.
At the time of this writing, Databricks has a bug in its distributions above version 5.1 that are causing an error that says something like this:
OLE DB or ODBC error: [DataSource.Error] ODBC: ERROR [HY000] [Microsoft][Hardy] (35) Error from server: error code: ‘0’ error message: ‘com.google.common.util.concurrent.UncheckedExecutionException: com.databricks.backend.daemon.data.common.InvalidMountException: Error while using path /mnt/mnt/xyz for resolving path ‘/xyz within mount at ‘/mnt/mnt.’..
The way to fix this error is to go to your Databricks Cluster configuration, go to Edit cluster and then add the following line of code to your Spark Config:
spark.hadoop.hive.server2.enable.doAs false
(please the the screenshot below for details)
After you restart your cluster, Power BI will be able to not only see the Spark tables but also actually load the data from them.
Worked like a charm. How did you figure out root cause on this one?
Primarily by banging my head against the wall and trying different things until something worked
I owe you a beer
🙂