Connect Azure Databricks to Azure Cosmos DB in Firewall enabled Environment using Spark connector.

 

by Ranjeet Singh Ahlawat


Introduction

Most of you are already connecting Azure Databricks to CosmosDB using SPN in firewall enabled environment. If you are one of those many, just give a like to this post and enjoy something different.

But if you are trying hard and tired of trying anymore, then take a walk, drink tea/coffee and come back to read this article, as tired mind will not be achieving the solutions.

Now proceed to article to solve this issue.   

 

Understanding the Challenge:

For Cyber requirement, we normally disable the public access to Azure Cosmos DB and creates a private endpoint for that. Along with that, most documentations by Microsoft or Databricks provide only Master Key as a connection method and property of SPN are hardly mentioned.

 

And Master Key connectivity is blocked by Admins. Now what can we do? Nothing, Ask Cyber personnel to make the connection and tells us how we should do that. That’s the best solution. But somehow this is Data Engineer job not the Cyber so let’s proceed further.

 

 

The Solution:

 

Library used: “com.azure.cosmos.spark:azure-cosmos-spark_3-3_2-12:4.17.2”, choose yours based on what spark you are using.

To find all property for config :

https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos-spark_3_2-12/docs/configuration-reference.md

 

Basic configs are here:

Cfg = {

  "spark.cosmos.accountEndpoint": <URL>

  "spark.cosmos.auth.type": "ServicePrinciple",

  "spark.cosmos.account.tenantId": <Azure tenant>,

  "spark.cosmos.account.subscriptionId": <Subscription ID>,

  "spark.cosmos.account.resourceGroupName": <Cosmos DB Resource Group Name>,

  "spark.cosmos.auth.aad.clientId": <SPN Client ID>,

  "spark.cosmos.auth.aad.clientSecret": <SPN Secret>,

  "spark.cosmos.database": <Database>,

  "spark.cosmos.container": <Container>,

  "spark.cosmos.write.strategy": "ItemOverwrite",

  "spark.cosmos.write.bulk.enabled": "true",

  "spark.cosmos.read.partitioning.strategy": "Restrictive",#IMPORTANT - any other partitioning strategy will result in indexing not being use to count - so latency and RU would spike up

  "spark.cosmos.read.inferSchema.enabled" : "false",

}

 

query_df = spark.read.format("cosmos.oltp").options(**Cfg).option("spark.cosmos.read.inferSchema.enabled", "true").load()

display(query_df.limit(10))

 

df.write \

  .format("cosmos.oltp") \

  .mode("append") \

  .options(**Cfg) \

  .save()

 

 

ARE YOU ABLE TO CONNECT???

 

Still NO???

 

What can I do friends? That’s fine if you are not able to.

 

Oh wait, proceed a little more I still haven’t lost my hopes. I know you can do it. Many have done that, why not us.

 

I believe, you have granted sufficient access to your SPN. Normally, we all are smart to provide the access first.

 

Now, As headline itself says, it is firewall enabled and public access is disabled and have a Private Endpoint created. But as we are smart, we put NSG on subnet, where our PE was created.

 

Are you taking Cyber Lightly, come on we also put only specific IP on a specific Port can only connect. Ah, yes, I know. you have already given access to IP on port 443. But sorry, you are using Spark connector, not Python Library, and Azure Cosmos have it is own ports for each API.

And yes, Microsoft is smart, they have already provided those. You just to find that. Try it. You can. I know you can.

 

Ok.. Here is the MSFT document.

 

https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/sdk-connection-modes

 

Come on scroll until the middle, you will find it.

 


Now if you have already found that, for NO SQL, port is 10350. Don’t know why there are so many ports. Just make one port to handle everything. But yes, everything has its own significance.

Oh that reminded me, Databricks Just launched Universal format in delta lake. So read that article. 

https://www.databricks.com/company/newsroom/press-releases/announcing-delta-lake-30-new-universal-format-offers-automatic

 

Fine proceed and add NSG rule for Databricks Host Subnet as source, port at 10350 and target as your Private Endpoint.

 

Now wait for couple of minutes, not sure why, but sometimes it take couples of minutes for NSG to come in affect.

 

Now you should be good to Go. Enjoy your connection.

 

What, it’s still not connecting? Now you can tell my name to your subscription, I heard it works sometimes, if not, then let’s connect, hope I can help. Or maybe not, but nothing wrong in trying.

 

Conclusion:

By following the steps outlined in this blog, you can overcome the challenges of connecting Azure Cosmos DB in a firewall-enabled environment to Azure Databricks using the Cosmos DB Spark Connector. This powerful combination empowers data scientists and developers to unleash the full potential of their data while maintaining a secure and compliant environment.

Unlock the insights hidden in your data with the Azure Cosmos DB Spark Connector and experience the full power of Azure Databricks for advanced analytics and data processing!

Comments