Connect Azure Databricks to Azure Cosmos DB in Firewall enabled Environment using Spark connector.
by Ranjeet Singh Ahlawat
Introduction
Most of you are already connecting
Azure Databricks to CosmosDB using SPN in firewall enabled environment. If you
are one of those many, just give a like to this post and enjoy something
different.
But if you are trying hard and
tired of trying anymore, then take a walk, drink tea/coffee and come back to
read this article, as tired mind will not be achieving the solutions.
Now proceed to article to solve
this issue.
Understanding the Challenge:
For Cyber requirement, we normally disable the public access
to Azure Cosmos DB and creates a private endpoint for that. Along with that,
most documentations by Microsoft or Databricks provide only Master Key as a connection
method and property of SPN are hardly mentioned.
And Master Key connectivity is blocked by Admins. Now what can
we do? Nothing, Ask Cyber personnel to make the connection and tells us how we
should do that. That’s the best solution. But somehow this is Data Engineer job
not the Cyber so let’s proceed further.
The Solution:
Library used: “com.azure.cosmos.spark:azure-cosmos-spark_3-3_2-12:4.17.2”,
choose yours based on what spark you are using.
To find all property for config :
Basic configs are here:
Cfg = {
"spark.cosmos.accountEndpoint": <URL>
"spark.cosmos.auth.type": "ServicePrinciple",
"spark.cosmos.account.tenantId": <Azure tenant>,
"spark.cosmos.account.subscriptionId": <Subscription
ID>,
"spark.cosmos.account.resourceGroupName": <Cosmos DB
Resource Group Name>,
"spark.cosmos.auth.aad.clientId": <SPN Client ID>,
"spark.cosmos.auth.aad.clientSecret": <SPN Secret>,
"spark.cosmos.database": <Database>,
"spark.cosmos.container": <Container>,
"spark.cosmos.write.strategy": "ItemOverwrite",
"spark.cosmos.write.bulk.enabled": "true",
"spark.cosmos.read.partitioning.strategy":
"Restrictive",#IMPORTANT - any other partitioning strategy will
result in indexing not being use to count - so latency and RU would spike up
"spark.cosmos.read.inferSchema.enabled" : "false",
}
query_df
= spark.read.format("cosmos.oltp").options(**Cfg).option("spark.cosmos.read.inferSchema.enabled", "true").load()
display(query_df.limit(10))
df.write
\
.format("cosmos.oltp") \
.mode("append") \
.options(**Cfg) \
.save()
ARE YOU ABLE TO CONNECT???
Still NO???
What can I do friends? That’s fine if you are not able to.
Oh wait, proceed a little more I still haven’t lost my hopes.
I know you can do it. Many have done that, why not us.
I believe, you have granted sufficient access to your SPN.
Normally, we all are smart to provide the access first.
Now, As headline itself says, it is firewall enabled and public
access is disabled and have a Private Endpoint created. But as we are smart, we
put NSG on subnet, where our PE was created.
Are you taking Cyber Lightly, come on we also put only
specific IP on a specific Port can only connect. Ah, yes, I know. you have
already given access to IP on port 443. But sorry, you are using Spark connector,
not Python Library, and Azure Cosmos have it is own ports for each API.
And yes, Microsoft is smart, they have already provided
those. You just to find that. Try it. You can. I know you can.
Ok.. Here is the MSFT document.
https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/sdk-connection-modes
Come on scroll until the middle, you will find it.
Now if you have already found that, for NO SQL, port is
10350. Don’t know why there are so many ports. Just make one port to handle
everything. But yes, everything has its own significance.
Oh that reminded me, Databricks Just launched Universal
format in delta lake. So read that article.
Fine proceed and add NSG rule for Databricks Host Subnet
as source, port at 10350 and target as your Private Endpoint.
Now wait for couple of minutes, not sure why, but sometimes
it take couples of minutes for NSG to come in affect.
Now you should
be good to Go. Enjoy your connection.
What, it’s
still not connecting? Now you can tell my name to your subscription, I heard it
works sometimes, if not, then let’s connect, hope I can help. Or maybe not, but
nothing wrong in trying.
Conclusion:
By following the steps outlined in this blog, you can
overcome the challenges of connecting Azure Cosmos DB in a firewall-enabled
environment to Azure Databricks using the Cosmos DB Spark Connector. This
powerful combination empowers data scientists and developers to unleash the
full potential of their data while maintaining a secure and compliant
environment.
Unlock the insights hidden in your data with the Azure
Cosmos DB Spark Connector and experience the full power of Azure Databricks for
advanced analytics and data processing!
Comments
Post a Comment