use the spark connector in python

use the park connector in python

Use the spark connector in python is a databricks programming solution on how to use the azure spark connector in python notebook.

Spark Connector enables the external SQL databases (Azure SQL/SQL server) connectivity in databricks. It's similar to databricks default JDBC connector. But differ in speed process with bulk copy support.

Problem context

You are working on a Python notebook and you want to implement bulkcopy insertion to SQL servers. But there is no function support in pyspark. Python supports only row-by-row insertions. So what you can do as a workaround to achieve the same functionality?

Workaround solution:
The Spark Connector is used to connect to the Azure SQL database or SQL server. At present, It's fully supported by Scala only. The below configuration is defined in Scala.

Code developed in Scala:

Import com.microsoft.azure.sqldb.spark.config.Config 
Import com.microsoft.azure.sqldb.spark.connect._ 
Import com.microsoft.azure.sqldb.spark.bulkcopy.bulkcopymetadata
val bulkCopyConfig = Config(Map(
 “url” -> “mysqlserver.database.windows.net”,   //Azure SQL database server name
 “databasename” -> “mydatabase”,     // Azure SQL database name 
“user” -> “username”,        // Azure SQL DB login username
“password” -> “password”,         // Azure SQL DB login password 
“dbtable” -> “mytable”,   //Azure SQL DB table name with schema ex: dbo.sampletable “bulkCopyBatchSize” -> “1000”,     // Batch size “bulkCopyTableLock” -> “true”,     // Table lock while insert for better copy 
“bulkCopyTimeout” -> “600”     // In sec
 
))

Parameters:
1. url
Name of the Server (Azure SQL server/ SQL Server). 

2. databasename
Name of the database to connect in the Server(Azure SQL server/ SQL Server). 

3.user
Name of the user for SQL server authentication.

4.password
password for the user for SQL server authentication.

5.dbtable
Name of the table to access for 'read/write' operations in Azure SQL/SQL server.

6.bulkCopyBatchSize
Select the batch size limit for each 'read/write' from Azure SQL/SQL server.

7.bulkCopyTablLock
Select 'True' in case of complete lock on the table until complete read/write process, select 'False' in case of no lock required while processing.

8.bulkCopyTimeout
Set the time limit for entire batch processing.

There is no bulkCopyToSqlDB function support in Python.
The question is like,  can we still achieve this functionality without much coding and analysis in Python?

The answer is 'Yes'. If you already implemented the same functionality in Scala, no worries. We can use the same code without any changes in Python.
Steps:

1. Navigate to Databricks workspace -> Cluster -> Notebook
2. Create one more cell in Python notebook
3. Add '%scala' at the top of the cell
4. Use the same Scala code as above without any change
5. Just pass the final dataframe from Python cell to Scala cell for bulk insertion

Conclusion
If you want to know how to pass the dataframe data from Python cell to Scala, click here.

Check the data availability in the Azure SQL server or SQL server. That's it. 👍



No comments:

Post a Comment