Azure Databricks, 데이터 수집 및 읽기(JDBC 드라이버 사용하여 SQL Server 쿼리 , Secret Scope 이용하여 Key 값 암호화하기)

Azure/Databricks

Azure Databricks, 데이터 수집 및 읽기(JDBC 드라이버 사용하여 SQL Server 쿼리 , Secret Scope 이용하여 Key 값 암호화하기)

helenaaaaa 2023. 5. 15. 15:30

기존의 데이터브릭스의 데이터 수집 및 읽기에 대한 내용은 이 링크에 연결된 이전 글에서 확인할 수 있다.

그 외 JDBC 드라이버를 활용하여 SQL Server에 쿼리하면서 Secret Scope을 이용하여 Key 값 암호화하는 방법까지 해보고자 한다.

JDBC 드라이버를 활용하여 SQL Server 쿼리

Azure Databricks는 JDBC를 사용하여 외부 데이터베이스에 연결할 수 있도록 지원한다.

New > Data 하면 SQL Server 뿐 아니라 Postgre, MySQL, MongoDB, Kafka 등 다영한 데이터 소스를 Databricks로 쉽게 로드하는 방법 확인이 가능하다.

연결정보

SQL Server 연결을 위해 SQL Server명, 데이터베이스명, 테이블명, SQL Server 사용자 정보를 각 변수에 담는다.

driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
url = "jdbc:sqlserver://<Server-Name>:1433;DatabaseName=<Database-Name>"
table = "Sales.Customers"
user = "user"
password = "password"

Secret Scope 이용하여 Key 값 암호화

연결정보의 사용자 정보를 암호화하여 보안 안전성을 높인다.

Azure Key Vault 생성

Databricks의 작업영역과 통합하기 위해 Secret Scope 생성

https://<Databricks URL>#secrets/createScope 접속하여 UI를 통해 Secret Scope을 생성한다.

Key Vault에 Secret 생성

Databricks Cluster 생성 - Spark 구성 속성을 사용하여 Secret 참조

- 해당 작업은 클러스터 Owner만 가능하다.

- 줄당 하나의 키-값 쌍으로 입력해야 한다.

spark.<property-name> {{secrets/<scope-name>/<secret-name>}}

데이터 읽기

df= (
  spark.read.format("jdbc") \
    .option("driver", driver)
    .option("url", url)
    .option("dbtable", table)
    .option("user", user)
    .option("password", password)
    # The following options configure parallelism for the query. This is required to get better performance, otherwise only a single thread will read all the data
    # a column that can be used that has a uniformly distributed range of values that can be used for parallelization
    # .option("partitionColumn", "partition_key")
    # lowest value to pull data for with the partitionColumn
    # .option("lowerBound", "minValue")
    # max value to pull data for with the partitionColumn
    # .option("upperBound", "maxValue")
    # number of partitions to distribute the data into. Do not set this very large (~hundreds) to not overwhelm your database
    # .option("numPartitions", <cluster_cores>)
    .load()
)

df= (
  spark.read.format("jdbc") \
    .option("driver", driver)
    .option("url", url)
    .option("dbtable", table)
    .option("user", user)
    .option("password",  spark.conf.get('spark.sqldw'))  #key vault 사용 
    .load()
)

테이블 생성

df.write.format("delta").mode("overwrite").saveAsTable("Customers")