top of page
Search
  • Writer's pictureSteve Flowers

Automate SQL Column Metadata in Microsoft Purview





Microsoft Purview is Microsoft's unified data governance platform. Purview has grown a lot in my time at Microsoft, and the platform is evolving rapidly. The new tenant level experience and integrated information protection services meet the needs of organizations large and small who are trying to govern the ocean of data surrounding them.


The Purview data catalog allows us to ingest and store metadata about our data assets on-prem and in the cloud. Common metadata includes data source type (file share, SQL, et al.), data product owner, schema, glossary, etc...


A customer of mine recently asked how to automate the column metadata for an Azure SQL database (or more specifically Synapse Dedicated SQL pool). How can we enhance our dataset to allow Purview to automatically ingest this metadata? Purview does not have a method out of the box unfortunately, but this simple solution may suffice. To properly use this in production I would recommend maintaining the below process via your change process to ensure metadata is updated when new tables are created, or schemas are extended.


All of the code discussed has been pushed to this git repository:



SQL has a system view which allows us to create and maintain this metadata. "sys.extended_properties" is a flexible table which allows us to add properties which describe our data. We can leverage this functionality to describe our SQL tables for Purview. But how do we do this? Let's say I have a table named "Orders".


exec sp_addextendedproperty  
     @name = N'Description' 
    ,@value = N'String based text collected from order form.' 
    ,@level0type = N'Schema', @level0name = 'dbo' 
    ,@level1type = N'Table',  @level1name = 'Orders' 
    ,@level2type = N'Column', @level2name = 'Comments'
go

This stored procedure will create a new record in our extended properties table. The parameters are described as:


  • Name - name of the extended property.

  • Value - value of that property.

  • Level0 Type - the first level in our hierarchy, the schema.

  • Level1 Type - the second level in our hierarchy, the table.

  • Level2 Type - the third level in our hierarchy, the column.


In my case, I want to ensure a description is captured for all columns when a new table is created. In reality, it will likely be difficult to do this for all columns in your environment but there is a lot of value for downstream data workloads and governance. To effectively implement creating this metadata, you will need to create a process for creating new tables as well as any table schema updates. Additionally, for ease of querying this data, I created a SQL view. The SELECT statement to query extended properties looks like this:


SELECT
   SCHEMA_NAME(tbl.schema_id) AS SchemaName,	
   tbl.name AS TableName, 
   clmns.name AS ColumnName,
   p.name AS ExtendedPropertyName,
   CAST(p.value AS sql_variant) AS ExtendedPropertyValue
FROM
   sys.tables AS tbl
   INNER JOIN sys.all_columns AS clmns ON clmns.object_id=tbl.object_id
   INNER JOIN sys.extended_properties AS p ON p.major_id=tbl.object_id AND p.minor_id=clmns.column_id AND p.class=1
WHERE
   SCHEMA_NAME(tbl.schema_id)='dbo'
   and tbl.name='Orders'

To query the view is much simpler:


SELECT 
	SchemaName, 
	TableName, 
	ColumnName, 
	ExtendedPropertyName, 
	ExtendedPropertyValue 
FROM {view_name}


Let's fast forward and assume I have implemented this metadata process, and all of my tables have an extended property for description. You could also extend this to additional properties like owner, business unit, etc...


We can now use this metadata to automatically update column descriptions in Purview, our data governance tool. Doing so ensures that when people are browsing the catalog, they get a description for columns which will assist in their development, data engineering, data science, or data analysis work.


Purview has three methods for interacting with the catalog programatically:


For this example, I use the pyapacheatlas package in Python. To follow along, open the "Purview Azure SQL Column Description.ipynb" file from the github repository. I am using Microsoft Fabric Spark engine for this example.


In the first cell, we install the pyapacheatlas package. Next we import the required libraries. The pyapacheatlas libraries allow us to authenticate to Purview, build a client, and work with Atlas entities.


Note: Apache Atlas is the underlying technology of Microsoft Purview hence references to Atlas.

In my notebook, to avoid storing secrets, I fetch all of the private variables from Azure Key Vault. I am using Azure SQL in this example to document columns and the goal is to update the descriptions of the columns for this Azure SQL database in Purview. SQL information is provided: URI to server, database name, SQL user and password (don't do this, use Entra ID). There is also a client id and password. This is an Entra ID app registration that has been granted permissions in Purview.


Create an app registration in your Entra ID tenant. Ideally you would create a security group to provide Purview permissions. Add this security group to the "Data Catalog Curators" role group in Microsoft Purview. Additionally, within your Purview data map, you must provide this security group data reader, data source admin, and collection admin.


Next, if we move on to the "Definitions" section of the notebook, you will find Python functions that have been defined to do the work. The first function, called "check_entity_if_exists()", simply calls the Purview API to ensure the table you want to work with exists. It also returns the entity which we need to parse to retrieve the column information. The entity is returned as a dictionary. The qualified name can be found in the Purview portal in the "Overview" pane. For an Azure SQL database it would look like:


mssql://{SQL server name}.database.windows.net/{database}/{schema}/{table}

The "typeName" property refers to the Apache Atlas entity type. In this example we are working with the "azure_sql_table" entity type.


# Check if the table exists in Purview
def check_entity_if_exist(client, qualifiedName, typeName):
    entities = client.get_entity(qualifiedName=qualifiedName, 			 		typeName=typeName)
    return entities

The function "get_description()" parses the extended property table for each of the columns retrieved from the entity. A dictionary of the description is returned.


# Get the value of the Description extended property for a column.
def get_description(df, column_name):
    epn = "Description"
    pdf = df.where(col("ColumnName") == column_name).where(col("ExtendedPropertyName") == epn).select("ExtendedPropertyValue").toPandas()
    dict = pdf.to_dict(orient = 'list')
    dict_length = len(dict['ExtendedPropertyValue'])
    if dict_length > 0:
        return dict['ExtendedPropertyValue'][0]

Finally, the function "update_entity()" performs the work of updating the Purview entity with the new column description. If you examine this function's required parameters you will see you need a client, a list of the entity columns, and a data frame. To update an entity in this way, the guid of the entity is required. The guid is part of the dictionary that is returned when we parse the entity later in the notebook. The results of querying the Azure SQL view which stores the extended properties will be stored in a data frame.


# Update column description in Purview
def update_entity(client, columns_list, df):
    for element in columns_list:
        name = element['name']
        userDescription = get_description(df, element['name'])
        if userDescription:
            client.partial_update_entity(
                guid=element['guid'],
                attributes={
                    "userDescription" : userDescription
                })
            print("Column {0} updated with description: {1}.".format(name, userDescription ))

Looking at the "Work" section, the first things we do are authenticate and build the client.


# Authenticate against your Atlas server
oauth = ServicePrincipalAuthentication(
    tenant_id= tenant_id,
    client_id= client_id,
    client_secret= client_secret
)

# Instantiate Purview client
client = PurviewClient(
    account_name = purview_name,
    authentication=oauth
)

We then configure the settings required to query our Azure SQL view. Following that, we run "spark.read.jdbc()" to send the query to Azure SQL.


# Get extended properties from Azure SQL.

jdbcHostname = sql_uri

# My example database is called "sql-test"
jdbcDatabase = "sql-test"


jdbcPort = 1433
jdbcUrl = "jdbc:sqlserver://{0}:{1};database={2}".format(jdbcHostname, jdbcPort, jdbcDatabase)

view_name = "OrdersProperties"

query = f"(SELECT SchemaName, TableName, ColumnName, ExtendedPropertyName, cast(ExtendedPropertyValue as varchar(255)) as ExtendedPropertyValue FROM {view_name}) as view_query"

connectionProperties = {
"user" : sql_uname,
"password" : sql_pass,
"driver" : "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}

Spdf = spark.read.jdbc(
    url=jdbcUrl,
    table=query,
    properties=connectionProperties
)

display(Spdf)

The returned data is stored in a data frame. And finally we use our function to retrieve the entity based on the information provided. The entity is of the type "azure_sql_table".


sql_db = "sql-test"
sql_schema = "dbo"
sql_table = "Orders"

entity = check_entity_if_exist(client, "mssql://{0}/{1}/{2}/{3}".format(sql_uri, sql_db, sql_schema, sql_table), "azure_sql_table")

The rest is standard Python. Get the columns from the entities dictionary then parse the dictionary to create a new dictionary that better fits our purpose. There may be a better way to achieve this, but I am not a strong Python developer (clearly).


columns = entity["entities"][0]["relationshipAttributes"]["columns"]

columns_list = []

for i in columns:
    dict = {
        "name" : i['displayText'],
        "guid" : i['guid']
    }
    columns_list.append(dict)

# Update all columns for the SQL table.

update_entity(client, columns_list, Spdf)

We now have all of the requirements to update the entity in Microsoft Purview. Every column will updated whether there was a change or not. This may also be an area for improvement in this notebook. Additionally, I would only trigger this notebook when a change to a SQL table has been detected. Running this on a schedule might work for a small number of databases but most of the customers I work with have tens or hundreds of thousands of tables.


I hope you found this post valuable and please feel free to contribute to the repository if you feel this process could be improved.



105 views0 comments
bottom of page