Data Saturday Atlanta Wrap-Up

Data Saturday Atlanta was a major success, both for the event and for myself. Rie Merritt (@IrishSQL) and the organizing committee did a great job bringing together a stellar group of speakers who covered many great topics. A HUGE thank you to Atlanta Azure Data User Group for the opportunity to speak about PowerShell. Thank you to each and every attendee of my session, Reducing Data Warehouse Processing with PowerShell Pre-ETL Processing. It was great to have so many people show interest in PowerShell and how data developers, data analysts, QA analysts, and others can leverage its capabilities.

The PDF version of the slide deck is available for download.

In addition to the slides, below are the demo scripts from the presentation. The data files used in the presentation are being made available, but you really shouldn’t need them. The idea is to use the scripts to get inspiration for your own solutions. The examples make use of two functions I’ve created. They are documented in this post.

CSV Validation – Loading data from a CSV into SQL Server is straightforward, if the columns in the CSV file match the database table’s. This validation performs a cursory check of the columns being passed in a CSV file. Certainly, there are other checks that can be performed, but these are a couple to a solution started.

# It is assumed the CSV being used contains a header record in the first row of the file. As a result,
# only the first record ( -TotalCount 1 ) is being read.

$header = Get-Content -Path $FileName -TotalCount 1
# Add .Replace("`"", "") to the end of the split operation to remove quotes from the values.
$headerColumns = ($header -split ',') 


# With a database table designed with the schema intended to be in the CSV 
#   file, query SQL Server to get the list of columns (and data types).
# NULL is being passed to the -Credential parameter, which informs the 
#   function to use Windows authentication to connect to the SQL instance.
$databaseTableColumns= Get-SQLColumnsForDataTable `
                -ServerInstance localhost `
                -Database AdventureWorks `
                -Credential $null `
                -SchemaName Production `
                -TableName Product

# Perform a quick column count to see if the number of columns match.  By no means does this mean the columns are the same, even if the counts are equal.
if ($headerColumns.Count -eq $columns.Count)
{
    write-output "The columns in the file match the table definition"
}
else
{
    write-output "The number of columns in the source file does not match the table definition"
}

# Check the CSV's columns to see if columns from the table are missing.
$missingColumn = $false
foreach ($column in $columns) 
{
    if ($headerColumns -notcontains $column.ColumnName) 
    { 
        $missingColumn = $true
        write-output "   Column, $($column.ColumnName), was not found" 
    } 
}

Excel Validation – Like the CSV validation, validating Excel Files is also possible with PSExcel module (technically on EPPlus.dll is being used for the OfficeOpenXML namespace). As you compare the two scripts, there is more work in the Excel validation, due to needing the assistance of EPPluss.dll to access the data, versus reading the first line of text from the CSV file with Get-Content.

# Import the PSExcel module from the PowerShell Gallery; only needs to be run once on a computer.  
# Run the PowerShell session in administrator mode to install the module for all users.

install-module PSExcel
import-module PSExcel

$excelFile = New-Object OfficeOpenXml.ExcelPackage $FileName
$workbook  = $excelFile.Workbook

# Every worksheet in Excel has a name on the tab.  That name goes here.
$worksheet = $workbook.Worksheets["Details"]
 
# $worksheet.Dimension.Rows returns the number of records in the Excel worksheet.  
# It may be useful to log the number of records received in a pre-processing/validation step.
Write-Output "#################"
Write-Output "   File contains $($worksheet.Dimension.Rows) rows"
Write-Output "#################"

# Like Dimension.Rows, Dimension.Columns returns the number of columns with data in the worksheet.
$worksheetColumns = $worksheet.Dimension.Columns

 #
 # Get columns for Sales.SalesOrderDetails
 #
 $dbColumns = Get-SQLColumnsForDataTable -ServerInstance localhost `
                                         -Database AdventureWorks `
                                         -Credential $null `
                                         -SchemaName "Sales" `
                                         -TableName "SalesOrderDetail"

if ($worksheetColumns -eq $dbColumns.Count)
{
    write-output "             "
    write-output "The Excel file and the database table have the same number of columns"
    write-output "             "
}
else
{
    write-output "The Excel file and the database table do not have the same number of columns"

    #
    # Find the differing columns
    #
    $worksheetColumnNames = @()
    $dbColumnNames = $dbColumns.ColumnName
    foreach ($columnNum in 1..($worksheetColumns-1))
    {
        $columnName = $worksheet.Cells.Item(1,$columnNum).Text
        
        if ($dbColumns.ColumnName -notcontains $columnName)
        {
            write-output "$columnName is not in the database table"
        }

        $worksheetColumnNames += $columnName
    }
}

JSON to CSV Transformation – This script is directly related to the screenshot of the JSON file from slide 21 in the slide deck, creating a CSV file from the highlighted elements in the JSON.

$activities= (Get-Content $FileName | convertfrom-json).Value

$activitiesTable = $activities | select activityName, activityRunStart, activityRunEnd, durationInMs, 
                                  @{Name="storedProcedureName";
                                    Expression={
                                                    ($_ | select-object -expandProperty input).storedProcedureName
                                               }
                                   }

$activitiesTable | format-table

$activitiesTable | export-csv c:\demos\activities.csv -NoTypeInformation

Using Metadata to Load Tables – When you need to load files into tables repeatedly over time, such as a updated customer information on a monthly basis, it may be useful to create a table to define metadata about those files. This could includ the naming pattern of the file, which database the data should reside in, along with the specific schema and table names. This example takes that metadata and instantiates a bulk copy object from Microsoft.Data.SqlClient. With the System.Data.DataTable object holding the contents of the CSV file, that data is loaded into the specified table.


$query = "select FileNamePattern, 
                 DestinationDatabaseName, 
                 DestinationSchemaName, 
                 DestinationTableName, 
                 DestinationSchemaTableName 
          from Logging.LoadMetadata"

# invoke-sqlcmd with no credential uses context of the current user
$files = invoke-sqlcmd -ServerInstance localhost -Database PowerShellDemo -Query $query 

foreach ($file in $files)
{
    write-output "...Loading $($file.DestinationSchemaTableName)"

    #Query sys.columns & sys.types to get the fields of a table
    $dbColumns = Get-SQLColumnsForDataTable -ServerInstance localhost `
                                            -Credential $null `
                                            -Database $file.DestinationDatabaseName `
                                            -SchemaName $file.DestinationSchemaName `
                                            -TableName $file.DestinationTableName

    #Load CSV data into memory and generate a DataTable with defined data types
    $dataTable = Import-CSV "C:\Demos\CSVs\$($file.FileNamePattern)"  | Out-DataTableWithDBDataTypes -Columns $columns

    # Bulk copy the DataTable into the specified table
    $bulkCopy = new-object ([Microsoft.Data.SqlClient.SqlBulkCopy]) `
                    -ArgumentList "server=localhost;database=$($file.DestinationDatabaseName);Integrated Security=SSPI"
    $bulkCopy.DestinationTableName = $file.DestinationSchemaTableName
    $bulkCopy.WriteToServer($dataTable)
}

If you have any questions about these demo scripts, you can find me on Twitter, @em_dempster.

Database Helpers for PowerShell

PowerShell has a lot of great functionality to pull data from various sources on a server, like event logs, WMI objects, Registry, etc. However, preparing data for logging in a database is an area PowerShell is lacking.

As supporting material for SQL Saturday/Data Saturday presentations, here are a couple functions, with basic functionality, to aid in scripting bulk loads to SQL Server. If a different database platform is being used, such as PostgreSQL, Oracle, etc., this code will need to be updated to support that platform.

Get-SQLColumnsForDataTable, as the name should imply, returns an array of objects representing the column names and data types from a table in a SQL Server database. The object array can be used by the second function, Out-DataTableWithDataTypes to predefine the data types of columns in a System.Data.DataTable object. This does not return the length of strings or the scale and precision of decimal and numeric data types. Additionally, the query translates common SQL data types to .Net data types. If the table includes binary, image, CLR data types, etc., those data types will need to be translated to an equivalent .Net data type.

function Get-SQLColumnsForDataTable
{
        Param(
		[Parameter(Mandatory=$true, ValueFromPipelineByPropertyName=$true)]
		[string]$ServerInstance,
		[Parameter(Mandatory=$true, ValueFromPipelineByPropertyName=$true)]
		[string]$Database,
		[Parameter(Mandatory=$false, ValueFromPipelineByPropertyName=$true)]
		[PSCredential]$Credential,
        [Parameter(Mandatory=$true, ValueFromPipelineByPropertyName=$true)]
        [String] $SchemaName,
        [Parameter(Mandatory=$true, ValueFromPipelineByPropertyName=$true)]
        [String] $TableName
        )
    $Query = "SELECT c.Name ColumnName, ty.Name DataType, case when ty.Name like '%char' then 'String' 
                                                               when ty.Name = 'int' then 'Int32' 
                                                               when ty.name = 'Integer' then 'Int32' 
                                                               when ty.name = 'Smallint' then 'Int16' 
                                                               when ty.Name = 'BigInt' then 'Int64' 
                                                               when ty.name = 'money' then 'Decimal' 
                                                               when ty.name = 'numeric' then 'Decimal' 
                                                               when ty.name = 'float' then 'Decimal'
                                                               when ty.name = 'date' then 'DateTime'
                                                               when ty.name = 'bit' then 'Boolean'
                                                               else ty.name 
                                                          end DataTableType
    FROM sys.tables t
	    INNER JOIN sys.columns c ON c.object_id = t.object_id
	    INNER JOIN sys.types ty ON ty.system_type_id = c.system_type_id AND ty.user_type_id = c.user_type_id
		left JOIN sys.identity_columns ic ON c.object_id = ic.object_id AND c.column_id = ic.column_id
    WHERE t.name = '$TableName' AND t.schema_id =SCHEMA_ID('$SchemaName') "

    if ($Credential -eq $null)
    {
	    $columns = invoke-sqlcmd -ServerInstance $ServerInstance -Database $Database -Query $Query
    }
    else
    {
        $columns = invoke-sqlcmd -ServerInstance $ServerInstance -Database $Database -Query $Query -Username $Credential.UserName -Password $Credential.GetNetworkCredential().Password
    }

    return , $columns
}


Out-DataTableWithDataTypes has been derived from code by Chad Miller and taken over by Rambling Cookie Monster (GitHub repository link). The original code attempted to determine the data type of incoming object. However, the data is loaded into the object as strings, so that’s the only data type that can be determined. This adaptation adds a second input parameter, $Columns. The best way to define this parameter is with the function above, Get-SQLColumnsForDataTable.

function Out-DataTableWithDataTypes
{ 
    [CmdletBinding()] 
    param([Parameter(Position=0, Mandatory=$true, ValueFromPipeline = $true)] [PSObject[]]$InputObject,
          [Parameter(Position=1, Mandatory=$true)] [PSObject[]]$Columns)
 
    Begin 
    { 
        
        $dt = new-object Data.datatable 
        if ($columns.Count -gt 0)
        {
            foreach($column in $Columns)
            {
                [void]$dt.Columns.Add($column.ColumnName, $column.DataTableType)
            }
        }
    } 
    Process 
    { 
        foreach ($object in $InputObject) 
        { 
            $DR = $DT.NewRow()   
            foreach($property in $object.PsObject.get_properties()) 
            {     
                if ($DT.Columns.Contains($property.Name))
                {
                    if ($property.value -eq $null -or $property.Value -eq "" -or $property.Value -eq "NULL")
                    {
                        $DR.Item($property.Name) = [System.DBNull]::Value
                    }
                    else 
                    {
                        $DR.Item($property.Name) = $property.value
                    }
                } 
            }   
            $DT.Rows.Add($DR)  
             
        } 
    }  
      
    End 
    { 
        Write-Output @(,($dt)) 
    } 
 
}

With these 2 functions, the basic functionality is there to load data into a SQL Server database table with the proper data types. As with any data loading process, if the data types do not match, Out-DataTableWithDataTypes will fail on the load.

As with most every blog on the Internet, this code has been working for me and the needs I have for it. I fully recognize this does not meet every scenario. Your use of this code is welcome, but there are no guarantees with this code.

Working with Data in .Net

As a PowerShell developer (yes, I consider PowerShell coding development, when it’s for pre-ETL processing), there are a handful of very useful namespaces .Net.

System.Data – “basic” data containers are classes in this namespace, include DataTable and DataSet. These are generic types, regardless of source or destination database or data source.

System.XML – If you’re ingesting XML data, there are several classes here you’ll use. It’s likely you’ll use XmlReader to stream data or XmlDocument to load a document into memory to read or manipulate.

Microsoft.Data.SqlClient – Updated version to System.Data.SqlClient, used to access SQL Server, including Azure SQL Databases and Azure SQL Managed Instances. SqlConnection, SqlCommand and SqlBulkCopy are a data developer’s best friends.

OfficeOpenXML.Core.ExcelPackage – Working with XLSX files requires an assembly found in the PowerShell Gallery, among other places. Classes in this assembly allow developers to read, update and write Excel data.

This is a placeholder and this post will be updated at a later date…

A Comeback in 2022

I’m lazy and a bit selfish.

What a way to start off a post, eh!?

I was getting some content out in 2019 and 2020, as well as starting to hit a stride with presentations at SQL Saturdays. Then a new bug came and rocked everyone, COVID-19. SQL Saturdays disappeared, as well as gatherings in general, for the next 12 – 15 months. By not having that motivator of speaking to people about SQL Server and supporting technologies, my blogging has been on hiatus. When blogging on a topic, I try my best to get the details right, so I can hopefully help those who come across the post. As a result, I tend to spend 6 – 8+ hours crafting a single post (some of that may be a little anxiety and over-analyzing). If I perceive there may be a likelihood of views dropping, because I’m not introducing myself to people, my lazy side comes out very quickly and says, “Why bother? Let’s go watch some TV or take a nap.”

That doesn’t mean life has stood still, and I haven’t learned anything. Quite to opposite, in fact. I have been extremely fortunate to work for a growing company in an key industry in this country. While the pandemic did affect us, it was the catalyst to grow in new directions. This required the data team (of which I’m one of three co-leads) to develop new ways to absorb the growth in our data, not just in volume, but with the increase in the breadth of information we’re managing.

Over the next several months, I’ll be posting about these new adventures, generalizing the concepts for the larger audience. You should see quite a bit coming about PowerShell and pre-ETL processing. Sprinkled in will be posts about Azure Automation and Azure Data Factory. By the time I get posts caught up to July 2022, I’m sure I will have learned, or been introduced to, new topics to explore with future posts.

Yesterday (July 7th, 2022), the PASS Data Community Summit announced general sessions and speakers for the 2022 event. My session, “Power-up the Data Warehouse with Pre-ETL Processing,” was chosen. I am incredibly honored to have the opportunity to speak at the Summit, in person. I did speak at the 2020 PASS Virtual Summit, pre-recording the session, so this will be my 2nd appearance at the “main event”.

I’ll try to keep train longer, providing new content.

Stay tuned…

Standing up Azure SQL Managed Instance & Connect to Storage Account

This is a quick article, related to connecting an Azure SQL Managed Instance to an Azure Storage Account.

When creating an Azure SQL Managed Instance, you have the options of creating a public endpoint and/or configuring the connection type of the private endpoint (as shown below). The default connection type for private endpoint is Proxy, however, Microsoft recommends using the Redirect method.

Using Redirect will create a Network Security Group (NSG) with various security rules. In the outbound rules, I have found, at most, 2 rules need to be added, which are highlighted in the screenshot below. One rule is to allow any connection from SQL Managed Instance subnet (172.x.x.x/27 as an example) to the subnet with the primary NIC of the storage account (172.x.x.0/24). The other rule is to allow traffic from the MI subnet to the ‘Storage.EastUS’ service.

Partial listing of outbound network security group rules in Azure for a SQL Managed Instance.

More investigation needs to be completed to tighten down these outbound rules, so they target specific ports, and ideally specific IP addresses. This will evolve…

Recent and Upcoming Presentations

The first set of blogs I posted on this site center around Row Level Security in Microsoft SQL Server. In addition to these posts, I’ve decided to present the topic at PASS events, including multiple SQL Saturdays. Below are the events I’ve already presented at in 2020 and am hoping to present at later this year.

Confirmed Events

Submission-pending Events

These are wonderful opportunities for multiple reasons. SQL Saturdays are unique events, in that they are hosted by the PASS community in each city, providing a free day of training (with the exception of a small fee for lunch). How crazy is that? With each event, I’m slowly trying to improve myself in the first 3 areas. Plus, everyone needs some fun now and again.

  • Networking – A few speakers are planning to be at a number of these events, which will give me an opportunity to meet and get to know them, as well as other speakers and event attendees.
  • Building Technical Skills – Just as I’m presenting on a relatively unknown topic, Row Level Security, there are others speaking about other, lesser-known topics. Of course, many of the presentations will cover better known topics, and we’ll get different perspectives from each speaker.
  • Building Communication Skills – For a long time, I had was quite terrified of public speaking, which has likely had a negative impact on my career. By “practicing” my public speaking skills, I can only improve, and this will help further my career, especially if I want to go into a management or product evangelist-type roles.
  • Seeing Different Cities – Who doesn’t like to check out different and see what they have to offer. Having a day or two free in each city will give me an opportunity to see what each has to offer.

Restricting Data Access with Row Level Security – Part 3

To this point in the series, the examples we’ve used are limited to the single Azure database for the Endless Timekeeping application. It’s great to define the concepts, but for an enterprise application, a few more tools need to be added to the tool belt. Endless Reporting isn’t large enough to really have good enterprise scenarios, yet, so we’ll rely on ACME Corporation, made famous by Wile E. Coyote. ACME has several divisions including Recreational Gear, Physical Security and Explosives, who receive materials and subcomponents from their vendors ACME Skates, ACME IronWorks and ACME Gun Powder, respectively. Each division produces their unique products, manage inventories, etc. The corporate accounting department is responsible for the finances of all their divisions. With the implementation of Just-in-Time inventory, the vendors have to be responsive to ACME’s needs.

Extending the Limits of Row Level Security

Mapping Users to Divisions

ACME is fortunate to use an Enterprise Resource Planning (ERP) system that allows each division to be split into separate “companies.” Knowing data security would be important (or maybe it was dumb luck) when the system was implemented, ACME did take advantage of this feature and laid out the divisions as follows.

  • 01 – Physical Security
  • 03 – Explosives
  • 04 – Recreational Gear

Of course, Row Level Security cannot be implemented in the ERP system directly, because it’s a 3rd-party application that was not designed to handle Row Level Security. However, the data is loaded into a data warehouse each day. Our examples will work with the inventory data mart.

The Accounting department, of course, has access to all companies in the system, as do all internal users of ACME Corporation. The vendors, on the other hand, should only have access to the ERP company they work with. Users from each vendor have Active Directory accounts in the ACME domain. We can take the companies and users and map them together in table named UserSecurity.CompanyUserMapping. We can continue to add users to this table, which will be time-consuming but possible. Several users are listed here.

CompanyIDADUserName
ALLACMECorpUS\AcctUser1
01ACMECorpUS\AIUser01
03ACMECorpUS\AGPUser01
04ACMECorpUS\ASUser01

In the data warehouse, there is a table, Dim.Company, that have a matching column named CompanyID. It turns out this table will play a key role with data access. A predicate function, like udf_CompanyFilterPredicate, can be created to filter Dim.Company, which is used in nearly every query against the data warehouse.

CREATE FUNCTION Dim.udf_CompanyFilterPredicate
(
    @CompanyID char(3)
)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
(
	-- Internal users who have access to all data (CompanyID = 'ALL')
	SELECT 1
	FROM UserSecurity.CompanyUserMapping m
	WHERE m.ADUserName = USER_NAME()
		AND m.CompanyID = 'ALL'

	UNION
	-- External users who have access to specific companies
	SELECT 1
	FROM UserSecurity.CompanyUserMapping m 
	WHERE m.ADUserName = USER_NAME() 
                AND m.CompanyID = @CompanyID
)

Let’s bind this function to the Dim.Company table.

CREATE SECURITY POLICY [dbo].[FilterDimCompanyPolicy] 
       ADD FILTER PREDICATE Dim.udf_CompanyFilterPredicate(CompanyID) ON Dim.Company
WITH (STATE = ON, SCHEMABINDING = ON)
GO

Assuming CompanyIDs are surrogate keys, defined as an IDENTITY, this solution won’t restrict data in the Inventory fact table, but it will be harder to determine what’s what, if the user doesn’t have access to all companies. Dim.Company is a small table, so when the table is filtered for each and every query, that filtering will be quite fast, compared to filtering DimCompanyID in the fact table. Of course, filtering Fact.Inventory would be the ideal solution, but there could be some performance issues, depending on the size of the table.

Filtering Data by Active Directory Groups

We started to key in Active Directory accounts into the CompanyUserMapping table for every user at ACME Corp. and for all their vendors’ users who need access to the data warehouse. That could be a lot of users to keep track of and a maintenance nightmare for the database administrator. A better solution might be to map the companies to Active Directory user groups, such that every member of a group has access to one or more companies. Let’s rename the table CompanyGroupMapping.

CompanyIDADGroup
ALLACMECorpUS\InternalUsers
01ACMECorpUS\ACMEPhysicalSecurityUsers
03ACMECorpUS\ACMEGunPowderUsers
04ACMECorpUS\ACMESkatesUsers

The predicate function we wrote earlier needs just a couple changes. Instead of comparing the current user’s login to the ADUserName field, the IS_MEMBER() function can be used, which determines if the current user is a member of specified database role or Active Directory (or the database server’s) group. The management of user access has moved from the DBA to the Active Directory administrators, by adding user accounts to the respective groups.

CREATE FUNCTION Dim.udf_CompanyFilterPredicate
(
    @CompanyID char(3)
)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
(
	-- Internal users who have access to all data (CompanyID = 'ALL')
	SELECT 1
	FROM UserSecurity.CompanyGroupMapping m
	WHERE IS_MEMBER(m.ADGroup) = 1
		AND m.CompanyID = 'ALL'

	UNION
	-- External users who have access to specific companies
	SELECT 1
	FROM UserSecurity.CompanyGroupMapping m 
	WHERE IS_MEMBER(m.ADGroup) = 1
                AND m.CompanyID = @CompanyID
)

These examples are not perfect, but I don’t know if there is a perfect example. However, my goal is to give you some ideas on ways to implement Row Level Security in a manner that might work for your enterprise.

In the next blog, we’ll look at some query patterns to follow and to avoid with Row Level Security. Stay tuned…