Category Archives: Uncategorized

Azure Data Factory Deployment Mistake

Recently, I was working with a release pipeline to add our Azure Data Factory solution. This was the first attempt to release Azure Data Factory components. Of course, there were many issues to resolve. The release pipeline handles both databases from multiple Git repositories and now ADF code, so it’s no longer an out of the box solution.

Microsoft provides documentation for AzureResourceManagerTemplateDeployment@3, which is helpful, if you read it thoroughly. Of course, I did not read it thoroughly, so I didn’t see the note about the DeploymentMode options. With the different issues, I decided to set parameter to ‘Complete’, instead of the default value, ‘Incremental’. The task looked a lot like the example below.

   - task: AzureResourceManagerTemplateDeployment@3
      displayName: 'Deploy ADF Solution'
      inputs:
        deploymentScope: 'Resource Group'
        azureResourceManagerConnection: '$(azureSubscription)'
        subscriptionId: '<subscriptionID>'
        action: 'Create Or Update Resource Group'
        location: '<Azure region>'
        resourceGroupName: '<resource group>'
        templateLocation: 'Linked artifact'
        csmFile: '$(System.ArtifactsDirectory)/<SolutionPath>/ARMTemplateForFactory.json'
        csmParametersFile: '$(System.ArtifactsDirectory)/<SolutionPath>/ARMTemplateParametersForFactory.json'
        overrideParameters: -factoryName xxxxxxxxxxxxx
                            -KeyVault_properties_typeProperties_baseUrl $(KeyVault-AutomatedTesting)
                            -ADLS_Storage_properties_typeProperties_url $(ADLSLocation-AutomatedTesting)
        deploymentMode: Complete

Oops! That was a mistake. The deployment failed, with a new error message about the data factory not being found. Excuse me?! I just saw it a couple minutes ago. Going back to Azure Data Factory, the instance I had been working with was no longer listed. What happened in the last couple minutes? The only thing I did was change the DeploymentMode to ‘Complete’.

Referring back to the documentation, I found this little nugget.

Complete mode deletes resources that are not in your template. Complete mode takes relatively more time than incremental mode. If the task times out, consider increasing the timeout or changing to the Incremental mode.

My ARM template only included the the code contained within Azure Data Factory. It did not contain the definition of the data factory itself. As a result, the Complete deployment mode deleted the data factory (along with a storage account) and then tried to deploy the code. Of course this didn’t work.

Fortunately, a proof-of-concept environment was created to develop the release pipeline, so I was the only one using the instances that were deleted. This mistake is a good reminder why resource groups should be resource instance specific and not encompass a large number (or even a modest number) of resources. It’s also a reminder to read the documentation, when new features are being used. ADO Pipeline tasks (and jobs) are concisely defined in YAML, but they perform a lot of work. It is easy to run into unexpected outcomes.

SQL Saturday Boston Wrap-Up

SQL Saturday Boston was a major success, both for the event and for myself. The organizing committee did a great job bringing together a stellar group of speakers who covered many great topics. A HUGE thank you to New England SQL Server User Group for the opportunity to speak about PowerShell. Thank you to each and every attendee of my session, Reducing Data Warehouse Processing with PowerShell Pre-ETL Processing. It was great to have so many people show interest in PowerShell and how data developers, data analysts, QA analysts, and others can leverage its capabilities.

The PDF version of the slide deck is available for download

In addition to the slides, below are the demo scripts from the presentation. The data files used in the presentation are being made available, but you really shouldn’t need them. The idea is to use the scripts to get inspiration for your own solutions. The examples make use of two functions I’ve created. They are documented in this post.

CSV Validation – Loading data from a CSV into SQL Server is straightforward, if the columns in the CSV file match the database table’s. This validation performs a cursory check of the columns being passed in a CSV file. Certainly, there are other checks that can be performed, but these are a couple to a solution started.

# It is assumed the CSV being used contains a header record in the first row of the file. As a result,
# only the first record ( -TotalCount 1 ) is being read.

$header = Get-Content -Path $FileName -TotalCount 1
# Add .Replace("`"", "") to the end of the split operation to remove quotes from the values.
$headerColumns = ($header -split ',') 


# With a database table designed with the schema intended to be in the CSV 
#   file, query SQL Server to get the list of columns (and data types).
# NULL is being passed to the -Credential parameter, which informs the 
#   function to use Windows authentication to connect to the SQL instance.
$databaseTableColumns= Get-SQLColumnsForDataTable `
                -ServerInstance localhost `
                -Database AdventureWorks `
                -Credential $null `
                -SchemaName Production `
                -TableName Product

# Perform a quick column count to see if the number of columns match.  By no means does this mean the columns are the same, even if the counts are equal.
if ($headerColumns.Count -eq $columns.Count)
{
    write-output "The columns in the file match the table definition"
}
else
{
    write-output "The number of columns in the source file does not match the table definition"
}

# Check the CSV's columns to see if columns from the table are missing.
$missingColumn = $false
foreach ($column in $columns) 
{
    if ($headerColumns -notcontains $column.ColumnName) 
    { 
        $missingColumn = $true
        write-output "   Column, $($column.ColumnName), was not found" 
    } 
}

Excel Validation – Like the CSV validation, validating Excel Files is also possible with PSExcel module (technically on EPPlus.dll is being used for the OfficeOpenXML namespace). As you compare the two scripts, there is more work in the Excel validation, due to needing the assistance of EPPluss.dll to access the data, versus reading the first line of text from the CSV file with Get-Content.

# Import the PSExcel module from the PowerShell Gallery; only needs to be run once on a computer.  
# Run the PowerShell session in administrator mode to install the module for all users.

install-module PSExcel
import-module PSExcel

$excelFile = New-Object OfficeOpenXml.ExcelPackage $FileName
$workbook  = $excelFile.Workbook

# Every worksheet in Excel has a name on the tab.  That name goes here.
$worksheet = $workbook.Worksheets["Details"]
 
# $worksheet.Dimension.Rows returns the number of records in the Excel worksheet.  
# It may be useful to log the number of records received in a pre-processing/validation step.
Write-Output "#################"
Write-Output "   File contains $($worksheet.Dimension.Rows) rows"
Write-Output "#################"

# Like Dimension.Rows, Dimension.Columns returns the number of columns with data in the worksheet.
$worksheetColumns = $worksheet.Dimension.Columns

 #
 # Get columns for Sales.SalesOrderDetails
 #
 $dbColumns = Get-SQLColumnsForDataTable -ServerInstance localhost `
                                         -Database AdventureWorks `
                                         -Credential $null `
                                         -SchemaName "Sales" `
                                         -TableName "SalesOrderDetail"

if ($worksheetColumns -eq $dbColumns.Count)
{
    write-output "             "
    write-output "The Excel file and the database table have the same number of columns"
    write-output "             "
}
else
{
    write-output "The Excel file and the database table do not have the same number of columns"

    #
    # Find the differing columns
    #
    $worksheetColumnNames = @()
    $dbColumnNames = $dbColumns.ColumnName
    foreach ($columnNum in 1..($worksheetColumns-1))
    {
        $columnName = $worksheet.Cells.Item(1,$columnNum).Text
        
        if ($dbColumns.ColumnName -notcontains $columnName)
        {
            write-output "$columnName is not in the database table"
        }

        $worksheetColumnNames += $columnName
    }
}

JSON to CSV Transformation – This script is directly related to the screenshot of the JSON file from slide 21 in the slide deck, creating a CSV file from the highlighted elements in the JSON.

$activities= (Get-Content $FileName | convertfrom-json).Value

$activitiesTable = $activities | select activityName, activityRunStart, activityRunEnd, durationInMs, 
                                  @{Name="storedProcedureName";
                                    Expression={
                                                    ($_ | select-object -expandProperty input).storedProcedureName
                                               }
                                   }

$activitiesTable | format-table

$activitiesTable | export-csv c:demosactivities.csv -NoTypeInformation

Using Metadata to Load Tables – When you need to load files into tables repeatedly over time, such as a updated customer information on a monthly basis, it may be useful to create a table to define metadata about those files. This could includ the naming pattern of the file, which database the data should reside in, along with the specific schema and table names. This example takes that metadata and instantiates a bulk copy object from Microsoft.Data.SqlClient. With the System.Data.DataTable object holding the contents of the CSV file, that data is loaded into the specified table.

$query = "select FileNamePattern, 
                 DestinationDatabaseName, 
                 DestinationSchemaName, 
                 DestinationTableName, 
                 DestinationSchemaTableName 
          from Logging.LoadMetadata"

# invoke-sqlcmd with no credential uses context of the current user
$files = invoke-sqlcmd -ServerInstance localhost -Database PowerShellDemo -Query $query 

foreach ($file in $files)
{
    write-output "...Loading $($file.DestinationSchemaTableName)"

    #Query sys.columns & sys.types to get the fields of a table
    $dbColumns = Get-SQLColumnsForDataTable -ServerInstance localhost `
                                            -Credential $null `
                                            -Database $file.DestinationDatabaseName `
                                            -SchemaName $file.DestinationSchemaName `
                                            -TableName $file.DestinationTableName

    #Load CSV data into memory and generate a DataTable with defined data types
    $dataTable = Import-CSV "C:DemosCSVs$($file.FileNamePattern)"  | Out-DataTableWithDBDataTypes -Columns $columns

    # Bulk copy the DataTable into the specified table
    $bulkCopy = new-object ([Microsoft.Data.SqlClient.SqlBulkCopy]) `
                    -ArgumentList "server=localhost;database=$($file.DestinationDatabaseName);Integrated Security=SSPI"
    $bulkCopy.DestinationTableName = $file.DestinationSchemaTableName
    $bulkCopy.WriteToServer($dataTable)
}

If you have any questions about these demo scripts, you can find me on Twitter, @em_dempster.

Azure Dev Ops – Job-Focused vs Task-Focused

This is the first post in the series, Azure Dev Ops Odds and Ends. The starting page has information about the code environment I’m working with, as well as links to other posts in this series.

The first step to consolidate builds into a single pipeline was to gather code from four repositories into a single location. From there, the projects could be built and deployed as a set. Based on our existing pipelines, I needed to create jobs, first to checkout code, before jobs to build and publish the solutions. Below is the start of the “BuildDACPACs” stage of the pipeline.

- stage: BuildDACPACs
  jobs:
  - job: BuildDACPACs
    displayName: Build and publish database DACPACs
    workspace:
      clean: all
    steps:
    # First step: check out the repositories with the source code and other scripts the pipeline needs.
    - checkout: git://AutomatedTestingPoC/AutomatedTestingPoC #Optional but included to gain access to artifacts in the "root" repo, such as PowerShell scripts needed in the deployment process
    - checkout: git://AutomatedTestingPoC/Staging
    - checkout: git://AutomatedTestingPoC/SystemLogging
    - checkout: git://AutomatedTestingPoC/Warehouse
    - checkout: git://AutomatedTestingPoC/AzureDataFactory
- job: BuildWarehouseDB
    displayName: Build Warehouse database project
    steps:
    # Second step: Build and publish the Warehouse database project 
    #   Two things:
    #     1) the DACPAC is buried, so it is being copied to a pre-defined 'ArtifactStagingDirectory' from the bin folder
    #     2) There are 2 settings files for publishing; one for Automated Testing and another for other environments.  Those need to be copied into the artifacts folder, along with the DACPAC.
    - task: VSBuild@1
      displayName: Build Warehouse Database
      inputs:
       solution: "$(Build.SourcesDirectory)/Warehouse/**/*.sln"
       platform: $(buildPlatform)
       configuration: $(buildConfiguration)
       clean: true
    - powershell: copy-item (get-childItem "$($env:Build_SourcesDirectory)\Warehouse" -Filter *.dacpac -Recurse | select fullname).FullName $($env:Build_ArtifactStagingDirectory)
      displayName: Copy Warehouse DACPAC
    - powershell: copy-item (get-childItem "$($env:Build_SourcesDirectory)\Warehouse" -Filter Default.publish.xml -Recurse | select fullname).FullName "$($env:Build_ArtifactStagingDirectory)\Warehouse.Publish.xml"
      displayName: Copy Warehouse Publish Settings XML
    - task: PublishBuildArtifacts@1
      displayName: Publish Warehouse Database
      inputs:
        PathtoPublish: '$(Build.ArtifactStagingDirectory)'
        ArtifactName: 'BlueMonkey_artifacts'
        publishLocation: 'Container'

The syntax of the YAML is good and the pipeline started to run. The checkouts worked as expected and there should be 4 directories with code in there. The Warehouse database build was first to start, and it failed on the build process itself. In Image #1, the error states the solution couldn’t be found in the folder path, ‘D:\a\1\s\Warehouse\**\*.sln’. Not being the most observant character, I was stumped and didn’t understand what was happening. After some digging, I realized the step before the build was doing exactly what it said it was doing. Imagine that!! I had checked out code earlier, but the Warehouse repository was being checked out again. Looking through the YAML above, there wasn’t an explicit step to check out the code again. Where is this being called?

Image #1 – Pipeline logs; Warehouse database failed

It turns out, when a job is defined, ADO performs a number of steps, including setting up an agent. This occurred for the first job to check out code, and it happened again to build the database. An automated step is to check out code from the repository in which the YAML file is defined. In order to not wipe out, we can’t use a job for each major process to build and deploy the solution. On line 5 of the YAML above, a clue was provided, by the workspace: clean option. That led me to dig deeper into what was happening and to review the logs in more detail.

Starting: Initialize job
Agent name: 'Hosted Agent'
Agent machine name: 'WIN-6C2TM2SQ98G'
Current agent version: '2.211.0'
Operating System
Runner Image
Runner Image Provisioner
Current image version: '20221002.2'
Agent running as: 'VssAdministrator'
Prepare build directory.
Set build variables.
Download all required tasks.
Downloading task: VSBuild (1.208.0)
Downloading task: PowerShell (2.210.0)
Downloading task: PublishBuildArtifacts (1.200.0)
Checking job knob settings.
   Knob: AgentToolsDirectory = C:\hostedtoolcache\windows Source: ${AGENT_TOOLSDIRECTORY} 
   Knob: AgentPerflog = C:\agents\perflog Source: ${VSTS_AGENT_PERFLOG} 
Finished checking job knob settings.
Start tracking orphan processes.
Finishing: Initialize job

Instead, we need to structure the YAML to run many tasks within a single job step. The build and publish tasks are now contained within a single job, BuildArtifacts. When the job is initialized, the current repository is still checked out, but the next step explicitly checks out all of the repositories, placing each one into its own subfolder, under $(Build.SourcesDirectory).

- stage: BuildArtifacts
  jobs:
  - job: BuildArtifacts
    displayName: Build and publish database DACPACs and ADF code
    workspace:
      clean: all
    steps:
    # First step: check out the repositories with the source code and other scripts the pipeline needs.
    - checkout: git://AutomatedTestingPoC/AutomatedTestingPoC #Optional but included to gain access to artifacts in the "root" repo, such as PowerShell scripts needed in the deployment process
    - checkout: git://AutomatedTestingPoC/Staging
    - checkout: git://AutomatedTestingPoC/SystemLogging
    - checkout: git://AutomatedTestingPoC/Warehouse
    - checkout: git://AutomatedTestingPoC/AzureDataFactory
    # Second step: Build and publish the Warehouse database project 
    #   Two things:
    #     1) the DACPAC is buried, so it is being copied to a pre-defined 'ArtifactStagingDirectory' from the bin folder
    #     2) There are 2 settings files for publishing; one for Automated Testing and another for other environments.  Those need to be copied into the artifacts folder, along with the DACPAC.
    - task: VSBuild@1
      displayName: Build Warehouse Database
      inputs:
       solution: "$(Build.SourcesDirectory)/Warehouse/**/*.sln"
       platform: $(buildPlatform)
       configuration: $(buildConfiguration)
       clean: true
    - powershell: copy-item (get-childItem "$($env:Build_SourcesDirectory)\Warehouse" -Filter *.dacpac -Recurse | select fullname).FullName $($env:Build_ArtifactStagingDirectory)
      displayName: Copy Warehouse DACPAC
    - powershell: copy-item (get-childItem "$($env:Build_SourcesDirectory)\Warehouse" -Filter Default.publish.xml -Recurse | select fullname).FullName "$($env:Build_ArtifactStagingDirectory)\Warehouse.Publish.xml"
      displayName: Copy Warehouse Publish Settings XML
    - task: PublishBuildArtifacts@1
      displayName: Publish Warehouse Database
      inputs:
        PathtoPublish: '$(Build.ArtifactStagingDirectory)'
        ArtifactName: 'BlueMonkey_artifacts'
        publishLocation: 'Container'

With the each repository’s code stored in its own subfolder, as originally intended, when the build task is executed, the code is found and the build process proceeds.

Image #2 – Pipeline logs; Warehouse build succeeded

This was my first major hurdle, and it clearly defined there is a lot going on with each YAML instruction. Implementing pipelines is greatly simplified because of that. At the same time, more care does have to be taken to ensure the pipeline behaves as you expect.

Azure Dev Ops Pipeline Odds and Ends

When one works in a smaller company, there is an expectation of wearing many hats. As the Data Operations Team Lead, I’m no different. In addition to working with SQL Server, Azure infrastructure, Azure Data Factory and more, I have started to tackle Azure Dev Ops. Specifically, I’m working to enhance, or streamline, existing build and release pipelines to work better in our organization. Learning how pipelines are built and behave, while re-vamping the process has led to some deeper dives into YAML and what Azure Dev Ops is doing.

Some of the developers on our team are relatively new to developing code in a corporate environment and using source control. To help reduce issues around check-ins and code merging, we have made some conservative decisions. The data warehouse solution consists of 3 database projects, Staging, Warehouse and SystemLogging, as well as Azure Data Factory artifacts. Each solution is stored in its own repository within the same Azure Dev Ops project. The build and release pipeline are within a proof of concept project, AutomationTestingPoC.

In this series of posts, I’ll document some of the nuances found in Azure Dev Ops and try to explain their benefits and drawbacks. The links below are listed chronologically, as I’m documenting them.

Check back often, as the list will grow and topics will be linked.

Data Saturday Atlanta Wrap-Up

Data Saturday Atlanta was a major success, both for the event and for myself. Rie Merritt (@IrishSQL) and the organizing committee did a great job bringing together a stellar group of speakers who covered many great topics. A HUGE thank you to Atlanta Azure Data User Group for the opportunity to speak about PowerShell. Thank you to each and every attendee of my session, Reducing Data Warehouse Processing with PowerShell Pre-ETL Processing. It was great to have so many people show interest in PowerShell and how data developers, data analysts, QA analysts, and others can leverage its capabilities.

The PDF version of the slide deck is available for download.

In addition to the slides, below are the demo scripts from the presentation. The data files used in the presentation are being made available, but you really shouldn’t need them. The idea is to use the scripts to get inspiration for your own solutions. The examples make use of two functions I’ve created. They are documented in this post.

CSV Validation – Loading data from a CSV into SQL Server is straightforward, if the columns in the CSV file match the database table’s. This validation performs a cursory check of the columns being passed in a CSV file. Certainly, there are other checks that can be performed, but these are a couple to a solution started.

# It is assumed the CSV being used contains a header record in the first row of the file. As a result,
# only the first record ( -TotalCount 1 ) is being read.

$header = Get-Content -Path $FileName -TotalCount 1
# Add .Replace("`"", "") to the end of the split operation to remove quotes from the values.
$headerColumns = ($header -split ',') 


# With a database table designed with the schema intended to be in the CSV 
#   file, query SQL Server to get the list of columns (and data types).
# NULL is being passed to the -Credential parameter, which informs the 
#   function to use Windows authentication to connect to the SQL instance.
$databaseTableColumns= Get-SQLColumnsForDataTable `
                -ServerInstance localhost `
                -Database AdventureWorks `
                -Credential $null `
                -SchemaName Production `
                -TableName Product

# Perform a quick column count to see if the number of columns match.  By no means does this mean the columns are the same, even if the counts are equal.
if ($headerColumns.Count -eq $columns.Count)
{
    write-output "The columns in the file match the table definition"
}
else
{
    write-output "The number of columns in the source file does not match the table definition"
}

# Check the CSV's columns to see if columns from the table are missing.
$missingColumn = $false
foreach ($column in $columns) 
{
    if ($headerColumns -notcontains $column.ColumnName) 
    { 
        $missingColumn = $true
        write-output "   Column, $($column.ColumnName), was not found" 
    } 
}

Excel Validation – Like the CSV validation, validating Excel Files is also possible with PSExcel module (technically on EPPlus.dll is being used for the OfficeOpenXML namespace). As you compare the two scripts, there is more work in the Excel validation, due to needing the assistance of EPPluss.dll to access the data, versus reading the first line of text from the CSV file with Get-Content.

# Import the PSExcel module from the PowerShell Gallery; only needs to be run once on a computer.  
# Run the PowerShell session in administrator mode to install the module for all users.

install-module PSExcel
import-module PSExcel

$excelFile = New-Object OfficeOpenXml.ExcelPackage $FileName
$workbook  = $excelFile.Workbook

# Every worksheet in Excel has a name on the tab.  That name goes here.
$worksheet = $workbook.Worksheets["Details"]
 
# $worksheet.Dimension.Rows returns the number of records in the Excel worksheet.  
# It may be useful to log the number of records received in a pre-processing/validation step.
Write-Output "#################"
Write-Output "   File contains $($worksheet.Dimension.Rows) rows"
Write-Output "#################"

# Like Dimension.Rows, Dimension.Columns returns the number of columns with data in the worksheet.
$worksheetColumns = $worksheet.Dimension.Columns

 #
 # Get columns for Sales.SalesOrderDetails
 #
 $dbColumns = Get-SQLColumnsForDataTable -ServerInstance localhost `
                                         -Database AdventureWorks `
                                         -Credential $null `
                                         -SchemaName "Sales" `
                                         -TableName "SalesOrderDetail"

if ($worksheetColumns -eq $dbColumns.Count)
{
    write-output "             "
    write-output "The Excel file and the database table have the same number of columns"
    write-output "             "
}
else
{
    write-output "The Excel file and the database table do not have the same number of columns"

    #
    # Find the differing columns
    #
    $worksheetColumnNames = @()
    $dbColumnNames = $dbColumns.ColumnName
    foreach ($columnNum in 1..($worksheetColumns-1))
    {
        $columnName = $worksheet.Cells.Item(1,$columnNum).Text
        
        if ($dbColumns.ColumnName -notcontains $columnName)
        {
            write-output "$columnName is not in the database table"
        }

        $worksheetColumnNames += $columnName
    }
}

JSON to CSV Transformation – This script is directly related to the screenshot of the JSON file from slide 21 in the slide deck, creating a CSV file from the highlighted elements in the JSON.

$activities= (Get-Content $FileName | convertfrom-json).Value

$activitiesTable = $activities | select activityName, activityRunStart, activityRunEnd, durationInMs, 
                                  @{Name="storedProcedureName";
                                    Expression={
                                                    ($_ | select-object -expandProperty input).storedProcedureName
                                               }
                                   }

$activitiesTable | format-table

$activitiesTable | export-csv c:\demos\activities.csv -NoTypeInformation

Using Metadata to Load Tables – When you need to load files into tables repeatedly over time, such as a updated customer information on a monthly basis, it may be useful to create a table to define metadata about those files. This could includ the naming pattern of the file, which database the data should reside in, along with the specific schema and table names. This example takes that metadata and instantiates a bulk copy object from Microsoft.Data.SqlClient. With the System.Data.DataTable object holding the contents of the CSV file, that data is loaded into the specified table.


$query = "select FileNamePattern, 
                 DestinationDatabaseName, 
                 DestinationSchemaName, 
                 DestinationTableName, 
                 DestinationSchemaTableName 
          from Logging.LoadMetadata"

# invoke-sqlcmd with no credential uses context of the current user
$files = invoke-sqlcmd -ServerInstance localhost -Database PowerShellDemo -Query $query 

foreach ($file in $files)
{
    write-output "...Loading $($file.DestinationSchemaTableName)"

    #Query sys.columns & sys.types to get the fields of a table
    $dbColumns = Get-SQLColumnsForDataTable -ServerInstance localhost `
                                            -Credential $null `
                                            -Database $file.DestinationDatabaseName `
                                            -SchemaName $file.DestinationSchemaName `
                                            -TableName $file.DestinationTableName

    #Load CSV data into memory and generate a DataTable with defined data types
    $dataTable = Import-CSV "C:\Demos\CSVs\$($file.FileNamePattern)"  | Out-DataTableWithDBDataTypes -Columns $columns

    # Bulk copy the DataTable into the specified table
    $bulkCopy = new-object ([Microsoft.Data.SqlClient.SqlBulkCopy]) `
                    -ArgumentList "server=localhost;database=$($file.DestinationDatabaseName);Integrated Security=SSPI"
    $bulkCopy.DestinationTableName = $file.DestinationSchemaTableName
    $bulkCopy.WriteToServer($dataTable)
}

If you have any questions about these demo scripts, you can find me on Twitter, @em_dempster.

Database Helpers for PowerShell

PowerShell has a lot of great functionality to pull data from various sources on a server, like event logs, WMI objects, Registry, etc. However, preparing data for logging in a database is an area PowerShell is lacking.

As supporting material for SQL Saturday/Data Saturday presentations, here are a couple functions, with basic functionality, to aid in scripting bulk loads to SQL Server. If a different database platform is being used, such as PostgreSQL, Oracle, etc., this code will need to be updated to support that platform.

Get-SQLColumnsForDataTable, as the name should imply, returns an array of objects representing the column names and data types from a table in a SQL Server database. The object array can be used by the second function, Out-DataTableWithDataTypes to predefine the data types of columns in a System.Data.DataTable object. This does not return the length of strings or the scale and precision of decimal and numeric data types. Additionally, the query translates common SQL data types to .Net data types. If the table includes binary, image, CLR data types, etc., those data types will need to be translated to an equivalent .Net data type.

function Get-SQLColumnsForDataTable
{
        Param(
		[Parameter(Mandatory=$true, ValueFromPipelineByPropertyName=$true)]
		[string]$ServerInstance,
		[Parameter(Mandatory=$true, ValueFromPipelineByPropertyName=$true)]
		[string]$Database,
		[Parameter(Mandatory=$false, ValueFromPipelineByPropertyName=$true)]
		[PSCredential]$Credential,
        [Parameter(Mandatory=$true, ValueFromPipelineByPropertyName=$true)]
        [String] $SchemaName,
        [Parameter(Mandatory=$true, ValueFromPipelineByPropertyName=$true)]
        [String] $TableName
        )
    $Query = "SELECT c.Name ColumnName, ty.Name DataType, case when ty.Name like '%char' then 'String' 
                                                               when ty.Name = 'int' then 'Int32' 
                                                               when ty.name = 'Integer' then 'Int32' 
                                                               when ty.name = 'Smallint' then 'Int16' 
                                                               when ty.Name = 'BigInt' then 'Int64' 
                                                               when ty.name = 'money' then 'Decimal' 
                                                               when ty.name = 'numeric' then 'Decimal' 
                                                               when ty.name = 'float' then 'Decimal'
                                                               when ty.name = 'date' then 'DateTime'
                                                               when ty.name = 'bit' then 'Boolean'
                                                               else ty.name 
                                                          end DataTableType
    FROM sys.tables t
	    INNER JOIN sys.columns c ON c.object_id = t.object_id
	    INNER JOIN sys.types ty ON ty.system_type_id = c.system_type_id AND ty.user_type_id = c.user_type_id
		left JOIN sys.identity_columns ic ON c.object_id = ic.object_id AND c.column_id = ic.column_id
    WHERE t.name = '$TableName' AND t.schema_id =SCHEMA_ID('$SchemaName') "

    if ($Credential -eq $null)
    {
	    $columns = invoke-sqlcmd -ServerInstance $ServerInstance -Database $Database -Query $Query
    }
    else
    {
        $columns = invoke-sqlcmd -ServerInstance $ServerInstance -Database $Database -Query $Query -Username $Credential.UserName -Password $Credential.GetNetworkCredential().Password
    }

    return , $columns
}


Out-DataTableWithDataTypes has been derived from code by Chad Miller and taken over by Rambling Cookie Monster (GitHub repository link). The original code attempted to determine the data type of incoming object. However, the data is loaded into the object as strings, so that’s the only data type that can be determined. This adaptation adds a second input parameter, $Columns. The best way to define this parameter is with the function above, Get-SQLColumnsForDataTable.

function Out-DataTableWithDataTypes
{ 
    [CmdletBinding()] 
    param([Parameter(Position=0, Mandatory=$true, ValueFromPipeline = $true)] [PSObject[]]$InputObject,
          [Parameter(Position=1, Mandatory=$true)] [PSObject[]]$Columns)
 
    Begin 
    { 
        
        $dt = new-object Data.datatable 
        if ($columns.Count -gt 0)
        {
            foreach($column in $Columns)
            {
                [void]$dt.Columns.Add($column.ColumnName, $column.DataTableType)
            }
        }
    } 
    Process 
    { 
        foreach ($object in $InputObject) 
        { 
            $DR = $DT.NewRow()   
            foreach($property in $object.PsObject.get_properties()) 
            {     
                if ($DT.Columns.Contains($property.Name))
                {
                    if ($property.value -eq $null -or $property.Value -eq "" -or $property.Value -eq "NULL")
                    {
                        $DR.Item($property.Name) = [System.DBNull]::Value
                    }
                    else 
                    {
                        $DR.Item($property.Name) = $property.value
                    }
                } 
            }   
            $DT.Rows.Add($DR)  
             
        } 
    }  
      
    End 
    { 
        Write-Output @(,($dt)) 
    } 
 
}

With these 2 functions, the basic functionality is there to load data into a SQL Server database table with the proper data types. As with any data loading process, if the data types do not match, Out-DataTableWithDataTypes will fail on the load.

As with most every blog on the Internet, this code has been working for me and the needs I have for it. I fully recognize this does not meet every scenario. Your use of this code is welcome, but there are no guarantees with this code.

Working with Data in .Net

As a PowerShell developer (yes, I consider PowerShell coding development, when it’s for pre-ETL processing), there are a handful of very useful namespaces .Net.

System.Data – “basic” data containers are classes in this namespace, include DataTable and DataSet. These are generic types, regardless of source or destination database or data source.

System.XML – If you’re ingesting XML data, there are several classes here you’ll use. It’s likely you’ll use XmlReader to stream data or XmlDocument to load a document into memory to read or manipulate.

Microsoft.Data.SqlClient – Updated version to System.Data.SqlClient, used to access SQL Server, including Azure SQL Databases and Azure SQL Managed Instances. SqlConnection, SqlCommand and SqlBulkCopy are a data developer’s best friends.

OfficeOpenXML.Core.ExcelPackage – Working with XLSX files requires an assembly found in the PowerShell Gallery, among other places. Classes in this assembly allow developers to read, update and write Excel data.

This is a placeholder and this post will be updated at a later date…

A Comeback in 2022

I’m lazy and a bit selfish.

What a way to start off a post, eh!?

I was getting some content out in 2019 and 2020, as well as starting to hit a stride with presentations at SQL Saturdays. Then a new bug came and rocked everyone, COVID-19. SQL Saturdays disappeared, as well as gatherings in general, for the next 12 – 15 months. By not having that motivator of speaking to people about SQL Server and supporting technologies, my blogging has been on hiatus. When blogging on a topic, I try my best to get the details right, so I can hopefully help those who come across the post. As a result, I tend to spend 6 – 8+ hours crafting a single post (some of that may be a little anxiety and over-analyzing). If I perceive there may be a likelihood of views dropping, because I’m not introducing myself to people, my lazy side comes out very quickly and says, “Why bother? Let’s go watch some TV or take a nap.”

That doesn’t mean life has stood still, and I haven’t learned anything. Quite to opposite, in fact. I have been extremely fortunate to work for a growing company in an key industry in this country. While the pandemic did affect us, it was the catalyst to grow in new directions. This required the data team (of which I’m one of three co-leads) to develop new ways to absorb the growth in our data, not just in volume, but with the increase in the breadth of information we’re managing.

Over the next several months, I’ll be posting about these new adventures, generalizing the concepts for the larger audience. You should see quite a bit coming about PowerShell and pre-ETL processing. Sprinkled in will be posts about Azure Automation and Azure Data Factory. By the time I get posts caught up to July 2022, I’m sure I will have learned, or been introduced to, new topics to explore with future posts.

Yesterday (July 7th, 2022), the PASS Data Community Summit announced general sessions and speakers for the 2022 event. My session, “Power-up the Data Warehouse with Pre-ETL Processing,” was chosen. I am incredibly honored to have the opportunity to speak at the Summit, in person. I did speak at the 2020 PASS Virtual Summit, pre-recording the session, so this will be my 2nd appearance at the “main event”.

I’ll try to keep train longer, providing new content.

Stay tuned…

Standing up Azure SQL Managed Instance & Connect to Storage Account

This is a quick article, related to connecting an Azure SQL Managed Instance to an Azure Storage Account.

When creating an Azure SQL Managed Instance, you have the options of creating a public endpoint and/or configuring the connection type of the private endpoint (as shown below). The default connection type for private endpoint is Proxy, however, Microsoft recommends using the Redirect method.

Using Redirect will create a Network Security Group (NSG) with various security rules. In the outbound rules, I have found, at most, 2 rules need to be added, which are highlighted in the screenshot below. One rule is to allow any connection from SQL Managed Instance subnet (172.x.x.x/27 as an example) to the subnet with the primary NIC of the storage account (172.x.x.0/24). The other rule is to allow traffic from the MI subnet to the ‘Storage.EastUS’ service.

Partial listing of outbound network security group rules in Azure for a SQL Managed Instance.

More investigation needs to be completed to tighten down these outbound rules, so they target specific ports, and ideally specific IP addresses. This will evolve…

Recent and Upcoming Presentations

The first set of blogs I posted on this site center around Row Level Security in Microsoft SQL Server. In addition to these posts, I’ve decided to present the topic at PASS events, including multiple SQL Saturdays. Below are the events I’ve already presented at in 2020 and am hoping to present at later this year.

Confirmed Events

Submission-pending Events

These are wonderful opportunities for multiple reasons. SQL Saturdays are unique events, in that they are hosted by the PASS community in each city, providing a free day of training (with the exception of a small fee for lunch). How crazy is that? With each event, I’m slowly trying to improve myself in the first 3 areas. Plus, everyone needs some fun now and again.

  • Networking – A few speakers are planning to be at a number of these events, which will give me an opportunity to meet and get to know them, as well as other speakers and event attendees.
  • Building Technical Skills – Just as I’m presenting on a relatively unknown topic, Row Level Security, there are others speaking about other, lesser-known topics. Of course, many of the presentations will cover better known topics, and we’ll get different perspectives from each speaker.
  • Building Communication Skills – For a long time, I had was quite terrified of public speaking, which has likely had a negative impact on my career. By “practicing” my public speaking skills, I can only improve, and this will help further my career, especially if I want to go into a management or product evangelist-type roles.
  • Seeing Different Cities – Who doesn’t like to check out different and see what they have to offer. Having a day or two free in each city will give me an opportunity to see what each has to offer.