Author Archives: erinmdempster

About erinmdempster

Currently a senior-level database developer and architect employed by an insurance company based in the Minneapolis, Minnesota area.

SQLBits 2023 Recap – Part 1

After several months of planning and preparation (…just on my part – organizers spent much longer), SQLBits 2023 came and went this past week at the ICC Wales, near the beautiful community of Newport. This was my first trip to SQLBits, and I didn’t know what to expect. It was a really great conference, from the speakers and topics to the organization and logistics of running such a large event. It was also a time to meet new people and connect with others in the SQL Server community.

What did I learn this week did you ask?

#1Rob Sewell and Ben Weissman have amazing suits! Not many could pull it off, but they…well, it’s not that bad!

#2Erin Stellato reminded me about Extended Events in SQL Server

Admittedly, I either try to brute force my way through issues, and/or, “try it again,” hoping it was an intermittent issue. When I do look for more help, I tend to go to the Query Store, and then on to Profiler to get more detail about what’s going on. What I didn’t know, is there are 180 events in Profiler to track. I mean, there are only 180 events. According to Erin, in SQL Server 2022, there are closer to 2,000 events to track!! Also, different extended events can be tracked with different filters, to get what need more quickly. Next time my developers say the SQL instance is slow, Extended Events will be fired up.

#3Microsoft Purview and Azure Synapse Analytics

I need to learn a lot more about these 2 products. My manager keeps telling me Purview can do this, and it can do that. Can it really?? After this week at SQLBits, I believe it can do those things, and a lot more. My company has grown quickly over the years, and a lot of key data is in Excel spreadsheets, CSV files, etc. Knowing who owns those files and that they understand the information the files provide is a really important thing to us. Purview gives us that ability…I think. I need to read more…

Azure Synapse Analytics will also play a large role, maybe with DataVerse, maybe not. By providing Azure Synapse Analytics the data to store in a data lake, Spark and DataBricks can be added on top to query the data. PowerBI then sits on top of all of that, providing visualizations of the data our users really need.

So much more there to discover…

#4 – A new type of Database Project is coming

For the past 6 – 10 years, Microsoft has had a project type in Visual Studio to store state-based SQL Server database schemas. Ideally, when a new project is started, the new database will be started in the project and saved under source control. What’s more realistic is when the database gets to UAT or production, then a schema comparison is used to load the full database schema into the project.

With Azure Data Studio and Visual Studio Code, there are extensions to manage schema changes for SQL Server/Azure SQL DB/Azure SQL Managed Instance. Both of these extensions will “compile” into a DACPAC, which is deployed to the database with sqlpackage.exe.

As I continue to refine my workshop to deploy database changes with Azure DevOps (and GitHub), I’ll be moving over to these 2 new tools.

There’s more to share, certainly, from SQLBits. Keep an eye out for the next post…

Azure Pipelines – Deploy SQL Changes with Service Principal

For several years, Microsoft has provided a project type in Visual Studio (previously aliased as BIDS) called SQL Server Database. This project allows developers to manage database schema in a project. When the project is built, a DACPAC is generated, which is used by sqlpackage.exe to deploy (technically publish) changes to a server. This is available for SQL Server, as well as Azure SQL Database/Azure SQL Managed Instance.

In Azure DevOps, a built-in task, Azure SQL Database deployment, is available to deploy DACPACs for Azure SQL DB/Azure SQL Managed Instance. (Going forward, I will shorten these two services to simply Azure SQL) The Classic Editor for Azure Pipelines provides a nice user interface to configure the task. It is important to remember when the pipeline executes, a YAML file will be executed. In YAML, everything is stored as a string. Within the properties, there are multiple authentication types, as shown in the screenshot below. For the 4 of the top 5 (exclude Active Directory – Integrated), passwords will be stored in the configuration…in plain text.

To provide a more secure deployment environment, these options are not optimal. However, the Service Principal is optimal, as it is an Azure Active Directory object, which can be referenced by GUIDs and a secret.

Let’s get started…

Create an App Registration

Creating an App Registration isn’t difficult, but it does require elevated permissions in Azure Active Directory. Unless you are an administrator, your infrastructure engineer or Azure administrator will need to perform the set up.

  • In the Azure Portal, search for your organization’s Azure Active Directory. Within the blade, locate App registrations from the left navigation bar, under Manage.
  • The organization’s existing enterprise applications will be listed. Click New registration from the top of the blade.
  • Define the Name of the App registration to get started. Under Supported account types, the most likely option is the first, Accounts in this organizational directory only. Click Register to create the registration.
  • Upon completion, the portal will take you to the application, which has information like the following screenshot.
  • Keep track of the Application (client) ID and Directory (tenant ID) from the Essentials section of the Overview.
  • Next, a secret needs to be created in the App registration. Under Manage on the navigation bar, choose Certificates & secrets.
  • In the middle of the screen, choose New client secret. Enter a description and the duration the secret should be valid. Click Add to create the secret.
  • The new secret will appear in the list. In the Value column, copy the secret to the clipboard and temporarily save it to a text editor. (Note: this needs to be done before navigating away from the secrets blade. Otherwise, the secret will not be available to copy and a new secret will need to be created.)

Create the Azure DevOps Service Connection

With the App registration created, the Azure DevOps service connection can be created. In PowerShell session, use the following script. From the steps above, <Secret>, <ApplicationID> and <DirectoryID> are known. The SubscriptionID can be found in the Azure Portal, searching for the company’s subscription that has the SQL instance. <DisplayName> is what the name of the Service Connection will be within Azure DevOps. The –organization argument needs to be filled out with your organization name in Azure DevOps. Your default URL to Azure DevOps is what you want here. Finally, <Project> is the Azure DevOps project in which you want to create the Service Connection.

$env:AZURE_DEVOPS_EXT_AZURE_RM_SERVICE_PRINCIPAL_KEY="<Secret>"
az devops service-endpoint azurerm create `
          --azure-rm-service-principal-id <ApplicationID>  `
          --azure-rm-subscription-id  <SubscriptionID>`
          --azure-rm-tenant-id <DirectoryID> `
          --name <DisplayName> `
          --organization "https://dev.azure.com/<Organization>" `
          --project <Project>

Assign SQL Permissions

The last step is to assign the App registration to the Azure SQL database. Within SQL Server Management Studio, or your preferred T-SQL editor connected to the database, run the following T-SQL statements, logged in as database administrator. Two assumptions are being made. First, the service connection name is WorkshopServiceConnection, and second, it should be added to the db_owner role. Please adjust these appropriately for your environment.

create user [WorkshopServiceConnection] from external provider
exec sp_addrolemember @rolename = 'db_owner', @memberName = 'WorkshopServiceConnection'

Wrap-up

With these items configured, the Azure Pipeline can now deploy DACPACs to an Azure SQL database in a much more secure manner. Passwords are not being stored in the clear. If someone gets a hold of your Azure DevOps project, they might see the application ID in Azure Active Directory, but the secret won’t be visible. The authentication to Azure SQL is managed through the App registration, which needs the secret, along with the ApplicationID.

SQL Saturday Auston 2023 – Recap

For those who attended my session, “Introducing Automated Deployments with Azure DevOps,” thank you. I think we had a good discussion on multiple pieces covered in the slides.

Speaking of the slides, here is a link to them – https://endrephome.files.wordpress.com/2023/03/sqlsaturdayaustin2023-devops.pptx

In the main demo, a deployment pipeline was created to build AND release code to an Azure SQL database. Below is the pipeline as we ended it. There was a question about deploying database schema changes to on-premises SQL Servers. There are tasks available in the Marketplace that will do this. It can also be accomplished through the use of a PowerShell script.

Using the pipeline above to deploy locally, there are a few changes required. First, instead of use Azure SQL Database deployment for the final task, Add a PowerShell script, like was done in the build section of this pipeline. The script will be inline and should look similar to this snippet (the options should all be on a single line with the command – this is for readability). The command sqlpackage.exe will be executed with an Action of publish. More information regarding the syntax for sqlpackage.exe can be found here.

sqlpackage.exe /Action:publish 
    /SourceFile:$(Pipeline.Workspace)\WorldWideImporters.dacpac 
    /TargetServerName:myserver.mydomain.com
    /TargetDatabaseName:WideWorldImporters
    /TargetEncryptConnection:false

What was not mentioned in the session is the PowerShell script needs to be able to connect to your on-premises SQL Server. To do that, an agent will be needed on your network and registered with Azure DevOps. Defining an agent pool is beyond the scope of this post, but this is documentation I used to create my pool . Below is what my pool, WorkshopPool, has for agents (Yes, I have 4 agents for workshops; you only need 1 agent in a pool).

In the pipeline’s deployment job, below, the properties need to be to adjusted. Instead of inheriting the agent from the pipeline, the Agent pool needs to be the one created for your on-premises agent. When the pipeline runs, this job will run on your agent and log into the SQL Server instance and deploy the DACPAC.

With a little luck the first time and knowledge on subsequent setups, on-premises SQL Server database deployments should become reasonably easy, though probably never as easy as deploying to Azure SQL DB or Azure SQL Managed Instance.

Azure Data Factory Deployment Mistake

Recently, I was working with a release pipeline to add our Azure Data Factory solution. This was the first attempt to release Azure Data Factory components. Of course, there were many issues to resolve. The release pipeline handles both databases from multiple Git repositories and now ADF code, so it’s no longer an out of the box solution.

Microsoft provides documentation for AzureResourceManagerTemplateDeployment@3, which is helpful, if you read it thoroughly. Of course, I did not read it thoroughly, so I didn’t see the note about the DeploymentMode options. With the different issues, I decided to set parameter to ‘Complete’, instead of the default value, ‘Incremental’. The task looked a lot like the example below.

   - task: AzureResourceManagerTemplateDeployment@3
      displayName: 'Deploy ADF Solution'
      inputs:
        deploymentScope: 'Resource Group'
        azureResourceManagerConnection: '$(azureSubscription)'
        subscriptionId: '<subscriptionID>'
        action: 'Create Or Update Resource Group'
        location: '<Azure region>'
        resourceGroupName: '<resource group>'
        templateLocation: 'Linked artifact'
        csmFile: '$(System.ArtifactsDirectory)/<SolutionPath>/ARMTemplateForFactory.json'
        csmParametersFile: '$(System.ArtifactsDirectory)/<SolutionPath>/ARMTemplateParametersForFactory.json'
        overrideParameters: -factoryName xxxxxxxxxxxxx
                            -KeyVault_properties_typeProperties_baseUrl $(KeyVault-AutomatedTesting)
                            -ADLS_Storage_properties_typeProperties_url $(ADLSLocation-AutomatedTesting)
        deploymentMode: Complete

Oops! That was a mistake. The deployment failed, with a new error message about the data factory not being found. Excuse me?! I just saw it a couple minutes ago. Going back to Azure Data Factory, the instance I had been working with was no longer listed. What happened in the last couple minutes? The only thing I did was change the DeploymentMode to ‘Complete’.

Referring back to the documentation, I found this little nugget.

Complete mode deletes resources that are not in your template. Complete mode takes relatively more time than incremental mode. If the task times out, consider increasing the timeout or changing to the Incremental mode.

My ARM template only included the the code contained within Azure Data Factory. It did not contain the definition of the data factory itself. As a result, the Complete deployment mode deleted the data factory (along with a storage account) and then tried to deploy the code. Of course this didn’t work.

Fortunately, a proof-of-concept environment was created to develop the release pipeline, so I was the only one using the instances that were deleted. This mistake is a good reminder why resource groups should be resource instance specific and not encompass a large number (or even a modest number) of resources. It’s also a reminder to read the documentation, when new features are being used. ADO Pipeline tasks (and jobs) are concisely defined in YAML, but they perform a lot of work. It is easy to run into unexpected outcomes.

SQL Saturday Boston Wrap-Up

SQL Saturday Boston was a major success, both for the event and for myself. The organizing committee did a great job bringing together a stellar group of speakers who covered many great topics. A HUGE thank you to New England SQL Server User Group for the opportunity to speak about PowerShell. Thank you to each and every attendee of my session, Reducing Data Warehouse Processing with PowerShell Pre-ETL Processing. It was great to have so many people show interest in PowerShell and how data developers, data analysts, QA analysts, and others can leverage its capabilities.

The PDF version of the slide deck is available for download

In addition to the slides, below are the demo scripts from the presentation. The data files used in the presentation are being made available, but you really shouldn’t need them. The idea is to use the scripts to get inspiration for your own solutions. The examples make use of two functions I’ve created. They are documented in this post.

CSV Validation – Loading data from a CSV into SQL Server is straightforward, if the columns in the CSV file match the database table’s. This validation performs a cursory check of the columns being passed in a CSV file. Certainly, there are other checks that can be performed, but these are a couple to a solution started.

# It is assumed the CSV being used contains a header record in the first row of the file. As a result,
# only the first record ( -TotalCount 1 ) is being read.

$header = Get-Content -Path $FileName -TotalCount 1
# Add .Replace("`"", "") to the end of the split operation to remove quotes from the values.
$headerColumns = ($header -split ',') 


# With a database table designed with the schema intended to be in the CSV 
#   file, query SQL Server to get the list of columns (and data types).
# NULL is being passed to the -Credential parameter, which informs the 
#   function to use Windows authentication to connect to the SQL instance.
$databaseTableColumns= Get-SQLColumnsForDataTable `
                -ServerInstance localhost `
                -Database AdventureWorks `
                -Credential $null `
                -SchemaName Production `
                -TableName Product

# Perform a quick column count to see if the number of columns match.  By no means does this mean the columns are the same, even if the counts are equal.
if ($headerColumns.Count -eq $columns.Count)
{
    write-output "The columns in the file match the table definition"
}
else
{
    write-output "The number of columns in the source file does not match the table definition"
}

# Check the CSV's columns to see if columns from the table are missing.
$missingColumn = $false
foreach ($column in $columns) 
{
    if ($headerColumns -notcontains $column.ColumnName) 
    { 
        $missingColumn = $true
        write-output "   Column, $($column.ColumnName), was not found" 
    } 
}

Excel Validation – Like the CSV validation, validating Excel Files is also possible with PSExcel module (technically on EPPlus.dll is being used for the OfficeOpenXML namespace). As you compare the two scripts, there is more work in the Excel validation, due to needing the assistance of EPPluss.dll to access the data, versus reading the first line of text from the CSV file with Get-Content.

# Import the PSExcel module from the PowerShell Gallery; only needs to be run once on a computer.  
# Run the PowerShell session in administrator mode to install the module for all users.

install-module PSExcel
import-module PSExcel

$excelFile = New-Object OfficeOpenXml.ExcelPackage $FileName
$workbook  = $excelFile.Workbook

# Every worksheet in Excel has a name on the tab.  That name goes here.
$worksheet = $workbook.Worksheets["Details"]
 
# $worksheet.Dimension.Rows returns the number of records in the Excel worksheet.  
# It may be useful to log the number of records received in a pre-processing/validation step.
Write-Output "#################"
Write-Output "   File contains $($worksheet.Dimension.Rows) rows"
Write-Output "#################"

# Like Dimension.Rows, Dimension.Columns returns the number of columns with data in the worksheet.
$worksheetColumns = $worksheet.Dimension.Columns

 #
 # Get columns for Sales.SalesOrderDetails
 #
 $dbColumns = Get-SQLColumnsForDataTable -ServerInstance localhost `
                                         -Database AdventureWorks `
                                         -Credential $null `
                                         -SchemaName "Sales" `
                                         -TableName "SalesOrderDetail"

if ($worksheetColumns -eq $dbColumns.Count)
{
    write-output "             "
    write-output "The Excel file and the database table have the same number of columns"
    write-output "             "
}
else
{
    write-output "The Excel file and the database table do not have the same number of columns"

    #
    # Find the differing columns
    #
    $worksheetColumnNames = @()
    $dbColumnNames = $dbColumns.ColumnName
    foreach ($columnNum in 1..($worksheetColumns-1))
    {
        $columnName = $worksheet.Cells.Item(1,$columnNum).Text
        
        if ($dbColumns.ColumnName -notcontains $columnName)
        {
            write-output "$columnName is not in the database table"
        }

        $worksheetColumnNames += $columnName
    }
}

JSON to CSV Transformation – This script is directly related to the screenshot of the JSON file from slide 21 in the slide deck, creating a CSV file from the highlighted elements in the JSON.

$activities= (Get-Content $FileName | convertfrom-json).Value

$activitiesTable = $activities | select activityName, activityRunStart, activityRunEnd, durationInMs, 
                                  @{Name="storedProcedureName";
                                    Expression={
                                                    ($_ | select-object -expandProperty input).storedProcedureName
                                               }
                                   }

$activitiesTable | format-table

$activitiesTable | export-csv c:demosactivities.csv -NoTypeInformation

Using Metadata to Load Tables – When you need to load files into tables repeatedly over time, such as a updated customer information on a monthly basis, it may be useful to create a table to define metadata about those files. This could includ the naming pattern of the file, which database the data should reside in, along with the specific schema and table names. This example takes that metadata and instantiates a bulk copy object from Microsoft.Data.SqlClient. With the System.Data.DataTable object holding the contents of the CSV file, that data is loaded into the specified table.

$query = "select FileNamePattern, 
                 DestinationDatabaseName, 
                 DestinationSchemaName, 
                 DestinationTableName, 
                 DestinationSchemaTableName 
          from Logging.LoadMetadata"

# invoke-sqlcmd with no credential uses context of the current user
$files = invoke-sqlcmd -ServerInstance localhost -Database PowerShellDemo -Query $query 

foreach ($file in $files)
{
    write-output "...Loading $($file.DestinationSchemaTableName)"

    #Query sys.columns & sys.types to get the fields of a table
    $dbColumns = Get-SQLColumnsForDataTable -ServerInstance localhost `
                                            -Credential $null `
                                            -Database $file.DestinationDatabaseName `
                                            -SchemaName $file.DestinationSchemaName `
                                            -TableName $file.DestinationTableName

    #Load CSV data into memory and generate a DataTable with defined data types
    $dataTable = Import-CSV "C:DemosCSVs$($file.FileNamePattern)"  | Out-DataTableWithDBDataTypes -Columns $columns

    # Bulk copy the DataTable into the specified table
    $bulkCopy = new-object ([Microsoft.Data.SqlClient.SqlBulkCopy]) `
                    -ArgumentList "server=localhost;database=$($file.DestinationDatabaseName);Integrated Security=SSPI"
    $bulkCopy.DestinationTableName = $file.DestinationSchemaTableName
    $bulkCopy.WriteToServer($dataTable)
}

If you have any questions about these demo scripts, you can find me on Twitter, @em_dempster.

Azure Dev Ops – Job-Focused vs Task-Focused

This is the first post in the series, Azure Dev Ops Odds and Ends. The starting page has information about the code environment I’m working with, as well as links to other posts in this series.

The first step to consolidate builds into a single pipeline was to gather code from four repositories into a single location. From there, the projects could be built and deployed as a set. Based on our existing pipelines, I needed to create jobs, first to checkout code, before jobs to build and publish the solutions. Below is the start of the “BuildDACPACs” stage of the pipeline.

- stage: BuildDACPACs
  jobs:
  - job: BuildDACPACs
    displayName: Build and publish database DACPACs
    workspace:
      clean: all
    steps:
    # First step: check out the repositories with the source code and other scripts the pipeline needs.
    - checkout: git://AutomatedTestingPoC/AutomatedTestingPoC #Optional but included to gain access to artifacts in the "root" repo, such as PowerShell scripts needed in the deployment process
    - checkout: git://AutomatedTestingPoC/Staging
    - checkout: git://AutomatedTestingPoC/SystemLogging
    - checkout: git://AutomatedTestingPoC/Warehouse
    - checkout: git://AutomatedTestingPoC/AzureDataFactory
- job: BuildWarehouseDB
    displayName: Build Warehouse database project
    steps:
    # Second step: Build and publish the Warehouse database project 
    #   Two things:
    #     1) the DACPAC is buried, so it is being copied to a pre-defined 'ArtifactStagingDirectory' from the bin folder
    #     2) There are 2 settings files for publishing; one for Automated Testing and another for other environments.  Those need to be copied into the artifacts folder, along with the DACPAC.
    - task: VSBuild@1
      displayName: Build Warehouse Database
      inputs:
       solution: "$(Build.SourcesDirectory)/Warehouse/**/*.sln"
       platform: $(buildPlatform)
       configuration: $(buildConfiguration)
       clean: true
    - powershell: copy-item (get-childItem "$($env:Build_SourcesDirectory)\Warehouse" -Filter *.dacpac -Recurse | select fullname).FullName $($env:Build_ArtifactStagingDirectory)
      displayName: Copy Warehouse DACPAC
    - powershell: copy-item (get-childItem "$($env:Build_SourcesDirectory)\Warehouse" -Filter Default.publish.xml -Recurse | select fullname).FullName "$($env:Build_ArtifactStagingDirectory)\Warehouse.Publish.xml"
      displayName: Copy Warehouse Publish Settings XML
    - task: PublishBuildArtifacts@1
      displayName: Publish Warehouse Database
      inputs:
        PathtoPublish: '$(Build.ArtifactStagingDirectory)'
        ArtifactName: 'BlueMonkey_artifacts'
        publishLocation: 'Container'

The syntax of the YAML is good and the pipeline started to run. The checkouts worked as expected and there should be 4 directories with code in there. The Warehouse database build was first to start, and it failed on the build process itself. In Image #1, the error states the solution couldn’t be found in the folder path, ‘D:\a\1\s\Warehouse\**\*.sln’. Not being the most observant character, I was stumped and didn’t understand what was happening. After some digging, I realized the step before the build was doing exactly what it said it was doing. Imagine that!! I had checked out code earlier, but the Warehouse repository was being checked out again. Looking through the YAML above, there wasn’t an explicit step to check out the code again. Where is this being called?

Image #1 – Pipeline logs; Warehouse database failed

It turns out, when a job is defined, ADO performs a number of steps, including setting up an agent. This occurred for the first job to check out code, and it happened again to build the database. An automated step is to check out code from the repository in which the YAML file is defined. In order to not wipe out, we can’t use a job for each major process to build and deploy the solution. On line 5 of the YAML above, a clue was provided, by the workspace: clean option. That led me to dig deeper into what was happening and to review the logs in more detail.

Starting: Initialize job
Agent name: 'Hosted Agent'
Agent machine name: 'WIN-6C2TM2SQ98G'
Current agent version: '2.211.0'
Operating System
Runner Image
Runner Image Provisioner
Current image version: '20221002.2'
Agent running as: 'VssAdministrator'
Prepare build directory.
Set build variables.
Download all required tasks.
Downloading task: VSBuild (1.208.0)
Downloading task: PowerShell (2.210.0)
Downloading task: PublishBuildArtifacts (1.200.0)
Checking job knob settings.
   Knob: AgentToolsDirectory = C:\hostedtoolcache\windows Source: ${AGENT_TOOLSDIRECTORY} 
   Knob: AgentPerflog = C:\agents\perflog Source: ${VSTS_AGENT_PERFLOG} 
Finished checking job knob settings.
Start tracking orphan processes.
Finishing: Initialize job

Instead, we need to structure the YAML to run many tasks within a single job step. The build and publish tasks are now contained within a single job, BuildArtifacts. When the job is initialized, the current repository is still checked out, but the next step explicitly checks out all of the repositories, placing each one into its own subfolder, under $(Build.SourcesDirectory).

- stage: BuildArtifacts
  jobs:
  - job: BuildArtifacts
    displayName: Build and publish database DACPACs and ADF code
    workspace:
      clean: all
    steps:
    # First step: check out the repositories with the source code and other scripts the pipeline needs.
    - checkout: git://AutomatedTestingPoC/AutomatedTestingPoC #Optional but included to gain access to artifacts in the "root" repo, such as PowerShell scripts needed in the deployment process
    - checkout: git://AutomatedTestingPoC/Staging
    - checkout: git://AutomatedTestingPoC/SystemLogging
    - checkout: git://AutomatedTestingPoC/Warehouse
    - checkout: git://AutomatedTestingPoC/AzureDataFactory
    # Second step: Build and publish the Warehouse database project 
    #   Two things:
    #     1) the DACPAC is buried, so it is being copied to a pre-defined 'ArtifactStagingDirectory' from the bin folder
    #     2) There are 2 settings files for publishing; one for Automated Testing and another for other environments.  Those need to be copied into the artifacts folder, along with the DACPAC.
    - task: VSBuild@1
      displayName: Build Warehouse Database
      inputs:
       solution: "$(Build.SourcesDirectory)/Warehouse/**/*.sln"
       platform: $(buildPlatform)
       configuration: $(buildConfiguration)
       clean: true
    - powershell: copy-item (get-childItem "$($env:Build_SourcesDirectory)\Warehouse" -Filter *.dacpac -Recurse | select fullname).FullName $($env:Build_ArtifactStagingDirectory)
      displayName: Copy Warehouse DACPAC
    - powershell: copy-item (get-childItem "$($env:Build_SourcesDirectory)\Warehouse" -Filter Default.publish.xml -Recurse | select fullname).FullName "$($env:Build_ArtifactStagingDirectory)\Warehouse.Publish.xml"
      displayName: Copy Warehouse Publish Settings XML
    - task: PublishBuildArtifacts@1
      displayName: Publish Warehouse Database
      inputs:
        PathtoPublish: '$(Build.ArtifactStagingDirectory)'
        ArtifactName: 'BlueMonkey_artifacts'
        publishLocation: 'Container'

With the each repository’s code stored in its own subfolder, as originally intended, when the build task is executed, the code is found and the build process proceeds.

Image #2 – Pipeline logs; Warehouse build succeeded

This was my first major hurdle, and it clearly defined there is a lot going on with each YAML instruction. Implementing pipelines is greatly simplified because of that. At the same time, more care does have to be taken to ensure the pipeline behaves as you expect.

Azure Dev Ops Pipeline Odds and Ends

When one works in a smaller company, there is an expectation of wearing many hats. As the Data Operations Team Lead, I’m no different. In addition to working with SQL Server, Azure infrastructure, Azure Data Factory and more, I have started to tackle Azure Dev Ops. Specifically, I’m working to enhance, or streamline, existing build and release pipelines to work better in our organization. Learning how pipelines are built and behave, while re-vamping the process has led to some deeper dives into YAML and what Azure Dev Ops is doing.

Some of the developers on our team are relatively new to developing code in a corporate environment and using source control. To help reduce issues around check-ins and code merging, we have made some conservative decisions. The data warehouse solution consists of 3 database projects, Staging, Warehouse and SystemLogging, as well as Azure Data Factory artifacts. Each solution is stored in its own repository within the same Azure Dev Ops project. The build and release pipeline are within a proof of concept project, AutomationTestingPoC.

In this series of posts, I’ll document some of the nuances found in Azure Dev Ops and try to explain their benefits and drawbacks. The links below are listed chronologically, as I’m documenting them.

Check back often, as the list will grow and topics will be linked.

Data Saturday Atlanta Wrap-Up

Data Saturday Atlanta was a major success, both for the event and for myself. Rie Merritt (@IrishSQL) and the organizing committee did a great job bringing together a stellar group of speakers who covered many great topics. A HUGE thank you to Atlanta Azure Data User Group for the opportunity to speak about PowerShell. Thank you to each and every attendee of my session, Reducing Data Warehouse Processing with PowerShell Pre-ETL Processing. It was great to have so many people show interest in PowerShell and how data developers, data analysts, QA analysts, and others can leverage its capabilities.

The PDF version of the slide deck is available for download.

In addition to the slides, below are the demo scripts from the presentation. The data files used in the presentation are being made available, but you really shouldn’t need them. The idea is to use the scripts to get inspiration for your own solutions. The examples make use of two functions I’ve created. They are documented in this post.

CSV Validation – Loading data from a CSV into SQL Server is straightforward, if the columns in the CSV file match the database table’s. This validation performs a cursory check of the columns being passed in a CSV file. Certainly, there are other checks that can be performed, but these are a couple to a solution started.

# It is assumed the CSV being used contains a header record in the first row of the file. As a result,
# only the first record ( -TotalCount 1 ) is being read.

$header = Get-Content -Path $FileName -TotalCount 1
# Add .Replace("`"", "") to the end of the split operation to remove quotes from the values.
$headerColumns = ($header -split ',') 


# With a database table designed with the schema intended to be in the CSV 
#   file, query SQL Server to get the list of columns (and data types).
# NULL is being passed to the -Credential parameter, which informs the 
#   function to use Windows authentication to connect to the SQL instance.
$databaseTableColumns= Get-SQLColumnsForDataTable `
                -ServerInstance localhost `
                -Database AdventureWorks `
                -Credential $null `
                -SchemaName Production `
                -TableName Product

# Perform a quick column count to see if the number of columns match.  By no means does this mean the columns are the same, even if the counts are equal.
if ($headerColumns.Count -eq $columns.Count)
{
    write-output "The columns in the file match the table definition"
}
else
{
    write-output "The number of columns in the source file does not match the table definition"
}

# Check the CSV's columns to see if columns from the table are missing.
$missingColumn = $false
foreach ($column in $columns) 
{
    if ($headerColumns -notcontains $column.ColumnName) 
    { 
        $missingColumn = $true
        write-output "   Column, $($column.ColumnName), was not found" 
    } 
}

Excel Validation – Like the CSV validation, validating Excel Files is also possible with PSExcel module (technically on EPPlus.dll is being used for the OfficeOpenXML namespace). As you compare the two scripts, there is more work in the Excel validation, due to needing the assistance of EPPluss.dll to access the data, versus reading the first line of text from the CSV file with Get-Content.

# Import the PSExcel module from the PowerShell Gallery; only needs to be run once on a computer.  
# Run the PowerShell session in administrator mode to install the module for all users.

install-module PSExcel
import-module PSExcel

$excelFile = New-Object OfficeOpenXml.ExcelPackage $FileName
$workbook  = $excelFile.Workbook

# Every worksheet in Excel has a name on the tab.  That name goes here.
$worksheet = $workbook.Worksheets["Details"]
 
# $worksheet.Dimension.Rows returns the number of records in the Excel worksheet.  
# It may be useful to log the number of records received in a pre-processing/validation step.
Write-Output "#################"
Write-Output "   File contains $($worksheet.Dimension.Rows) rows"
Write-Output "#################"

# Like Dimension.Rows, Dimension.Columns returns the number of columns with data in the worksheet.
$worksheetColumns = $worksheet.Dimension.Columns

 #
 # Get columns for Sales.SalesOrderDetails
 #
 $dbColumns = Get-SQLColumnsForDataTable -ServerInstance localhost `
                                         -Database AdventureWorks `
                                         -Credential $null `
                                         -SchemaName "Sales" `
                                         -TableName "SalesOrderDetail"

if ($worksheetColumns -eq $dbColumns.Count)
{
    write-output "             "
    write-output "The Excel file and the database table have the same number of columns"
    write-output "             "
}
else
{
    write-output "The Excel file and the database table do not have the same number of columns"

    #
    # Find the differing columns
    #
    $worksheetColumnNames = @()
    $dbColumnNames = $dbColumns.ColumnName
    foreach ($columnNum in 1..($worksheetColumns-1))
    {
        $columnName = $worksheet.Cells.Item(1,$columnNum).Text
        
        if ($dbColumns.ColumnName -notcontains $columnName)
        {
            write-output "$columnName is not in the database table"
        }

        $worksheetColumnNames += $columnName
    }
}

JSON to CSV Transformation – This script is directly related to the screenshot of the JSON file from slide 21 in the slide deck, creating a CSV file from the highlighted elements in the JSON.

$activities= (Get-Content $FileName | convertfrom-json).Value

$activitiesTable = $activities | select activityName, activityRunStart, activityRunEnd, durationInMs, 
                                  @{Name="storedProcedureName";
                                    Expression={
                                                    ($_ | select-object -expandProperty input).storedProcedureName
                                               }
                                   }

$activitiesTable | format-table

$activitiesTable | export-csv c:\demos\activities.csv -NoTypeInformation

Using Metadata to Load Tables – When you need to load files into tables repeatedly over time, such as a updated customer information on a monthly basis, it may be useful to create a table to define metadata about those files. This could includ the naming pattern of the file, which database the data should reside in, along with the specific schema and table names. This example takes that metadata and instantiates a bulk copy object from Microsoft.Data.SqlClient. With the System.Data.DataTable object holding the contents of the CSV file, that data is loaded into the specified table.


$query = "select FileNamePattern, 
                 DestinationDatabaseName, 
                 DestinationSchemaName, 
                 DestinationTableName, 
                 DestinationSchemaTableName 
          from Logging.LoadMetadata"

# invoke-sqlcmd with no credential uses context of the current user
$files = invoke-sqlcmd -ServerInstance localhost -Database PowerShellDemo -Query $query 

foreach ($file in $files)
{
    write-output "...Loading $($file.DestinationSchemaTableName)"

    #Query sys.columns & sys.types to get the fields of a table
    $dbColumns = Get-SQLColumnsForDataTable -ServerInstance localhost `
                                            -Credential $null `
                                            -Database $file.DestinationDatabaseName `
                                            -SchemaName $file.DestinationSchemaName `
                                            -TableName $file.DestinationTableName

    #Load CSV data into memory and generate a DataTable with defined data types
    $dataTable = Import-CSV "C:\Demos\CSVs\$($file.FileNamePattern)"  | Out-DataTableWithDBDataTypes -Columns $columns

    # Bulk copy the DataTable into the specified table
    $bulkCopy = new-object ([Microsoft.Data.SqlClient.SqlBulkCopy]) `
                    -ArgumentList "server=localhost;database=$($file.DestinationDatabaseName);Integrated Security=SSPI"
    $bulkCopy.DestinationTableName = $file.DestinationSchemaTableName
    $bulkCopy.WriteToServer($dataTable)
}

If you have any questions about these demo scripts, you can find me on Twitter, @em_dempster.

Database Helpers for PowerShell

PowerShell has a lot of great functionality to pull data from various sources on a server, like event logs, WMI objects, Registry, etc. However, preparing data for logging in a database is an area PowerShell is lacking.

As supporting material for SQL Saturday/Data Saturday presentations, here are a couple functions, with basic functionality, to aid in scripting bulk loads to SQL Server. If a different database platform is being used, such as PostgreSQL, Oracle, etc., this code will need to be updated to support that platform.

Get-SQLColumnsForDataTable, as the name should imply, returns an array of objects representing the column names and data types from a table in a SQL Server database. The object array can be used by the second function, Out-DataTableWithDataTypes to predefine the data types of columns in a System.Data.DataTable object. This does not return the length of strings or the scale and precision of decimal and numeric data types. Additionally, the query translates common SQL data types to .Net data types. If the table includes binary, image, CLR data types, etc., those data types will need to be translated to an equivalent .Net data type.

function Get-SQLColumnsForDataTable
{
        Param(
		[Parameter(Mandatory=$true, ValueFromPipelineByPropertyName=$true)]
		[string]$ServerInstance,
		[Parameter(Mandatory=$true, ValueFromPipelineByPropertyName=$true)]
		[string]$Database,
		[Parameter(Mandatory=$false, ValueFromPipelineByPropertyName=$true)]
		[PSCredential]$Credential,
        [Parameter(Mandatory=$true, ValueFromPipelineByPropertyName=$true)]
        [String] $SchemaName,
        [Parameter(Mandatory=$true, ValueFromPipelineByPropertyName=$true)]
        [String] $TableName
        )
    $Query = "SELECT c.Name ColumnName, ty.Name DataType, case when ty.Name like '%char' then 'String' 
                                                               when ty.Name = 'int' then 'Int32' 
                                                               when ty.name = 'Integer' then 'Int32' 
                                                               when ty.name = 'Smallint' then 'Int16' 
                                                               when ty.Name = 'BigInt' then 'Int64' 
                                                               when ty.name = 'money' then 'Decimal' 
                                                               when ty.name = 'numeric' then 'Decimal' 
                                                               when ty.name = 'float' then 'Decimal'
                                                               when ty.name = 'date' then 'DateTime'
                                                               when ty.name = 'bit' then 'Boolean'
                                                               else ty.name 
                                                          end DataTableType
    FROM sys.tables t
	    INNER JOIN sys.columns c ON c.object_id = t.object_id
	    INNER JOIN sys.types ty ON ty.system_type_id = c.system_type_id AND ty.user_type_id = c.user_type_id
		left JOIN sys.identity_columns ic ON c.object_id = ic.object_id AND c.column_id = ic.column_id
    WHERE t.name = '$TableName' AND t.schema_id =SCHEMA_ID('$SchemaName') "

    if ($Credential -eq $null)
    {
	    $columns = invoke-sqlcmd -ServerInstance $ServerInstance -Database $Database -Query $Query
    }
    else
    {
        $columns = invoke-sqlcmd -ServerInstance $ServerInstance -Database $Database -Query $Query -Username $Credential.UserName -Password $Credential.GetNetworkCredential().Password
    }

    return , $columns
}


Out-DataTableWithDataTypes has been derived from code by Chad Miller and taken over by Rambling Cookie Monster (GitHub repository link). The original code attempted to determine the data type of incoming object. However, the data is loaded into the object as strings, so that’s the only data type that can be determined. This adaptation adds a second input parameter, $Columns. The best way to define this parameter is with the function above, Get-SQLColumnsForDataTable.

function Out-DataTableWithDataTypes
{ 
    [CmdletBinding()] 
    param([Parameter(Position=0, Mandatory=$true, ValueFromPipeline = $true)] [PSObject[]]$InputObject,
          [Parameter(Position=1, Mandatory=$true)] [PSObject[]]$Columns)
 
    Begin 
    { 
        
        $dt = new-object Data.datatable 
        if ($columns.Count -gt 0)
        {
            foreach($column in $Columns)
            {
                [void]$dt.Columns.Add($column.ColumnName, $column.DataTableType)
            }
        }
    } 
    Process 
    { 
        foreach ($object in $InputObject) 
        { 
            $DR = $DT.NewRow()   
            foreach($property in $object.PsObject.get_properties()) 
            {     
                if ($DT.Columns.Contains($property.Name))
                {
                    if ($property.value -eq $null -or $property.Value -eq "" -or $property.Value -eq "NULL")
                    {
                        $DR.Item($property.Name) = [System.DBNull]::Value
                    }
                    else 
                    {
                        $DR.Item($property.Name) = $property.value
                    }
                } 
            }   
            $DT.Rows.Add($DR)  
             
        } 
    }  
      
    End 
    { 
        Write-Output @(,($dt)) 
    } 
 
}

With these 2 functions, the basic functionality is there to load data into a SQL Server database table with the proper data types. As with any data loading process, if the data types do not match, Out-DataTableWithDataTypes will fail on the load.

As with most every blog on the Internet, this code has been working for me and the needs I have for it. I fully recognize this does not meet every scenario. Your use of this code is welcome, but there are no guarantees with this code.

Working with Data in .Net

As a PowerShell developer (yes, I consider PowerShell coding development, when it’s for pre-ETL processing), there are a handful of very useful namespaces .Net.

System.Data – “basic” data containers are classes in this namespace, include DataTable and DataSet. These are generic types, regardless of source or destination database or data source.

System.XML – If you’re ingesting XML data, there are several classes here you’ll use. It’s likely you’ll use XmlReader to stream data or XmlDocument to load a document into memory to read or manipulate.

Microsoft.Data.SqlClient – Updated version to System.Data.SqlClient, used to access SQL Server, including Azure SQL Databases and Azure SQL Managed Instances. SqlConnection, SqlCommand and SqlBulkCopy are a data developer’s best friends.

OfficeOpenXML.Core.ExcelPackage – Working with XLSX files requires an assembly found in the PowerShell Gallery, among other places. Classes in this assembly allow developers to read, update and write Excel data.

This is a placeholder and this post will be updated at a later date…