Data Saturday Atlanta Wrap-Up

Data Saturday Atlanta was a major success, both for the event and for myself. Rie Merritt (@IrishSQL) and the organizing committee did a great job bringing together a stellar group of speakers who covered many great topics. A HUGE thank you to Atlanta Azure Data User Group for the opportunity to speak about PowerShell. Thank you to each and every attendee of my session, Reducing Data Warehouse Processing with PowerShell Pre-ETL Processing. It was great to have so many people show interest in PowerShell and how data developers, data analysts, QA analysts, and others can leverage its capabilities.

The PDF version of the slide deck is available for download.

In addition to the slides, below are the demo scripts from the presentation. The data files used in the presentation are being made available, but you really shouldn’t need them. The idea is to use the scripts to get inspiration for your own solutions. The examples make use of two functions I’ve created. They are documented in this post.

CSV Validation – Loading data from a CSV into SQL Server is straightforward, if the columns in the CSV file match the database table’s. This validation performs a cursory check of the columns being passed in a CSV file. Certainly, there are other checks that can be performed, but these are a couple to a solution started.

# It is assumed the CSV being used contains a header record in the first row of the file. As a result,
# only the first record ( -TotalCount 1 ) is being read.

$header = Get-Content -Path $FileName -TotalCount 1
# Add .Replace("`"", "") to the end of the split operation to remove quotes from the values.
$headerColumns = ($header -split ',') 

# With a database table designed with the schema intended to be in the CSV 
#   file, query SQL Server to get the list of columns (and data types).
# NULL is being passed to the -Credential parameter, which informs the 
#   function to use Windows authentication to connect to the SQL instance.
$databaseTableColumns= Get-SQLColumnsForDataTable `
                -ServerInstance localhost `
                -Database AdventureWorks `
                -Credential $null `
                -SchemaName Production `
                -TableName Product

# Perform a quick column count to see if the number of columns match.  By no means does this mean the columns are the same, even if the counts are equal.
if ($headerColumns.Count -eq $columns.Count)
    write-output "The columns in the file match the table definition"
    write-output "The number of columns in the source file does not match the table definition"

# Check the CSV's columns to see if columns from the table are missing.
$missingColumn = $false
foreach ($column in $columns) 
    if ($headerColumns -notcontains $column.ColumnName) 
        $missingColumn = $true
        write-output "   Column, $($column.ColumnName), was not found" 

Excel Validation – Like the CSV validation, validating Excel Files is also possible with PSExcel module (technically on EPPlus.dll is being used for the OfficeOpenXML namespace). As you compare the two scripts, there is more work in the Excel validation, due to needing the assistance of EPPluss.dll to access the data, versus reading the first line of text from the CSV file with Get-Content.

# Import the PSExcel module from the PowerShell Gallery; only needs to be run once on a computer.  
# Run the PowerShell session in administrator mode to install the module for all users.

install-module PSExcel
import-module PSExcel

$excelFile = New-Object OfficeOpenXml.ExcelPackage $FileName
$workbook  = $excelFile.Workbook

# Every worksheet in Excel has a name on the tab.  That name goes here.
$worksheet = $workbook.Worksheets["Details"]
# $worksheet.Dimension.Rows returns the number of records in the Excel worksheet.  
# It may be useful to log the number of records received in a pre-processing/validation step.
Write-Output "#################"
Write-Output "   File contains $($worksheet.Dimension.Rows) rows"
Write-Output "#################"

# Like Dimension.Rows, Dimension.Columns returns the number of columns with data in the worksheet.
$worksheetColumns = $worksheet.Dimension.Columns

 # Get columns for Sales.SalesOrderDetails
 $dbColumns = Get-SQLColumnsForDataTable -ServerInstance localhost `
                                         -Database AdventureWorks `
                                         -Credential $null `
                                         -SchemaName "Sales" `
                                         -TableName "SalesOrderDetail"

if ($worksheetColumns -eq $dbColumns.Count)
    write-output "             "
    write-output "The Excel file and the database table have the same number of columns"
    write-output "             "
    write-output "The Excel file and the database table do not have the same number of columns"

    # Find the differing columns
    $worksheetColumnNames = @()
    $dbColumnNames = $dbColumns.ColumnName
    foreach ($columnNum in 1..($worksheetColumns-1))
        $columnName = $worksheet.Cells.Item(1,$columnNum).Text
        if ($dbColumns.ColumnName -notcontains $columnName)
            write-output "$columnName is not in the database table"

        $worksheetColumnNames += $columnName

JSON to CSV Transformation – This script is directly related to the screenshot of the JSON file from slide 21 in the slide deck, creating a CSV file from the highlighted elements in the JSON.

$activities= (Get-Content $FileName | convertfrom-json).Value

$activitiesTable = $activities | select activityName, activityRunStart, activityRunEnd, durationInMs, 
                                                    ($_ | select-object -expandProperty input).storedProcedureName

$activitiesTable | format-table

$activitiesTable | export-csv c:\demos\activities.csv -NoTypeInformation

Using Metadata to Load Tables – When you need to load files into tables repeatedly over time, such as a updated customer information on a monthly basis, it may be useful to create a table to define metadata about those files. This could includ the naming pattern of the file, which database the data should reside in, along with the specific schema and table names. This example takes that metadata and instantiates a bulk copy object from Microsoft.Data.SqlClient. With the System.Data.DataTable object holding the contents of the CSV file, that data is loaded into the specified table.

$query = "select FileNamePattern, 
          from Logging.LoadMetadata"

# invoke-sqlcmd with no credential uses context of the current user
$files = invoke-sqlcmd -ServerInstance localhost -Database PowerShellDemo -Query $query 

foreach ($file in $files)
    write-output "...Loading $($file.DestinationSchemaTableName)"

    #Query sys.columns & sys.types to get the fields of a table
    $dbColumns = Get-SQLColumnsForDataTable -ServerInstance localhost `
                                            -Credential $null `
                                            -Database $file.DestinationDatabaseName `
                                            -SchemaName $file.DestinationSchemaName `
                                            -TableName $file.DestinationTableName

    #Load CSV data into memory and generate a DataTable with defined data types
    $dataTable = Import-CSV "C:\Demos\CSVs\$($file.FileNamePattern)"  | Out-DataTableWithDBDataTypes -Columns $columns

    # Bulk copy the DataTable into the specified table
    $bulkCopy = new-object ([Microsoft.Data.SqlClient.SqlBulkCopy]) `
                    -ArgumentList "server=localhost;database=$($file.DestinationDatabaseName);Integrated Security=SSPI"
    $bulkCopy.DestinationTableName = $file.DestinationSchemaTableName

If you have any questions about these demo scripts, you can find me on Twitter, @em_dempster.


Leave a Reply





%d bloggers like this: