Disclaimer: This blog was previously posted to Superior Consulting Services and is being posted again to my personal site.
Hello! How is your week going? Did you just get slammed with a large, high priority data project? Or, are you in the middle of a project but aren’t sure how to get from point E to point L? If you’ve come across this blog post via a search for SSIS or Integration Services resources, you have come to the right place.
First things first, please take a nice deep breath. Your organization and I need you to keep oxygen flowing to the brain, so you can tackle this project or, get it over the few last hurdles keeping it from going to beta or production. Over the next several posts I’ll walk through the design and implementation of solutions using SQL Server Integration Services (SSIS) and some areas of potential interest.
This is what we’ll cover in this series:
Part 1 – Technical Design (Data Requirements)
Part 2 – Build a Skeleton SSIS Staging Package
Part 3 – Add Functionality to Staging Package
Part 4 – Design an SSIS Package for File Extracts
Part 5 – Design an SSIS Package for Data Warehousing
Part 1 – Technical Design
“…but, Erin, we don’t have time to design. We have to start coding! NOW!!…”
Yes, you do have time, and doing it now will save time during development and testing. Now that this project has landed on your desk, (hopefully) there have been conversations in the organization and with any luck, those conversations have been documented and summarized resulting in some type of requirements document. Maybe the project is to generate a file extract to send to a vendor, such as retirement benefits being sent to a bank. In that case, the vendor (bank) should be providing requirements about what is needed in the flat file and how to format it. Perhaps there is a need to pull data from a couple of systems (or, maybe several systems) to support new reporting requirements. Did business users sketch what they need, and/or provide a list of fields they want to see in reports, along with a quick description of what any calculations mean? Or, maybe they took screenshots from an application and highlighted the fields they would like included?
Chances are you have something, which is better than nothing. And now it’s your turn to execute on this request. There are a few steps that really should be taken to help yourself (and your team). If you only have an email saying we need to do x, y and z, now would be the time to reply and request additional information before you proceed.
Data Discovery and Mapping
With the requirements you have been given, the first step is to understand if your organization has the data being requested, and where it’s located. If you are fortunate to have a business or data analyst involved in the project, this should already be done. If you are the analyst and developer, my apologies, as it means more work for you. If any application screenshots were provided, then you know the data exists. However, if you receive a sketch of a report, it might be possible the data simply doesn’t exist. Or, maybe the data does exist in a spreadsheet a user maintains on their computer.
An inventory of your data sources, with regards to this project, needs to be completed to understand where all of the data lives (or doesn’t). Open a spreadsheet, or a text editor, and make a list of all of the fields being sought. If the field is a calculation, break down the calculation into its components. Then, field by field, track down the location of the data and record it. If there is an existing query for some of the fields, terrific! Simply transfer those fields to this list, as you continue through the list. The mapping should provide enough information to be able to get back to the data during development. It likely needs to include the table and field in a specific database and system, the file path to a CSV file or an object name in a hosted system, like Salesforce.
Here are the easy wins gained by completing this process:
- If data doesn’t exist at all, letting the stakeholders know upfront is better than after you went hunting for it halfway through the project.
- If data does exist but it is stored remotely, your IT team may need to set up a connection to that system. Giving them a heads up before it’s needed will make for a happier IT team, and that’s always a good thing.
- Begin to talk with managers about the potential magnitude of the project (but stay away from timelines and estimates at this point).
- A critical piece of documentation has been created for use during the remaining design steps, development and especially for maintenance in the future. It’s one less document to write up at the end of the project, when it’s least useful and more difficult to do.
To Stage or Not to Stage…
With the data mapping complete, this is the time to decide whether or not data needs to be staged for the final solution. If all of the data lives in the same system and file extract(s) are being created, it is likely the data does not need to be staged. However, if you are designing a solution for a complex project or data lives in multiple locations and systems, it may be very beneficial or necessary to have an intermediate database, such as an operational data store. Even if all of the data is in a single system, if that system is a transactional database, a goal should be to minimize access to that data during business hours. If a process can scan the table once and save the data to another server to use multiple times, such as with reports or data mart processing, the application’s users will thank you for just reading from their system once.
Understand Logging Requirements
This isn’t meant to discourage you or your development team but to start the thought processes around what, if anything, needs to be logged for troubleshooting and auditing requirements. This will help you to better manage expectations with stakeholders of the project and your manager(s). In the event it was decided to create a staging database, having a table to record when packages ran, which staging tables were touched, and how many insertions, modifications or deletions were made is a good tool to have.
The logging will serve a couple of important purposes:
- For your DBA – If there are problems with a package, records may not be filled out completely, or the package didn’t work at all and a record doesn’t exist for a specific date. This is a good clue that there is a problem somewhere in the ETL process and it may be able to be narrowed down to a specific package.
- Auditing – If your organization is subject to compliance regulations surrounding business processes, like Sarbanes-Oxley, copying financial data from the source to almost anywhere else (flat file, staging database, data warehouse, etc.) will likely trigger a visit from the auditors. They will want to understand the ETL process itself, but they may also ask questions about what type of logging is implemented and how long those logs are retained. By implementing a custom logging solution used by your SSIS packages, what is logged can be customized to please the auditors.
Planning upfront and implementing a consistent strategy for your SSIS packages should make their requests a lot easier to manage than not having a strategy in place.
With these steps completed, the project should be on solid footing. In the coming posts, we’ll take these technical requirements and turn them into solutions with Integration Services packages.
Stay tuned. In our next post, we’ll set up a new SSIS project