Generating Test Data in Python – Part 1

As a data professional, regardless of discipline, you’re going to need to generate sample data to populate a table or series of tables.  While one could write a simple SQL statement to insert a bunch of names, along the lines of Jane Doe1, Jane Doe2, Jane Doe3, etc.  the data becomes a bit monotonous.   When it comes to addresses, it’s not quite so simple.  To have test data on different streets and different parts of town, a little more is needed, beyond SQL. 

As I’ve been learning Python through Data Camp, the wheels have started to turn.  After learning about list comprehensions and iterators, the possibility of generating “large” amounts of data is much more realistic.  Below is an address generator I wrote this morning to test my skills. In this blog, I’m not going to go into what the code is doing…yes, I do know what it’s doing. I will dive into the code in a future blog. What I will say now is that the code contained inside the square brackets [] is known as list comprehensions in Python, and each one below is doing quite a bit of work.

def generate_sample_addresses(maxHouseNumberPerBlock, 
                              houseNumberGapSize,
                              listOfAvenues, 
                              numberOfStreets):
    blockHouses = [houseNumber 
                    for houseNumber in range(0, maxHouseNumberPerBlock) 
                       if houseNumber%houseNumberGapSize == 0]
    
    streetNames = [convert_number_to_ordinal(num) 
                       for num in range(1,numberOfStreets+1)]
    streetAddresses = [str(streetNum) + ("0" + str(houseNum))[-2:] + " " + convert_number_to_ordinal(streetNum) + " St" 
                           for houseNum in blockHouses 
                               for streetNum in range(0,len(listOfAvenues)+1)]
        
    avenueAddresses = [str(blockNum) + ("0" + str(houseNum))[-2:] + " " + avenue + " Ave" 
                           for houseNum in blockHouses for avenue in listOfAvenues
                               for blockNum in range(1,numberOfStreets+1)]

    allAddresses = streetAddresses + avenueAddresses

    for count in range(0, len(allAddresses)):
        randomNumber = int(random.uniform(0,1) * len(allAddresses))

        yield allAddresses[randomNumber]
def convert_number_to_ordinal(n):
    num = int(n)
    
    if 11 <= (num % 100) <= 13:
        ordinal_suffix = 'th'
    elif num%10 == 1 :
        ordinal_suffix = "st"
    elif num%10 == 2:
        ordinal_suffix = "nd"
    elif num%10 == 3:
        ordinal_suffix = "rd"
    else:
        ordinal_suffix = "th"
    return str(num) + ordinal_suffix

This function, generate_sample_addresses requires 4 parameters to generate street addresses.

  1. maxHouseNumberPerBlock – an integer representing the highest house number on a block. For example, if I want the last house on the 700 block of Main St to be 745 Main St, the value to pass is 45.
  2. houseNumberGapSize – helps define the density of homes on the block. From 0 to maxHouseNumberOnBlock, an address will be generated for every x values. Below, you’ll see I used a value of 3. This is handy for me, as addresses in most places in the US have odd numbers on one side of the street and even numbers on the other. If I wanted to map the data and someone zoomed all the way in, it would be nice if they see houses on both sides of the street. An odd number will give you that capability.
  3. listOfAvenues – provide a list of avenue names to be used in the addresses.
  4. numberOfStreets – assuming streets are ordinal and not named, the generator will create street names 1st – nth St.

What blow me away about Python is this generator has eight statements. EIGHT!!! Okay, okay, there is a 2nd function used to create the street ordinals. (By the way, the ordinal function was inspired by StackOverflow. ) But still, check out this list of addresses. Here are 20 pseudo-random addresses from the generator, based on the inputs below. With 4 avenues and 10 streets (in a grid formation) and 10 houses per block (1st and 2nd parameters determine this), we can get 440 unique addresses. Increasing the houses per block to 15 (value of 45 for 1st parameter), 660 addresses can be generated. Building out the number of avenues and increasing the number of streets also result in more addresses being generated.

avenueNames = ['Ash', 'Birch', 'Aldrich', 'Colfax']

address = generate_sample_addresses(30, 3, avenueNames, 10)
for num in range(1, 21):
    print("#" + str(num) + " - " + next(address))
#1 - 221 Ash Ave
#2 - 206 Birch Ave
#3 - 515 Ash Ave
#4 - 409 Birch Ave
#5 - 103 Birch Ave
#6 - 412 4th St
#7 - 724 Colfax Ave
#8 - 321 Birch Ave
#9 - 309 Birch Ave
#10 - 312 Aldrich Ave
#11 - 709 Ash Ave
#12 - 724 Colfax Ave
#13 - 115 Birch Ave
#14 - 503 Aldrich Ave
#15 - 806 Ash Ave
#16 - 827 Aldrich Ave
#17 - 121 1st St
#18 - 209 Ash Ave
#19 - 415 Aldrich Ave
#20 - 503 Birch Ave

An important feature of the generator is the use of the yield statement, instead of return. By yielding the execution of the function, calling code can iterate through the records, one by one. This is helpful if there is also code to generate names of people, giving each person a place to live. (Code for generating names, based on files of popular names, will be in a future blog.)

Stay tuned for more Python code and how it can be used to benefit data professionals…

Follow me on Twitter – @em_dempster