Fun with XML, part 2

Today's post is a continuation of the mini-series I started on Feb 28th. Read the first part here which contains the story behind this. In this post, I'll be getting into the technical details of what I learned.

This is about the guts of the XML itself, in SQL, where I ended up writing this procedure. First, before I get into the details, I wanted to reference this Stack Overflow post which was the single most useful post and an example of the work I did. The person who posted the example, Ryan, also laid out a very simple temp table structure so I copied that code and pasted it into my test SQL server and played with it to understand it.

What I needed to create

The file format I needed to create is for the Canada Revenue Agency and specifically for T5 submissions. The specs are relatively well laid out here, although the formatting of the page makes it hard to read the details and find sections easily (IMHO).

The basic format of the file is this:

<Submission>
   <T619>
      Organizational info (the filer)
   </T619> -- referred to as section A in my code at the end
   <Return>
      <T5>
         <T5Slip>
            Data for a single T5 slip goes here
            (for my file, I needed over 100 T5Slip sections)
         </T5Slip> -- referred to as section B
         <T5Summary>
            Summary of all T5 slips
         </T5Summary> -- referred to as section C
      </T5> -- referred to as section D
   </Return> -- referred to as section E
</Submission> -- referred to as section Z

Here's a visual example of the simple file I've mocked up to write this post.

A mocked-up file with one T619 section, 1 Return, 4 T5Slips + 1 T5 Summary in that Return.

What I found challenging

There were a few challenging things along the way. The biggest was there are multiple levels of information required in the file and based on the Stack Overflow post and answer, that would require multiple levels of nested select statements to accommodate. At times it was getting hard to identify which part was related to what parent element. In each section, there are further nested elements required which made the file contain a lot more levels than any example I could find online to follow along with. For example: any fields with names or addresses are encapsulated in a parent node with child elements underneath.

How I approached it

First, I used the example from that Stack Overflow post referenced at the top of this article. I used the sample temp table scripts and dummy data and then played with it, trying to add more complexity to it until I understood the code. I'm not a developer, so this was all new to me. I've written tons of SQL queries in my life but none using FOR XML.

Second, I mapped his example vs. my file format to see what sections aligned with his, now that I understood the working parts a little better. In my case, I had to nest a lot deeper than the example so I started from the inside out, at the lowest level of the child elements in the file. Here's an example. The original example lined up nicely with the T5 section inwards. This helped me get my query right, from the simplest innermost detail and working outwards back to the Submission level overall. The biggest diff between his example and mine: this data was all coming from one source table, not 3 tables. His example had a bunch of joins to get it working together which I didn't need, yet.

Stack Overflow example & my comparative section in brackets

<Elements> (T5)
   <Element> (T5Slip)
      <Field1></Field1> (fields I need that were not in the S/O example)
      <Field2></Field2>
      <Field3></Field3>
      <Customer> (Recipient Name section)
         some data
      </Customer>
      <RepairFacility> (Recipient Address section)
         some data
      </RepairFacility>
      <Action> (T5 Amount section)
         some data
      </Action>
   </Element>
   <Element> (Next T5 Slip)
      repeat as needed for the next customer section
   </Element>
<Elements>

I started with the T619 section since that was the simplest to create: it would contain 1 record of data, for our organization, plus, it has fewer nested levels to figure out.

Example using FOR XML PATH

Using a generic example that resembles the T619 section of the CRA XML, my sample below is based on my data fields in SQL not being the same as the element names I need in the file so I am aliasing everything. I show some constant values as well since some of my XML files were constant values hard-coded in my query, whereas most of the rest came from SQL tables. I used PATH instead of AUTO because I found it easier to work with, by explicitly telling SQL what my XML element name was for a given section. Each subquery is for an area where there is a parent-child relationship to lower-level data in the file.

SELECT 'TEST01' as element1,
       A.FilerBN as element2,
       (SELECT OrgName as Name1
           FROM MySQLTable1
           FOR XML PATH('FilerName'),TYPE),
       (SELECT OrgAddr1 as Addr1, 
               OrgCity as City, 
               OrgProv as Prov, 
               OrgCountry as Country, 
               OrgPC as PC
           FROM MySQLTable1 
           FOR XML PATH('FilerAddress'),TYPE),
       (SELECT OrgContact as ContactName, 
               OrgPhone as ContactPhone,
               OrgEmail as ContactEmail
           FROM MySQLTable1 
           FOR XML PATH('FilerContact'),TYPE)
   FROM MySQLTable1 A
   FOR XML PATH('T619'), TYPE

The results of that code generated this in SQL. This resembles the T619, with fewer fields just to demonstrate the concept.

Example of the T619 section of the T5 file

Once that section was working, I then moved on to what would make that fit in the larger file, i.e. the T619 section fits inside a Submission section. So, I took that script above and placed it inside another SELECT statement, within brackets. Line 1, and the last 2 lines are new code, the rest is the same code as before placed in the new code.

SELECT 
   (SELECT 'TEST01' as element1,
           A.FilerBN as element2,
           (SELECT OrgName as Name1
              FROM MySQLTable1
              FOR XML PATH('FilerName'),TYPE),
           (SELECT OrgAddr1 as Addr1, 
                  OrgCity as City, 
                  OrgProv as Prov, 
                  OrgCountry as Country, 
                  OrgPC as PC
              FROM MySQLTable1 
              FOR XML PATH('FilerAddress'),TYPE),
           (SELECT OrgContact as ContactName, 
                  OrgPhone as ContactPhone,
                  OrgEmail as ContactEmail
              FROM MySQLTable1 
              FOR XML PATH('FilerContact'),TYPE)
      FROM MySQLTable1 A
      FOR XML PATH('T619'), TYPE) 
   FROM MySQLTable1 Z
   FOR XML PATH('Submission')

That worked exactly as expected and the beginning of the final file format began to take shape. I was missing namespace info on the top line of the file and of course, I need the Return section in as well.

The next iteration of the results started to resemble the final file format.

Next example: the big T5Slip section

Here is where things got interesting. I got a certain point very quickly because the core file format was similar to the above code. Again, here's a simplified example with some additional nuances thrown in to handle null values. The table this data is from is a different table than the organizational info but what I did include was one field in common to link the two together later. I chose the FilerBN (business number) field but could have easily made up a field instead. Like the CRA file, the element names are the same for some data. I did that intentionally to mimic the CRA format (and to avoid having to make up new examples!).

SELECT  (SELECT RcpName as BusName1
                NULLIF(RcpName2,'') as BusName2 -- ** see note 1 **
            FROM MySQLTable2
            WHERE RecipientType > 2 -- ** see note 2 **
            FOR XML PATH('BusinessName'),TYPE),
        (SELECT RcpSurname as Surname,
                RcpFirstName as GivenName,
                NULLIF(RcpInitial,'') as Initial -- ** see note 1 **
            FROM MySQLTable2
            WHERE RecipientType < 3 -- ** see note 2 **
            FOR XML PATH('IndividualName'),TYPE), 
        (SELECT RcpAddr1 as Addr1, 
                RcpCity as City, 
                RcpProv as Prov, 
                RcpCountry as Country, 
                RcpPC as PC
            FROM MySQLTable2 
            FOR XML PATH('RecipientAddress'),TYPE),
        'CAD' as CurrencyID,
        B.RecipientType as RcpType,
        (SELECT InterestAmount as IntAmt
            FROM MySQLTable2 
            FOR XML PATH('T5_Amounts'),TYPE)
    FROM MySQLTable2 B
    FOR XML PATH('T5Slip'), TYPE

Here's how that code turned out. I had 4 "records" (slips), one per person or business that I was expecting to see. I've expanded 2 of the 4 to highlight specific things.

Notes from above:

NULLIF was a simple way to correct what for me is a data issue. My empty records are empty strings, not NULLs, so if I don't do something to hide them, I will get an empty tag, which the CRA does not like. So, on fields like a 2nd business name (where the name is too long for one field), or a person's initial which I may not have, I used NULLIF(fieldname,'')empty set to indicate if the value is an empty string, treat it as null. The result, for the first T5Slip shows the Initial element is not showing on Jane Smith because she didn't give me her middle initial.
In order to hide an entire Parent-Child section if null, I had to use a where clause on the subquery. Without this, I would have had below an empty set of tags. The CRA schema does not allow this. If a tag is present, the minimum length check should be 1 character or more (for some fields at least). The first slip contains no BusinessName element and the second slip contains no IndividualName element.
One other observation: the CRA spec indicated where an & was used, I needed to put 'amp;' in the file after it. Initially, I was going to code a REPLACE to do that, until I realized quite by accident that SQL generates it that way auto-magically. Cool. One less thing to worry about! (J&S Plumbing in the 2nd slip as the example).

Example of 2 T5Slip sections, one of which had an & (ampersand) in the name.

Another round of nested selects

Once again, I wanted to visualize this nested inside an <T5> element instead of an <Return> element inside a <Submission> element. That meant nesting that code above inside 3 more select statements. The one difference here compared to my first example is I was selecting from my first table that only has one record. When I selected from the same "table2" detail table I would get many-to-many thing going on and far too many records. Here is the example, slimmed down to ignore the details in the actual working sections within.

SELECT -- Submission level
   (SELECT -- Return level
      (SELECT -- T5 level
           (** T5Slip code **) 
        FROM MySQLTable1 D
        WHERE D.FilerBN = Z.FilerBN
        FOR XML PATH('T5'), TYPE) -- D - T5level
     FROM MySQLTable1 E
     WHERE E.FilerBN = Z.FilerBN
     FOR XML PATH('Return'), TYPE) -- E - Return level
  FROM MySQLTable1 Z
  FOR XML PATH('Submission') -- Z - Submission level

Result:

Here's what it would look like if I had joined section D to the details (2nd) table not the summary (1st) table: a duplication of the data 4 times in the T5 level!

What the code would have looked like with an incorrect join.

Putting it altogether

This was where I thought I was stuck originally. I got to this point pretty quickly, detours aside (from my previous post) and there was a point at which I couldn't get it all in one Submission element. As soon as I tried to combine the T619 with the Return, I got some weird format back, not XML.

It turns out I was missing something SUPER basic and that was in my T619 section in my original code (not in this post), I used the tag ELEMENTS not TYPE. Well, that was one annoying little detail I missed that caused all hell to break loose when I thought I was close to getting a file ready.

Weird results when attempting to put the code all together.

Adding the T619 and T5Summary in

Here's the above example, now with the other 2 working sections pieced in. Each of the other pieces becomes a "select" in brackets with a comma at the end to separate it from the next "field" selected in the query.

SELECT -- Submission level
   (** T619 code **), -- A - T619
   (SELECT -- Return level
      (SELECT -- T5 level
         (** T5Slip code **), -- B - T5Slip
         (** T5Summary code **) -- C - T5Summary
        FROM MySQLTable1 D
        WHERE D.FilerBN = Z.FilerBN
        FOR XML PATH('T5'), TYPE) -- D - T5level
     FROM MySQLTable1 E
     WHERE E.FilerBN = Z.FilerBN
     FOR XML PATH('Return'), TYPE) -- E - Return level
  FROM MySQLTable1 Z
  FOR XML PATH('Submission') -- Z - Submission level

Result

Recap

In the end, a lot is going on in this post but I know it might have helped me if I had found something like this when I was working through the issues. Using XML PATH and "TYPE" throughout got me the results I needed.

Next, I’ll cover how I got namespaces to work and another around using a parser in Notepad++ to validate the file against the CRA xsd schema files.