I have to deal with XML data with PowerShell. That’s fine, but I always find myself sighing when I need to read an XML object. I have a copy of ‘XML in a Nutshell’. It is over six hundred pages of essential information on XML. Some nutshell. XML is, I believe, a way of representing data that should be kept at arms-length, especially if the arm is reaching out of the window.
I want to convert reasonably small XML files to hash tables and PowerShell objects. PowerShell never had a ConvertFrom-XML Cmdlet because gulping a large XML file into a PowerShell data object is expensive in resources. It is the sheer time it takes to consume a large XML file. Instead, you have to use the XMLDocument object to navigate to the data you want or use an Xpath query. It is all well and good to handle XML in this way, but it is inconsistent to have no ConvertFrom-XML cmdlet. After all, there is a ConvertFrom cmdlet for CSV, JSON, and a variety of text-based data. It would be good to have one for XML as well. Usually, I just want to consume relatively small XML files and just pick out the data I want. I hoped that one that worked would turn up but somehow it never did. So I wrote my own.
There are certain problems with tackling a routine that has to successfully convert all the permutations of XML into arrays and hashtables. XML doesn’t handle arrays natively but implies them by assigning them the same keys, it allows empty elements, or elements that contain only other elements. There is no built-in concept of NULLs. It can have elements that contain only text, or that mix text and elements. Additionally, attributes don’t have any intrinsic order whereas elements do. It is interesting to see how the online conversion utilities fare. There is little consensus about this.
In addition, the requirements of users vary. How do you distinguish attributes? Do you prefix them with a character such as ‘@’ or ‘-‘. Do you show the document element? You will soon understand and appreciate how difficult it is to consistently interpret XML.
XML is better understood as a document language that can be used as a data description language but it is too open-ended to be optimal for data-interchange. Because it is so open-ended, there are fewer certainties as to how it is used for storing data. This makes it more difficult to produce a function that renders any XML file as PowerShell. Hopefully, this is one of those routines that can be improved by experience.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 |
function ConvertFrom-XML { [CmdletBinding()] param ( [Parameter(Mandatory = $true, ValueFromPipeline)] [System.Xml.XmlNode]$node, #we are working through the nodes [string]$Prefix='',#do we indicate an attribute with a prefix? $ShowDocElement=$false #Do we show the document element? ) process { #if option set, we skip the Document element if ($node.DocumentElement -and !($ShowDocElement)) { $node = $node.DocumentElement } $oHash = [ordered] @{ } # start with an ordered hashtable. #The order of elements is always significant regardless of what they are write-verbose "calling with $($node.LocalName)" if ($node.Attributes -ne $null) #if there are elements # record all the attributes first in the ordered hash { $node.Attributes | foreach { $oHash.$($Prefix+$_.FirstChild.parentNode.LocalName) = $_.FirstChild.value } } # check to see if there is a pseudo-array. (more than one # child-node with the same name that must be handled as an array) $node.ChildNodes | #we just group the names and create an empty #array for each Group-Object -Property LocalName | where { $_.count -gt 1 } | select Name | foreach{ write-verbose "pseudo-Array $($_.Name)" $oHash.($_.Name) = @() <# create an empty array for each one#> }; foreach ($child in $node.ChildNodes) {#now we look at each node in turn. write-verbose "processing the '$($child.LocalName)'" $childName = $child.LocalName if ($child -is [system.xml.xmltext]) # if it is simple XML text { write-verbose "simple xml $childname"; $oHash.$childname += $child.InnerText } # if it has a #text child we may need to cope with attributes elseif ($child.FirstChild.Name -eq '#text' -and $child.ChildNodes.Count -eq 1) { write-verbose "text"; if ($child.Attributes -ne $null) #hah, an attribute { <#we need to record the text with the #text label and preserve all the attributes #> $aHash = [ordered]@{ }; $child.Attributes | foreach { $aHash.$($_.FirstChild.parentNode.LocalName) = $_.FirstChild.value } #now we add the text with an explicit name $aHash.'#text' += $child.'#text' $oHash.$childname += $aHash } else { #phew, just a simple text attribute. $oHash.$childname += $child.FirstChild.InnerText } } elseif ($child.'#cdata-section' -ne $null) # if it is a data section, a block of text that isnt parsed by the parser, # but is otherwise recognized as markup { write-verbose "cdata section"; $oHash.$childname = $child.'#cdata-section' } elseif ($child.ChildNodes.Count -gt 1 -and ($child | gm -MemberType Property).Count -eq 1) { $oHash.$childname = @() foreach ($grandchild in $child.ChildNodes) { $oHash.$childname += (ConvertFrom-XML $grandchild) } } else { # create an array as a value to the hashtable element $oHash.$childname += (ConvertFrom-XML $child) } } $oHash } } |
Testing this routine has been an interesting experience. The method I’ve used is to take a range of XML files, an pass them through some online XML to JSON translation systems. I pick the one that seems the best fit. Then I take the output of this routine and check that it produces the same JSON, using the ConvertTo-JSON cmdlet.
Here is a sample of the tests, which are placed in an array and executed in turn. Any problems, and a warning appears.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
@( #A complex XML element which contains both elements and text: @{ 'Test' = ' #A complex XML element which contains both elements and text:' 'TheXML' = '<description> It happened on <date lang="norwegian">03.03.99</date> .... </description>'; 'TheJSON' = '{ "#text": [ "\nIt happened on ", " ....\n" ], "date": { "lang": "norwegian", "#text": "03.03.99" } }' }, #A complex XML element, "food", which contains only text: @{ 'Test' = 'A complex XML element, which contains only text' 'TheXML' = '<food type="dessert">Ice cream</food>'; 'TheJSON' = '{ "type": "dessert", "#text": "Ice cream" }' }, #empty element @{ 'Test' = 'empty element' 'TheXML' = '<product pid="1345"/>'; 'TheJSON' = '{ "pid": "1345" }' }, #A complex XML element, "employee", which contains only other elements: @{ 'Test' = 'complex element,which contains only other elements' 'TheXML' = ' <employee> <firstname>Jane</firstname> <lastname>Smith</lastname> </employee>'; 'TheJSON' = '{ "firstname": "Jane", "lastname": "Smith" }' }) | <strong>foreach</strong>{ $Reference = $_.TheJSON; $Difference = [xml]$_.TheXML | convertFrom-XML | <strong>convertto-json</strong> if ($Reference -ine $Difference) { <strong>Write-Warning</strong> "An anomaly testing $($_.Test). The $Reference was different to $Difference" } else { "passed test $($_.Test)" } } |
I’ve added the source to a collection of PowerShell utilities on my github repositories.
Load comments