{"id":91185,"date":"2021-06-08T16:05:14","date_gmt":"2021-06-08T16:05:14","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=91185"},"modified":"2021-07-07T10:43:46","modified_gmt":"2021-07-07T10:43:46","slug":"convert-from-xml","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/blogs\/convert-from-xml\/","title":{"rendered":"ConvertFrom-XML: a PowerShell utility Cmdlet"},"content":{"rendered":"<p>I have to deal with XML data with PowerShell. That\u2019s fine, but I always find myself sighing when I need to read an XML object. I have a copy of \u2018XML in a Nutshell\u2019. It is over six hundred pages of essential information on XML. Some nutshell. XML is, I believe, a way of representing data that should be kept at arms-length, especially if the arm is reaching out of the window.<\/p>\n<p>I want to convert reasonably small XML files to hash tables and PowerShell objects. PowerShell never had a <strong>ConvertFrom-XML<\/strong> Cmdlet because gulping a large XML file into a PowerShell data object is expensive in resources. It is the sheer time it takes to consume a large XML file. Instead, you have to use the <strong>XMLDocument<\/strong> object to navigate to the data you want or use an Xpath query. It is all well and good to handle XML in this way, but it is inconsistent to have no<strong> ConvertFrom-XML<\/strong> cmdlet. After all, there is a <strong>ConvertFrom<\/strong> cmdlet for CSV, JSON, and a variety of text-based data. It would be good to have one for XML as well. Usually, I just want to consume relatively small XML files and just pick out the data I want. I hoped that one that worked would turn up but somehow it never did. So I wrote my own.<\/p>\n<p>There are certain problems with tackling a routine that has to successfully convert all the permutations of XML into arrays and hashtables. XML doesn\u2019t handle arrays natively but implies them by assigning them the same keys, it allows empty elements, or elements that contain only other elements. There is no built-in concept of NULLs. It can have elements that contain only text, or that mix text and elements. Additionally, attributes don\u2019t have any intrinsic order whereas elements do. It is interesting to see how the online conversion utilities fare. There is little consensus about this.<\/p>\n<p>In addition, the requirements of users vary. How do you distinguish attributes? Do you prefix them with a character such as \u2018@\u2019 or \u2018-\u2018. Do you show the document element? You will soon understand and appreciate how difficult it is to consistently interpret XML.<\/p>\n<p>XML is better understood as a document language that can be used as a data description language but it is too open-ended to be optimal for data-interchange. Because it is so open-ended, there are fewer certainties as to how it is used for storing data. This makes it more difficult to produce a function that renders any XML file as PowerShell. Hopefully, this is one of those routines that can be improved by experience.<\/p>\n<pre class=\"theme:powershell-ise font:consolas font-size:13 line-height:10 marking:false ranges:false nums-toggle:false wrap-toggle:false lang:ps decode:true\">function ConvertFrom-XML\r\n{\r\n\t[CmdletBinding()]\r\n\tparam\r\n\t(\r\n\t\t[Parameter(Mandatory = $true, ValueFromPipeline)]\r\n\t\t[System.Xml.XmlNode]$node, #we are working through the nodes\r\n\t\t[string]$Prefix='',#do we indicate an attribute with a prefix?\r\n\t\t$ShowDocElement=$false #Do we show the document element? \r\n\t)\r\n\tprocess\r\n\t{   #if option set, we skip the Document element\r\n\t\tif ($node.DocumentElement -and !($ShowDocElement)) \r\n            { $node = $node.DocumentElement }\r\n\t\t$oHash = [ordered] @{ } # start with an ordered hashtable.\r\n        #The order of elements is always significant regardless of what they are\r\n\t\twrite-verbose \"calling with $($node.LocalName)\"\r\n\t\tif ($node.Attributes -ne $null) #if there are elements\r\n\t\t# record all the attributes first in the ordered hash\r\n\t\t{\r\n\t\t\t$node.Attributes | foreach {\r\n\t\t\t\t$oHash.$($Prefix+$_.FirstChild.parentNode.LocalName) = $_.FirstChild.value\r\n\t\t\t}\r\n\t\t}\r\n\t\t# check to see if there is a pseudo-array. (more than one\r\n\t\t# child-node with the same name that must be handled as an array)\r\n\t\t$node.ChildNodes | #we just group the names and create an empty\r\n        #array for each\r\n\t\tGroup-Object -Property LocalName | where { $_.count -gt 1 } | select Name |\r\n\t\tforeach{\r\n\t\t\twrite-verbose \"pseudo-Array $($_.Name)\"\r\n\t\t\t$oHash.($_.Name) = @() &lt;# create an empty array for each one#&gt;\r\n\t\t};\r\n\t\tforeach ($child in $node.ChildNodes)\r\n\t\t{#now we look at each node in turn.\r\n\t\t\twrite-verbose \"processing the '$($child.LocalName)'\"\r\n\t\t\t$childName = $child.LocalName\r\n\t\t\tif ($child -is [system.xml.xmltext])\r\n\t\t\t# if it is simple XML text \r\n\t\t\t{\r\n\t\t\t\twrite-verbose \"simple xml $childname\";\r\n\t\t\t\t$oHash.$childname += $child.InnerText\r\n\t\t\t}\r\n\t\t\t# if it has a #text child we may need to cope with attributes\r\n\t\t\telseif ($child.FirstChild.Name -eq '#text' -and $child.ChildNodes.Count -eq 1)\r\n\t\t\t{\r\n\t\t\t\twrite-verbose \"text\";\r\n\t\t\t\tif ($child.Attributes -ne $null) #hah, an attribute\r\n\t\t\t\t{\r\n\t\t\t\t\t&lt;#we need to record the text with the #text label and preserve all\r\n\t\t\t\t\tthe attributes #&gt;\r\n\t\t\t\t\t$aHash = [ordered]@{ };\r\n\t\t\t\t\t$child.Attributes | foreach {\r\n\t\t\t\t\t\t$aHash.$($_.FirstChild.parentNode.LocalName) = $_.FirstChild.value\r\n\t\t\t\t\t}\r\n                    #now we add the text with an explicit name\r\n\t\t\t\t\t$aHash.'#text' += $child.'#text'\r\n\t\t\t\t\t$oHash.$childname += $aHash\r\n\t\t\t\t}\r\n\t\t\t\telse\r\n\t\t\t\t{ #phew, just a simple text attribute. \r\n\t\t\t\t\t$oHash.$childname += $child.FirstChild.InnerText\r\n\t\t\t\t}\r\n\t\t\t}\r\n\t\t\telseif ($child.'#cdata-section' -ne $null)\r\n\t\t\t# if it is a data section, a block of text that isnt parsed by the parser,\r\n\t\t\t# but is otherwise recognized as markup\r\n\t\t\t{\r\n\t\t\t\twrite-verbose \"cdata section\";\r\n\t\t\t\t$oHash.$childname = $child.'#cdata-section'\r\n\t\t\t}\r\n\t\t\telseif ($child.ChildNodes.Count -gt 1 -and \r\n                        ($child | gm -MemberType Property).Count -eq 1)\r\n\t\t\t{\r\n\t\t\t\t$oHash.$childname = @()\r\n\t\t\t\tforeach ($grandchild in $child.ChildNodes)\r\n\t\t\t\t{\r\n\t\t\t\t\t$oHash.$childname += (ConvertFrom-XML $grandchild)\r\n\t\t\t\t}\r\n\t\t\t}\r\n\t\t\telse\r\n\t\t\t{\r\n\t\t\t\t# create an array as a value  to the hashtable element\r\n\t\t\t\t$oHash.$childname += (ConvertFrom-XML $child)\r\n\t\t\t}\r\n\t\t}\r\n\t\t$oHash\r\n\t}\r\n} <\/pre>\n<p>Testing this routine has been an interesting experience. The method I\u2019ve used is to take a range of XML files, an pass them through some online XML to JSON translation systems. I pick the one that seems the best fit. Then I take the output of this routine and check that it produces the same JSON, using the <strong>ConvertTo-JSON<\/strong> cmdlet.<\/p>\n<p>Here is a sample of the tests, which are placed in an array and executed in turn. Any problems, and a warning appears.<\/p>\n<pre class=\"theme:powershell-ise font:consolas font-size:13 line-height:10 marking:false ranges:false nums-toggle:false wrap-toggle:false lang:ps decode:true\">@(\r\n\t#A complex XML element which contains both elements and text:\r\n\t@{\r\n\t\t'Test' = ' #A complex XML element which contains both elements and text:'\r\n\t\t'TheXML' = '&lt;description&gt;\r\nIt happened on &lt;date lang=\"norwegian\"&gt;03.03.99&lt;\/date&gt; ....\r\n&lt;\/description&gt;';\r\n\t\t'TheJSON' = '{\r\n    \"#text\":  [\r\n                  \"\\nIt happened on \",\r\n                  \" ....\\n\"\r\n              ],\r\n    \"date\":  {\r\n                 \"lang\":  \"norwegian\",\r\n                 \"#text\":  \"03.03.99\"\r\n             }\r\n}'\r\n\t},\r\n\t#A complex XML element, \"food\", which contains only text:\r\n\t@{\r\n\t\t'Test' = 'A complex XML element, which contains only text'\r\n\t\t'TheXML' = '&lt;food type=\"dessert\"&gt;Ice cream&lt;\/food&gt;';\r\n\t\t'TheJSON' = '{\r\n    \"type\":  \"dessert\",\r\n    \"#text\":  \"Ice cream\"\r\n}'\r\n\t},\r\n\t#empty  element\r\n\t@{\r\n\t\t'Test' = 'empty  element'\r\n\t\t'TheXML' = '&lt;product pid=\"1345\"\/&gt;';\r\n\t\t'TheJSON' = '{\r\n    \"pid\":  \"1345\"\r\n}'\r\n\t},\r\n\t#A complex XML element, \"employee\", which contains only other elements:\r\n\t@{\r\n\t\t'Test' = 'complex element,which contains only other elements'\r\n\t\t'TheXML' = '\r\n&lt;employee&gt;\r\n  &lt;firstname&gt;Jane&lt;\/firstname&gt;\r\n  &lt;lastname&gt;Smith&lt;\/lastname&gt;\r\n&lt;\/employee&gt;';\r\n\t\t'TheJSON' = '{\r\n    \"firstname\":  \"Jane\",\r\n    \"lastname\":  \"Smith\"\r\n}'\r\n\t}) | &lt;strong&gt;foreach&lt;\/strong&gt;{\r\n\t$Reference = $_.TheJSON;\r\n\t$Difference = [xml]$_.TheXML | convertFrom-XML | &lt;strong&gt;convertto-json&lt;\/strong&gt;\r\n\tif ($Reference -ine $Difference)\r\n\t{ &lt;strong&gt;Write-Warning&lt;\/strong&gt; \"An anomaly testing $($_.Test). The $Reference was different to $Difference\" }\r\n\telse\r\n\t{ \"passed test $($_.Test)\" }\r\n}<\/pre>\n<p>I\u2019ve added the source to a <a href=\"https:\/\/github.com\/Phil-Factor\/PowerShell-Utility-Cmdlets\">collection of PowerShell utilities<\/a> on my github repositories.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I have to deal with XML data with PowerShell. That\u2019s fine, but I always find myself sighing when I need to read an XML object. I have a copy of \u2018XML in a Nutshell\u2019. It is over six hundred pages of essential information on XML. Some nutshell. XML is, I believe, a way of representing&#8230;&hellip;<\/p>\n","protected":false},"author":154613,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[2],"tags":[],"coauthors":[6813],"class_list":["post-91185","post","type-post","status-publish","format-standard","hentry","category-blogs"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/91185","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/154613"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=91185"}],"version-history":[{"count":4,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/91185\/revisions"}],"predecessor-version":[{"id":91542,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/91185\/revisions\/91542"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=91185"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=91185"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=91185"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=91185"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}