Code Smells: Raw Strings and the Subtle Syntax Catastrophe

Strings. Strings are indispensable and ubiquitous in almost any programming language. What could be more innocuous than strings? For some time, I have been gaining awareness of a potential for misuse, something significant enough that is worth sharing with the developer community. And that is the issue of constructing objects that are themselves strings but represent semantic objects. For example, “<tag>content</tag>” is a semantic object; it is a valid representation of XML. At the same time, it is “just” a string. So it seems perfectly natural and reasonable that you could create that string by just saying, for example,

1	var myXml = "<tag id='x123'>content</tag>"

or, perhaps it makes more sense to separate the components just a bit:

1	var myXml = "<tag id='x123'>" + "content" + "</tag>"

How could those possibly be unreasonable? This is the perfect case, the “poster child” if you will, to apply the analogy of the “slippery slope”. If those expressions are fine, you could introduce some content by variable:

1	var myXml = "<tag id='x123'>" + myContentVariable + "</tag>"

…or introduce an attribute by variable:

1 2	var myXml = "<tag id='" + myAttribute + "'>" + myContent + "</tag>"

…or perhaps you want to apply the best practice of reducing magic string use and introduce the tag by variable, too:

var myXml = "<" + tagVar

+ "id='" + myAttribute + "'>"

+ myContent + "<"+ tagVar +">"

All seems perfectly natural and reasonable, something you will find used widely, right? Except there is a big problem with that last line of code. It will execute just fine, but did you notice that it is no longer valid XML? In fact, there are two errors there; can you spot them?

What I have been describing is working with raw strings—which is to say, plain old string objects—used to build a semantic string object. Using raw strings can easily, as it has done above, lead to what I refer to as the subtle string catastrophe: you have what you think is a semantic object, what looks like a semantic object, but fails catastrophically at runtime. So raw strings are a strong source of code smell!

Solution Part 1: Encapsulate

The solution to this problem lies with best practices of software coding. First, encapsulate any such potentially hazardous code into its own class or method. Then when you need to generate, for example, an XML string, you use that dedicated helper or builder or generator. This immediately provides two benefits: it avoids code duplication and it avoids an SRP violation.

Code Duplication

The original scenario had the potential for code duplication. By parameterizing that line of XML (corrected here) you would likely have put it into a simple function, such as this:

function GenerateXml(tagVar, myAttribute, myContent) {

var myXml = "<" + tagVar

+ " id='" + myAttribute + "'>"

+ myContent + "</" + tagVar + ">";

return myXml;

}

But say that sometime later there is another class that needs to use it. Because the method is so short and simple, the natural tendency is just to copy it into the new class, resulting in duplicate code. Even though it is just a couple lines of code, it is still duplicate code. And the debate has largely been settled in favor of this being a bad thing, because it makes maintenance of a software system much more challenging. (Of course, there are always edge cases where, perhaps for performance reasons, you might argue that you have to do it.)

I said above that it is the “natural tendency”. Well, not really. Most of us strive to use best practices and know better. Yet, we still must succumb to the evils of code duplication for a couple reasons.

Efficacy is a common reason. Class X has just the method you need to add to the new class Y. You could extract that method from X into a new class Z, then wire up both X and Y to use the method from Z… but you simply do not have time right now, with a deadline looming, so you just copy the method from X to Y and make a mental note of the technical debt that you will get back to. Eventually.

Ignorance is probably the most prevalent reason, though. You did not know that you have a function X₁ in project Y that already knows how to interpolate a franistan into a gazorniplatz so you wrote function X₂ in project Z to do the same thing.

SRP Violation

The Single Responsibility Principle (SRP) states that a class should be focused on doing one thing and have only a single reason to change. So you have a GradeCalculator class that calculates student grades and needs to return that data as XML. But this GradeCalculator has no need to know, nor should it know, the details of building XML. Perhaps sometime later the requirements change so that GradeCalculator needs to return XML with an additional attribute. But the GenerateXml method only accepts one attribute so the method will need to be modified. If GenerateXml is within the GradeCalculator class, that is a violation of SRP.

Solution Part 2: Builders

Just moving the fragile code into the GenerateXml method above did not solve the problem; the code is still fragile. This next example illustrates what you might think at first glance is a good solution to the raw string code smell. This example is in C# but the principle applies to any language. A StringBuilder (line 1) is just a fancy wrapper for doing string concatenation efficiently, and is not significant here; just note that you get the finished string from its ToString method at the end (line 13).

public string GetXyzAsXml()

{

var stringBuilder = new StringBuilder(); // 1

stringBuilder.Append("<xyz name='" + Name + "'>"); // 2

foreach (KeyValuePair<string, string> pair in ItemList) // 3

{

stringBuilder.Append("<content>"); // 4

stringBuilder.Append("<key>"); // 5

stringBuilder.Append(pair.Key); // 6

stringBuilder.Append("</key>"); // 7

stringBuilder.Append("<value>"); // 8

stringBuilder.Append(pair.Value); // 9

stringBuilder.Append("</value>"); // 10

stringBuilder.Append("</content>"); // 11

}

stringBuilder.Append("</xyz>"); // 12

return stringBuilder.ToString(); // 13

}

Arguably this is a better approach than the prior example, but it still suffers the same fragility. This method is still attempting to build valid XML just by concatenating strings together. In line 2, for example, it creates a single XML tag containing an attribute by using direct string concatenation via the plus (+) operator. The loop in lines 3 – 11 creates some content and that tag created on line 2 is not closed until line 12. You can see several more instances of creating elements by string concatenation via the Append method (lines 5 – 7, 8 – 10, and 4 – 11).

But this is still a terribly fragile way to build semantically meaningful objects. It aims to build XML with nothing to safeguard that it is, in fact, delivering XML. Delete line 9 and the code will still compile just fine, and it will still run just fine, and the XML it delivers will still be valid XML (though it will not contain the correct data). However, delete line 10 instead and it still compiles and it still runs, but it is no longer delivering valid XML at all: another subtle syntax catastrophe!

Contrast that approach with a general purpose, dedicated XML builder. You, as the consumer of it, have no access to its internals. So while you could still inject incorrect data into the XML, you are guaranteed to have semantically valid XML object whenever you call it. Here’s one way the above XML could be generated with such a library:

public string GetXyzAsXml()

{

return xmlBuilder.build(

"xyz",

new AttributeList { new Attribute("name", Name) },

ItemList.Select(pair =>

xmlBuilder.build(

"content",

xmlBuilder.build("key", pair.Key),

xmlBuilder.build("value", pair.Value)

)

).ToList()

);

}

The above provides the data, and provides the ordering, but does not deal with the syntax of the XML at all; that has been completely abstracted away into the XML builder. With an XML builder, no longer are you dealing with XML as raw strings. If you always use the xmlBuilder when you need a piece of XML, you have its safety net guaranteeing the string you get back is not a string of meaningless characters, but rather valid XML.

More Examples plus Solutions

I hope the above has started to persuade you that dealing in raw strings (think raw fish) is a big code smell! I hope not to overwhelm your olfactory sense, but let me show just a few more entities that you will widely see dealt with as raw strings that should instead be handled by a dedicated builder. HTML, of course, is an obvious candidate as it has many similarities with XML. A couple of the others may be more surprising.

Remember: this table shows examples of what not to do.

Entity	Example as Raw String
HTML	`"<span class='"+myClass+"' + xyz='5'>" + myContent + "</span>"`
URL	`"http://foo.com/bar?id="+id+"&term="+termValue`
Regular expression	`suspectRegexString.replace(/[.*+?^${}()\|[\]\\]/g, '\\$&')`
Quantified English phrase	`count + " bird" + (count == 1 ? "" : "s")`
General string	`"error (" + object.ref.depth.code + ") + due to " + r1 + " or " + r2`

The remaining sections discuss the requirements of each of these, provide example implementations, and show a “before” and “after” picture using the samples in the table above. (The implementations are mostly done as AngularJS services, so it is trivial to adapt them to other JavaScript uses, and reasonably straightforward to adapt to other languages.)

HTML Builder

At first glance, an HTML builder seems like just a few lines of code should suffice, and you might start in the same class where it will be used, as discussed above, with a simple function that accepts a tag, optional content, and an optional attribute list, then combines them all appropriately:

function build(tag, content, attributes) {

content = content || '';

attributes = attributes || {};

var result = '<' + tag;

for (var attr in attributes) {

result += ' ' + attr + '=' + attributes[attr];

}

result += ">" + content + "</" + tag + ">";

return result;

};

That’s fine for a “proof of concept”, but remember that the goal is to make a robust, self-contained, and separate module that does its one task well. Co-opting that popular expression “set it and forget it” here you want to “write it and forget it”. That is, once written, anytime you need to output HTML you want to be able to rely on your HTML builder, so all you have to do is call it with the correct parameters.

So to start on the real code, let’s start with requirements, considering the full range of what it needs to do. There is just something about moving a function to its own class… as soon as I do that I feel the weight of the function’s obligations, and can then see more clearly all the variations and edge cases.

build with NO content and NO attributes returns an empty element
build with content and NO attributes returns an element with content
build with NO content and ONE attribute returns an empty element with attribute
build with content and ONE attribute returns an element with attribute
build with content and NULL attribute returns an element with attribute having no value
build with content and UNDEFINED attribute returns an element with attribute having no value
build with content and EMPTY attribute returns an element with attribute having empty value
build with content and WHITESPACE attribute returns an element with attribute having whitespace value
build with NO content and MULTIPLE attributes returns an empty element with attributes
build with content and MULTIPLE attributes returns an empty element with attributes
build with reserved characters in content encodes them
build with NO content and self-closing flag ON returns an empty element
build with NO content and self-closing flag OFF returns an explicitly closed element
build with NO content and an attribute and self-closing flag OFF returns an explicitly closed element
build with attribute containing DOUBLE quotes uses single quote delimiters
build with attribute containing SINGLE quotes uses double quote delimiters
build throws exception when attribute value contains BOTH double quotes and single quotes
build throws exception when tag contains invalid characters
build throws exception when attribute name contains invalid characters

Those requirements are, of course, the names of the unit tests. (If you are wondering why the uppercase on certain words, that is my convention to help indicate a unit test’s uniqueness.) With TDD, here is the final implementation that evolved to satisfy those unit tests:

angular('myModule').service('htmlBuilder',

function() {

'use strict';

var me = this;

me.build = function(tag, content, attributes, config) {

if (tag.match(/[^a-z]/i)) {

throw new Error('tag may only contain alphabetic characters');

}

// process optional arguments

content = content ? me.htmlEscape(content) : '';

attributes = attributes || {};

var allowSelfClosing =

!config || typeof config.allowSelfClosing === 'undefined'

? true

: config.allowSelfClosing;

// do the work

var formatString = content || !allowSelfClosing

? '<{0}{2}>{1}</{0}>'

: '<{0}{2}/>';

var result = formatString.format(

tag, content, buildAttributeString(attributes));

return result;

};

me.htmlEscape = function(str) {

return String(str)

.replace(/&/g, '&'+'amp;')

.replace(/"/g, '&'+'quot;')

.replace(/'/g, '&'+'#39;')

.replace(/</g, '&'+'lt;')

.replace(/>/g, '&'+'gt;');

};

function buildAttributeString(attributes) {

var attrList = Object.keys(attributes)

.map(function(attrName) {

if (attrName.match(/[ "'>/=]/)) {

throw new Error(

'invalid character in attribute name ({0})'

.format(attrName));

}

var attrValue = attributes[attrName];

if (typeof (attrValue) === 'undefined'

|| attrValue === null) {

return ' ' + attrName;

} else {

if (typeof (attrValue) === 'string'

&& attrValue.match(/".*'|'.*"/)) {

throw new Error(

'invalid character in attribute value ({0})'

.format(attrValue));

}

var quote = typeof (attrValue) === 'string'

&& attrValue.match(/"/) ? '\'' : '"';

return ' {0}={2}{1}{2}' // note leading space

.format(attrName, attrValue, quote);

}

});

return attrList.join('');

}

});

When you create a dedicated engine for a task, you have to make sure it does what one could reasonably expect with all sorts of inputs. For example, the above code checks to see what kind of quotes each attribute uses and accommodates either type. The code also accommodates an HTML peculiarity wherein, even though the syntax might allow something, the semantics do not: the most notorious example of this is the <script> tag, which must be written

1	<script src="..."></script>

rather than

1	<script src="..."/>

The builder service by default creates an empty (self-closing) element, but allows you to specify the former behavior via a parameter:

htmlBuilder.build(

'script',

undefined,

{ src: 'c:/path/to/my/script.js' },

{ allowSelfClosing: false }

);

By setting config.allowSelfClosing to false, you then get an explicit end tag even with no content.

TL;DR

Something like this…

1	"<span class='"+myClass+"' xyz='5'>" + myContent + "</span>"

Now becomes this…

htmlBuilder.build(

'span',

myContent,

{

class: myClass,

xyz: 5

}

)

URL Builder

Why would you need a builder for something as simple as a URL? Here again is the example I gave in the introduction to this section:

1	"http://foo.com/bar?id="+id+"&term="+termValue

Sure, you can write the URL in one line with string concatenation. Undoubtedly you know that attributes are assigned by the equals sign, and attributes are separated by the ampersand, and the URL is divided by the question mark… but what if an attribute name has a character that needs to be encoded? Or what if an attribute value inadvertently has a trailing space? Once you start to think about it there are a few more details than fit into a single line of code.

Here are the requirements (unit tests) for a simple URL builder. I say “simple” because it does not, for example, cover anchor expressions nor encoding the path component. In the nature of just-in-time coding principles, my particular application did not need those, so I did not add support for them at this time.

build returns supplied path when NO parameters specified
build returns formatted url when ONE parameter specified
build returns formatted url when MULTIPLE parameters specified
build encodes parameter VALUE if necessary
build encodes parameter NAME if necessary
build trims off LEADING spaces of path
build trims off TRAILING spaces of path
build throws exception if path contains question mark

Here is the implementation that materialized from that list, again as an AngularJS service:

angular('myModule').service('urlBuilder',

function() {

var me = this;

me.build = function(path, paramObject) {

if (path.match(/\?/)) {

throw new Error('invalid character in path');

}

paramObject = paramObject || {};

var paramList = Object.keys(paramObject)

.map(function(paramName) {

// similar to angular's own $httpParamSerializer

return "{0}={1}".format(

encodeURIComponent(paramName),

encodeURIComponent(paramObject[paramName]));

});

return paramList.length > 0

? "{0}?{1}".format(path.trim(), paramList.join('&'))

: path.trim();

};

});

With that service in place, you can then describe a URL in a much more… civilized fashion, as shown in the summary below.

TL;DR

Something like this…

1	"http://foo.com/bar?id="+id+"&term="+termValue

Now becomes this…

urlBuilder.build(

"http://foo.com/bar",

{ id: id, term: termValue }

)

Regex Cleaner

When it comes to a regular expression, I find I did not need a builder per se, but rather a cleaner: a way to make sure that my regular expressions were safe from corruption due to content inadvertently containing regular expression meta-characters. The example above was this:

1	suspectRegexString.replace(/[.*+?^${}()\|[\]\\]/g, '\\$&')

That is actually very close to encapsulating the details, but every time you need to cleanse a regular expression you should not have to specify the full set of regular expression meta-characters! This is really just a one-liner, and the requirement is clear: escape the requisite meta-characters. All that is needed is to wrap that one line of code into a function:

angular('myModule').service('regexCleaner',

function() {

'use strict';

var me = this;

me.escape = function(str) {

return str.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');

};

});

TL;DR

Something like this…

1	suspectRegexString.replace(/[.*+?^${}()\|[\]\\]/g, '\\$&')

Now becomes this…

1	regexCleaner.escape(suspectRegexString)

Quantified English Phrase Matcher

Have you noticed that there is something unique about one number… that is to say, there is something unique about the number one? How many times have you seen in a user interface “… which occurs 1 times” or anything similar where the number does not agree with the noun? In an effort to ameliorate that grammatical error the quick and dirty fix is to make the plural parenthetical: “… which occurs 1 time(s)” so whether the number is “5” or “1”, it still works… sort of. A better approach is with a dedicated quantifier matcher.

Unit tests/requirements:

Match does not pluralize a noun when the quantity is 1
Match pluralizes a noun with quantity 0
Match pluralizes a noun with quantity greater than 1
Match uses a DEFAULT ending when none supplied
Match uses a SPECIFIED ending when supplied
Match truncates the noun per the suffix to replace then adds the DEFAULT ending
Match truncates the noun per the suffix to replace then adds the SPECIFIED ending
Match does not truncate the noun if the suffix to replace is not found

This simple AngularJS service accepts a quantity, a noun, an optional plural ending, and an optional suffix to replace. If no plural ending is supplied it defaults to the most common, “s”. By default that ending will just be appended to the noun. However, some English nouns are made plural by replacing an existing ending on the noun. If specified as a regular expression, that existing suffix will be replaced with the new ending.

angular('myModule').service('pluralizer',

function() {

'use strict';

var me = this;

me.match = function(quantity, noun, pluralEnding, suffixToReplace) {

pluralEnding = pluralEnding || 's';

if (suffixToReplace) {

var re = new RegExp(

suffixToReplace +

(suffixToReplace.match(/\$$/) ? '' : '$'));

return "{0} {1}"

.format(

quantity,

quantity === 1

? noun

: (noun.match(re)

? noun.replace(re, pluralEnding)

: noun + pluralEnding)

);

} else {

return "{0} {1}{2}"

.format(

quantity,

noun,

quantity === 1 ? '' : pluralEnding

);

}

};

});

TL;DR

Something like this…

1	count + " bird" + (count == 1 ? "" : "s")

Now becomes this…

1	pluralizer.match(count, “bird”) // emits “1 bird” or “29 birds”

A couple more examples:

1 2	pluralizer.match(4, 'latch', 'es') // emits “4 latches” pluralizer.match(0, 'index', 'ices', 'ex') // emits “0 indices”

General String Formatting

You have seen a few ideas above on where to use dedicated builders and helpers instead of raw strings. You can expand outward from there in all directions, of course, but as my final example, I want to take you inward. The example I gave of what not to do is this:

1	"error (" + object.ref.depth.code + ") + due to " + r1 + " or " + r2

Yes, I am, in fact, saying that string concatenation should be considered harmful! If you think about it, it suffers the same ills as the previous entities you have seen. In this case, you need to make sure that there is a leading space or a trailing space in certain pieces, that you have a closing parenthesis to match the opening parenthesis, and if the variable names are long, it might wrap over more than one line making it difficult to grasp the whole string in a single mental snapshot.

The solution of this issue is already built-in to many languages, dating back to the venerable sprintf from the “C” language, so I am not going to start from scratch in my discussion here for this one. Java uses essentially the same thing as C but calls it String.format. The .NET framework morphed it slightly but used the same name (albeit with a capital), String.Format. There is an sprintf for JavaScript but my preference in JavaScript is to use the .NET equivalent, called simply format. Here is the complete code from that last link:

if (!String.prototype.format) {

String.prototype.format = function() {

var args = arguments;

return this.replace(/{(\d+)}/g, function(match, number) {

return typeof args[number] != 'undefined'

? args[number]

: match

;

});

};

}

By using this format method, the string that you are rendering is much cleaner and easier to grasp at a glance. As proof, go back over all the example implementations in this article—you will find that I used the method a lot. My rule of thumb for string building, then, is that it is OK to have one plus sign but never more than one. Thus, if I have just two strings to concatenate I use a plus (+) sign; for any more than two strings, I use format.

TL;DR

Something like this…

1	"error (" + object.ref.depth.code + ") + due to " + r1 + " or " + r2

Now becomes this…

1 2	"error ({0}) due to {1} or {2}" .format(object.ref.depth.code, r1, r2)

Conclusion

With raw strings, creating semantically meaningful objects is fraught with fragility. Instead create dedicated builders for any such objects you may need. So remove your nose plugs, and take a metaphorical whiff of the following table (bringing together all of the results detailed in the preceding sections), now free of rank odors!

Entity	Example with semantic helpers
HTML	`htmlBuilder.build('span', myContent, { class: myClass, xyz: 5 } )`
URL	`urlBuilder.build( "http://foo.com/bar", { id: id, term: termValue } )`
Regular expression	`regexCleaner.escape(suspectRegexString)`
Quantified English phrase	`pluralizer.match(count, “bird”)`
General string	`"error ({0}) due to {1} or {2}".format(object.ref.depth.code, r1, r2)`

Now that you are thinking along these lines, any similar helpers, builders, matchers, or formatters come to your mind? Add a comment!

Register for Simple Talk

Code Smells: Raw Strings and the Subtle Syntax Catastrophe

Solution Part 1: Encapsulate

Code Duplication

SRP Violation

Solution Part 2: Builders

More Examples plus Solutions

HTML Builder

TL;DR

URL Builder

TL;DR

Regex Cleaner

TL;DR

Quantified English Phrase Matcher

TL;DR

General String Formatting

TL;DR

Conclusion

About the author

Michael Sorens

Michael's contributions

Articles

Books

Top topics

Michael's latest contributions:

Checking the pulse of Kubernetes

The Phantom Menace in Unit Testing

Better Code Reviews with GIT

Solution Part 1: Encapsulate

Code Duplication

SRP Violation

Solution Part 2: Builders

More Examples plus Solutions

HTML Builder

TL;DR

URL Builder

TL;DR

Regex Cleaner

TL;DR

Quantified English Phrase Matcher

TL;DR

General String Formatting

TL;DR

Conclusion

Recommended

About the author

Michael Sorens

Michael's contributions

Articles

Books

Top topics

Michael's latest contributions:

Checking the pulse of Kubernetes

The Phantom Menace in Unit Testing

Better Code Reviews with GIT