Cowboy filenames

If you ask me, and I take it as implicit by your visit that you do (sorry about that, but it’s Friday afternoon and I need a weekend), Windows is far too facist about filenames. Most notably in terms of the characters one can put into a filename, and the obstreperous way it will cough and whine at you if you dare to break the rules.

For the application I’m working on, we automatically generate filenames based on certain criteria. These filenames can contain descriptive text which we don’t necessarily want to restrict to the set of filename-legal characters (I’ll return to this point momentarily). So we have to filter the descriptive text and strip out illegal characters.

So what is the set of legal characters? I consulted my trusty internet, and after the usual few minutes throwing seaweed over it and sticking pins in effigies, it yielded the following rather useful bit of wisdom, from the Windows Platform SDK, under Win32 and COM Development, System Services, Files and I/O, SDK Documentation, Storage, Storage Overview, File Management, Creating, Deleting and Maintaining Files, Naming a File.


Use any character in the current code page for a name, except characters in the range of 0 through 31, or any character that the file system does not allow. A name can contain characters in the extended character set (128-255). However, it cannot contain the following reserved characters:
< > : ” / |

The following reserved device names cannot be used as the name of a file: CON, PRN, AUX, NUL, COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9, LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, and LPT9. Also avoid these names followed by an extension, for example, NUL.tx7.

Now at first, as I was in a hurry, I read this as “everything except < > : ” / | is ok”.

But I forgot that characters below 32 are explicitly disallowed.

And then I noticed that (quite naturally) applications don’t let you save filenames containing the “*” key, as this is used as a wildcard.

And also filenames containing the “?” key are disallowed. This is also a wildcard character, and used to escape long paths by prepending “\?” (I always liked that little hack. Such style.)

But with a firm conviction that MSDN Does Not Lie, I checked my ASCII table to make sure I wasn’t hallucinating. Sure enough, “*” and “?” are ASCII 42 and 63 (please don’t anyone tell me they knew that from memory, or I may cry).

But, and aha, there’s another clause I hadn’t read properly: “…or any character that the file system does not allow”. So, time to work out what characters are considered illegal by FAT, FAT32 and NTFS, methought.

Around this point I came across the system structure BIGFATBOOTFSINFO, which I only mention as it’s an excellent name. But I digress.

So, I wandered towards the Platform SDK again, by way of Win32 and COM Development, Development Guides, Windows 95/98/Me Programming, Long File Names, Long File Names and the Protected-Mode FAT System. It was a long walk, but I sustained myself on the way with a coffee. Herein I found more wisdom.


When an application creates a file or directory that has a long file name, the system automatically generates a corresponding alias for that file or directory using the standard 8.3 format. The characters used in the alias are the same characters that are available for use in MS-DOS file and directory names. Valid characters for the alias are any combination of letters, digits, or characters with ASCII codes greater than 127, the space character (ASCII 20h), as well as any of the following special characters.

$ % ‘ – _ @ ~ ` ! ( ) { } ^ # &

The space character has been available to applications for file names and directory names through the functions in current and earlier versions of MS-DOS. However, many applications do not recognize the space character as a valid character, and the system does not use the space character when it generates an alias for a long file name. MS-DOS does not distinguish between uppercase and lowercase letters in file names and directory names, and this is also true for aliases.

The set of valid characters for long file names includes all the characters that are valid for an alias as well as the following additional characters.

+ , ; = [ ]

So, we’re getting a little closer. At least FAT16 fesses up to not allowing “*” and “?”.

As to exactly which characters are disallowed by NTFS, this remains a mystery to me. It doesn’t fess up to disallowing “*” and “?”, and indeed googling suggests that all Unicode characters are allowed by NTFS as filenames. So it is presumably the OS level Windows, quite possibly not much deeper than the Win32 API, which disallows “*” and “?” as well as the characters listed in the first quoted section above. Could it disallow others? Since “*” and “?” are not documented explicitly, one has to presume so.

If it wasn’t for the fact that our application had to support Unicode, I would write a simple filter based on ASCII character codes, and be rather facist about it. In our case we can simply replace offending characters with underscores “_” and have done with it. Once, of course, we’ve determined what those offending characters are.

I’m sure I’ve encountered this issue before, and I seem to recall wishing then just for a simple table, for both Unicode and ASCII characters, showing which were disallowed and by what level of the system (shell, file system API, file system itself). My wish is still pending, but since I have some pressing work to do I should Alt-Tab and get on with it…