Natural Sort

Natural Sort is a hazy concept; it basically means something like human-friendly sort; it could be useful if the strings being sorted are going to be visible to the user.

Difference between normal sort and natural sort

Let us look at an example, we consider the list [1, 2, 10] to be an ascending order list, and sorting algorithms would not disagree. But if the list happened to be a list of strings most sorting libraries would sort it like this [ “1”, “10”, “2”]. The other common sorting inconvenience is related to the fact that in ASCII order, capital letters come before lower case letters. So if you try to sort the list [“a”, “B”] in ascending order, you will get the result [“B”, “a”].

I had always assumed that a character by character sort is the only practical way to sort strings and have always relied on zero-padding to avoid this problem. It was only recently that I noticed a difference between the way Windows Explorer sorts file names and the way PowerShell sorts file names. I was trying to create an m3u playlist using PowerShell. When I use PowerShell to create the playlist, files are in the wrong order. But if I create the playlist using drag-and-drop via Explorer, the order of files is correct. Here is an example of how Explorer sorts file names.

Files sorted by name in Windows Explorer

As you can see, Explorer does not have either the case problem or the number-as-string problem. This led me on a search for sorting libraries which sorts the way Explorer does. It seems like PHP and Javascript has some sort of support for it. The standard libraries of none of the languages I know seem to support natural sorting. In the end, I decided to use the Python library natsort.

Limitations of Natural Sort

Despite the fact that Windows Explorer, PHP, Javascript, etc support it; Natural sort really is a vague concept. It is understandable that most languages do not want to provide official support for it. In order to understand the limitations of Natural Sort, it is useful to think about the root causes of the sort problems which it is trying to solve.

The Case Problem

The case problem is the result of a mismatch between ASCII or Unicode ordering and alphabetic ordering. This is one of the most common sorting complaints users have and it is solvable. I believe the Python Natsort library handles it in a locale-aware manner. If you are dealing with only English, you don’t need an external library to handle this.

Numbers with special meaning inside a string

The problem of numbers in a string is a lot more thorny. Let us look at an example of how numbers with special meaning get added to a string. A user wants to store multiple pieces of data relating to an object. Only one variable or field is available to store the data. If the only variable available happens to be a permissive data type like the String, the user will likely dump all the data into the single string.

In order to do a real natural sort on this kind of compound string, you need to recognize the components which make up the string and guess which component the user wants to sort by. For example, let us say I have two files Zebra_and_4_Gazelles_Lesson1.ai and Elephant_Lesson2.ai. Here I have added the file name and lesson number into one string. What should be the natural order of these files? Should lesson1 come before lesson2 or should the sorting be alphabetical? If we decide that lesson1 must come before lesson2, we will have a hard time dealing with that 4 in the middle which has no special meaning.

Wrong Datatype

Sometimes the value inside the string has a definite and single data type. For example, there are strings which contain only values like dates, measurements, coordinate values, etc. Naturally, the first thing to try when faced with these type of strings is to see if a more appropriate data type can be used. It is not possible for a general sorting algorithm to deal with all these specific cases; I believe they do deal with some of them.

Conclusion

You have to say that programmers are in a somewhat unenviable position. Natural sorting has its unpredictability. On the other hand, users expect programmers to take care of at least the two most common sorting inconveniences: ASCII ordering and the case of numbers tagged to the beginning of the string. As mentioned before popular applications like Windows Explorer are already doing it. It is probably best to use an algorithm which handles only these two cases and sorts the rest in a predictable manner.