|By David Weinberger||
|July 27, 2014 08:22 PM EDT||
This is one of the most amazing examples I’ve seen of the complexity of even simple organizational schemes. “Unicode Collation Algorithm (Unicode Technical Standard #10)” spells out in precise detail how to sort strings in what we might colloquially call “alphabetical order.” But it’s way, way, way more complex than that.
Unicode is an international standard for how strings of characters get represented within computing systems. For example, in the familiar ASCII encoding, the letter “A” is represented in computers by the number 65. But ASCII is too limited to encode the world’s alphabets. Unicode does the job.
As the paper says, “Collation is the general term for the process and function of determining the sorting order of strings of characters” so that, for example, users can look them up on a list. Alphabetical order is a simple form of collation.
Sorting inconsistent alphabets is, well, a problem. But let Technical Standard #10 explain the problem:
It is important to ensure that collation meets user expectations as fully as possible. For example, in the majority of Latin languages, ø sorts as an accented variant of o, meaning that most users would expect ø alongside o. However, a few languages, such as Norwegian and Danish, sort ø as a unique element after z. Sorting “Søren” after “Sylt” in a long list, as would be expected in Norwegian or Danish, will cause problems if the user expects ø as a variant of o. A user will look for “Søren” between “Sorem” and “Soret”, not see it in the selection, and assume the string is missing, confused because it was sorted in a completely different location.
Heck, some French dictionaries even sort their accents in reverse order. (See Section 1.3.)
But that’s nothing. Here’s a fairly random paragraph from further into this magnificent document (section 7.2):
In the DUCET, characters are given tertiary weights according to Table 17. The Decomposition Type is from the Unicode Character Database [UAX44]. The Case or Kana Subtype entry refers either to a case distinction or to a specific list of characters. The weights are from MIN = 2 to MAX = 1F16, excluding 7, which is not used for historical reasons.
Or from section 8.2:
Users often find asymmetric searching to be a useful option. When doing an asymmetric search, a character (or grapheme cluster) in the query that is unmarked at the secondary and/or tertiary levels will match a character in the target that is either marked or unmarked at the same levels, but a character in the query that is marked at the secondary and/or tertiary levels will only match a character in the target that is marked in the same way.
You may think I’m being snarky. I’m not at all. This document dives resolutely into the brambles and does not give up. It incidentally exposes just how complicated even the simplest of sorting tasks is when looked at in their full context, where that context is history, language, culture, and the ambiguity in which they thrive.
Although it has gained significant traction in the consumer space, IoT is still in the early stages of adoption in enterprises environments. However, many companies are working on initiatives like Industry 4.0 that includes IoT as one of the key disruptive technologies expected to reshape businesses of tomorrow. The key challenges will be availability, robustness and reliability of networks that connect devices in a business environment. Software Defined Wide Area Network (SD-WAN) is expected to...
Aug. 31, 2016 06:15 PM EDT Reads: 262
Aug. 31, 2016 06:15 PM EDT Reads: 173
DevOps at Cloud Expo – being held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA – announces that its Call for Papers is open. Born out of proven success in agile development, cloud computing, and process automation, DevOps is a macro trend you cannot afford to miss. From showcase success stories from early adopters and web-scale businesses, DevOps is expanding to organizations of all sizes, including the world's largest enterprises – and delivering real results. Am...
Aug. 31, 2016 06:15 PM EDT Reads: 3,613
Aug. 31, 2016 05:15 PM EDT Reads: 268
Aug. 31, 2016 05:00 PM EDT Reads: 872
Aug. 31, 2016 04:45 PM EDT Reads: 3,841
Aug. 31, 2016 04:39 PM EDT Reads: 191
Aug. 31, 2016 04:15 PM EDT Reads: 975
Aug. 31, 2016 04:12 PM EDT Reads: 222
Aug. 31, 2016 04:00 PM EDT Reads: 1,128
Aug. 31, 2016 03:29 PM EDT Reads: 193
Aug. 31, 2016 03:15 PM EDT Reads: 314
Aug. 31, 2016 03:00 PM EDT Reads: 790
Aug. 31, 2016 02:30 PM EDT Reads: 2,099
Aug. 31, 2016 02:03 PM EDT Reads: 221