|By David Weinberger||
|July 27, 2014 08:22 PM EDT||
This is one of the most amazing examples I’ve seen of the complexity of even simple organizational schemes. “Unicode Collation Algorithm (Unicode Technical Standard #10)” spells out in precise detail how to sort strings in what we might colloquially call “alphabetical order.” But it’s way, way, way more complex than that.
Unicode is an international standard for how strings of characters get represented within computing systems. For example, in the familiar ASCII encoding, the letter “A” is represented in computers by the number 65. But ASCII is too limited to encode the world’s alphabets. Unicode does the job.
As the paper says, “Collation is the general term for the process and function of determining the sorting order of strings of characters” so that, for example, users can look them up on a list. Alphabetical order is a simple form of collation.
Sorting inconsistent alphabets is, well, a problem. But let Technical Standard #10 explain the problem:
It is important to ensure that collation meets user expectations as fully as possible. For example, in the majority of Latin languages, ø sorts as an accented variant of o, meaning that most users would expect ø alongside o. However, a few languages, such as Norwegian and Danish, sort ø as a unique element after z. Sorting “Søren” after “Sylt” in a long list, as would be expected in Norwegian or Danish, will cause problems if the user expects ø as a variant of o. A user will look for “Søren” between “Sorem” and “Soret”, not see it in the selection, and assume the string is missing, confused because it was sorted in a completely different location.
Heck, some French dictionaries even sort their accents in reverse order. (See Section 1.3.)
But that’s nothing. Here’s a fairly random paragraph from further into this magnificent document (section 7.2):
In the DUCET, characters are given tertiary weights according to Table 17. The Decomposition Type is from the Unicode Character Database [UAX44]. The Case or Kana Subtype entry refers either to a case distinction or to a specific list of characters. The weights are from MIN = 2 to MAX = 1F16, excluding 7, which is not used for historical reasons.
Or from section 8.2:
Users often find asymmetric searching to be a useful option. When doing an asymmetric search, a character (or grapheme cluster) in the query that is unmarked at the secondary and/or tertiary levels will match a character in the target that is either marked or unmarked at the same levels, but a character in the query that is marked at the secondary and/or tertiary levels will only match a character in the target that is marked in the same way.
You may think I’m being snarky. I’m not at all. This document dives resolutely into the brambles and does not give up. It incidentally exposes just how complicated even the simplest of sorting tasks is when looked at in their full context, where that context is history, language, culture, and the ambiguity in which they thrive.
Much of the value of DevOps comes from a (renewed) focus on measurement, sharing, and continuous feedback loops. In increasingly complex DevOps workflows and environments, and especially in larger, regulated, or more crystallized organizations, these core concepts become even more critical. In his session at @DevOpsSummit at 18th Cloud Expo, Andi Mann, Chief Technology Advocate at Splunk, showed how, by focusing on 'metrics that matter,' you can provide objective, transparent, and meaningful f...
Sep. 26, 2016 10:15 AM EDT Reads: 2,272
SYS-CON Events announced today that ReadyTalk, a leading provider of online conferencing and webinar services, has been named Vendor Presentation Sponsor at the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. ReadyTalk delivers audio and web conferencing services that inspire collaboration and enable the Future of Work for today’s increasingly digital and mobile workforce. By combining intuitive, innovative tec...
Sep. 26, 2016 10:00 AM EDT Reads: 2,717
Big Data has been changing the world. IoT fuels the further transformation recently. How are Big Data and IoT related? In his session at @BigDataExpo, Tony Shan, a renowned visionary and thought leader, will explore the interplay of Big Data and IoT. He will anatomize Big Data and IoT separately in terms of what, which, why, where, when, who, how and how much. He will then analyze the relationship between IoT and Big Data, specifically the drilldown of how the 4Vs of Big Data (Volume, Variety,...
Sep. 26, 2016 10:00 AM EDT Reads: 996
Sep. 26, 2016 10:00 AM EDT Reads: 2,671
Sep. 26, 2016 09:45 AM EDT Reads: 2,803
Sep. 26, 2016 09:15 AM EDT Reads: 2,725
Sep. 26, 2016 09:00 AM EDT Reads: 1,523
Sep. 26, 2016 09:00 AM EDT Reads: 1,063
Sep. 26, 2016 08:30 AM EDT Reads: 2,552
Sep. 26, 2016 08:30 AM EDT Reads: 2,468
Sep. 26, 2016 08:15 AM EDT Reads: 1,541
Sep. 26, 2016 08:15 AM EDT Reads: 2,449
Sep. 26, 2016 08:15 AM EDT Reads: 2,545
Sep. 26, 2016 07:15 AM EDT Reads: 1,601
Sep. 26, 2016 07:15 AM EDT Reads: 1,897