Welcome!

Blog Feed Post

[eim] Alphabetical order explained in a mere 27,817 words

This is one of the most amazing examples I’ve seen of the complexity of even simple organizational schemes. “Unicode Collation Algorithm (Unicode Technical Standard #10)” spells out in precise detail how to sort strings in what we might colloquially call “alphabetical order.” But it’s way, way, way more complex than that.

Unicode is an international standard for how strings of characters get represented within computing systems. For example, in the familiar ASCII encoding, the letter “A” is represented in computers by the number 65. But ASCII is too limited to encode the world’s alphabets. Unicode does the job.

As the paper says, “Collation is the general term for the process and function of determining the sorting order of strings of characters” so that, for example, users can look them up on a list. Alphabetical order is a simple form of collation.

Sorting inconsistent alphabets is, well, a problem. But let Technical Standard #10 explain the problem:

It is important to ensure that collation meets user expectations as fully as possible. For example, in the majority of Latin languages, ø sorts as an accented variant of o, meaning that most users would expect ø alongside o. However, a few languages, such as Norwegian and Danish, sort ø as a unique element after z. Sorting “Søren” after “Sylt” in a long list, as would be expected in Norwegian or Danish, will cause problems if the user expects ø as a variant of o. A user will look for “Søren” between “Sorem” and “Soret”, not see it in the selection, and assume the string is missing, confused because it was sorted in a completely different location.

Heck, some French dictionaries even sort their accents in reverse order. (See Section 1.3.)

But that’s nothing. Here’s a fairly random paragraph from further into this magnificent document (section 7.2):

In the DUCET, characters are given tertiary weights according to Table 17. The Decomposition Type is from the Unicode Character Database [UAX44]. The Case or Kana Subtype entry refers either to a case distinction or to a specific list of characters. The weights are from MIN = 2 to MAX = 1F16, excluding 7, which is not used for historical reasons.

Or from section 8.2:

Users often find asymmetric searching to be a useful option. When doing an asymmetric search, a character (or grapheme cluster) in the query that is unmarked at the secondary and/or tertiary levels will match a character in the target that is either marked or unmarked at the same levels, but a character in the query that is marked at the secondary and/or tertiary levels will only match a character in the target that is marked in the same way.

You may think I’m being snarky. I’m not at all. This document dives resolutely into the brambles and does not give up. It incidentally exposes just how complicated even the simplest of sorting tasks is when looked at in their full context, where that context is history, language, culture, and the ambiguity in which they thrive.

Read the original blog entry...

More Stories By David Weinberger

David is the author of JOHO the blog (www.hyperorg.com/blogger). He is an independent marketing consultant and a frequent speaker at various conferences. "All I can promise is that I will be honest with you and never write something I don't believe in because someone is paying me as part of a relationship you don't know about. Put differently: All I'll hide are the irrelevancies."

Latest Stories
Much of the value of DevOps comes from a (renewed) focus on measurement, sharing, and continuous feedback loops. In increasingly complex DevOps workflows and environments, and especially in larger, regulated, or more crystallized organizations, these core concepts become even more critical. In his session at @DevOpsSummit at 18th Cloud Expo, Andi Mann, Chief Technology Advocate at Splunk, showed how, by focusing on 'metrics that matter,' you can provide objective, transparent, and meaningful f...
SYS-CON Events announced today that ReadyTalk, a leading provider of online conferencing and webinar services, has been named Vendor Presentation Sponsor at the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. ReadyTalk delivers audio and web conferencing services that inspire collaboration and enable the Future of Work for today’s increasingly digital and mobile workforce. By combining intuitive, innovative tec...
Big Data has been changing the world. IoT fuels the further transformation recently. How are Big Data and IoT related? In his session at @BigDataExpo, Tony Shan, a renowned visionary and thought leader, will explore the interplay of Big Data and IoT. He will anatomize Big Data and IoT separately in terms of what, which, why, where, when, who, how and how much. He will then analyze the relationship between IoT and Big Data, specifically the drilldown of how the 4Vs of Big Data (Volume, Variety,...
SYS-CON Events announced today that Bsquare has been named “Silver Sponsor” of SYS-CON's @ThingsExpo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. For more than two decades, Bsquare has helped its customers extract business value from a broad array of physical assets by making them intelligent, connecting them, and using the data they generate to optimize business processes.
Cognitive Computing is becoming the foundation for a new generation of solutions that have the potential to transform business. Unlike traditional approaches to building solutions, a cognitive computing approach allows the data to help determine the way applications are designed. This contrasts with conventional software development that begins with defining logic based on the current way a business operates. In her session at 18th Cloud Expo, Judith S. Hurwitz, President and CEO of Hurwitz & ...
Kubernetes is a new and revolutionary open-sourced system for managing containers across multiple hosts in a cluster. Ansible is a simple IT automation tool for just about any requirement for reproducible environments. In his session at @DevOpsSummit at 18th Cloud Expo, Patrick Galbraith, a principal engineer at HPE, discussed how to build a fully functional Kubernetes cluster on a number of virtual machines or bare-metal hosts. Also included will be a brief demonstration of running a Galera M...
WebRTC adoption has generated a wave of creative uses of communications and collaboration through websites, sales apps, customer care and business applications. As WebRTC has become more mainstream it has evolved to use cases beyond the original peer-to-peer case, which has led to a repeating requirement for interoperability with existing infrastructures. In his session at @ThingsExpo, Graham Holt, Executive Vice President of Daitan Group, will cover implementation examples that have enabled ea...
In his session at @DevOpsSummit at 19th Cloud Expo, Robert Doyle, lead architect at eCube Systems, will examine the issues and need for an agile infrastructure and show the advantages of capturing developer knowledge in an exportable file for migration into production. He will introduce the use of NXTmonitor, a next-generation DevOps tool that captures application environments, dependencies and start/stop procedures in a portable configuration file with an easy-to-use GUI. In addition to captu...
SYS-CON Events announced today that Tintri Inc., a leading producer of VM-aware storage (VAS) for virtualization and cloud environments, will exhibit at the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. Tintri VM-aware storage is the simplest for virtualized applications and cloud. Organizations including GE, Toyota, United Healthcare, NASA and 6 of the Fortune 15 have said “No to LUNs.” With Tintri they mana...
SYS-CON Events announced today that Interface Masters Technologies, a leader in Network Visibility and Uptime Solutions, will exhibit at the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. Interface Masters Technologies is a leading vendor in the network monitoring and high speed networking markets. Based in the heart of Silicon Valley, Interface Masters' expertise lies in Gigabit, 10 Gigabit and 40 Gigabit Eth...
The vision of a connected smart home is becoming reality with the application of integrated wireless technologies in devices and appliances. The use of standardized and TCP/IP networked wireless technologies in line-powered and battery operated sensors and controls has led to the adoption of radios in the 2.4GHz band, including Wi-Fi, BT/BLE and 802.15.4 applied ZigBee and Thread. This is driving the need for robust wireless coexistence for multiple radios to ensure throughput performance and th...
Vidyo, Inc., has joined the Alliance for Open Media. The Alliance for Open Media is a non-profit organization working to define and develop media technologies that address the need for an open standard for video compression and delivery over the web. As a member of the Alliance, Vidyo will collaborate with industry leaders in pursuit of an open and royalty-free AOMedia Video codec, AV1. Vidyo’s contributions to the organization will bring to bear its long history of expertise in codec technolo...
SYS-CON Events announced today that Commvault, a global leader in enterprise data protection and information management, has been named “Bronze Sponsor” of SYS-CON's 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. Commvault is a leading provider of data protection and information management solutions, helping companies worldwide activate their data to drive more value and business insight and to transform moder...
If you’re responsible for an application that depends on the data or functionality of various IoT endpoints – either sensors or devices – your brand reputation depends on the security, reliability, and compliance of its many integrated parts. If your application fails to deliver the expected business results, your customers and partners won't care if that failure stems from the code you developed or from a component that you integrated. What can you do to ensure that the endpoints work as expect...
An IoT product’s log files speak volumes about what’s happening with your products in the field, pinpointing current and potential issues, and enabling you to predict failures and save millions of dollars in inventory. But until recently, no one knew how to listen. In his session at @ThingsExpo, Dan Gettens, Chief Research Officer at OnProcess, will discuss recent research by Massachusetts Institute of Technology and OnProcess Technology, where MIT created a new, breakthrough analytics model f...