Ordering German special characters and those from other languages when sortingWhat are good online dictionaries for translation between German and English?Which languages are “nnl.”, “altn.” and “schw.”?German dictionary with detailed declensions, audio pronunciations, and IPAWhich German word contains the most ä, ö, ü, and ß in any variation?How to perform an advanced search of German nouns in Wiktionary restricting both gender and ending?How to type the ß and capital ß (ẞ) on a Windows 8 German keyboard?When to use 'ß' and 'ss'?Why does the German dictionary show only 2nd and 3rd person conjugation?Why are “tomorrow” and “morning” the same in German?How many masculine, feminine, neuter and plural nouns are there in German language?

Why cruise at 7000' in an A319?

In the Marvel universe, can a human have a baby with any non-human?

How often can a PC check with passive perception during a combat turn?

Is there any set of 2-6 notes that doesn't have a chord name?

Was touching your nose a greeting in second millenium Mesopotamia?

Fitting a mixture of two normal distributions for a data set?

Going to get married soon, should I do it on Dec 31 or Jan 1?

How to append a matrix element by element?

MH370 blackbox - is it still possible to retrieve data from it?

STM Microcontroller burns every time

Why isn’t the tax system continuous rather than bracketed?

Counting occurrence of words in table is slow

Short story with brother-sister conjoined twins as protagonists?

What is this opening trap called, and how should I play afterwards? How can I refute the gambit, and play if I accept it?

Should I hide continue button until tasks are completed?

Fedora boot screen shows both Fedora logo and Lenovo logo. Why and How?

Does anycast addressing add additional latency in any way?

Could Sauron have read Tom Bombadil's mind if Tom had held the Palantir?

How many satellites can stay in a Lagrange point?

A player is constantly pestering me about rules, what do I do as a DM?

Why does the A-4 Skyhawk sit nose-up when on ground?

Why does the numerical solution of an ODE move away from an unstable equilibrium?

Do equal angles necessarily mean a polygon is regular?

Is it okay to visually align the elements in a logo?

Ordering German special characters and those from other languages when sorting

What are good online dictionaries for translation between German and English?Which languages are “nnl.”, “altn.” and “schw.”?German dictionary with detailed declensions, audio pronunciations, and IPAWhich German word contains the most ä, ö, ü, and ß in any variation?How to perform an advanced search of German nouns in Wiktionary restricting both gender and ending?How to type the ß and capital ß (ẞ) on a Windows 8 German keyboard?When to use 'ß' and 'ss'?Why does the German dictionary show only 2nd and 3rd person conjugation?Why are “tomorrow” and “morning” the same in German?How many masculine, feminine, neuter and plural nouns are there in German language?

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;

I want to sort strings (text) in a software project of mine. I'm planning to do this in the lexically best way.
My set of possible characters consist of the full alphabet (a–z and A–Z) and of the typical Latin 1 Umlauts, like Ä, ö, ß, and also characters from other Latin 1 languages like à, á, â, ã.
(It’s technically impossible to order the data by expanding characters like Ä to Ae.)

How would one sort those characters so that also humans could look them up fast?

Would one look for Ä after A (I guess). And for é after e?

In which order would à, á, â, ã, and ä be sorted in between a and b?

Is there some kind of ISO standard defining such things? How would those characters be arranged?

edited yesterday

Wrzlprmft♦

18.4k5 gold badges49 silver badges114 bronze badges

asked Jun 16 at 15:24

Matthias

1836 bronze badges

New contributor

12

In most programming languages a collation function is available which compares strings according to a locale. In C, this function is strcoll(). Java has a Collator class.

– RHa
Jun 16 at 19:12

The right way to do this is to use strxfrm() to convert your user-visible strings into encoded strings that can be compared using strcmp(). If you're sorting user records, say, and each has a "name" field, you'd have to then create a "nameSortable" field and use strxfrm to populate it. That resulting field may or may not look much like the ASCII you put in but is guaranteed to sort in the right order for the current locale. And since you do the transformation once, it should then sort quickly.

– Swiss Frank
Jun 17 at 13:45

A easier but typically lower-performance method is to use strcoll() to compare. This may be slower because in effect it's internally creating buffers, calling strxfrm to get comperable strings, comparing them, and throwing away the results of the conversion, only leaving the result. strxform() requires you to manage the caches, but pays you back by allowing far faster compares.

– Swiss Frank
Jun 17 at 13:45

3

Humans? Unfortunately, it depends which humans. Swedish humans looking for these characters will look in a different place from German humans. There are many standards for collating sequences, some country-specific, some industry-specific. Whatever you do, don't go inventing another one.

– Michael Kay
Jun 17 at 16:02

add a comment |

How would one sort those characters so that also humans could look them up fast?

Would one look for Ä after A (I guess). And for é after e?

In which order would à, á, â, ã, and ä be sorted in between a and b?

Is there some kind of ISO standard defining such things? How would those characters be arranged?

edited yesterday

Wrzlprmft♦

18.4k5 gold badges49 silver badges114 bronze badges

asked Jun 16 at 15:24

Matthias

1836 bronze badges

New contributor

12

In most programming languages a collation function is available which compares strings according to a locale. In C, this function is strcoll(). Java has a Collator class.

– RHa
Jun 16 at 19:12

The right way to do this is to use strxfrm() to convert your user-visible strings into encoded strings that can be compared using strcmp(). If you're sorting user records, say, and each has a "name" field, you'd have to then create a "nameSortable" field and use strxfrm to populate it. That resulting field may or may not look much like the ASCII you put in but is guaranteed to sort in the right order for the current locale. And since you do the transformation once, it should then sort quickly.

– Swiss Frank
Jun 17 at 13:45

A easier but typically lower-performance method is to use strcoll() to compare. This may be slower because in effect it's internally creating buffers, calling strxfrm to get comperable strings, comparing them, and throwing away the results of the conversion, only leaving the result. strxform() requires you to manage the caches, but pays you back by allowing far faster compares.

– Swiss Frank
Jun 17 at 13:45

3

Humans? Unfortunately, it depends which humans. Swedish humans looking for these characters will look in a different place from German humans. There are many standards for collating sequences, some country-specific, some industry-specific. Whatever you do, don't go inventing another one.

– Michael Kay
Jun 17 at 16:02

add a comment |

How would one sort those characters so that also humans could look them up fast?

Would one look for Ä after A (I guess). And for é after e?

In which order would à, á, â, ã, and ä be sorted in between a and b?

Is there some kind of ISO standard defining such things? How would those characters be arranged?

edited yesterday

Wrzlprmft♦

18.4k5 gold badges49 silver badges114 bronze badges

asked Jun 16 at 15:24

Matthias

1836 bronze badges

New contributor

How would one sort those characters so that also humans could look them up fast?

Would one look for Ä after A (I guess). And for é after e?

In which order would à, á, â, ã, and ä be sorted in between a and b?

Is there some kind of ISO standard defining such things? How would those characters be arranged?

dictionary umlaut eszett

edited yesterday

Wrzlprmft♦

18.4k5 gold badges49 silver badges114 bronze badges

asked Jun 16 at 15:24

Matthias

1836 bronze badges

New contributor

edited yesterday

Wrzlprmft♦

18.4k5 gold badges49 silver badges114 bronze badges

asked Jun 16 at 15:24

Matthias

1836 bronze badges

New contributor

edited yesterday

Wrzlprmft♦

18.4k5 gold badges49 silver badges114 bronze badges

edited yesterday

Wrzlprmft♦

18.4k5 gold badges49 silver badges114 bronze badges

edited yesterday

Wrzlprmft♦

18.4k5 gold badges49 silver badges114 bronze badges

asked Jun 16 at 15:24

Matthias

1836 bronze badges

New contributor

asked Jun 16 at 15:24

Matthias

1836 bronze badges

asked Jun 16 at 15:24

Matthias

1836 bronze badges

New contributor

12

In most programming languages a collation function is available which compares strings according to a locale. In C, this function is strcoll(). Java has a Collator class.

– RHa
Jun 16 at 19:12

The right way to do this is to use strxfrm() to convert your user-visible strings into encoded strings that can be compared using strcmp(). If you're sorting user records, say, and each has a "name" field, you'd have to then create a "nameSortable" field and use strxfrm to populate it. That resulting field may or may not look much like the ASCII you put in but is guaranteed to sort in the right order for the current locale. And since you do the transformation once, it should then sort quickly.

– Swiss Frank
Jun 17 at 13:45

A easier but typically lower-performance method is to use strcoll() to compare. This may be slower because in effect it's internally creating buffers, calling strxfrm to get comperable strings, comparing them, and throwing away the results of the conversion, only leaving the result. strxform() requires you to manage the caches, but pays you back by allowing far faster compares.

– Swiss Frank
Jun 17 at 13:45

3

Humans? Unfortunately, it depends which humans. Swedish humans looking for these characters will look in a different place from German humans. There are many standards for collating sequences, some country-specific, some industry-specific. Whatever you do, don't go inventing another one.

– Michael Kay
Jun 17 at 16:02

add a comment |

12

In most programming languages a collation function is available which compares strings according to a locale. In C, this function is strcoll(). Java has a Collator class.

– RHa
Jun 16 at 19:12

The right way to do this is to use strxfrm() to convert your user-visible strings into encoded strings that can be compared using strcmp(). If you're sorting user records, say, and each has a "name" field, you'd have to then create a "nameSortable" field and use strxfrm to populate it. That resulting field may or may not look much like the ASCII you put in but is guaranteed to sort in the right order for the current locale. And since you do the transformation once, it should then sort quickly.

– Swiss Frank
Jun 17 at 13:45

A easier but typically lower-performance method is to use strcoll() to compare. This may be slower because in effect it's internally creating buffers, calling strxfrm to get comperable strings, comparing them, and throwing away the results of the conversion, only leaving the result. strxform() requires you to manage the caches, but pays you back by allowing far faster compares.

– Swiss Frank
Jun 17 at 13:45

3

Humans? Unfortunately, it depends which humans. Swedish humans looking for these characters will look in a different place from German humans. There are many standards for collating sequences, some country-specific, some industry-specific. Whatever you do, don't go inventing another one.

– Michael Kay
Jun 17 at 16:02

In most programming languages a collation function is available which compares strings according to a locale. In C, this function is strcoll(). Java has a Collator class.

– RHa
Jun 16 at 19:12

The right way to do this is to use strxfrm() to convert your user-visible strings into encoded strings that can be compared using strcmp(). If you're sorting user records, say, and each has a "name" field, you'd have to then create a "nameSortable" field and use strxfrm to populate it. That resulting field may or may not look much like the ASCII you put in but is guaranteed to sort in the right order for the current locale. And since you do the transformation once, it should then sort quickly.

– Swiss Frank
Jun 17 at 13:45

A easier but typically lower-performance method is to use strcoll() to compare. This may be slower because in effect it's internally creating buffers, calling strxfrm to get comperable strings, comparing them, and throwing away the results of the conversion, only leaving the result. strxform() requires you to manage the caches, but pays you back by allowing far faster compares.

– Swiss Frank
Jun 17 at 13:45

Humans? Unfortunately, it depends which humans. Swedish humans looking for these characters will look in a different place from German humans. There are many standards for collating sequences, some country-specific, some industry-specific. Whatever you do, don't go inventing another one.

– Michael Kay
Jun 17 at 16:02

add a comment |

3 Answers
3

active

oldest

votes

Short answer:

Take a look at MySQL and different character-collations. Choose one and follow its rules. Or as @RHa and @cbeleites suggest find a library that provides locale-dependent sorting.

Long answer:

There are 3 different solutions for your problem (actually there are 4, but believe me, you don't want to realize the 4th ;) )

Rewrite every Umlaut to its base (German dictionary rules - DIN 5007-1 var. 1)

Every Umlaut and Diacritic results in the same char.
e.g.

áàâäã = a

ß = ss

and so on.

Sort them.

Rewrite every Umlaut by adding an e, diacritics are removed (German phone book rules - DIN 5007-1 var. 2)

ä = ae

áàâã = a

ü = ue

ß = ss

Sort them.

1. Umlaute are new chars added to the alphabet (Swedish/Finnish collation rules)
  
  Every Umlaut and å are treated like new chars, which are added after the z of the alphabet. Other chars with diacretics are converted to their base.
  
  So sort like
  
  aáàâãbcd [...] xyzåäü
2. Umlaute are new chars added to the alphabet (Austrian phone book order - kudos to @rexkogitans from the comments)
  
  It's as 3.1., but Umlaute and chars with Diacretics are appended to their base characters.
  
  aäáà [...] bcdeèé [...] uü ...
  
  But watch out, wiki says this is true for the Austrian White Pages, but the Austrian Yellow Pages are sorted like DIN 5007-1 var. 1.

In addition. According to EN 13710 the order of diacritics is

Acute accent (á)

Grave accent (à)

Breve (ă)

Circumflex (â)

Hacek (háček) (š)

Ring (å)

Trema (ä)

Double acute accent (ő)

Tilde (ã)

Dot (ż)

Cedilla (ş)

Ogonek (ą)

Macron (ā)

With stroke through (ø)

Modified letter(s) (æ)

For further information take a look at

European ordering rules (EOR / EN 13710)

Unicode collation algorithm

ISO 14651 (download)

One very last comment:

There are a lot of countries and there are a lot of different languages and a lot of different characters. Some are frequently used in one country, but unknown in another.

Therefore there are a lot of different standards how to compare strings alphabetically. Even though there are international standards, you have to choose which one to follow.

How? Well, there's a saying: When in Rome, do as the Romans do.

Take a look at your target group and their expectations. Then choose the collation most of them will accept.

Spanish people? Well, ñ is an independent character sorted in
between n and o.

Germans? Take a look above.

Russians and western europeans? Oh boy, you are in trouble sorting
all these cyrillic and latin characters ;)

Or choose the one you are most comfortable with.

Also ... one of my professors once said: "The first thing I do before coding is searching the internet if there is already a solution." And as @RHa and @cbeleites said in the comments there are solutions (C, Java, PHP, etc.), so unless you insist ... use one of them and you only have to worry about choosing the right locale (and looking up which sorting rules they follow).

Last and least: The 4th sorting method

As I mentioned ealier there is a 4th (horrible) solution. German Wikipedia describes it as

Gleichordnung von Grundbuchstaben, Doppelbuchstaben und Umlaut, wenn Doppelbuchstabe wie Umlaut gesprochen wird. Mull wird wie Muell oder Müll sortiert. Duell dagegen zwischen Duden und Dugast.

To explain it in English I would like to quote @FabioTurati from the comments

It says that an approach is treating plain letters, double letters and Umlauts the same way, but only when the double letters are pronounced as an Umlaut. For example, "Mull", "Muell" and "Müll" are treated the same way (note that Mull and Müll are different words!), and the "ue" in Muell here is a way to avoid typing "ü". In other cases, the letters "ue" are pronounced separately: a "u" followed by an "e", for example in "Duell", which is why Duell is placed between Duden and Dugast.

Why is this a problem?

Consider it from the other point of view: if you find a word containing "ue", how do you treat it? For Muell you should pretend the "e" is not there, and treat it as "Mull". For "Duell", doing it would lead to "Dull", which would be sorted after "Dugast", which would be wrong. So, to know how to sort words you need to have some more info (that is, how they are pronounced), which a sorting algorithm normally doesn't have. That's why this approach is troublesome!

edited Jun 19 at 9:20

answered Jun 16 at 16:32

mtwde

4,4521 gold badge3 silver badges20 bronze badges

1

Another sorting rule: Treat ä, ö, ü as additional characters, but not appended after z, but instead after a, o, u respectively: a, ä, b, c, ..., m, n, o, ö, p, ... This is called Austrian Order.

– rexkogitans
Jun 17 at 6:11

12

So... what's the fourth?

– sgf
Jun 17 at 7:47

2

@sgf from wikipedia: "Gleichordnung von Grundbuchstaben, Doppelbuchstaben und Umlaut, wenn Doppelbuchstabe wie Umlaut gesprochen wird. Mull wird wie Muell oder Müll sortiert. Duell dagegen zwischen Duden und Dugast." which is horrifying to code and gets nasty with names like "Schröder / Schroeder", etc

– mtwde
Jun 17 at 9:50

2

Your description of the Swedish collation order is incorrect. You are right that å, ä and ö are treated as full-worthy letters and sorted after z, but all other diacritics are ignored. E.g. á is sorted as a and not after a. Until some 10 years ago, v and w were also considered equal when sorting, but w is now usually considered a 'proper' letter in itself and sorted after v. AFAIK, the same collation rules are used in Finnish.

– jarnbjo
Jun 17 at 13:06

1

@mtwde My German isn't good enough to understand that Wikipedia excerpt. Could you translate it, please?

– Nzall
Jun 18 at 7:08

|
show 7 more comments

If it's not names you are dealing with, it would be best to ignore all diacritics when sorting (and count ß as ss).

The only reason to deviate from this simple system lies in the unfortunate fact that German names show unpredictable variation between ä, ö, ü and ae, oe, ue. This has lead to phone books and library catalogues sorting e.g. Räder as Raeder, Örtel as Oertel, Hüber as Hueber.

Wikipedia has a good write-up.

answered Jun 16 at 16:11

David Vogt

6,8431 gold badge6 silver badges36 bronze badges

add a comment |

I can answer you only regarding the German characters. "Ä" is considered equivalent to "Ae", "Ö" to "Oe", "Ü" to "Ue" and "ß" to "ss". This is how those characters are sorted in a phonebook.

answered Jun 16 at 15:40

ziganotschka

271 bronze badge

New contributor

Thank you for your answer. Sadly I cannot implement this behaviour. I'm sorry. I removed the phone book reference.

– Matthias
Jun 16 at 16:06

1

This does mean that it is sorted like this not that it is written like this.

– Kami Kaze
Jun 17 at 6:43

This is how German users would expect it.

– Simon Richter
Jun 17 at 8:56

3

@SimonRichter: There are (at least) two different widely-used collation orders in Germany, and even more if you also consider other German-speaking countries such as Austria and Switzerland. So, whether or not "German users would expect it" this way depends very much on context and on whether those German users are actually from Germany or German-speaking from Austria, for example.

– Jörg W Mittag
Jun 17 at 12:47

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "253"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

Matthias is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fgerman.stackexchange.com%2fquestions%2f52765%2fordering-german-special-characters-and-those-from-other-languages-when-sorting%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes