Ordering German special characters and those from other languages when sortingWhat are good online dictionaries for translation between German and English?Which languages are “nnl.”, “altn.” and “schw.”?German dictionary with detailed declensions, audio pronunciations, and IPAWhich German word contains the most ä, ö, ü, and ß in any variation?How to perform an advanced search of German nouns in Wiktionary restricting both gender and ending?How to type the ß and capital ß (ẞ) on a Windows 8 German keyboard?When to use 'ß' and 'ss'?Why does the German dictionary show only 2nd and 3rd person conjugation?Why are “tomorrow” and “morning” the same in German?How many masculine, feminine, neuter and plural nouns are there in German language?
Why cruise at 7000' in an A319?
In the Marvel universe, can a human have a baby with any non-human?
How often can a PC check with passive perception during a combat turn?
Is there any set of 2-6 notes that doesn't have a chord name?
Was touching your nose a greeting in second millenium Mesopotamia?
Fitting a mixture of two normal distributions for a data set?
Going to get married soon, should I do it on Dec 31 or Jan 1?
How to append a matrix element by element?
MH370 blackbox - is it still possible to retrieve data from it?
STM Microcontroller burns every time
Why isn’t the tax system continuous rather than bracketed?
Counting occurrence of words in table is slow
Short story with brother-sister conjoined twins as protagonists?
What is this opening trap called, and how should I play afterwards? How can I refute the gambit, and play if I accept it?
Should I hide continue button until tasks are completed?
Fedora boot screen shows both Fedora logo and Lenovo logo. Why and How?
Does anycast addressing add additional latency in any way?
Could Sauron have read Tom Bombadil's mind if Tom had held the Palantir?
How many satellites can stay in a Lagrange point?
A player is constantly pestering me about rules, what do I do as a DM?
Why does the A-4 Skyhawk sit nose-up when on ground?
Why does the numerical solution of an ODE move away from an unstable equilibrium?
Do equal angles necessarily mean a polygon is regular?
Is it okay to visually align the elements in a logo?
Ordering German special characters and those from other languages when sorting
What are good online dictionaries for translation between German and English?Which languages are “nnl.”, “altn.” and “schw.”?German dictionary with detailed declensions, audio pronunciations, and IPAWhich German word contains the most ä, ö, ü, and ß in any variation?How to perform an advanced search of German nouns in Wiktionary restricting both gender and ending?How to type the ß and capital ß (ẞ) on a Windows 8 German keyboard?When to use 'ß' and 'ss'?Why does the German dictionary show only 2nd and 3rd person conjugation?Why are “tomorrow” and “morning” the same in German?How many masculine, feminine, neuter and plural nouns are there in German language?
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
I want to sort strings (text) in a software project of mine. I'm planning to do this in the lexically best way.
My set of possible characters consist of the full alphabet (a–z and A–Z) and of the typical Latin 1 Umlauts, like Ä, ö, ß, and also characters from other Latin 1 languages like à, á, â, ã.
(It’s technically impossible to order the data by expanding characters like Ä to Ae.)
How would one sort those characters so that also humans could look them up fast?
- Would one look for Ä after A (I guess). And for é after e?
- In which order would à, á, â, ã, and ä be sorted in between a and b?
- Is there some kind of ISO standard defining such things? How would those characters be arranged?
dictionary umlaut eszett
New contributor
add a comment |
I want to sort strings (text) in a software project of mine. I'm planning to do this in the lexically best way.
My set of possible characters consist of the full alphabet (a–z and A–Z) and of the typical Latin 1 Umlauts, like Ä, ö, ß, and also characters from other Latin 1 languages like à, á, â, ã.
(It’s technically impossible to order the data by expanding characters like Ä to Ae.)
How would one sort those characters so that also humans could look them up fast?
- Would one look for Ä after A (I guess). And for é after e?
- In which order would à, á, â, ã, and ä be sorted in between a and b?
- Is there some kind of ISO standard defining such things? How would those characters be arranged?
dictionary umlaut eszett
New contributor
12
In most programming languages a collation function is available which compares strings according to a locale. In C, this function is strcoll(). Java has a Collator class.
– RHa
Jun 16 at 19:12
The right way to do this is to usestrxfrm()
to convert your user-visible strings into encoded strings that can be compared usingstrcmp()
. If you're sorting user records, say, and each has a "name" field, you'd have to then create a "nameSortable" field and usestrxfrm
to populate it. That resulting field may or may not look much like the ASCII you put in but is guaranteed to sort in the right order for the current locale. And since you do the transformation once, it should then sort quickly.
– Swiss Frank
Jun 17 at 13:45
A easier but typically lower-performance method is to usestrcoll()
to compare. This may be slower because in effect it's internally creating buffers, calling strxfrm to get comperable strings, comparing them, and throwing away the results of the conversion, only leaving the result.strxform()
requires you to manage the caches, but pays you back by allowing far faster compares.
– Swiss Frank
Jun 17 at 13:45
3
Humans? Unfortunately, it depends which humans. Swedish humans looking for these characters will look in a different place from German humans. There are many standards for collating sequences, some country-specific, some industry-specific. Whatever you do, don't go inventing another one.
– Michael Kay
Jun 17 at 16:02
add a comment |
I want to sort strings (text) in a software project of mine. I'm planning to do this in the lexically best way.
My set of possible characters consist of the full alphabet (a–z and A–Z) and of the typical Latin 1 Umlauts, like Ä, ö, ß, and also characters from other Latin 1 languages like à, á, â, ã.
(It’s technically impossible to order the data by expanding characters like Ä to Ae.)
How would one sort those characters so that also humans could look them up fast?
- Would one look for Ä after A (I guess). And for é after e?
- In which order would à, á, â, ã, and ä be sorted in between a and b?
- Is there some kind of ISO standard defining such things? How would those characters be arranged?
dictionary umlaut eszett
New contributor
I want to sort strings (text) in a software project of mine. I'm planning to do this in the lexically best way.
My set of possible characters consist of the full alphabet (a–z and A–Z) and of the typical Latin 1 Umlauts, like Ä, ö, ß, and also characters from other Latin 1 languages like à, á, â, ã.
(It’s technically impossible to order the data by expanding characters like Ä to Ae.)
How would one sort those characters so that also humans could look them up fast?
- Would one look for Ä after A (I guess). And for é after e?
- In which order would à, á, â, ã, and ä be sorted in between a and b?
- Is there some kind of ISO standard defining such things? How would those characters be arranged?
dictionary umlaut eszett
dictionary umlaut eszett
New contributor
New contributor
edited yesterday
Wrzlprmft♦
18.4k5 gold badges49 silver badges114 bronze badges
18.4k5 gold badges49 silver badges114 bronze badges
New contributor
asked Jun 16 at 15:24
MatthiasMatthias
1836 bronze badges
1836 bronze badges
New contributor
New contributor
12
In most programming languages a collation function is available which compares strings according to a locale. In C, this function is strcoll(). Java has a Collator class.
– RHa
Jun 16 at 19:12
The right way to do this is to usestrxfrm()
to convert your user-visible strings into encoded strings that can be compared usingstrcmp()
. If you're sorting user records, say, and each has a "name" field, you'd have to then create a "nameSortable" field and usestrxfrm
to populate it. That resulting field may or may not look much like the ASCII you put in but is guaranteed to sort in the right order for the current locale. And since you do the transformation once, it should then sort quickly.
– Swiss Frank
Jun 17 at 13:45
A easier but typically lower-performance method is to usestrcoll()
to compare. This may be slower because in effect it's internally creating buffers, calling strxfrm to get comperable strings, comparing them, and throwing away the results of the conversion, only leaving the result.strxform()
requires you to manage the caches, but pays you back by allowing far faster compares.
– Swiss Frank
Jun 17 at 13:45
3
Humans? Unfortunately, it depends which humans. Swedish humans looking for these characters will look in a different place from German humans. There are many standards for collating sequences, some country-specific, some industry-specific. Whatever you do, don't go inventing another one.
– Michael Kay
Jun 17 at 16:02
add a comment |
12
In most programming languages a collation function is available which compares strings according to a locale. In C, this function is strcoll(). Java has a Collator class.
– RHa
Jun 16 at 19:12
The right way to do this is to usestrxfrm()
to convert your user-visible strings into encoded strings that can be compared usingstrcmp()
. If you're sorting user records, say, and each has a "name" field, you'd have to then create a "nameSortable" field and usestrxfrm
to populate it. That resulting field may or may not look much like the ASCII you put in but is guaranteed to sort in the right order for the current locale. And since you do the transformation once, it should then sort quickly.
– Swiss Frank
Jun 17 at 13:45
A easier but typically lower-performance method is to usestrcoll()
to compare. This may be slower because in effect it's internally creating buffers, calling strxfrm to get comperable strings, comparing them, and throwing away the results of the conversion, only leaving the result.strxform()
requires you to manage the caches, but pays you back by allowing far faster compares.
– Swiss Frank
Jun 17 at 13:45
3
Humans? Unfortunately, it depends which humans. Swedish humans looking for these characters will look in a different place from German humans. There are many standards for collating sequences, some country-specific, some industry-specific. Whatever you do, don't go inventing another one.
– Michael Kay
Jun 17 at 16:02
12
12
In most programming languages a collation function is available which compares strings according to a locale. In C, this function is strcoll(). Java has a Collator class.
– RHa
Jun 16 at 19:12
In most programming languages a collation function is available which compares strings according to a locale. In C, this function is strcoll(). Java has a Collator class.
– RHa
Jun 16 at 19:12
The right way to do this is to use
strxfrm()
to convert your user-visible strings into encoded strings that can be compared using strcmp()
. If you're sorting user records, say, and each has a "name" field, you'd have to then create a "nameSortable" field and use strxfrm
to populate it. That resulting field may or may not look much like the ASCII you put in but is guaranteed to sort in the right order for the current locale. And since you do the transformation once, it should then sort quickly.– Swiss Frank
Jun 17 at 13:45
The right way to do this is to use
strxfrm()
to convert your user-visible strings into encoded strings that can be compared using strcmp()
. If you're sorting user records, say, and each has a "name" field, you'd have to then create a "nameSortable" field and use strxfrm
to populate it. That resulting field may or may not look much like the ASCII you put in but is guaranteed to sort in the right order for the current locale. And since you do the transformation once, it should then sort quickly.– Swiss Frank
Jun 17 at 13:45
A easier but typically lower-performance method is to use
strcoll()
to compare. This may be slower because in effect it's internally creating buffers, calling strxfrm to get comperable strings, comparing them, and throwing away the results of the conversion, only leaving the result. strxform()
requires you to manage the caches, but pays you back by allowing far faster compares.– Swiss Frank
Jun 17 at 13:45
A easier but typically lower-performance method is to use
strcoll()
to compare. This may be slower because in effect it's internally creating buffers, calling strxfrm to get comperable strings, comparing them, and throwing away the results of the conversion, only leaving the result. strxform()
requires you to manage the caches, but pays you back by allowing far faster compares.– Swiss Frank
Jun 17 at 13:45
3
3
Humans? Unfortunately, it depends which humans. Swedish humans looking for these characters will look in a different place from German humans. There are many standards for collating sequences, some country-specific, some industry-specific. Whatever you do, don't go inventing another one.
– Michael Kay
Jun 17 at 16:02
Humans? Unfortunately, it depends which humans. Swedish humans looking for these characters will look in a different place from German humans. There are many standards for collating sequences, some country-specific, some industry-specific. Whatever you do, don't go inventing another one.
– Michael Kay
Jun 17 at 16:02
add a comment |
3 Answers
3
active
oldest
votes
Short answer:
Take a look at MySQL and different character-collations. Choose one and follow its rules. Or as @RHa and @cbeleites suggest find a library that provides locale-dependent sorting.
Long answer:
There are 3 different solutions for your problem (actually there are 4, but believe me, you don't want to realize the 4th ;) )
Rewrite every Umlaut to its base (German dictionary rules - DIN 5007-1 var. 1)
Every Umlaut and Diacritic results in the same char.
e.g.
áàâäã = a
ß = ss
and so on.
Sort them.
Rewrite every Umlaut by adding an e, diacritics are removed (German phone book rules - DIN 5007-1 var. 2)
ä = ae
áàâã = a
ü = ue
ß = ss
Sort them.
Umlaute are new chars added to the alphabet (Swedish/Finnish collation rules)
Every Umlaut and å are treated like new chars, which are added after the z of the alphabet. Other chars with diacretics are converted to their base.
So sort like
aáàâãbcd [...] xyzåäü
Umlaute are new chars added to the alphabet (Austrian phone book order - kudos to @rexkogitans from the comments)
It's as 3.1., but Umlaute and chars with Diacretics are appended to their base characters.
aäáà [...] bcdeèé [...] uü ...
But watch out, wiki says this is true for the Austrian White Pages, but the Austrian Yellow Pages are sorted like DIN 5007-1 var. 1.
In addition. According to EN 13710 the order of diacritics is
- Acute accent (á)
- Grave accent (à)
- Breve (ă)
- Circumflex (â)
- Hacek (háček) (š)
- Ring (å)
- Trema (ä)
- Double acute accent (ő)
- Tilde (ã)
- Dot (ż)
- Cedilla (ş)
- Ogonek (ą)
- Macron (ā)
- With stroke through (ø)
- Modified letter(s) (æ)
For further information take a look at
European ordering rules (EOR / EN 13710)
Unicode collation algorithm
ISO 14651 (download)
One very last comment:
There are a lot of countries and there are a lot of different languages and a lot of different characters. Some are frequently used in one country, but unknown in another.
Therefore there are a lot of different standards how to compare strings alphabetically. Even though there are international standards, you have to choose which one to follow.
How? Well, there's a saying: When in Rome, do as the Romans do.
Take a look at your target group and their expectations. Then choose the collation most of them will accept.
Spanish people? Well, ñ is an independent character sorted in
between n and o.Germans? Take a look above.
Russians and western europeans? Oh boy, you are in trouble sorting
all these cyrillic and latin characters ;)Or choose the one you are most comfortable with.
Also ... one of my professors once said: "The first thing I do before coding is searching the internet if there is already a solution." And as @RHa and @cbeleites said in the comments there are solutions (C, Java, PHP, etc.), so unless you insist ... use one of them and you only have to worry about choosing the right locale (and looking up which sorting rules they follow).
Last and least: The 4th sorting method
As I mentioned ealier there is a 4th (horrible) solution. German Wikipedia describes it as
Gleichordnung von Grundbuchstaben, Doppelbuchstaben und Umlaut, wenn Doppelbuchstabe wie Umlaut gesprochen wird. Mull wird wie Muell oder Müll sortiert. Duell dagegen zwischen Duden und Dugast.
To explain it in English I would like to quote @FabioTurati from the comments
It says that an approach is treating plain letters, double letters and Umlauts the same way, but only when the double letters are pronounced as an Umlaut. For example, "Mull", "Muell" and "Müll" are treated the same way (note that Mull and Müll are different words!), and the "ue" in Muell here is a way to avoid typing "ü". In other cases, the letters "ue" are pronounced separately: a "u" followed by an "e", for example in "Duell", which is why Duell is placed between Duden and Dugast.
Why is this a problem?
Consider it from the other point of view: if you find a word containing "ue", how do you treat it? For Muell you should pretend the "e" is not there, and treat it as "Mull". For "Duell", doing it would lead to "Dull", which would be sorted after "Dugast", which would be wrong. So, to know how to sort words you need to have some more info (that is, how they are pronounced), which a sorting algorithm normally doesn't have. That's why this approach is troublesome!
1
Another sorting rule: Treat ä, ö, ü as additional characters, but not appended after z, but instead after a, o, u respectively: a, ä, b, c, ..., m, n, o, ö, p, ... This is called Austrian Order.
– rexkogitans
Jun 17 at 6:11
12
So... what's the fourth?
– sgf
Jun 17 at 7:47
2
@sgf from wikipedia: "Gleichordnung von Grundbuchstaben, Doppelbuchstaben und Umlaut, wenn Doppelbuchstabe wie Umlaut gesprochen wird. Mull wird wie Muell oder Müll sortiert. Duell dagegen zwischen Duden und Dugast." which is horrifying to code and gets nasty with names like "Schröder / Schroeder", etc
– mtwde
Jun 17 at 9:50
2
Your description of the Swedish collation order is incorrect. You are right that å, ä and ö are treated as full-worthy letters and sorted after z, but all other diacritics are ignored. E.g. á is sorted as a and not after a. Until some 10 years ago, v and w were also considered equal when sorting, but w is now usually considered a 'proper' letter in itself and sorted after v. AFAIK, the same collation rules are used in Finnish.
– jarnbjo
Jun 17 at 13:06
1
@mtwde My German isn't good enough to understand that Wikipedia excerpt. Could you translate it, please?
– Nzall
Jun 18 at 7:08
|
show 7 more comments
If it's not names you are dealing with, it would be best to ignore all diacritics when sorting (and count ß as ss).
The only reason to deviate from this simple system lies in the unfortunate fact that German names show unpredictable variation between ä, ö, ü and ae, oe, ue. This has lead to phone books and library catalogues sorting e.g. Räder as Raeder, Örtel as Oertel, Hüber as Hueber.
Wikipedia has a good write-up.
add a comment |
I can answer you only regarding the German characters. "Ä" is considered equivalent to "Ae", "Ö" to "Oe", "Ü" to "Ue" and "ß" to "ss". This is how those characters are sorted in a phonebook.
New contributor
Thank you for your answer. Sadly I cannot implement this behaviour. I'm sorry. I removed the phone book reference.
– Matthias
Jun 16 at 16:06
1
This does mean that it is sorted like this not that it is written like this.
– Kami Kaze
Jun 17 at 6:43
This is how German users would expect it.
– Simon Richter
Jun 17 at 8:56
3
@SimonRichter: There are (at least) two different widely-used collation orders in Germany, and even more if you also consider other German-speaking countries such as Austria and Switzerland. So, whether or not "German users would expect it" this way depends very much on context and on whether those German users are actually from Germany or German-speaking from Austria, for example.
– Jörg W Mittag
Jun 17 at 12:47
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "253"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Matthias is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fgerman.stackexchange.com%2fquestions%2f52765%2fordering-german-special-characters-and-those-from-other-languages-when-sorting%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
Short answer:
Take a look at MySQL and different character-collations. Choose one and follow its rules. Or as @RHa and @cbeleites suggest find a library that provides locale-dependent sorting.
Long answer:
There are 3 different solutions for your problem (actually there are 4, but believe me, you don't want to realize the 4th ;) )
Rewrite every Umlaut to its base (German dictionary rules - DIN 5007-1 var. 1)
Every Umlaut and Diacritic results in the same char.
e.g.
áàâäã = a
ß = ss
and so on.
Sort them.
Rewrite every Umlaut by adding an e, diacritics are removed (German phone book rules - DIN 5007-1 var. 2)
ä = ae
áàâã = a
ü = ue
ß = ss
Sort them.
Umlaute are new chars added to the alphabet (Swedish/Finnish collation rules)
Every Umlaut and å are treated like new chars, which are added after the z of the alphabet. Other chars with diacretics are converted to their base.
So sort like
aáàâãbcd [...] xyzåäü
Umlaute are new chars added to the alphabet (Austrian phone book order - kudos to @rexkogitans from the comments)
It's as 3.1., but Umlaute and chars with Diacretics are appended to their base characters.
aäáà [...] bcdeèé [...] uü ...
But watch out, wiki says this is true for the Austrian White Pages, but the Austrian Yellow Pages are sorted like DIN 5007-1 var. 1.
In addition. According to EN 13710 the order of diacritics is
- Acute accent (á)
- Grave accent (à)
- Breve (ă)
- Circumflex (â)
- Hacek (háček) (š)
- Ring (å)
- Trema (ä)
- Double acute accent (ő)
- Tilde (ã)
- Dot (ż)
- Cedilla (ş)
- Ogonek (ą)
- Macron (ā)
- With stroke through (ø)
- Modified letter(s) (æ)
For further information take a look at
European ordering rules (EOR / EN 13710)
Unicode collation algorithm
ISO 14651 (download)
One very last comment:
There are a lot of countries and there are a lot of different languages and a lot of different characters. Some are frequently used in one country, but unknown in another.
Therefore there are a lot of different standards how to compare strings alphabetically. Even though there are international standards, you have to choose which one to follow.
How? Well, there's a saying: When in Rome, do as the Romans do.
Take a look at your target group and their expectations. Then choose the collation most of them will accept.
Spanish people? Well, ñ is an independent character sorted in
between n and o.Germans? Take a look above.
Russians and western europeans? Oh boy, you are in trouble sorting
all these cyrillic and latin characters ;)Or choose the one you are most comfortable with.
Also ... one of my professors once said: "The first thing I do before coding is searching the internet if there is already a solution." And as @RHa and @cbeleites said in the comments there are solutions (C, Java, PHP, etc.), so unless you insist ... use one of them and you only have to worry about choosing the right locale (and looking up which sorting rules they follow).
Last and least: The 4th sorting method
As I mentioned ealier there is a 4th (horrible) solution. German Wikipedia describes it as
Gleichordnung von Grundbuchstaben, Doppelbuchstaben und Umlaut, wenn Doppelbuchstabe wie Umlaut gesprochen wird. Mull wird wie Muell oder Müll sortiert. Duell dagegen zwischen Duden und Dugast.
To explain it in English I would like to quote @FabioTurati from the comments
It says that an approach is treating plain letters, double letters and Umlauts the same way, but only when the double letters are pronounced as an Umlaut. For example, "Mull", "Muell" and "Müll" are treated the same way (note that Mull and Müll are different words!), and the "ue" in Muell here is a way to avoid typing "ü". In other cases, the letters "ue" are pronounced separately: a "u" followed by an "e", for example in "Duell", which is why Duell is placed between Duden and Dugast.
Why is this a problem?
Consider it from the other point of view: if you find a word containing "ue", how do you treat it? For Muell you should pretend the "e" is not there, and treat it as "Mull". For "Duell", doing it would lead to "Dull", which would be sorted after "Dugast", which would be wrong. So, to know how to sort words you need to have some more info (that is, how they are pronounced), which a sorting algorithm normally doesn't have. That's why this approach is troublesome!
1
Another sorting rule: Treat ä, ö, ü as additional characters, but not appended after z, but instead after a, o, u respectively: a, ä, b, c, ..., m, n, o, ö, p, ... This is called Austrian Order.
– rexkogitans
Jun 17 at 6:11
12
So... what's the fourth?
– sgf
Jun 17 at 7:47
2
@sgf from wikipedia: "Gleichordnung von Grundbuchstaben, Doppelbuchstaben und Umlaut, wenn Doppelbuchstabe wie Umlaut gesprochen wird. Mull wird wie Muell oder Müll sortiert. Duell dagegen zwischen Duden und Dugast." which is horrifying to code and gets nasty with names like "Schröder / Schroeder", etc
– mtwde
Jun 17 at 9:50
2
Your description of the Swedish collation order is incorrect. You are right that å, ä and ö are treated as full-worthy letters and sorted after z, but all other diacritics are ignored. E.g. á is sorted as a and not after a. Until some 10 years ago, v and w were also considered equal when sorting, but w is now usually considered a 'proper' letter in itself and sorted after v. AFAIK, the same collation rules are used in Finnish.
– jarnbjo
Jun 17 at 13:06
1
@mtwde My German isn't good enough to understand that Wikipedia excerpt. Could you translate it, please?
– Nzall
Jun 18 at 7:08
|
show 7 more comments
Short answer:
Take a look at MySQL and different character-collations. Choose one and follow its rules. Or as @RHa and @cbeleites suggest find a library that provides locale-dependent sorting.
Long answer:
There are 3 different solutions for your problem (actually there are 4, but believe me, you don't want to realize the 4th ;) )
Rewrite every Umlaut to its base (German dictionary rules - DIN 5007-1 var. 1)
Every Umlaut and Diacritic results in the same char.
e.g.
áàâäã = a
ß = ss
and so on.
Sort them.
Rewrite every Umlaut by adding an e, diacritics are removed (German phone book rules - DIN 5007-1 var. 2)
ä = ae
áàâã = a
ü = ue
ß = ss
Sort them.
Umlaute are new chars added to the alphabet (Swedish/Finnish collation rules)
Every Umlaut and å are treated like new chars, which are added after the z of the alphabet. Other chars with diacretics are converted to their base.
So sort like
aáàâãbcd [...] xyzåäü
Umlaute are new chars added to the alphabet (Austrian phone book order - kudos to @rexkogitans from the comments)
It's as 3.1., but Umlaute and chars with Diacretics are appended to their base characters.
aäáà [...] bcdeèé [...] uü ...
But watch out, wiki says this is true for the Austrian White Pages, but the Austrian Yellow Pages are sorted like DIN 5007-1 var. 1.
In addition. According to EN 13710 the order of diacritics is
- Acute accent (á)
- Grave accent (à)
- Breve (ă)
- Circumflex (â)
- Hacek (háček) (š)
- Ring (å)
- Trema (ä)
- Double acute accent (ő)
- Tilde (ã)
- Dot (ż)
- Cedilla (ş)
- Ogonek (ą)
- Macron (ā)
- With stroke through (ø)
- Modified letter(s) (æ)
For further information take a look at
European ordering rules (EOR / EN 13710)
Unicode collation algorithm
ISO 14651 (download)
One very last comment:
There are a lot of countries and there are a lot of different languages and a lot of different characters. Some are frequently used in one country, but unknown in another.
Therefore there are a lot of different standards how to compare strings alphabetically. Even though there are international standards, you have to choose which one to follow.
How? Well, there's a saying: When in Rome, do as the Romans do.
Take a look at your target group and their expectations. Then choose the collation most of them will accept.
Spanish people? Well, ñ is an independent character sorted in
between n and o.Germans? Take a look above.
Russians and western europeans? Oh boy, you are in trouble sorting
all these cyrillic and latin characters ;)Or choose the one you are most comfortable with.
Also ... one of my professors once said: "The first thing I do before coding is searching the internet if there is already a solution." And as @RHa and @cbeleites said in the comments there are solutions (C, Java, PHP, etc.), so unless you insist ... use one of them and you only have to worry about choosing the right locale (and looking up which sorting rules they follow).
Last and least: The 4th sorting method
As I mentioned ealier there is a 4th (horrible) solution. German Wikipedia describes it as
Gleichordnung von Grundbuchstaben, Doppelbuchstaben und Umlaut, wenn Doppelbuchstabe wie Umlaut gesprochen wird. Mull wird wie Muell oder Müll sortiert. Duell dagegen zwischen Duden und Dugast.
To explain it in English I would like to quote @FabioTurati from the comments
It says that an approach is treating plain letters, double letters and Umlauts the same way, but only when the double letters are pronounced as an Umlaut. For example, "Mull", "Muell" and "Müll" are treated the same way (note that Mull and Müll are different words!), and the "ue" in Muell here is a way to avoid typing "ü". In other cases, the letters "ue" are pronounced separately: a "u" followed by an "e", for example in "Duell", which is why Duell is placed between Duden and Dugast.
Why is this a problem?
Consider it from the other point of view: if you find a word containing "ue", how do you treat it? For Muell you should pretend the "e" is not there, and treat it as "Mull". For "Duell", doing it would lead to "Dull", which would be sorted after "Dugast", which would be wrong. So, to know how to sort words you need to have some more info (that is, how they are pronounced), which a sorting algorithm normally doesn't have. That's why this approach is troublesome!
1
Another sorting rule: Treat ä, ö, ü as additional characters, but not appended after z, but instead after a, o, u respectively: a, ä, b, c, ..., m, n, o, ö, p, ... This is called Austrian Order.
– rexkogitans
Jun 17 at 6:11
12
So... what's the fourth?
– sgf
Jun 17 at 7:47
2
@sgf from wikipedia: "Gleichordnung von Grundbuchstaben, Doppelbuchstaben und Umlaut, wenn Doppelbuchstabe wie Umlaut gesprochen wird. Mull wird wie Muell oder Müll sortiert. Duell dagegen zwischen Duden und Dugast." which is horrifying to code and gets nasty with names like "Schröder / Schroeder", etc
– mtwde
Jun 17 at 9:50
2
Your description of the Swedish collation order is incorrect. You are right that å, ä and ö are treated as full-worthy letters and sorted after z, but all other diacritics are ignored. E.g. á is sorted as a and not after a. Until some 10 years ago, v and w were also considered equal when sorting, but w is now usually considered a 'proper' letter in itself and sorted after v. AFAIK, the same collation rules are used in Finnish.
– jarnbjo
Jun 17 at 13:06
1
@mtwde My German isn't good enough to understand that Wikipedia excerpt. Could you translate it, please?
– Nzall
Jun 18 at 7:08
|
show 7 more comments
Short answer:
Take a look at MySQL and different character-collations. Choose one and follow its rules. Or as @RHa and @cbeleites suggest find a library that provides locale-dependent sorting.
Long answer:
There are 3 different solutions for your problem (actually there are 4, but believe me, you don't want to realize the 4th ;) )
Rewrite every Umlaut to its base (German dictionary rules - DIN 5007-1 var. 1)
Every Umlaut and Diacritic results in the same char.
e.g.
áàâäã = a
ß = ss
and so on.
Sort them.
Rewrite every Umlaut by adding an e, diacritics are removed (German phone book rules - DIN 5007-1 var. 2)
ä = ae
áàâã = a
ü = ue
ß = ss
Sort them.
Umlaute are new chars added to the alphabet (Swedish/Finnish collation rules)
Every Umlaut and å are treated like new chars, which are added after the z of the alphabet. Other chars with diacretics are converted to their base.
So sort like
aáàâãbcd [...] xyzåäü
Umlaute are new chars added to the alphabet (Austrian phone book order - kudos to @rexkogitans from the comments)
It's as 3.1., but Umlaute and chars with Diacretics are appended to their base characters.
aäáà [...] bcdeèé [...] uü ...
But watch out, wiki says this is true for the Austrian White Pages, but the Austrian Yellow Pages are sorted like DIN 5007-1 var. 1.
In addition. According to EN 13710 the order of diacritics is
- Acute accent (á)
- Grave accent (à)
- Breve (ă)
- Circumflex (â)
- Hacek (háček) (š)
- Ring (å)
- Trema (ä)
- Double acute accent (ő)
- Tilde (ã)
- Dot (ż)
- Cedilla (ş)
- Ogonek (ą)
- Macron (ā)
- With stroke through (ø)
- Modified letter(s) (æ)
For further information take a look at
European ordering rules (EOR / EN 13710)
Unicode collation algorithm
ISO 14651 (download)
One very last comment:
There are a lot of countries and there are a lot of different languages and a lot of different characters. Some are frequently used in one country, but unknown in another.
Therefore there are a lot of different standards how to compare strings alphabetically. Even though there are international standards, you have to choose which one to follow.
How? Well, there's a saying: When in Rome, do as the Romans do.
Take a look at your target group and their expectations. Then choose the collation most of them will accept.
Spanish people? Well, ñ is an independent character sorted in
between n and o.Germans? Take a look above.
Russians and western europeans? Oh boy, you are in trouble sorting
all these cyrillic and latin characters ;)Or choose the one you are most comfortable with.
Also ... one of my professors once said: "The first thing I do before coding is searching the internet if there is already a solution." And as @RHa and @cbeleites said in the comments there are solutions (C, Java, PHP, etc.), so unless you insist ... use one of them and you only have to worry about choosing the right locale (and looking up which sorting rules they follow).
Last and least: The 4th sorting method
As I mentioned ealier there is a 4th (horrible) solution. German Wikipedia describes it as
Gleichordnung von Grundbuchstaben, Doppelbuchstaben und Umlaut, wenn Doppelbuchstabe wie Umlaut gesprochen wird. Mull wird wie Muell oder Müll sortiert. Duell dagegen zwischen Duden und Dugast.
To explain it in English I would like to quote @FabioTurati from the comments
It says that an approach is treating plain letters, double letters and Umlauts the same way, but only when the double letters are pronounced as an Umlaut. For example, "Mull", "Muell" and "Müll" are treated the same way (note that Mull and Müll are different words!), and the "ue" in Muell here is a way to avoid typing "ü". In other cases, the letters "ue" are pronounced separately: a "u" followed by an "e", for example in "Duell", which is why Duell is placed between Duden and Dugast.
Why is this a problem?
Consider it from the other point of view: if you find a word containing "ue", how do you treat it? For Muell you should pretend the "e" is not there, and treat it as "Mull". For "Duell", doing it would lead to "Dull", which would be sorted after "Dugast", which would be wrong. So, to know how to sort words you need to have some more info (that is, how they are pronounced), which a sorting algorithm normally doesn't have. That's why this approach is troublesome!
Short answer:
Take a look at MySQL and different character-collations. Choose one and follow its rules. Or as @RHa and @cbeleites suggest find a library that provides locale-dependent sorting.
Long answer:
There are 3 different solutions for your problem (actually there are 4, but believe me, you don't want to realize the 4th ;) )
Rewrite every Umlaut to its base (German dictionary rules - DIN 5007-1 var. 1)
Every Umlaut and Diacritic results in the same char.
e.g.
áàâäã = a
ß = ss
and so on.
Sort them.
Rewrite every Umlaut by adding an e, diacritics are removed (German phone book rules - DIN 5007-1 var. 2)
ä = ae
áàâã = a
ü = ue
ß = ss
Sort them.
Umlaute are new chars added to the alphabet (Swedish/Finnish collation rules)
Every Umlaut and å are treated like new chars, which are added after the z of the alphabet. Other chars with diacretics are converted to their base.
So sort like
aáàâãbcd [...] xyzåäü
Umlaute are new chars added to the alphabet (Austrian phone book order - kudos to @rexkogitans from the comments)
It's as 3.1., but Umlaute and chars with Diacretics are appended to their base characters.
aäáà [...] bcdeèé [...] uü ...
But watch out, wiki says this is true for the Austrian White Pages, but the Austrian Yellow Pages are sorted like DIN 5007-1 var. 1.
In addition. According to EN 13710 the order of diacritics is
- Acute accent (á)
- Grave accent (à)
- Breve (ă)
- Circumflex (â)
- Hacek (háček) (š)
- Ring (å)
- Trema (ä)
- Double acute accent (ő)
- Tilde (ã)
- Dot (ż)
- Cedilla (ş)
- Ogonek (ą)
- Macron (ā)
- With stroke through (ø)
- Modified letter(s) (æ)
For further information take a look at
European ordering rules (EOR / EN 13710)
Unicode collation algorithm
ISO 14651 (download)
One very last comment:
There are a lot of countries and there are a lot of different languages and a lot of different characters. Some are frequently used in one country, but unknown in another.
Therefore there are a lot of different standards how to compare strings alphabetically. Even though there are international standards, you have to choose which one to follow.
How? Well, there's a saying: When in Rome, do as the Romans do.
Take a look at your target group and their expectations. Then choose the collation most of them will accept.
Spanish people? Well, ñ is an independent character sorted in
between n and o.Germans? Take a look above.
Russians and western europeans? Oh boy, you are in trouble sorting
all these cyrillic and latin characters ;)Or choose the one you are most comfortable with.
Also ... one of my professors once said: "The first thing I do before coding is searching the internet if there is already a solution." And as @RHa and @cbeleites said in the comments there are solutions (C, Java, PHP, etc.), so unless you insist ... use one of them and you only have to worry about choosing the right locale (and looking up which sorting rules they follow).
Last and least: The 4th sorting method
As I mentioned ealier there is a 4th (horrible) solution. German Wikipedia describes it as
Gleichordnung von Grundbuchstaben, Doppelbuchstaben und Umlaut, wenn Doppelbuchstabe wie Umlaut gesprochen wird. Mull wird wie Muell oder Müll sortiert. Duell dagegen zwischen Duden und Dugast.
To explain it in English I would like to quote @FabioTurati from the comments
It says that an approach is treating plain letters, double letters and Umlauts the same way, but only when the double letters are pronounced as an Umlaut. For example, "Mull", "Muell" and "Müll" are treated the same way (note that Mull and Müll are different words!), and the "ue" in Muell here is a way to avoid typing "ü". In other cases, the letters "ue" are pronounced separately: a "u" followed by an "e", for example in "Duell", which is why Duell is placed between Duden and Dugast.
Why is this a problem?
Consider it from the other point of view: if you find a word containing "ue", how do you treat it? For Muell you should pretend the "e" is not there, and treat it as "Mull". For "Duell", doing it would lead to "Dull", which would be sorted after "Dugast", which would be wrong. So, to know how to sort words you need to have some more info (that is, how they are pronounced), which a sorting algorithm normally doesn't have. That's why this approach is troublesome!
edited Jun 19 at 9:20
answered Jun 16 at 16:32
mtwdemtwde
4,4521 gold badge3 silver badges20 bronze badges
4,4521 gold badge3 silver badges20 bronze badges
1
Another sorting rule: Treat ä, ö, ü as additional characters, but not appended after z, but instead after a, o, u respectively: a, ä, b, c, ..., m, n, o, ö, p, ... This is called Austrian Order.
– rexkogitans
Jun 17 at 6:11
12
So... what's the fourth?
– sgf
Jun 17 at 7:47
2
@sgf from wikipedia: "Gleichordnung von Grundbuchstaben, Doppelbuchstaben und Umlaut, wenn Doppelbuchstabe wie Umlaut gesprochen wird. Mull wird wie Muell oder Müll sortiert. Duell dagegen zwischen Duden und Dugast." which is horrifying to code and gets nasty with names like "Schröder / Schroeder", etc
– mtwde
Jun 17 at 9:50
2
Your description of the Swedish collation order is incorrect. You are right that å, ä and ö are treated as full-worthy letters and sorted after z, but all other diacritics are ignored. E.g. á is sorted as a and not after a. Until some 10 years ago, v and w were also considered equal when sorting, but w is now usually considered a 'proper' letter in itself and sorted after v. AFAIK, the same collation rules are used in Finnish.
– jarnbjo
Jun 17 at 13:06
1
@mtwde My German isn't good enough to understand that Wikipedia excerpt. Could you translate it, please?
– Nzall
Jun 18 at 7:08
|
show 7 more comments
1
Another sorting rule: Treat ä, ö, ü as additional characters, but not appended after z, but instead after a, o, u respectively: a, ä, b, c, ..., m, n, o, ö, p, ... This is called Austrian Order.
– rexkogitans
Jun 17 at 6:11
12
So... what's the fourth?
– sgf
Jun 17 at 7:47
2
@sgf from wikipedia: "Gleichordnung von Grundbuchstaben, Doppelbuchstaben und Umlaut, wenn Doppelbuchstabe wie Umlaut gesprochen wird. Mull wird wie Muell oder Müll sortiert. Duell dagegen zwischen Duden und Dugast." which is horrifying to code and gets nasty with names like "Schröder / Schroeder", etc
– mtwde
Jun 17 at 9:50
2
Your description of the Swedish collation order is incorrect. You are right that å, ä and ö are treated as full-worthy letters and sorted after z, but all other diacritics are ignored. E.g. á is sorted as a and not after a. Until some 10 years ago, v and w were also considered equal when sorting, but w is now usually considered a 'proper' letter in itself and sorted after v. AFAIK, the same collation rules are used in Finnish.
– jarnbjo
Jun 17 at 13:06
1
@mtwde My German isn't good enough to understand that Wikipedia excerpt. Could you translate it, please?
– Nzall
Jun 18 at 7:08
1
1
Another sorting rule: Treat ä, ö, ü as additional characters, but not appended after z, but instead after a, o, u respectively: a, ä, b, c, ..., m, n, o, ö, p, ... This is called Austrian Order.
– rexkogitans
Jun 17 at 6:11
Another sorting rule: Treat ä, ö, ü as additional characters, but not appended after z, but instead after a, o, u respectively: a, ä, b, c, ..., m, n, o, ö, p, ... This is called Austrian Order.
– rexkogitans
Jun 17 at 6:11
12
12
So... what's the fourth?
– sgf
Jun 17 at 7:47
So... what's the fourth?
– sgf
Jun 17 at 7:47
2
2
@sgf from wikipedia: "Gleichordnung von Grundbuchstaben, Doppelbuchstaben und Umlaut, wenn Doppelbuchstabe wie Umlaut gesprochen wird. Mull wird wie Muell oder Müll sortiert. Duell dagegen zwischen Duden und Dugast." which is horrifying to code and gets nasty with names like "Schröder / Schroeder", etc
– mtwde
Jun 17 at 9:50
@sgf from wikipedia: "Gleichordnung von Grundbuchstaben, Doppelbuchstaben und Umlaut, wenn Doppelbuchstabe wie Umlaut gesprochen wird. Mull wird wie Muell oder Müll sortiert. Duell dagegen zwischen Duden und Dugast." which is horrifying to code and gets nasty with names like "Schröder / Schroeder", etc
– mtwde
Jun 17 at 9:50
2
2
Your description of the Swedish collation order is incorrect. You are right that å, ä and ö are treated as full-worthy letters and sorted after z, but all other diacritics are ignored. E.g. á is sorted as a and not after a. Until some 10 years ago, v and w were also considered equal when sorting, but w is now usually considered a 'proper' letter in itself and sorted after v. AFAIK, the same collation rules are used in Finnish.
– jarnbjo
Jun 17 at 13:06
Your description of the Swedish collation order is incorrect. You are right that å, ä and ö are treated as full-worthy letters and sorted after z, but all other diacritics are ignored. E.g. á is sorted as a and not after a. Until some 10 years ago, v and w were also considered equal when sorting, but w is now usually considered a 'proper' letter in itself and sorted after v. AFAIK, the same collation rules are used in Finnish.
– jarnbjo
Jun 17 at 13:06
1
1
@mtwde My German isn't good enough to understand that Wikipedia excerpt. Could you translate it, please?
– Nzall
Jun 18 at 7:08
@mtwde My German isn't good enough to understand that Wikipedia excerpt. Could you translate it, please?
– Nzall
Jun 18 at 7:08
|
show 7 more comments
If it's not names you are dealing with, it would be best to ignore all diacritics when sorting (and count ß as ss).
The only reason to deviate from this simple system lies in the unfortunate fact that German names show unpredictable variation between ä, ö, ü and ae, oe, ue. This has lead to phone books and library catalogues sorting e.g. Räder as Raeder, Örtel as Oertel, Hüber as Hueber.
Wikipedia has a good write-up.
add a comment |
If it's not names you are dealing with, it would be best to ignore all diacritics when sorting (and count ß as ss).
The only reason to deviate from this simple system lies in the unfortunate fact that German names show unpredictable variation between ä, ö, ü and ae, oe, ue. This has lead to phone books and library catalogues sorting e.g. Räder as Raeder, Örtel as Oertel, Hüber as Hueber.
Wikipedia has a good write-up.
add a comment |
If it's not names you are dealing with, it would be best to ignore all diacritics when sorting (and count ß as ss).
The only reason to deviate from this simple system lies in the unfortunate fact that German names show unpredictable variation between ä, ö, ü and ae, oe, ue. This has lead to phone books and library catalogues sorting e.g. Räder as Raeder, Örtel as Oertel, Hüber as Hueber.
Wikipedia has a good write-up.
If it's not names you are dealing with, it would be best to ignore all diacritics when sorting (and count ß as ss).
The only reason to deviate from this simple system lies in the unfortunate fact that German names show unpredictable variation between ä, ö, ü and ae, oe, ue. This has lead to phone books and library catalogues sorting e.g. Räder as Raeder, Örtel as Oertel, Hüber as Hueber.
Wikipedia has a good write-up.
answered Jun 16 at 16:11
David VogtDavid Vogt
6,8431 gold badge6 silver badges36 bronze badges
6,8431 gold badge6 silver badges36 bronze badges
add a comment |
add a comment |
I can answer you only regarding the German characters. "Ä" is considered equivalent to "Ae", "Ö" to "Oe", "Ü" to "Ue" and "ß" to "ss". This is how those characters are sorted in a phonebook.
New contributor
Thank you for your answer. Sadly I cannot implement this behaviour. I'm sorry. I removed the phone book reference.
– Matthias
Jun 16 at 16:06
1
This does mean that it is sorted like this not that it is written like this.
– Kami Kaze
Jun 17 at 6:43
This is how German users would expect it.
– Simon Richter
Jun 17 at 8:56
3
@SimonRichter: There are (at least) two different widely-used collation orders in Germany, and even more if you also consider other German-speaking countries such as Austria and Switzerland. So, whether or not "German users would expect it" this way depends very much on context and on whether those German users are actually from Germany or German-speaking from Austria, for example.
– Jörg W Mittag
Jun 17 at 12:47
add a comment |
I can answer you only regarding the German characters. "Ä" is considered equivalent to "Ae", "Ö" to "Oe", "Ü" to "Ue" and "ß" to "ss". This is how those characters are sorted in a phonebook.
New contributor
Thank you for your answer. Sadly I cannot implement this behaviour. I'm sorry. I removed the phone book reference.
– Matthias
Jun 16 at 16:06
1
This does mean that it is sorted like this not that it is written like this.
– Kami Kaze
Jun 17 at 6:43
This is how German users would expect it.
– Simon Richter
Jun 17 at 8:56
3
@SimonRichter: There are (at least) two different widely-used collation orders in Germany, and even more if you also consider other German-speaking countries such as Austria and Switzerland. So, whether or not "German users would expect it" this way depends very much on context and on whether those German users are actually from Germany or German-speaking from Austria, for example.
– Jörg W Mittag
Jun 17 at 12:47
add a comment |
I can answer you only regarding the German characters. "Ä" is considered equivalent to "Ae", "Ö" to "Oe", "Ü" to "Ue" and "ß" to "ss". This is how those characters are sorted in a phonebook.
New contributor
I can answer you only regarding the German characters. "Ä" is considered equivalent to "Ae", "Ö" to "Oe", "Ü" to "Ue" and "ß" to "ss". This is how those characters are sorted in a phonebook.
New contributor
New contributor
answered Jun 16 at 15:40
ziganotschkaziganotschka
271 bronze badge
271 bronze badge
New contributor
New contributor
Thank you for your answer. Sadly I cannot implement this behaviour. I'm sorry. I removed the phone book reference.
– Matthias
Jun 16 at 16:06
1
This does mean that it is sorted like this not that it is written like this.
– Kami Kaze
Jun 17 at 6:43
This is how German users would expect it.
– Simon Richter
Jun 17 at 8:56
3
@SimonRichter: There are (at least) two different widely-used collation orders in Germany, and even more if you also consider other German-speaking countries such as Austria and Switzerland. So, whether or not "German users would expect it" this way depends very much on context and on whether those German users are actually from Germany or German-speaking from Austria, for example.
– Jörg W Mittag
Jun 17 at 12:47
add a comment |
Thank you for your answer. Sadly I cannot implement this behaviour. I'm sorry. I removed the phone book reference.
– Matthias
Jun 16 at 16:06
1
This does mean that it is sorted like this not that it is written like this.
– Kami Kaze
Jun 17 at 6:43
This is how German users would expect it.
– Simon Richter
Jun 17 at 8:56
3
@SimonRichter: There are (at least) two different widely-used collation orders in Germany, and even more if you also consider other German-speaking countries such as Austria and Switzerland. So, whether or not "German users would expect it" this way depends very much on context and on whether those German users are actually from Germany or German-speaking from Austria, for example.
– Jörg W Mittag
Jun 17 at 12:47
Thank you for your answer. Sadly I cannot implement this behaviour. I'm sorry. I removed the phone book reference.
– Matthias
Jun 16 at 16:06
Thank you for your answer. Sadly I cannot implement this behaviour. I'm sorry. I removed the phone book reference.
– Matthias
Jun 16 at 16:06
1
1
This does mean that it is sorted like this not that it is written like this.
– Kami Kaze
Jun 17 at 6:43
This does mean that it is sorted like this not that it is written like this.
– Kami Kaze
Jun 17 at 6:43
This is how German users would expect it.
– Simon Richter
Jun 17 at 8:56
This is how German users would expect it.
– Simon Richter
Jun 17 at 8:56
3
3
@SimonRichter: There are (at least) two different widely-used collation orders in Germany, and even more if you also consider other German-speaking countries such as Austria and Switzerland. So, whether or not "German users would expect it" this way depends very much on context and on whether those German users are actually from Germany or German-speaking from Austria, for example.
– Jörg W Mittag
Jun 17 at 12:47
@SimonRichter: There are (at least) two different widely-used collation orders in Germany, and even more if you also consider other German-speaking countries such as Austria and Switzerland. So, whether or not "German users would expect it" this way depends very much on context and on whether those German users are actually from Germany or German-speaking from Austria, for example.
– Jörg W Mittag
Jun 17 at 12:47
add a comment |
Matthias is a new contributor. Be nice, and check out our Code of Conduct.
Matthias is a new contributor. Be nice, and check out our Code of Conduct.
Matthias is a new contributor. Be nice, and check out our Code of Conduct.
Matthias is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to German Language Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fgerman.stackexchange.com%2fquestions%2f52765%2fordering-german-special-characters-and-those-from-other-languages-when-sorting%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
12
In most programming languages a collation function is available which compares strings according to a locale. In C, this function is strcoll(). Java has a Collator class.
– RHa
Jun 16 at 19:12
The right way to do this is to use
strxfrm()
to convert your user-visible strings into encoded strings that can be compared usingstrcmp()
. If you're sorting user records, say, and each has a "name" field, you'd have to then create a "nameSortable" field and usestrxfrm
to populate it. That resulting field may or may not look much like the ASCII you put in but is guaranteed to sort in the right order for the current locale. And since you do the transformation once, it should then sort quickly.– Swiss Frank
Jun 17 at 13:45
A easier but typically lower-performance method is to use
strcoll()
to compare. This may be slower because in effect it's internally creating buffers, calling strxfrm to get comperable strings, comparing them, and throwing away the results of the conversion, only leaving the result.strxform()
requires you to manage the caches, but pays you back by allowing far faster compares.– Swiss Frank
Jun 17 at 13:45
3
Humans? Unfortunately, it depends which humans. Swedish humans looking for these characters will look in a different place from German humans. There are many standards for collating sequences, some country-specific, some industry-specific. Whatever you do, don't go inventing another one.
– Michael Kay
Jun 17 at 16:02