Why is this code 6.5x slower with optimizations enabled?With arrays, why is it the case that a[5] == 5[a]?Why is the Android emulator so slow? How can we speed up the Android emulator?Why doesn't GCC optimize a*a*a*a*a*a to (a*a*a)*(a*a*a)?Why are elementwise additions much faster in separate loops than in a combined loop?What is “:-!!” in C code?Why does changing 0.1f to 0 slow down performance by 10x?Why is it faster to process a sorted array than an unsorted array?Why does the C preprocessor interpret the word “linux” as the constant “1”?Why is printing “B” dramatically slower than printing “#”?Why is “1000000000000000 in range(1000000000000001)” so fast in Python 3?
LWC and complex parameters
Are white and non-white police officers equally likely to kill black suspects?
Why is my log file so massive? 22gb. I am running log backups
Is there a name of the flying bionic bird?
How is it possible for user's password to be changed after storage was encrypted? (on OS X, Android)
How can I plot a Farey diagram?
Could Giant Ground Sloths have been a good pack animal for the ancient Mayans?
What causes the sudden spool-up sound from an F-16 when enabling afterburner?
I see my dog run
Does the average primeness of natural numbers tend to zero?
Prime joint compound before latex paint?
What are the advantages and disadvantages of running one shots compared to campaigns?
What is GPS' 19 year rollover and does it present a cybersecurity issue?
Re-submission of rejected manuscript without informing co-authors
What is the meaning of "of trouble" in the following sentence?
Add an angle to a sphere
How to make payment on the internet without leaving a money trail?
Landlord wants to switch my lease to a "Land contract" to "get back at the city"
I’m planning on buying a laser printer but concerned about the life cycle of toner in the machine
If a centaur druid Wild Shapes into a Giant Elk, do their Charge features stack?
Ideas for 3rd eye abilities
extract characters between two commas?
Manga about a female worker who got dragged into another world together with this high school girl and she was just told she's not needed anymore
Is Social Media Science Fiction?
Why is this code 6.5x slower with optimizations enabled?
With arrays, why is it the case that a[5] == 5[a]?Why is the Android emulator so slow? How can we speed up the Android emulator?Why doesn't GCC optimize a*a*a*a*a*a to (a*a*a)*(a*a*a)?Why are elementwise additions much faster in separate loops than in a combined loop?What is “:-!!” in C code?Why does changing 0.1f to 0 slow down performance by 10x?Why is it faster to process a sorted array than an unsorted array?Why does the C preprocessor interpret the word “linux” as the constant “1”?Why is printing “B” dramatically slower than printing “#”?Why is “1000000000000000 in range(1000000000000001)” so fast in Python 3?
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;
I wanted to benchmark glibc
's strlen
function for some reason and found out it apparently performs much slower with optimizations enabled in GCC and I have no idea why.
Here's my code:
#include <time.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
int main()
char *s = calloc(1 << 20, 1);
memset(s, 65, 1000000);
clock_t start = clock();
for (int i = 0; i < 128; ++i)
s[strlen(s)] = 'A';
clock_t end = clock();
printf("%lldn", (long long)(end-start));
return 0;
On my machine it outputs:
$ gcc test.c && ./a.out
13336
$ gcc -O1 test.c && ./a.out
199004
$ gcc -O2 test.c && ./a.out
83415
$ gcc -O3 test.c && ./a.out
83415
Somehow, enabling optimizations causes it to execute longer.
c performance gcc glibc
|
show 5 more comments
I wanted to benchmark glibc
's strlen
function for some reason and found out it apparently performs much slower with optimizations enabled in GCC and I have no idea why.
Here's my code:
#include <time.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
int main()
char *s = calloc(1 << 20, 1);
memset(s, 65, 1000000);
clock_t start = clock();
for (int i = 0; i < 128; ++i)
s[strlen(s)] = 'A';
clock_t end = clock();
printf("%lldn", (long long)(end-start));
return 0;
On my machine it outputs:
$ gcc test.c && ./a.out
13336
$ gcc -O1 test.c && ./a.out
199004
$ gcc -O2 test.c && ./a.out
83415
$ gcc -O3 test.c && ./a.out
83415
Somehow, enabling optimizations causes it to execute longer.
c performance gcc glibc
With gcc-8.2 debug version takes 51334, release 8246. Release compiler options-O3 -march=native -DNDEBUG
– Maxim Egorushkin
yesterday
1
Please report it to gcc's bugzilla.
– Marc Glisse
yesterday
1
Using-fno-builtin
makes the problem go away. So presumably the issue is that in this particular instance, GCC's builtinstrlen
is slower than the library's.
– David Schwartz
yesterday
1
It is generatingrepnz scasb
for strlen at -O1.
– Marc Glisse
yesterday
3
@MarcGlisse It's already been filed: gcc.gnu.org/bugzilla/show_bug.cgi?id=88809
– Justin
yesterday
|
show 5 more comments
I wanted to benchmark glibc
's strlen
function for some reason and found out it apparently performs much slower with optimizations enabled in GCC and I have no idea why.
Here's my code:
#include <time.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
int main()
char *s = calloc(1 << 20, 1);
memset(s, 65, 1000000);
clock_t start = clock();
for (int i = 0; i < 128; ++i)
s[strlen(s)] = 'A';
clock_t end = clock();
printf("%lldn", (long long)(end-start));
return 0;
On my machine it outputs:
$ gcc test.c && ./a.out
13336
$ gcc -O1 test.c && ./a.out
199004
$ gcc -O2 test.c && ./a.out
83415
$ gcc -O3 test.c && ./a.out
83415
Somehow, enabling optimizations causes it to execute longer.
c performance gcc glibc
I wanted to benchmark glibc
's strlen
function for some reason and found out it apparently performs much slower with optimizations enabled in GCC and I have no idea why.
Here's my code:
#include <time.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
int main()
char *s = calloc(1 << 20, 1);
memset(s, 65, 1000000);
clock_t start = clock();
for (int i = 0; i < 128; ++i)
s[strlen(s)] = 'A';
clock_t end = clock();
printf("%lldn", (long long)(end-start));
return 0;
On my machine it outputs:
$ gcc test.c && ./a.out
13336
$ gcc -O1 test.c && ./a.out
199004
$ gcc -O2 test.c && ./a.out
83415
$ gcc -O3 test.c && ./a.out
83415
Somehow, enabling optimizations causes it to execute longer.
c performance gcc glibc
c performance gcc glibc
edited 21 hours ago
Muntasir
6321919
6321919
asked yesterday
TsarNTsarN
17328
17328
With gcc-8.2 debug version takes 51334, release 8246. Release compiler options-O3 -march=native -DNDEBUG
– Maxim Egorushkin
yesterday
1
Please report it to gcc's bugzilla.
– Marc Glisse
yesterday
1
Using-fno-builtin
makes the problem go away. So presumably the issue is that in this particular instance, GCC's builtinstrlen
is slower than the library's.
– David Schwartz
yesterday
1
It is generatingrepnz scasb
for strlen at -O1.
– Marc Glisse
yesterday
3
@MarcGlisse It's already been filed: gcc.gnu.org/bugzilla/show_bug.cgi?id=88809
– Justin
yesterday
|
show 5 more comments
With gcc-8.2 debug version takes 51334, release 8246. Release compiler options-O3 -march=native -DNDEBUG
– Maxim Egorushkin
yesterday
1
Please report it to gcc's bugzilla.
– Marc Glisse
yesterday
1
Using-fno-builtin
makes the problem go away. So presumably the issue is that in this particular instance, GCC's builtinstrlen
is slower than the library's.
– David Schwartz
yesterday
1
It is generatingrepnz scasb
for strlen at -O1.
– Marc Glisse
yesterday
3
@MarcGlisse It's already been filed: gcc.gnu.org/bugzilla/show_bug.cgi?id=88809
– Justin
yesterday
With gcc-8.2 debug version takes 51334, release 8246. Release compiler options
-O3 -march=native -DNDEBUG
– Maxim Egorushkin
yesterday
With gcc-8.2 debug version takes 51334, release 8246. Release compiler options
-O3 -march=native -DNDEBUG
– Maxim Egorushkin
yesterday
1
1
Please report it to gcc's bugzilla.
– Marc Glisse
yesterday
Please report it to gcc's bugzilla.
– Marc Glisse
yesterday
1
1
Using
-fno-builtin
makes the problem go away. So presumably the issue is that in this particular instance, GCC's builtin strlen
is slower than the library's.– David Schwartz
yesterday
Using
-fno-builtin
makes the problem go away. So presumably the issue is that in this particular instance, GCC's builtin strlen
is slower than the library's.– David Schwartz
yesterday
1
1
It is generating
repnz scasb
for strlen at -O1.– Marc Glisse
yesterday
It is generating
repnz scasb
for strlen at -O1.– Marc Glisse
yesterday
3
3
@MarcGlisse It's already been filed: gcc.gnu.org/bugzilla/show_bug.cgi?id=88809
– Justin
yesterday
@MarcGlisse It's already been filed: gcc.gnu.org/bugzilla/show_bug.cgi?id=88809
– Justin
yesterday
|
show 5 more comments
1 Answer
1
active
oldest
votes
Testing your code on Godbolt's Compiler Explorer provides this explanation:
- at
-O0
or without optimisations, the generated code call the C library functionstrlen
- at
-O1
the generated code uses a simple inline expansion using arep scasb
instruction. - at
-O2
and above, the generated code uses a more elaborate inline expansion.
Benchmarking your code repeatedly shows a substantial variation from one run to another, but increasing the number of iterations shows that:
- the
-O1
code is much slower than the C library implementation:32240
vs3090
- the
-O2
code is faster than the-O1
but still substantially slower than the C ibrary code:8570
vs3090
.
This behavior is specific to gcc
and the glibc
. The same test on OS/X with clang
and Apple's Libc does not show a significant difference, which is not a surprise as Godbolt shows that clang
generates a call to the C library strlen
at all optimisation levels.
This could be considered a bug in gcc/glibc but more extensive benchmarking might show that the overhead of calling strlen
has a more important impact than the lack of performance of the inline code for small strings. The strings on which you benchmark are uncommonly large, so focusing the benchmark on ultra-long strings might not give meaningful results.
I updated the benchmark for smaller strings and it shows similar performance for string lengths varying from 0
to 100
at -O0
and -O2
but still a much worse performance at -O1
, 3 times slower.
Here is the updated code:
#include <stdlib.h>
#include <string.h>
#include <time.h>
void benchmark(int repeat, int minlen, int maxlen)
char *s = malloc(maxlen + 1);
memset(s, 'A', minlen);
long long bytes = 0, calls = 0;
clock_t clk = clock();
for (int n = 0; n < repeat; n++)
for (int i = minlen; i < maxlen; ++i)
bytes += i + 1;
calls += 1;
s[i] = '';
s[strlen(s)] = 'A';
clk = clock() - clk;
free(s);
double avglen = (minlen + maxlen - 1) / 2.0;
double ns = (double)clk * 1e9 / CLOCKS_PER_SEC;
printf("average length %7.0f -> avg time: %7.3f ns/byte, %7.3f ns/calln",
avglen, ns / bytes, ns / calls);
int main()
benchmark(10000000, 0, 1);
benchmark(1000000, 0, 10);
benchmark(1000000, 5, 15);
benchmark(100000, 0, 100);
benchmark(100000, 50, 150);
benchmark(10000, 0, 1000);
benchmark(10000, 500, 1500);
benchmark(1000, 0, 10000);
benchmark(1000, 5000, 15000);
benchmark(100, 1000000 - 50, 1000000 + 50);
return 0;
Here is the output:
chqrlie> gcc -std=c99 -O0 benchstrlen.c && ./a.out
average length 0 -> avg time: 14.000 ns/byte, 14.000 ns/call
average length 4 -> avg time: 2.364 ns/byte, 13.000 ns/call
average length 10 -> avg time: 1.238 ns/byte, 13.000 ns/call
average length 50 -> avg time: 0.317 ns/byte, 16.000 ns/call
average length 100 -> avg time: 0.169 ns/byte, 17.000 ns/call
average length 500 -> avg time: 0.074 ns/byte, 37.000 ns/call
average length 1000 -> avg time: 0.068 ns/byte, 68.000 ns/call
average length 5000 -> avg time: 0.064 ns/byte, 318.000 ns/call
average length 10000 -> avg time: 0.062 ns/byte, 622.000 ns/call
average length 1000000 -> avg time: 0.062 ns/byte, 62000.000 ns/call
chqrlie> gcc -std=c99 -O1 benchstrlen.c && ./a.out
average length 0 -> avg time: 20.000 ns/byte, 20.000 ns/call
average length 4 -> avg time: 3.818 ns/byte, 21.000 ns/call
average length 10 -> avg time: 2.190 ns/byte, 23.000 ns/call
average length 50 -> avg time: 0.990 ns/byte, 50.000 ns/call
average length 100 -> avg time: 0.816 ns/byte, 82.000 ns/call
average length 500 -> avg time: 0.679 ns/byte, 340.000 ns/call
average length 1000 -> avg time: 0.664 ns/byte, 664.000 ns/call
average length 5000 -> avg time: 0.651 ns/byte, 3254.000 ns/call
average length 10000 -> avg time: 0.649 ns/byte, 6491.000 ns/call
average length 1000000 -> avg time: 0.648 ns/byte, 648000.000 ns/call
chqrlie> gcc -std=c99 -O2 benchstrlen.c && ./a.out
average length 0 -> avg time: 10.000 ns/byte, 10.000 ns/call
average length 4 -> avg time: 2.000 ns/byte, 11.000 ns/call
average length 10 -> avg time: 1.048 ns/byte, 11.000 ns/call
average length 50 -> avg time: 0.337 ns/byte, 17.000 ns/call
average length 100 -> avg time: 0.299 ns/byte, 30.000 ns/call
average length 500 -> avg time: 0.202 ns/byte, 101.000 ns/call
average length 1000 -> avg time: 0.188 ns/byte, 188.000 ns/call
average length 5000 -> avg time: 0.174 ns/byte, 868.000 ns/call
average length 10000 -> avg time: 0.172 ns/byte, 1716.000 ns/call
average length 1000000 -> avg time: 0.172 ns/byte, 172000.000 ns/call
Wouldn't it still be better for the inlined version to use the same optimizations as the librarystrlen
, giving the best of both worlds?
– Daniel H
yesterday
2
It would, but the hand optimized version in the C library might be larger and more complicated to inline. I have not looked into this recently, but there used to be a mix of complex platform specific macros in<string.h>
and hard coded optimisations in the gcc code generator. Definitely still room for improvement on intel targets.
– chqrlie
yesterday
Does it change if you use-march=native -mtune=native
?
– Deduplicator
yesterday
1
@chqrlie: The problem I'm trying to highlight here is that people benchmark on "wildly unrealistic in practice" scenarios, then make false assumptions based on the unrealistic results, then optimise code (e.g. in libraries) based on these false assumptions. Ifstrlen() is a bottleneck (e.g. because the strings actually are large) anyone that cares about performance will keep track of string lengths themselves and will not use
strlen(); and (for people that care about performance)
strlen()` is mostly only used when the strings are too small to bother keeping track of their lengths.
– Brendan
yesterday
4
@chqrlie: I'd also say that this is partly a symptom of a larger problem - code in libraries can't be optimised for any specific case and therefore must be "un-optimal" for some cases. To work around that it would've been nice if there was astrlen_small()
and a separatestrlen_large()
, but there isn't.
– Brendan
yesterday
|
show 12 more comments
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55563598%2fwhy-is-this-code-6-5x-slower-with-optimizations-enabled%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Testing your code on Godbolt's Compiler Explorer provides this explanation:
- at
-O0
or without optimisations, the generated code call the C library functionstrlen
- at
-O1
the generated code uses a simple inline expansion using arep scasb
instruction. - at
-O2
and above, the generated code uses a more elaborate inline expansion.
Benchmarking your code repeatedly shows a substantial variation from one run to another, but increasing the number of iterations shows that:
- the
-O1
code is much slower than the C library implementation:32240
vs3090
- the
-O2
code is faster than the-O1
but still substantially slower than the C ibrary code:8570
vs3090
.
This behavior is specific to gcc
and the glibc
. The same test on OS/X with clang
and Apple's Libc does not show a significant difference, which is not a surprise as Godbolt shows that clang
generates a call to the C library strlen
at all optimisation levels.
This could be considered a bug in gcc/glibc but more extensive benchmarking might show that the overhead of calling strlen
has a more important impact than the lack of performance of the inline code for small strings. The strings on which you benchmark are uncommonly large, so focusing the benchmark on ultra-long strings might not give meaningful results.
I updated the benchmark for smaller strings and it shows similar performance for string lengths varying from 0
to 100
at -O0
and -O2
but still a much worse performance at -O1
, 3 times slower.
Here is the updated code:
#include <stdlib.h>
#include <string.h>
#include <time.h>
void benchmark(int repeat, int minlen, int maxlen)
char *s = malloc(maxlen + 1);
memset(s, 'A', minlen);
long long bytes = 0, calls = 0;
clock_t clk = clock();
for (int n = 0; n < repeat; n++)
for (int i = minlen; i < maxlen; ++i)
bytes += i + 1;
calls += 1;
s[i] = '';
s[strlen(s)] = 'A';
clk = clock() - clk;
free(s);
double avglen = (minlen + maxlen - 1) / 2.0;
double ns = (double)clk * 1e9 / CLOCKS_PER_SEC;
printf("average length %7.0f -> avg time: %7.3f ns/byte, %7.3f ns/calln",
avglen, ns / bytes, ns / calls);
int main()
benchmark(10000000, 0, 1);
benchmark(1000000, 0, 10);
benchmark(1000000, 5, 15);
benchmark(100000, 0, 100);
benchmark(100000, 50, 150);
benchmark(10000, 0, 1000);
benchmark(10000, 500, 1500);
benchmark(1000, 0, 10000);
benchmark(1000, 5000, 15000);
benchmark(100, 1000000 - 50, 1000000 + 50);
return 0;
Here is the output:
chqrlie> gcc -std=c99 -O0 benchstrlen.c && ./a.out
average length 0 -> avg time: 14.000 ns/byte, 14.000 ns/call
average length 4 -> avg time: 2.364 ns/byte, 13.000 ns/call
average length 10 -> avg time: 1.238 ns/byte, 13.000 ns/call
average length 50 -> avg time: 0.317 ns/byte, 16.000 ns/call
average length 100 -> avg time: 0.169 ns/byte, 17.000 ns/call
average length 500 -> avg time: 0.074 ns/byte, 37.000 ns/call
average length 1000 -> avg time: 0.068 ns/byte, 68.000 ns/call
average length 5000 -> avg time: 0.064 ns/byte, 318.000 ns/call
average length 10000 -> avg time: 0.062 ns/byte, 622.000 ns/call
average length 1000000 -> avg time: 0.062 ns/byte, 62000.000 ns/call
chqrlie> gcc -std=c99 -O1 benchstrlen.c && ./a.out
average length 0 -> avg time: 20.000 ns/byte, 20.000 ns/call
average length 4 -> avg time: 3.818 ns/byte, 21.000 ns/call
average length 10 -> avg time: 2.190 ns/byte, 23.000 ns/call
average length 50 -> avg time: 0.990 ns/byte, 50.000 ns/call
average length 100 -> avg time: 0.816 ns/byte, 82.000 ns/call
average length 500 -> avg time: 0.679 ns/byte, 340.000 ns/call
average length 1000 -> avg time: 0.664 ns/byte, 664.000 ns/call
average length 5000 -> avg time: 0.651 ns/byte, 3254.000 ns/call
average length 10000 -> avg time: 0.649 ns/byte, 6491.000 ns/call
average length 1000000 -> avg time: 0.648 ns/byte, 648000.000 ns/call
chqrlie> gcc -std=c99 -O2 benchstrlen.c && ./a.out
average length 0 -> avg time: 10.000 ns/byte, 10.000 ns/call
average length 4 -> avg time: 2.000 ns/byte, 11.000 ns/call
average length 10 -> avg time: 1.048 ns/byte, 11.000 ns/call
average length 50 -> avg time: 0.337 ns/byte, 17.000 ns/call
average length 100 -> avg time: 0.299 ns/byte, 30.000 ns/call
average length 500 -> avg time: 0.202 ns/byte, 101.000 ns/call
average length 1000 -> avg time: 0.188 ns/byte, 188.000 ns/call
average length 5000 -> avg time: 0.174 ns/byte, 868.000 ns/call
average length 10000 -> avg time: 0.172 ns/byte, 1716.000 ns/call
average length 1000000 -> avg time: 0.172 ns/byte, 172000.000 ns/call
Wouldn't it still be better for the inlined version to use the same optimizations as the librarystrlen
, giving the best of both worlds?
– Daniel H
yesterday
2
It would, but the hand optimized version in the C library might be larger and more complicated to inline. I have not looked into this recently, but there used to be a mix of complex platform specific macros in<string.h>
and hard coded optimisations in the gcc code generator. Definitely still room for improvement on intel targets.
– chqrlie
yesterday
Does it change if you use-march=native -mtune=native
?
– Deduplicator
yesterday
1
@chqrlie: The problem I'm trying to highlight here is that people benchmark on "wildly unrealistic in practice" scenarios, then make false assumptions based on the unrealistic results, then optimise code (e.g. in libraries) based on these false assumptions. Ifstrlen() is a bottleneck (e.g. because the strings actually are large) anyone that cares about performance will keep track of string lengths themselves and will not use
strlen(); and (for people that care about performance)
strlen()` is mostly only used when the strings are too small to bother keeping track of their lengths.
– Brendan
yesterday
4
@chqrlie: I'd also say that this is partly a symptom of a larger problem - code in libraries can't be optimised for any specific case and therefore must be "un-optimal" for some cases. To work around that it would've been nice if there was astrlen_small()
and a separatestrlen_large()
, but there isn't.
– Brendan
yesterday
|
show 12 more comments
Testing your code on Godbolt's Compiler Explorer provides this explanation:
- at
-O0
or without optimisations, the generated code call the C library functionstrlen
- at
-O1
the generated code uses a simple inline expansion using arep scasb
instruction. - at
-O2
and above, the generated code uses a more elaborate inline expansion.
Benchmarking your code repeatedly shows a substantial variation from one run to another, but increasing the number of iterations shows that:
- the
-O1
code is much slower than the C library implementation:32240
vs3090
- the
-O2
code is faster than the-O1
but still substantially slower than the C ibrary code:8570
vs3090
.
This behavior is specific to gcc
and the glibc
. The same test on OS/X with clang
and Apple's Libc does not show a significant difference, which is not a surprise as Godbolt shows that clang
generates a call to the C library strlen
at all optimisation levels.
This could be considered a bug in gcc/glibc but more extensive benchmarking might show that the overhead of calling strlen
has a more important impact than the lack of performance of the inline code for small strings. The strings on which you benchmark are uncommonly large, so focusing the benchmark on ultra-long strings might not give meaningful results.
I updated the benchmark for smaller strings and it shows similar performance for string lengths varying from 0
to 100
at -O0
and -O2
but still a much worse performance at -O1
, 3 times slower.
Here is the updated code:
#include <stdlib.h>
#include <string.h>
#include <time.h>
void benchmark(int repeat, int minlen, int maxlen)
char *s = malloc(maxlen + 1);
memset(s, 'A', minlen);
long long bytes = 0, calls = 0;
clock_t clk = clock();
for (int n = 0; n < repeat; n++)
for (int i = minlen; i < maxlen; ++i)
bytes += i + 1;
calls += 1;
s[i] = '';
s[strlen(s)] = 'A';
clk = clock() - clk;
free(s);
double avglen = (minlen + maxlen - 1) / 2.0;
double ns = (double)clk * 1e9 / CLOCKS_PER_SEC;
printf("average length %7.0f -> avg time: %7.3f ns/byte, %7.3f ns/calln",
avglen, ns / bytes, ns / calls);
int main()
benchmark(10000000, 0, 1);
benchmark(1000000, 0, 10);
benchmark(1000000, 5, 15);
benchmark(100000, 0, 100);
benchmark(100000, 50, 150);
benchmark(10000, 0, 1000);
benchmark(10000, 500, 1500);
benchmark(1000, 0, 10000);
benchmark(1000, 5000, 15000);
benchmark(100, 1000000 - 50, 1000000 + 50);
return 0;
Here is the output:
chqrlie> gcc -std=c99 -O0 benchstrlen.c && ./a.out
average length 0 -> avg time: 14.000 ns/byte, 14.000 ns/call
average length 4 -> avg time: 2.364 ns/byte, 13.000 ns/call
average length 10 -> avg time: 1.238 ns/byte, 13.000 ns/call
average length 50 -> avg time: 0.317 ns/byte, 16.000 ns/call
average length 100 -> avg time: 0.169 ns/byte, 17.000 ns/call
average length 500 -> avg time: 0.074 ns/byte, 37.000 ns/call
average length 1000 -> avg time: 0.068 ns/byte, 68.000 ns/call
average length 5000 -> avg time: 0.064 ns/byte, 318.000 ns/call
average length 10000 -> avg time: 0.062 ns/byte, 622.000 ns/call
average length 1000000 -> avg time: 0.062 ns/byte, 62000.000 ns/call
chqrlie> gcc -std=c99 -O1 benchstrlen.c && ./a.out
average length 0 -> avg time: 20.000 ns/byte, 20.000 ns/call
average length 4 -> avg time: 3.818 ns/byte, 21.000 ns/call
average length 10 -> avg time: 2.190 ns/byte, 23.000 ns/call
average length 50 -> avg time: 0.990 ns/byte, 50.000 ns/call
average length 100 -> avg time: 0.816 ns/byte, 82.000 ns/call
average length 500 -> avg time: 0.679 ns/byte, 340.000 ns/call
average length 1000 -> avg time: 0.664 ns/byte, 664.000 ns/call
average length 5000 -> avg time: 0.651 ns/byte, 3254.000 ns/call
average length 10000 -> avg time: 0.649 ns/byte, 6491.000 ns/call
average length 1000000 -> avg time: 0.648 ns/byte, 648000.000 ns/call
chqrlie> gcc -std=c99 -O2 benchstrlen.c && ./a.out
average length 0 -> avg time: 10.000 ns/byte, 10.000 ns/call
average length 4 -> avg time: 2.000 ns/byte, 11.000 ns/call
average length 10 -> avg time: 1.048 ns/byte, 11.000 ns/call
average length 50 -> avg time: 0.337 ns/byte, 17.000 ns/call
average length 100 -> avg time: 0.299 ns/byte, 30.000 ns/call
average length 500 -> avg time: 0.202 ns/byte, 101.000 ns/call
average length 1000 -> avg time: 0.188 ns/byte, 188.000 ns/call
average length 5000 -> avg time: 0.174 ns/byte, 868.000 ns/call
average length 10000 -> avg time: 0.172 ns/byte, 1716.000 ns/call
average length 1000000 -> avg time: 0.172 ns/byte, 172000.000 ns/call
Wouldn't it still be better for the inlined version to use the same optimizations as the librarystrlen
, giving the best of both worlds?
– Daniel H
yesterday
2
It would, but the hand optimized version in the C library might be larger and more complicated to inline. I have not looked into this recently, but there used to be a mix of complex platform specific macros in<string.h>
and hard coded optimisations in the gcc code generator. Definitely still room for improvement on intel targets.
– chqrlie
yesterday
Does it change if you use-march=native -mtune=native
?
– Deduplicator
yesterday
1
@chqrlie: The problem I'm trying to highlight here is that people benchmark on "wildly unrealistic in practice" scenarios, then make false assumptions based on the unrealistic results, then optimise code (e.g. in libraries) based on these false assumptions. Ifstrlen() is a bottleneck (e.g. because the strings actually are large) anyone that cares about performance will keep track of string lengths themselves and will not use
strlen(); and (for people that care about performance)
strlen()` is mostly only used when the strings are too small to bother keeping track of their lengths.
– Brendan
yesterday
4
@chqrlie: I'd also say that this is partly a symptom of a larger problem - code in libraries can't be optimised for any specific case and therefore must be "un-optimal" for some cases. To work around that it would've been nice if there was astrlen_small()
and a separatestrlen_large()
, but there isn't.
– Brendan
yesterday
|
show 12 more comments
Testing your code on Godbolt's Compiler Explorer provides this explanation:
- at
-O0
or without optimisations, the generated code call the C library functionstrlen
- at
-O1
the generated code uses a simple inline expansion using arep scasb
instruction. - at
-O2
and above, the generated code uses a more elaborate inline expansion.
Benchmarking your code repeatedly shows a substantial variation from one run to another, but increasing the number of iterations shows that:
- the
-O1
code is much slower than the C library implementation:32240
vs3090
- the
-O2
code is faster than the-O1
but still substantially slower than the C ibrary code:8570
vs3090
.
This behavior is specific to gcc
and the glibc
. The same test on OS/X with clang
and Apple's Libc does not show a significant difference, which is not a surprise as Godbolt shows that clang
generates a call to the C library strlen
at all optimisation levels.
This could be considered a bug in gcc/glibc but more extensive benchmarking might show that the overhead of calling strlen
has a more important impact than the lack of performance of the inline code for small strings. The strings on which you benchmark are uncommonly large, so focusing the benchmark on ultra-long strings might not give meaningful results.
I updated the benchmark for smaller strings and it shows similar performance for string lengths varying from 0
to 100
at -O0
and -O2
but still a much worse performance at -O1
, 3 times slower.
Here is the updated code:
#include <stdlib.h>
#include <string.h>
#include <time.h>
void benchmark(int repeat, int minlen, int maxlen)
char *s = malloc(maxlen + 1);
memset(s, 'A', minlen);
long long bytes = 0, calls = 0;
clock_t clk = clock();
for (int n = 0; n < repeat; n++)
for (int i = minlen; i < maxlen; ++i)
bytes += i + 1;
calls += 1;
s[i] = '';
s[strlen(s)] = 'A';
clk = clock() - clk;
free(s);
double avglen = (minlen + maxlen - 1) / 2.0;
double ns = (double)clk * 1e9 / CLOCKS_PER_SEC;
printf("average length %7.0f -> avg time: %7.3f ns/byte, %7.3f ns/calln",
avglen, ns / bytes, ns / calls);
int main()
benchmark(10000000, 0, 1);
benchmark(1000000, 0, 10);
benchmark(1000000, 5, 15);
benchmark(100000, 0, 100);
benchmark(100000, 50, 150);
benchmark(10000, 0, 1000);
benchmark(10000, 500, 1500);
benchmark(1000, 0, 10000);
benchmark(1000, 5000, 15000);
benchmark(100, 1000000 - 50, 1000000 + 50);
return 0;
Here is the output:
chqrlie> gcc -std=c99 -O0 benchstrlen.c && ./a.out
average length 0 -> avg time: 14.000 ns/byte, 14.000 ns/call
average length 4 -> avg time: 2.364 ns/byte, 13.000 ns/call
average length 10 -> avg time: 1.238 ns/byte, 13.000 ns/call
average length 50 -> avg time: 0.317 ns/byte, 16.000 ns/call
average length 100 -> avg time: 0.169 ns/byte, 17.000 ns/call
average length 500 -> avg time: 0.074 ns/byte, 37.000 ns/call
average length 1000 -> avg time: 0.068 ns/byte, 68.000 ns/call
average length 5000 -> avg time: 0.064 ns/byte, 318.000 ns/call
average length 10000 -> avg time: 0.062 ns/byte, 622.000 ns/call
average length 1000000 -> avg time: 0.062 ns/byte, 62000.000 ns/call
chqrlie> gcc -std=c99 -O1 benchstrlen.c && ./a.out
average length 0 -> avg time: 20.000 ns/byte, 20.000 ns/call
average length 4 -> avg time: 3.818 ns/byte, 21.000 ns/call
average length 10 -> avg time: 2.190 ns/byte, 23.000 ns/call
average length 50 -> avg time: 0.990 ns/byte, 50.000 ns/call
average length 100 -> avg time: 0.816 ns/byte, 82.000 ns/call
average length 500 -> avg time: 0.679 ns/byte, 340.000 ns/call
average length 1000 -> avg time: 0.664 ns/byte, 664.000 ns/call
average length 5000 -> avg time: 0.651 ns/byte, 3254.000 ns/call
average length 10000 -> avg time: 0.649 ns/byte, 6491.000 ns/call
average length 1000000 -> avg time: 0.648 ns/byte, 648000.000 ns/call
chqrlie> gcc -std=c99 -O2 benchstrlen.c && ./a.out
average length 0 -> avg time: 10.000 ns/byte, 10.000 ns/call
average length 4 -> avg time: 2.000 ns/byte, 11.000 ns/call
average length 10 -> avg time: 1.048 ns/byte, 11.000 ns/call
average length 50 -> avg time: 0.337 ns/byte, 17.000 ns/call
average length 100 -> avg time: 0.299 ns/byte, 30.000 ns/call
average length 500 -> avg time: 0.202 ns/byte, 101.000 ns/call
average length 1000 -> avg time: 0.188 ns/byte, 188.000 ns/call
average length 5000 -> avg time: 0.174 ns/byte, 868.000 ns/call
average length 10000 -> avg time: 0.172 ns/byte, 1716.000 ns/call
average length 1000000 -> avg time: 0.172 ns/byte, 172000.000 ns/call
Testing your code on Godbolt's Compiler Explorer provides this explanation:
- at
-O0
or without optimisations, the generated code call the C library functionstrlen
- at
-O1
the generated code uses a simple inline expansion using arep scasb
instruction. - at
-O2
and above, the generated code uses a more elaborate inline expansion.
Benchmarking your code repeatedly shows a substantial variation from one run to another, but increasing the number of iterations shows that:
- the
-O1
code is much slower than the C library implementation:32240
vs3090
- the
-O2
code is faster than the-O1
but still substantially slower than the C ibrary code:8570
vs3090
.
This behavior is specific to gcc
and the glibc
. The same test on OS/X with clang
and Apple's Libc does not show a significant difference, which is not a surprise as Godbolt shows that clang
generates a call to the C library strlen
at all optimisation levels.
This could be considered a bug in gcc/glibc but more extensive benchmarking might show that the overhead of calling strlen
has a more important impact than the lack of performance of the inline code for small strings. The strings on which you benchmark are uncommonly large, so focusing the benchmark on ultra-long strings might not give meaningful results.
I updated the benchmark for smaller strings and it shows similar performance for string lengths varying from 0
to 100
at -O0
and -O2
but still a much worse performance at -O1
, 3 times slower.
Here is the updated code:
#include <stdlib.h>
#include <string.h>
#include <time.h>
void benchmark(int repeat, int minlen, int maxlen)
char *s = malloc(maxlen + 1);
memset(s, 'A', minlen);
long long bytes = 0, calls = 0;
clock_t clk = clock();
for (int n = 0; n < repeat; n++)
for (int i = minlen; i < maxlen; ++i)
bytes += i + 1;
calls += 1;
s[i] = '';
s[strlen(s)] = 'A';
clk = clock() - clk;
free(s);
double avglen = (minlen + maxlen - 1) / 2.0;
double ns = (double)clk * 1e9 / CLOCKS_PER_SEC;
printf("average length %7.0f -> avg time: %7.3f ns/byte, %7.3f ns/calln",
avglen, ns / bytes, ns / calls);
int main()
benchmark(10000000, 0, 1);
benchmark(1000000, 0, 10);
benchmark(1000000, 5, 15);
benchmark(100000, 0, 100);
benchmark(100000, 50, 150);
benchmark(10000, 0, 1000);
benchmark(10000, 500, 1500);
benchmark(1000, 0, 10000);
benchmark(1000, 5000, 15000);
benchmark(100, 1000000 - 50, 1000000 + 50);
return 0;
Here is the output:
chqrlie> gcc -std=c99 -O0 benchstrlen.c && ./a.out
average length 0 -> avg time: 14.000 ns/byte, 14.000 ns/call
average length 4 -> avg time: 2.364 ns/byte, 13.000 ns/call
average length 10 -> avg time: 1.238 ns/byte, 13.000 ns/call
average length 50 -> avg time: 0.317 ns/byte, 16.000 ns/call
average length 100 -> avg time: 0.169 ns/byte, 17.000 ns/call
average length 500 -> avg time: 0.074 ns/byte, 37.000 ns/call
average length 1000 -> avg time: 0.068 ns/byte, 68.000 ns/call
average length 5000 -> avg time: 0.064 ns/byte, 318.000 ns/call
average length 10000 -> avg time: 0.062 ns/byte, 622.000 ns/call
average length 1000000 -> avg time: 0.062 ns/byte, 62000.000 ns/call
chqrlie> gcc -std=c99 -O1 benchstrlen.c && ./a.out
average length 0 -> avg time: 20.000 ns/byte, 20.000 ns/call
average length 4 -> avg time: 3.818 ns/byte, 21.000 ns/call
average length 10 -> avg time: 2.190 ns/byte, 23.000 ns/call
average length 50 -> avg time: 0.990 ns/byte, 50.000 ns/call
average length 100 -> avg time: 0.816 ns/byte, 82.000 ns/call
average length 500 -> avg time: 0.679 ns/byte, 340.000 ns/call
average length 1000 -> avg time: 0.664 ns/byte, 664.000 ns/call
average length 5000 -> avg time: 0.651 ns/byte, 3254.000 ns/call
average length 10000 -> avg time: 0.649 ns/byte, 6491.000 ns/call
average length 1000000 -> avg time: 0.648 ns/byte, 648000.000 ns/call
chqrlie> gcc -std=c99 -O2 benchstrlen.c && ./a.out
average length 0 -> avg time: 10.000 ns/byte, 10.000 ns/call
average length 4 -> avg time: 2.000 ns/byte, 11.000 ns/call
average length 10 -> avg time: 1.048 ns/byte, 11.000 ns/call
average length 50 -> avg time: 0.337 ns/byte, 17.000 ns/call
average length 100 -> avg time: 0.299 ns/byte, 30.000 ns/call
average length 500 -> avg time: 0.202 ns/byte, 101.000 ns/call
average length 1000 -> avg time: 0.188 ns/byte, 188.000 ns/call
average length 5000 -> avg time: 0.174 ns/byte, 868.000 ns/call
average length 10000 -> avg time: 0.172 ns/byte, 1716.000 ns/call
average length 1000000 -> avg time: 0.172 ns/byte, 172000.000 ns/call
edited yesterday
answered yesterday
chqrliechqrlie
63.1k850108
63.1k850108
Wouldn't it still be better for the inlined version to use the same optimizations as the librarystrlen
, giving the best of both worlds?
– Daniel H
yesterday
2
It would, but the hand optimized version in the C library might be larger and more complicated to inline. I have not looked into this recently, but there used to be a mix of complex platform specific macros in<string.h>
and hard coded optimisations in the gcc code generator. Definitely still room for improvement on intel targets.
– chqrlie
yesterday
Does it change if you use-march=native -mtune=native
?
– Deduplicator
yesterday
1
@chqrlie: The problem I'm trying to highlight here is that people benchmark on "wildly unrealistic in practice" scenarios, then make false assumptions based on the unrealistic results, then optimise code (e.g. in libraries) based on these false assumptions. Ifstrlen() is a bottleneck (e.g. because the strings actually are large) anyone that cares about performance will keep track of string lengths themselves and will not use
strlen(); and (for people that care about performance)
strlen()` is mostly only used when the strings are too small to bother keeping track of their lengths.
– Brendan
yesterday
4
@chqrlie: I'd also say that this is partly a symptom of a larger problem - code in libraries can't be optimised for any specific case and therefore must be "un-optimal" for some cases. To work around that it would've been nice if there was astrlen_small()
and a separatestrlen_large()
, but there isn't.
– Brendan
yesterday
|
show 12 more comments
Wouldn't it still be better for the inlined version to use the same optimizations as the librarystrlen
, giving the best of both worlds?
– Daniel H
yesterday
2
It would, but the hand optimized version in the C library might be larger and more complicated to inline. I have not looked into this recently, but there used to be a mix of complex platform specific macros in<string.h>
and hard coded optimisations in the gcc code generator. Definitely still room for improvement on intel targets.
– chqrlie
yesterday
Does it change if you use-march=native -mtune=native
?
– Deduplicator
yesterday
1
@chqrlie: The problem I'm trying to highlight here is that people benchmark on "wildly unrealistic in practice" scenarios, then make false assumptions based on the unrealistic results, then optimise code (e.g. in libraries) based on these false assumptions. Ifstrlen() is a bottleneck (e.g. because the strings actually are large) anyone that cares about performance will keep track of string lengths themselves and will not use
strlen(); and (for people that care about performance)
strlen()` is mostly only used when the strings are too small to bother keeping track of their lengths.
– Brendan
yesterday
4
@chqrlie: I'd also say that this is partly a symptom of a larger problem - code in libraries can't be optimised for any specific case and therefore must be "un-optimal" for some cases. To work around that it would've been nice if there was astrlen_small()
and a separatestrlen_large()
, but there isn't.
– Brendan
yesterday
Wouldn't it still be better for the inlined version to use the same optimizations as the library
strlen
, giving the best of both worlds?– Daniel H
yesterday
Wouldn't it still be better for the inlined version to use the same optimizations as the library
strlen
, giving the best of both worlds?– Daniel H
yesterday
2
2
It would, but the hand optimized version in the C library might be larger and more complicated to inline. I have not looked into this recently, but there used to be a mix of complex platform specific macros in
<string.h>
and hard coded optimisations in the gcc code generator. Definitely still room for improvement on intel targets.– chqrlie
yesterday
It would, but the hand optimized version in the C library might be larger and more complicated to inline. I have not looked into this recently, but there used to be a mix of complex platform specific macros in
<string.h>
and hard coded optimisations in the gcc code generator. Definitely still room for improvement on intel targets.– chqrlie
yesterday
Does it change if you use
-march=native -mtune=native
?– Deduplicator
yesterday
Does it change if you use
-march=native -mtune=native
?– Deduplicator
yesterday
1
1
@chqrlie: The problem I'm trying to highlight here is that people benchmark on "wildly unrealistic in practice" scenarios, then make false assumptions based on the unrealistic results, then optimise code (e.g. in libraries) based on these false assumptions. If
strlen() is a bottleneck (e.g. because the strings actually are large) anyone that cares about performance will keep track of string lengths themselves and will not use
strlen(); and (for people that care about performance)
strlen()` is mostly only used when the strings are too small to bother keeping track of their lengths.– Brendan
yesterday
@chqrlie: The problem I'm trying to highlight here is that people benchmark on "wildly unrealistic in practice" scenarios, then make false assumptions based on the unrealistic results, then optimise code (e.g. in libraries) based on these false assumptions. If
strlen() is a bottleneck (e.g. because the strings actually are large) anyone that cares about performance will keep track of string lengths themselves and will not use
strlen(); and (for people that care about performance)
strlen()` is mostly only used when the strings are too small to bother keeping track of their lengths.– Brendan
yesterday
4
4
@chqrlie: I'd also say that this is partly a symptom of a larger problem - code in libraries can't be optimised for any specific case and therefore must be "un-optimal" for some cases. To work around that it would've been nice if there was a
strlen_small()
and a separate strlen_large()
, but there isn't.– Brendan
yesterday
@chqrlie: I'd also say that this is partly a symptom of a larger problem - code in libraries can't be optimised for any specific case and therefore must be "un-optimal" for some cases. To work around that it would've been nice if there was a
strlen_small()
and a separate strlen_large()
, but there isn't.– Brendan
yesterday
|
show 12 more comments
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55563598%2fwhy-is-this-code-6-5x-slower-with-optimizations-enabled%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
With gcc-8.2 debug version takes 51334, release 8246. Release compiler options
-O3 -march=native -DNDEBUG
– Maxim Egorushkin
yesterday
1
Please report it to gcc's bugzilla.
– Marc Glisse
yesterday
1
Using
-fno-builtin
makes the problem go away. So presumably the issue is that in this particular instance, GCC's builtinstrlen
is slower than the library's.– David Schwartz
yesterday
1
It is generating
repnz scasb
for strlen at -O1.– Marc Glisse
yesterday
3
@MarcGlisse It's already been filed: gcc.gnu.org/bugzilla/show_bug.cgi?id=88809
– Justin
yesterday