Performance of loop vs expansionUnderstanding “IFS= read -r line”Optimizing a shell script with long running while loopHow to use 3rd variable in BASH for loop?While/Read/Do/Done Loop Cutting Out Halfway Through Text Fileparallel processing reading from a file in a loopI need to change date in CSV filePreventing parameter expansion multiple timesBash loop for detecting equal folder sizesWriting output of a For Loop to .txt fileMerge multiple files with different extensions using a loop in bash scriptUnzip files in loop
Why different specifications for telescopes and binoculars?
When did "&" stop being taught alongside the alphabet?
The rigidity of the countable product of free groups
When I press the space bar it deletes the letters in front of it
Why weren't bootable game disks ever a thing on the IBM PC?
What happens to unproductive professors?
Why is Nibbana referred to as "The destination and the path leading to the destination"?
How to compare the ls output of two folders to find a missing directory?
A horrible Stockfish chess engine evaluation
Why does every calorie tracking app give a different target calorie count for the same goals?
Can I play a mimic PC?
What is the correct parsing of お高くとまる?
WTB Horizon 47c - small crack in the middle of the tire
In Spider-Man: Far From Home, is this superhero name a reference to another comic book?
Are there any sports for which the world's best player is female?
Can i use larger/smaller circular saw blades on my circular / plunge / table / miter saw?
Could you brine steak?
How to drill holes in 3/8" steel plates?
The joke office
Why does the Antonov AN-225 not have any winglets?
Yet another hash table in C
Efficiently defining a SparseArray function
How can I effectively communicate to recruiters that a phone call is not possible?
Why is a mixture of two normally distributed variables only bimodal if their means differ by at least two times the common standard deviation?
Performance of loop vs expansion
Understanding “IFS= read -r line”Optimizing a shell script with long running while loopHow to use 3rd variable in BASH for loop?While/Read/Do/Done Loop Cutting Out Halfway Through Text Fileparallel processing reading from a file in a loopI need to change date in CSV filePreventing parameter expansion multiple timesBash loop for detecting equal folder sizesWriting output of a For Loop to .txt fileMerge multiple files with different extensions using a loop in bash scriptUnzip files in loop
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
Need expert suggestions on below comparison:
Code Segment using loop:
for file in `cat large_file_list`
do
gzip -d $file
done
Code segment using simple expansion:
gzip -d `cat large_file_list`
Which one will be faster? Have to manipulate large data set.
linux bash shell-script shell
add a comment |
Need expert suggestions on below comparison:
Code Segment using loop:
for file in `cat large_file_list`
do
gzip -d $file
done
Code segment using simple expansion:
gzip -d `cat large_file_list`
Which one will be faster? Have to manipulate large data set.
linux bash shell-script shell
1
The correct answer will depend on how long it takes to startgzip
on your system, the number of files in the file list and the size of those files.
– Kusalananda♦
Jul 1 at 6:25
File list will have around 1000 - 10000 files. Size varies from some kilobytes to 500 MB. I have no idea how long it takes to start gzip in my system. any way check?
– Leon
Jul 1 at 6:27
1
Ok, then it may also depend on the length of the filenames. If the filenames are long, some systems might generate a "argument list too long" error if you tried to do it without a loop since the command substitution would result in a too long command line for the shell to execute. If you don't want to depend on the number of files in the list, just use a loop. Are you spending a significant amount of time decompressing these files compared to the other processing that you will perform on them?
– Kusalananda♦
Jul 1 at 6:54
Leon take a look at my test results: "huge-arglist" is 20x faster than "loop" in my setting.
– sam68
Jul 2 at 16:08
for a happy medium between process starts and command line length, use something likexargs gzip -d < large_file_list
but watch out for spaces in filenames, maybe withtr \n \0 large_file_list | xargs -0 gzip -d
– w00t
Jul 5 at 8:13
add a comment |
Need expert suggestions on below comparison:
Code Segment using loop:
for file in `cat large_file_list`
do
gzip -d $file
done
Code segment using simple expansion:
gzip -d `cat large_file_list`
Which one will be faster? Have to manipulate large data set.
linux bash shell-script shell
Need expert suggestions on below comparison:
Code Segment using loop:
for file in `cat large_file_list`
do
gzip -d $file
done
Code segment using simple expansion:
gzip -d `cat large_file_list`
Which one will be faster? Have to manipulate large data set.
linux bash shell-script shell
linux bash shell-script shell
asked Jul 1 at 5:16
LeonLeon
1585 bronze badges
1585 bronze badges
1
The correct answer will depend on how long it takes to startgzip
on your system, the number of files in the file list and the size of those files.
– Kusalananda♦
Jul 1 at 6:25
File list will have around 1000 - 10000 files. Size varies from some kilobytes to 500 MB. I have no idea how long it takes to start gzip in my system. any way check?
– Leon
Jul 1 at 6:27
1
Ok, then it may also depend on the length of the filenames. If the filenames are long, some systems might generate a "argument list too long" error if you tried to do it without a loop since the command substitution would result in a too long command line for the shell to execute. If you don't want to depend on the number of files in the list, just use a loop. Are you spending a significant amount of time decompressing these files compared to the other processing that you will perform on them?
– Kusalananda♦
Jul 1 at 6:54
Leon take a look at my test results: "huge-arglist" is 20x faster than "loop" in my setting.
– sam68
Jul 2 at 16:08
for a happy medium between process starts and command line length, use something likexargs gzip -d < large_file_list
but watch out for spaces in filenames, maybe withtr \n \0 large_file_list | xargs -0 gzip -d
– w00t
Jul 5 at 8:13
add a comment |
1
The correct answer will depend on how long it takes to startgzip
on your system, the number of files in the file list and the size of those files.
– Kusalananda♦
Jul 1 at 6:25
File list will have around 1000 - 10000 files. Size varies from some kilobytes to 500 MB. I have no idea how long it takes to start gzip in my system. any way check?
– Leon
Jul 1 at 6:27
1
Ok, then it may also depend on the length of the filenames. If the filenames are long, some systems might generate a "argument list too long" error if you tried to do it without a loop since the command substitution would result in a too long command line for the shell to execute. If you don't want to depend on the number of files in the list, just use a loop. Are you spending a significant amount of time decompressing these files compared to the other processing that you will perform on them?
– Kusalananda♦
Jul 1 at 6:54
Leon take a look at my test results: "huge-arglist" is 20x faster than "loop" in my setting.
– sam68
Jul 2 at 16:08
for a happy medium between process starts and command line length, use something likexargs gzip -d < large_file_list
but watch out for spaces in filenames, maybe withtr \n \0 large_file_list | xargs -0 gzip -d
– w00t
Jul 5 at 8:13
1
1
The correct answer will depend on how long it takes to start
gzip
on your system, the number of files in the file list and the size of those files.– Kusalananda♦
Jul 1 at 6:25
The correct answer will depend on how long it takes to start
gzip
on your system, the number of files in the file list and the size of those files.– Kusalananda♦
Jul 1 at 6:25
File list will have around 1000 - 10000 files. Size varies from some kilobytes to 500 MB. I have no idea how long it takes to start gzip in my system. any way check?
– Leon
Jul 1 at 6:27
File list will have around 1000 - 10000 files. Size varies from some kilobytes to 500 MB. I have no idea how long it takes to start gzip in my system. any way check?
– Leon
Jul 1 at 6:27
1
1
Ok, then it may also depend on the length of the filenames. If the filenames are long, some systems might generate a "argument list too long" error if you tried to do it without a loop since the command substitution would result in a too long command line for the shell to execute. If you don't want to depend on the number of files in the list, just use a loop. Are you spending a significant amount of time decompressing these files compared to the other processing that you will perform on them?
– Kusalananda♦
Jul 1 at 6:54
Ok, then it may also depend on the length of the filenames. If the filenames are long, some systems might generate a "argument list too long" error if you tried to do it without a loop since the command substitution would result in a too long command line for the shell to execute. If you don't want to depend on the number of files in the list, just use a loop. Are you spending a significant amount of time decompressing these files compared to the other processing that you will perform on them?
– Kusalananda♦
Jul 1 at 6:54
Leon take a look at my test results: "huge-arglist" is 20x faster than "loop" in my setting.
– sam68
Jul 2 at 16:08
Leon take a look at my test results: "huge-arglist" is 20x faster than "loop" in my setting.
– sam68
Jul 2 at 16:08
for a happy medium between process starts and command line length, use something like
xargs gzip -d < large_file_list
but watch out for spaces in filenames, maybe with tr \n \0 large_file_list | xargs -0 gzip -d
– w00t
Jul 5 at 8:13
for a happy medium between process starts and command line length, use something like
xargs gzip -d < large_file_list
but watch out for spaces in filenames, maybe with tr \n \0 large_file_list | xargs -0 gzip -d
– w00t
Jul 5 at 8:13
add a comment |
5 Answers
5
active
oldest
votes
Complications
The following will work only sometimes:
gzip -d `cat large_file_list`
Three problems are (in bash
and most other Bourne-like shells):
It will fail if any file name has space tab or newline characters in it (assuming
$IFS
has not been modified). This is because of the shell's word splitting.It is also liable to fail if any file name has glob-active characters in it. This is because the shell will apply pathname expansion to the file list.
It will also fail if filenames starts with
-
(ifPOSIXLY_CORRECT=1
that only applies to the first file) or if any filename is-
.It will also fail if there are too many file names in it to fit on one command line.
The code below is subject to the same problems as the code above (except for the fourth)
for file in `cat large_file_list`
do
gzip -d $file
done
Reliable solution
If your large_file_list
has exactly one file name per line, and a file called -
is not among them, and you're on a GNU system, then use:
xargs -rd'n' gzip -d -- <large_file_list
-d'n'
tells xargs
to treat each line of input as a separate file name.
-r
tells xargs
not to run the command if the input file is empty.
--
tells gzip
that the following arguments are not to be treated as options even if they start with -
. -
alone would still be treated as -
instead of the file called -
though.
xargs
will put many file names on each command line but not so many that it exceeds the command line limit. This reduces the number of times that a gzip
process must be started and therefore makes this fast. It is also safe: the file names will also be protected from word splitting and pathname expansion.
Thanks for detailed reply. I understand your mentioned 3 issues. File name is simple and won't face those challenges as the list will hold upto 20000. And my question is basically on performance of those two segments. Thanks.
– Leon
Jul 1 at 6:20
1
@Leon Thefor
loop will be —by far— the slowest. The other two methods will be very close in speed to each other.
– John1024
Jul 1 at 6:36
7
Also, don't dismiss the potential problems: many many questions here on StackExchange are because word splitting or pathname expansion happened to people who weren't expecting it.
– John1024
Jul 1 at 6:38
5
Note also that there's variation on reading a file withxargs
: at least GNU version has--arg-file
option (short form-a
). So one could doxargs -a large_file_list -rd'n' gzip -d
instead. Effectively, there's no difference, aside from the fact that<
is shell operator and would makexargs
read from stdin (which shell "links" to file), while-a
would makexargs
explicitly open the file in question
– Sergiy Kolodyazhnyy
Jul 1 at 7:34
2
terdon noted in another comment about usingparallel
to run multiple copies ofgzip
, butxargs
(at least the GNU one), has the-P
switch for that, too. On multicore machines that might make a difference. But it's also possible that the decompression is completely I/O-bound anyway.
– ilkkachu
Jul 2 at 14:19
|
show 1 more comment
I doubt it would matter much.
I would use a loop, just because I don't know how many files are listed in the list file, and I don't (generally) know if any of the filenames have spaces in their names. Doing a command substitution that would generate a very long list of argument may result in an "Argument list too long" error when the length of the list generated is too long.
My loop would look like
while IFS= read -r name; do
gunzip "$name"
done <file.list
This would additionally allow me to insert commands for processing the data after the gunzip
command. In fact, depending on what the data actually is and what needs to be done with it, it may even be possible to process it without saving it to file at all:
while IFS= read -r name; do
zcat "$name" | process_data
done <file.list
(where process_data
is some pipeline that reads the uncompressed data from standard input)
If the processing of the data takes longer than the uncompressing of it, the question of whether a loop is more efficient or not becomes irrelevant.
Ideally, I would prefer to not work off a list of filenames though, and instead use a filename globbing pattern, as in
for name in ./*.gz; do
# processing of "$name" here
done
where ./*.gz
is some pattern that matches the relevant files. This way we are not depending on the number of files nor on the characters used in the filenames (they may contain newlines or other whitespace characters, or start with dashes, etc.)
Related:
- Understanding "IFS= read -r line"
add a comment |
Out of those two, the one with all files passed to a single invocation of gzip
is likely to be faster, exactly because you only need to launch gzip
once. (That is, if the command works at all, see the other answers for the caveats.)
But, I'd like to remind of the golden rule of optimization: Don't do it prematurely.
Don't optimize that sort of thing before you know it's a problem.
Does this part of the program take a long time? Well, decompressing large files might, and you're going to have to do it anyway, so it might not be that easy to answer.
Measure. Really, it's the best way to be sure.
You'll see the results with your own eyes (or with your own stopwatch), and they'll apply to your situation which random answers on the Internet might not. Put both variants in scripts and run
time script1.sh
, andtime script2.sh
. (Do that with a list of empty compressed files to measure the absolute amount of the overhead.)
add a comment |
How fast is your disk?
This should use all your CPUs:
parallel -X gzip -d :::: large_file_list
So your limit is likely going to be the speed of your disk.
You can try adjusting with -j
:
parallel -j50% -X gzip -d :::: large_file_list
This will run half of the jobs in parallel as the previous command, and will stress your disk less, so depending on your disk this can be faster.
add a comment |
ADDED after some profiling:
I made one hundred dirs 00 to 99 on a tmpfs and filled them with man-1 files starting with "s". That gives me 113 files from 20 byte to 69 K in each dir, totalling to 1.2 Mb in each dir.
Result: it ALL works:
Globbing and huge arg list:
gzip -d ??/*
and
gzip -d $(<filelist)
both 0.8 s
With a loop:
for file in $(<filelist)
do
gzip -d $file
done
14 sec
And when I add my ampersand:
...
gzip -d $file &
...
5.8 sec
The performance side of this is easily explained: the less you call gzip and the less you wait for it, the faster it runs.
Now: Is there at all a maximum arg list length? Is there a maximum number of processes? (I have ulimit -u of 31366, but it looks like I could even raise that, as root on my 300$ i5 Mini-PC...). Or will the max. be "only limited by the amount of avail. Memory".
Despite the overhead of ampersanding (forking off) 11300 mostly tiny jobs, this takes half as long as to wait for each command to complete. And this was on a tmpfs...
I forgot this one:
time find ?? -name "*.gz" -exec gzip -d +
also 0.8 s (6.2 s without the "+", but with ugly ";")
ORIGINAL answer:
Short answer: use ampersand &
to execute each gzip
command in the background to start the next loop iteration immediately.
for filnam in $(<flist)
do
gzip -d $filnam &
done
I also replaced cat
with "<" (not in a loop, so it is not critical) and backticks with $()
(more "modern").
to test/benchmark: place the shell builtin time
in front of your command: I did time . loop.cmd
(with loop.cmd as shown above). This gives you the time spent, and after hitting Enter
again the shell shows you the finished gzip "jobs".
I tested this out some weeks ago with a (small) list of (medium size) tar files to extract. The &
made such a difference in time
I could hardly believe it. The & has a double effect: it not only starts the next gzip immediately, it also gives you the prompt as soon as the last gzip got started.
CAREFUL: if you have whitespace in your filenames, it doesn't work. I don't know how to implement that. I tried with quotes and IFS...
Another well-known tip is: use globbing, if possible. Maybe start with a list of directories and for each go (cd $dir; gzip -d *.gz &)
.
--> I only can give you this ampersand "trick". First step is always optimizing the design. I am not sure what exactly you want this huge list of files for? Is it a hand made list? If it is generated by rules, then maybe use these rules directly and not via a filelist.
Your second example should be a bit faster, because gzip starts only once. But: I don't know what the limit is if you go gzip -d 1.gz 2.gz 3.gz ... 9999.gz
, i.e. giving gzip a very large argument list.
Whitespace: a filename like '10 A.gz'
results in gzip -d ... 9.gz 10 A.gz 11.gz...
. Gzip will say: '10' and 'A.gz' not found...unless these files DO exist, and if it is rm
and not gzip
then you have that classic shell trap.
added:
What about:
find $(<dirlist) -name "*.gz" -exec gzip -d +
...starting from a smaller list of dirs? The +
is the "parallelization" option (a find -exec
thing, like ...).
If you can redesign a bit, this must be the best...
In the meanwhile, "kusa" has answered...this IFS/read is what I meant...I hope my explanations are useful anyway.
– sam68
Jul 1 at 8:37
Now I proposed a "find -exec" command at the end of my answer. Maybe not what Leon wants, but still important.
– sam68
Jul 1 at 9:34
4
Your loop will probably bring the OP's machine to its knees. The file list can have up to 10000 files, so running ten thousand gzip instances in parallel will very likely cause the system to hang.
– terdon♦
Jul 1 at 9:44
process #1 might be finished when #10 or #100 gets started...it started as a performance Q, then whitespace, now system resources...this question is not so easy to answer! A extra process/job for each small .gz would not be ideal, I admit. 10000 separate gzips is the thing to avoid in the first place.
– sam68
Jul 1 at 11:09
5
This is the sort of thing GNUparallel
is for. You could simply doparallel -j3 gzip -d ::: < <(find $(<dirlist) -name '*.gz')
and that would only launch 3 gzips at a time.
– terdon♦
Jul 1 at 11:49
|
show 2 more comments
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f527813%2fperformance-of-loop-vs-expansion%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
5 Answers
5
active
oldest
votes
5 Answers
5
active
oldest
votes
active
oldest
votes
active
oldest
votes
Complications
The following will work only sometimes:
gzip -d `cat large_file_list`
Three problems are (in bash
and most other Bourne-like shells):
It will fail if any file name has space tab or newline characters in it (assuming
$IFS
has not been modified). This is because of the shell's word splitting.It is also liable to fail if any file name has glob-active characters in it. This is because the shell will apply pathname expansion to the file list.
It will also fail if filenames starts with
-
(ifPOSIXLY_CORRECT=1
that only applies to the first file) or if any filename is-
.It will also fail if there are too many file names in it to fit on one command line.
The code below is subject to the same problems as the code above (except for the fourth)
for file in `cat large_file_list`
do
gzip -d $file
done
Reliable solution
If your large_file_list
has exactly one file name per line, and a file called -
is not among them, and you're on a GNU system, then use:
xargs -rd'n' gzip -d -- <large_file_list
-d'n'
tells xargs
to treat each line of input as a separate file name.
-r
tells xargs
not to run the command if the input file is empty.
--
tells gzip
that the following arguments are not to be treated as options even if they start with -
. -
alone would still be treated as -
instead of the file called -
though.
xargs
will put many file names on each command line but not so many that it exceeds the command line limit. This reduces the number of times that a gzip
process must be started and therefore makes this fast. It is also safe: the file names will also be protected from word splitting and pathname expansion.
Thanks for detailed reply. I understand your mentioned 3 issues. File name is simple and won't face those challenges as the list will hold upto 20000. And my question is basically on performance of those two segments. Thanks.
– Leon
Jul 1 at 6:20
1
@Leon Thefor
loop will be —by far— the slowest. The other two methods will be very close in speed to each other.
– John1024
Jul 1 at 6:36
7
Also, don't dismiss the potential problems: many many questions here on StackExchange are because word splitting or pathname expansion happened to people who weren't expecting it.
– John1024
Jul 1 at 6:38
5
Note also that there's variation on reading a file withxargs
: at least GNU version has--arg-file
option (short form-a
). So one could doxargs -a large_file_list -rd'n' gzip -d
instead. Effectively, there's no difference, aside from the fact that<
is shell operator and would makexargs
read from stdin (which shell "links" to file), while-a
would makexargs
explicitly open the file in question
– Sergiy Kolodyazhnyy
Jul 1 at 7:34
2
terdon noted in another comment about usingparallel
to run multiple copies ofgzip
, butxargs
(at least the GNU one), has the-P
switch for that, too. On multicore machines that might make a difference. But it's also possible that the decompression is completely I/O-bound anyway.
– ilkkachu
Jul 2 at 14:19
|
show 1 more comment
Complications
The following will work only sometimes:
gzip -d `cat large_file_list`
Three problems are (in bash
and most other Bourne-like shells):
It will fail if any file name has space tab or newline characters in it (assuming
$IFS
has not been modified). This is because of the shell's word splitting.It is also liable to fail if any file name has glob-active characters in it. This is because the shell will apply pathname expansion to the file list.
It will also fail if filenames starts with
-
(ifPOSIXLY_CORRECT=1
that only applies to the first file) or if any filename is-
.It will also fail if there are too many file names in it to fit on one command line.
The code below is subject to the same problems as the code above (except for the fourth)
for file in `cat large_file_list`
do
gzip -d $file
done
Reliable solution
If your large_file_list
has exactly one file name per line, and a file called -
is not among them, and you're on a GNU system, then use:
xargs -rd'n' gzip -d -- <large_file_list
-d'n'
tells xargs
to treat each line of input as a separate file name.
-r
tells xargs
not to run the command if the input file is empty.
--
tells gzip
that the following arguments are not to be treated as options even if they start with -
. -
alone would still be treated as -
instead of the file called -
though.
xargs
will put many file names on each command line but not so many that it exceeds the command line limit. This reduces the number of times that a gzip
process must be started and therefore makes this fast. It is also safe: the file names will also be protected from word splitting and pathname expansion.
Thanks for detailed reply. I understand your mentioned 3 issues. File name is simple and won't face those challenges as the list will hold upto 20000. And my question is basically on performance of those two segments. Thanks.
– Leon
Jul 1 at 6:20
1
@Leon Thefor
loop will be —by far— the slowest. The other two methods will be very close in speed to each other.
– John1024
Jul 1 at 6:36
7
Also, don't dismiss the potential problems: many many questions here on StackExchange are because word splitting or pathname expansion happened to people who weren't expecting it.
– John1024
Jul 1 at 6:38
5
Note also that there's variation on reading a file withxargs
: at least GNU version has--arg-file
option (short form-a
). So one could doxargs -a large_file_list -rd'n' gzip -d
instead. Effectively, there's no difference, aside from the fact that<
is shell operator and would makexargs
read from stdin (which shell "links" to file), while-a
would makexargs
explicitly open the file in question
– Sergiy Kolodyazhnyy
Jul 1 at 7:34
2
terdon noted in another comment about usingparallel
to run multiple copies ofgzip
, butxargs
(at least the GNU one), has the-P
switch for that, too. On multicore machines that might make a difference. But it's also possible that the decompression is completely I/O-bound anyway.
– ilkkachu
Jul 2 at 14:19
|
show 1 more comment
Complications
The following will work only sometimes:
gzip -d `cat large_file_list`
Three problems are (in bash
and most other Bourne-like shells):
It will fail if any file name has space tab or newline characters in it (assuming
$IFS
has not been modified). This is because of the shell's word splitting.It is also liable to fail if any file name has glob-active characters in it. This is because the shell will apply pathname expansion to the file list.
It will also fail if filenames starts with
-
(ifPOSIXLY_CORRECT=1
that only applies to the first file) or if any filename is-
.It will also fail if there are too many file names in it to fit on one command line.
The code below is subject to the same problems as the code above (except for the fourth)
for file in `cat large_file_list`
do
gzip -d $file
done
Reliable solution
If your large_file_list
has exactly one file name per line, and a file called -
is not among them, and you're on a GNU system, then use:
xargs -rd'n' gzip -d -- <large_file_list
-d'n'
tells xargs
to treat each line of input as a separate file name.
-r
tells xargs
not to run the command if the input file is empty.
--
tells gzip
that the following arguments are not to be treated as options even if they start with -
. -
alone would still be treated as -
instead of the file called -
though.
xargs
will put many file names on each command line but not so many that it exceeds the command line limit. This reduces the number of times that a gzip
process must be started and therefore makes this fast. It is also safe: the file names will also be protected from word splitting and pathname expansion.
Complications
The following will work only sometimes:
gzip -d `cat large_file_list`
Three problems are (in bash
and most other Bourne-like shells):
It will fail if any file name has space tab or newline characters in it (assuming
$IFS
has not been modified). This is because of the shell's word splitting.It is also liable to fail if any file name has glob-active characters in it. This is because the shell will apply pathname expansion to the file list.
It will also fail if filenames starts with
-
(ifPOSIXLY_CORRECT=1
that only applies to the first file) or if any filename is-
.It will also fail if there are too many file names in it to fit on one command line.
The code below is subject to the same problems as the code above (except for the fourth)
for file in `cat large_file_list`
do
gzip -d $file
done
Reliable solution
If your large_file_list
has exactly one file name per line, and a file called -
is not among them, and you're on a GNU system, then use:
xargs -rd'n' gzip -d -- <large_file_list
-d'n'
tells xargs
to treat each line of input as a separate file name.
-r
tells xargs
not to run the command if the input file is empty.
--
tells gzip
that the following arguments are not to be treated as options even if they start with -
. -
alone would still be treated as -
instead of the file called -
though.
xargs
will put many file names on each command line but not so many that it exceeds the command line limit. This reduces the number of times that a gzip
process must be started and therefore makes this fast. It is also safe: the file names will also be protected from word splitting and pathname expansion.
edited Jul 2 at 19:32
Stéphane Chazelas
325k57 gold badges628 silver badges997 bronze badges
325k57 gold badges628 silver badges997 bronze badges
answered Jul 1 at 5:51
John1024John1024
51k5 gold badges117 silver badges132 bronze badges
51k5 gold badges117 silver badges132 bronze badges
Thanks for detailed reply. I understand your mentioned 3 issues. File name is simple and won't face those challenges as the list will hold upto 20000. And my question is basically on performance of those two segments. Thanks.
– Leon
Jul 1 at 6:20
1
@Leon Thefor
loop will be —by far— the slowest. The other two methods will be very close in speed to each other.
– John1024
Jul 1 at 6:36
7
Also, don't dismiss the potential problems: many many questions here on StackExchange are because word splitting or pathname expansion happened to people who weren't expecting it.
– John1024
Jul 1 at 6:38
5
Note also that there's variation on reading a file withxargs
: at least GNU version has--arg-file
option (short form-a
). So one could doxargs -a large_file_list -rd'n' gzip -d
instead. Effectively, there's no difference, aside from the fact that<
is shell operator and would makexargs
read from stdin (which shell "links" to file), while-a
would makexargs
explicitly open the file in question
– Sergiy Kolodyazhnyy
Jul 1 at 7:34
2
terdon noted in another comment about usingparallel
to run multiple copies ofgzip
, butxargs
(at least the GNU one), has the-P
switch for that, too. On multicore machines that might make a difference. But it's also possible that the decompression is completely I/O-bound anyway.
– ilkkachu
Jul 2 at 14:19
|
show 1 more comment
Thanks for detailed reply. I understand your mentioned 3 issues. File name is simple and won't face those challenges as the list will hold upto 20000. And my question is basically on performance of those two segments. Thanks.
– Leon
Jul 1 at 6:20
1
@Leon Thefor
loop will be —by far— the slowest. The other two methods will be very close in speed to each other.
– John1024
Jul 1 at 6:36
7
Also, don't dismiss the potential problems: many many questions here on StackExchange are because word splitting or pathname expansion happened to people who weren't expecting it.
– John1024
Jul 1 at 6:38
5
Note also that there's variation on reading a file withxargs
: at least GNU version has--arg-file
option (short form-a
). So one could doxargs -a large_file_list -rd'n' gzip -d
instead. Effectively, there's no difference, aside from the fact that<
is shell operator and would makexargs
read from stdin (which shell "links" to file), while-a
would makexargs
explicitly open the file in question
– Sergiy Kolodyazhnyy
Jul 1 at 7:34
2
terdon noted in another comment about usingparallel
to run multiple copies ofgzip
, butxargs
(at least the GNU one), has the-P
switch for that, too. On multicore machines that might make a difference. But it's also possible that the decompression is completely I/O-bound anyway.
– ilkkachu
Jul 2 at 14:19
Thanks for detailed reply. I understand your mentioned 3 issues. File name is simple and won't face those challenges as the list will hold upto 20000. And my question is basically on performance of those two segments. Thanks.
– Leon
Jul 1 at 6:20
Thanks for detailed reply. I understand your mentioned 3 issues. File name is simple and won't face those challenges as the list will hold upto 20000. And my question is basically on performance of those two segments. Thanks.
– Leon
Jul 1 at 6:20
1
1
@Leon The
for
loop will be —by far— the slowest. The other two methods will be very close in speed to each other.– John1024
Jul 1 at 6:36
@Leon The
for
loop will be —by far— the slowest. The other two methods will be very close in speed to each other.– John1024
Jul 1 at 6:36
7
7
Also, don't dismiss the potential problems: many many questions here on StackExchange are because word splitting or pathname expansion happened to people who weren't expecting it.
– John1024
Jul 1 at 6:38
Also, don't dismiss the potential problems: many many questions here on StackExchange are because word splitting or pathname expansion happened to people who weren't expecting it.
– John1024
Jul 1 at 6:38
5
5
Note also that there's variation on reading a file with
xargs
: at least GNU version has --arg-file
option (short form -a
). So one could do xargs -a large_file_list -rd'n' gzip -d
instead. Effectively, there's no difference, aside from the fact that <
is shell operator and would make xargs
read from stdin (which shell "links" to file), while -a
would make xargs
explicitly open the file in question– Sergiy Kolodyazhnyy
Jul 1 at 7:34
Note also that there's variation on reading a file with
xargs
: at least GNU version has --arg-file
option (short form -a
). So one could do xargs -a large_file_list -rd'n' gzip -d
instead. Effectively, there's no difference, aside from the fact that <
is shell operator and would make xargs
read from stdin (which shell "links" to file), while -a
would make xargs
explicitly open the file in question– Sergiy Kolodyazhnyy
Jul 1 at 7:34
2
2
terdon noted in another comment about using
parallel
to run multiple copies of gzip
, but xargs
(at least the GNU one), has the -P
switch for that, too. On multicore machines that might make a difference. But it's also possible that the decompression is completely I/O-bound anyway.– ilkkachu
Jul 2 at 14:19
terdon noted in another comment about using
parallel
to run multiple copies of gzip
, but xargs
(at least the GNU one), has the -P
switch for that, too. On multicore machines that might make a difference. But it's also possible that the decompression is completely I/O-bound anyway.– ilkkachu
Jul 2 at 14:19
|
show 1 more comment
I doubt it would matter much.
I would use a loop, just because I don't know how many files are listed in the list file, and I don't (generally) know if any of the filenames have spaces in their names. Doing a command substitution that would generate a very long list of argument may result in an "Argument list too long" error when the length of the list generated is too long.
My loop would look like
while IFS= read -r name; do
gunzip "$name"
done <file.list
This would additionally allow me to insert commands for processing the data after the gunzip
command. In fact, depending on what the data actually is and what needs to be done with it, it may even be possible to process it without saving it to file at all:
while IFS= read -r name; do
zcat "$name" | process_data
done <file.list
(where process_data
is some pipeline that reads the uncompressed data from standard input)
If the processing of the data takes longer than the uncompressing of it, the question of whether a loop is more efficient or not becomes irrelevant.
Ideally, I would prefer to not work off a list of filenames though, and instead use a filename globbing pattern, as in
for name in ./*.gz; do
# processing of "$name" here
done
where ./*.gz
is some pattern that matches the relevant files. This way we are not depending on the number of files nor on the characters used in the filenames (they may contain newlines or other whitespace characters, or start with dashes, etc.)
Related:
- Understanding "IFS= read -r line"
add a comment |
I doubt it would matter much.
I would use a loop, just because I don't know how many files are listed in the list file, and I don't (generally) know if any of the filenames have spaces in their names. Doing a command substitution that would generate a very long list of argument may result in an "Argument list too long" error when the length of the list generated is too long.
My loop would look like
while IFS= read -r name; do
gunzip "$name"
done <file.list
This would additionally allow me to insert commands for processing the data after the gunzip
command. In fact, depending on what the data actually is and what needs to be done with it, it may even be possible to process it without saving it to file at all:
while IFS= read -r name; do
zcat "$name" | process_data
done <file.list
(where process_data
is some pipeline that reads the uncompressed data from standard input)
If the processing of the data takes longer than the uncompressing of it, the question of whether a loop is more efficient or not becomes irrelevant.
Ideally, I would prefer to not work off a list of filenames though, and instead use a filename globbing pattern, as in
for name in ./*.gz; do
# processing of "$name" here
done
where ./*.gz
is some pattern that matches the relevant files. This way we are not depending on the number of files nor on the characters used in the filenames (they may contain newlines or other whitespace characters, or start with dashes, etc.)
Related:
- Understanding "IFS= read -r line"
add a comment |
I doubt it would matter much.
I would use a loop, just because I don't know how many files are listed in the list file, and I don't (generally) know if any of the filenames have spaces in their names. Doing a command substitution that would generate a very long list of argument may result in an "Argument list too long" error when the length of the list generated is too long.
My loop would look like
while IFS= read -r name; do
gunzip "$name"
done <file.list
This would additionally allow me to insert commands for processing the data after the gunzip
command. In fact, depending on what the data actually is and what needs to be done with it, it may even be possible to process it without saving it to file at all:
while IFS= read -r name; do
zcat "$name" | process_data
done <file.list
(where process_data
is some pipeline that reads the uncompressed data from standard input)
If the processing of the data takes longer than the uncompressing of it, the question of whether a loop is more efficient or not becomes irrelevant.
Ideally, I would prefer to not work off a list of filenames though, and instead use a filename globbing pattern, as in
for name in ./*.gz; do
# processing of "$name" here
done
where ./*.gz
is some pattern that matches the relevant files. This way we are not depending on the number of files nor on the characters used in the filenames (they may contain newlines or other whitespace characters, or start with dashes, etc.)
Related:
- Understanding "IFS= read -r line"
I doubt it would matter much.
I would use a loop, just because I don't know how many files are listed in the list file, and I don't (generally) know if any of the filenames have spaces in their names. Doing a command substitution that would generate a very long list of argument may result in an "Argument list too long" error when the length of the list generated is too long.
My loop would look like
while IFS= read -r name; do
gunzip "$name"
done <file.list
This would additionally allow me to insert commands for processing the data after the gunzip
command. In fact, depending on what the data actually is and what needs to be done with it, it may even be possible to process it without saving it to file at all:
while IFS= read -r name; do
zcat "$name" | process_data
done <file.list
(where process_data
is some pipeline that reads the uncompressed data from standard input)
If the processing of the data takes longer than the uncompressing of it, the question of whether a loop is more efficient or not becomes irrelevant.
Ideally, I would prefer to not work off a list of filenames though, and instead use a filename globbing pattern, as in
for name in ./*.gz; do
# processing of "$name" here
done
where ./*.gz
is some pattern that matches the relevant files. This way we are not depending on the number of files nor on the characters used in the filenames (they may contain newlines or other whitespace characters, or start with dashes, etc.)
Related:
- Understanding "IFS= read -r line"
edited Jul 1 at 9:39
answered Jul 1 at 7:21
Kusalananda♦Kusalananda
154k18 gold badges304 silver badges488 bronze badges
154k18 gold badges304 silver badges488 bronze badges
add a comment |
add a comment |
Out of those two, the one with all files passed to a single invocation of gzip
is likely to be faster, exactly because you only need to launch gzip
once. (That is, if the command works at all, see the other answers for the caveats.)
But, I'd like to remind of the golden rule of optimization: Don't do it prematurely.
Don't optimize that sort of thing before you know it's a problem.
Does this part of the program take a long time? Well, decompressing large files might, and you're going to have to do it anyway, so it might not be that easy to answer.
Measure. Really, it's the best way to be sure.
You'll see the results with your own eyes (or with your own stopwatch), and they'll apply to your situation which random answers on the Internet might not. Put both variants in scripts and run
time script1.sh
, andtime script2.sh
. (Do that with a list of empty compressed files to measure the absolute amount of the overhead.)
add a comment |
Out of those two, the one with all files passed to a single invocation of gzip
is likely to be faster, exactly because you only need to launch gzip
once. (That is, if the command works at all, see the other answers for the caveats.)
But, I'd like to remind of the golden rule of optimization: Don't do it prematurely.
Don't optimize that sort of thing before you know it's a problem.
Does this part of the program take a long time? Well, decompressing large files might, and you're going to have to do it anyway, so it might not be that easy to answer.
Measure. Really, it's the best way to be sure.
You'll see the results with your own eyes (or with your own stopwatch), and they'll apply to your situation which random answers on the Internet might not. Put both variants in scripts and run
time script1.sh
, andtime script2.sh
. (Do that with a list of empty compressed files to measure the absolute amount of the overhead.)
add a comment |
Out of those two, the one with all files passed to a single invocation of gzip
is likely to be faster, exactly because you only need to launch gzip
once. (That is, if the command works at all, see the other answers for the caveats.)
But, I'd like to remind of the golden rule of optimization: Don't do it prematurely.
Don't optimize that sort of thing before you know it's a problem.
Does this part of the program take a long time? Well, decompressing large files might, and you're going to have to do it anyway, so it might not be that easy to answer.
Measure. Really, it's the best way to be sure.
You'll see the results with your own eyes (or with your own stopwatch), and they'll apply to your situation which random answers on the Internet might not. Put both variants in scripts and run
time script1.sh
, andtime script2.sh
. (Do that with a list of empty compressed files to measure the absolute amount of the overhead.)
Out of those two, the one with all files passed to a single invocation of gzip
is likely to be faster, exactly because you only need to launch gzip
once. (That is, if the command works at all, see the other answers for the caveats.)
But, I'd like to remind of the golden rule of optimization: Don't do it prematurely.
Don't optimize that sort of thing before you know it's a problem.
Does this part of the program take a long time? Well, decompressing large files might, and you're going to have to do it anyway, so it might not be that easy to answer.
Measure. Really, it's the best way to be sure.
You'll see the results with your own eyes (or with your own stopwatch), and they'll apply to your situation which random answers on the Internet might not. Put both variants in scripts and run
time script1.sh
, andtime script2.sh
. (Do that with a list of empty compressed files to measure the absolute amount of the overhead.)
answered Jul 1 at 15:03
ilkkachuilkkachu
64.9k10 gold badges109 silver badges189 bronze badges
64.9k10 gold badges109 silver badges189 bronze badges
add a comment |
add a comment |
How fast is your disk?
This should use all your CPUs:
parallel -X gzip -d :::: large_file_list
So your limit is likely going to be the speed of your disk.
You can try adjusting with -j
:
parallel -j50% -X gzip -d :::: large_file_list
This will run half of the jobs in parallel as the previous command, and will stress your disk less, so depending on your disk this can be faster.
add a comment |
How fast is your disk?
This should use all your CPUs:
parallel -X gzip -d :::: large_file_list
So your limit is likely going to be the speed of your disk.
You can try adjusting with -j
:
parallel -j50% -X gzip -d :::: large_file_list
This will run half of the jobs in parallel as the previous command, and will stress your disk less, so depending on your disk this can be faster.
add a comment |
How fast is your disk?
This should use all your CPUs:
parallel -X gzip -d :::: large_file_list
So your limit is likely going to be the speed of your disk.
You can try adjusting with -j
:
parallel -j50% -X gzip -d :::: large_file_list
This will run half of the jobs in parallel as the previous command, and will stress your disk less, so depending on your disk this can be faster.
How fast is your disk?
This should use all your CPUs:
parallel -X gzip -d :::: large_file_list
So your limit is likely going to be the speed of your disk.
You can try adjusting with -j
:
parallel -j50% -X gzip -d :::: large_file_list
This will run half of the jobs in parallel as the previous command, and will stress your disk less, so depending on your disk this can be faster.
answered Jul 2 at 17:09
Ole TangeOle Tange
13.6k17 gold badges60 silver badges108 bronze badges
13.6k17 gold badges60 silver badges108 bronze badges
add a comment |
add a comment |
ADDED after some profiling:
I made one hundred dirs 00 to 99 on a tmpfs and filled them with man-1 files starting with "s". That gives me 113 files from 20 byte to 69 K in each dir, totalling to 1.2 Mb in each dir.
Result: it ALL works:
Globbing and huge arg list:
gzip -d ??/*
and
gzip -d $(<filelist)
both 0.8 s
With a loop:
for file in $(<filelist)
do
gzip -d $file
done
14 sec
And when I add my ampersand:
...
gzip -d $file &
...
5.8 sec
The performance side of this is easily explained: the less you call gzip and the less you wait for it, the faster it runs.
Now: Is there at all a maximum arg list length? Is there a maximum number of processes? (I have ulimit -u of 31366, but it looks like I could even raise that, as root on my 300$ i5 Mini-PC...). Or will the max. be "only limited by the amount of avail. Memory".
Despite the overhead of ampersanding (forking off) 11300 mostly tiny jobs, this takes half as long as to wait for each command to complete. And this was on a tmpfs...
I forgot this one:
time find ?? -name "*.gz" -exec gzip -d +
also 0.8 s (6.2 s without the "+", but with ugly ";")
ORIGINAL answer:
Short answer: use ampersand &
to execute each gzip
command in the background to start the next loop iteration immediately.
for filnam in $(<flist)
do
gzip -d $filnam &
done
I also replaced cat
with "<" (not in a loop, so it is not critical) and backticks with $()
(more "modern").
to test/benchmark: place the shell builtin time
in front of your command: I did time . loop.cmd
(with loop.cmd as shown above). This gives you the time spent, and after hitting Enter
again the shell shows you the finished gzip "jobs".
I tested this out some weeks ago with a (small) list of (medium size) tar files to extract. The &
made such a difference in time
I could hardly believe it. The & has a double effect: it not only starts the next gzip immediately, it also gives you the prompt as soon as the last gzip got started.
CAREFUL: if you have whitespace in your filenames, it doesn't work. I don't know how to implement that. I tried with quotes and IFS...
Another well-known tip is: use globbing, if possible. Maybe start with a list of directories and for each go (cd $dir; gzip -d *.gz &)
.
--> I only can give you this ampersand "trick". First step is always optimizing the design. I am not sure what exactly you want this huge list of files for? Is it a hand made list? If it is generated by rules, then maybe use these rules directly and not via a filelist.
Your second example should be a bit faster, because gzip starts only once. But: I don't know what the limit is if you go gzip -d 1.gz 2.gz 3.gz ... 9999.gz
, i.e. giving gzip a very large argument list.
Whitespace: a filename like '10 A.gz'
results in gzip -d ... 9.gz 10 A.gz 11.gz...
. Gzip will say: '10' and 'A.gz' not found...unless these files DO exist, and if it is rm
and not gzip
then you have that classic shell trap.
added:
What about:
find $(<dirlist) -name "*.gz" -exec gzip -d +
...starting from a smaller list of dirs? The +
is the "parallelization" option (a find -exec
thing, like ...).
If you can redesign a bit, this must be the best...
In the meanwhile, "kusa" has answered...this IFS/read is what I meant...I hope my explanations are useful anyway.
– sam68
Jul 1 at 8:37
Now I proposed a "find -exec" command at the end of my answer. Maybe not what Leon wants, but still important.
– sam68
Jul 1 at 9:34
4
Your loop will probably bring the OP's machine to its knees. The file list can have up to 10000 files, so running ten thousand gzip instances in parallel will very likely cause the system to hang.
– terdon♦
Jul 1 at 9:44
process #1 might be finished when #10 or #100 gets started...it started as a performance Q, then whitespace, now system resources...this question is not so easy to answer! A extra process/job for each small .gz would not be ideal, I admit. 10000 separate gzips is the thing to avoid in the first place.
– sam68
Jul 1 at 11:09
5
This is the sort of thing GNUparallel
is for. You could simply doparallel -j3 gzip -d ::: < <(find $(<dirlist) -name '*.gz')
and that would only launch 3 gzips at a time.
– terdon♦
Jul 1 at 11:49
|
show 2 more comments
ADDED after some profiling:
I made one hundred dirs 00 to 99 on a tmpfs and filled them with man-1 files starting with "s". That gives me 113 files from 20 byte to 69 K in each dir, totalling to 1.2 Mb in each dir.
Result: it ALL works:
Globbing and huge arg list:
gzip -d ??/*
and
gzip -d $(<filelist)
both 0.8 s
With a loop:
for file in $(<filelist)
do
gzip -d $file
done
14 sec
And when I add my ampersand:
...
gzip -d $file &
...
5.8 sec
The performance side of this is easily explained: the less you call gzip and the less you wait for it, the faster it runs.
Now: Is there at all a maximum arg list length? Is there a maximum number of processes? (I have ulimit -u of 31366, but it looks like I could even raise that, as root on my 300$ i5 Mini-PC...). Or will the max. be "only limited by the amount of avail. Memory".
Despite the overhead of ampersanding (forking off) 11300 mostly tiny jobs, this takes half as long as to wait for each command to complete. And this was on a tmpfs...
I forgot this one:
time find ?? -name "*.gz" -exec gzip -d +
also 0.8 s (6.2 s without the "+", but with ugly ";")
ORIGINAL answer:
Short answer: use ampersand &
to execute each gzip
command in the background to start the next loop iteration immediately.
for filnam in $(<flist)
do
gzip -d $filnam &
done
I also replaced cat
with "<" (not in a loop, so it is not critical) and backticks with $()
(more "modern").
to test/benchmark: place the shell builtin time
in front of your command: I did time . loop.cmd
(with loop.cmd as shown above). This gives you the time spent, and after hitting Enter
again the shell shows you the finished gzip "jobs".
I tested this out some weeks ago with a (small) list of (medium size) tar files to extract. The &
made such a difference in time
I could hardly believe it. The & has a double effect: it not only starts the next gzip immediately, it also gives you the prompt as soon as the last gzip got started.
CAREFUL: if you have whitespace in your filenames, it doesn't work. I don't know how to implement that. I tried with quotes and IFS...
Another well-known tip is: use globbing, if possible. Maybe start with a list of directories and for each go (cd $dir; gzip -d *.gz &)
.
--> I only can give you this ampersand "trick". First step is always optimizing the design. I am not sure what exactly you want this huge list of files for? Is it a hand made list? If it is generated by rules, then maybe use these rules directly and not via a filelist.
Your second example should be a bit faster, because gzip starts only once. But: I don't know what the limit is if you go gzip -d 1.gz 2.gz 3.gz ... 9999.gz
, i.e. giving gzip a very large argument list.
Whitespace: a filename like '10 A.gz'
results in gzip -d ... 9.gz 10 A.gz 11.gz...
. Gzip will say: '10' and 'A.gz' not found...unless these files DO exist, and if it is rm
and not gzip
then you have that classic shell trap.
added:
What about:
find $(<dirlist) -name "*.gz" -exec gzip -d +
...starting from a smaller list of dirs? The +
is the "parallelization" option (a find -exec
thing, like ...).
If you can redesign a bit, this must be the best...
In the meanwhile, "kusa" has answered...this IFS/read is what I meant...I hope my explanations are useful anyway.
– sam68
Jul 1 at 8:37
Now I proposed a "find -exec" command at the end of my answer. Maybe not what Leon wants, but still important.
– sam68
Jul 1 at 9:34
4
Your loop will probably bring the OP's machine to its knees. The file list can have up to 10000 files, so running ten thousand gzip instances in parallel will very likely cause the system to hang.
– terdon♦
Jul 1 at 9:44
process #1 might be finished when #10 or #100 gets started...it started as a performance Q, then whitespace, now system resources...this question is not so easy to answer! A extra process/job for each small .gz would not be ideal, I admit. 10000 separate gzips is the thing to avoid in the first place.
– sam68
Jul 1 at 11:09
5
This is the sort of thing GNUparallel
is for. You could simply doparallel -j3 gzip -d ::: < <(find $(<dirlist) -name '*.gz')
and that would only launch 3 gzips at a time.
– terdon♦
Jul 1 at 11:49
|
show 2 more comments
ADDED after some profiling:
I made one hundred dirs 00 to 99 on a tmpfs and filled them with man-1 files starting with "s". That gives me 113 files from 20 byte to 69 K in each dir, totalling to 1.2 Mb in each dir.
Result: it ALL works:
Globbing and huge arg list:
gzip -d ??/*
and
gzip -d $(<filelist)
both 0.8 s
With a loop:
for file in $(<filelist)
do
gzip -d $file
done
14 sec
And when I add my ampersand:
...
gzip -d $file &
...
5.8 sec
The performance side of this is easily explained: the less you call gzip and the less you wait for it, the faster it runs.
Now: Is there at all a maximum arg list length? Is there a maximum number of processes? (I have ulimit -u of 31366, but it looks like I could even raise that, as root on my 300$ i5 Mini-PC...). Or will the max. be "only limited by the amount of avail. Memory".
Despite the overhead of ampersanding (forking off) 11300 mostly tiny jobs, this takes half as long as to wait for each command to complete. And this was on a tmpfs...
I forgot this one:
time find ?? -name "*.gz" -exec gzip -d +
also 0.8 s (6.2 s without the "+", but with ugly ";")
ORIGINAL answer:
Short answer: use ampersand &
to execute each gzip
command in the background to start the next loop iteration immediately.
for filnam in $(<flist)
do
gzip -d $filnam &
done
I also replaced cat
with "<" (not in a loop, so it is not critical) and backticks with $()
(more "modern").
to test/benchmark: place the shell builtin time
in front of your command: I did time . loop.cmd
(with loop.cmd as shown above). This gives you the time spent, and after hitting Enter
again the shell shows you the finished gzip "jobs".
I tested this out some weeks ago with a (small) list of (medium size) tar files to extract. The &
made such a difference in time
I could hardly believe it. The & has a double effect: it not only starts the next gzip immediately, it also gives you the prompt as soon as the last gzip got started.
CAREFUL: if you have whitespace in your filenames, it doesn't work. I don't know how to implement that. I tried with quotes and IFS...
Another well-known tip is: use globbing, if possible. Maybe start with a list of directories and for each go (cd $dir; gzip -d *.gz &)
.
--> I only can give you this ampersand "trick". First step is always optimizing the design. I am not sure what exactly you want this huge list of files for? Is it a hand made list? If it is generated by rules, then maybe use these rules directly and not via a filelist.
Your second example should be a bit faster, because gzip starts only once. But: I don't know what the limit is if you go gzip -d 1.gz 2.gz 3.gz ... 9999.gz
, i.e. giving gzip a very large argument list.
Whitespace: a filename like '10 A.gz'
results in gzip -d ... 9.gz 10 A.gz 11.gz...
. Gzip will say: '10' and 'A.gz' not found...unless these files DO exist, and if it is rm
and not gzip
then you have that classic shell trap.
added:
What about:
find $(<dirlist) -name "*.gz" -exec gzip -d +
...starting from a smaller list of dirs? The +
is the "parallelization" option (a find -exec
thing, like ...).
If you can redesign a bit, this must be the best...
ADDED after some profiling:
I made one hundred dirs 00 to 99 on a tmpfs and filled them with man-1 files starting with "s". That gives me 113 files from 20 byte to 69 K in each dir, totalling to 1.2 Mb in each dir.
Result: it ALL works:
Globbing and huge arg list:
gzip -d ??/*
and
gzip -d $(<filelist)
both 0.8 s
With a loop:
for file in $(<filelist)
do
gzip -d $file
done
14 sec
And when I add my ampersand:
...
gzip -d $file &
...
5.8 sec
The performance side of this is easily explained: the less you call gzip and the less you wait for it, the faster it runs.
Now: Is there at all a maximum arg list length? Is there a maximum number of processes? (I have ulimit -u of 31366, but it looks like I could even raise that, as root on my 300$ i5 Mini-PC...). Or will the max. be "only limited by the amount of avail. Memory".
Despite the overhead of ampersanding (forking off) 11300 mostly tiny jobs, this takes half as long as to wait for each command to complete. And this was on a tmpfs...
I forgot this one:
time find ?? -name "*.gz" -exec gzip -d +
also 0.8 s (6.2 s without the "+", but with ugly ";")
ORIGINAL answer:
Short answer: use ampersand &
to execute each gzip
command in the background to start the next loop iteration immediately.
for filnam in $(<flist)
do
gzip -d $filnam &
done
I also replaced cat
with "<" (not in a loop, so it is not critical) and backticks with $()
(more "modern").
to test/benchmark: place the shell builtin time
in front of your command: I did time . loop.cmd
(with loop.cmd as shown above). This gives you the time spent, and after hitting Enter
again the shell shows you the finished gzip "jobs".
I tested this out some weeks ago with a (small) list of (medium size) tar files to extract. The &
made such a difference in time
I could hardly believe it. The & has a double effect: it not only starts the next gzip immediately, it also gives you the prompt as soon as the last gzip got started.
CAREFUL: if you have whitespace in your filenames, it doesn't work. I don't know how to implement that. I tried with quotes and IFS...
Another well-known tip is: use globbing, if possible. Maybe start with a list of directories and for each go (cd $dir; gzip -d *.gz &)
.
--> I only can give you this ampersand "trick". First step is always optimizing the design. I am not sure what exactly you want this huge list of files for? Is it a hand made list? If it is generated by rules, then maybe use these rules directly and not via a filelist.
Your second example should be a bit faster, because gzip starts only once. But: I don't know what the limit is if you go gzip -d 1.gz 2.gz 3.gz ... 9999.gz
, i.e. giving gzip a very large argument list.
Whitespace: a filename like '10 A.gz'
results in gzip -d ... 9.gz 10 A.gz 11.gz...
. Gzip will say: '10' and 'A.gz' not found...unless these files DO exist, and if it is rm
and not gzip
then you have that classic shell trap.
added:
What about:
find $(<dirlist) -name "*.gz" -exec gzip -d +
...starting from a smaller list of dirs? The +
is the "parallelization" option (a find -exec
thing, like ...).
If you can redesign a bit, this must be the best...
edited Jul 2 at 16:24
answered Jul 1 at 8:32
sam68sam68
1776 bronze badges
1776 bronze badges
In the meanwhile, "kusa" has answered...this IFS/read is what I meant...I hope my explanations are useful anyway.
– sam68
Jul 1 at 8:37
Now I proposed a "find -exec" command at the end of my answer. Maybe not what Leon wants, but still important.
– sam68
Jul 1 at 9:34
4
Your loop will probably bring the OP's machine to its knees. The file list can have up to 10000 files, so running ten thousand gzip instances in parallel will very likely cause the system to hang.
– terdon♦
Jul 1 at 9:44
process #1 might be finished when #10 or #100 gets started...it started as a performance Q, then whitespace, now system resources...this question is not so easy to answer! A extra process/job for each small .gz would not be ideal, I admit. 10000 separate gzips is the thing to avoid in the first place.
– sam68
Jul 1 at 11:09
5
This is the sort of thing GNUparallel
is for. You could simply doparallel -j3 gzip -d ::: < <(find $(<dirlist) -name '*.gz')
and that would only launch 3 gzips at a time.
– terdon♦
Jul 1 at 11:49
|
show 2 more comments
In the meanwhile, "kusa" has answered...this IFS/read is what I meant...I hope my explanations are useful anyway.
– sam68
Jul 1 at 8:37
Now I proposed a "find -exec" command at the end of my answer. Maybe not what Leon wants, but still important.
– sam68
Jul 1 at 9:34
4
Your loop will probably bring the OP's machine to its knees. The file list can have up to 10000 files, so running ten thousand gzip instances in parallel will very likely cause the system to hang.
– terdon♦
Jul 1 at 9:44
process #1 might be finished when #10 or #100 gets started...it started as a performance Q, then whitespace, now system resources...this question is not so easy to answer! A extra process/job for each small .gz would not be ideal, I admit. 10000 separate gzips is the thing to avoid in the first place.
– sam68
Jul 1 at 11:09
5
This is the sort of thing GNUparallel
is for. You could simply doparallel -j3 gzip -d ::: < <(find $(<dirlist) -name '*.gz')
and that would only launch 3 gzips at a time.
– terdon♦
Jul 1 at 11:49
In the meanwhile, "kusa" has answered...this IFS/read is what I meant...I hope my explanations are useful anyway.
– sam68
Jul 1 at 8:37
In the meanwhile, "kusa" has answered...this IFS/read is what I meant...I hope my explanations are useful anyway.
– sam68
Jul 1 at 8:37
Now I proposed a "find -exec" command at the end of my answer. Maybe not what Leon wants, but still important.
– sam68
Jul 1 at 9:34
Now I proposed a "find -exec" command at the end of my answer. Maybe not what Leon wants, but still important.
– sam68
Jul 1 at 9:34
4
4
Your loop will probably bring the OP's machine to its knees. The file list can have up to 10000 files, so running ten thousand gzip instances in parallel will very likely cause the system to hang.
– terdon♦
Jul 1 at 9:44
Your loop will probably bring the OP's machine to its knees. The file list can have up to 10000 files, so running ten thousand gzip instances in parallel will very likely cause the system to hang.
– terdon♦
Jul 1 at 9:44
process #1 might be finished when #10 or #100 gets started...it started as a performance Q, then whitespace, now system resources...this question is not so easy to answer! A extra process/job for each small .gz would not be ideal, I admit. 10000 separate gzips is the thing to avoid in the first place.
– sam68
Jul 1 at 11:09
process #1 might be finished when #10 or #100 gets started...it started as a performance Q, then whitespace, now system resources...this question is not so easy to answer! A extra process/job for each small .gz would not be ideal, I admit. 10000 separate gzips is the thing to avoid in the first place.
– sam68
Jul 1 at 11:09
5
5
This is the sort of thing GNU
parallel
is for. You could simply do parallel -j3 gzip -d ::: < <(find $(<dirlist) -name '*.gz')
and that would only launch 3 gzips at a time.– terdon♦
Jul 1 at 11:49
This is the sort of thing GNU
parallel
is for. You could simply do parallel -j3 gzip -d ::: < <(find $(<dirlist) -name '*.gz')
and that would only launch 3 gzips at a time.– terdon♦
Jul 1 at 11:49
|
show 2 more comments
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f527813%2fperformance-of-loop-vs-expansion%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
The correct answer will depend on how long it takes to start
gzip
on your system, the number of files in the file list and the size of those files.– Kusalananda♦
Jul 1 at 6:25
File list will have around 1000 - 10000 files. Size varies from some kilobytes to 500 MB. I have no idea how long it takes to start gzip in my system. any way check?
– Leon
Jul 1 at 6:27
1
Ok, then it may also depend on the length of the filenames. If the filenames are long, some systems might generate a "argument list too long" error if you tried to do it without a loop since the command substitution would result in a too long command line for the shell to execute. If you don't want to depend on the number of files in the list, just use a loop. Are you spending a significant amount of time decompressing these files compared to the other processing that you will perform on them?
– Kusalananda♦
Jul 1 at 6:54
Leon take a look at my test results: "huge-arglist" is 20x faster than "loop" in my setting.
– sam68
Jul 2 at 16:08
for a happy medium between process starts and command line length, use something like
xargs gzip -d < large_file_list
but watch out for spaces in filenames, maybe withtr \n \0 large_file_list | xargs -0 gzip -d
– w00t
Jul 5 at 8:13