Performance of loop vs expansionUnderstanding “IFS= read -r line”Optimizing a shell script with long running while loopHow to use 3rd variable in BASH for loop?While/Read/Do/Done Loop Cutting Out Halfway Through Text Fileparallel processing reading from a file in a loopI need to change date in CSV filePreventing parameter expansion multiple timesBash loop for detecting equal folder sizesWriting output of a For Loop to .txt fileMerge multiple files with different extensions using a loop in bash scriptUnzip files in loop

Why different specifications for telescopes and binoculars?

When did "&" stop being taught alongside the alphabet?

The rigidity of the countable product of free groups

When I press the space bar it deletes the letters in front of it

Why weren't bootable game disks ever a thing on the IBM PC?

What happens to unproductive professors?

Why is Nibbana referred to as "The destination and the path leading to the destination"?

How to compare the ls output of two folders to find a missing directory?

A horrible Stockfish chess engine evaluation

Why does every calorie tracking app give a different target calorie count for the same goals?

Can I play a mimic PC?

What is the correct parsing of お高くとまる?

WTB Horizon 47c - small crack in the middle of the tire

In Spider-Man: Far From Home, is this superhero name a reference to another comic book?

Are there any sports for which the world's best player is female?

Can i use larger/smaller circular saw blades on my circular / plunge / table / miter saw?

Could you brine steak?

How to drill holes in 3/8" steel plates?

The joke office

Why does the Antonov AN-225 not have any winglets?

Yet another hash table in C

Efficiently defining a SparseArray function

How can I effectively communicate to recruiters that a phone call is not possible?

Why is a mixture of two normally distributed variables only bimodal if their means differ by at least two times the common standard deviation?

Performance of loop vs expansion

Understanding “IFS= read -r line”Optimizing a shell script with long running while loopHow to use 3rd variable in BASH for loop?While/Read/Do/Done Loop Cutting Out Halfway Through Text Fileparallel processing reading from a file in a loopI need to change date in CSV filePreventing parameter expansion multiple timesBash loop for detecting equal folder sizesWriting output of a For Loop to .txt fileMerge multiple files with different extensions using a loop in bash scriptUnzip files in loop

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;

Need expert suggestions on below comparison:

Code Segment using loop:

for file in `cat large_file_list`
do
 gzip -d $file
done

Code segment using simple expansion:

gzip -d `cat large_file_list`

Which one will be faster? Have to manipulate large data set.

asked Jul 1 at 5:16

Leon

1585 bronze badges

1

The correct answer will depend on how long it takes to start gzip on your system, the number of files in the file list and the size of those files.

– Kusalananda♦
Jul 1 at 6:25

File list will have around 1000 - 10000 files. Size varies from some kilobytes to 500 MB. I have no idea how long it takes to start gzip in my system. any way check?

– Leon
Jul 1 at 6:27

1

Ok, then it may also depend on the length of the filenames. If the filenames are long, some systems might generate a "argument list too long" error if you tried to do it without a loop since the command substitution would result in a too long command line for the shell to execute. If you don't want to depend on the number of files in the list, just use a loop. Are you spending a significant amount of time decompressing these files compared to the other processing that you will perform on them?

– Kusalananda♦
Jul 1 at 6:54

Leon take a look at my test results: "huge-arglist" is 20x faster than "loop" in my setting.

– sam68
Jul 2 at 16:08

for a happy medium between process starts and command line length, use something like xargs gzip -d < large_file_list but watch out for spaces in filenames, maybe with tr \n \0 large_file_list | xargs -0 gzip -d

– w00t
Jul 5 at 8:13

add a comment |

Need expert suggestions on below comparison:

Code Segment using loop:

for file in `cat large_file_list`
do
 gzip -d $file
done

Code segment using simple expansion:

gzip -d `cat large_file_list`

Which one will be faster? Have to manipulate large data set.

asked Jul 1 at 5:16

Leon

1585 bronze badges

1

The correct answer will depend on how long it takes to start gzip on your system, the number of files in the file list and the size of those files.

– Kusalananda♦
Jul 1 at 6:25

File list will have around 1000 - 10000 files. Size varies from some kilobytes to 500 MB. I have no idea how long it takes to start gzip in my system. any way check?

– Leon
Jul 1 at 6:27

1

Ok, then it may also depend on the length of the filenames. If the filenames are long, some systems might generate a "argument list too long" error if you tried to do it without a loop since the command substitution would result in a too long command line for the shell to execute. If you don't want to depend on the number of files in the list, just use a loop. Are you spending a significant amount of time decompressing these files compared to the other processing that you will perform on them?

– Kusalananda♦
Jul 1 at 6:54

Leon take a look at my test results: "huge-arglist" is 20x faster than "loop" in my setting.

– sam68
Jul 2 at 16:08

for a happy medium between process starts and command line length, use something like xargs gzip -d < large_file_list but watch out for spaces in filenames, maybe with tr \n \0 large_file_list | xargs -0 gzip -d

– w00t
Jul 5 at 8:13

add a comment |

Need expert suggestions on below comparison:

Code Segment using loop:

for file in `cat large_file_list`
do
 gzip -d $file
done

Code segment using simple expansion:

gzip -d `cat large_file_list`

Which one will be faster? Have to manipulate large data set.

asked Jul 1 at 5:16

Leon

1585 bronze badges

Need expert suggestions on below comparison:

Code Segment using loop:

for file in `cat large_file_list`
do
 gzip -d $file
done

Code segment using simple expansion:

gzip -d `cat large_file_list`

Which one will be faster? Have to manipulate large data set.

linux bash shell-script shell

asked Jul 1 at 5:16

Leon

1585 bronze badges

asked Jul 1 at 5:16

Leon

1585 bronze badges

asked Jul 1 at 5:16

Leon

1585 bronze badges

asked Jul 1 at 5:16

Leon

1585 bronze badges

asked Jul 1 at 5:16

Leon

1585 bronze badges

1

The correct answer will depend on how long it takes to start gzip on your system, the number of files in the file list and the size of those files.

– Kusalananda♦
Jul 1 at 6:25

File list will have around 1000 - 10000 files. Size varies from some kilobytes to 500 MB. I have no idea how long it takes to start gzip in my system. any way check?

– Leon
Jul 1 at 6:27

1

Ok, then it may also depend on the length of the filenames. If the filenames are long, some systems might generate a "argument list too long" error if you tried to do it without a loop since the command substitution would result in a too long command line for the shell to execute. If you don't want to depend on the number of files in the list, just use a loop. Are you spending a significant amount of time decompressing these files compared to the other processing that you will perform on them?

– Kusalananda♦
Jul 1 at 6:54

Leon take a look at my test results: "huge-arglist" is 20x faster than "loop" in my setting.

– sam68
Jul 2 at 16:08

for a happy medium between process starts and command line length, use something like xargs gzip -d < large_file_list but watch out for spaces in filenames, maybe with tr \n \0 large_file_list | xargs -0 gzip -d

– w00t
Jul 5 at 8:13

add a comment |

1

The correct answer will depend on how long it takes to start gzip on your system, the number of files in the file list and the size of those files.

– Kusalananda♦
Jul 1 at 6:25

File list will have around 1000 - 10000 files. Size varies from some kilobytes to 500 MB. I have no idea how long it takes to start gzip in my system. any way check?

– Leon
Jul 1 at 6:27

1

Ok, then it may also depend on the length of the filenames. If the filenames are long, some systems might generate a "argument list too long" error if you tried to do it without a loop since the command substitution would result in a too long command line for the shell to execute. If you don't want to depend on the number of files in the list, just use a loop. Are you spending a significant amount of time decompressing these files compared to the other processing that you will perform on them?

– Kusalananda♦
Jul 1 at 6:54

Leon take a look at my test results: "huge-arglist" is 20x faster than "loop" in my setting.

– sam68
Jul 2 at 16:08

for a happy medium between process starts and command line length, use something like xargs gzip -d < large_file_list but watch out for spaces in filenames, maybe with tr \n \0 large_file_list | xargs -0 gzip -d

– w00t
Jul 5 at 8:13

The correct answer will depend on how long it takes to start gzip on your system, the number of files in the file list and the size of those files.

– Kusalananda♦
Jul 1 at 6:25

File list will have around 1000 - 10000 files. Size varies from some kilobytes to 500 MB. I have no idea how long it takes to start gzip in my system. any way check?

– Leon
Jul 1 at 6:27

Ok, then it may also depend on the length of the filenames. If the filenames are long, some systems might generate a "argument list too long" error if you tried to do it without a loop since the command substitution would result in a too long command line for the shell to execute. If you don't want to depend on the number of files in the list, just use a loop. Are you spending a significant amount of time decompressing these files compared to the other processing that you will perform on them?

– Kusalananda♦
Jul 1 at 6:54

Leon take a look at my test results: "huge-arglist" is 20x faster than "loop" in my setting.

– sam68
Jul 2 at 16:08

for a happy medium between process starts and command line length, use something like xargs gzip -d < large_file_list but watch out for spaces in filenames, maybe with tr \n \0 large_file_list | xargs -0 gzip -d

– w00t
Jul 5 at 8:13

add a comment |

5 Answers
5

active

oldest

votes

Complications

The following will work only sometimes:

gzip -d `cat large_file_list`

Three problems are (in bash and most other Bourne-like shells):

It will fail if any file name has space tab or newline characters in it (assuming $IFS has not been modified). This is because of the shell's word splitting.

It is also liable to fail if any file name has glob-active characters in it. This is because the shell will apply pathname expansion to the file list.

It will also fail if filenames starts with - (if POSIXLY_CORRECT=1 that only applies to the first file) or if any filename is -.

It will also fail if there are too many file names in it to fit on one command line.

The code below is subject to the same problems as the code above (except for the fourth)

for file in `cat large_file_list`
do
 gzip -d $file
done

Reliable solution

If your large_file_list has exactly one file name per line, and a file called - is not among them, and you're on a GNU system, then use:

xargs -rd'n' gzip -d -- <large_file_list

-d'n' tells xargs to treat each line of input as a separate file name.

-r tells xargs not to run the command if the input file is empty.

-- tells gzip that the following arguments are not to be treated as options even if they start with -. - alone would still be treated as - instead of the file called - though.

xargs will put many file names on each command line but not so many that it exceeds the command line limit. This reduces the number of times that a gzip process must be started and therefore makes this fast. It is also safe: the file names will also be protected from word splitting and pathname expansion.

edited Jul 2 at 19:32

Stéphane Chazelas

325k57 gold badges628 silver badges997 bronze badges

answered Jul 1 at 5:51

John1024

51k5 gold badges117 silver badges132 bronze badges

Thanks for detailed reply. I understand your mentioned 3 issues. File name is simple and won't face those challenges as the list will hold upto 20000. And my question is basically on performance of those two segments. Thanks.

– Leon
Jul 1 at 6:20

1

@Leon The for loop will be —by far— the slowest. The other two methods will be very close in speed to each other.

– John1024
Jul 1 at 6:36

7

Also, don't dismiss the potential problems: many many questions here on StackExchange are because word splitting or pathname expansion happened to people who weren't expecting it.

– John1024
Jul 1 at 6:38

5

Note also that there's variation on reading a file with xargs: at least GNU version has --arg-file option (short form -a). So one could do xargs -a large_file_list -rd'n' gzip -d instead. Effectively, there's no difference, aside from the fact that < is shell operator and would make xargs read from stdin (which shell "links" to file), while -a would make xargs explicitly open the file in question

– Sergiy Kolodyazhnyy
Jul 1 at 7:34

2

terdon noted in another comment about using parallel to run multiple copies of gzip, but xargs (at least the GNU one), has the -P switch for that, too. On multicore machines that might make a difference. But it's also possible that the decompression is completely I/O-bound anyway.

– ilkkachu
Jul 2 at 14:19

|
show 1 more comment

I doubt it would matter much.

I would use a loop, just because I don't know how many files are listed in the list file, and I don't (generally) know if any of the filenames have spaces in their names. Doing a command substitution that would generate a very long list of argument may result in an "Argument list too long" error when the length of the list generated is too long.

My loop would look like

while IFS= read -r name; do
 gunzip "$name"
done <file.list

This would additionally allow me to insert commands for processing the data after the gunzip command. In fact, depending on what the data actually is and what needs to be done with it, it may even be possible to process it without saving it to file at all:

while IFS= read -r name; do
 zcat "$name" | process_data
done <file.list

(where process_data is some pipeline that reads the uncompressed data from standard input)

If the processing of the data takes longer than the uncompressing of it, the question of whether a loop is more efficient or not becomes irrelevant.

Ideally, I would prefer to not work off a list of filenames though, and instead use a filename globbing pattern, as in

for name in ./*.gz; do
 # processing of "$name" here
done

where ./*.gz is some pattern that matches the relevant files. This way we are not depending on the number of files nor on the characters used in the filenames (they may contain newlines or other whitespace characters, or start with dashes, etc.)

Understanding "IFS= read -r line"

edited Jul 1 at 9:39

answered Jul 1 at 7:21

Kusalananda♦

154k18 gold badges304 silver badges488 bronze badges

add a comment |

Out of those two, the one with all files passed to a single invocation of gzip is likely to be faster, exactly because you only need to launch gzip once. (That is, if the command works at all, see the other answers for the caveats.)

But, I'd like to remind of the golden rule of optimization: Don't do it prematurely.

Don't optimize that sort of thing before you know it's a problem.

Does this part of the program take a long time? Well, decompressing large files might, and you're going to have to do it anyway, so it might not be that easy to answer.

Measure. Really, it's the best way to be sure.

You'll see the results with your own eyes (or with your own stopwatch), and they'll apply to your situation which random answers on the Internet might not. Put both variants in scripts and run time script1.sh, and time script2.sh. (Do that with a list of empty compressed files to measure the absolute amount of the overhead.)

answered Jul 1 at 15:03

ilkkachu

64.9k10 gold badges109 silver badges189 bronze badges

add a comment |

How fast is your disk?

This should use all your CPUs:

parallel -X gzip -d :::: large_file_list

So your limit is likely going to be the speed of your disk.

You can try adjusting with -j:

parallel -j50% -X gzip -d :::: large_file_list

This will run half of the jobs in parallel as the previous command, and will stress your disk less, so depending on your disk this can be faster.

answered Jul 2 at 17:09

Ole Tange

13.6k17 gold badges60 silver badges108 bronze badges

add a comment |

-1

ADDED after some profiling:

I made one hundred dirs 00 to 99 on a tmpfs and filled them with man-1 files starting with "s". That gives me 113 files from 20 byte to 69 K in each dir, totalling to 1.2 Mb in each dir.

Result: it ALL works:

Globbing and huge arg list:

gzip -d ??/* 
and
gzip -d $(<filelist)

both 0.8 s

With a loop:

for file in $(<filelist)
do
 gzip -d $file 
done

14 sec

And when I add my ampersand:

... 
 gzip -d $file &
... 

5.8 sec

The performance side of this is easily explained: the less you call gzip and the less you wait for it, the faster it runs.

Now: Is there at all a maximum arg list length? Is there a maximum number of processes? (I have ulimit -u of 31366, but it looks like I could even raise that, as root on my 300$ i5 Mini-PC...). Or will the max. be "only limited by the amount of avail. Memory".

Despite the overhead of ampersanding (forking off) 11300 mostly tiny jobs, this takes half as long as to wait for each command to complete. And this was on a tmpfs...

I forgot this one:

time find ?? -name "*.gz" -exec gzip -d +

also 0.8 s (6.2 s without the "+", but with ugly ";")

ORIGINAL answer:

Short answer: use ampersand & to execute each gzip command in the background to start the next loop iteration immediately.

for filnam in $(<flist)
do
 gzip -d $filnam &
done

I also replaced cat with "<" (not in a loop, so it is not critical) and backticks with $() (more "modern").

to test/benchmark: place the shell builtin time in front of your command: I did time . loop.cmd (with loop.cmd as shown above). This gives you the time spent, and after hitting Enter again the shell shows you the finished gzip "jobs".

I tested this out some weeks ago with a (small) list of (medium size) tar files to extract. The & made such a difference in time I could hardly believe it. The & has a double effect: it not only starts the next gzip immediately, it also gives you the prompt as soon as the last gzip got started.

CAREFUL: if you have whitespace in your filenames, it doesn't work. I don't know how to implement that. I tried with quotes and IFS...

Another well-known tip is: use globbing, if possible. Maybe start with a list of directories and for each go (cd $dir; gzip -d *.gz &).

--> I only can give you this ampersand "trick". First step is always optimizing the design. I am not sure what exactly you want this huge list of files for? Is it a hand made list? If it is generated by rules, then maybe use these rules directly and not via a filelist.

Your second example should be a bit faster, because gzip starts only once. But: I don't know what the limit is if you go gzip -d 1.gz 2.gz 3.gz ... 9999.gz, i.e. giving gzip a very large argument list.

Whitespace: a filename like '10 A.gz' results in gzip -d ... 9.gz 10 A.gz 11.gz.... Gzip will say: '10' and 'A.gz' not found...unless these files DO exist, and if it is rm and not gzip then you have that classic shell trap.

added:

What about:

find $(<dirlist) -name "*.gz" -exec gzip -d +

...starting from a smaller list of dirs? The + is the "parallelization" option (a find -exec thing, like ...).

If you can redesign a bit, this must be the best...

edited Jul 2 at 16:24

answered Jul 1 at 8:32

sam68

1776 bronze badges

In the meanwhile, "kusa" has answered...this IFS/read is what I meant...I hope my explanations are useful anyway.

– sam68
Jul 1 at 8:37

Now I proposed a "find -exec" command at the end of my answer. Maybe not what Leon wants, but still important.

– sam68
Jul 1 at 9:34

4

Your loop will probably bring the OP's machine to its knees. The file list can have up to 10000 files, so running ten thousand gzip instances in parallel will very likely cause the system to hang.

– terdon♦
Jul 1 at 9:44

process #1 might be finished when #10 or #100 gets started...it started as a performance Q, then whitespace, now system resources...this question is not so easy to answer! A extra process/job for each small .gz would not be ideal, I admit. 10000 separate gzips is the thing to avoid in the first place.

– sam68
Jul 1 at 11:09

5

This is the sort of thing GNU parallel is for. You could simply do parallel -j3 gzip -d ::: < <(find $(<dirlist) -name '*.gz') and that would only launch 3 gzips at a time.

– terdon♦
Jul 1 at 11:49

|
show 2 more comments

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f527813%2fperformance-of-loop-vs-expansion%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

5 Answers
5

active

oldest

votes

5 Answers
5

active

oldest

votes

Complications

The following will work only sometimes:

gzip -d `cat large_file_list`

Three problems are (in bash and most other Bourne-like shells):

It will fail if any file name has space tab or newline characters in it (assuming $IFS has not been modified). This is because of the shell's word splitting.

It is also liable to fail if any file name has glob-active characters in it. This is because the shell will apply pathname expansion to the file list.

It will also fail if filenames starts with - (if POSIXLY_CORRECT=1 that only applies to the first file) or if any filename is -.

It will also fail if there are too many file names in it to fit on one command line.

The code below is subject to the same problems as the code above (except for the fourth)

for file in `cat large_file_list`
do
 gzip -d $file
done

Reliable solution

If your large_file_list has exactly one file name per line, and a file called - is not among them, and you're on a GNU system, then use:

xargs -rd'n' gzip -d -- <large_file_list

-d'n' tells xargs to treat each line of input as a separate file name.

-r tells xargs not to run the command if the input file is empty.

-- tells gzip that the following arguments are not to be treated as options even if they start with -. - alone would still be treated as - instead of the file called - though.

edited Jul 2 at 19:32

Stéphane Chazelas

325k57 gold badges628 silver badges997 bronze badges

answered Jul 1 at 5:51

John1024

51k5 gold badges117 silver badges132 bronze badges

Thanks for detailed reply. I understand your mentioned 3 issues. File name is simple and won't face those challenges as the list will hold upto 20000. And my question is basically on performance of those two segments. Thanks.

– Leon
Jul 1 at 6:20

1

@Leon The for loop will be —by far— the slowest. The other two methods will be very close in speed to each other.

– John1024
Jul 1 at 6:36

7

Also, don't dismiss the potential problems: many many questions here on StackExchange are because word splitting or pathname expansion happened to people who weren't expecting it.

– John1024
Jul 1 at 6:38

5

Note also that there's variation on reading a file with xargs: at least GNU version has --arg-file option (short form -a). So one could do xargs -a large_file_list -rd'n' gzip -d instead. Effectively, there's no difference, aside from the fact that < is shell operator and would make xargs read from stdin (which shell "links" to file), while -a would make xargs explicitly open the file in question

– Sergiy Kolodyazhnyy
Jul 1 at 7:34

2

terdon noted in another comment about using parallel to run multiple copies of gzip, but xargs (at least the GNU one), has the -P switch for that, too. On multicore machines that might make a difference. But it's also possible that the decompression is completely I/O-bound anyway.

– ilkkachu
Jul 2 at 14:19

|
show 1 more comment

Complications

The following will work only sometimes:

gzip -d `cat large_file_list`

Three problems are (in bash and most other Bourne-like shells):

It will fail if any file name has space tab or newline characters in it (assuming $IFS has not been modified). This is because of the shell's word splitting.

It is also liable to fail if any file name has glob-active characters in it. This is because the shell will apply pathname expansion to the file list.

It will also fail if filenames starts with - (if POSIXLY_CORRECT=1 that only applies to the first file) or if any filename is -.

It will also fail if there are too many file names in it to fit on one command line.

The code below is subject to the same problems as the code above (except for the fourth)

for file in `cat large_file_list`
do
 gzip -d $file
done

Reliable solution

If your large_file_list has exactly one file name per line, and a file called - is not among them, and you're on a GNU system, then use:

xargs -rd'n' gzip -d -- <large_file_list

-d'n' tells xargs to treat each line of input as a separate file name.

-r tells xargs not to run the command if the input file is empty.

-- tells gzip that the following arguments are not to be treated as options even if they start with -. - alone would still be treated as - instead of the file called - though.

edited Jul 2 at 19:32

Stéphane Chazelas

325k57 gold badges628 silver badges997 bronze badges

answered Jul 1 at 5:51

John1024

51k5 gold badges117 silver badges132 bronze badges

Thanks for detailed reply. I understand your mentioned 3 issues. File name is simple and won't face those challenges as the list will hold upto 20000. And my question is basically on performance of those two segments. Thanks.

– Leon
Jul 1 at 6:20

1

@Leon The for loop will be —by far— the slowest. The other two methods will be very close in speed to each other.

– John1024
Jul 1 at 6:36

7

Also, don't dismiss the potential problems: many many questions here on StackExchange are because word splitting or pathname expansion happened to people who weren't expecting it.

– John1024
Jul 1 at 6:38

5

Note also that there's variation on reading a file with xargs: at least GNU version has --arg-file option (short form -a). So one could do xargs -a large_file_list -rd'n' gzip -d instead. Effectively, there's no difference, aside from the fact that < is shell operator and would make xargs read from stdin (which shell "links" to file), while -a would make xargs explicitly open the file in question

– Sergiy Kolodyazhnyy
Jul 1 at 7:34

2

terdon noted in another comment about using parallel to run multiple copies of gzip, but xargs (at least the GNU one), has the -P switch for that, too. On multicore machines that might make a difference. But it's also possible that the decompression is completely I/O-bound anyway.

– ilkkachu
Jul 2 at 14:19

|
show 1 more comment

Complications

The following will work only sometimes:

gzip -d `cat large_file_list`

Three problems are (in bash and most other Bourne-like shells):

It will fail if any file name has space tab or newline characters in it (assuming $IFS has not been modified). This is because of the shell's word splitting.

It is also liable to fail if any file name has glob-active characters in it. This is because the shell will apply pathname expansion to the file list.

It will also fail if filenames starts with - (if POSIXLY_CORRECT=1 that only applies to the first file) or if any filename is -.

It will also fail if there are too many file names in it to fit on one command line.

The code below is subject to the same problems as the code above (except for the fourth)

for file in `cat large_file_list`
do
 gzip -d $file
done

Reliable solution

If your large_file_list has exactly one file name per line, and a file called - is not among them, and you're on a GNU system, then use:

xargs -rd'n' gzip -d -- <large_file_list

-d'n' tells xargs to treat each line of input as a separate file name.

-r tells xargs not to run the command if the input file is empty.

-- tells gzip that the following arguments are not to be treated as options even if they start with -. - alone would still be treated as - instead of the file called - though.

edited Jul 2 at 19:32

Stéphane Chazelas

325k57 gold badges628 silver badges997 bronze badges

answered Jul 1 at 5:51

John1024

51k5 gold badges117 silver badges132 bronze badges

Complications

The following will work only sometimes:

gzip -d `cat large_file_list`

Three problems are (in bash and most other Bourne-like shells):

It will fail if any file name has space tab or newline characters in it (assuming $IFS has not been modified). This is because of the shell's word splitting.

It is also liable to fail if any file name has glob-active characters in it. This is because the shell will apply pathname expansion to the file list.

It will also fail if filenames starts with - (if POSIXLY_CORRECT=1 that only applies to the first file) or if any filename is -.

It will also fail if there are too many file names in it to fit on one command line.

The code below is subject to the same problems as the code above (except for the fourth)

for file in `cat large_file_list`
do
 gzip -d $file
done

Reliable solution

If your large_file_list has exactly one file name per line, and a file called - is not among them, and you're on a GNU system, then use:

xargs -rd'n' gzip -d -- <large_file_list

-d'n' tells xargs to treat each line of input as a separate file name.

-r tells xargs not to run the command if the input file is empty.

-- tells gzip that the following arguments are not to be treated as options even if they start with -. - alone would still be treated as - instead of the file called - though.

edited Jul 2 at 19:32

Stéphane Chazelas

325k57 gold badges628 silver badges997 bronze badges

answered Jul 1 at 5:51

John1024

51k5 gold badges117 silver badges132 bronze badges

edited Jul 2 at 19:32

Stéphane Chazelas

325k57 gold badges628 silver badges997 bronze badges

edited Jul 2 at 19:32

Stéphane Chazelas

325k57 gold badges628 silver badges997 bronze badges

edited Jul 2 at 19:32

Stéphane Chazelas

325k57 gold badges628 silver badges997 bronze badges

answered Jul 1 at 5:51

John1024

51k5 gold badges117 silver badges132 bronze badges

answered Jul 1 at 5:51

John1024

51k5 gold badges117 silver badges132 bronze badges

answered Jul 1 at 5:51

John1024

51k5 gold badges117 silver badges132 bronze badges

Thanks for detailed reply. I understand your mentioned 3 issues. File name is simple and won't face those challenges as the list will hold upto 20000. And my question is basically on performance of those two segments. Thanks.

– Leon
Jul 1 at 6:20

1

@Leon The for loop will be —by far— the slowest. The other two methods will be very close in speed to each other.

– John1024
Jul 1 at 6:36

7

Also, don't dismiss the potential problems: many many questions here on StackExchange are because word splitting or pathname expansion happened to people who weren't expecting it.

– John1024
Jul 1 at 6:38

5

Note also that there's variation on reading a file with xargs: at least GNU version has --arg-file option (short form -a). So one could do xargs -a large_file_list -rd'n' gzip -d instead. Effectively, there's no difference, aside from the fact that < is shell operator and would make xargs read from stdin (which shell "links" to file), while -a would make xargs explicitly open the file in question

– Sergiy Kolodyazhnyy
Jul 1 at 7:34

2

terdon noted in another comment about using parallel to run multiple copies of gzip, but xargs (at least the GNU one), has the -P switch for that, too. On multicore machines that might make a difference. But it's also possible that the decompression is completely I/O-bound anyway.

– ilkkachu
Jul 2 at 14:19

|
show 1 more comment

Thanks for detailed reply. I understand your mentioned 3 issues. File name is simple and won't face those challenges as the list will hold upto 20000. And my question is basically on performance of those two segments. Thanks.

– Leon
Jul 1 at 6:20

1

@Leon The for loop will be —by far— the slowest. The other two methods will be very close in speed to each other.

– John1024
Jul 1 at 6:36

7

Also, don't dismiss the potential problems: many many questions here on StackExchange are because word splitting or pathname expansion happened to people who weren't expecting it.

– John1024
Jul 1 at 6:38

5

Note also that there's variation on reading a file with xargs: at least GNU version has --arg-file option (short form -a). So one could do xargs -a large_file_list -rd'n' gzip -d instead. Effectively, there's no difference, aside from the fact that < is shell operator and would make xargs read from stdin (which shell "links" to file), while -a would make xargs explicitly open the file in question

– Sergiy Kolodyazhnyy
Jul 1 at 7:34

2

terdon noted in another comment about using parallel to run multiple copies of gzip, but xargs (at least the GNU one), has the -P switch for that, too. On multicore machines that might make a difference. But it's also possible that the decompression is completely I/O-bound anyway.

– ilkkachu
Jul 2 at 14:19

Thanks for detailed reply. I understand your mentioned 3 issues. File name is simple and won't face those challenges as the list will hold upto 20000. And my question is basically on performance of those two segments. Thanks.

– Leon
Jul 1 at 6:20

@Leon The for loop will be —by far— the slowest. The other two methods will be very close in speed to each other.

– John1024
Jul 1 at 6:36

Also, don't dismiss the potential problems: many many questions here on StackExchange are because word splitting or pathname expansion happened to people who weren't expecting it.

– John1024
Jul 1 at 6:38

Note also that there's variation on reading a file with xargs: at least GNU version has --arg-file option (short form -a). So one could do xargs -a large_file_list -rd'n' gzip -d instead. Effectively, there's no difference, aside from the fact that < is shell operator and would make xargs read from stdin (which shell "links" to file), while -a would make xargs explicitly open the file in question

– Sergiy Kolodyazhnyy
Jul 1 at 7:34

terdon noted in another comment about using parallel to run multiple copies of gzip, but xargs (at least the GNU one), has the -P switch for that, too. On multicore machines that might make a difference. But it's also possible that the decompression is completely I/O-bound anyway.

– ilkkachu
Jul 2 at 14:19

|
show 1 more comment

I doubt it would matter much.

My loop would look like

while IFS= read -r name; do
 gunzip "$name"
done <file.list

while IFS= read -r name; do
 zcat "$name" | process_data
done <file.list

(where process_data is some pipeline that reads the uncompressed data from standard input)

If the processing of the data takes longer than the uncompressing of it, the question of whether a loop is more efficient or not becomes irrelevant.

Ideally, I would prefer to not work off a list of filenames though, and instead use a filename globbing pattern, as in

for name in ./*.gz; do
 # processing of "$name" here
done

Understanding "IFS= read -r line"

edited Jul 1 at 9:39

answered Jul 1 at 7:21

Kusalananda♦

154k18 gold badges304 silver badges488 bronze badges

add a comment |

I doubt it would matter much.

My loop would look like

while IFS= read -r name; do
 gunzip "$name"
done <file.list

while IFS= read -r name; do
 zcat "$name" | process_data
done <file.list

(where process_data is some pipeline that reads the uncompressed data from standard input)

If the processing of the data takes longer than the uncompressing of it, the question of whether a loop is more efficient or not becomes irrelevant.

Ideally, I would prefer to not work off a list of filenames though, and instead use a filename globbing pattern, as in

for name in ./*.gz; do
 # processing of "$name" here
done

Understanding "IFS= read -r line"

edited Jul 1 at 9:39

answered Jul 1 at 7:21

Kusalananda♦

154k18 gold badges304 silver badges488 bronze badges

add a comment |

I doubt it would matter much.

My loop would look like

while IFS= read -r name; do
 gunzip "$name"
done <file.list

while IFS= read -r name; do
 zcat "$name" | process_data
done <file.list

(where process_data is some pipeline that reads the uncompressed data from standard input)

If the processing of the data takes longer than the uncompressing of it, the question of whether a loop is more efficient or not becomes irrelevant.

Ideally, I would prefer to not work off a list of filenames though, and instead use a filename globbing pattern, as in

for name in ./*.gz; do
 # processing of "$name" here
done

Understanding "IFS= read -r line"

edited Jul 1 at 9:39

answered Jul 1 at 7:21

Kusalananda♦

154k18 gold badges304 silver badges488 bronze badges

I doubt it would matter much.

My loop would look like

while IFS= read -r name; do
 gunzip "$name"
done <file.list

while IFS= read -r name; do
 zcat "$name" | process_data
done <file.list

(where process_data is some pipeline that reads the uncompressed data from standard input)

If the processing of the data takes longer than the uncompressing of it, the question of whether a loop is more efficient or not becomes irrelevant.

Ideally, I would prefer to not work off a list of filenames though, and instead use a filename globbing pattern, as in

for name in ./*.gz; do
 # processing of "$name" here
done

Understanding "IFS= read -r line"

edited Jul 1 at 9:39

answered Jul 1 at 7:21

Kusalananda♦

154k18 gold badges304 silver badges488 bronze badges

edited Jul 1 at 9:39

answered Jul 1 at 7:21

Kusalananda♦

154k18 gold badges304 silver badges488 bronze badges

answered Jul 1 at 7:21

Kusalananda♦

154k18 gold badges304 silver badges488 bronze badges

answered Jul 1 at 7:21

Kusalananda♦

154k18 gold badges304 silver badges488 bronze badges

add a comment |

But, I'd like to remind of the golden rule of optimization: Don't do it prematurely.

Don't optimize that sort of thing before you know it's a problem.

Does this part of the program take a long time? Well, decompressing large files might, and you're going to have to do it anyway, so it might not be that easy to answer.

Measure. Really, it's the best way to be sure.

You'll see the results with your own eyes (or with your own stopwatch), and they'll apply to your situation which random answers on the Internet might not. Put both variants in scripts and run time script1.sh, and time script2.sh. (Do that with a list of empty compressed files to measure the absolute amount of the overhead.)

answered Jul 1 at 15:03

ilkkachu

64.9k10 gold badges109 silver badges189 bronze badges

add a comment |

But, I'd like to remind of the golden rule of optimization: Don't do it prematurely.

Don't optimize that sort of thing before you know it's a problem.

Does this part of the program take a long time? Well, decompressing large files might, and you're going to have to do it anyway, so it might not be that easy to answer.

Measure. Really, it's the best way to be sure.

You'll see the results with your own eyes (or with your own stopwatch), and they'll apply to your situation which random answers on the Internet might not. Put both variants in scripts and run time script1.sh, and time script2.sh. (Do that with a list of empty compressed files to measure the absolute amount of the overhead.)

answered Jul 1 at 15:03

ilkkachu

64.9k10 gold badges109 silver badges189 bronze badges

add a comment |

But, I'd like to remind of the golden rule of optimization: Don't do it prematurely.

Don't optimize that sort of thing before you know it's a problem.

Does this part of the program take a long time? Well, decompressing large files might, and you're going to have to do it anyway, so it might not be that easy to answer.

Measure. Really, it's the best way to be sure.

You'll see the results with your own eyes (or with your own stopwatch), and they'll apply to your situation which random answers on the Internet might not. Put both variants in scripts and run time script1.sh, and time script2.sh. (Do that with a list of empty compressed files to measure the absolute amount of the overhead.)

answered Jul 1 at 15:03

ilkkachu

64.9k10 gold badges109 silver badges189 bronze badges

But, I'd like to remind of the golden rule of optimization: Don't do it prematurely.

Don't optimize that sort of thing before you know it's a problem.

Does this part of the program take a long time? Well, decompressing large files might, and you're going to have to do it anyway, so it might not be that easy to answer.

Measure. Really, it's the best way to be sure.

You'll see the results with your own eyes (or with your own stopwatch), and they'll apply to your situation which random answers on the Internet might not. Put both variants in scripts and run time script1.sh, and time script2.sh. (Do that with a list of empty compressed files to measure the absolute amount of the overhead.)

answered Jul 1 at 15:03

ilkkachu

64.9k10 gold badges109 silver badges189 bronze badges

answered Jul 1 at 15:03

ilkkachu

64.9k10 gold badges109 silver badges189 bronze badges

answered Jul 1 at 15:03

ilkkachu

64.9k10 gold badges109 silver badges189 bronze badges

answered Jul 1 at 15:03

ilkkachu

64.9k10 gold badges109 silver badges189 bronze badges

add a comment |

How fast is your disk?

This should use all your CPUs:

parallel -X gzip -d :::: large_file_list

So your limit is likely going to be the speed of your disk.

You can try adjusting with -j:

parallel -j50% -X gzip -d :::: large_file_list

This will run half of the jobs in parallel as the previous command, and will stress your disk less, so depending on your disk this can be faster.

answered Jul 2 at 17:09

Ole Tange

13.6k17 gold badges60 silver badges108 bronze badges

add a comment |

How fast is your disk?

This should use all your CPUs:

parallel -X gzip -d :::: large_file_list

So your limit is likely going to be the speed of your disk.

You can try adjusting with -j:

parallel -j50% -X gzip -d :::: large_file_list

This will run half of the jobs in parallel as the previous command, and will stress your disk less, so depending on your disk this can be faster.

answered Jul 2 at 17:09

Ole Tange

13.6k17 gold badges60 silver badges108 bronze badges

add a comment |

How fast is your disk?

This should use all your CPUs:

parallel -X gzip -d :::: large_file_list

So your limit is likely going to be the speed of your disk.

You can try adjusting with -j:

parallel -j50% -X gzip -d :::: large_file_list

This will run half of the jobs in parallel as the previous command, and will stress your disk less, so depending on your disk this can be faster.

answered Jul 2 at 17:09

Ole Tange

13.6k17 gold badges60 silver badges108 bronze badges

How fast is your disk?

This should use all your CPUs:

parallel -X gzip -d :::: large_file_list

So your limit is likely going to be the speed of your disk.

You can try adjusting with -j:

parallel -j50% -X gzip -d :::: large_file_list

This will run half of the jobs in parallel as the previous command, and will stress your disk less, so depending on your disk this can be faster.

answered Jul 2 at 17:09

Ole Tange

13.6k17 gold badges60 silver badges108 bronze badges

answered Jul 2 at 17:09

Ole Tange

13.6k17 gold badges60 silver badges108 bronze badges

answered Jul 2 at 17:09

Ole Tange

13.6k17 gold badges60 silver badges108 bronze badges

answered Jul 2 at 17:09

Ole Tange

13.6k17 gold badges60 silver badges108 bronze badges

add a comment |

-1

ADDED after some profiling:

I made one hundred dirs 00 to 99 on a tmpfs and filled them with man-1 files starting with "s". That gives me 113 files from 20 byte to 69 K in each dir, totalling to 1.2 Mb in each dir.

Result: it ALL works:

Globbing and huge arg list:

gzip -d ??/* 
and
gzip -d $(<filelist)

both 0.8 s

With a loop:

for file in $(<filelist)
do
 gzip -d $file 
done

14 sec

And when I add my ampersand:

... 
 gzip -d $file &
... 

5.8 sec

The performance side of this is easily explained: the less you call gzip and the less you wait for it, the faster it runs.

Despite the overhead of ampersanding (forking off) 11300 mostly tiny jobs, this takes half as long as to wait for each command to complete. And this was on a tmpfs...

I forgot this one:

time find ?? -name "*.gz" -exec gzip -d +

also 0.8 s (6.2 s without the "+", but with ugly ";")

ORIGINAL answer:

Short answer: use ampersand & to execute each gzip command in the background to start the next loop iteration immediately.

for filnam in $(<flist)
do
 gzip -d $filnam &
done

I also replaced cat with "<" (not in a loop, so it is not critical) and backticks with $() (more "modern").

CAREFUL: if you have whitespace in your filenames, it doesn't work. I don't know how to implement that. I tried with quotes and IFS...

Another well-known tip is: use globbing, if possible. Maybe start with a list of directories and for each go (cd $dir; gzip -d *.gz &).

added:

What about:

find $(<dirlist) -name "*.gz" -exec gzip -d +

...starting from a smaller list of dirs? The + is the "parallelization" option (a find -exec thing, like ...).

If you can redesign a bit, this must be the best...

edited Jul 2 at 16:24

answered Jul 1 at 8:32

sam68

1776 bronze badges

In the meanwhile, "kusa" has answered...this IFS/read is what I meant...I hope my explanations are useful anyway.

– sam68
Jul 1 at 8:37

Now I proposed a "find -exec" command at the end of my answer. Maybe not what Leon wants, but still important.

– sam68
Jul 1 at 9:34

4

Your loop will probably bring the OP's machine to its knees. The file list can have up to 10000 files, so running ten thousand gzip instances in parallel will very likely cause the system to hang.

– terdon♦
Jul 1 at 9:44

process #1 might be finished when #10 or #100 gets started...it started as a performance Q, then whitespace, now system resources...this question is not so easy to answer! A extra process/job for each small .gz would not be ideal, I admit. 10000 separate gzips is the thing to avoid in the first place.

– sam68
Jul 1 at 11:09

5

This is the sort of thing GNU parallel is for. You could simply do parallel -j3 gzip -d ::: < <(find $(<dirlist) -name '*.gz') and that would only launch 3 gzips at a time.

– terdon♦
Jul 1 at 11:49

|
show 2 more comments

-1

ADDED after some profiling:

I made one hundred dirs 00 to 99 on a tmpfs and filled them with man-1 files starting with "s". That gives me 113 files from 20 byte to 69 K in each dir, totalling to 1.2 Mb in each dir.

Result: it ALL works:

Globbing and huge arg list:

gzip -d ??/* 
and
gzip -d $(<filelist)

both 0.8 s

With a loop:

for file in $(<filelist)
do
 gzip -d $file 
done

14 sec

And when I add my ampersand:

... 
 gzip -d $file &
... 

5.8 sec

The performance side of this is easily explained: the less you call gzip and the less you wait for it, the faster it runs.

Despite the overhead of ampersanding (forking off) 11300 mostly tiny jobs, this takes half as long as to wait for each command to complete. And this was on a tmpfs...

I forgot this one:

time find ?? -name "*.gz" -exec gzip -d +

also 0.8 s (6.2 s without the "+", but with ugly ";")

ORIGINAL answer:

Short answer: use ampersand & to execute each gzip command in the background to start the next loop iteration immediately.

for filnam in $(<flist)
do
 gzip -d $filnam &
done

I also replaced cat with "<" (not in a loop, so it is not critical) and backticks with $() (more "modern").

CAREFUL: if you have whitespace in your filenames, it doesn't work. I don't know how to implement that. I tried with quotes and IFS...

Another well-known tip is: use globbing, if possible. Maybe start with a list of directories and for each go (cd $dir; gzip -d *.gz &).

added:

What about:

find $(<dirlist) -name "*.gz" -exec gzip -d +

...starting from a smaller list of dirs? The + is the "parallelization" option (a find -exec thing, like ...).

If you can redesign a bit, this must be the best...

edited Jul 2 at 16:24

answered Jul 1 at 8:32

sam68

1776 bronze badges

In the meanwhile, "kusa" has answered...this IFS/read is what I meant...I hope my explanations are useful anyway.

– sam68
Jul 1 at 8:37

Now I proposed a "find -exec" command at the end of my answer. Maybe not what Leon wants, but still important.

– sam68
Jul 1 at 9:34

4

Your loop will probably bring the OP's machine to its knees. The file list can have up to 10000 files, so running ten thousand gzip instances in parallel will very likely cause the system to hang.

– terdon♦
Jul 1 at 9:44

process #1 might be finished when #10 or #100 gets started...it started as a performance Q, then whitespace, now system resources...this question is not so easy to answer! A extra process/job for each small .gz would not be ideal, I admit. 10000 separate gzips is the thing to avoid in the first place.

– sam68
Jul 1 at 11:09

5

This is the sort of thing GNU parallel is for. You could simply do parallel -j3 gzip -d ::: < <(find $(<dirlist) -name '*.gz') and that would only launch 3 gzips at a time.

– terdon♦
Jul 1 at 11:49

|
show 2 more comments

-1

ADDED after some profiling:

I made one hundred dirs 00 to 99 on a tmpfs and filled them with man-1 files starting with "s". That gives me 113 files from 20 byte to 69 K in each dir, totalling to 1.2 Mb in each dir.

Result: it ALL works:

Globbing and huge arg list:

gzip -d ??/* 
and
gzip -d $(<filelist)

both 0.8 s

With a loop:

for file in $(<filelist)
do
 gzip -d $file 
done

14 sec

And when I add my ampersand:

... 
 gzip -d $file &
... 

5.8 sec

The performance side of this is easily explained: the less you call gzip and the less you wait for it, the faster it runs.

Despite the overhead of ampersanding (forking off) 11300 mostly tiny jobs, this takes half as long as to wait for each command to complete. And this was on a tmpfs...

I forgot this one:

time find ?? -name "*.gz" -exec gzip -d +

also 0.8 s (6.2 s without the "+", but with ugly ";")

ORIGINAL answer:

Short answer: use ampersand & to execute each gzip command in the background to start the next loop iteration immediately.

for filnam in $(<flist)
do
 gzip -d $filnam &
done

I also replaced cat with "<" (not in a loop, so it is not critical) and backticks with $() (more "modern").

CAREFUL: if you have whitespace in your filenames, it doesn't work. I don't know how to implement that. I tried with quotes and IFS...

Another well-known tip is: use globbing, if possible. Maybe start with a list of directories and for each go (cd $dir; gzip -d *.gz &).

added:

What about:

find $(<dirlist) -name "*.gz" -exec gzip -d +

...starting from a smaller list of dirs? The + is the "parallelization" option (a find -exec thing, like ...).

If you can redesign a bit, this must be the best...

edited Jul 2 at 16:24

answered Jul 1 at 8:32

sam68

1776 bronze badges

ADDED after some profiling:

I made one hundred dirs 00 to 99 on a tmpfs and filled them with man-1 files starting with "s". That gives me 113 files from 20 byte to 69 K in each dir, totalling to 1.2 Mb in each dir.

Result: it ALL works:

Globbing and huge arg list:

gzip -d ??/* 
and
gzip -d $(<filelist)

both 0.8 s

With a loop:

for file in $(<filelist)
do
 gzip -d $file 
done

14 sec

And when I add my ampersand:

... 
 gzip -d $file &
... 

5.8 sec

The performance side of this is easily explained: the less you call gzip and the less you wait for it, the faster it runs.

Despite the overhead of ampersanding (forking off) 11300 mostly tiny jobs, this takes half as long as to wait for each command to complete. And this was on a tmpfs...

I forgot this one:

time find ?? -name "*.gz" -exec gzip -d +

also 0.8 s (6.2 s without the "+", but with ugly ";")

ORIGINAL answer:

Short answer: use ampersand & to execute each gzip command in the background to start the next loop iteration immediately.

for filnam in $(<flist)
do
 gzip -d $filnam &
done

I also replaced cat with "<" (not in a loop, so it is not critical) and backticks with $() (more "modern").

CAREFUL: if you have whitespace in your filenames, it doesn't work. I don't know how to implement that. I tried with quotes and IFS...

Another well-known tip is: use globbing, if possible. Maybe start with a list of directories and for each go (cd $dir; gzip -d *.gz &).

added:

What about:

find $(<dirlist) -name "*.gz" -exec gzip -d +

...starting from a smaller list of dirs? The + is the "parallelization" option (a find -exec thing, like ...).

If you can redesign a bit, this must be the best...

edited Jul 2 at 16:24

answered Jul 1 at 8:32

sam68

1776 bronze badges

edited Jul 2 at 16:24

answered Jul 1 at 8:32

sam68

1776 bronze badges

answered Jul 1 at 8:32

sam68

1776 bronze badges

answered Jul 1 at 8:32

sam68

1776 bronze badges

In the meanwhile, "kusa" has answered...this IFS/read is what I meant...I hope my explanations are useful anyway.

– sam68
Jul 1 at 8:37

Now I proposed a "find -exec" command at the end of my answer. Maybe not what Leon wants, but still important.

– sam68
Jul 1 at 9:34

4

Your loop will probably bring the OP's machine to its knees. The file list can have up to 10000 files, so running ten thousand gzip instances in parallel will very likely cause the system to hang.

– terdon♦
Jul 1 at 9:44

process #1 might be finished when #10 or #100 gets started...it started as a performance Q, then whitespace, now system resources...this question is not so easy to answer! A extra process/job for each small .gz would not be ideal, I admit. 10000 separate gzips is the thing to avoid in the first place.

– sam68
Jul 1 at 11:09

5

This is the sort of thing GNU parallel is for. You could simply do parallel -j3 gzip -d ::: < <(find $(<dirlist) -name '*.gz') and that would only launch 3 gzips at a time.

– terdon♦
Jul 1 at 11:49

|
show 2 more comments

In the meanwhile, "kusa" has answered...this IFS/read is what I meant...I hope my explanations are useful anyway.

– sam68
Jul 1 at 8:37

Now I proposed a "find -exec" command at the end of my answer. Maybe not what Leon wants, but still important.

– sam68
Jul 1 at 9:34

4

Your loop will probably bring the OP's machine to its knees. The file list can have up to 10000 files, so running ten thousand gzip instances in parallel will very likely cause the system to hang.

– terdon♦
Jul 1 at 9:44

process #1 might be finished when #10 or #100 gets started...it started as a performance Q, then whitespace, now system resources...this question is not so easy to answer! A extra process/job for each small .gz would not be ideal, I admit. 10000 separate gzips is the thing to avoid in the first place.

– sam68
Jul 1 at 11:09

5

This is the sort of thing GNU parallel is for. You could simply do parallel -j3 gzip -d ::: < <(find $(<dirlist) -name '*.gz') and that would only launch 3 gzips at a time.

– terdon♦
Jul 1 at 11:49

In the meanwhile, "kusa" has answered...this IFS/read is what I meant...I hope my explanations are useful anyway.

– sam68
Jul 1 at 8:37

Now I proposed a "find -exec" command at the end of my answer. Maybe not what Leon wants, but still important.

– sam68
Jul 1 at 9:34

Your loop will probably bring the OP's machine to its knees. The file list can have up to 10000 files, so running ten thousand gzip instances in parallel will very likely cause the system to hang.

– terdon♦
Jul 1 at 9:44

process #1 might be finished when #10 or #100 gets started...it started as a performance Q, then whitespace, now system resources...this question is not so easy to answer! A extra process/job for each small .gz would not be ideal, I admit. 10000 separate gzips is the thing to avoid in the first place.

– sam68
Jul 1 at 11:09

This is the sort of thing GNU parallel is for. You could simply do parallel -j3 gzip -d ::: < <(find $(<dirlist) -name '*.gz') and that would only launch 3 gzips at a time.

– terdon♦
Jul 1 at 11:49

|
show 2 more comments

draft saved

draft discarded

Thanks for contributing an answer to Unix & Linux Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

2xp GieA5bbfA9qW20P KoX,Lo nXXR3KRb wLkcy1wc,rO t9kEz f0

搜尋此網誌

Ttdfjt

5 Answers
5

Complications

Reliable solution

Your Answer

Post as a guest

5 Answers
5

5 Answers
5

Complications

Reliable solution

Complications

Reliable solution

Complications

Reliable solution

Complications

Reliable solution

Post as a guest

Popular posts from this blog

5 Answers 5

Complications

Reliable solution

Your Answer

Sign up or log in

Post as a guest

Post as a guest

5 Answers 5

5 Answers 5

Complications

Reliable solution

Complications

Reliable solution

Complications

Reliable solution

Complications

Reliable solution

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

5 Answers
5

5 Answers
5

5 Answers
5