Performance of loop vs expansionUnderstanding “IFS= read -r line”Optimizing a shell script with long running while loopHow to use 3rd variable in BASH for loop?While/Read/Do/Done Loop Cutting Out Halfway Through Text Fileparallel processing reading from a file in a loopI need to change date in CSV filePreventing parameter expansion multiple timesBash loop for detecting equal folder sizesWriting output of a For Loop to .txt fileMerge multiple files with different extensions using a loop in bash scriptUnzip files in loop

Why different specifications for telescopes and binoculars?

When did "&" stop being taught alongside the alphabet?

The rigidity of the countable product of free groups

When I press the space bar it deletes the letters in front of it

Why weren't bootable game disks ever a thing on the IBM PC?

What happens to unproductive professors?

Why is Nibbana referred to as "The destination and the path leading to the destination"?

How to compare the ls output of two folders to find a missing directory?

A horrible Stockfish chess engine evaluation

Why does every calorie tracking app give a different target calorie count for the same goals?

Can I play a mimic PC?

What is the correct parsing of お高くとまる?

WTB Horizon 47c - small crack in the middle of the tire

In Spider-Man: Far From Home, is this superhero name a reference to another comic book?

Are there any sports for which the world's best player is female?

Can i use larger/smaller circular saw blades on my circular / plunge / table / miter saw?

Could you brine steak?

How to drill holes in 3/8" steel plates?

The joke office

Why does the Antonov AN-225 not have any winglets?

Yet another hash table in C

Efficiently defining a SparseArray function

How can I effectively communicate to recruiters that a phone call is not possible?

Why is a mixture of two normally distributed variables only bimodal if their means differ by at least two times the common standard deviation?



Performance of loop vs expansion


Understanding “IFS= read -r line”Optimizing a shell script with long running while loopHow to use 3rd variable in BASH for loop?While/Read/Do/Done Loop Cutting Out Halfway Through Text Fileparallel processing reading from a file in a loopI need to change date in CSV filePreventing parameter expansion multiple timesBash loop for detecting equal folder sizesWriting output of a For Loop to .txt fileMerge multiple files with different extensions using a loop in bash scriptUnzip files in loop






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








9















Need expert suggestions on below comparison:



Code Segment using loop:



for file in `cat large_file_list`
do
gzip -d $file
done


Code segment using simple expansion:



gzip -d `cat large_file_list`


Which one will be faster? Have to manipulate large data set.










share|improve this question

















  • 1





    The correct answer will depend on how long it takes to start gzip on your system, the number of files in the file list and the size of those files.

    – Kusalananda
    Jul 1 at 6:25











  • File list will have around 1000 - 10000 files. Size varies from some kilobytes to 500 MB. I have no idea how long it takes to start gzip in my system. any way check?

    – Leon
    Jul 1 at 6:27






  • 1





    Ok, then it may also depend on the length of the filenames. If the filenames are long, some systems might generate a "argument list too long" error if you tried to do it without a loop since the command substitution would result in a too long command line for the shell to execute. If you don't want to depend on the number of files in the list, just use a loop. Are you spending a significant amount of time decompressing these files compared to the other processing that you will perform on them?

    – Kusalananda
    Jul 1 at 6:54












  • Leon take a look at my test results: "huge-arglist" is 20x faster than "loop" in my setting.

    – sam68
    Jul 2 at 16:08











  • for a happy medium between process starts and command line length, use something like xargs gzip -d < large_file_list but watch out for spaces in filenames, maybe with tr \n \0 large_file_list | xargs -0 gzip -d

    – w00t
    Jul 5 at 8:13

















9















Need expert suggestions on below comparison:



Code Segment using loop:



for file in `cat large_file_list`
do
gzip -d $file
done


Code segment using simple expansion:



gzip -d `cat large_file_list`


Which one will be faster? Have to manipulate large data set.










share|improve this question

















  • 1





    The correct answer will depend on how long it takes to start gzip on your system, the number of files in the file list and the size of those files.

    – Kusalananda
    Jul 1 at 6:25











  • File list will have around 1000 - 10000 files. Size varies from some kilobytes to 500 MB. I have no idea how long it takes to start gzip in my system. any way check?

    – Leon
    Jul 1 at 6:27






  • 1





    Ok, then it may also depend on the length of the filenames. If the filenames are long, some systems might generate a "argument list too long" error if you tried to do it without a loop since the command substitution would result in a too long command line for the shell to execute. If you don't want to depend on the number of files in the list, just use a loop. Are you spending a significant amount of time decompressing these files compared to the other processing that you will perform on them?

    – Kusalananda
    Jul 1 at 6:54












  • Leon take a look at my test results: "huge-arglist" is 20x faster than "loop" in my setting.

    – sam68
    Jul 2 at 16:08











  • for a happy medium between process starts and command line length, use something like xargs gzip -d < large_file_list but watch out for spaces in filenames, maybe with tr \n \0 large_file_list | xargs -0 gzip -d

    – w00t
    Jul 5 at 8:13













9












9








9








Need expert suggestions on below comparison:



Code Segment using loop:



for file in `cat large_file_list`
do
gzip -d $file
done


Code segment using simple expansion:



gzip -d `cat large_file_list`


Which one will be faster? Have to manipulate large data set.










share|improve this question














Need expert suggestions on below comparison:



Code Segment using loop:



for file in `cat large_file_list`
do
gzip -d $file
done


Code segment using simple expansion:



gzip -d `cat large_file_list`


Which one will be faster? Have to manipulate large data set.







linux bash shell-script shell






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Jul 1 at 5:16









LeonLeon

1585 bronze badges




1585 bronze badges







  • 1





    The correct answer will depend on how long it takes to start gzip on your system, the number of files in the file list and the size of those files.

    – Kusalananda
    Jul 1 at 6:25











  • File list will have around 1000 - 10000 files. Size varies from some kilobytes to 500 MB. I have no idea how long it takes to start gzip in my system. any way check?

    – Leon
    Jul 1 at 6:27






  • 1





    Ok, then it may also depend on the length of the filenames. If the filenames are long, some systems might generate a "argument list too long" error if you tried to do it without a loop since the command substitution would result in a too long command line for the shell to execute. If you don't want to depend on the number of files in the list, just use a loop. Are you spending a significant amount of time decompressing these files compared to the other processing that you will perform on them?

    – Kusalananda
    Jul 1 at 6:54












  • Leon take a look at my test results: "huge-arglist" is 20x faster than "loop" in my setting.

    – sam68
    Jul 2 at 16:08











  • for a happy medium between process starts and command line length, use something like xargs gzip -d < large_file_list but watch out for spaces in filenames, maybe with tr \n \0 large_file_list | xargs -0 gzip -d

    – w00t
    Jul 5 at 8:13












  • 1





    The correct answer will depend on how long it takes to start gzip on your system, the number of files in the file list and the size of those files.

    – Kusalananda
    Jul 1 at 6:25











  • File list will have around 1000 - 10000 files. Size varies from some kilobytes to 500 MB. I have no idea how long it takes to start gzip in my system. any way check?

    – Leon
    Jul 1 at 6:27






  • 1





    Ok, then it may also depend on the length of the filenames. If the filenames are long, some systems might generate a "argument list too long" error if you tried to do it without a loop since the command substitution would result in a too long command line for the shell to execute. If you don't want to depend on the number of files in the list, just use a loop. Are you spending a significant amount of time decompressing these files compared to the other processing that you will perform on them?

    – Kusalananda
    Jul 1 at 6:54












  • Leon take a look at my test results: "huge-arglist" is 20x faster than "loop" in my setting.

    – sam68
    Jul 2 at 16:08











  • for a happy medium between process starts and command line length, use something like xargs gzip -d < large_file_list but watch out for spaces in filenames, maybe with tr \n \0 large_file_list | xargs -0 gzip -d

    – w00t
    Jul 5 at 8:13







1




1





The correct answer will depend on how long it takes to start gzip on your system, the number of files in the file list and the size of those files.

– Kusalananda
Jul 1 at 6:25





The correct answer will depend on how long it takes to start gzip on your system, the number of files in the file list and the size of those files.

– Kusalananda
Jul 1 at 6:25













File list will have around 1000 - 10000 files. Size varies from some kilobytes to 500 MB. I have no idea how long it takes to start gzip in my system. any way check?

– Leon
Jul 1 at 6:27





File list will have around 1000 - 10000 files. Size varies from some kilobytes to 500 MB. I have no idea how long it takes to start gzip in my system. any way check?

– Leon
Jul 1 at 6:27




1




1





Ok, then it may also depend on the length of the filenames. If the filenames are long, some systems might generate a "argument list too long" error if you tried to do it without a loop since the command substitution would result in a too long command line for the shell to execute. If you don't want to depend on the number of files in the list, just use a loop. Are you spending a significant amount of time decompressing these files compared to the other processing that you will perform on them?

– Kusalananda
Jul 1 at 6:54






Ok, then it may also depend on the length of the filenames. If the filenames are long, some systems might generate a "argument list too long" error if you tried to do it without a loop since the command substitution would result in a too long command line for the shell to execute. If you don't want to depend on the number of files in the list, just use a loop. Are you spending a significant amount of time decompressing these files compared to the other processing that you will perform on them?

– Kusalananda
Jul 1 at 6:54














Leon take a look at my test results: "huge-arglist" is 20x faster than "loop" in my setting.

– sam68
Jul 2 at 16:08





Leon take a look at my test results: "huge-arglist" is 20x faster than "loop" in my setting.

– sam68
Jul 2 at 16:08













for a happy medium between process starts and command line length, use something like xargs gzip -d < large_file_list but watch out for spaces in filenames, maybe with tr \n \0 large_file_list | xargs -0 gzip -d

– w00t
Jul 5 at 8:13





for a happy medium between process starts and command line length, use something like xargs gzip -d < large_file_list but watch out for spaces in filenames, maybe with tr \n \0 large_file_list | xargs -0 gzip -d

– w00t
Jul 5 at 8:13










5 Answers
5






active

oldest

votes


















19














Complications



The following will work only sometimes:



gzip -d `cat large_file_list`


Three problems are (in bash and most other Bourne-like shells):



  1. It will fail if any file name has space tab or newline characters in it (assuming $IFS has not been modified). This is because of the shell's word splitting.


  2. It is also liable to fail if any file name has glob-active characters in it. This is because the shell will apply pathname expansion to the file list.


  3. It will also fail if filenames starts with - (if POSIXLY_CORRECT=1 that only applies to the first file) or if any filename is -.


  4. It will also fail if there are too many file names in it to fit on one command line.


The code below is subject to the same problems as the code above (except for the fourth)



for file in `cat large_file_list`
do
gzip -d $file
done


Reliable solution



If your large_file_list has exactly one file name per line, and a file called - is not among them, and you're on a GNU system, then use:



xargs -rd'n' gzip -d -- <large_file_list


-d'n' tells xargs to treat each line of input as a separate file name.



-r tells xargs not to run the command if the input file is empty.



-- tells gzip that the following arguments are not to be treated as options even if they start with -. - alone would still be treated as - instead of the file called - though.



xargs will put many file names on each command line but not so many that it exceeds the command line limit. This reduces the number of times that a gzip process must be started and therefore makes this fast. It is also safe: the file names will also be protected from word splitting and pathname expansion.






share|improve this answer

























  • Thanks for detailed reply. I understand your mentioned 3 issues. File name is simple and won't face those challenges as the list will hold upto 20000. And my question is basically on performance of those two segments. Thanks.

    – Leon
    Jul 1 at 6:20







  • 1





    @Leon The for loop will be —by far— the slowest. The other two methods will be very close in speed to each other.

    – John1024
    Jul 1 at 6:36







  • 7





    Also, don't dismiss the potential problems: many many questions here on StackExchange are because word splitting or pathname expansion happened to people who weren't expecting it.

    – John1024
    Jul 1 at 6:38






  • 5





    Note also that there's variation on reading a file with xargs: at least GNU version has --arg-file option (short form -a). So one could do xargs -a large_file_list -rd'n' gzip -d instead. Effectively, there's no difference, aside from the fact that < is shell operator and would make xargs read from stdin (which shell "links" to file), while -a would make xargs explicitly open the file in question

    – Sergiy Kolodyazhnyy
    Jul 1 at 7:34






  • 2





    terdon noted in another comment about using parallel to run multiple copies of gzip, but xargs (at least the GNU one), has the -P switch for that, too. On multicore machines that might make a difference. But it's also possible that the decompression is completely I/O-bound anyway.

    – ilkkachu
    Jul 2 at 14:19


















12














I doubt it would matter much.



I would use a loop, just because I don't know how many files are listed in the list file, and I don't (generally) know if any of the filenames have spaces in their names. Doing a command substitution that would generate a very long list of argument may result in an "Argument list too long" error when the length of the list generated is too long.



My loop would look like



while IFS= read -r name; do
gunzip "$name"
done <file.list


This would additionally allow me to insert commands for processing the data after the gunzip command. In fact, depending on what the data actually is and what needs to be done with it, it may even be possible to process it without saving it to file at all:



while IFS= read -r name; do
zcat "$name" | process_data
done <file.list


(where process_data is some pipeline that reads the uncompressed data from standard input)



If the processing of the data takes longer than the uncompressing of it, the question of whether a loop is more efficient or not becomes irrelevant.



Ideally, I would prefer to not work off a list of filenames though, and instead use a filename globbing pattern, as in



for name in ./*.gz; do
# processing of "$name" here
done


where ./*.gz is some pattern that matches the relevant files. This way we are not depending on the number of files nor on the characters used in the filenames (they may contain newlines or other whitespace characters, or start with dashes, etc.)



Related:



  • Understanding "IFS= read -r line"





share|improve this answer
































    6














    Out of those two, the one with all files passed to a single invocation of gzip is likely to be faster, exactly because you only need to launch gzip once. (That is, if the command works at all, see the other answers for the caveats.)



    But, I'd like to remind of the golden rule of optimization: Don't do it prematurely.




    1. Don't optimize that sort of thing before you know it's a problem.



      Does this part of the program take a long time? Well, decompressing large files might, and you're going to have to do it anyway, so it might not be that easy to answer.




    2. Measure. Really, it's the best way to be sure.



      You'll see the results with your own eyes (or with your own stopwatch), and they'll apply to your situation which random answers on the Internet might not. Put both variants in scripts and run time script1.sh, and time script2.sh. (Do that with a list of empty compressed files to measure the absolute amount of the overhead.)







    share|improve this answer






























      0














      How fast is your disk?



      This should use all your CPUs:



      parallel -X gzip -d :::: large_file_list


      So your limit is likely going to be the speed of your disk.



      You can try adjusting with -j:



      parallel -j50% -X gzip -d :::: large_file_list


      This will run half of the jobs in parallel as the previous command, and will stress your disk less, so depending on your disk this can be faster.






      share|improve this answer






























        -1














        ADDED after some profiling:



        I made one hundred dirs 00 to 99 on a tmpfs and filled them with man-1 files starting with "s". That gives me 113 files from 20 byte to 69 K in each dir, totalling to 1.2 Mb in each dir.



        Result: it ALL works:



        Globbing and huge arg list:



        gzip -d ??/* 
        and
        gzip -d $(<filelist)

        both 0.8 s


        With a loop:



        for file in $(<filelist)
        do
        gzip -d $file
        done

        14 sec


        And when I add my ampersand:



        ... 
        gzip -d $file &
        ...

        5.8 sec


        The performance side of this is easily explained: the less you call gzip and the less you wait for it, the faster it runs.



        Now: Is there at all a maximum arg list length? Is there a maximum number of processes? (I have ulimit -u of 31366, but it looks like I could even raise that, as root on my 300$ i5 Mini-PC...). Or will the max. be "only limited by the amount of avail. Memory".



        Despite the overhead of ampersanding (forking off) 11300 mostly tiny jobs, this takes half as long as to wait for each command to complete. And this was on a tmpfs...



        I forgot this one:



        time find ?? -name "*.gz" -exec gzip -d +

        also 0.8 s (6.2 s without the "+", but with ugly ";")



        ORIGINAL answer:



        Short answer: use ampersand & to execute each gzip command in the background to start the next loop iteration immediately.



        for filnam in $(<flist)
        do
        gzip -d $filnam &
        done


        I also replaced cat with "<" (not in a loop, so it is not critical) and backticks with $() (more "modern").



        to test/benchmark: place the shell builtin time in front of your command: I did time . loop.cmd (with loop.cmd as shown above). This gives you the time spent, and after hitting Enter again the shell shows you the finished gzip "jobs".



        I tested this out some weeks ago with a (small) list of (medium size) tar files to extract. The & made such a difference in time I could hardly believe it. The & has a double effect: it not only starts the next gzip immediately, it also gives you the prompt as soon as the last gzip got started.



        CAREFUL: if you have whitespace in your filenames, it doesn't work. I don't know how to implement that. I tried with quotes and IFS...



        Another well-known tip is: use globbing, if possible. Maybe start with a list of directories and for each go (cd $dir; gzip -d *.gz &).



        --> I only can give you this ampersand "trick". First step is always optimizing the design. I am not sure what exactly you want this huge list of files for? Is it a hand made list? If it is generated by rules, then maybe use these rules directly and not via a filelist.



        Your second example should be a bit faster, because gzip starts only once. But: I don't know what the limit is if you go gzip -d 1.gz 2.gz 3.gz ... 9999.gz, i.e. giving gzip a very large argument list.



        Whitespace: a filename like '10 A.gz' results in gzip -d ... 9.gz 10 A.gz 11.gz.... Gzip will say: '10' and 'A.gz' not found...unless these files DO exist, and if it is rm and not gzip then you have that classic shell trap.




        added:



        What about:



        find $(<dirlist) -name "*.gz" -exec gzip -d +


        ...starting from a smaller list of dirs? The + is the "parallelization" option (a find -exec thing, like ...).



        If you can redesign a bit, this must be the best...






        share|improve this answer

























        • In the meanwhile, "kusa" has answered...this IFS/read is what I meant...I hope my explanations are useful anyway.

          – sam68
          Jul 1 at 8:37











        • Now I proposed a "find -exec" command at the end of my answer. Maybe not what Leon wants, but still important.

          – sam68
          Jul 1 at 9:34






        • 4





          Your loop will probably bring the OP's machine to its knees. The file list can have up to 10000 files, so running ten thousand gzip instances in parallel will very likely cause the system to hang.

          – terdon
          Jul 1 at 9:44












        • process #1 might be finished when #10 or #100 gets started...it started as a performance Q, then whitespace, now system resources...this question is not so easy to answer! A extra process/job for each small .gz would not be ideal, I admit. 10000 separate gzips is the thing to avoid in the first place.

          – sam68
          Jul 1 at 11:09






        • 5





          This is the sort of thing GNU parallel is for. You could simply do parallel -j3 gzip -d ::: < <(find $(<dirlist) -name '*.gz') and that would only launch 3 gzips at a time.

          – terdon
          Jul 1 at 11:49













        Your Answer








        StackExchange.ready(function()
        var channelOptions =
        tags: "".split(" "),
        id: "106"
        ;
        initTagRenderer("".split(" "), "".split(" "), channelOptions);

        StackExchange.using("externalEditor", function()
        // Have to fire editor after snippets, if snippets enabled
        if (StackExchange.settings.snippets.snippetsEnabled)
        StackExchange.using("snippets", function()
        createEditor();
        );

        else
        createEditor();

        );

        function createEditor()
        StackExchange.prepareEditor(
        heartbeatType: 'answer',
        autoActivateHeartbeat: false,
        convertImagesToLinks: false,
        noModals: true,
        showLowRepImageUploadWarning: true,
        reputationToPostImages: null,
        bindNavPrevention: true,
        postfix: "",
        imageUploader:
        brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
        contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
        allowUrls: true
        ,
        onDemand: true,
        discardSelector: ".discard-answer"
        ,immediatelyShowMarkdownHelp:true
        );



        );













        draft saved

        draft discarded


















        StackExchange.ready(
        function ()
        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f527813%2fperformance-of-loop-vs-expansion%23new-answer', 'question_page');

        );

        Post as a guest















        Required, but never shown

























        5 Answers
        5






        active

        oldest

        votes








        5 Answers
        5






        active

        oldest

        votes









        active

        oldest

        votes






        active

        oldest

        votes









        19














        Complications



        The following will work only sometimes:



        gzip -d `cat large_file_list`


        Three problems are (in bash and most other Bourne-like shells):



        1. It will fail if any file name has space tab or newline characters in it (assuming $IFS has not been modified). This is because of the shell's word splitting.


        2. It is also liable to fail if any file name has glob-active characters in it. This is because the shell will apply pathname expansion to the file list.


        3. It will also fail if filenames starts with - (if POSIXLY_CORRECT=1 that only applies to the first file) or if any filename is -.


        4. It will also fail if there are too many file names in it to fit on one command line.


        The code below is subject to the same problems as the code above (except for the fourth)



        for file in `cat large_file_list`
        do
        gzip -d $file
        done


        Reliable solution



        If your large_file_list has exactly one file name per line, and a file called - is not among them, and you're on a GNU system, then use:



        xargs -rd'n' gzip -d -- <large_file_list


        -d'n' tells xargs to treat each line of input as a separate file name.



        -r tells xargs not to run the command if the input file is empty.



        -- tells gzip that the following arguments are not to be treated as options even if they start with -. - alone would still be treated as - instead of the file called - though.



        xargs will put many file names on each command line but not so many that it exceeds the command line limit. This reduces the number of times that a gzip process must be started and therefore makes this fast. It is also safe: the file names will also be protected from word splitting and pathname expansion.






        share|improve this answer

























        • Thanks for detailed reply. I understand your mentioned 3 issues. File name is simple and won't face those challenges as the list will hold upto 20000. And my question is basically on performance of those two segments. Thanks.

          – Leon
          Jul 1 at 6:20







        • 1





          @Leon The for loop will be —by far— the slowest. The other two methods will be very close in speed to each other.

          – John1024
          Jul 1 at 6:36







        • 7





          Also, don't dismiss the potential problems: many many questions here on StackExchange are because word splitting or pathname expansion happened to people who weren't expecting it.

          – John1024
          Jul 1 at 6:38






        • 5





          Note also that there's variation on reading a file with xargs: at least GNU version has --arg-file option (short form -a). So one could do xargs -a large_file_list -rd'n' gzip -d instead. Effectively, there's no difference, aside from the fact that < is shell operator and would make xargs read from stdin (which shell "links" to file), while -a would make xargs explicitly open the file in question

          – Sergiy Kolodyazhnyy
          Jul 1 at 7:34






        • 2





          terdon noted in another comment about using parallel to run multiple copies of gzip, but xargs (at least the GNU one), has the -P switch for that, too. On multicore machines that might make a difference. But it's also possible that the decompression is completely I/O-bound anyway.

          – ilkkachu
          Jul 2 at 14:19















        19














        Complications



        The following will work only sometimes:



        gzip -d `cat large_file_list`


        Three problems are (in bash and most other Bourne-like shells):



        1. It will fail if any file name has space tab or newline characters in it (assuming $IFS has not been modified). This is because of the shell's word splitting.


        2. It is also liable to fail if any file name has glob-active characters in it. This is because the shell will apply pathname expansion to the file list.


        3. It will also fail if filenames starts with - (if POSIXLY_CORRECT=1 that only applies to the first file) or if any filename is -.


        4. It will also fail if there are too many file names in it to fit on one command line.


        The code below is subject to the same problems as the code above (except for the fourth)



        for file in `cat large_file_list`
        do
        gzip -d $file
        done


        Reliable solution



        If your large_file_list has exactly one file name per line, and a file called - is not among them, and you're on a GNU system, then use:



        xargs -rd'n' gzip -d -- <large_file_list


        -d'n' tells xargs to treat each line of input as a separate file name.



        -r tells xargs not to run the command if the input file is empty.



        -- tells gzip that the following arguments are not to be treated as options even if they start with -. - alone would still be treated as - instead of the file called - though.



        xargs will put many file names on each command line but not so many that it exceeds the command line limit. This reduces the number of times that a gzip process must be started and therefore makes this fast. It is also safe: the file names will also be protected from word splitting and pathname expansion.






        share|improve this answer

























        • Thanks for detailed reply. I understand your mentioned 3 issues. File name is simple and won't face those challenges as the list will hold upto 20000. And my question is basically on performance of those two segments. Thanks.

          – Leon
          Jul 1 at 6:20







        • 1





          @Leon The for loop will be —by far— the slowest. The other two methods will be very close in speed to each other.

          – John1024
          Jul 1 at 6:36







        • 7





          Also, don't dismiss the potential problems: many many questions here on StackExchange are because word splitting or pathname expansion happened to people who weren't expecting it.

          – John1024
          Jul 1 at 6:38






        • 5





          Note also that there's variation on reading a file with xargs: at least GNU version has --arg-file option (short form -a). So one could do xargs -a large_file_list -rd'n' gzip -d instead. Effectively, there's no difference, aside from the fact that < is shell operator and would make xargs read from stdin (which shell "links" to file), while -a would make xargs explicitly open the file in question

          – Sergiy Kolodyazhnyy
          Jul 1 at 7:34






        • 2





          terdon noted in another comment about using parallel to run multiple copies of gzip, but xargs (at least the GNU one), has the -P switch for that, too. On multicore machines that might make a difference. But it's also possible that the decompression is completely I/O-bound anyway.

          – ilkkachu
          Jul 2 at 14:19













        19












        19








        19







        Complications



        The following will work only sometimes:



        gzip -d `cat large_file_list`


        Three problems are (in bash and most other Bourne-like shells):



        1. It will fail if any file name has space tab or newline characters in it (assuming $IFS has not been modified). This is because of the shell's word splitting.


        2. It is also liable to fail if any file name has glob-active characters in it. This is because the shell will apply pathname expansion to the file list.


        3. It will also fail if filenames starts with - (if POSIXLY_CORRECT=1 that only applies to the first file) or if any filename is -.


        4. It will also fail if there are too many file names in it to fit on one command line.


        The code below is subject to the same problems as the code above (except for the fourth)



        for file in `cat large_file_list`
        do
        gzip -d $file
        done


        Reliable solution



        If your large_file_list has exactly one file name per line, and a file called - is not among them, and you're on a GNU system, then use:



        xargs -rd'n' gzip -d -- <large_file_list


        -d'n' tells xargs to treat each line of input as a separate file name.



        -r tells xargs not to run the command if the input file is empty.



        -- tells gzip that the following arguments are not to be treated as options even if they start with -. - alone would still be treated as - instead of the file called - though.



        xargs will put many file names on each command line but not so many that it exceeds the command line limit. This reduces the number of times that a gzip process must be started and therefore makes this fast. It is also safe: the file names will also be protected from word splitting and pathname expansion.






        share|improve this answer















        Complications



        The following will work only sometimes:



        gzip -d `cat large_file_list`


        Three problems are (in bash and most other Bourne-like shells):



        1. It will fail if any file name has space tab or newline characters in it (assuming $IFS has not been modified). This is because of the shell's word splitting.


        2. It is also liable to fail if any file name has glob-active characters in it. This is because the shell will apply pathname expansion to the file list.


        3. It will also fail if filenames starts with - (if POSIXLY_CORRECT=1 that only applies to the first file) or if any filename is -.


        4. It will also fail if there are too many file names in it to fit on one command line.


        The code below is subject to the same problems as the code above (except for the fourth)



        for file in `cat large_file_list`
        do
        gzip -d $file
        done


        Reliable solution



        If your large_file_list has exactly one file name per line, and a file called - is not among them, and you're on a GNU system, then use:



        xargs -rd'n' gzip -d -- <large_file_list


        -d'n' tells xargs to treat each line of input as a separate file name.



        -r tells xargs not to run the command if the input file is empty.



        -- tells gzip that the following arguments are not to be treated as options even if they start with -. - alone would still be treated as - instead of the file called - though.



        xargs will put many file names on each command line but not so many that it exceeds the command line limit. This reduces the number of times that a gzip process must be started and therefore makes this fast. It is also safe: the file names will also be protected from word splitting and pathname expansion.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Jul 2 at 19:32









        Stéphane Chazelas

        325k57 gold badges628 silver badges997 bronze badges




        325k57 gold badges628 silver badges997 bronze badges










        answered Jul 1 at 5:51









        John1024John1024

        51k5 gold badges117 silver badges132 bronze badges




        51k5 gold badges117 silver badges132 bronze badges












        • Thanks for detailed reply. I understand your mentioned 3 issues. File name is simple and won't face those challenges as the list will hold upto 20000. And my question is basically on performance of those two segments. Thanks.

          – Leon
          Jul 1 at 6:20







        • 1





          @Leon The for loop will be —by far— the slowest. The other two methods will be very close in speed to each other.

          – John1024
          Jul 1 at 6:36







        • 7





          Also, don't dismiss the potential problems: many many questions here on StackExchange are because word splitting or pathname expansion happened to people who weren't expecting it.

          – John1024
          Jul 1 at 6:38






        • 5





          Note also that there's variation on reading a file with xargs: at least GNU version has --arg-file option (short form -a). So one could do xargs -a large_file_list -rd'n' gzip -d instead. Effectively, there's no difference, aside from the fact that < is shell operator and would make xargs read from stdin (which shell "links" to file), while -a would make xargs explicitly open the file in question

          – Sergiy Kolodyazhnyy
          Jul 1 at 7:34






        • 2





          terdon noted in another comment about using parallel to run multiple copies of gzip, but xargs (at least the GNU one), has the -P switch for that, too. On multicore machines that might make a difference. But it's also possible that the decompression is completely I/O-bound anyway.

          – ilkkachu
          Jul 2 at 14:19

















        • Thanks for detailed reply. I understand your mentioned 3 issues. File name is simple and won't face those challenges as the list will hold upto 20000. And my question is basically on performance of those two segments. Thanks.

          – Leon
          Jul 1 at 6:20







        • 1





          @Leon The for loop will be —by far— the slowest. The other two methods will be very close in speed to each other.

          – John1024
          Jul 1 at 6:36







        • 7





          Also, don't dismiss the potential problems: many many questions here on StackExchange are because word splitting or pathname expansion happened to people who weren't expecting it.

          – John1024
          Jul 1 at 6:38






        • 5





          Note also that there's variation on reading a file with xargs: at least GNU version has --arg-file option (short form -a). So one could do xargs -a large_file_list -rd'n' gzip -d instead. Effectively, there's no difference, aside from the fact that < is shell operator and would make xargs read from stdin (which shell "links" to file), while -a would make xargs explicitly open the file in question

          – Sergiy Kolodyazhnyy
          Jul 1 at 7:34






        • 2





          terdon noted in another comment about using parallel to run multiple copies of gzip, but xargs (at least the GNU one), has the -P switch for that, too. On multicore machines that might make a difference. But it's also possible that the decompression is completely I/O-bound anyway.

          – ilkkachu
          Jul 2 at 14:19
















        Thanks for detailed reply. I understand your mentioned 3 issues. File name is simple and won't face those challenges as the list will hold upto 20000. And my question is basically on performance of those two segments. Thanks.

        – Leon
        Jul 1 at 6:20






        Thanks for detailed reply. I understand your mentioned 3 issues. File name is simple and won't face those challenges as the list will hold upto 20000. And my question is basically on performance of those two segments. Thanks.

        – Leon
        Jul 1 at 6:20





        1




        1





        @Leon The for loop will be —by far— the slowest. The other two methods will be very close in speed to each other.

        – John1024
        Jul 1 at 6:36






        @Leon The for loop will be —by far— the slowest. The other two methods will be very close in speed to each other.

        – John1024
        Jul 1 at 6:36





        7




        7





        Also, don't dismiss the potential problems: many many questions here on StackExchange are because word splitting or pathname expansion happened to people who weren't expecting it.

        – John1024
        Jul 1 at 6:38





        Also, don't dismiss the potential problems: many many questions here on StackExchange are because word splitting or pathname expansion happened to people who weren't expecting it.

        – John1024
        Jul 1 at 6:38




        5




        5





        Note also that there's variation on reading a file with xargs: at least GNU version has --arg-file option (short form -a). So one could do xargs -a large_file_list -rd'n' gzip -d instead. Effectively, there's no difference, aside from the fact that < is shell operator and would make xargs read from stdin (which shell "links" to file), while -a would make xargs explicitly open the file in question

        – Sergiy Kolodyazhnyy
        Jul 1 at 7:34





        Note also that there's variation on reading a file with xargs: at least GNU version has --arg-file option (short form -a). So one could do xargs -a large_file_list -rd'n' gzip -d instead. Effectively, there's no difference, aside from the fact that < is shell operator and would make xargs read from stdin (which shell "links" to file), while -a would make xargs explicitly open the file in question

        – Sergiy Kolodyazhnyy
        Jul 1 at 7:34




        2




        2





        terdon noted in another comment about using parallel to run multiple copies of gzip, but xargs (at least the GNU one), has the -P switch for that, too. On multicore machines that might make a difference. But it's also possible that the decompression is completely I/O-bound anyway.

        – ilkkachu
        Jul 2 at 14:19





        terdon noted in another comment about using parallel to run multiple copies of gzip, but xargs (at least the GNU one), has the -P switch for that, too. On multicore machines that might make a difference. But it's also possible that the decompression is completely I/O-bound anyway.

        – ilkkachu
        Jul 2 at 14:19













        12














        I doubt it would matter much.



        I would use a loop, just because I don't know how many files are listed in the list file, and I don't (generally) know if any of the filenames have spaces in their names. Doing a command substitution that would generate a very long list of argument may result in an "Argument list too long" error when the length of the list generated is too long.



        My loop would look like



        while IFS= read -r name; do
        gunzip "$name"
        done <file.list


        This would additionally allow me to insert commands for processing the data after the gunzip command. In fact, depending on what the data actually is and what needs to be done with it, it may even be possible to process it without saving it to file at all:



        while IFS= read -r name; do
        zcat "$name" | process_data
        done <file.list


        (where process_data is some pipeline that reads the uncompressed data from standard input)



        If the processing of the data takes longer than the uncompressing of it, the question of whether a loop is more efficient or not becomes irrelevant.



        Ideally, I would prefer to not work off a list of filenames though, and instead use a filename globbing pattern, as in



        for name in ./*.gz; do
        # processing of "$name" here
        done


        where ./*.gz is some pattern that matches the relevant files. This way we are not depending on the number of files nor on the characters used in the filenames (they may contain newlines or other whitespace characters, or start with dashes, etc.)



        Related:



        • Understanding "IFS= read -r line"





        share|improve this answer





























          12














          I doubt it would matter much.



          I would use a loop, just because I don't know how many files are listed in the list file, and I don't (generally) know if any of the filenames have spaces in their names. Doing a command substitution that would generate a very long list of argument may result in an "Argument list too long" error when the length of the list generated is too long.



          My loop would look like



          while IFS= read -r name; do
          gunzip "$name"
          done <file.list


          This would additionally allow me to insert commands for processing the data after the gunzip command. In fact, depending on what the data actually is and what needs to be done with it, it may even be possible to process it without saving it to file at all:



          while IFS= read -r name; do
          zcat "$name" | process_data
          done <file.list


          (where process_data is some pipeline that reads the uncompressed data from standard input)



          If the processing of the data takes longer than the uncompressing of it, the question of whether a loop is more efficient or not becomes irrelevant.



          Ideally, I would prefer to not work off a list of filenames though, and instead use a filename globbing pattern, as in



          for name in ./*.gz; do
          # processing of "$name" here
          done


          where ./*.gz is some pattern that matches the relevant files. This way we are not depending on the number of files nor on the characters used in the filenames (they may contain newlines or other whitespace characters, or start with dashes, etc.)



          Related:



          • Understanding "IFS= read -r line"





          share|improve this answer



























            12












            12








            12







            I doubt it would matter much.



            I would use a loop, just because I don't know how many files are listed in the list file, and I don't (generally) know if any of the filenames have spaces in their names. Doing a command substitution that would generate a very long list of argument may result in an "Argument list too long" error when the length of the list generated is too long.



            My loop would look like



            while IFS= read -r name; do
            gunzip "$name"
            done <file.list


            This would additionally allow me to insert commands for processing the data after the gunzip command. In fact, depending on what the data actually is and what needs to be done with it, it may even be possible to process it without saving it to file at all:



            while IFS= read -r name; do
            zcat "$name" | process_data
            done <file.list


            (where process_data is some pipeline that reads the uncompressed data from standard input)



            If the processing of the data takes longer than the uncompressing of it, the question of whether a loop is more efficient or not becomes irrelevant.



            Ideally, I would prefer to not work off a list of filenames though, and instead use a filename globbing pattern, as in



            for name in ./*.gz; do
            # processing of "$name" here
            done


            where ./*.gz is some pattern that matches the relevant files. This way we are not depending on the number of files nor on the characters used in the filenames (they may contain newlines or other whitespace characters, or start with dashes, etc.)



            Related:



            • Understanding "IFS= read -r line"





            share|improve this answer















            I doubt it would matter much.



            I would use a loop, just because I don't know how many files are listed in the list file, and I don't (generally) know if any of the filenames have spaces in their names. Doing a command substitution that would generate a very long list of argument may result in an "Argument list too long" error when the length of the list generated is too long.



            My loop would look like



            while IFS= read -r name; do
            gunzip "$name"
            done <file.list


            This would additionally allow me to insert commands for processing the data after the gunzip command. In fact, depending on what the data actually is and what needs to be done with it, it may even be possible to process it without saving it to file at all:



            while IFS= read -r name; do
            zcat "$name" | process_data
            done <file.list


            (where process_data is some pipeline that reads the uncompressed data from standard input)



            If the processing of the data takes longer than the uncompressing of it, the question of whether a loop is more efficient or not becomes irrelevant.



            Ideally, I would prefer to not work off a list of filenames though, and instead use a filename globbing pattern, as in



            for name in ./*.gz; do
            # processing of "$name" here
            done


            where ./*.gz is some pattern that matches the relevant files. This way we are not depending on the number of files nor on the characters used in the filenames (they may contain newlines or other whitespace characters, or start with dashes, etc.)



            Related:



            • Understanding "IFS= read -r line"






            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Jul 1 at 9:39

























            answered Jul 1 at 7:21









            KusalanandaKusalananda

            154k18 gold badges304 silver badges488 bronze badges




            154k18 gold badges304 silver badges488 bronze badges





















                6














                Out of those two, the one with all files passed to a single invocation of gzip is likely to be faster, exactly because you only need to launch gzip once. (That is, if the command works at all, see the other answers for the caveats.)



                But, I'd like to remind of the golden rule of optimization: Don't do it prematurely.




                1. Don't optimize that sort of thing before you know it's a problem.



                  Does this part of the program take a long time? Well, decompressing large files might, and you're going to have to do it anyway, so it might not be that easy to answer.




                2. Measure. Really, it's the best way to be sure.



                  You'll see the results with your own eyes (or with your own stopwatch), and they'll apply to your situation which random answers on the Internet might not. Put both variants in scripts and run time script1.sh, and time script2.sh. (Do that with a list of empty compressed files to measure the absolute amount of the overhead.)







                share|improve this answer



























                  6














                  Out of those two, the one with all files passed to a single invocation of gzip is likely to be faster, exactly because you only need to launch gzip once. (That is, if the command works at all, see the other answers for the caveats.)



                  But, I'd like to remind of the golden rule of optimization: Don't do it prematurely.




                  1. Don't optimize that sort of thing before you know it's a problem.



                    Does this part of the program take a long time? Well, decompressing large files might, and you're going to have to do it anyway, so it might not be that easy to answer.




                  2. Measure. Really, it's the best way to be sure.



                    You'll see the results with your own eyes (or with your own stopwatch), and they'll apply to your situation which random answers on the Internet might not. Put both variants in scripts and run time script1.sh, and time script2.sh. (Do that with a list of empty compressed files to measure the absolute amount of the overhead.)







                  share|improve this answer

























                    6












                    6








                    6







                    Out of those two, the one with all files passed to a single invocation of gzip is likely to be faster, exactly because you only need to launch gzip once. (That is, if the command works at all, see the other answers for the caveats.)



                    But, I'd like to remind of the golden rule of optimization: Don't do it prematurely.




                    1. Don't optimize that sort of thing before you know it's a problem.



                      Does this part of the program take a long time? Well, decompressing large files might, and you're going to have to do it anyway, so it might not be that easy to answer.




                    2. Measure. Really, it's the best way to be sure.



                      You'll see the results with your own eyes (or with your own stopwatch), and they'll apply to your situation which random answers on the Internet might not. Put both variants in scripts and run time script1.sh, and time script2.sh. (Do that with a list of empty compressed files to measure the absolute amount of the overhead.)







                    share|improve this answer













                    Out of those two, the one with all files passed to a single invocation of gzip is likely to be faster, exactly because you only need to launch gzip once. (That is, if the command works at all, see the other answers for the caveats.)



                    But, I'd like to remind of the golden rule of optimization: Don't do it prematurely.




                    1. Don't optimize that sort of thing before you know it's a problem.



                      Does this part of the program take a long time? Well, decompressing large files might, and you're going to have to do it anyway, so it might not be that easy to answer.




                    2. Measure. Really, it's the best way to be sure.



                      You'll see the results with your own eyes (or with your own stopwatch), and they'll apply to your situation which random answers on the Internet might not. Put both variants in scripts and run time script1.sh, and time script2.sh. (Do that with a list of empty compressed files to measure the absolute amount of the overhead.)








                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered Jul 1 at 15:03









                    ilkkachuilkkachu

                    64.9k10 gold badges109 silver badges189 bronze badges




                    64.9k10 gold badges109 silver badges189 bronze badges





















                        0














                        How fast is your disk?



                        This should use all your CPUs:



                        parallel -X gzip -d :::: large_file_list


                        So your limit is likely going to be the speed of your disk.



                        You can try adjusting with -j:



                        parallel -j50% -X gzip -d :::: large_file_list


                        This will run half of the jobs in parallel as the previous command, and will stress your disk less, so depending on your disk this can be faster.






                        share|improve this answer



























                          0














                          How fast is your disk?



                          This should use all your CPUs:



                          parallel -X gzip -d :::: large_file_list


                          So your limit is likely going to be the speed of your disk.



                          You can try adjusting with -j:



                          parallel -j50% -X gzip -d :::: large_file_list


                          This will run half of the jobs in parallel as the previous command, and will stress your disk less, so depending on your disk this can be faster.






                          share|improve this answer

























                            0












                            0








                            0







                            How fast is your disk?



                            This should use all your CPUs:



                            parallel -X gzip -d :::: large_file_list


                            So your limit is likely going to be the speed of your disk.



                            You can try adjusting with -j:



                            parallel -j50% -X gzip -d :::: large_file_list


                            This will run half of the jobs in parallel as the previous command, and will stress your disk less, so depending on your disk this can be faster.






                            share|improve this answer













                            How fast is your disk?



                            This should use all your CPUs:



                            parallel -X gzip -d :::: large_file_list


                            So your limit is likely going to be the speed of your disk.



                            You can try adjusting with -j:



                            parallel -j50% -X gzip -d :::: large_file_list


                            This will run half of the jobs in parallel as the previous command, and will stress your disk less, so depending on your disk this can be faster.







                            share|improve this answer












                            share|improve this answer



                            share|improve this answer










                            answered Jul 2 at 17:09









                            Ole TangeOle Tange

                            13.6k17 gold badges60 silver badges108 bronze badges




                            13.6k17 gold badges60 silver badges108 bronze badges





















                                -1














                                ADDED after some profiling:



                                I made one hundred dirs 00 to 99 on a tmpfs and filled them with man-1 files starting with "s". That gives me 113 files from 20 byte to 69 K in each dir, totalling to 1.2 Mb in each dir.



                                Result: it ALL works:



                                Globbing and huge arg list:



                                gzip -d ??/* 
                                and
                                gzip -d $(<filelist)

                                both 0.8 s


                                With a loop:



                                for file in $(<filelist)
                                do
                                gzip -d $file
                                done

                                14 sec


                                And when I add my ampersand:



                                ... 
                                gzip -d $file &
                                ...

                                5.8 sec


                                The performance side of this is easily explained: the less you call gzip and the less you wait for it, the faster it runs.



                                Now: Is there at all a maximum arg list length? Is there a maximum number of processes? (I have ulimit -u of 31366, but it looks like I could even raise that, as root on my 300$ i5 Mini-PC...). Or will the max. be "only limited by the amount of avail. Memory".



                                Despite the overhead of ampersanding (forking off) 11300 mostly tiny jobs, this takes half as long as to wait for each command to complete. And this was on a tmpfs...



                                I forgot this one:



                                time find ?? -name "*.gz" -exec gzip -d +

                                also 0.8 s (6.2 s without the "+", but with ugly ";")



                                ORIGINAL answer:



                                Short answer: use ampersand & to execute each gzip command in the background to start the next loop iteration immediately.



                                for filnam in $(<flist)
                                do
                                gzip -d $filnam &
                                done


                                I also replaced cat with "<" (not in a loop, so it is not critical) and backticks with $() (more "modern").



                                to test/benchmark: place the shell builtin time in front of your command: I did time . loop.cmd (with loop.cmd as shown above). This gives you the time spent, and after hitting Enter again the shell shows you the finished gzip "jobs".



                                I tested this out some weeks ago with a (small) list of (medium size) tar files to extract. The & made such a difference in time I could hardly believe it. The & has a double effect: it not only starts the next gzip immediately, it also gives you the prompt as soon as the last gzip got started.



                                CAREFUL: if you have whitespace in your filenames, it doesn't work. I don't know how to implement that. I tried with quotes and IFS...



                                Another well-known tip is: use globbing, if possible. Maybe start with a list of directories and for each go (cd $dir; gzip -d *.gz &).



                                --> I only can give you this ampersand "trick". First step is always optimizing the design. I am not sure what exactly you want this huge list of files for? Is it a hand made list? If it is generated by rules, then maybe use these rules directly and not via a filelist.



                                Your second example should be a bit faster, because gzip starts only once. But: I don't know what the limit is if you go gzip -d 1.gz 2.gz 3.gz ... 9999.gz, i.e. giving gzip a very large argument list.



                                Whitespace: a filename like '10 A.gz' results in gzip -d ... 9.gz 10 A.gz 11.gz.... Gzip will say: '10' and 'A.gz' not found...unless these files DO exist, and if it is rm and not gzip then you have that classic shell trap.




                                added:



                                What about:



                                find $(<dirlist) -name "*.gz" -exec gzip -d +


                                ...starting from a smaller list of dirs? The + is the "parallelization" option (a find -exec thing, like ...).



                                If you can redesign a bit, this must be the best...






                                share|improve this answer

























                                • In the meanwhile, "kusa" has answered...this IFS/read is what I meant...I hope my explanations are useful anyway.

                                  – sam68
                                  Jul 1 at 8:37











                                • Now I proposed a "find -exec" command at the end of my answer. Maybe not what Leon wants, but still important.

                                  – sam68
                                  Jul 1 at 9:34






                                • 4





                                  Your loop will probably bring the OP's machine to its knees. The file list can have up to 10000 files, so running ten thousand gzip instances in parallel will very likely cause the system to hang.

                                  – terdon
                                  Jul 1 at 9:44












                                • process #1 might be finished when #10 or #100 gets started...it started as a performance Q, then whitespace, now system resources...this question is not so easy to answer! A extra process/job for each small .gz would not be ideal, I admit. 10000 separate gzips is the thing to avoid in the first place.

                                  – sam68
                                  Jul 1 at 11:09






                                • 5





                                  This is the sort of thing GNU parallel is for. You could simply do parallel -j3 gzip -d ::: < <(find $(<dirlist) -name '*.gz') and that would only launch 3 gzips at a time.

                                  – terdon
                                  Jul 1 at 11:49















                                -1














                                ADDED after some profiling:



                                I made one hundred dirs 00 to 99 on a tmpfs and filled them with man-1 files starting with "s". That gives me 113 files from 20 byte to 69 K in each dir, totalling to 1.2 Mb in each dir.



                                Result: it ALL works:



                                Globbing and huge arg list:



                                gzip -d ??/* 
                                and
                                gzip -d $(<filelist)

                                both 0.8 s


                                With a loop:



                                for file in $(<filelist)
                                do
                                gzip -d $file
                                done

                                14 sec


                                And when I add my ampersand:



                                ... 
                                gzip -d $file &
                                ...

                                5.8 sec


                                The performance side of this is easily explained: the less you call gzip and the less you wait for it, the faster it runs.



                                Now: Is there at all a maximum arg list length? Is there a maximum number of processes? (I have ulimit -u of 31366, but it looks like I could even raise that, as root on my 300$ i5 Mini-PC...). Or will the max. be "only limited by the amount of avail. Memory".



                                Despite the overhead of ampersanding (forking off) 11300 mostly tiny jobs, this takes half as long as to wait for each command to complete. And this was on a tmpfs...



                                I forgot this one:



                                time find ?? -name "*.gz" -exec gzip -d +

                                also 0.8 s (6.2 s without the "+", but with ugly ";")



                                ORIGINAL answer:



                                Short answer: use ampersand & to execute each gzip command in the background to start the next loop iteration immediately.



                                for filnam in $(<flist)
                                do
                                gzip -d $filnam &
                                done


                                I also replaced cat with "<" (not in a loop, so it is not critical) and backticks with $() (more "modern").



                                to test/benchmark: place the shell builtin time in front of your command: I did time . loop.cmd (with loop.cmd as shown above). This gives you the time spent, and after hitting Enter again the shell shows you the finished gzip "jobs".



                                I tested this out some weeks ago with a (small) list of (medium size) tar files to extract. The & made such a difference in time I could hardly believe it. The & has a double effect: it not only starts the next gzip immediately, it also gives you the prompt as soon as the last gzip got started.



                                CAREFUL: if you have whitespace in your filenames, it doesn't work. I don't know how to implement that. I tried with quotes and IFS...



                                Another well-known tip is: use globbing, if possible. Maybe start with a list of directories and for each go (cd $dir; gzip -d *.gz &).



                                --> I only can give you this ampersand "trick". First step is always optimizing the design. I am not sure what exactly you want this huge list of files for? Is it a hand made list? If it is generated by rules, then maybe use these rules directly and not via a filelist.



                                Your second example should be a bit faster, because gzip starts only once. But: I don't know what the limit is if you go gzip -d 1.gz 2.gz 3.gz ... 9999.gz, i.e. giving gzip a very large argument list.



                                Whitespace: a filename like '10 A.gz' results in gzip -d ... 9.gz 10 A.gz 11.gz.... Gzip will say: '10' and 'A.gz' not found...unless these files DO exist, and if it is rm and not gzip then you have that classic shell trap.




                                added:



                                What about:



                                find $(<dirlist) -name "*.gz" -exec gzip -d +


                                ...starting from a smaller list of dirs? The + is the "parallelization" option (a find -exec thing, like ...).



                                If you can redesign a bit, this must be the best...






                                share|improve this answer

























                                • In the meanwhile, "kusa" has answered...this IFS/read is what I meant...I hope my explanations are useful anyway.

                                  – sam68
                                  Jul 1 at 8:37











                                • Now I proposed a "find -exec" command at the end of my answer. Maybe not what Leon wants, but still important.

                                  – sam68
                                  Jul 1 at 9:34






                                • 4





                                  Your loop will probably bring the OP's machine to its knees. The file list can have up to 10000 files, so running ten thousand gzip instances in parallel will very likely cause the system to hang.

                                  – terdon
                                  Jul 1 at 9:44












                                • process #1 might be finished when #10 or #100 gets started...it started as a performance Q, then whitespace, now system resources...this question is not so easy to answer! A extra process/job for each small .gz would not be ideal, I admit. 10000 separate gzips is the thing to avoid in the first place.

                                  – sam68
                                  Jul 1 at 11:09






                                • 5





                                  This is the sort of thing GNU parallel is for. You could simply do parallel -j3 gzip -d ::: < <(find $(<dirlist) -name '*.gz') and that would only launch 3 gzips at a time.

                                  – terdon
                                  Jul 1 at 11:49













                                -1












                                -1








                                -1







                                ADDED after some profiling:



                                I made one hundred dirs 00 to 99 on a tmpfs and filled them with man-1 files starting with "s". That gives me 113 files from 20 byte to 69 K in each dir, totalling to 1.2 Mb in each dir.



                                Result: it ALL works:



                                Globbing and huge arg list:



                                gzip -d ??/* 
                                and
                                gzip -d $(<filelist)

                                both 0.8 s


                                With a loop:



                                for file in $(<filelist)
                                do
                                gzip -d $file
                                done

                                14 sec


                                And when I add my ampersand:



                                ... 
                                gzip -d $file &
                                ...

                                5.8 sec


                                The performance side of this is easily explained: the less you call gzip and the less you wait for it, the faster it runs.



                                Now: Is there at all a maximum arg list length? Is there a maximum number of processes? (I have ulimit -u of 31366, but it looks like I could even raise that, as root on my 300$ i5 Mini-PC...). Or will the max. be "only limited by the amount of avail. Memory".



                                Despite the overhead of ampersanding (forking off) 11300 mostly tiny jobs, this takes half as long as to wait for each command to complete. And this was on a tmpfs...



                                I forgot this one:



                                time find ?? -name "*.gz" -exec gzip -d +

                                also 0.8 s (6.2 s without the "+", but with ugly ";")



                                ORIGINAL answer:



                                Short answer: use ampersand & to execute each gzip command in the background to start the next loop iteration immediately.



                                for filnam in $(<flist)
                                do
                                gzip -d $filnam &
                                done


                                I also replaced cat with "<" (not in a loop, so it is not critical) and backticks with $() (more "modern").



                                to test/benchmark: place the shell builtin time in front of your command: I did time . loop.cmd (with loop.cmd as shown above). This gives you the time spent, and after hitting Enter again the shell shows you the finished gzip "jobs".



                                I tested this out some weeks ago with a (small) list of (medium size) tar files to extract. The & made such a difference in time I could hardly believe it. The & has a double effect: it not only starts the next gzip immediately, it also gives you the prompt as soon as the last gzip got started.



                                CAREFUL: if you have whitespace in your filenames, it doesn't work. I don't know how to implement that. I tried with quotes and IFS...



                                Another well-known tip is: use globbing, if possible. Maybe start with a list of directories and for each go (cd $dir; gzip -d *.gz &).



                                --> I only can give you this ampersand "trick". First step is always optimizing the design. I am not sure what exactly you want this huge list of files for? Is it a hand made list? If it is generated by rules, then maybe use these rules directly and not via a filelist.



                                Your second example should be a bit faster, because gzip starts only once. But: I don't know what the limit is if you go gzip -d 1.gz 2.gz 3.gz ... 9999.gz, i.e. giving gzip a very large argument list.



                                Whitespace: a filename like '10 A.gz' results in gzip -d ... 9.gz 10 A.gz 11.gz.... Gzip will say: '10' and 'A.gz' not found...unless these files DO exist, and if it is rm and not gzip then you have that classic shell trap.




                                added:



                                What about:



                                find $(<dirlist) -name "*.gz" -exec gzip -d +


                                ...starting from a smaller list of dirs? The + is the "parallelization" option (a find -exec thing, like ...).



                                If you can redesign a bit, this must be the best...






                                share|improve this answer















                                ADDED after some profiling:



                                I made one hundred dirs 00 to 99 on a tmpfs and filled them with man-1 files starting with "s". That gives me 113 files from 20 byte to 69 K in each dir, totalling to 1.2 Mb in each dir.



                                Result: it ALL works:



                                Globbing and huge arg list:



                                gzip -d ??/* 
                                and
                                gzip -d $(<filelist)

                                both 0.8 s


                                With a loop:



                                for file in $(<filelist)
                                do
                                gzip -d $file
                                done

                                14 sec


                                And when I add my ampersand:



                                ... 
                                gzip -d $file &
                                ...

                                5.8 sec


                                The performance side of this is easily explained: the less you call gzip and the less you wait for it, the faster it runs.



                                Now: Is there at all a maximum arg list length? Is there a maximum number of processes? (I have ulimit -u of 31366, but it looks like I could even raise that, as root on my 300$ i5 Mini-PC...). Or will the max. be "only limited by the amount of avail. Memory".



                                Despite the overhead of ampersanding (forking off) 11300 mostly tiny jobs, this takes half as long as to wait for each command to complete. And this was on a tmpfs...



                                I forgot this one:



                                time find ?? -name "*.gz" -exec gzip -d +

                                also 0.8 s (6.2 s without the "+", but with ugly ";")



                                ORIGINAL answer:



                                Short answer: use ampersand & to execute each gzip command in the background to start the next loop iteration immediately.



                                for filnam in $(<flist)
                                do
                                gzip -d $filnam &
                                done


                                I also replaced cat with "<" (not in a loop, so it is not critical) and backticks with $() (more "modern").



                                to test/benchmark: place the shell builtin time in front of your command: I did time . loop.cmd (with loop.cmd as shown above). This gives you the time spent, and after hitting Enter again the shell shows you the finished gzip "jobs".



                                I tested this out some weeks ago with a (small) list of (medium size) tar files to extract. The & made such a difference in time I could hardly believe it. The & has a double effect: it not only starts the next gzip immediately, it also gives you the prompt as soon as the last gzip got started.



                                CAREFUL: if you have whitespace in your filenames, it doesn't work. I don't know how to implement that. I tried with quotes and IFS...



                                Another well-known tip is: use globbing, if possible. Maybe start with a list of directories and for each go (cd $dir; gzip -d *.gz &).



                                --> I only can give you this ampersand "trick". First step is always optimizing the design. I am not sure what exactly you want this huge list of files for? Is it a hand made list? If it is generated by rules, then maybe use these rules directly and not via a filelist.



                                Your second example should be a bit faster, because gzip starts only once. But: I don't know what the limit is if you go gzip -d 1.gz 2.gz 3.gz ... 9999.gz, i.e. giving gzip a very large argument list.



                                Whitespace: a filename like '10 A.gz' results in gzip -d ... 9.gz 10 A.gz 11.gz.... Gzip will say: '10' and 'A.gz' not found...unless these files DO exist, and if it is rm and not gzip then you have that classic shell trap.




                                added:



                                What about:



                                find $(<dirlist) -name "*.gz" -exec gzip -d +


                                ...starting from a smaller list of dirs? The + is the "parallelization" option (a find -exec thing, like ...).



                                If you can redesign a bit, this must be the best...







                                share|improve this answer














                                share|improve this answer



                                share|improve this answer








                                edited Jul 2 at 16:24

























                                answered Jul 1 at 8:32









                                sam68sam68

                                1776 bronze badges




                                1776 bronze badges












                                • In the meanwhile, "kusa" has answered...this IFS/read is what I meant...I hope my explanations are useful anyway.

                                  – sam68
                                  Jul 1 at 8:37











                                • Now I proposed a "find -exec" command at the end of my answer. Maybe not what Leon wants, but still important.

                                  – sam68
                                  Jul 1 at 9:34






                                • 4





                                  Your loop will probably bring the OP's machine to its knees. The file list can have up to 10000 files, so running ten thousand gzip instances in parallel will very likely cause the system to hang.

                                  – terdon
                                  Jul 1 at 9:44












                                • process #1 might be finished when #10 or #100 gets started...it started as a performance Q, then whitespace, now system resources...this question is not so easy to answer! A extra process/job for each small .gz would not be ideal, I admit. 10000 separate gzips is the thing to avoid in the first place.

                                  – sam68
                                  Jul 1 at 11:09






                                • 5





                                  This is the sort of thing GNU parallel is for. You could simply do parallel -j3 gzip -d ::: < <(find $(<dirlist) -name '*.gz') and that would only launch 3 gzips at a time.

                                  – terdon
                                  Jul 1 at 11:49

















                                • In the meanwhile, "kusa" has answered...this IFS/read is what I meant...I hope my explanations are useful anyway.

                                  – sam68
                                  Jul 1 at 8:37











                                • Now I proposed a "find -exec" command at the end of my answer. Maybe not what Leon wants, but still important.

                                  – sam68
                                  Jul 1 at 9:34






                                • 4





                                  Your loop will probably bring the OP's machine to its knees. The file list can have up to 10000 files, so running ten thousand gzip instances in parallel will very likely cause the system to hang.

                                  – terdon
                                  Jul 1 at 9:44












                                • process #1 might be finished when #10 or #100 gets started...it started as a performance Q, then whitespace, now system resources...this question is not so easy to answer! A extra process/job for each small .gz would not be ideal, I admit. 10000 separate gzips is the thing to avoid in the first place.

                                  – sam68
                                  Jul 1 at 11:09






                                • 5





                                  This is the sort of thing GNU parallel is for. You could simply do parallel -j3 gzip -d ::: < <(find $(<dirlist) -name '*.gz') and that would only launch 3 gzips at a time.

                                  – terdon
                                  Jul 1 at 11:49
















                                In the meanwhile, "kusa" has answered...this IFS/read is what I meant...I hope my explanations are useful anyway.

                                – sam68
                                Jul 1 at 8:37





                                In the meanwhile, "kusa" has answered...this IFS/read is what I meant...I hope my explanations are useful anyway.

                                – sam68
                                Jul 1 at 8:37













                                Now I proposed a "find -exec" command at the end of my answer. Maybe not what Leon wants, but still important.

                                – sam68
                                Jul 1 at 9:34





                                Now I proposed a "find -exec" command at the end of my answer. Maybe not what Leon wants, but still important.

                                – sam68
                                Jul 1 at 9:34




                                4




                                4





                                Your loop will probably bring the OP's machine to its knees. The file list can have up to 10000 files, so running ten thousand gzip instances in parallel will very likely cause the system to hang.

                                – terdon
                                Jul 1 at 9:44






                                Your loop will probably bring the OP's machine to its knees. The file list can have up to 10000 files, so running ten thousand gzip instances in parallel will very likely cause the system to hang.

                                – terdon
                                Jul 1 at 9:44














                                process #1 might be finished when #10 or #100 gets started...it started as a performance Q, then whitespace, now system resources...this question is not so easy to answer! A extra process/job for each small .gz would not be ideal, I admit. 10000 separate gzips is the thing to avoid in the first place.

                                – sam68
                                Jul 1 at 11:09





                                process #1 might be finished when #10 or #100 gets started...it started as a performance Q, then whitespace, now system resources...this question is not so easy to answer! A extra process/job for each small .gz would not be ideal, I admit. 10000 separate gzips is the thing to avoid in the first place.

                                – sam68
                                Jul 1 at 11:09




                                5




                                5





                                This is the sort of thing GNU parallel is for. You could simply do parallel -j3 gzip -d ::: < <(find $(<dirlist) -name '*.gz') and that would only launch 3 gzips at a time.

                                – terdon
                                Jul 1 at 11:49





                                This is the sort of thing GNU parallel is for. You could simply do parallel -j3 gzip -d ::: < <(find $(<dirlist) -name '*.gz') and that would only launch 3 gzips at a time.

                                – terdon
                                Jul 1 at 11:49

















                                draft saved

                                draft discarded
















































                                Thanks for contributing an answer to Unix & Linux Stack Exchange!


                                • Please be sure to answer the question. Provide details and share your research!

                                But avoid


                                • Asking for help, clarification, or responding to other answers.

                                • Making statements based on opinion; back them up with references or personal experience.

                                To learn more, see our tips on writing great answers.




                                draft saved


                                draft discarded














                                StackExchange.ready(
                                function ()
                                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f527813%2fperformance-of-loop-vs-expansion%23new-answer', 'question_page');

                                );

                                Post as a guest















                                Required, but never shown





















































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown

































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown







                                Popular posts from this blog

                                Category:9 (number) SubcategoriesMedia in category "9 (number)"Navigation menuUpload mediaGND ID: 4485639-8Library of Congress authority ID: sh85091979ReasonatorScholiaStatistics

                                Circuit construction for execution of conditional statements using least significant bitHow are two different registers being used as “control”?How exactly is the stated composite state of the two registers being produced using the $R_zz$ controlled rotations?Efficiently performing controlled rotations in HHLWould this quantum algorithm implementation work?How to prepare a superposed states of odd integers from $1$ to $sqrtN$?Why is this implementation of the order finding algorithm not working?Circuit construction for Hamiltonian simulationHow can I invert the least significant bit of a certain term of a superposed state?Implementing an oracleImplementing a controlled sum operation

                                Magento 2 “No Payment Methods” in Admin New OrderHow to integrate Paypal Express Checkout with the Magento APIMagento 1.5 - Sales > Order > edit order and shipping methods disappearAuto Invoice Check/Money Order Payment methodAdd more simple payment methods?Shipping methods not showingWhat should I do to change payment methods if changing the configuration has no effects?1.9 - No Payment Methods showing upMy Payment Methods not Showing for downloadable/virtual product when checkout?Magento2 API to access internal payment methodHow to call an existing payment methods in the registration form?