Repairing the headers of phylip bioinformatics files to accurately reflect the updated number of samples in the file(s)Select file based on number of lines and manipulate the resultcreate a new column based on existing columns using if else statement in awkPrint sets of lines that do not have a corresponding pairUsing numbers in file A to get unique ID from file B based on the order specified by file AUsing Uniq -c with a regular expression or counting the number of lines removedDetermine how long tabs 't' are on a lineextract fasta entries from list using while readscript to parse file for two consecutive lines of unequal lengthConcatenate multiple zipped files, skipping header lines in all but the first filedelete rows with duplications in first column in bash

Complications of displaced core material?

Writing "hahaha" versus describing the laugh

Why does the painters tape have to be blue?

Why isn't Tyrion mentioned in 'A song of Ice and Fire'?

Alexandrov's generalization of Cauchy's rigidity theorem

How can I get a refund from a seller who only accepts Zelle?

Why is the Eisenstein ideal paper so great?

Are cells guaranteed to get at least one mitochondrion when they divide?

Does water in vacuum form a solid shell or freeze solid?

Count all vowels in string

ifconfig shows UP while ip link shows DOWN

If I arrive in the UK, and then head to mainland Europe, does my Schengen visa 90 day limit start when I arrived in the UK, or mainland Europe?

How can I minimize the damage of an unstable nuclear reactor to the surrounding area?

How would a developer who mostly fixed bugs for years at a company call out their contributions in their CV?

The disk image is 497GB smaller than the target device

How do you earn the reader's trust?

What is the use case for non-breathable waterproof pants?

How does the Earth's center produce heat?

Why does Bran want to find Drogon?

Using too much dialogue?

Why is std::ssize() introduced in C++20?

I want to ask company flying me out for office tour if I can bring my fiance

What did the 'turbo' button actually do?

What is to the west of Westeros?



Repairing the headers of phylip bioinformatics files to accurately reflect the updated number of samples in the file(s)


Select file based on number of lines and manipulate the resultcreate a new column based on existing columns using if else statement in awkPrint sets of lines that do not have a corresponding pairUsing numbers in file A to get unique ID from file B based on the order specified by file AUsing Uniq -c with a regular expression or counting the number of lines removedDetermine how long tabs 't' are on a lineextract fasta entries from list using while readscript to parse file for two consecutive lines of unequal lengthConcatenate multiple zipped files, skipping header lines in all but the first filedelete rows with duplications in first column in bash






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








3















I have a dataset that I am working with made up of phylip files that I have been editing. Phylip format is a bioinformatics format that contains as a header the number of samples and the sequence length, followed by each sample and its sequence. for example:



5 10
sample_1 gaatatccga
sample_2 gaatatccga
sample_3 gaatatcgca
sample_4 caatatccga
sample_5 gaataagcga


My issue is that in trimming these datasets, the sample number in the header no longer is accurate (e.g. in above example might say five, but I've since trimmed to have only three samples). What I need to do is to replace that sample count with the new, accurate sample count but I cannot figure out how to do so without losing the sequence length number (e.g. the 10).



I have 550 files so simply doing this by hand is not an option. I can for-loop the wc but again I need to retain that sequence length information and somehow combine it with a new, accurate wc.










share|improve this question









New contributor



erikusrex is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.














  • 3





    Please don't post pictures of text but post the actual text in a code block.

    – Jesse_b
    May 15 at 17:49











  • Also, it's unclear how the number should be changed. Always to 3?

    – choroba
    May 15 at 17:55











  • ok thank you, will do in the future. Not always to three as the sample number is different across files but most of the files have been edited and so the new number of samples is often fewer than what is currently stated in the header

    – erikusrex
    May 15 at 18:18

















3















I have a dataset that I am working with made up of phylip files that I have been editing. Phylip format is a bioinformatics format that contains as a header the number of samples and the sequence length, followed by each sample and its sequence. for example:



5 10
sample_1 gaatatccga
sample_2 gaatatccga
sample_3 gaatatcgca
sample_4 caatatccga
sample_5 gaataagcga


My issue is that in trimming these datasets, the sample number in the header no longer is accurate (e.g. in above example might say five, but I've since trimmed to have only three samples). What I need to do is to replace that sample count with the new, accurate sample count but I cannot figure out how to do so without losing the sequence length number (e.g. the 10).



I have 550 files so simply doing this by hand is not an option. I can for-loop the wc but again I need to retain that sequence length information and somehow combine it with a new, accurate wc.










share|improve this question









New contributor



erikusrex is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.














  • 3





    Please don't post pictures of text but post the actual text in a code block.

    – Jesse_b
    May 15 at 17:49











  • Also, it's unclear how the number should be changed. Always to 3?

    – choroba
    May 15 at 17:55











  • ok thank you, will do in the future. Not always to three as the sample number is different across files but most of the files have been edited and so the new number of samples is often fewer than what is currently stated in the header

    – erikusrex
    May 15 at 18:18













3












3








3








I have a dataset that I am working with made up of phylip files that I have been editing. Phylip format is a bioinformatics format that contains as a header the number of samples and the sequence length, followed by each sample and its sequence. for example:



5 10
sample_1 gaatatccga
sample_2 gaatatccga
sample_3 gaatatcgca
sample_4 caatatccga
sample_5 gaataagcga


My issue is that in trimming these datasets, the sample number in the header no longer is accurate (e.g. in above example might say five, but I've since trimmed to have only three samples). What I need to do is to replace that sample count with the new, accurate sample count but I cannot figure out how to do so without losing the sequence length number (e.g. the 10).



I have 550 files so simply doing this by hand is not an option. I can for-loop the wc but again I need to retain that sequence length information and somehow combine it with a new, accurate wc.










share|improve this question









New contributor



erikusrex is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











I have a dataset that I am working with made up of phylip files that I have been editing. Phylip format is a bioinformatics format that contains as a header the number of samples and the sequence length, followed by each sample and its sequence. for example:



5 10
sample_1 gaatatccga
sample_2 gaatatccga
sample_3 gaatatcgca
sample_4 caatatccga
sample_5 gaataagcga


My issue is that in trimming these datasets, the sample number in the header no longer is accurate (e.g. in above example might say five, but I've since trimmed to have only three samples). What I need to do is to replace that sample count with the new, accurate sample count but I cannot figure out how to do so without losing the sequence length number (e.g. the 10).



I have 550 files so simply doing this by hand is not an option. I can for-loop the wc but again I need to retain that sequence length information and somehow combine it with a new, accurate wc.







text-processing bioinformatics wc






share|improve this question









New contributor



erikusrex is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.










share|improve this question









New contributor



erikusrex is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.








share|improve this question




share|improve this question








edited May 16 at 1:22









Jeff Schaller

46k1165150




46k1165150






New contributor



erikusrex is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.








asked May 15 at 17:49









erikusrexerikusrex

182




182




New contributor



erikusrex is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.




New contributor




erikusrex is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









  • 3





    Please don't post pictures of text but post the actual text in a code block.

    – Jesse_b
    May 15 at 17:49











  • Also, it's unclear how the number should be changed. Always to 3?

    – choroba
    May 15 at 17:55











  • ok thank you, will do in the future. Not always to three as the sample number is different across files but most of the files have been edited and so the new number of samples is often fewer than what is currently stated in the header

    – erikusrex
    May 15 at 18:18












  • 3





    Please don't post pictures of text but post the actual text in a code block.

    – Jesse_b
    May 15 at 17:49











  • Also, it's unclear how the number should be changed. Always to 3?

    – choroba
    May 15 at 17:55











  • ok thank you, will do in the future. Not always to three as the sample number is different across files but most of the files have been edited and so the new number of samples is often fewer than what is currently stated in the header

    – erikusrex
    May 15 at 18:18







3




3





Please don't post pictures of text but post the actual text in a code block.

– Jesse_b
May 15 at 17:49





Please don't post pictures of text but post the actual text in a code block.

– Jesse_b
May 15 at 17:49













Also, it's unclear how the number should be changed. Always to 3?

– choroba
May 15 at 17:55





Also, it's unclear how the number should be changed. Always to 3?

– choroba
May 15 at 17:55













ok thank you, will do in the future. Not always to three as the sample number is different across files but most of the files have been edited and so the new number of samples is often fewer than what is currently stated in the header

– erikusrex
May 15 at 18:18





ok thank you, will do in the future. Not always to three as the sample number is different across files but most of the files have been edited and so the new number of samples is often fewer than what is currently stated in the header

– erikusrex
May 15 at 18:18










3 Answers
3






active

oldest

votes


















3














If I understand your requirement correctly you can use the following awk command:



awk -v samples="$(($(grep -c . input)-1))" 'NR == 1 $1=samples 1' input


samples will be set to the number of lines in the input file minus one (since you aren't counting the header line).



awk will then change the first column of the first line to the new sample number and print everything.




$ cat input
5 10
sample_1 gaatatccga
sample_2 gaatatccga
sample_3 gaatatccga
$ awk -v samples="$(($(grep -c . input)-1))" 'NR == 1 $1=samples 1' input
3 10
sample_1 gaatatccga
sample_2 gaatatccga
sample_3 gaatatccga



With GNU awk you can use the -i flag to modify the files in place but I would prefer to make a second set of modified files to ensure the correct changes have been made.



Something like:



for file in *.phy; do
awk -v samples="$(($(grep -c . "$file")-1))" 'NR == 1 $1=samples 1' "$file" > "$file.new"
done





share|improve this answer
































    3














    Another option would be to use ed (of course!):



    for f in input*
    do
    printf '1s/[[:digit:]][[:digit:]]*/%dnwnq' $(( $(wc -l < "$f") - 1 )) | ed -s "$f"
    done


    This loops over the files (named, for example input-something) and sends a simple ed-script to ed:



    • on line 1, search and replace (s//) one or more digits at the beginning of the line with another number -- that replacement number being the result of computing the line length of the input minus one

    • after that, w write the file out and

    • then q quit ed





    share|improve this answer






























      1














      In Vim, run:



      :execute '1s/^[0-9]+/' . (line('$')-1) . '/'


      (Thanks also to this answer for pointing me in the right direction.)



      You can also do this in a loop, e.g. using :bufdo or just a shell for loop.






      share|improve this answer























        Your Answer








        StackExchange.ready(function()
        var channelOptions =
        tags: "".split(" "),
        id: "106"
        ;
        initTagRenderer("".split(" "), "".split(" "), channelOptions);

        StackExchange.using("externalEditor", function()
        // Have to fire editor after snippets, if snippets enabled
        if (StackExchange.settings.snippets.snippetsEnabled)
        StackExchange.using("snippets", function()
        createEditor();
        );

        else
        createEditor();

        );

        function createEditor()
        StackExchange.prepareEditor(
        heartbeatType: 'answer',
        autoActivateHeartbeat: false,
        convertImagesToLinks: false,
        noModals: true,
        showLowRepImageUploadWarning: true,
        reputationToPostImages: null,
        bindNavPrevention: true,
        postfix: "",
        imageUploader:
        brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
        contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
        allowUrls: true
        ,
        onDemand: true,
        discardSelector: ".discard-answer"
        ,immediatelyShowMarkdownHelp:true
        );



        );






        erikusrex is a new contributor. Be nice, and check out our Code of Conduct.









        draft saved

        draft discarded


















        StackExchange.ready(
        function ()
        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f519129%2frepairing-the-headers-of-phylip-bioinformatics-files-to-accurately-reflect-the-u%23new-answer', 'question_page');

        );

        Post as a guest















        Required, but never shown

























        3 Answers
        3






        active

        oldest

        votes








        3 Answers
        3






        active

        oldest

        votes









        active

        oldest

        votes






        active

        oldest

        votes









        3














        If I understand your requirement correctly you can use the following awk command:



        awk -v samples="$(($(grep -c . input)-1))" 'NR == 1 $1=samples 1' input


        samples will be set to the number of lines in the input file minus one (since you aren't counting the header line).



        awk will then change the first column of the first line to the new sample number and print everything.




        $ cat input
        5 10
        sample_1 gaatatccga
        sample_2 gaatatccga
        sample_3 gaatatccga
        $ awk -v samples="$(($(grep -c . input)-1))" 'NR == 1 $1=samples 1' input
        3 10
        sample_1 gaatatccga
        sample_2 gaatatccga
        sample_3 gaatatccga



        With GNU awk you can use the -i flag to modify the files in place but I would prefer to make a second set of modified files to ensure the correct changes have been made.



        Something like:



        for file in *.phy; do
        awk -v samples="$(($(grep -c . "$file")-1))" 'NR == 1 $1=samples 1' "$file" > "$file.new"
        done





        share|improve this answer





























          3














          If I understand your requirement correctly you can use the following awk command:



          awk -v samples="$(($(grep -c . input)-1))" 'NR == 1 $1=samples 1' input


          samples will be set to the number of lines in the input file minus one (since you aren't counting the header line).



          awk will then change the first column of the first line to the new sample number and print everything.




          $ cat input
          5 10
          sample_1 gaatatccga
          sample_2 gaatatccga
          sample_3 gaatatccga
          $ awk -v samples="$(($(grep -c . input)-1))" 'NR == 1 $1=samples 1' input
          3 10
          sample_1 gaatatccga
          sample_2 gaatatccga
          sample_3 gaatatccga



          With GNU awk you can use the -i flag to modify the files in place but I would prefer to make a second set of modified files to ensure the correct changes have been made.



          Something like:



          for file in *.phy; do
          awk -v samples="$(($(grep -c . "$file")-1))" 'NR == 1 $1=samples 1' "$file" > "$file.new"
          done





          share|improve this answer



























            3












            3








            3







            If I understand your requirement correctly you can use the following awk command:



            awk -v samples="$(($(grep -c . input)-1))" 'NR == 1 $1=samples 1' input


            samples will be set to the number of lines in the input file minus one (since you aren't counting the header line).



            awk will then change the first column of the first line to the new sample number and print everything.




            $ cat input
            5 10
            sample_1 gaatatccga
            sample_2 gaatatccga
            sample_3 gaatatccga
            $ awk -v samples="$(($(grep -c . input)-1))" 'NR == 1 $1=samples 1' input
            3 10
            sample_1 gaatatccga
            sample_2 gaatatccga
            sample_3 gaatatccga



            With GNU awk you can use the -i flag to modify the files in place but I would prefer to make a second set of modified files to ensure the correct changes have been made.



            Something like:



            for file in *.phy; do
            awk -v samples="$(($(grep -c . "$file")-1))" 'NR == 1 $1=samples 1' "$file" > "$file.new"
            done





            share|improve this answer















            If I understand your requirement correctly you can use the following awk command:



            awk -v samples="$(($(grep -c . input)-1))" 'NR == 1 $1=samples 1' input


            samples will be set to the number of lines in the input file minus one (since you aren't counting the header line).



            awk will then change the first column of the first line to the new sample number and print everything.




            $ cat input
            5 10
            sample_1 gaatatccga
            sample_2 gaatatccga
            sample_3 gaatatccga
            $ awk -v samples="$(($(grep -c . input)-1))" 'NR == 1 $1=samples 1' input
            3 10
            sample_1 gaatatccga
            sample_2 gaatatccga
            sample_3 gaatatccga



            With GNU awk you can use the -i flag to modify the files in place but I would prefer to make a second set of modified files to ensure the correct changes have been made.



            Something like:



            for file in *.phy; do
            awk -v samples="$(($(grep -c . "$file")-1))" 'NR == 1 $1=samples 1' "$file" > "$file.new"
            done






            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited May 15 at 18:05

























            answered May 15 at 17:58









            Jesse_bJesse_b

            15.5k33776




            15.5k33776























                3














                Another option would be to use ed (of course!):



                for f in input*
                do
                printf '1s/[[:digit:]][[:digit:]]*/%dnwnq' $(( $(wc -l < "$f") - 1 )) | ed -s "$f"
                done


                This loops over the files (named, for example input-something) and sends a simple ed-script to ed:



                • on line 1, search and replace (s//) one or more digits at the beginning of the line with another number -- that replacement number being the result of computing the line length of the input minus one

                • after that, w write the file out and

                • then q quit ed





                share|improve this answer



























                  3














                  Another option would be to use ed (of course!):



                  for f in input*
                  do
                  printf '1s/[[:digit:]][[:digit:]]*/%dnwnq' $(( $(wc -l < "$f") - 1 )) | ed -s "$f"
                  done


                  This loops over the files (named, for example input-something) and sends a simple ed-script to ed:



                  • on line 1, search and replace (s//) one or more digits at the beginning of the line with another number -- that replacement number being the result of computing the line length of the input minus one

                  • after that, w write the file out and

                  • then q quit ed





                  share|improve this answer

























                    3












                    3








                    3







                    Another option would be to use ed (of course!):



                    for f in input*
                    do
                    printf '1s/[[:digit:]][[:digit:]]*/%dnwnq' $(( $(wc -l < "$f") - 1 )) | ed -s "$f"
                    done


                    This loops over the files (named, for example input-something) and sends a simple ed-script to ed:



                    • on line 1, search and replace (s//) one or more digits at the beginning of the line with another number -- that replacement number being the result of computing the line length of the input minus one

                    • after that, w write the file out and

                    • then q quit ed





                    share|improve this answer













                    Another option would be to use ed (of course!):



                    for f in input*
                    do
                    printf '1s/[[:digit:]][[:digit:]]*/%dnwnq' $(( $(wc -l < "$f") - 1 )) | ed -s "$f"
                    done


                    This loops over the files (named, for example input-something) and sends a simple ed-script to ed:



                    • on line 1, search and replace (s//) one or more digits at the beginning of the line with another number -- that replacement number being the result of computing the line length of the input minus one

                    • after that, w write the file out and

                    • then q quit ed






                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered May 15 at 18:30









                    Jeff SchallerJeff Schaller

                    46k1165150




                    46k1165150





















                        1














                        In Vim, run:



                        :execute '1s/^[0-9]+/' . (line('$')-1) . '/'


                        (Thanks also to this answer for pointing me in the right direction.)



                        You can also do this in a loop, e.g. using :bufdo or just a shell for loop.






                        share|improve this answer



























                          1














                          In Vim, run:



                          :execute '1s/^[0-9]+/' . (line('$')-1) . '/'


                          (Thanks also to this answer for pointing me in the right direction.)



                          You can also do this in a loop, e.g. using :bufdo or just a shell for loop.






                          share|improve this answer

























                            1












                            1








                            1







                            In Vim, run:



                            :execute '1s/^[0-9]+/' . (line('$')-1) . '/'


                            (Thanks also to this answer for pointing me in the right direction.)



                            You can also do this in a loop, e.g. using :bufdo or just a shell for loop.






                            share|improve this answer













                            In Vim, run:



                            :execute '1s/^[0-9]+/' . (line('$')-1) . '/'


                            (Thanks also to this answer for pointing me in the right direction.)



                            You can also do this in a loop, e.g. using :bufdo or just a shell for loop.







                            share|improve this answer












                            share|improve this answer



                            share|improve this answer










                            answered May 16 at 19:53









                            WildcardWildcard

                            23.6k1068174




                            23.6k1068174




















                                erikusrex is a new contributor. Be nice, and check out our Code of Conduct.









                                draft saved

                                draft discarded


















                                erikusrex is a new contributor. Be nice, and check out our Code of Conduct.












                                erikusrex is a new contributor. Be nice, and check out our Code of Conduct.











                                erikusrex is a new contributor. Be nice, and check out our Code of Conduct.














                                Thanks for contributing an answer to Unix & Linux Stack Exchange!


                                • Please be sure to answer the question. Provide details and share your research!

                                But avoid


                                • Asking for help, clarification, or responding to other answers.

                                • Making statements based on opinion; back them up with references or personal experience.

                                To learn more, see our tips on writing great answers.




                                draft saved


                                draft discarded














                                StackExchange.ready(
                                function ()
                                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f519129%2frepairing-the-headers-of-phylip-bioinformatics-files-to-accurately-reflect-the-u%23new-answer', 'question_page');

                                );

                                Post as a guest















                                Required, but never shown





















































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown

































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown







                                Popular posts from this blog

                                Get product attribute by attribute group code in magento 2get product attribute by product attribute group in magento 2Magento 2 Log Bundle Product Data in List Page?How to get all product attribute of a attribute group of Default attribute set?Magento 2.1 Create a filter in the product grid by new attributeMagento 2 : Get Product Attribute values By GroupMagento 2 How to get all existing values for one attributeMagento 2 get custom attribute of a single product inside a pluginMagento 2.3 How to get all the Multi Source Inventory (MSI) locations collection in custom module?Magento2: how to develop rest API to get new productsGet product attribute by attribute group code ( [attribute_group_code] ) in magento 2

                                Category:9 (number) SubcategoriesMedia in category "9 (number)"Navigation menuUpload mediaGND ID: 4485639-8Library of Congress authority ID: sh85091979ReasonatorScholiaStatistics

                                Magento 2.3: How do i solve this, Not registered handle, on custom form?How can i rewrite TierPrice Block in Magento2magento 2 captcha not rendering if I override layout xmlmain.CRITICAL: Plugin class doesn't existMagento 2 : Problem while adding custom button order view page?Magento 2.2.5: Overriding Admin Controller sales/orderMagento 2.2.5: Add, Update and Delete existing products Custom OptionsMagento 2.3 : File Upload issue in UI Component FormMagento2 Not registered handleHow to configured Form Builder Js in my custom magento 2.3.0 module?Magento 2.3. How to create image upload field in an admin form