Repairing the headers of phylip bioinformatics files to accurately reflect the updated number of samples in the file(s)Select file based on number of lines and manipulate the resultcreate a new column based on existing columns using if else statement in awkPrint sets of lines that do not have a corresponding pairUsing numbers in file A to get unique ID from file B based on the order specified by file AUsing Uniq -c with a regular expression or counting the number of lines removedDetermine how long tabs 't' are on a lineextract fasta entries from list using while readscript to parse file for two consecutive lines of unequal lengthConcatenate multiple zipped files, skipping header lines in all but the first filedelete rows with duplications in first column in bash
Are cells guaranteed to get at least one mitochondrion when they divide?
Where is Jon going?
Why did OJ Simpson's trial take 9 months?
Navigating a quick return to previous employer
Merge pdfs sequentially
ifconfig shows UP while ip link shows DOWN
Is keeping the forking link on a true fork necessary (Github/GPL)?
Fill area of x^2+y^2>1 and x^2+y^2>4 using patterns and tikzpicture
Can a UK national work as a paid shop assistant in the USA?
Writing "hahaha" versus describing the laugh
Are PMR446 walkie-talkies legal in Switzerland?
Why is the Eisenstein ideal paper so great?
Split into three!
Local variables in DynamicModule affected by outside evaluation
switching alignment
Is it safe to redirect stdout and stderr to the same file without file descriptor copies?
Are there historical examples of audiences drawn to a work that was "so bad it's good"?
Why does the painters tape have to be blue?
What did the 'turbo' button actually do?
I want to ask company flying me out for office tour if I can bring my fiance
Why is 'additive' EQ more difficult to use than 'subtractive'?
Paired t-test means that the variances of the 2 samples are the same?
Are there any German nonsense poems (Jabberwocky)?
To exponential digit growth and beyond!
Repairing the headers of phylip bioinformatics files to accurately reflect the updated number of samples in the file(s)
Select file based on number of lines and manipulate the resultcreate a new column based on existing columns using if else statement in awkPrint sets of lines that do not have a corresponding pairUsing numbers in file A to get unique ID from file B based on the order specified by file AUsing Uniq -c with a regular expression or counting the number of lines removedDetermine how long tabs 't' are on a lineextract fasta entries from list using while readscript to parse file for two consecutive lines of unequal lengthConcatenate multiple zipped files, skipping header lines in all but the first filedelete rows with duplications in first column in bash
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
I have a dataset that I am working with made up of phylip files that I have been editing. Phylip format is a bioinformatics format that contains as a header the number of samples and the sequence length, followed by each sample and its sequence. for example:
5 10
sample_1 gaatatccga
sample_2 gaatatccga
sample_3 gaatatcgca
sample_4 caatatccga
sample_5 gaataagcga
My issue is that in trimming these datasets, the sample number in the header no longer is accurate (e.g. in above example might say five, but I've since trimmed to have only three samples). What I need to do is to replace that sample count with the new, accurate sample count but I cannot figure out how to do so without losing the sequence length number (e.g. the 10).
I have 550 files so simply doing this by hand is not an option. I can for-loop the wc but again I need to retain that sequence length information and somehow combine it with a new, accurate wc.
text-processing bioinformatics wc
New contributor
add a comment |
I have a dataset that I am working with made up of phylip files that I have been editing. Phylip format is a bioinformatics format that contains as a header the number of samples and the sequence length, followed by each sample and its sequence. for example:
5 10
sample_1 gaatatccga
sample_2 gaatatccga
sample_3 gaatatcgca
sample_4 caatatccga
sample_5 gaataagcga
My issue is that in trimming these datasets, the sample number in the header no longer is accurate (e.g. in above example might say five, but I've since trimmed to have only three samples). What I need to do is to replace that sample count with the new, accurate sample count but I cannot figure out how to do so without losing the sequence length number (e.g. the 10).
I have 550 files so simply doing this by hand is not an option. I can for-loop the wc but again I need to retain that sequence length information and somehow combine it with a new, accurate wc.
text-processing bioinformatics wc
New contributor
3
Please don't post pictures of text but post the actual text in a code block.
– Jesse_b
May 15 at 17:49
Also, it's unclear how the number should be changed. Always to 3?
– choroba
May 15 at 17:55
ok thank you, will do in the future. Not always to three as the sample number is different across files but most of the files have been edited and so the new number of samples is often fewer than what is currently stated in the header
– erikusrex
May 15 at 18:18
add a comment |
I have a dataset that I am working with made up of phylip files that I have been editing. Phylip format is a bioinformatics format that contains as a header the number of samples and the sequence length, followed by each sample and its sequence. for example:
5 10
sample_1 gaatatccga
sample_2 gaatatccga
sample_3 gaatatcgca
sample_4 caatatccga
sample_5 gaataagcga
My issue is that in trimming these datasets, the sample number in the header no longer is accurate (e.g. in above example might say five, but I've since trimmed to have only three samples). What I need to do is to replace that sample count with the new, accurate sample count but I cannot figure out how to do so without losing the sequence length number (e.g. the 10).
I have 550 files so simply doing this by hand is not an option. I can for-loop the wc but again I need to retain that sequence length information and somehow combine it with a new, accurate wc.
text-processing bioinformatics wc
New contributor
I have a dataset that I am working with made up of phylip files that I have been editing. Phylip format is a bioinformatics format that contains as a header the number of samples and the sequence length, followed by each sample and its sequence. for example:
5 10
sample_1 gaatatccga
sample_2 gaatatccga
sample_3 gaatatcgca
sample_4 caatatccga
sample_5 gaataagcga
My issue is that in trimming these datasets, the sample number in the header no longer is accurate (e.g. in above example might say five, but I've since trimmed to have only three samples). What I need to do is to replace that sample count with the new, accurate sample count but I cannot figure out how to do so without losing the sequence length number (e.g. the 10).
I have 550 files so simply doing this by hand is not an option. I can for-loop the wc but again I need to retain that sequence length information and somehow combine it with a new, accurate wc.
text-processing bioinformatics wc
text-processing bioinformatics wc
New contributor
New contributor
edited May 16 at 1:22
Jeff Schaller♦
46k1165150
46k1165150
New contributor
asked May 15 at 17:49
erikusrexerikusrex
182
182
New contributor
New contributor
3
Please don't post pictures of text but post the actual text in a code block.
– Jesse_b
May 15 at 17:49
Also, it's unclear how the number should be changed. Always to 3?
– choroba
May 15 at 17:55
ok thank you, will do in the future. Not always to three as the sample number is different across files but most of the files have been edited and so the new number of samples is often fewer than what is currently stated in the header
– erikusrex
May 15 at 18:18
add a comment |
3
Please don't post pictures of text but post the actual text in a code block.
– Jesse_b
May 15 at 17:49
Also, it's unclear how the number should be changed. Always to 3?
– choroba
May 15 at 17:55
ok thank you, will do in the future. Not always to three as the sample number is different across files but most of the files have been edited and so the new number of samples is often fewer than what is currently stated in the header
– erikusrex
May 15 at 18:18
3
3
Please don't post pictures of text but post the actual text in a code block.
– Jesse_b
May 15 at 17:49
Please don't post pictures of text but post the actual text in a code block.
– Jesse_b
May 15 at 17:49
Also, it's unclear how the number should be changed. Always to 3?
– choroba
May 15 at 17:55
Also, it's unclear how the number should be changed. Always to 3?
– choroba
May 15 at 17:55
ok thank you, will do in the future. Not always to three as the sample number is different across files but most of the files have been edited and so the new number of samples is often fewer than what is currently stated in the header
– erikusrex
May 15 at 18:18
ok thank you, will do in the future. Not always to three as the sample number is different across files but most of the files have been edited and so the new number of samples is often fewer than what is currently stated in the header
– erikusrex
May 15 at 18:18
add a comment |
3 Answers
3
active
oldest
votes
If I understand your requirement correctly you can use the following awk
command:
awk -v samples="$(($(grep -c . input)-1))" 'NR == 1 $1=samples 1' input
samples
will be set to the number of lines in the input
file minus one (since you aren't counting the header line).
awk
will then change the first column of the first line to the new sample number and print everything.
$ cat input
5 10
sample_1 gaatatccga
sample_2 gaatatccga
sample_3 gaatatccga
$ awk -v samples="$(($(grep -c . input)-1))" 'NR == 1 $1=samples 1' input
3 10
sample_1 gaatatccga
sample_2 gaatatccga
sample_3 gaatatccga
With GNU awk you can use the -i
flag to modify the files in place but I would prefer to make a second set of modified files to ensure the correct changes have been made.
Something like:
for file in *.phy; do
awk -v samples="$(($(grep -c . "$file")-1))" 'NR == 1 $1=samples 1' "$file" > "$file.new"
done
add a comment |
Another option would be to use ed
(of course!):
for f in input*
do
printf '1s/[[:digit:]][[:digit:]]*/%dnwnq' $(( $(wc -l < "$f") - 1 )) | ed -s "$f"
done
This loops over the files (named, for example input
-something) and sends a simple ed-script to ed
:
- on line
1
, search and replace (s//
) one or more digits at the beginning of the line with another number -- that replacement number being the result of computing the line length of the input minus one - after that,
w
write the file out and - then
q
quit ed
add a comment |
In Vim, run:
:execute '1s/^[0-9]+/' . (line('$')-1) . '/'
(Thanks also to this answer for pointing me in the right direction.)
You can also do this in a loop, e.g. using :bufdo
or just a shell for
loop.
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
erikusrex is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f519129%2frepairing-the-headers-of-phylip-bioinformatics-files-to-accurately-reflect-the-u%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
If I understand your requirement correctly you can use the following awk
command:
awk -v samples="$(($(grep -c . input)-1))" 'NR == 1 $1=samples 1' input
samples
will be set to the number of lines in the input
file minus one (since you aren't counting the header line).
awk
will then change the first column of the first line to the new sample number and print everything.
$ cat input
5 10
sample_1 gaatatccga
sample_2 gaatatccga
sample_3 gaatatccga
$ awk -v samples="$(($(grep -c . input)-1))" 'NR == 1 $1=samples 1' input
3 10
sample_1 gaatatccga
sample_2 gaatatccga
sample_3 gaatatccga
With GNU awk you can use the -i
flag to modify the files in place but I would prefer to make a second set of modified files to ensure the correct changes have been made.
Something like:
for file in *.phy; do
awk -v samples="$(($(grep -c . "$file")-1))" 'NR == 1 $1=samples 1' "$file" > "$file.new"
done
add a comment |
If I understand your requirement correctly you can use the following awk
command:
awk -v samples="$(($(grep -c . input)-1))" 'NR == 1 $1=samples 1' input
samples
will be set to the number of lines in the input
file minus one (since you aren't counting the header line).
awk
will then change the first column of the first line to the new sample number and print everything.
$ cat input
5 10
sample_1 gaatatccga
sample_2 gaatatccga
sample_3 gaatatccga
$ awk -v samples="$(($(grep -c . input)-1))" 'NR == 1 $1=samples 1' input
3 10
sample_1 gaatatccga
sample_2 gaatatccga
sample_3 gaatatccga
With GNU awk you can use the -i
flag to modify the files in place but I would prefer to make a second set of modified files to ensure the correct changes have been made.
Something like:
for file in *.phy; do
awk -v samples="$(($(grep -c . "$file")-1))" 'NR == 1 $1=samples 1' "$file" > "$file.new"
done
add a comment |
If I understand your requirement correctly you can use the following awk
command:
awk -v samples="$(($(grep -c . input)-1))" 'NR == 1 $1=samples 1' input
samples
will be set to the number of lines in the input
file minus one (since you aren't counting the header line).
awk
will then change the first column of the first line to the new sample number and print everything.
$ cat input
5 10
sample_1 gaatatccga
sample_2 gaatatccga
sample_3 gaatatccga
$ awk -v samples="$(($(grep -c . input)-1))" 'NR == 1 $1=samples 1' input
3 10
sample_1 gaatatccga
sample_2 gaatatccga
sample_3 gaatatccga
With GNU awk you can use the -i
flag to modify the files in place but I would prefer to make a second set of modified files to ensure the correct changes have been made.
Something like:
for file in *.phy; do
awk -v samples="$(($(grep -c . "$file")-1))" 'NR == 1 $1=samples 1' "$file" > "$file.new"
done
If I understand your requirement correctly you can use the following awk
command:
awk -v samples="$(($(grep -c . input)-1))" 'NR == 1 $1=samples 1' input
samples
will be set to the number of lines in the input
file minus one (since you aren't counting the header line).
awk
will then change the first column of the first line to the new sample number and print everything.
$ cat input
5 10
sample_1 gaatatccga
sample_2 gaatatccga
sample_3 gaatatccga
$ awk -v samples="$(($(grep -c . input)-1))" 'NR == 1 $1=samples 1' input
3 10
sample_1 gaatatccga
sample_2 gaatatccga
sample_3 gaatatccga
With GNU awk you can use the -i
flag to modify the files in place but I would prefer to make a second set of modified files to ensure the correct changes have been made.
Something like:
for file in *.phy; do
awk -v samples="$(($(grep -c . "$file")-1))" 'NR == 1 $1=samples 1' "$file" > "$file.new"
done
edited May 15 at 18:05
answered May 15 at 17:58
Jesse_bJesse_b
15.5k33776
15.5k33776
add a comment |
add a comment |
Another option would be to use ed
(of course!):
for f in input*
do
printf '1s/[[:digit:]][[:digit:]]*/%dnwnq' $(( $(wc -l < "$f") - 1 )) | ed -s "$f"
done
This loops over the files (named, for example input
-something) and sends a simple ed-script to ed
:
- on line
1
, search and replace (s//
) one or more digits at the beginning of the line with another number -- that replacement number being the result of computing the line length of the input minus one - after that,
w
write the file out and - then
q
quit ed
add a comment |
Another option would be to use ed
(of course!):
for f in input*
do
printf '1s/[[:digit:]][[:digit:]]*/%dnwnq' $(( $(wc -l < "$f") - 1 )) | ed -s "$f"
done
This loops over the files (named, for example input
-something) and sends a simple ed-script to ed
:
- on line
1
, search and replace (s//
) one or more digits at the beginning of the line with another number -- that replacement number being the result of computing the line length of the input minus one - after that,
w
write the file out and - then
q
quit ed
add a comment |
Another option would be to use ed
(of course!):
for f in input*
do
printf '1s/[[:digit:]][[:digit:]]*/%dnwnq' $(( $(wc -l < "$f") - 1 )) | ed -s "$f"
done
This loops over the files (named, for example input
-something) and sends a simple ed-script to ed
:
- on line
1
, search and replace (s//
) one or more digits at the beginning of the line with another number -- that replacement number being the result of computing the line length of the input minus one - after that,
w
write the file out and - then
q
quit ed
Another option would be to use ed
(of course!):
for f in input*
do
printf '1s/[[:digit:]][[:digit:]]*/%dnwnq' $(( $(wc -l < "$f") - 1 )) | ed -s "$f"
done
This loops over the files (named, for example input
-something) and sends a simple ed-script to ed
:
- on line
1
, search and replace (s//
) one or more digits at the beginning of the line with another number -- that replacement number being the result of computing the line length of the input minus one - after that,
w
write the file out and - then
q
quit ed
answered May 15 at 18:30
Jeff Schaller♦Jeff Schaller
46k1165150
46k1165150
add a comment |
add a comment |
In Vim, run:
:execute '1s/^[0-9]+/' . (line('$')-1) . '/'
(Thanks also to this answer for pointing me in the right direction.)
You can also do this in a loop, e.g. using :bufdo
or just a shell for
loop.
add a comment |
In Vim, run:
:execute '1s/^[0-9]+/' . (line('$')-1) . '/'
(Thanks also to this answer for pointing me in the right direction.)
You can also do this in a loop, e.g. using :bufdo
or just a shell for
loop.
add a comment |
In Vim, run:
:execute '1s/^[0-9]+/' . (line('$')-1) . '/'
(Thanks also to this answer for pointing me in the right direction.)
You can also do this in a loop, e.g. using :bufdo
or just a shell for
loop.
In Vim, run:
:execute '1s/^[0-9]+/' . (line('$')-1) . '/'
(Thanks also to this answer for pointing me in the right direction.)
You can also do this in a loop, e.g. using :bufdo
or just a shell for
loop.
answered May 16 at 19:53
WildcardWildcard
23.6k1068174
23.6k1068174
add a comment |
add a comment |
erikusrex is a new contributor. Be nice, and check out our Code of Conduct.
erikusrex is a new contributor. Be nice, and check out our Code of Conduct.
erikusrex is a new contributor. Be nice, and check out our Code of Conduct.
erikusrex is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f519129%2frepairing-the-headers-of-phylip-bioinformatics-files-to-accurately-reflect-the-u%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
3
Please don't post pictures of text but post the actual text in a code block.
– Jesse_b
May 15 at 17:49
Also, it's unclear how the number should be changed. Always to 3?
– choroba
May 15 at 17:55
ok thank you, will do in the future. Not always to three as the sample number is different across files but most of the files have been edited and so the new number of samples is often fewer than what is currently stated in the header
– erikusrex
May 15 at 18:18