Convert BAM to properly paired FASTQ filesHow to filter out cross alignments from a BED file?samtools mpileup empty when filtering out flagsWhat is the fastest way to calculate the number of unknown nucleotides in FASTA / FASTQ files?How to safely and efficiently convert subset of bam to fastq?Convert bam file to highly compressible bamHow does htslib/samtools access optional BAM fields?Double-counting coverage of overlapped read pairsExtracting all reads from bam file which match read IDs in another fileconvert supplementary reads to primary in sam or bamBatch alignment of inconsistently named Fastq files
How can I help our ranger feel special about her beast companion?
How can one convert an expression to a string while keeping the quotation marks of strings that are part of the expression?
When designing an adventure, how can I ensure a continuous player experience in a setting that's likely to favor TPKs?
Who determines when road center lines are solid or dashed?
What would be the safest way to drop thousands of small, hard objects from a typical, high wing, GA airplane?
Last-minute canceled work-trip mean I'll lose thousands of dollars on planned vacation
What happens if a company buys back all of its shares?
Wordplay addition paradox
Drawing a circle with nodes shift with Tikz
Applying for jobs with an obvious scar
How do you send money when you're not sure it's not a scam?
Arithmetics in LuaLaTeX
Practical example in using (homotopy) type theory
Strategy to pay off revolving debt while building reserve savings fund?
Random piece of plastic
Is it possible to have a career in SciComp without contributing to arms research?
The most secure way to handle someone forgetting to verify their account?
"This used to be my phone number"
How to remove the first colon ':' from a timestamp?
Why are there few or no black super GMs?
Time signature inconsistent
I have found a mistake on someone's code published online: what is the protocol?
Did Hitler say this quote about homeschooling?
How to find location on Cambridge-Mildenhall railway that still has tracks/rails?
Convert BAM to properly paired FASTQ files
How to filter out cross alignments from a BED file?samtools mpileup empty when filtering out flagsWhat is the fastest way to calculate the number of unknown nucleotides in FASTA / FASTQ files?How to safely and efficiently convert subset of bam to fastq?Convert bam file to highly compressible bamHow does htslib/samtools access optional BAM fields?Double-counting coverage of overlapped read pairsExtracting all reads from bam file which match read IDs in another fileconvert supplementary reads to primary in sam or bamBatch alignment of inconsistently named Fastq files
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
$begingroup$
I thought I had figured this one out. But apparently not.
I need to convert a BAM file of paired-end alignments to two FASTQ files of paired reads to realign them, with a twist: I only want reads that fall within a defined region.
Currently I am using the following command (the region is an example; it also fails on different regions; at any rate, this is on hs37d5 but presumably the issue would exist for other references too):
samtools view -b -u -@ 48 alignments.bam 21:31,831,502-31,896,094
| samtools collate -n 128 -u -O -@ 48 - /tmp/bam-to-fastq-
| samtools fastq -F 0x900 -@ 48
-0 /dev/null -1 reads_R1.fastq.gz -2 reads_R2.fastq.gz -
(Apart from the region filtering, this command is essentially the same that word of god tells us to use.)
The resulting files are not properly paired:
⟩⟩⟩ diff <(gunzip -c reads_R1.fastq.gz | sed -n -e 1~4p -e 10q)
1 ⟩ <(gunzip -c reads_R2.fastq.gz | sed -n -e 1~4p -e 10q)
1,2d0
< @A00346:17:H52H7DSXX:3:2142:32316:24518
< @A00346:17:H52H7DSXX:3:1153:6705:26412
3a2,3
> @A00346:17:H52H7DSXX:3:1173:9118:3129
> @A00346:17:H52H7DSXX:3:1109:29414:11757
As far as I can see from performing a spot check, these are all reads which are marked as paired in the BAM file, but for which only one mate falls within the filtered region.
And running bwa mem
on the files gives me the error:
[mem_sam_pe] paired reads have different names: […]
How can I ensure that only properly paired reads are included in the FASTQ files?
bam samtools fastq
$endgroup$
|
show 4 more comments
$begingroup$
I thought I had figured this one out. But apparently not.
I need to convert a BAM file of paired-end alignments to two FASTQ files of paired reads to realign them, with a twist: I only want reads that fall within a defined region.
Currently I am using the following command (the region is an example; it also fails on different regions; at any rate, this is on hs37d5 but presumably the issue would exist for other references too):
samtools view -b -u -@ 48 alignments.bam 21:31,831,502-31,896,094
| samtools collate -n 128 -u -O -@ 48 - /tmp/bam-to-fastq-
| samtools fastq -F 0x900 -@ 48
-0 /dev/null -1 reads_R1.fastq.gz -2 reads_R2.fastq.gz -
(Apart from the region filtering, this command is essentially the same that word of god tells us to use.)
The resulting files are not properly paired:
⟩⟩⟩ diff <(gunzip -c reads_R1.fastq.gz | sed -n -e 1~4p -e 10q)
1 ⟩ <(gunzip -c reads_R2.fastq.gz | sed -n -e 1~4p -e 10q)
1,2d0
< @A00346:17:H52H7DSXX:3:2142:32316:24518
< @A00346:17:H52H7DSXX:3:1153:6705:26412
3a2,3
> @A00346:17:H52H7DSXX:3:1173:9118:3129
> @A00346:17:H52H7DSXX:3:1109:29414:11757
As far as I can see from performing a spot check, these are all reads which are marked as paired in the BAM file, but for which only one mate falls within the filtered region.
And running bwa mem
on the files gives me the error:
[mem_sam_pe] paired reads have different names: […]
How can I ensure that only properly paired reads are included in the FASTQ files?
bam samtools fastq
$endgroup$
2
$begingroup$
I'm guessing this is just that some reads fall in the target region but their pairs fall outside it. I don't know how to do this (hence a comment), but have you tried extracting only those reads that fall entirely within the target region?
$endgroup$
– terdon♦
Jul 9 at 11:32
$begingroup$
@terdon Right, that’s precisely the problem. However, I’d naively expectsamtools fastq
to do the right thing here, and actually output pairs of reads, or no read at all. Because resolving this efficiently separately is pretty hard.
$endgroup$
– Konrad Rudolph
Jul 9 at 12:40
1
$begingroup$
The samtools devs might not agree that this is a bug, but the use case you describe is certainly sufficiently common that they should consider implementing it as an option. There's a strong argument that it should really be the default behavior, but that would probably require a major version bump, which raises another set of practical considerations.
$endgroup$
– Daniel Standage
Jul 9 at 13:28
$begingroup$
That said, I can see some complications with its implementation. We typically see paired reads mapped within a few hundred base pairs, but how will it affect performance if we have very large insert libraries or pairs that are discordantly mapped? In the absence of a data structure that explicitly tracks the location of a read's pair, there are some edge cases that wouldn't be straightforward to handle. From this perspective, I'm not surprised the use case you describe doesn't seem to be supported out-of-the-box.
$endgroup$
– Daniel Standage
Jul 9 at 13:32
$begingroup$
@DanielStandagesamtools collate
should take care of this issue: If both reads of a pair are present, they will be adjacent; so no fancy data structure is necessary. In principle this should even be possible using a simple additional pass usingsed
but I’m currently failing at that.
$endgroup$
– Konrad Rudolph
Jul 9 at 13:41
|
show 4 more comments
$begingroup$
I thought I had figured this one out. But apparently not.
I need to convert a BAM file of paired-end alignments to two FASTQ files of paired reads to realign them, with a twist: I only want reads that fall within a defined region.
Currently I am using the following command (the region is an example; it also fails on different regions; at any rate, this is on hs37d5 but presumably the issue would exist for other references too):
samtools view -b -u -@ 48 alignments.bam 21:31,831,502-31,896,094
| samtools collate -n 128 -u -O -@ 48 - /tmp/bam-to-fastq-
| samtools fastq -F 0x900 -@ 48
-0 /dev/null -1 reads_R1.fastq.gz -2 reads_R2.fastq.gz -
(Apart from the region filtering, this command is essentially the same that word of god tells us to use.)
The resulting files are not properly paired:
⟩⟩⟩ diff <(gunzip -c reads_R1.fastq.gz | sed -n -e 1~4p -e 10q)
1 ⟩ <(gunzip -c reads_R2.fastq.gz | sed -n -e 1~4p -e 10q)
1,2d0
< @A00346:17:H52H7DSXX:3:2142:32316:24518
< @A00346:17:H52H7DSXX:3:1153:6705:26412
3a2,3
> @A00346:17:H52H7DSXX:3:1173:9118:3129
> @A00346:17:H52H7DSXX:3:1109:29414:11757
As far as I can see from performing a spot check, these are all reads which are marked as paired in the BAM file, but for which only one mate falls within the filtered region.
And running bwa mem
on the files gives me the error:
[mem_sam_pe] paired reads have different names: […]
How can I ensure that only properly paired reads are included in the FASTQ files?
bam samtools fastq
$endgroup$
I thought I had figured this one out. But apparently not.
I need to convert a BAM file of paired-end alignments to two FASTQ files of paired reads to realign them, with a twist: I only want reads that fall within a defined region.
Currently I am using the following command (the region is an example; it also fails on different regions; at any rate, this is on hs37d5 but presumably the issue would exist for other references too):
samtools view -b -u -@ 48 alignments.bam 21:31,831,502-31,896,094
| samtools collate -n 128 -u -O -@ 48 - /tmp/bam-to-fastq-
| samtools fastq -F 0x900 -@ 48
-0 /dev/null -1 reads_R1.fastq.gz -2 reads_R2.fastq.gz -
(Apart from the region filtering, this command is essentially the same that word of god tells us to use.)
The resulting files are not properly paired:
⟩⟩⟩ diff <(gunzip -c reads_R1.fastq.gz | sed -n -e 1~4p -e 10q)
1 ⟩ <(gunzip -c reads_R2.fastq.gz | sed -n -e 1~4p -e 10q)
1,2d0
< @A00346:17:H52H7DSXX:3:2142:32316:24518
< @A00346:17:H52H7DSXX:3:1153:6705:26412
3a2,3
> @A00346:17:H52H7DSXX:3:1173:9118:3129
> @A00346:17:H52H7DSXX:3:1109:29414:11757
As far as I can see from performing a spot check, these are all reads which are marked as paired in the BAM file, but for which only one mate falls within the filtered region.
And running bwa mem
on the files gives me the error:
[mem_sam_pe] paired reads have different names: […]
How can I ensure that only properly paired reads are included in the FASTQ files?
bam samtools fastq
bam samtools fastq
edited Jul 9 at 12:54
Konrad Rudolph
asked Jul 9 at 11:22
Konrad RudolphKonrad Rudolph
3,4307 silver badges33 bronze badges
3,4307 silver badges33 bronze badges
2
$begingroup$
I'm guessing this is just that some reads fall in the target region but their pairs fall outside it. I don't know how to do this (hence a comment), but have you tried extracting only those reads that fall entirely within the target region?
$endgroup$
– terdon♦
Jul 9 at 11:32
$begingroup$
@terdon Right, that’s precisely the problem. However, I’d naively expectsamtools fastq
to do the right thing here, and actually output pairs of reads, or no read at all. Because resolving this efficiently separately is pretty hard.
$endgroup$
– Konrad Rudolph
Jul 9 at 12:40
1
$begingroup$
The samtools devs might not agree that this is a bug, but the use case you describe is certainly sufficiently common that they should consider implementing it as an option. There's a strong argument that it should really be the default behavior, but that would probably require a major version bump, which raises another set of practical considerations.
$endgroup$
– Daniel Standage
Jul 9 at 13:28
$begingroup$
That said, I can see some complications with its implementation. We typically see paired reads mapped within a few hundred base pairs, but how will it affect performance if we have very large insert libraries or pairs that are discordantly mapped? In the absence of a data structure that explicitly tracks the location of a read's pair, there are some edge cases that wouldn't be straightforward to handle. From this perspective, I'm not surprised the use case you describe doesn't seem to be supported out-of-the-box.
$endgroup$
– Daniel Standage
Jul 9 at 13:32
$begingroup$
@DanielStandagesamtools collate
should take care of this issue: If both reads of a pair are present, they will be adjacent; so no fancy data structure is necessary. In principle this should even be possible using a simple additional pass usingsed
but I’m currently failing at that.
$endgroup$
– Konrad Rudolph
Jul 9 at 13:41
|
show 4 more comments
2
$begingroup$
I'm guessing this is just that some reads fall in the target region but their pairs fall outside it. I don't know how to do this (hence a comment), but have you tried extracting only those reads that fall entirely within the target region?
$endgroup$
– terdon♦
Jul 9 at 11:32
$begingroup$
@terdon Right, that’s precisely the problem. However, I’d naively expectsamtools fastq
to do the right thing here, and actually output pairs of reads, or no read at all. Because resolving this efficiently separately is pretty hard.
$endgroup$
– Konrad Rudolph
Jul 9 at 12:40
1
$begingroup$
The samtools devs might not agree that this is a bug, but the use case you describe is certainly sufficiently common that they should consider implementing it as an option. There's a strong argument that it should really be the default behavior, but that would probably require a major version bump, which raises another set of practical considerations.
$endgroup$
– Daniel Standage
Jul 9 at 13:28
$begingroup$
That said, I can see some complications with its implementation. We typically see paired reads mapped within a few hundred base pairs, but how will it affect performance if we have very large insert libraries or pairs that are discordantly mapped? In the absence of a data structure that explicitly tracks the location of a read's pair, there are some edge cases that wouldn't be straightforward to handle. From this perspective, I'm not surprised the use case you describe doesn't seem to be supported out-of-the-box.
$endgroup$
– Daniel Standage
Jul 9 at 13:32
$begingroup$
@DanielStandagesamtools collate
should take care of this issue: If both reads of a pair are present, they will be adjacent; so no fancy data structure is necessary. In principle this should even be possible using a simple additional pass usingsed
but I’m currently failing at that.
$endgroup$
– Konrad Rudolph
Jul 9 at 13:41
2
2
$begingroup$
I'm guessing this is just that some reads fall in the target region but their pairs fall outside it. I don't know how to do this (hence a comment), but have you tried extracting only those reads that fall entirely within the target region?
$endgroup$
– terdon♦
Jul 9 at 11:32
$begingroup$
I'm guessing this is just that some reads fall in the target region but their pairs fall outside it. I don't know how to do this (hence a comment), but have you tried extracting only those reads that fall entirely within the target region?
$endgroup$
– terdon♦
Jul 9 at 11:32
$begingroup$
@terdon Right, that’s precisely the problem. However, I’d naively expect
samtools fastq
to do the right thing here, and actually output pairs of reads, or no read at all. Because resolving this efficiently separately is pretty hard.$endgroup$
– Konrad Rudolph
Jul 9 at 12:40
$begingroup$
@terdon Right, that’s precisely the problem. However, I’d naively expect
samtools fastq
to do the right thing here, and actually output pairs of reads, or no read at all. Because resolving this efficiently separately is pretty hard.$endgroup$
– Konrad Rudolph
Jul 9 at 12:40
1
1
$begingroup$
The samtools devs might not agree that this is a bug, but the use case you describe is certainly sufficiently common that they should consider implementing it as an option. There's a strong argument that it should really be the default behavior, but that would probably require a major version bump, which raises another set of practical considerations.
$endgroup$
– Daniel Standage
Jul 9 at 13:28
$begingroup$
The samtools devs might not agree that this is a bug, but the use case you describe is certainly sufficiently common that they should consider implementing it as an option. There's a strong argument that it should really be the default behavior, but that would probably require a major version bump, which raises another set of practical considerations.
$endgroup$
– Daniel Standage
Jul 9 at 13:28
$begingroup$
That said, I can see some complications with its implementation. We typically see paired reads mapped within a few hundred base pairs, but how will it affect performance if we have very large insert libraries or pairs that are discordantly mapped? In the absence of a data structure that explicitly tracks the location of a read's pair, there are some edge cases that wouldn't be straightforward to handle. From this perspective, I'm not surprised the use case you describe doesn't seem to be supported out-of-the-box.
$endgroup$
– Daniel Standage
Jul 9 at 13:32
$begingroup$
That said, I can see some complications with its implementation. We typically see paired reads mapped within a few hundred base pairs, but how will it affect performance if we have very large insert libraries or pairs that are discordantly mapped? In the absence of a data structure that explicitly tracks the location of a read's pair, there are some edge cases that wouldn't be straightforward to handle. From this perspective, I'm not surprised the use case you describe doesn't seem to be supported out-of-the-box.
$endgroup$
– Daniel Standage
Jul 9 at 13:32
$begingroup$
@DanielStandage
samtools collate
should take care of this issue: If both reads of a pair are present, they will be adjacent; so no fancy data structure is necessary. In principle this should even be possible using a simple additional pass using sed
but I’m currently failing at that.$endgroup$
– Konrad Rudolph
Jul 9 at 13:41
$begingroup$
@DanielStandage
samtools collate
should take care of this issue: If both reads of a pair are present, they will be adjacent; so no fancy data structure is necessary. In principle this should even be possible using a simple additional pass using sed
but I’m currently failing at that.$endgroup$
– Konrad Rudolph
Jul 9 at 13:41
|
show 4 more comments
1 Answer
1
active
oldest
votes
$begingroup$
As noted in the comments, the problem is “some reads fall in the target region but their pairs fall outside it”, leading to non-trivial numbers of singleton reads coming out of samtools collate
.
… | samtools fastq -F 0x900 -@ 48
-0 /dev/null -1 reads_R1.fastq.gz -2 reads_R2.fastq.gz -
Your samtools fastq
command is not doing anything to siphon off these singleton reads. You should add -s /dev/null
to get rid of them.
See the discussion in samtools/samtools#874 (especially the part starting here). I have not looked at how the recent changes from jkbonfield, which will land in an upcoming samtools version (after 1.9), might affect this.
$endgroup$
$begingroup$
Oooh, that explains why I previously thought this worked! My script used to include-s /dev/null
but then I removed it because I thought it had no effect. The documentation of these different outputs is somewhat confusing.
$endgroup$
– Konrad Rudolph
Jul 9 at 14:14
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "676"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fbioinformatics.stackexchange.com%2fquestions%2f8938%2fconvert-bam-to-properly-paired-fastq-files%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
As noted in the comments, the problem is “some reads fall in the target region but their pairs fall outside it”, leading to non-trivial numbers of singleton reads coming out of samtools collate
.
… | samtools fastq -F 0x900 -@ 48
-0 /dev/null -1 reads_R1.fastq.gz -2 reads_R2.fastq.gz -
Your samtools fastq
command is not doing anything to siphon off these singleton reads. You should add -s /dev/null
to get rid of them.
See the discussion in samtools/samtools#874 (especially the part starting here). I have not looked at how the recent changes from jkbonfield, which will land in an upcoming samtools version (after 1.9), might affect this.
$endgroup$
$begingroup$
Oooh, that explains why I previously thought this worked! My script used to include-s /dev/null
but then I removed it because I thought it had no effect. The documentation of these different outputs is somewhat confusing.
$endgroup$
– Konrad Rudolph
Jul 9 at 14:14
add a comment |
$begingroup$
As noted in the comments, the problem is “some reads fall in the target region but their pairs fall outside it”, leading to non-trivial numbers of singleton reads coming out of samtools collate
.
… | samtools fastq -F 0x900 -@ 48
-0 /dev/null -1 reads_R1.fastq.gz -2 reads_R2.fastq.gz -
Your samtools fastq
command is not doing anything to siphon off these singleton reads. You should add -s /dev/null
to get rid of them.
See the discussion in samtools/samtools#874 (especially the part starting here). I have not looked at how the recent changes from jkbonfield, which will land in an upcoming samtools version (after 1.9), might affect this.
$endgroup$
$begingroup$
Oooh, that explains why I previously thought this worked! My script used to include-s /dev/null
but then I removed it because I thought it had no effect. The documentation of these different outputs is somewhat confusing.
$endgroup$
– Konrad Rudolph
Jul 9 at 14:14
add a comment |
$begingroup$
As noted in the comments, the problem is “some reads fall in the target region but their pairs fall outside it”, leading to non-trivial numbers of singleton reads coming out of samtools collate
.
… | samtools fastq -F 0x900 -@ 48
-0 /dev/null -1 reads_R1.fastq.gz -2 reads_R2.fastq.gz -
Your samtools fastq
command is not doing anything to siphon off these singleton reads. You should add -s /dev/null
to get rid of them.
See the discussion in samtools/samtools#874 (especially the part starting here). I have not looked at how the recent changes from jkbonfield, which will land in an upcoming samtools version (after 1.9), might affect this.
$endgroup$
As noted in the comments, the problem is “some reads fall in the target region but their pairs fall outside it”, leading to non-trivial numbers of singleton reads coming out of samtools collate
.
… | samtools fastq -F 0x900 -@ 48
-0 /dev/null -1 reads_R1.fastq.gz -2 reads_R2.fastq.gz -
Your samtools fastq
command is not doing anything to siphon off these singleton reads. You should add -s /dev/null
to get rid of them.
See the discussion in samtools/samtools#874 (especially the part starting here). I have not looked at how the recent changes from jkbonfield, which will land in an upcoming samtools version (after 1.9), might affect this.
answered Jul 9 at 14:07
John MarshallJohn Marshall
6433 silver badges6 bronze badges
6433 silver badges6 bronze badges
$begingroup$
Oooh, that explains why I previously thought this worked! My script used to include-s /dev/null
but then I removed it because I thought it had no effect. The documentation of these different outputs is somewhat confusing.
$endgroup$
– Konrad Rudolph
Jul 9 at 14:14
add a comment |
$begingroup$
Oooh, that explains why I previously thought this worked! My script used to include-s /dev/null
but then I removed it because I thought it had no effect. The documentation of these different outputs is somewhat confusing.
$endgroup$
– Konrad Rudolph
Jul 9 at 14:14
$begingroup$
Oooh, that explains why I previously thought this worked! My script used to include
-s /dev/null
but then I removed it because I thought it had no effect. The documentation of these different outputs is somewhat confusing.$endgroup$
– Konrad Rudolph
Jul 9 at 14:14
$begingroup$
Oooh, that explains why I previously thought this worked! My script used to include
-s /dev/null
but then I removed it because I thought it had no effect. The documentation of these different outputs is somewhat confusing.$endgroup$
– Konrad Rudolph
Jul 9 at 14:14
add a comment |
Thanks for contributing an answer to Bioinformatics Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fbioinformatics.stackexchange.com%2fquestions%2f8938%2fconvert-bam-to-properly-paired-fastq-files%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
$begingroup$
I'm guessing this is just that some reads fall in the target region but their pairs fall outside it. I don't know how to do this (hence a comment), but have you tried extracting only those reads that fall entirely within the target region?
$endgroup$
– terdon♦
Jul 9 at 11:32
$begingroup$
@terdon Right, that’s precisely the problem. However, I’d naively expect
samtools fastq
to do the right thing here, and actually output pairs of reads, or no read at all. Because resolving this efficiently separately is pretty hard.$endgroup$
– Konrad Rudolph
Jul 9 at 12:40
1
$begingroup$
The samtools devs might not agree that this is a bug, but the use case you describe is certainly sufficiently common that they should consider implementing it as an option. There's a strong argument that it should really be the default behavior, but that would probably require a major version bump, which raises another set of practical considerations.
$endgroup$
– Daniel Standage
Jul 9 at 13:28
$begingroup$
That said, I can see some complications with its implementation. We typically see paired reads mapped within a few hundred base pairs, but how will it affect performance if we have very large insert libraries or pairs that are discordantly mapped? In the absence of a data structure that explicitly tracks the location of a read's pair, there are some edge cases that wouldn't be straightforward to handle. From this perspective, I'm not surprised the use case you describe doesn't seem to be supported out-of-the-box.
$endgroup$
– Daniel Standage
Jul 9 at 13:32
$begingroup$
@DanielStandage
samtools collate
should take care of this issue: If both reads of a pair are present, they will be adjacent; so no fancy data structure is necessary. In principle this should even be possible using a simple additional pass usingsed
but I’m currently failing at that.$endgroup$
– Konrad Rudolph
Jul 9 at 13:41