How drastic would the result be if I use fasta or reference assembly from ucsc and gtf from gencode?PASA pipeline: compare experimental transcripts to the reference annotationDifference between de novo transcriptome assembly methodsExon-exon junctions: compare experimental transcripts to reference annotationCufflinks Error: sort order of reads in BAMs must be the sameCount files using htseq-count?RNA seq fasta file annotation from alignment to reference matchesMaking a bed file for RSeQCNormalization for two bulk RNA-Seq samples to enable reliable fold-change estimation between genesAssembly by stringtieRNA-seq analysis of mixed viral/host reads with salmon
How does a pilot select the correct ILS when the airport has parallel runways?
Can Ogre clerics use Purify Food and Drink on humanoid characters?
Array initialization optimization
Appropriate way to say "see you tomorrow" when meeting online
How do I turn off a repeating trade?
Output of "$OSTYPE:6" on old releases of Mac OS X
"How can you guarantee that you won't change/quit job after just couple of months?" How to respond?
.NET executes a SQL query and Active Monitor shows multiple rows blocking each other
How do I professionally let my manager know I'll quit over smoking in the office?
Do I have to explain the mechanical superiority of the player-character within the fiction of the game?
Why is it recommended to mix yogurt starter with a small amount of milk before adding to the entire batch?
JSON selector class in Python
How long would it take to cross the Channel in 1890's?
What was the Shuttle Carrier Aircraft escape tunnel?
Find the C-factor of a vote
What does "play with your toy’s toys" mean?
How many people are necessary to maintain modern civilisation?
Is "qch. est à mourir" considered an anglicism calqued from "sth is to die for"? How commonly is it used?
What did River say when she woke from her proto-comatose state?
Count All Possible Unique Combinations of Letters in a Word
Loss of power when I remove item from the outlet
What size of powerbank will I need to power a phone and DSLR for 2 weeks?
Is it illegal to withhold someone's passport and green card in California?
What does the hyphen "-" mean in "tar xzf -"?
How drastic would the result be if I use fasta or reference assembly from ucsc and gtf from gencode?
PASA pipeline: compare experimental transcripts to the reference annotationDifference between de novo transcriptome assembly methodsExon-exon junctions: compare experimental transcripts to reference annotationCufflinks Error: sort order of reads in BAMs must be the sameCount files using htseq-count?RNA seq fasta file annotation from alignment to reference matchesMaking a bed file for RSeQCNormalization for two bulk RNA-Seq samples to enable reliable fold-change estimation between genesAssembly by stringtieRNA-seq analysis of mixed viral/host reads with salmon
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
$begingroup$
There are difference annotation file for UCSC and gencode.
But if I use the reference assembly from UCSC and the GTF from Genocode or vice versa would my downstream results would be wrong?
rna-seq
$endgroup$
add a comment |
$begingroup$
There are difference annotation file for UCSC and gencode.
But if I use the reference assembly from UCSC and the GTF from Genocode or vice versa would my downstream results would be wrong?
rna-seq
$endgroup$
$begingroup$
Do you want to include older UCSC/gencode releases or only the most recent ones? They should now be fully compatible.
$endgroup$
– Devon Ryan♦
Jun 13 at 7:50
$begingroup$
so i used gencode.v21 as my gtf file and not sure about the reference assembly,where i have used hg38 but im not sure its from gencode or ucsc .Is there a way to find out the what reference assembly im using from the reference assembly ?
$endgroup$
– krushnach Chandra
Jun 13 at 7:58
add a comment |
$begingroup$
There are difference annotation file for UCSC and gencode.
But if I use the reference assembly from UCSC and the GTF from Genocode or vice versa would my downstream results would be wrong?
rna-seq
$endgroup$
There are difference annotation file for UCSC and gencode.
But if I use the reference assembly from UCSC and the GTF from Genocode or vice versa would my downstream results would be wrong?
rna-seq
rna-seq
edited Jun 13 at 7:49
Devon Ryan♦
14.8k21742
14.8k21742
asked Jun 13 at 7:20
krushnach Chandrakrushnach Chandra
51639
51639
$begingroup$
Do you want to include older UCSC/gencode releases or only the most recent ones? They should now be fully compatible.
$endgroup$
– Devon Ryan♦
Jun 13 at 7:50
$begingroup$
so i used gencode.v21 as my gtf file and not sure about the reference assembly,where i have used hg38 but im not sure its from gencode or ucsc .Is there a way to find out the what reference assembly im using from the reference assembly ?
$endgroup$
– krushnach Chandra
Jun 13 at 7:58
add a comment |
$begingroup$
Do you want to include older UCSC/gencode releases or only the most recent ones? They should now be fully compatible.
$endgroup$
– Devon Ryan♦
Jun 13 at 7:50
$begingroup$
so i used gencode.v21 as my gtf file and not sure about the reference assembly,where i have used hg38 but im not sure its from gencode or ucsc .Is there a way to find out the what reference assembly im using from the reference assembly ?
$endgroup$
– krushnach Chandra
Jun 13 at 7:58
$begingroup$
Do you want to include older UCSC/gencode releases or only the most recent ones? They should now be fully compatible.
$endgroup$
– Devon Ryan♦
Jun 13 at 7:50
$begingroup$
Do you want to include older UCSC/gencode releases or only the most recent ones? They should now be fully compatible.
$endgroup$
– Devon Ryan♦
Jun 13 at 7:50
$begingroup$
so i used gencode.v21 as my gtf file and not sure about the reference assembly,where i have used hg38 but im not sure its from gencode or ucsc .Is there a way to find out the what reference assembly im using from the reference assembly ?
$endgroup$
– krushnach Chandra
Jun 13 at 7:58
$begingroup$
so i used gencode.v21 as my gtf file and not sure about the reference assembly,where i have used hg38 but im not sure its from gencode or ucsc .Is there a way to find out the what reference assembly im using from the reference assembly ?
$endgroup$
– krushnach Chandra
Jun 13 at 7:58
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
Never use genomes or annotations from UCSC, they're poorly versioned and only recently with mouse and human have they even included all of the contigs. For fasta/GTF files from early in the GRCh38 release, you can tell whether you're using UCSC or Gencode by the presence/absence of _random
contigs, which will only exist for UCSC. These were mostly later split into the actual contigs, so recent download from UCSC should more closely match what you find at Gencode/Ensembl. Further, that time predated UCSC beginning to adopt Gencode's vastly superior annotations, so if your GTF file has instances where the same gene ID is on either multiple strands or multiple chromosomes (this is obviously biologically impossible) then you have a UCSC GTF file.
In general, with early GRCh38 releases your only real issues will be with _random contigs, which are a minority of the genome and don't have all that many genes. But really, you should be keeping track of the sources of your files and ensuring that they're compatible.
Update: I should expand a bit on my "Never use genomes or annotations from UCSC" comment. In point of fact the genomes themselves aren't so terrible. Early on UCSC had the bad habit of concatenating contigs together into _random
"chromosomes", but they seem to have mostly kicked that habit as of late. Note, however, that there are no versions for their genomes. Since reference genomes continue to get updates over time (mostly through the addition of patches) the lack of actual versions means you have to manually check if a recently downloaded file matches what may have been downloaded either previously or by someone else. This has obvious consequences when it comes to reproducibility. The same issue occurs for annotations from UCSC, but they have the additional problem of historically having biologically incoherent concepts of genes. That is, they will contain the same gene in multiple places with multiple orientations, which will break many tools in both obvious and completely unclear ways. For example, DEXSeq will simply break with an error message if given a UCSC annotation, since they break biological plausibility. If you were to use these annotations files with deepTools, you wouldn't get an error message, but the resulting output would be only partial, due to the biologically impossible annotation effectively corrupting most obvious ways of storing annotation data in a data structure (i.e., you can no longer treat IDs as unique). This could have downstream ramifications on biological interpretation of results.
$endgroup$
$begingroup$
to make sure if im have gencode so i used this "cat gencode.v21.annotation.gtf | grep "_random" no instance of that word .
$endgroup$
– krushnach Chandra
Jun 13 at 8:13
1
$begingroup$
You could also justhead
the file. Gencode files tend to start with a few comment lines (granted, given what the file is named, the odds of it being from Gencode were very high to begin with).
$endgroup$
– Devon Ryan♦
Jun 13 at 8:15
$begingroup$
##description: evidence-based annotation of the human genome (GRCh38), version 21 (Ensembl 77) ##provider: GENCODE ##contact: gencode@sanger.ac.uk ##format: gtf ##date: 2014-09-29 im sure now im in right path
$endgroup$
– krushnach Chandra
Jun 13 at 8:16
2
$begingroup$
@krushnachChandra not really relevant, but just so you know,grep
can read files. You don't needcat file | grep foo
, you can always dogrep foo file
.
$endgroup$
– terdon♦
Jun 13 at 8:43
1
$begingroup$
Devon, could you elaborate on your advice never to use UCSC genomes. Never seems a bit extreme. Can you explain what they're missing in a bit more detail?Is there any reason not to use them for human reference? Yes, they may have*random
in them, but is that such a problem?
$endgroup$
– terdon♦
Jun 13 at 8:49
|
show 3 more comments
$begingroup$
The main difference is in the way the chromosomes are named - UCSC uses the "chr" prefix (so chromosome 1 is "chr1") while in gencode the "chr" isn't used (so chromosome 1 is just "1"). Depending on your use case, this can obviously cause problems - if you're trying to match a locus (e.g. from gencode 1:1000002) between them, whatever tool you use is going to be looking in your aligned data for "1:1000002", but in your aligned data, it'll be named "chr1:1000002", so it won't match the two up.
$endgroup$
$begingroup$
Both use thechr
prefix for the most recent human and mouse releases.
$endgroup$
– Devon Ryan♦
Jun 13 at 7:48
$begingroup$
so its with the annotation file my results would vary but not much with the references?
$endgroup$
– krushnach Chandra
Jun 13 at 7:59
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "676"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fbioinformatics.stackexchange.com%2fquestions%2f8794%2fhow-drastic-would-the-result-be-if-i-use-fasta-or-reference-assembly-from-ucsc-a%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Never use genomes or annotations from UCSC, they're poorly versioned and only recently with mouse and human have they even included all of the contigs. For fasta/GTF files from early in the GRCh38 release, you can tell whether you're using UCSC or Gencode by the presence/absence of _random
contigs, which will only exist for UCSC. These were mostly later split into the actual contigs, so recent download from UCSC should more closely match what you find at Gencode/Ensembl. Further, that time predated UCSC beginning to adopt Gencode's vastly superior annotations, so if your GTF file has instances where the same gene ID is on either multiple strands or multiple chromosomes (this is obviously biologically impossible) then you have a UCSC GTF file.
In general, with early GRCh38 releases your only real issues will be with _random contigs, which are a minority of the genome and don't have all that many genes. But really, you should be keeping track of the sources of your files and ensuring that they're compatible.
Update: I should expand a bit on my "Never use genomes or annotations from UCSC" comment. In point of fact the genomes themselves aren't so terrible. Early on UCSC had the bad habit of concatenating contigs together into _random
"chromosomes", but they seem to have mostly kicked that habit as of late. Note, however, that there are no versions for their genomes. Since reference genomes continue to get updates over time (mostly through the addition of patches) the lack of actual versions means you have to manually check if a recently downloaded file matches what may have been downloaded either previously or by someone else. This has obvious consequences when it comes to reproducibility. The same issue occurs for annotations from UCSC, but they have the additional problem of historically having biologically incoherent concepts of genes. That is, they will contain the same gene in multiple places with multiple orientations, which will break many tools in both obvious and completely unclear ways. For example, DEXSeq will simply break with an error message if given a UCSC annotation, since they break biological plausibility. If you were to use these annotations files with deepTools, you wouldn't get an error message, but the resulting output would be only partial, due to the biologically impossible annotation effectively corrupting most obvious ways of storing annotation data in a data structure (i.e., you can no longer treat IDs as unique). This could have downstream ramifications on biological interpretation of results.
$endgroup$
$begingroup$
to make sure if im have gencode so i used this "cat gencode.v21.annotation.gtf | grep "_random" no instance of that word .
$endgroup$
– krushnach Chandra
Jun 13 at 8:13
1
$begingroup$
You could also justhead
the file. Gencode files tend to start with a few comment lines (granted, given what the file is named, the odds of it being from Gencode were very high to begin with).
$endgroup$
– Devon Ryan♦
Jun 13 at 8:15
$begingroup$
##description: evidence-based annotation of the human genome (GRCh38), version 21 (Ensembl 77) ##provider: GENCODE ##contact: gencode@sanger.ac.uk ##format: gtf ##date: 2014-09-29 im sure now im in right path
$endgroup$
– krushnach Chandra
Jun 13 at 8:16
2
$begingroup$
@krushnachChandra not really relevant, but just so you know,grep
can read files. You don't needcat file | grep foo
, you can always dogrep foo file
.
$endgroup$
– terdon♦
Jun 13 at 8:43
1
$begingroup$
Devon, could you elaborate on your advice never to use UCSC genomes. Never seems a bit extreme. Can you explain what they're missing in a bit more detail?Is there any reason not to use them for human reference? Yes, they may have*random
in them, but is that such a problem?
$endgroup$
– terdon♦
Jun 13 at 8:49
|
show 3 more comments
$begingroup$
Never use genomes or annotations from UCSC, they're poorly versioned and only recently with mouse and human have they even included all of the contigs. For fasta/GTF files from early in the GRCh38 release, you can tell whether you're using UCSC or Gencode by the presence/absence of _random
contigs, which will only exist for UCSC. These were mostly later split into the actual contigs, so recent download from UCSC should more closely match what you find at Gencode/Ensembl. Further, that time predated UCSC beginning to adopt Gencode's vastly superior annotations, so if your GTF file has instances where the same gene ID is on either multiple strands or multiple chromosomes (this is obviously biologically impossible) then you have a UCSC GTF file.
In general, with early GRCh38 releases your only real issues will be with _random contigs, which are a minority of the genome and don't have all that many genes. But really, you should be keeping track of the sources of your files and ensuring that they're compatible.
Update: I should expand a bit on my "Never use genomes or annotations from UCSC" comment. In point of fact the genomes themselves aren't so terrible. Early on UCSC had the bad habit of concatenating contigs together into _random
"chromosomes", but they seem to have mostly kicked that habit as of late. Note, however, that there are no versions for their genomes. Since reference genomes continue to get updates over time (mostly through the addition of patches) the lack of actual versions means you have to manually check if a recently downloaded file matches what may have been downloaded either previously or by someone else. This has obvious consequences when it comes to reproducibility. The same issue occurs for annotations from UCSC, but they have the additional problem of historically having biologically incoherent concepts of genes. That is, they will contain the same gene in multiple places with multiple orientations, which will break many tools in both obvious and completely unclear ways. For example, DEXSeq will simply break with an error message if given a UCSC annotation, since they break biological plausibility. If you were to use these annotations files with deepTools, you wouldn't get an error message, but the resulting output would be only partial, due to the biologically impossible annotation effectively corrupting most obvious ways of storing annotation data in a data structure (i.e., you can no longer treat IDs as unique). This could have downstream ramifications on biological interpretation of results.
$endgroup$
$begingroup$
to make sure if im have gencode so i used this "cat gencode.v21.annotation.gtf | grep "_random" no instance of that word .
$endgroup$
– krushnach Chandra
Jun 13 at 8:13
1
$begingroup$
You could also justhead
the file. Gencode files tend to start with a few comment lines (granted, given what the file is named, the odds of it being from Gencode were very high to begin with).
$endgroup$
– Devon Ryan♦
Jun 13 at 8:15
$begingroup$
##description: evidence-based annotation of the human genome (GRCh38), version 21 (Ensembl 77) ##provider: GENCODE ##contact: gencode@sanger.ac.uk ##format: gtf ##date: 2014-09-29 im sure now im in right path
$endgroup$
– krushnach Chandra
Jun 13 at 8:16
2
$begingroup$
@krushnachChandra not really relevant, but just so you know,grep
can read files. You don't needcat file | grep foo
, you can always dogrep foo file
.
$endgroup$
– terdon♦
Jun 13 at 8:43
1
$begingroup$
Devon, could you elaborate on your advice never to use UCSC genomes. Never seems a bit extreme. Can you explain what they're missing in a bit more detail?Is there any reason not to use them for human reference? Yes, they may have*random
in them, but is that such a problem?
$endgroup$
– terdon♦
Jun 13 at 8:49
|
show 3 more comments
$begingroup$
Never use genomes or annotations from UCSC, they're poorly versioned and only recently with mouse and human have they even included all of the contigs. For fasta/GTF files from early in the GRCh38 release, you can tell whether you're using UCSC or Gencode by the presence/absence of _random
contigs, which will only exist for UCSC. These were mostly later split into the actual contigs, so recent download from UCSC should more closely match what you find at Gencode/Ensembl. Further, that time predated UCSC beginning to adopt Gencode's vastly superior annotations, so if your GTF file has instances where the same gene ID is on either multiple strands or multiple chromosomes (this is obviously biologically impossible) then you have a UCSC GTF file.
In general, with early GRCh38 releases your only real issues will be with _random contigs, which are a minority of the genome and don't have all that many genes. But really, you should be keeping track of the sources of your files and ensuring that they're compatible.
Update: I should expand a bit on my "Never use genomes or annotations from UCSC" comment. In point of fact the genomes themselves aren't so terrible. Early on UCSC had the bad habit of concatenating contigs together into _random
"chromosomes", but they seem to have mostly kicked that habit as of late. Note, however, that there are no versions for their genomes. Since reference genomes continue to get updates over time (mostly through the addition of patches) the lack of actual versions means you have to manually check if a recently downloaded file matches what may have been downloaded either previously or by someone else. This has obvious consequences when it comes to reproducibility. The same issue occurs for annotations from UCSC, but they have the additional problem of historically having biologically incoherent concepts of genes. That is, they will contain the same gene in multiple places with multiple orientations, which will break many tools in both obvious and completely unclear ways. For example, DEXSeq will simply break with an error message if given a UCSC annotation, since they break biological plausibility. If you were to use these annotations files with deepTools, you wouldn't get an error message, but the resulting output would be only partial, due to the biologically impossible annotation effectively corrupting most obvious ways of storing annotation data in a data structure (i.e., you can no longer treat IDs as unique). This could have downstream ramifications on biological interpretation of results.
$endgroup$
Never use genomes or annotations from UCSC, they're poorly versioned and only recently with mouse and human have they even included all of the contigs. For fasta/GTF files from early in the GRCh38 release, you can tell whether you're using UCSC or Gencode by the presence/absence of _random
contigs, which will only exist for UCSC. These were mostly later split into the actual contigs, so recent download from UCSC should more closely match what you find at Gencode/Ensembl. Further, that time predated UCSC beginning to adopt Gencode's vastly superior annotations, so if your GTF file has instances where the same gene ID is on either multiple strands or multiple chromosomes (this is obviously biologically impossible) then you have a UCSC GTF file.
In general, with early GRCh38 releases your only real issues will be with _random contigs, which are a minority of the genome and don't have all that many genes. But really, you should be keeping track of the sources of your files and ensuring that they're compatible.
Update: I should expand a bit on my "Never use genomes or annotations from UCSC" comment. In point of fact the genomes themselves aren't so terrible. Early on UCSC had the bad habit of concatenating contigs together into _random
"chromosomes", but they seem to have mostly kicked that habit as of late. Note, however, that there are no versions for their genomes. Since reference genomes continue to get updates over time (mostly through the addition of patches) the lack of actual versions means you have to manually check if a recently downloaded file matches what may have been downloaded either previously or by someone else. This has obvious consequences when it comes to reproducibility. The same issue occurs for annotations from UCSC, but they have the additional problem of historically having biologically incoherent concepts of genes. That is, they will contain the same gene in multiple places with multiple orientations, which will break many tools in both obvious and completely unclear ways. For example, DEXSeq will simply break with an error message if given a UCSC annotation, since they break biological plausibility. If you were to use these annotations files with deepTools, you wouldn't get an error message, but the resulting output would be only partial, due to the biologically impossible annotation effectively corrupting most obvious ways of storing annotation data in a data structure (i.e., you can no longer treat IDs as unique). This could have downstream ramifications on biological interpretation of results.
edited Jun 13 at 9:00
answered Jun 13 at 8:08
Devon Ryan♦Devon Ryan
14.8k21742
14.8k21742
$begingroup$
to make sure if im have gencode so i used this "cat gencode.v21.annotation.gtf | grep "_random" no instance of that word .
$endgroup$
– krushnach Chandra
Jun 13 at 8:13
1
$begingroup$
You could also justhead
the file. Gencode files tend to start with a few comment lines (granted, given what the file is named, the odds of it being from Gencode were very high to begin with).
$endgroup$
– Devon Ryan♦
Jun 13 at 8:15
$begingroup$
##description: evidence-based annotation of the human genome (GRCh38), version 21 (Ensembl 77) ##provider: GENCODE ##contact: gencode@sanger.ac.uk ##format: gtf ##date: 2014-09-29 im sure now im in right path
$endgroup$
– krushnach Chandra
Jun 13 at 8:16
2
$begingroup$
@krushnachChandra not really relevant, but just so you know,grep
can read files. You don't needcat file | grep foo
, you can always dogrep foo file
.
$endgroup$
– terdon♦
Jun 13 at 8:43
1
$begingroup$
Devon, could you elaborate on your advice never to use UCSC genomes. Never seems a bit extreme. Can you explain what they're missing in a bit more detail?Is there any reason not to use them for human reference? Yes, they may have*random
in them, but is that such a problem?
$endgroup$
– terdon♦
Jun 13 at 8:49
|
show 3 more comments
$begingroup$
to make sure if im have gencode so i used this "cat gencode.v21.annotation.gtf | grep "_random" no instance of that word .
$endgroup$
– krushnach Chandra
Jun 13 at 8:13
1
$begingroup$
You could also justhead
the file. Gencode files tend to start with a few comment lines (granted, given what the file is named, the odds of it being from Gencode were very high to begin with).
$endgroup$
– Devon Ryan♦
Jun 13 at 8:15
$begingroup$
##description: evidence-based annotation of the human genome (GRCh38), version 21 (Ensembl 77) ##provider: GENCODE ##contact: gencode@sanger.ac.uk ##format: gtf ##date: 2014-09-29 im sure now im in right path
$endgroup$
– krushnach Chandra
Jun 13 at 8:16
2
$begingroup$
@krushnachChandra not really relevant, but just so you know,grep
can read files. You don't needcat file | grep foo
, you can always dogrep foo file
.
$endgroup$
– terdon♦
Jun 13 at 8:43
1
$begingroup$
Devon, could you elaborate on your advice never to use UCSC genomes. Never seems a bit extreme. Can you explain what they're missing in a bit more detail?Is there any reason not to use them for human reference? Yes, they may have*random
in them, but is that such a problem?
$endgroup$
– terdon♦
Jun 13 at 8:49
$begingroup$
to make sure if im have gencode so i used this "cat gencode.v21.annotation.gtf | grep "_random" no instance of that word .
$endgroup$
– krushnach Chandra
Jun 13 at 8:13
$begingroup$
to make sure if im have gencode so i used this "cat gencode.v21.annotation.gtf | grep "_random" no instance of that word .
$endgroup$
– krushnach Chandra
Jun 13 at 8:13
1
1
$begingroup$
You could also just
head
the file. Gencode files tend to start with a few comment lines (granted, given what the file is named, the odds of it being from Gencode were very high to begin with).$endgroup$
– Devon Ryan♦
Jun 13 at 8:15
$begingroup$
You could also just
head
the file. Gencode files tend to start with a few comment lines (granted, given what the file is named, the odds of it being from Gencode were very high to begin with).$endgroup$
– Devon Ryan♦
Jun 13 at 8:15
$begingroup$
##description: evidence-based annotation of the human genome (GRCh38), version 21 (Ensembl 77) ##provider: GENCODE ##contact: gencode@sanger.ac.uk ##format: gtf ##date: 2014-09-29 im sure now im in right path
$endgroup$
– krushnach Chandra
Jun 13 at 8:16
$begingroup$
##description: evidence-based annotation of the human genome (GRCh38), version 21 (Ensembl 77) ##provider: GENCODE ##contact: gencode@sanger.ac.uk ##format: gtf ##date: 2014-09-29 im sure now im in right path
$endgroup$
– krushnach Chandra
Jun 13 at 8:16
2
2
$begingroup$
@krushnachChandra not really relevant, but just so you know,
grep
can read files. You don't need cat file | grep foo
, you can always do grep foo file
.$endgroup$
– terdon♦
Jun 13 at 8:43
$begingroup$
@krushnachChandra not really relevant, but just so you know,
grep
can read files. You don't need cat file | grep foo
, you can always do grep foo file
.$endgroup$
– terdon♦
Jun 13 at 8:43
1
1
$begingroup$
Devon, could you elaborate on your advice never to use UCSC genomes. Never seems a bit extreme. Can you explain what they're missing in a bit more detail?Is there any reason not to use them for human reference? Yes, they may have
*random
in them, but is that such a problem?$endgroup$
– terdon♦
Jun 13 at 8:49
$begingroup$
Devon, could you elaborate on your advice never to use UCSC genomes. Never seems a bit extreme. Can you explain what they're missing in a bit more detail?Is there any reason not to use them for human reference? Yes, they may have
*random
in them, but is that such a problem?$endgroup$
– terdon♦
Jun 13 at 8:49
|
show 3 more comments
$begingroup$
The main difference is in the way the chromosomes are named - UCSC uses the "chr" prefix (so chromosome 1 is "chr1") while in gencode the "chr" isn't used (so chromosome 1 is just "1"). Depending on your use case, this can obviously cause problems - if you're trying to match a locus (e.g. from gencode 1:1000002) between them, whatever tool you use is going to be looking in your aligned data for "1:1000002", but in your aligned data, it'll be named "chr1:1000002", so it won't match the two up.
$endgroup$
$begingroup$
Both use thechr
prefix for the most recent human and mouse releases.
$endgroup$
– Devon Ryan♦
Jun 13 at 7:48
$begingroup$
so its with the annotation file my results would vary but not much with the references?
$endgroup$
– krushnach Chandra
Jun 13 at 7:59
add a comment |
$begingroup$
The main difference is in the way the chromosomes are named - UCSC uses the "chr" prefix (so chromosome 1 is "chr1") while in gencode the "chr" isn't used (so chromosome 1 is just "1"). Depending on your use case, this can obviously cause problems - if you're trying to match a locus (e.g. from gencode 1:1000002) between them, whatever tool you use is going to be looking in your aligned data for "1:1000002", but in your aligned data, it'll be named "chr1:1000002", so it won't match the two up.
$endgroup$
$begingroup$
Both use thechr
prefix for the most recent human and mouse releases.
$endgroup$
– Devon Ryan♦
Jun 13 at 7:48
$begingroup$
so its with the annotation file my results would vary but not much with the references?
$endgroup$
– krushnach Chandra
Jun 13 at 7:59
add a comment |
$begingroup$
The main difference is in the way the chromosomes are named - UCSC uses the "chr" prefix (so chromosome 1 is "chr1") while in gencode the "chr" isn't used (so chromosome 1 is just "1"). Depending on your use case, this can obviously cause problems - if you're trying to match a locus (e.g. from gencode 1:1000002) between them, whatever tool you use is going to be looking in your aligned data for "1:1000002", but in your aligned data, it'll be named "chr1:1000002", so it won't match the two up.
$endgroup$
The main difference is in the way the chromosomes are named - UCSC uses the "chr" prefix (so chromosome 1 is "chr1") while in gencode the "chr" isn't used (so chromosome 1 is just "1"). Depending on your use case, this can obviously cause problems - if you're trying to match a locus (e.g. from gencode 1:1000002) between them, whatever tool you use is going to be looking in your aligned data for "1:1000002", but in your aligned data, it'll be named "chr1:1000002", so it won't match the two up.
answered Jun 13 at 7:46
JenGJenG
1841
1841
$begingroup$
Both use thechr
prefix for the most recent human and mouse releases.
$endgroup$
– Devon Ryan♦
Jun 13 at 7:48
$begingroup$
so its with the annotation file my results would vary but not much with the references?
$endgroup$
– krushnach Chandra
Jun 13 at 7:59
add a comment |
$begingroup$
Both use thechr
prefix for the most recent human and mouse releases.
$endgroup$
– Devon Ryan♦
Jun 13 at 7:48
$begingroup$
so its with the annotation file my results would vary but not much with the references?
$endgroup$
– krushnach Chandra
Jun 13 at 7:59
$begingroup$
Both use the
chr
prefix for the most recent human and mouse releases.$endgroup$
– Devon Ryan♦
Jun 13 at 7:48
$begingroup$
Both use the
chr
prefix for the most recent human and mouse releases.$endgroup$
– Devon Ryan♦
Jun 13 at 7:48
$begingroup$
so its with the annotation file my results would vary but not much with the references?
$endgroup$
– krushnach Chandra
Jun 13 at 7:59
$begingroup$
so its with the annotation file my results would vary but not much with the references?
$endgroup$
– krushnach Chandra
Jun 13 at 7:59
add a comment |
Thanks for contributing an answer to Bioinformatics Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fbioinformatics.stackexchange.com%2fquestions%2f8794%2fhow-drastic-would-the-result-be-if-i-use-fasta-or-reference-assembly-from-ucsc-a%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
Do you want to include older UCSC/gencode releases or only the most recent ones? They should now be fully compatible.
$endgroup$
– Devon Ryan♦
Jun 13 at 7:50
$begingroup$
so i used gencode.v21 as my gtf file and not sure about the reference assembly,where i have used hg38 but im not sure its from gencode or ucsc .Is there a way to find out the what reference assembly im using from the reference assembly ?
$endgroup$
– krushnach Chandra
Jun 13 at 7:58