Lazy Loading a Bioinformatic SAM recordLazy class instantiation in PythonLazy loading with __getLazy loaded property readabilityMax heap in JavaQuerying Facebook for details of a user's OAuth tokenBinary Puzzle Solver - 10000 questionsSimple Java program - Coding bat sumNumbersLeetcode: String to Integer (atoi)Lazy split and semi-lazy splitLazy-loading iframes as they scroll into view
Why does Mind Blank stop the Feeblemind spell?
Map of water taps to fill bottles
How do I check if a string is entirely made of the same substring?
Get consecutive integer number ranges from list of int
Contradiction proof for inequality of P and NP?
Is the claim "Employers won't employ people with no 'social media presence'" realistic?
How to have a sharp product image?
If a planet has 3 moons, is it possible to have triple Full/New Moons at once?
Why does nature favour the Laplacian?
Multiple options vs single option UI
Mistake in years of experience in resume?
What term is being referred to with "reflected-sound-of-underground-spirits"?
How can I practically buy stocks?
Could the terminal length of components like resistors be reduced?
Does tea made with boiling water cool faster than tea made with boiled (but still hot) water?
Alignment of various blocks in tikz
Relationship between strut and baselineskip
Phrase for the opposite of "foolproof"
Pulling the rope with one hand is as heavy as with two hands?
How to display Aura JS Errors Lightning Out
Why must Chinese maps be obfuscated?
How exactly does Hawking radiation decrease the mass of black holes?
Checks user level and limit the data before saving it to mongoDB
Why do games have consumables?
Lazy Loading a Bioinformatic SAM record
Lazy class instantiation in PythonLazy loading with __getLazy loaded property readabilityMax heap in JavaQuerying Facebook for details of a user's OAuth tokenBinary Puzzle Solver - 10000 questionsSimple Java program - Coding bat sumNumbersLeetcode: String to Integer (atoi)Lazy split and semi-lazy splitLazy-loading iframes as they scroll into view
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
$begingroup$
I'm currently writing an API to work with Bioinformatic SAM records. Here's an example of one:
SBL_XSBF463_ID:3230017:BCR1:GCATAA:BCR2:CATATA/1:vpe 97 hs07 38253395 3 30M = 38330420 77055 TTGTTCCACTGCCAAAGAGTTTCTTATAAT EEEEEEEEEEEEAEEEEEEEEEEEEEEEEE PG:Z:novoalign AS:i:0 UQ:i:0 NM:i:0 MD:Z:30 ZS:Z:R NH:i:2 HI:i:1 IH:i:1
Each piece of information separated by a tab is it's own field and corresponds to some type of data.
Now, it's important to note that these files get BIG (10's of GB) and so splitting each one as soon as it's instantiated in some kind of POJO would be inefficient.
Hence, I've decided to create an object with a lazy loading mechanism. Only the original string is stored until one of the fields is requested by some calling code. This should minimise the amount of work done when the object is created, as well as minimise the amount of memory taken by the objects.
Here's my attempt:
/** Class for storing and working with sam formatted DNA sequence.
*
* Upon construction, only the String record is stored.
* All querying of fields is done on demand, to save time.
*
*/
public class SamRecord implements Record
private final String read;
private String id = null;
private int flag = -1;
private String referenceName = null;
private int pos = -1;
private int mappingQuality = -1;
private String cigar = null;
private String mateReferenceName = null;
private int matePosition = -1;
private int templateLength = -1;
private String sequence = null;
private String quality = null;
private String variableTerms = null;
private final static String REPEAT_TERM = "ZS:Z:R";
private final static String MATCH_TERM = "ZS:Z:NM";
private final static String QUALITY_CHECK_TERM = "ZS:Z:QC";
/** Simple constructor for the sam record
* @param read full read
*/
public SamRecord(String read)
this.read = read;
public String getRead()
return read;
/**
* @inheritDoc
*/
@Override
public String getId()
if(id == null)
id = XsamReadQueries.findID(read);
return id;
/**
* @inheritDoc
*/
@Override
public int getFlag() throws NumberFormatException
if(flag == -1)
flag = Integer.parseInt(XsamReadQueries.findElement(read, 1));
return flag;
/**
* @inheritDoc
*/
@Override
public String getReferenceName()
if(referenceName == null)
referenceName = XsamReadQueries.findReferneceName(read);
return referenceName;
/**
* @inheritDoc
*/
@Override
public int getPos() throws NumberFormatException
if(pos == -1)
pos = Integer.parseInt(XsamReadQueries.findElement(read, 3));
return pos;
/**
* @inheritDoc
*/
@Override
public int getMappingQuality() throws NumberFormatException
if(mappingQuality == -1)
mappingQuality = Integer.parseInt(XsamReadQueries.findElement(read, 4));
return mappingQuality;
/**
* @inheritDoc
*/
@Override
public String getCigar()
if(cigar == null)
cigar = XsamReadQueries.findCigar(read);
return cigar;
/**
* @inheritDoc
*/
@Override
public String getMateReferenceName()
if(mateReferenceName == null)
mateReferenceName = XsamReadQueries.findElement(read, 6);
return mateReferenceName;
/**
* @inheritDoc
*/
@Override
public int getMatePosition() throws NumberFormatException
if(matePosition == -1)
matePosition = Integer.parseInt(XsamReadQueries.findElement(read, 7));
return matePosition;
/**
* @inheritDoc
*/
@Override
public int getTemplateLength() throws NumberFormatException
if(templateLength == -1)
templateLength = Integer.parseInt(XsamReadQueries.findElement(read, 8));
return templateLength;
/**
* @inheritDoc
*/
@Override
public String getSequence()
if(sequence == null)
sequence = XsamReadQueries.findBaseSequence(read);
return sequence;
/**
* @inheritDoc
*/
@Override
public String getQuality()
if(quality == null)
quality = XsamReadQueries.findElement(read, 10);
return quality;
/**
* @inheritDoc
*/
@Override
public boolean isRepeat()
return read.contains(REPEAT_TERM);
/**
* @inheritDoc
*/
@Override
public boolean isMapped()
return !read.contains(MATCH_TERM);
/**
* @inheritDoc
*/
@Override
public String getVariableTerms()
if(variableTerms == null)
variableTerms = XsamReadQueries.findVariableRegionSequence(read);
return variableTerms;
/**
* @inheritDoc
*/
@Override
public boolean isQualityFailed()
return read.contains(QUALITY_CHECK_TERM);
@Override
public boolean equals(Object o) getClass() != o.getClass()) return false;
SamRecord samRecord = (SamRecord) o;
return Objects.equals(read, samRecord.read);
@Override
public int hashCode()
return Objects.hash(read);
@Override
public String toString()
return read;
The fields are returned by static methods in a helper class which retrieve them by looking at where the tab characters are. i.e. flag = Integer.parseInt(XsamReadQueries.findElement(read, 1));
Below is the XsamReadQuery
class
/**
* Non-instantiable utility class for working with Xsam reads
*/
public final class XsamReadQueries
// Suppress instantiation
private XsamReadQueries()
throw new AssertionError();
/** finds the position of the tab directly before the start of the variable region
* @param read whole sam or Xsam read to search
* @return position of the tab in the String
*/
public static int findVariableRegionStart(String read)
int found = 0;
for(int i = 0; i < read.length(); i++)
if(read.charAt(i) == 't')
found++;
if(found >= 11 && i+1 < read.length() && (read.charAt(i+1) != 'x' && read.charAt(i+1) != 't')) //guard against double-tabs
return i + 1;
return -1;
/** Attempts to find the library name from SBL reads
* where SBL reads have the id SBL_LibraryName_ID:XXXXX
* if LibraryName end's with a lower case letter, the letter will be removed.
* if SBL_LibID is not valid, return the full ID.
* @param ID or String to search.
* @return Library name with lower case endings removed
*/
public static String findLibraryName(String ID)
if(!ID.startsWith("SBL")) return "";
try
int firstPos = XsamReadQueries.findPosAfter(ID, "_");
int i = firstPos;
while (ID.charAt(i) != '_' && ID.charAt(i) != 't')
i++;
String library = ID.substring(firstPos, i);
char lastChar = library.charAt(library.length()-1);
if(lastChar >= 97 && lastChar <= 122)
library = library.substring(0, library.length()-1);
return library;
catch (Exception e)
int i = 0;
while(ID.charAt(i) != 't')
i++;
if(i == ID.length())
break;
return ID.substring(0, i);
/** Returns the ID from the sample
* @param sample Xsam read
* @return ID
*/
public static String findID(String sample)
return findElement(sample, 0);
/** Returns the phred score from the sample
* @param sample Xsam read
* @return phred string
*/
public static String findPhred(String sample)
return findElement(sample, 10);
/**
* Returns the cigar from the xsam read
*
* @param sample read
* @return cigar string
*/
public static String findCigar(String sample)
return findElement(sample, 5);
/**
* Returns the bases from the xsam read
*
* @param sample read
* @return base string
*/
public static String findBaseSequence(String sample)
return findElement(sample, 9);
/**
* finds the n'th element in the tab delimited sample
* i.e findElement(0) returns one from "onettwo"
* 0 indexed.
*
* @param sample String to search
* @param element element to find
* @return found element or "" if not found
*/
public static String findElement(String sample, int element)
boolean tabsFound = false;
int i = 0;
int firstTab = 0;
int secondTab = 0;
int tabsToSkip = element - 1 >= 0 ? element - 1 : 0;
int skippedTabs = 0;
if (element == 0)
while (sample.charAt(i) != 't')
i++;
return sample.substring(0, i);
else
while (!tabsFound)
if (sample.charAt(i) != 't')
i++;
else
if (skippedTabs == tabsToSkip)
if (firstTab == 0)
firstTab = i;
else
secondTab = i;
tabsFound = true;
else
skippedTabs++;
i++;
return sample.substring(firstTab + 1, secondTab);
/** finds the variable region past the quality
* @param sample sam or Xsam record string
* @return variable sequence or empty string
*/
public static String findVariableRegionSequence(String sample)
int start = findVariableRegionStart(sample);
if(start == -1) return "";
return sample.substring(findVariableRegionStart(sample));
/** finds the xL field
* @param sample String to search
* @return position if found, '' (null) value if not.
*/
public static int findxLField(String sample)
int chartStart = findPosAfter(sample, "txL:i:");
if (chartStart == -1)
return -1; //return -1 if not found.
int i = chartStart;
while (sample.charAt(i) != 't')
i++;
return Integer.parseInt(sample.substring(chartStart, i));
/** finds the xR field
* @param sample String to search
* @return position if found, '' (null) value if not.
*/
public static int findxRField(String sample)
int chartStart = findPosAfter(sample, "txR:i:");
if (chartStart == -1)
return ''; //return NULL if not found.
int i = chartStart;
while (sample.charAt(i) != 't')
i++;
return Integer.parseInt(sample.substring(chartStart, i));
/** finds the xLSeq field
* @param sample String to search
* @return String if found, empty string if not.
*/
public static Optional<String> findxLSeqField(String sample)
int charStart = findPosAfter(sample, "txLseq:i:");
if (charStart == -1)
return Optional.empty(); //return NULL if not found.
int i = charStart;
while (sample.charAt(i) != 't')
i++;
return Optional.of(sample.substring(charStart, i));
/** finds the reference name field
* @param sample String to search
* @return String if found, empty string if not.
*/
public static String findReferneceName(String sample)
//should always appear between the second and third tabs
boolean tabsFound = false;
int i = 0;
int secondTab = 0;
int thirdTab = 0;
boolean skippedFirstTab = false;
while (!tabsFound)
if (sample.charAt(i) != 't')
i++;
else
if (skippedFirstTab)
if (secondTab == 0)
secondTab = i;
else
thirdTab = i;
tabsFound = true;
skippedFirstTab = true;
i++;
if(sample.substring(secondTab + 1, thirdTab).contains("/"))
String[] split = sample.substring(secondTab + 1, thirdTab).split("/");
return split[split.length-1];
return sample.substring(secondTab + 1, thirdTab);
/**
* Finds the needle in the haystack, and returns the position of the single next digit.
*
* @param haystack The string to search
* @param needle String field to search on.
* @return position of the end of the needle
*/
private static int findPosAfter(String haystack, String needle)
int hLen = haystack.length();
int nLen = needle.length();
int maxSearch = hLen - nLen;
outer:
for (int i = 0; i < maxSearch; i++)
for (int j = 0; j < nLen; j++)
if (haystack.charAt(i + j) != needle.charAt(j))
continue outer;
// If it reaches here, match has been found:
return i + nLen;
return -1; // Not found
My question is, are there are any drawbacks to this approach? Or any alternative way that might be more effective?
Thanks in advance,
Sam
Edit:
Instantiating code:
public interface RecordFactory<T extends Record>
T createRecord(String recordString);
Implementing it like:
private RecordFactory<SamRecord> samRecordFactory = SamRecord::new
java bioinformatics lazy
$endgroup$
add a comment |
$begingroup$
I'm currently writing an API to work with Bioinformatic SAM records. Here's an example of one:
SBL_XSBF463_ID:3230017:BCR1:GCATAA:BCR2:CATATA/1:vpe 97 hs07 38253395 3 30M = 38330420 77055 TTGTTCCACTGCCAAAGAGTTTCTTATAAT EEEEEEEEEEEEAEEEEEEEEEEEEEEEEE PG:Z:novoalign AS:i:0 UQ:i:0 NM:i:0 MD:Z:30 ZS:Z:R NH:i:2 HI:i:1 IH:i:1
Each piece of information separated by a tab is it's own field and corresponds to some type of data.
Now, it's important to note that these files get BIG (10's of GB) and so splitting each one as soon as it's instantiated in some kind of POJO would be inefficient.
Hence, I've decided to create an object with a lazy loading mechanism. Only the original string is stored until one of the fields is requested by some calling code. This should minimise the amount of work done when the object is created, as well as minimise the amount of memory taken by the objects.
Here's my attempt:
/** Class for storing and working with sam formatted DNA sequence.
*
* Upon construction, only the String record is stored.
* All querying of fields is done on demand, to save time.
*
*/
public class SamRecord implements Record
private final String read;
private String id = null;
private int flag = -1;
private String referenceName = null;
private int pos = -1;
private int mappingQuality = -1;
private String cigar = null;
private String mateReferenceName = null;
private int matePosition = -1;
private int templateLength = -1;
private String sequence = null;
private String quality = null;
private String variableTerms = null;
private final static String REPEAT_TERM = "ZS:Z:R";
private final static String MATCH_TERM = "ZS:Z:NM";
private final static String QUALITY_CHECK_TERM = "ZS:Z:QC";
/** Simple constructor for the sam record
* @param read full read
*/
public SamRecord(String read)
this.read = read;
public String getRead()
return read;
/**
* @inheritDoc
*/
@Override
public String getId()
if(id == null)
id = XsamReadQueries.findID(read);
return id;
/**
* @inheritDoc
*/
@Override
public int getFlag() throws NumberFormatException
if(flag == -1)
flag = Integer.parseInt(XsamReadQueries.findElement(read, 1));
return flag;
/**
* @inheritDoc
*/
@Override
public String getReferenceName()
if(referenceName == null)
referenceName = XsamReadQueries.findReferneceName(read);
return referenceName;
/**
* @inheritDoc
*/
@Override
public int getPos() throws NumberFormatException
if(pos == -1)
pos = Integer.parseInt(XsamReadQueries.findElement(read, 3));
return pos;
/**
* @inheritDoc
*/
@Override
public int getMappingQuality() throws NumberFormatException
if(mappingQuality == -1)
mappingQuality = Integer.parseInt(XsamReadQueries.findElement(read, 4));
return mappingQuality;
/**
* @inheritDoc
*/
@Override
public String getCigar()
if(cigar == null)
cigar = XsamReadQueries.findCigar(read);
return cigar;
/**
* @inheritDoc
*/
@Override
public String getMateReferenceName()
if(mateReferenceName == null)
mateReferenceName = XsamReadQueries.findElement(read, 6);
return mateReferenceName;
/**
* @inheritDoc
*/
@Override
public int getMatePosition() throws NumberFormatException
if(matePosition == -1)
matePosition = Integer.parseInt(XsamReadQueries.findElement(read, 7));
return matePosition;
/**
* @inheritDoc
*/
@Override
public int getTemplateLength() throws NumberFormatException
if(templateLength == -1)
templateLength = Integer.parseInt(XsamReadQueries.findElement(read, 8));
return templateLength;
/**
* @inheritDoc
*/
@Override
public String getSequence()
if(sequence == null)
sequence = XsamReadQueries.findBaseSequence(read);
return sequence;
/**
* @inheritDoc
*/
@Override
public String getQuality()
if(quality == null)
quality = XsamReadQueries.findElement(read, 10);
return quality;
/**
* @inheritDoc
*/
@Override
public boolean isRepeat()
return read.contains(REPEAT_TERM);
/**
* @inheritDoc
*/
@Override
public boolean isMapped()
return !read.contains(MATCH_TERM);
/**
* @inheritDoc
*/
@Override
public String getVariableTerms()
if(variableTerms == null)
variableTerms = XsamReadQueries.findVariableRegionSequence(read);
return variableTerms;
/**
* @inheritDoc
*/
@Override
public boolean isQualityFailed()
return read.contains(QUALITY_CHECK_TERM);
@Override
public boolean equals(Object o) getClass() != o.getClass()) return false;
SamRecord samRecord = (SamRecord) o;
return Objects.equals(read, samRecord.read);
@Override
public int hashCode()
return Objects.hash(read);
@Override
public String toString()
return read;
The fields are returned by static methods in a helper class which retrieve them by looking at where the tab characters are. i.e. flag = Integer.parseInt(XsamReadQueries.findElement(read, 1));
Below is the XsamReadQuery
class
/**
* Non-instantiable utility class for working with Xsam reads
*/
public final class XsamReadQueries
// Suppress instantiation
private XsamReadQueries()
throw new AssertionError();
/** finds the position of the tab directly before the start of the variable region
* @param read whole sam or Xsam read to search
* @return position of the tab in the String
*/
public static int findVariableRegionStart(String read)
int found = 0;
for(int i = 0; i < read.length(); i++)
if(read.charAt(i) == 't')
found++;
if(found >= 11 && i+1 < read.length() && (read.charAt(i+1) != 'x' && read.charAt(i+1) != 't')) //guard against double-tabs
return i + 1;
return -1;
/** Attempts to find the library name from SBL reads
* where SBL reads have the id SBL_LibraryName_ID:XXXXX
* if LibraryName end's with a lower case letter, the letter will be removed.
* if SBL_LibID is not valid, return the full ID.
* @param ID or String to search.
* @return Library name with lower case endings removed
*/
public static String findLibraryName(String ID)
if(!ID.startsWith("SBL")) return "";
try
int firstPos = XsamReadQueries.findPosAfter(ID, "_");
int i = firstPos;
while (ID.charAt(i) != '_' && ID.charAt(i) != 't')
i++;
String library = ID.substring(firstPos, i);
char lastChar = library.charAt(library.length()-1);
if(lastChar >= 97 && lastChar <= 122)
library = library.substring(0, library.length()-1);
return library;
catch (Exception e)
int i = 0;
while(ID.charAt(i) != 't')
i++;
if(i == ID.length())
break;
return ID.substring(0, i);
/** Returns the ID from the sample
* @param sample Xsam read
* @return ID
*/
public static String findID(String sample)
return findElement(sample, 0);
/** Returns the phred score from the sample
* @param sample Xsam read
* @return phred string
*/
public static String findPhred(String sample)
return findElement(sample, 10);
/**
* Returns the cigar from the xsam read
*
* @param sample read
* @return cigar string
*/
public static String findCigar(String sample)
return findElement(sample, 5);
/**
* Returns the bases from the xsam read
*
* @param sample read
* @return base string
*/
public static String findBaseSequence(String sample)
return findElement(sample, 9);
/**
* finds the n'th element in the tab delimited sample
* i.e findElement(0) returns one from "onettwo"
* 0 indexed.
*
* @param sample String to search
* @param element element to find
* @return found element or "" if not found
*/
public static String findElement(String sample, int element)
boolean tabsFound = false;
int i = 0;
int firstTab = 0;
int secondTab = 0;
int tabsToSkip = element - 1 >= 0 ? element - 1 : 0;
int skippedTabs = 0;
if (element == 0)
while (sample.charAt(i) != 't')
i++;
return sample.substring(0, i);
else
while (!tabsFound)
if (sample.charAt(i) != 't')
i++;
else
if (skippedTabs == tabsToSkip)
if (firstTab == 0)
firstTab = i;
else
secondTab = i;
tabsFound = true;
else
skippedTabs++;
i++;
return sample.substring(firstTab + 1, secondTab);
/** finds the variable region past the quality
* @param sample sam or Xsam record string
* @return variable sequence or empty string
*/
public static String findVariableRegionSequence(String sample)
int start = findVariableRegionStart(sample);
if(start == -1) return "";
return sample.substring(findVariableRegionStart(sample));
/** finds the xL field
* @param sample String to search
* @return position if found, '' (null) value if not.
*/
public static int findxLField(String sample)
int chartStart = findPosAfter(sample, "txL:i:");
if (chartStart == -1)
return -1; //return -1 if not found.
int i = chartStart;
while (sample.charAt(i) != 't')
i++;
return Integer.parseInt(sample.substring(chartStart, i));
/** finds the xR field
* @param sample String to search
* @return position if found, '' (null) value if not.
*/
public static int findxRField(String sample)
int chartStart = findPosAfter(sample, "txR:i:");
if (chartStart == -1)
return ''; //return NULL if not found.
int i = chartStart;
while (sample.charAt(i) != 't')
i++;
return Integer.parseInt(sample.substring(chartStart, i));
/** finds the xLSeq field
* @param sample String to search
* @return String if found, empty string if not.
*/
public static Optional<String> findxLSeqField(String sample)
int charStart = findPosAfter(sample, "txLseq:i:");
if (charStart == -1)
return Optional.empty(); //return NULL if not found.
int i = charStart;
while (sample.charAt(i) != 't')
i++;
return Optional.of(sample.substring(charStart, i));
/** finds the reference name field
* @param sample String to search
* @return String if found, empty string if not.
*/
public static String findReferneceName(String sample)
//should always appear between the second and third tabs
boolean tabsFound = false;
int i = 0;
int secondTab = 0;
int thirdTab = 0;
boolean skippedFirstTab = false;
while (!tabsFound)
if (sample.charAt(i) != 't')
i++;
else
if (skippedFirstTab)
if (secondTab == 0)
secondTab = i;
else
thirdTab = i;
tabsFound = true;
skippedFirstTab = true;
i++;
if(sample.substring(secondTab + 1, thirdTab).contains("/"))
String[] split = sample.substring(secondTab + 1, thirdTab).split("/");
return split[split.length-1];
return sample.substring(secondTab + 1, thirdTab);
/**
* Finds the needle in the haystack, and returns the position of the single next digit.
*
* @param haystack The string to search
* @param needle String field to search on.
* @return position of the end of the needle
*/
private static int findPosAfter(String haystack, String needle)
int hLen = haystack.length();
int nLen = needle.length();
int maxSearch = hLen - nLen;
outer:
for (int i = 0; i < maxSearch; i++)
for (int j = 0; j < nLen; j++)
if (haystack.charAt(i + j) != needle.charAt(j))
continue outer;
// If it reaches here, match has been found:
return i + nLen;
return -1; // Not found
My question is, are there are any drawbacks to this approach? Or any alternative way that might be more effective?
Thanks in advance,
Sam
Edit:
Instantiating code:
public interface RecordFactory<T extends Record>
T createRecord(String recordString);
Implementing it like:
private RecordFactory<SamRecord> samRecordFactory = SamRecord::new
java bioinformatics lazy
$endgroup$
$begingroup$
Is this class, or could this class, be used in a multithreaded scenario?
$endgroup$
– IEatBagels
Apr 23 at 13:47
$begingroup$
It will be, yes. However, it's unlikely to be shared between threads. It's more likely that collections of these records will be handed off to separate workers.
$endgroup$
– Sam
Apr 23 at 13:48
$begingroup$
0. Do you have an actual, tested performance issue, or are you guessing at a potential problem? 1. What are you trying to minimize, execution time or memory footprint?
$endgroup$
– Eric Stein
Apr 23 at 14:41
$begingroup$
Also, are you able/willing to share the code that instantiatesSamRecord
s?
$endgroup$
– Eric Stein
Apr 23 at 15:09
$begingroup$
Hi @EricStein. Sure thing, but it's just a short factory method using generics (sinceSamRecord
extends fromRecord
- and I also have a few other types that extendRecord
too.) I know from past experience that this is a performance issue, since these files have 100's of millions of records. When I split the file, and hand the records to workers, I need to do as little as possible with the record themselves. The software using this API most likely wont need every detail of the record at one time
$endgroup$
– Sam
2 days ago
add a comment |
$begingroup$
I'm currently writing an API to work with Bioinformatic SAM records. Here's an example of one:
SBL_XSBF463_ID:3230017:BCR1:GCATAA:BCR2:CATATA/1:vpe 97 hs07 38253395 3 30M = 38330420 77055 TTGTTCCACTGCCAAAGAGTTTCTTATAAT EEEEEEEEEEEEAEEEEEEEEEEEEEEEEE PG:Z:novoalign AS:i:0 UQ:i:0 NM:i:0 MD:Z:30 ZS:Z:R NH:i:2 HI:i:1 IH:i:1
Each piece of information separated by a tab is it's own field and corresponds to some type of data.
Now, it's important to note that these files get BIG (10's of GB) and so splitting each one as soon as it's instantiated in some kind of POJO would be inefficient.
Hence, I've decided to create an object with a lazy loading mechanism. Only the original string is stored until one of the fields is requested by some calling code. This should minimise the amount of work done when the object is created, as well as minimise the amount of memory taken by the objects.
Here's my attempt:
/** Class for storing and working with sam formatted DNA sequence.
*
* Upon construction, only the String record is stored.
* All querying of fields is done on demand, to save time.
*
*/
public class SamRecord implements Record
private final String read;
private String id = null;
private int flag = -1;
private String referenceName = null;
private int pos = -1;
private int mappingQuality = -1;
private String cigar = null;
private String mateReferenceName = null;
private int matePosition = -1;
private int templateLength = -1;
private String sequence = null;
private String quality = null;
private String variableTerms = null;
private final static String REPEAT_TERM = "ZS:Z:R";
private final static String MATCH_TERM = "ZS:Z:NM";
private final static String QUALITY_CHECK_TERM = "ZS:Z:QC";
/** Simple constructor for the sam record
* @param read full read
*/
public SamRecord(String read)
this.read = read;
public String getRead()
return read;
/**
* @inheritDoc
*/
@Override
public String getId()
if(id == null)
id = XsamReadQueries.findID(read);
return id;
/**
* @inheritDoc
*/
@Override
public int getFlag() throws NumberFormatException
if(flag == -1)
flag = Integer.parseInt(XsamReadQueries.findElement(read, 1));
return flag;
/**
* @inheritDoc
*/
@Override
public String getReferenceName()
if(referenceName == null)
referenceName = XsamReadQueries.findReferneceName(read);
return referenceName;
/**
* @inheritDoc
*/
@Override
public int getPos() throws NumberFormatException
if(pos == -1)
pos = Integer.parseInt(XsamReadQueries.findElement(read, 3));
return pos;
/**
* @inheritDoc
*/
@Override
public int getMappingQuality() throws NumberFormatException
if(mappingQuality == -1)
mappingQuality = Integer.parseInt(XsamReadQueries.findElement(read, 4));
return mappingQuality;
/**
* @inheritDoc
*/
@Override
public String getCigar()
if(cigar == null)
cigar = XsamReadQueries.findCigar(read);
return cigar;
/**
* @inheritDoc
*/
@Override
public String getMateReferenceName()
if(mateReferenceName == null)
mateReferenceName = XsamReadQueries.findElement(read, 6);
return mateReferenceName;
/**
* @inheritDoc
*/
@Override
public int getMatePosition() throws NumberFormatException
if(matePosition == -1)
matePosition = Integer.parseInt(XsamReadQueries.findElement(read, 7));
return matePosition;
/**
* @inheritDoc
*/
@Override
public int getTemplateLength() throws NumberFormatException
if(templateLength == -1)
templateLength = Integer.parseInt(XsamReadQueries.findElement(read, 8));
return templateLength;
/**
* @inheritDoc
*/
@Override
public String getSequence()
if(sequence == null)
sequence = XsamReadQueries.findBaseSequence(read);
return sequence;
/**
* @inheritDoc
*/
@Override
public String getQuality()
if(quality == null)
quality = XsamReadQueries.findElement(read, 10);
return quality;
/**
* @inheritDoc
*/
@Override
public boolean isRepeat()
return read.contains(REPEAT_TERM);
/**
* @inheritDoc
*/
@Override
public boolean isMapped()
return !read.contains(MATCH_TERM);
/**
* @inheritDoc
*/
@Override
public String getVariableTerms()
if(variableTerms == null)
variableTerms = XsamReadQueries.findVariableRegionSequence(read);
return variableTerms;
/**
* @inheritDoc
*/
@Override
public boolean isQualityFailed()
return read.contains(QUALITY_CHECK_TERM);
@Override
public boolean equals(Object o) getClass() != o.getClass()) return false;
SamRecord samRecord = (SamRecord) o;
return Objects.equals(read, samRecord.read);
@Override
public int hashCode()
return Objects.hash(read);
@Override
public String toString()
return read;
The fields are returned by static methods in a helper class which retrieve them by looking at where the tab characters are. i.e. flag = Integer.parseInt(XsamReadQueries.findElement(read, 1));
Below is the XsamReadQuery
class
/**
* Non-instantiable utility class for working with Xsam reads
*/
public final class XsamReadQueries
// Suppress instantiation
private XsamReadQueries()
throw new AssertionError();
/** finds the position of the tab directly before the start of the variable region
* @param read whole sam or Xsam read to search
* @return position of the tab in the String
*/
public static int findVariableRegionStart(String read)
int found = 0;
for(int i = 0; i < read.length(); i++)
if(read.charAt(i) == 't')
found++;
if(found >= 11 && i+1 < read.length() && (read.charAt(i+1) != 'x' && read.charAt(i+1) != 't')) //guard against double-tabs
return i + 1;
return -1;
/** Attempts to find the library name from SBL reads
* where SBL reads have the id SBL_LibraryName_ID:XXXXX
* if LibraryName end's with a lower case letter, the letter will be removed.
* if SBL_LibID is not valid, return the full ID.
* @param ID or String to search.
* @return Library name with lower case endings removed
*/
public static String findLibraryName(String ID)
if(!ID.startsWith("SBL")) return "";
try
int firstPos = XsamReadQueries.findPosAfter(ID, "_");
int i = firstPos;
while (ID.charAt(i) != '_' && ID.charAt(i) != 't')
i++;
String library = ID.substring(firstPos, i);
char lastChar = library.charAt(library.length()-1);
if(lastChar >= 97 && lastChar <= 122)
library = library.substring(0, library.length()-1);
return library;
catch (Exception e)
int i = 0;
while(ID.charAt(i) != 't')
i++;
if(i == ID.length())
break;
return ID.substring(0, i);
/** Returns the ID from the sample
* @param sample Xsam read
* @return ID
*/
public static String findID(String sample)
return findElement(sample, 0);
/** Returns the phred score from the sample
* @param sample Xsam read
* @return phred string
*/
public static String findPhred(String sample)
return findElement(sample, 10);
/**
* Returns the cigar from the xsam read
*
* @param sample read
* @return cigar string
*/
public static String findCigar(String sample)
return findElement(sample, 5);
/**
* Returns the bases from the xsam read
*
* @param sample read
* @return base string
*/
public static String findBaseSequence(String sample)
return findElement(sample, 9);
/**
* finds the n'th element in the tab delimited sample
* i.e findElement(0) returns one from "onettwo"
* 0 indexed.
*
* @param sample String to search
* @param element element to find
* @return found element or "" if not found
*/
public static String findElement(String sample, int element)
boolean tabsFound = false;
int i = 0;
int firstTab = 0;
int secondTab = 0;
int tabsToSkip = element - 1 >= 0 ? element - 1 : 0;
int skippedTabs = 0;
if (element == 0)
while (sample.charAt(i) != 't')
i++;
return sample.substring(0, i);
else
while (!tabsFound)
if (sample.charAt(i) != 't')
i++;
else
if (skippedTabs == tabsToSkip)
if (firstTab == 0)
firstTab = i;
else
secondTab = i;
tabsFound = true;
else
skippedTabs++;
i++;
return sample.substring(firstTab + 1, secondTab);
/** finds the variable region past the quality
* @param sample sam or Xsam record string
* @return variable sequence or empty string
*/
public static String findVariableRegionSequence(String sample)
int start = findVariableRegionStart(sample);
if(start == -1) return "";
return sample.substring(findVariableRegionStart(sample));
/** finds the xL field
* @param sample String to search
* @return position if found, '' (null) value if not.
*/
public static int findxLField(String sample)
int chartStart = findPosAfter(sample, "txL:i:");
if (chartStart == -1)
return -1; //return -1 if not found.
int i = chartStart;
while (sample.charAt(i) != 't')
i++;
return Integer.parseInt(sample.substring(chartStart, i));
/** finds the xR field
* @param sample String to search
* @return position if found, '' (null) value if not.
*/
public static int findxRField(String sample)
int chartStart = findPosAfter(sample, "txR:i:");
if (chartStart == -1)
return ''; //return NULL if not found.
int i = chartStart;
while (sample.charAt(i) != 't')
i++;
return Integer.parseInt(sample.substring(chartStart, i));
/** finds the xLSeq field
* @param sample String to search
* @return String if found, empty string if not.
*/
public static Optional<String> findxLSeqField(String sample)
int charStart = findPosAfter(sample, "txLseq:i:");
if (charStart == -1)
return Optional.empty(); //return NULL if not found.
int i = charStart;
while (sample.charAt(i) != 't')
i++;
return Optional.of(sample.substring(charStart, i));
/** finds the reference name field
* @param sample String to search
* @return String if found, empty string if not.
*/
public static String findReferneceName(String sample)
//should always appear between the second and third tabs
boolean tabsFound = false;
int i = 0;
int secondTab = 0;
int thirdTab = 0;
boolean skippedFirstTab = false;
while (!tabsFound)
if (sample.charAt(i) != 't')
i++;
else
if (skippedFirstTab)
if (secondTab == 0)
secondTab = i;
else
thirdTab = i;
tabsFound = true;
skippedFirstTab = true;
i++;
if(sample.substring(secondTab + 1, thirdTab).contains("/"))
String[] split = sample.substring(secondTab + 1, thirdTab).split("/");
return split[split.length-1];
return sample.substring(secondTab + 1, thirdTab);
/**
* Finds the needle in the haystack, and returns the position of the single next digit.
*
* @param haystack The string to search
* @param needle String field to search on.
* @return position of the end of the needle
*/
private static int findPosAfter(String haystack, String needle)
int hLen = haystack.length();
int nLen = needle.length();
int maxSearch = hLen - nLen;
outer:
for (int i = 0; i < maxSearch; i++)
for (int j = 0; j < nLen; j++)
if (haystack.charAt(i + j) != needle.charAt(j))
continue outer;
// If it reaches here, match has been found:
return i + nLen;
return -1; // Not found
My question is, are there are any drawbacks to this approach? Or any alternative way that might be more effective?
Thanks in advance,
Sam
Edit:
Instantiating code:
public interface RecordFactory<T extends Record>
T createRecord(String recordString);
Implementing it like:
private RecordFactory<SamRecord> samRecordFactory = SamRecord::new
java bioinformatics lazy
$endgroup$
I'm currently writing an API to work with Bioinformatic SAM records. Here's an example of one:
SBL_XSBF463_ID:3230017:BCR1:GCATAA:BCR2:CATATA/1:vpe 97 hs07 38253395 3 30M = 38330420 77055 TTGTTCCACTGCCAAAGAGTTTCTTATAAT EEEEEEEEEEEEAEEEEEEEEEEEEEEEEE PG:Z:novoalign AS:i:0 UQ:i:0 NM:i:0 MD:Z:30 ZS:Z:R NH:i:2 HI:i:1 IH:i:1
Each piece of information separated by a tab is it's own field and corresponds to some type of data.
Now, it's important to note that these files get BIG (10's of GB) and so splitting each one as soon as it's instantiated in some kind of POJO would be inefficient.
Hence, I've decided to create an object with a lazy loading mechanism. Only the original string is stored until one of the fields is requested by some calling code. This should minimise the amount of work done when the object is created, as well as minimise the amount of memory taken by the objects.
Here's my attempt:
/** Class for storing and working with sam formatted DNA sequence.
*
* Upon construction, only the String record is stored.
* All querying of fields is done on demand, to save time.
*
*/
public class SamRecord implements Record
private final String read;
private String id = null;
private int flag = -1;
private String referenceName = null;
private int pos = -1;
private int mappingQuality = -1;
private String cigar = null;
private String mateReferenceName = null;
private int matePosition = -1;
private int templateLength = -1;
private String sequence = null;
private String quality = null;
private String variableTerms = null;
private final static String REPEAT_TERM = "ZS:Z:R";
private final static String MATCH_TERM = "ZS:Z:NM";
private final static String QUALITY_CHECK_TERM = "ZS:Z:QC";
/** Simple constructor for the sam record
* @param read full read
*/
public SamRecord(String read)
this.read = read;
public String getRead()
return read;
/**
* @inheritDoc
*/
@Override
public String getId()
if(id == null)
id = XsamReadQueries.findID(read);
return id;
/**
* @inheritDoc
*/
@Override
public int getFlag() throws NumberFormatException
if(flag == -1)
flag = Integer.parseInt(XsamReadQueries.findElement(read, 1));
return flag;
/**
* @inheritDoc
*/
@Override
public String getReferenceName()
if(referenceName == null)
referenceName = XsamReadQueries.findReferneceName(read);
return referenceName;
/**
* @inheritDoc
*/
@Override
public int getPos() throws NumberFormatException
if(pos == -1)
pos = Integer.parseInt(XsamReadQueries.findElement(read, 3));
return pos;
/**
* @inheritDoc
*/
@Override
public int getMappingQuality() throws NumberFormatException
if(mappingQuality == -1)
mappingQuality = Integer.parseInt(XsamReadQueries.findElement(read, 4));
return mappingQuality;
/**
* @inheritDoc
*/
@Override
public String getCigar()
if(cigar == null)
cigar = XsamReadQueries.findCigar(read);
return cigar;
/**
* @inheritDoc
*/
@Override
public String getMateReferenceName()
if(mateReferenceName == null)
mateReferenceName = XsamReadQueries.findElement(read, 6);
return mateReferenceName;
/**
* @inheritDoc
*/
@Override
public int getMatePosition() throws NumberFormatException
if(matePosition == -1)
matePosition = Integer.parseInt(XsamReadQueries.findElement(read, 7));
return matePosition;
/**
* @inheritDoc
*/
@Override
public int getTemplateLength() throws NumberFormatException
if(templateLength == -1)
templateLength = Integer.parseInt(XsamReadQueries.findElement(read, 8));
return templateLength;
/**
* @inheritDoc
*/
@Override
public String getSequence()
if(sequence == null)
sequence = XsamReadQueries.findBaseSequence(read);
return sequence;
/**
* @inheritDoc
*/
@Override
public String getQuality()
if(quality == null)
quality = XsamReadQueries.findElement(read, 10);
return quality;
/**
* @inheritDoc
*/
@Override
public boolean isRepeat()
return read.contains(REPEAT_TERM);
/**
* @inheritDoc
*/
@Override
public boolean isMapped()
return !read.contains(MATCH_TERM);
/**
* @inheritDoc
*/
@Override
public String getVariableTerms()
if(variableTerms == null)
variableTerms = XsamReadQueries.findVariableRegionSequence(read);
return variableTerms;
/**
* @inheritDoc
*/
@Override
public boolean isQualityFailed()
return read.contains(QUALITY_CHECK_TERM);
@Override
public boolean equals(Object o) getClass() != o.getClass()) return false;
SamRecord samRecord = (SamRecord) o;
return Objects.equals(read, samRecord.read);
@Override
public int hashCode()
return Objects.hash(read);
@Override
public String toString()
return read;
The fields are returned by static methods in a helper class which retrieve them by looking at where the tab characters are. i.e. flag = Integer.parseInt(XsamReadQueries.findElement(read, 1));
Below is the XsamReadQuery
class
/**
* Non-instantiable utility class for working with Xsam reads
*/
public final class XsamReadQueries
// Suppress instantiation
private XsamReadQueries()
throw new AssertionError();
/** finds the position of the tab directly before the start of the variable region
* @param read whole sam or Xsam read to search
* @return position of the tab in the String
*/
public static int findVariableRegionStart(String read)
int found = 0;
for(int i = 0; i < read.length(); i++)
if(read.charAt(i) == 't')
found++;
if(found >= 11 && i+1 < read.length() && (read.charAt(i+1) != 'x' && read.charAt(i+1) != 't')) //guard against double-tabs
return i + 1;
return -1;
/** Attempts to find the library name from SBL reads
* where SBL reads have the id SBL_LibraryName_ID:XXXXX
* if LibraryName end's with a lower case letter, the letter will be removed.
* if SBL_LibID is not valid, return the full ID.
* @param ID or String to search.
* @return Library name with lower case endings removed
*/
public static String findLibraryName(String ID)
if(!ID.startsWith("SBL")) return "";
try
int firstPos = XsamReadQueries.findPosAfter(ID, "_");
int i = firstPos;
while (ID.charAt(i) != '_' && ID.charAt(i) != 't')
i++;
String library = ID.substring(firstPos, i);
char lastChar = library.charAt(library.length()-1);
if(lastChar >= 97 && lastChar <= 122)
library = library.substring(0, library.length()-1);
return library;
catch (Exception e)
int i = 0;
while(ID.charAt(i) != 't')
i++;
if(i == ID.length())
break;
return ID.substring(0, i);
/** Returns the ID from the sample
* @param sample Xsam read
* @return ID
*/
public static String findID(String sample)
return findElement(sample, 0);
/** Returns the phred score from the sample
* @param sample Xsam read
* @return phred string
*/
public static String findPhred(String sample)
return findElement(sample, 10);
/**
* Returns the cigar from the xsam read
*
* @param sample read
* @return cigar string
*/
public static String findCigar(String sample)
return findElement(sample, 5);
/**
* Returns the bases from the xsam read
*
* @param sample read
* @return base string
*/
public static String findBaseSequence(String sample)
return findElement(sample, 9);
/**
* finds the n'th element in the tab delimited sample
* i.e findElement(0) returns one from "onettwo"
* 0 indexed.
*
* @param sample String to search
* @param element element to find
* @return found element or "" if not found
*/
public static String findElement(String sample, int element)
boolean tabsFound = false;
int i = 0;
int firstTab = 0;
int secondTab = 0;
int tabsToSkip = element - 1 >= 0 ? element - 1 : 0;
int skippedTabs = 0;
if (element == 0)
while (sample.charAt(i) != 't')
i++;
return sample.substring(0, i);
else
while (!tabsFound)
if (sample.charAt(i) != 't')
i++;
else
if (skippedTabs == tabsToSkip)
if (firstTab == 0)
firstTab = i;
else
secondTab = i;
tabsFound = true;
else
skippedTabs++;
i++;
return sample.substring(firstTab + 1, secondTab);
/** finds the variable region past the quality
* @param sample sam or Xsam record string
* @return variable sequence or empty string
*/
public static String findVariableRegionSequence(String sample)
int start = findVariableRegionStart(sample);
if(start == -1) return "";
return sample.substring(findVariableRegionStart(sample));
/** finds the xL field
* @param sample String to search
* @return position if found, '' (null) value if not.
*/
public static int findxLField(String sample)
int chartStart = findPosAfter(sample, "txL:i:");
if (chartStart == -1)
return -1; //return -1 if not found.
int i = chartStart;
while (sample.charAt(i) != 't')
i++;
return Integer.parseInt(sample.substring(chartStart, i));
/** finds the xR field
* @param sample String to search
* @return position if found, '' (null) value if not.
*/
public static int findxRField(String sample)
int chartStart = findPosAfter(sample, "txR:i:");
if (chartStart == -1)
return ''; //return NULL if not found.
int i = chartStart;
while (sample.charAt(i) != 't')
i++;
return Integer.parseInt(sample.substring(chartStart, i));
/** finds the xLSeq field
* @param sample String to search
* @return String if found, empty string if not.
*/
public static Optional<String> findxLSeqField(String sample)
int charStart = findPosAfter(sample, "txLseq:i:");
if (charStart == -1)
return Optional.empty(); //return NULL if not found.
int i = charStart;
while (sample.charAt(i) != 't')
i++;
return Optional.of(sample.substring(charStart, i));
/** finds the reference name field
* @param sample String to search
* @return String if found, empty string if not.
*/
public static String findReferneceName(String sample)
//should always appear between the second and third tabs
boolean tabsFound = false;
int i = 0;
int secondTab = 0;
int thirdTab = 0;
boolean skippedFirstTab = false;
while (!tabsFound)
if (sample.charAt(i) != 't')
i++;
else
if (skippedFirstTab)
if (secondTab == 0)
secondTab = i;
else
thirdTab = i;
tabsFound = true;
skippedFirstTab = true;
i++;
if(sample.substring(secondTab + 1, thirdTab).contains("/"))
String[] split = sample.substring(secondTab + 1, thirdTab).split("/");
return split[split.length-1];
return sample.substring(secondTab + 1, thirdTab);
/**
* Finds the needle in the haystack, and returns the position of the single next digit.
*
* @param haystack The string to search
* @param needle String field to search on.
* @return position of the end of the needle
*/
private static int findPosAfter(String haystack, String needle)
int hLen = haystack.length();
int nLen = needle.length();
int maxSearch = hLen - nLen;
outer:
for (int i = 0; i < maxSearch; i++)
for (int j = 0; j < nLen; j++)
if (haystack.charAt(i + j) != needle.charAt(j))
continue outer;
// If it reaches here, match has been found:
return i + nLen;
return -1; // Not found
My question is, are there are any drawbacks to this approach? Or any alternative way that might be more effective?
Thanks in advance,
Sam
Edit:
Instantiating code:
public interface RecordFactory<T extends Record>
T createRecord(String recordString);
Implementing it like:
private RecordFactory<SamRecord> samRecordFactory = SamRecord::new
java bioinformatics lazy
java bioinformatics lazy
edited 2 days ago
Sam
asked Apr 23 at 12:45
SamSam
21217
21217
$begingroup$
Is this class, or could this class, be used in a multithreaded scenario?
$endgroup$
– IEatBagels
Apr 23 at 13:47
$begingroup$
It will be, yes. However, it's unlikely to be shared between threads. It's more likely that collections of these records will be handed off to separate workers.
$endgroup$
– Sam
Apr 23 at 13:48
$begingroup$
0. Do you have an actual, tested performance issue, or are you guessing at a potential problem? 1. What are you trying to minimize, execution time or memory footprint?
$endgroup$
– Eric Stein
Apr 23 at 14:41
$begingroup$
Also, are you able/willing to share the code that instantiatesSamRecord
s?
$endgroup$
– Eric Stein
Apr 23 at 15:09
$begingroup$
Hi @EricStein. Sure thing, but it's just a short factory method using generics (sinceSamRecord
extends fromRecord
- and I also have a few other types that extendRecord
too.) I know from past experience that this is a performance issue, since these files have 100's of millions of records. When I split the file, and hand the records to workers, I need to do as little as possible with the record themselves. The software using this API most likely wont need every detail of the record at one time
$endgroup$
– Sam
2 days ago
add a comment |
$begingroup$
Is this class, or could this class, be used in a multithreaded scenario?
$endgroup$
– IEatBagels
Apr 23 at 13:47
$begingroup$
It will be, yes. However, it's unlikely to be shared between threads. It's more likely that collections of these records will be handed off to separate workers.
$endgroup$
– Sam
Apr 23 at 13:48
$begingroup$
0. Do you have an actual, tested performance issue, or are you guessing at a potential problem? 1. What are you trying to minimize, execution time or memory footprint?
$endgroup$
– Eric Stein
Apr 23 at 14:41
$begingroup$
Also, are you able/willing to share the code that instantiatesSamRecord
s?
$endgroup$
– Eric Stein
Apr 23 at 15:09
$begingroup$
Hi @EricStein. Sure thing, but it's just a short factory method using generics (sinceSamRecord
extends fromRecord
- and I also have a few other types that extendRecord
too.) I know from past experience that this is a performance issue, since these files have 100's of millions of records. When I split the file, and hand the records to workers, I need to do as little as possible with the record themselves. The software using this API most likely wont need every detail of the record at one time
$endgroup$
– Sam
2 days ago
$begingroup$
Is this class, or could this class, be used in a multithreaded scenario?
$endgroup$
– IEatBagels
Apr 23 at 13:47
$begingroup$
Is this class, or could this class, be used in a multithreaded scenario?
$endgroup$
– IEatBagels
Apr 23 at 13:47
$begingroup$
It will be, yes. However, it's unlikely to be shared between threads. It's more likely that collections of these records will be handed off to separate workers.
$endgroup$
– Sam
Apr 23 at 13:48
$begingroup$
It will be, yes. However, it's unlikely to be shared between threads. It's more likely that collections of these records will be handed off to separate workers.
$endgroup$
– Sam
Apr 23 at 13:48
$begingroup$
0. Do you have an actual, tested performance issue, or are you guessing at a potential problem? 1. What are you trying to minimize, execution time or memory footprint?
$endgroup$
– Eric Stein
Apr 23 at 14:41
$begingroup$
0. Do you have an actual, tested performance issue, or are you guessing at a potential problem? 1. What are you trying to minimize, execution time or memory footprint?
$endgroup$
– Eric Stein
Apr 23 at 14:41
$begingroup$
Also, are you able/willing to share the code that instantiates
SamRecord
s?$endgroup$
– Eric Stein
Apr 23 at 15:09
$begingroup$
Also, are you able/willing to share the code that instantiates
SamRecord
s?$endgroup$
– Eric Stein
Apr 23 at 15:09
$begingroup$
Hi @EricStein. Sure thing, but it's just a short factory method using generics (since
SamRecord
extends from Record
- and I also have a few other types that extend Record
too.) I know from past experience that this is a performance issue, since these files have 100's of millions of records. When I split the file, and hand the records to workers, I need to do as little as possible with the record themselves. The software using this API most likely wont need every detail of the record at one time$endgroup$
– Sam
2 days ago
$begingroup$
Hi @EricStein. Sure thing, but it's just a short factory method using generics (since
SamRecord
extends from Record
- and I also have a few other types that extend Record
too.) I know from past experience that this is a performance issue, since these files have 100's of millions of records. When I split the file, and hand the records to workers, I need to do as little as possible with the record themselves. The software using this API most likely wont need every detail of the record at one time$endgroup$
– Sam
2 days ago
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
Performance
There is one thing that I believe could increase the performance of your application.
You often call findElement
, which goes through the SAM record every time.
By loading a record, you are pretty certain that you will at least access it once.
At some point, maybe when creating the class, or when accessing the first property for the first time, you should "index" your SAM record.
Go through the whole file once and keep an array of where the tabs are. This way, if your code ends up calling :
XsamReadQueries.findElement(read, 1)
XsamReadQueries.findElement(read, 2)
XsamReadQueries.findElement(read, 3)
The calls to the second and third method would be much faster than they are now.
To do this, you could add a method to XsamReadQueries
names something like IndexTabs
, that would return an array of ints.
If you want more insight as to how to do this, you can write a comment and I'll add more information, but I'm pretty sure this would help you.
Code style
There are one of two things that are bothering me in your code with regards to clarity and future maintenance.
You have methods named findPhred
, which call findElement
, but in your SamRecord
sometimes you call findElement
and something a specific find*
, which is basically the same code. You should decide on one way to do things, either have specific methods for each field in the XsamReadQueries
or keep only the findElement
method.
Finally, you could consider using an enum
for the element
parameter of the findElement
method.
$endgroup$
1
$begingroup$
Hi @IEatBagels, thanks for the answer. Indexing is a really good idea, I'll definitely look to implement that. I agree with the find* notation - It's partially as I'm halfway through coding the API and wanted some feedback before committing to one way or the other. The enum's a good idea too, it'll definitely make it more readable!
$endgroup$
– Sam
Apr 23 at 16:00
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f217955%2flazy-loading-a-bioinformatic-sam-record%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Performance
There is one thing that I believe could increase the performance of your application.
You often call findElement
, which goes through the SAM record every time.
By loading a record, you are pretty certain that you will at least access it once.
At some point, maybe when creating the class, or when accessing the first property for the first time, you should "index" your SAM record.
Go through the whole file once and keep an array of where the tabs are. This way, if your code ends up calling :
XsamReadQueries.findElement(read, 1)
XsamReadQueries.findElement(read, 2)
XsamReadQueries.findElement(read, 3)
The calls to the second and third method would be much faster than they are now.
To do this, you could add a method to XsamReadQueries
names something like IndexTabs
, that would return an array of ints.
If you want more insight as to how to do this, you can write a comment and I'll add more information, but I'm pretty sure this would help you.
Code style
There are one of two things that are bothering me in your code with regards to clarity and future maintenance.
You have methods named findPhred
, which call findElement
, but in your SamRecord
sometimes you call findElement
and something a specific find*
, which is basically the same code. You should decide on one way to do things, either have specific methods for each field in the XsamReadQueries
or keep only the findElement
method.
Finally, you could consider using an enum
for the element
parameter of the findElement
method.
$endgroup$
1
$begingroup$
Hi @IEatBagels, thanks for the answer. Indexing is a really good idea, I'll definitely look to implement that. I agree with the find* notation - It's partially as I'm halfway through coding the API and wanted some feedback before committing to one way or the other. The enum's a good idea too, it'll definitely make it more readable!
$endgroup$
– Sam
Apr 23 at 16:00
add a comment |
$begingroup$
Performance
There is one thing that I believe could increase the performance of your application.
You often call findElement
, which goes through the SAM record every time.
By loading a record, you are pretty certain that you will at least access it once.
At some point, maybe when creating the class, or when accessing the first property for the first time, you should "index" your SAM record.
Go through the whole file once and keep an array of where the tabs are. This way, if your code ends up calling :
XsamReadQueries.findElement(read, 1)
XsamReadQueries.findElement(read, 2)
XsamReadQueries.findElement(read, 3)
The calls to the second and third method would be much faster than they are now.
To do this, you could add a method to XsamReadQueries
names something like IndexTabs
, that would return an array of ints.
If you want more insight as to how to do this, you can write a comment and I'll add more information, but I'm pretty sure this would help you.
Code style
There are one of two things that are bothering me in your code with regards to clarity and future maintenance.
You have methods named findPhred
, which call findElement
, but in your SamRecord
sometimes you call findElement
and something a specific find*
, which is basically the same code. You should decide on one way to do things, either have specific methods for each field in the XsamReadQueries
or keep only the findElement
method.
Finally, you could consider using an enum
for the element
parameter of the findElement
method.
$endgroup$
1
$begingroup$
Hi @IEatBagels, thanks for the answer. Indexing is a really good idea, I'll definitely look to implement that. I agree with the find* notation - It's partially as I'm halfway through coding the API and wanted some feedback before committing to one way or the other. The enum's a good idea too, it'll definitely make it more readable!
$endgroup$
– Sam
Apr 23 at 16:00
add a comment |
$begingroup$
Performance
There is one thing that I believe could increase the performance of your application.
You often call findElement
, which goes through the SAM record every time.
By loading a record, you are pretty certain that you will at least access it once.
At some point, maybe when creating the class, or when accessing the first property for the first time, you should "index" your SAM record.
Go through the whole file once and keep an array of where the tabs are. This way, if your code ends up calling :
XsamReadQueries.findElement(read, 1)
XsamReadQueries.findElement(read, 2)
XsamReadQueries.findElement(read, 3)
The calls to the second and third method would be much faster than they are now.
To do this, you could add a method to XsamReadQueries
names something like IndexTabs
, that would return an array of ints.
If you want more insight as to how to do this, you can write a comment and I'll add more information, but I'm pretty sure this would help you.
Code style
There are one of two things that are bothering me in your code with regards to clarity and future maintenance.
You have methods named findPhred
, which call findElement
, but in your SamRecord
sometimes you call findElement
and something a specific find*
, which is basically the same code. You should decide on one way to do things, either have specific methods for each field in the XsamReadQueries
or keep only the findElement
method.
Finally, you could consider using an enum
for the element
parameter of the findElement
method.
$endgroup$
Performance
There is one thing that I believe could increase the performance of your application.
You often call findElement
, which goes through the SAM record every time.
By loading a record, you are pretty certain that you will at least access it once.
At some point, maybe when creating the class, or when accessing the first property for the first time, you should "index" your SAM record.
Go through the whole file once and keep an array of where the tabs are. This way, if your code ends up calling :
XsamReadQueries.findElement(read, 1)
XsamReadQueries.findElement(read, 2)
XsamReadQueries.findElement(read, 3)
The calls to the second and third method would be much faster than they are now.
To do this, you could add a method to XsamReadQueries
names something like IndexTabs
, that would return an array of ints.
If you want more insight as to how to do this, you can write a comment and I'll add more information, but I'm pretty sure this would help you.
Code style
There are one of two things that are bothering me in your code with regards to clarity and future maintenance.
You have methods named findPhred
, which call findElement
, but in your SamRecord
sometimes you call findElement
and something a specific find*
, which is basically the same code. You should decide on one way to do things, either have specific methods for each field in the XsamReadQueries
or keep only the findElement
method.
Finally, you could consider using an enum
for the element
parameter of the findElement
method.
edited Apr 23 at 14:19
answered Apr 23 at 14:13
IEatBagelsIEatBagels
9,07323579
9,07323579
1
$begingroup$
Hi @IEatBagels, thanks for the answer. Indexing is a really good idea, I'll definitely look to implement that. I agree with the find* notation - It's partially as I'm halfway through coding the API and wanted some feedback before committing to one way or the other. The enum's a good idea too, it'll definitely make it more readable!
$endgroup$
– Sam
Apr 23 at 16:00
add a comment |
1
$begingroup$
Hi @IEatBagels, thanks for the answer. Indexing is a really good idea, I'll definitely look to implement that. I agree with the find* notation - It's partially as I'm halfway through coding the API and wanted some feedback before committing to one way or the other. The enum's a good idea too, it'll definitely make it more readable!
$endgroup$
– Sam
Apr 23 at 16:00
1
1
$begingroup$
Hi @IEatBagels, thanks for the answer. Indexing is a really good idea, I'll definitely look to implement that. I agree with the find* notation - It's partially as I'm halfway through coding the API and wanted some feedback before committing to one way or the other. The enum's a good idea too, it'll definitely make it more readable!
$endgroup$
– Sam
Apr 23 at 16:00
$begingroup$
Hi @IEatBagels, thanks for the answer. Indexing is a really good idea, I'll definitely look to implement that. I agree with the find* notation - It's partially as I'm halfway through coding the API and wanted some feedback before committing to one way or the other. The enum's a good idea too, it'll definitely make it more readable!
$endgroup$
– Sam
Apr 23 at 16:00
add a comment |
Thanks for contributing an answer to Code Review Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f217955%2flazy-loading-a-bioinformatic-sam-record%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
Is this class, or could this class, be used in a multithreaded scenario?
$endgroup$
– IEatBagels
Apr 23 at 13:47
$begingroup$
It will be, yes. However, it's unlikely to be shared between threads. It's more likely that collections of these records will be handed off to separate workers.
$endgroup$
– Sam
Apr 23 at 13:48
$begingroup$
0. Do you have an actual, tested performance issue, or are you guessing at a potential problem? 1. What are you trying to minimize, execution time or memory footprint?
$endgroup$
– Eric Stein
Apr 23 at 14:41
$begingroup$
Also, are you able/willing to share the code that instantiates
SamRecord
s?$endgroup$
– Eric Stein
Apr 23 at 15:09
$begingroup$
Hi @EricStein. Sure thing, but it's just a short factory method using generics (since
SamRecord
extends fromRecord
- and I also have a few other types that extendRecord
too.) I know from past experience that this is a performance issue, since these files have 100's of millions of records. When I split the file, and hand the records to workers, I need to do as little as possible with the record themselves. The software using this API most likely wont need every detail of the record at one time$endgroup$
– Sam
2 days ago