Lazy Loading a Bioinformatic SAM recordLazy class instantiation in PythonLazy loading with __getLazy loaded property readabilityMax heap in JavaQuerying Facebook for details of a user's OAuth tokenBinary Puzzle Solver - 10000 questionsSimple Java program - Coding bat sumNumbersLeetcode: String to Integer (atoi)Lazy split and semi-lazy splitLazy-loading iframes as they scroll into view

Why does Mind Blank stop the Feeblemind spell?

Map of water taps to fill bottles

How do I check if a string is entirely made of the same substring?

Get consecutive integer number ranges from list of int

Contradiction proof for inequality of P and NP?

Is the claim "Employers won't employ people with no 'social media presence'" realistic?

How to have a sharp product image?

If a planet has 3 moons, is it possible to have triple Full/New Moons at once?

Why does nature favour the Laplacian?

Multiple options vs single option UI

Mistake in years of experience in resume?

What term is being referred to with "reflected-sound-of-underground-spirits"?

How can I practically buy stocks?

Could the terminal length of components like resistors be reduced?

Does tea made with boiling water cool faster than tea made with boiled (but still hot) water?

Alignment of various blocks in tikz

Relationship between strut and baselineskip

Phrase for the opposite of "foolproof"

Pulling the rope with one hand is as heavy as with two hands?

How to display Aura JS Errors Lightning Out

Why must Chinese maps be obfuscated?

How exactly does Hawking radiation decrease the mass of black holes?

Checks user level and limit the data before saving it to mongoDB

Why do games have consumables?



Lazy Loading a Bioinformatic SAM record


Lazy class instantiation in PythonLazy loading with __getLazy loaded property readabilityMax heap in JavaQuerying Facebook for details of a user's OAuth tokenBinary Puzzle Solver - 10000 questionsSimple Java program - Coding bat sumNumbersLeetcode: String to Integer (atoi)Lazy split and semi-lazy splitLazy-loading iframes as they scroll into view






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








4












$begingroup$


I'm currently writing an API to work with Bioinformatic SAM records. Here's an example of one:



SBL_XSBF463_ID:3230017:BCR1:GCATAA:BCR2:CATATA/1:vpe 97 hs07 38253395 3 30M = 38330420 77055 TTGTTCCACTGCCAAAGAGTTTCTTATAAT EEEEEEEEEEEEAEEEEEEEEEEEEEEEEE PG:Z:novoalign AS:i:0 UQ:i:0 NM:i:0 MD:Z:30 ZS:Z:R NH:i:2 HI:i:1 IH:i:1



Each piece of information separated by a tab is it's own field and corresponds to some type of data.



Now, it's important to note that these files get BIG (10's of GB) and so splitting each one as soon as it's instantiated in some kind of POJO would be inefficient.



Hence, I've decided to create an object with a lazy loading mechanism. Only the original string is stored until one of the fields is requested by some calling code. This should minimise the amount of work done when the object is created, as well as minimise the amount of memory taken by the objects.



Here's my attempt:



/** Class for storing and working with sam formatted DNA sequence.
*
* Upon construction, only the String record is stored.
* All querying of fields is done on demand, to save time.
*
*/
public class SamRecord implements Record

private final String read;
private String id = null;
private int flag = -1;
private String referenceName = null;
private int pos = -1;
private int mappingQuality = -1;
private String cigar = null;
private String mateReferenceName = null;
private int matePosition = -1;
private int templateLength = -1;
private String sequence = null;
private String quality = null;
private String variableTerms = null;

private final static String REPEAT_TERM = "ZS:Z:R";
private final static String MATCH_TERM = "ZS:Z:NM";
private final static String QUALITY_CHECK_TERM = "ZS:Z:QC";

/** Simple constructor for the sam record
* @param read full read
*/
public SamRecord(String read)
this.read = read;


public String getRead()
return read;


/**
* @inheritDoc
*/
@Override
public String getId()
if(id == null)
id = XsamReadQueries.findID(read);


return id;


/**
* @inheritDoc
*/
@Override
public int getFlag() throws NumberFormatException
if(flag == -1)
flag = Integer.parseInt(XsamReadQueries.findElement(read, 1));

return flag;


/**
* @inheritDoc
*/
@Override
public String getReferenceName()
if(referenceName == null)
referenceName = XsamReadQueries.findReferneceName(read);


return referenceName;


/**
* @inheritDoc
*/
@Override
public int getPos() throws NumberFormatException
if(pos == -1)
pos = Integer.parseInt(XsamReadQueries.findElement(read, 3));


return pos;


/**
* @inheritDoc
*/
@Override
public int getMappingQuality() throws NumberFormatException
if(mappingQuality == -1)
mappingQuality = Integer.parseInt(XsamReadQueries.findElement(read, 4));


return mappingQuality;


/**
* @inheritDoc
*/
@Override
public String getCigar()
if(cigar == null)
cigar = XsamReadQueries.findCigar(read);


return cigar;


/**
* @inheritDoc
*/
@Override
public String getMateReferenceName()
if(mateReferenceName == null)
mateReferenceName = XsamReadQueries.findElement(read, 6);


return mateReferenceName;


/**
* @inheritDoc
*/
@Override
public int getMatePosition() throws NumberFormatException
if(matePosition == -1)
matePosition = Integer.parseInt(XsamReadQueries.findElement(read, 7));


return matePosition;


/**
* @inheritDoc
*/
@Override
public int getTemplateLength() throws NumberFormatException
if(templateLength == -1)
templateLength = Integer.parseInt(XsamReadQueries.findElement(read, 8));


return templateLength;


/**
* @inheritDoc
*/
@Override
public String getSequence()
if(sequence == null)
sequence = XsamReadQueries.findBaseSequence(read);


return sequence;


/**
* @inheritDoc
*/
@Override
public String getQuality()
if(quality == null)
quality = XsamReadQueries.findElement(read, 10);


return quality;


/**
* @inheritDoc
*/
@Override
public boolean isRepeat()
return read.contains(REPEAT_TERM);


/**
* @inheritDoc
*/
@Override
public boolean isMapped()
return !read.contains(MATCH_TERM);



/**
* @inheritDoc
*/
@Override
public String getVariableTerms()
if(variableTerms == null)
variableTerms = XsamReadQueries.findVariableRegionSequence(read);


return variableTerms;


/**
* @inheritDoc
*/
@Override
public boolean isQualityFailed()
return read.contains(QUALITY_CHECK_TERM);



@Override
public boolean equals(Object o) getClass() != o.getClass()) return false;
SamRecord samRecord = (SamRecord) o;
return Objects.equals(read, samRecord.read);


@Override
public int hashCode()
return Objects.hash(read);


@Override
public String toString()
return read;





The fields are returned by static methods in a helper class which retrieve them by looking at where the tab characters are. i.e. flag = Integer.parseInt(XsamReadQueries.findElement(read, 1));



Below is the XsamReadQuery class



/**
* Non-instantiable utility class for working with Xsam reads
*/
public final class XsamReadQueries

// Suppress instantiation
private XsamReadQueries()
throw new AssertionError();


/** finds the position of the tab directly before the start of the variable region
* @param read whole sam or Xsam read to search
* @return position of the tab in the String
*/
public static int findVariableRegionStart(String read)

int found = 0;

for(int i = 0; i < read.length(); i++)
if(read.charAt(i) == 't')
found++;
if(found >= 11 && i+1 < read.length() && (read.charAt(i+1) != 'x' && read.charAt(i+1) != 't')) //guard against double-tabs
return i + 1;




return -1;



/** Attempts to find the library name from SBL reads
* where SBL reads have the id SBL_LibraryName_ID:XXXXX
* if LibraryName end's with a lower case letter, the letter will be removed.
* if SBL_LibID is not valid, return the full ID.
* @param ID or String to search.
* @return Library name with lower case endings removed
*/
public static String findLibraryName(String ID)

if(!ID.startsWith("SBL")) return "";

try
int firstPos = XsamReadQueries.findPosAfter(ID, "_");
int i = firstPos;
while (ID.charAt(i) != '_' && ID.charAt(i) != 't')
i++;


String library = ID.substring(firstPos, i);

char lastChar = library.charAt(library.length()-1);
if(lastChar >= 97 && lastChar <= 122)
library = library.substring(0, library.length()-1);


return library;

catch (Exception e)
int i = 0;
while(ID.charAt(i) != 't')
i++;
if(i == ID.length())
break;


return ID.substring(0, i);



/** Returns the ID from the sample
* @param sample Xsam read
* @return ID
*/
public static String findID(String sample)
return findElement(sample, 0);

/** Returns the phred score from the sample
* @param sample Xsam read
* @return phred string
*/
public static String findPhred(String sample)
return findElement(sample, 10);


/**
* Returns the cigar from the xsam read
*
* @param sample read
* @return cigar string
*/
public static String findCigar(String sample)
return findElement(sample, 5);


/**
* Returns the bases from the xsam read
*
* @param sample read
* @return base string
*/
public static String findBaseSequence(String sample)
return findElement(sample, 9);


/**
* finds the n'th element in the tab delimited sample
* i.e findElement(0) returns one from "onettwo"
* 0 indexed.
*
* @param sample String to search
* @param element element to find
* @return found element or "" if not found
*/
public static String findElement(String sample, int element)
boolean tabsFound = false;
int i = 0;
int firstTab = 0;
int secondTab = 0;
int tabsToSkip = element - 1 >= 0 ? element - 1 : 0;
int skippedTabs = 0;
if (element == 0)
while (sample.charAt(i) != 't')
i++;

return sample.substring(0, i);
else
while (!tabsFound)
if (sample.charAt(i) != 't')
i++;
else
if (skippedTabs == tabsToSkip)
if (firstTab == 0)
firstTab = i;
else
secondTab = i;
tabsFound = true;

else
skippedTabs++;

i++;




return sample.substring(firstTab + 1, secondTab);


/** finds the variable region past the quality
* @param sample sam or Xsam record string
* @return variable sequence or empty string
*/
public static String findVariableRegionSequence(String sample)

int start = findVariableRegionStart(sample);

if(start == -1) return "";

return sample.substring(findVariableRegionStart(sample));


/** finds the xL field
* @param sample String to search
* @return position if found, '' (null) value if not.
*/
public static int findxLField(String sample)
int chartStart = findPosAfter(sample, "txL:i:");
if (chartStart == -1)
return -1; //return -1 if not found.

int i = chartStart;
while (sample.charAt(i) != 't')
i++;

return Integer.parseInt(sample.substring(chartStart, i));



/** finds the xR field
* @param sample String to search
* @return position if found, '' (null) value if not.
*/
public static int findxRField(String sample)
int chartStart = findPosAfter(sample, "txR:i:");
if (chartStart == -1)
return ''; //return NULL if not found.

int i = chartStart;
while (sample.charAt(i) != 't')
i++;

return Integer.parseInt(sample.substring(chartStart, i));



/** finds the xLSeq field
* @param sample String to search
* @return String if found, empty string if not.
*/
public static Optional<String> findxLSeqField(String sample)
int charStart = findPosAfter(sample, "txLseq:i:");
if (charStart == -1)
return Optional.empty(); //return NULL if not found.

int i = charStart;
while (sample.charAt(i) != 't')
i++;

return Optional.of(sample.substring(charStart, i));



/** finds the reference name field
* @param sample String to search
* @return String if found, empty string if not.
*/
public static String findReferneceName(String sample)
//should always appear between the second and third tabs
boolean tabsFound = false;
int i = 0;
int secondTab = 0;
int thirdTab = 0;
boolean skippedFirstTab = false;
while (!tabsFound)
if (sample.charAt(i) != 't')
i++;
else
if (skippedFirstTab)
if (secondTab == 0)
secondTab = i;
else
thirdTab = i;
tabsFound = true;


skippedFirstTab = true;
i++;



if(sample.substring(secondTab + 1, thirdTab).contains("/"))
String[] split = sample.substring(secondTab + 1, thirdTab).split("/");
return split[split.length-1];



return sample.substring(secondTab + 1, thirdTab);


/**
* Finds the needle in the haystack, and returns the position of the single next digit.
*
* @param haystack The string to search
* @param needle String field to search on.
* @return position of the end of the needle
*/
private static int findPosAfter(String haystack, String needle)
int hLen = haystack.length();
int nLen = needle.length();
int maxSearch = hLen - nLen;

outer:
for (int i = 0; i < maxSearch; i++)
for (int j = 0; j < nLen; j++)
if (haystack.charAt(i + j) != needle.charAt(j))
continue outer;


// If it reaches here, match has been found:
return i + nLen;

return -1; // Not found





My question is, are there are any drawbacks to this approach? Or any alternative way that might be more effective?



Thanks in advance,



Sam



Edit:



Instantiating code:



public interface RecordFactory<T extends Record> 

T createRecord(String recordString);



Implementing it like:



private RecordFactory<SamRecord> samRecordFactory = SamRecord::new









share|improve this question











$endgroup$











  • $begingroup$
    Is this class, or could this class, be used in a multithreaded scenario?
    $endgroup$
    – IEatBagels
    Apr 23 at 13:47










  • $begingroup$
    It will be, yes. However, it's unlikely to be shared between threads. It's more likely that collections of these records will be handed off to separate workers.
    $endgroup$
    – Sam
    Apr 23 at 13:48










  • $begingroup$
    0. Do you have an actual, tested performance issue, or are you guessing at a potential problem? 1. What are you trying to minimize, execution time or memory footprint?
    $endgroup$
    – Eric Stein
    Apr 23 at 14:41










  • $begingroup$
    Also, are you able/willing to share the code that instantiates SamRecords?
    $endgroup$
    – Eric Stein
    Apr 23 at 15:09










  • $begingroup$
    Hi @EricStein. Sure thing, but it's just a short factory method using generics (since SamRecord extends from Record - and I also have a few other types that extend Record too.) I know from past experience that this is a performance issue, since these files have 100's of millions of records. When I split the file, and hand the records to workers, I need to do as little as possible with the record themselves. The software using this API most likely wont need every detail of the record at one time
    $endgroup$
    – Sam
    2 days ago


















4












$begingroup$


I'm currently writing an API to work with Bioinformatic SAM records. Here's an example of one:



SBL_XSBF463_ID:3230017:BCR1:GCATAA:BCR2:CATATA/1:vpe 97 hs07 38253395 3 30M = 38330420 77055 TTGTTCCACTGCCAAAGAGTTTCTTATAAT EEEEEEEEEEEEAEEEEEEEEEEEEEEEEE PG:Z:novoalign AS:i:0 UQ:i:0 NM:i:0 MD:Z:30 ZS:Z:R NH:i:2 HI:i:1 IH:i:1



Each piece of information separated by a tab is it's own field and corresponds to some type of data.



Now, it's important to note that these files get BIG (10's of GB) and so splitting each one as soon as it's instantiated in some kind of POJO would be inefficient.



Hence, I've decided to create an object with a lazy loading mechanism. Only the original string is stored until one of the fields is requested by some calling code. This should minimise the amount of work done when the object is created, as well as minimise the amount of memory taken by the objects.



Here's my attempt:



/** Class for storing and working with sam formatted DNA sequence.
*
* Upon construction, only the String record is stored.
* All querying of fields is done on demand, to save time.
*
*/
public class SamRecord implements Record

private final String read;
private String id = null;
private int flag = -1;
private String referenceName = null;
private int pos = -1;
private int mappingQuality = -1;
private String cigar = null;
private String mateReferenceName = null;
private int matePosition = -1;
private int templateLength = -1;
private String sequence = null;
private String quality = null;
private String variableTerms = null;

private final static String REPEAT_TERM = "ZS:Z:R";
private final static String MATCH_TERM = "ZS:Z:NM";
private final static String QUALITY_CHECK_TERM = "ZS:Z:QC";

/** Simple constructor for the sam record
* @param read full read
*/
public SamRecord(String read)
this.read = read;


public String getRead()
return read;


/**
* @inheritDoc
*/
@Override
public String getId()
if(id == null)
id = XsamReadQueries.findID(read);


return id;


/**
* @inheritDoc
*/
@Override
public int getFlag() throws NumberFormatException
if(flag == -1)
flag = Integer.parseInt(XsamReadQueries.findElement(read, 1));

return flag;


/**
* @inheritDoc
*/
@Override
public String getReferenceName()
if(referenceName == null)
referenceName = XsamReadQueries.findReferneceName(read);


return referenceName;


/**
* @inheritDoc
*/
@Override
public int getPos() throws NumberFormatException
if(pos == -1)
pos = Integer.parseInt(XsamReadQueries.findElement(read, 3));


return pos;


/**
* @inheritDoc
*/
@Override
public int getMappingQuality() throws NumberFormatException
if(mappingQuality == -1)
mappingQuality = Integer.parseInt(XsamReadQueries.findElement(read, 4));


return mappingQuality;


/**
* @inheritDoc
*/
@Override
public String getCigar()
if(cigar == null)
cigar = XsamReadQueries.findCigar(read);


return cigar;


/**
* @inheritDoc
*/
@Override
public String getMateReferenceName()
if(mateReferenceName == null)
mateReferenceName = XsamReadQueries.findElement(read, 6);


return mateReferenceName;


/**
* @inheritDoc
*/
@Override
public int getMatePosition() throws NumberFormatException
if(matePosition == -1)
matePosition = Integer.parseInt(XsamReadQueries.findElement(read, 7));


return matePosition;


/**
* @inheritDoc
*/
@Override
public int getTemplateLength() throws NumberFormatException
if(templateLength == -1)
templateLength = Integer.parseInt(XsamReadQueries.findElement(read, 8));


return templateLength;


/**
* @inheritDoc
*/
@Override
public String getSequence()
if(sequence == null)
sequence = XsamReadQueries.findBaseSequence(read);


return sequence;


/**
* @inheritDoc
*/
@Override
public String getQuality()
if(quality == null)
quality = XsamReadQueries.findElement(read, 10);


return quality;


/**
* @inheritDoc
*/
@Override
public boolean isRepeat()
return read.contains(REPEAT_TERM);


/**
* @inheritDoc
*/
@Override
public boolean isMapped()
return !read.contains(MATCH_TERM);



/**
* @inheritDoc
*/
@Override
public String getVariableTerms()
if(variableTerms == null)
variableTerms = XsamReadQueries.findVariableRegionSequence(read);


return variableTerms;


/**
* @inheritDoc
*/
@Override
public boolean isQualityFailed()
return read.contains(QUALITY_CHECK_TERM);



@Override
public boolean equals(Object o) getClass() != o.getClass()) return false;
SamRecord samRecord = (SamRecord) o;
return Objects.equals(read, samRecord.read);


@Override
public int hashCode()
return Objects.hash(read);


@Override
public String toString()
return read;





The fields are returned by static methods in a helper class which retrieve them by looking at where the tab characters are. i.e. flag = Integer.parseInt(XsamReadQueries.findElement(read, 1));



Below is the XsamReadQuery class



/**
* Non-instantiable utility class for working with Xsam reads
*/
public final class XsamReadQueries

// Suppress instantiation
private XsamReadQueries()
throw new AssertionError();


/** finds the position of the tab directly before the start of the variable region
* @param read whole sam or Xsam read to search
* @return position of the tab in the String
*/
public static int findVariableRegionStart(String read)

int found = 0;

for(int i = 0; i < read.length(); i++)
if(read.charAt(i) == 't')
found++;
if(found >= 11 && i+1 < read.length() && (read.charAt(i+1) != 'x' && read.charAt(i+1) != 't')) //guard against double-tabs
return i + 1;




return -1;



/** Attempts to find the library name from SBL reads
* where SBL reads have the id SBL_LibraryName_ID:XXXXX
* if LibraryName end's with a lower case letter, the letter will be removed.
* if SBL_LibID is not valid, return the full ID.
* @param ID or String to search.
* @return Library name with lower case endings removed
*/
public static String findLibraryName(String ID)

if(!ID.startsWith("SBL")) return "";

try
int firstPos = XsamReadQueries.findPosAfter(ID, "_");
int i = firstPos;
while (ID.charAt(i) != '_' && ID.charAt(i) != 't')
i++;


String library = ID.substring(firstPos, i);

char lastChar = library.charAt(library.length()-1);
if(lastChar >= 97 && lastChar <= 122)
library = library.substring(0, library.length()-1);


return library;

catch (Exception e)
int i = 0;
while(ID.charAt(i) != 't')
i++;
if(i == ID.length())
break;


return ID.substring(0, i);



/** Returns the ID from the sample
* @param sample Xsam read
* @return ID
*/
public static String findID(String sample)
return findElement(sample, 0);

/** Returns the phred score from the sample
* @param sample Xsam read
* @return phred string
*/
public static String findPhred(String sample)
return findElement(sample, 10);


/**
* Returns the cigar from the xsam read
*
* @param sample read
* @return cigar string
*/
public static String findCigar(String sample)
return findElement(sample, 5);


/**
* Returns the bases from the xsam read
*
* @param sample read
* @return base string
*/
public static String findBaseSequence(String sample)
return findElement(sample, 9);


/**
* finds the n'th element in the tab delimited sample
* i.e findElement(0) returns one from "onettwo"
* 0 indexed.
*
* @param sample String to search
* @param element element to find
* @return found element or "" if not found
*/
public static String findElement(String sample, int element)
boolean tabsFound = false;
int i = 0;
int firstTab = 0;
int secondTab = 0;
int tabsToSkip = element - 1 >= 0 ? element - 1 : 0;
int skippedTabs = 0;
if (element == 0)
while (sample.charAt(i) != 't')
i++;

return sample.substring(0, i);
else
while (!tabsFound)
if (sample.charAt(i) != 't')
i++;
else
if (skippedTabs == tabsToSkip)
if (firstTab == 0)
firstTab = i;
else
secondTab = i;
tabsFound = true;

else
skippedTabs++;

i++;




return sample.substring(firstTab + 1, secondTab);


/** finds the variable region past the quality
* @param sample sam or Xsam record string
* @return variable sequence or empty string
*/
public static String findVariableRegionSequence(String sample)

int start = findVariableRegionStart(sample);

if(start == -1) return "";

return sample.substring(findVariableRegionStart(sample));


/** finds the xL field
* @param sample String to search
* @return position if found, '' (null) value if not.
*/
public static int findxLField(String sample)
int chartStart = findPosAfter(sample, "txL:i:");
if (chartStart == -1)
return -1; //return -1 if not found.

int i = chartStart;
while (sample.charAt(i) != 't')
i++;

return Integer.parseInt(sample.substring(chartStart, i));



/** finds the xR field
* @param sample String to search
* @return position if found, '' (null) value if not.
*/
public static int findxRField(String sample)
int chartStart = findPosAfter(sample, "txR:i:");
if (chartStart == -1)
return ''; //return NULL if not found.

int i = chartStart;
while (sample.charAt(i) != 't')
i++;

return Integer.parseInt(sample.substring(chartStart, i));



/** finds the xLSeq field
* @param sample String to search
* @return String if found, empty string if not.
*/
public static Optional<String> findxLSeqField(String sample)
int charStart = findPosAfter(sample, "txLseq:i:");
if (charStart == -1)
return Optional.empty(); //return NULL if not found.

int i = charStart;
while (sample.charAt(i) != 't')
i++;

return Optional.of(sample.substring(charStart, i));



/** finds the reference name field
* @param sample String to search
* @return String if found, empty string if not.
*/
public static String findReferneceName(String sample)
//should always appear between the second and third tabs
boolean tabsFound = false;
int i = 0;
int secondTab = 0;
int thirdTab = 0;
boolean skippedFirstTab = false;
while (!tabsFound)
if (sample.charAt(i) != 't')
i++;
else
if (skippedFirstTab)
if (secondTab == 0)
secondTab = i;
else
thirdTab = i;
tabsFound = true;


skippedFirstTab = true;
i++;



if(sample.substring(secondTab + 1, thirdTab).contains("/"))
String[] split = sample.substring(secondTab + 1, thirdTab).split("/");
return split[split.length-1];



return sample.substring(secondTab + 1, thirdTab);


/**
* Finds the needle in the haystack, and returns the position of the single next digit.
*
* @param haystack The string to search
* @param needle String field to search on.
* @return position of the end of the needle
*/
private static int findPosAfter(String haystack, String needle)
int hLen = haystack.length();
int nLen = needle.length();
int maxSearch = hLen - nLen;

outer:
for (int i = 0; i < maxSearch; i++)
for (int j = 0; j < nLen; j++)
if (haystack.charAt(i + j) != needle.charAt(j))
continue outer;


// If it reaches here, match has been found:
return i + nLen;

return -1; // Not found





My question is, are there are any drawbacks to this approach? Or any alternative way that might be more effective?



Thanks in advance,



Sam



Edit:



Instantiating code:



public interface RecordFactory<T extends Record> 

T createRecord(String recordString);



Implementing it like:



private RecordFactory<SamRecord> samRecordFactory = SamRecord::new









share|improve this question











$endgroup$











  • $begingroup$
    Is this class, or could this class, be used in a multithreaded scenario?
    $endgroup$
    – IEatBagels
    Apr 23 at 13:47










  • $begingroup$
    It will be, yes. However, it's unlikely to be shared between threads. It's more likely that collections of these records will be handed off to separate workers.
    $endgroup$
    – Sam
    Apr 23 at 13:48










  • $begingroup$
    0. Do you have an actual, tested performance issue, or are you guessing at a potential problem? 1. What are you trying to minimize, execution time or memory footprint?
    $endgroup$
    – Eric Stein
    Apr 23 at 14:41










  • $begingroup$
    Also, are you able/willing to share the code that instantiates SamRecords?
    $endgroup$
    – Eric Stein
    Apr 23 at 15:09










  • $begingroup$
    Hi @EricStein. Sure thing, but it's just a short factory method using generics (since SamRecord extends from Record - and I also have a few other types that extend Record too.) I know from past experience that this is a performance issue, since these files have 100's of millions of records. When I split the file, and hand the records to workers, I need to do as little as possible with the record themselves. The software using this API most likely wont need every detail of the record at one time
    $endgroup$
    – Sam
    2 days ago














4












4








4





$begingroup$


I'm currently writing an API to work with Bioinformatic SAM records. Here's an example of one:



SBL_XSBF463_ID:3230017:BCR1:GCATAA:BCR2:CATATA/1:vpe 97 hs07 38253395 3 30M = 38330420 77055 TTGTTCCACTGCCAAAGAGTTTCTTATAAT EEEEEEEEEEEEAEEEEEEEEEEEEEEEEE PG:Z:novoalign AS:i:0 UQ:i:0 NM:i:0 MD:Z:30 ZS:Z:R NH:i:2 HI:i:1 IH:i:1



Each piece of information separated by a tab is it's own field and corresponds to some type of data.



Now, it's important to note that these files get BIG (10's of GB) and so splitting each one as soon as it's instantiated in some kind of POJO would be inefficient.



Hence, I've decided to create an object with a lazy loading mechanism. Only the original string is stored until one of the fields is requested by some calling code. This should minimise the amount of work done when the object is created, as well as minimise the amount of memory taken by the objects.



Here's my attempt:



/** Class for storing and working with sam formatted DNA sequence.
*
* Upon construction, only the String record is stored.
* All querying of fields is done on demand, to save time.
*
*/
public class SamRecord implements Record

private final String read;
private String id = null;
private int flag = -1;
private String referenceName = null;
private int pos = -1;
private int mappingQuality = -1;
private String cigar = null;
private String mateReferenceName = null;
private int matePosition = -1;
private int templateLength = -1;
private String sequence = null;
private String quality = null;
private String variableTerms = null;

private final static String REPEAT_TERM = "ZS:Z:R";
private final static String MATCH_TERM = "ZS:Z:NM";
private final static String QUALITY_CHECK_TERM = "ZS:Z:QC";

/** Simple constructor for the sam record
* @param read full read
*/
public SamRecord(String read)
this.read = read;


public String getRead()
return read;


/**
* @inheritDoc
*/
@Override
public String getId()
if(id == null)
id = XsamReadQueries.findID(read);


return id;


/**
* @inheritDoc
*/
@Override
public int getFlag() throws NumberFormatException
if(flag == -1)
flag = Integer.parseInt(XsamReadQueries.findElement(read, 1));

return flag;


/**
* @inheritDoc
*/
@Override
public String getReferenceName()
if(referenceName == null)
referenceName = XsamReadQueries.findReferneceName(read);


return referenceName;


/**
* @inheritDoc
*/
@Override
public int getPos() throws NumberFormatException
if(pos == -1)
pos = Integer.parseInt(XsamReadQueries.findElement(read, 3));


return pos;


/**
* @inheritDoc
*/
@Override
public int getMappingQuality() throws NumberFormatException
if(mappingQuality == -1)
mappingQuality = Integer.parseInt(XsamReadQueries.findElement(read, 4));


return mappingQuality;


/**
* @inheritDoc
*/
@Override
public String getCigar()
if(cigar == null)
cigar = XsamReadQueries.findCigar(read);


return cigar;


/**
* @inheritDoc
*/
@Override
public String getMateReferenceName()
if(mateReferenceName == null)
mateReferenceName = XsamReadQueries.findElement(read, 6);


return mateReferenceName;


/**
* @inheritDoc
*/
@Override
public int getMatePosition() throws NumberFormatException
if(matePosition == -1)
matePosition = Integer.parseInt(XsamReadQueries.findElement(read, 7));


return matePosition;


/**
* @inheritDoc
*/
@Override
public int getTemplateLength() throws NumberFormatException
if(templateLength == -1)
templateLength = Integer.parseInt(XsamReadQueries.findElement(read, 8));


return templateLength;


/**
* @inheritDoc
*/
@Override
public String getSequence()
if(sequence == null)
sequence = XsamReadQueries.findBaseSequence(read);


return sequence;


/**
* @inheritDoc
*/
@Override
public String getQuality()
if(quality == null)
quality = XsamReadQueries.findElement(read, 10);


return quality;


/**
* @inheritDoc
*/
@Override
public boolean isRepeat()
return read.contains(REPEAT_TERM);


/**
* @inheritDoc
*/
@Override
public boolean isMapped()
return !read.contains(MATCH_TERM);



/**
* @inheritDoc
*/
@Override
public String getVariableTerms()
if(variableTerms == null)
variableTerms = XsamReadQueries.findVariableRegionSequence(read);


return variableTerms;


/**
* @inheritDoc
*/
@Override
public boolean isQualityFailed()
return read.contains(QUALITY_CHECK_TERM);



@Override
public boolean equals(Object o) getClass() != o.getClass()) return false;
SamRecord samRecord = (SamRecord) o;
return Objects.equals(read, samRecord.read);


@Override
public int hashCode()
return Objects.hash(read);


@Override
public String toString()
return read;





The fields are returned by static methods in a helper class which retrieve them by looking at where the tab characters are. i.e. flag = Integer.parseInt(XsamReadQueries.findElement(read, 1));



Below is the XsamReadQuery class



/**
* Non-instantiable utility class for working with Xsam reads
*/
public final class XsamReadQueries

// Suppress instantiation
private XsamReadQueries()
throw new AssertionError();


/** finds the position of the tab directly before the start of the variable region
* @param read whole sam or Xsam read to search
* @return position of the tab in the String
*/
public static int findVariableRegionStart(String read)

int found = 0;

for(int i = 0; i < read.length(); i++)
if(read.charAt(i) == 't')
found++;
if(found >= 11 && i+1 < read.length() && (read.charAt(i+1) != 'x' && read.charAt(i+1) != 't')) //guard against double-tabs
return i + 1;




return -1;



/** Attempts to find the library name from SBL reads
* where SBL reads have the id SBL_LibraryName_ID:XXXXX
* if LibraryName end's with a lower case letter, the letter will be removed.
* if SBL_LibID is not valid, return the full ID.
* @param ID or String to search.
* @return Library name with lower case endings removed
*/
public static String findLibraryName(String ID)

if(!ID.startsWith("SBL")) return "";

try
int firstPos = XsamReadQueries.findPosAfter(ID, "_");
int i = firstPos;
while (ID.charAt(i) != '_' && ID.charAt(i) != 't')
i++;


String library = ID.substring(firstPos, i);

char lastChar = library.charAt(library.length()-1);
if(lastChar >= 97 && lastChar <= 122)
library = library.substring(0, library.length()-1);


return library;

catch (Exception e)
int i = 0;
while(ID.charAt(i) != 't')
i++;
if(i == ID.length())
break;


return ID.substring(0, i);



/** Returns the ID from the sample
* @param sample Xsam read
* @return ID
*/
public static String findID(String sample)
return findElement(sample, 0);

/** Returns the phred score from the sample
* @param sample Xsam read
* @return phred string
*/
public static String findPhred(String sample)
return findElement(sample, 10);


/**
* Returns the cigar from the xsam read
*
* @param sample read
* @return cigar string
*/
public static String findCigar(String sample)
return findElement(sample, 5);


/**
* Returns the bases from the xsam read
*
* @param sample read
* @return base string
*/
public static String findBaseSequence(String sample)
return findElement(sample, 9);


/**
* finds the n'th element in the tab delimited sample
* i.e findElement(0) returns one from "onettwo"
* 0 indexed.
*
* @param sample String to search
* @param element element to find
* @return found element or "" if not found
*/
public static String findElement(String sample, int element)
boolean tabsFound = false;
int i = 0;
int firstTab = 0;
int secondTab = 0;
int tabsToSkip = element - 1 >= 0 ? element - 1 : 0;
int skippedTabs = 0;
if (element == 0)
while (sample.charAt(i) != 't')
i++;

return sample.substring(0, i);
else
while (!tabsFound)
if (sample.charAt(i) != 't')
i++;
else
if (skippedTabs == tabsToSkip)
if (firstTab == 0)
firstTab = i;
else
secondTab = i;
tabsFound = true;

else
skippedTabs++;

i++;




return sample.substring(firstTab + 1, secondTab);


/** finds the variable region past the quality
* @param sample sam or Xsam record string
* @return variable sequence or empty string
*/
public static String findVariableRegionSequence(String sample)

int start = findVariableRegionStart(sample);

if(start == -1) return "";

return sample.substring(findVariableRegionStart(sample));


/** finds the xL field
* @param sample String to search
* @return position if found, '' (null) value if not.
*/
public static int findxLField(String sample)
int chartStart = findPosAfter(sample, "txL:i:");
if (chartStart == -1)
return -1; //return -1 if not found.

int i = chartStart;
while (sample.charAt(i) != 't')
i++;

return Integer.parseInt(sample.substring(chartStart, i));



/** finds the xR field
* @param sample String to search
* @return position if found, '' (null) value if not.
*/
public static int findxRField(String sample)
int chartStart = findPosAfter(sample, "txR:i:");
if (chartStart == -1)
return ''; //return NULL if not found.

int i = chartStart;
while (sample.charAt(i) != 't')
i++;

return Integer.parseInt(sample.substring(chartStart, i));



/** finds the xLSeq field
* @param sample String to search
* @return String if found, empty string if not.
*/
public static Optional<String> findxLSeqField(String sample)
int charStart = findPosAfter(sample, "txLseq:i:");
if (charStart == -1)
return Optional.empty(); //return NULL if not found.

int i = charStart;
while (sample.charAt(i) != 't')
i++;

return Optional.of(sample.substring(charStart, i));



/** finds the reference name field
* @param sample String to search
* @return String if found, empty string if not.
*/
public static String findReferneceName(String sample)
//should always appear between the second and third tabs
boolean tabsFound = false;
int i = 0;
int secondTab = 0;
int thirdTab = 0;
boolean skippedFirstTab = false;
while (!tabsFound)
if (sample.charAt(i) != 't')
i++;
else
if (skippedFirstTab)
if (secondTab == 0)
secondTab = i;
else
thirdTab = i;
tabsFound = true;


skippedFirstTab = true;
i++;



if(sample.substring(secondTab + 1, thirdTab).contains("/"))
String[] split = sample.substring(secondTab + 1, thirdTab).split("/");
return split[split.length-1];



return sample.substring(secondTab + 1, thirdTab);


/**
* Finds the needle in the haystack, and returns the position of the single next digit.
*
* @param haystack The string to search
* @param needle String field to search on.
* @return position of the end of the needle
*/
private static int findPosAfter(String haystack, String needle)
int hLen = haystack.length();
int nLen = needle.length();
int maxSearch = hLen - nLen;

outer:
for (int i = 0; i < maxSearch; i++)
for (int j = 0; j < nLen; j++)
if (haystack.charAt(i + j) != needle.charAt(j))
continue outer;


// If it reaches here, match has been found:
return i + nLen;

return -1; // Not found





My question is, are there are any drawbacks to this approach? Or any alternative way that might be more effective?



Thanks in advance,



Sam



Edit:



Instantiating code:



public interface RecordFactory<T extends Record> 

T createRecord(String recordString);



Implementing it like:



private RecordFactory<SamRecord> samRecordFactory = SamRecord::new









share|improve this question











$endgroup$




I'm currently writing an API to work with Bioinformatic SAM records. Here's an example of one:



SBL_XSBF463_ID:3230017:BCR1:GCATAA:BCR2:CATATA/1:vpe 97 hs07 38253395 3 30M = 38330420 77055 TTGTTCCACTGCCAAAGAGTTTCTTATAAT EEEEEEEEEEEEAEEEEEEEEEEEEEEEEE PG:Z:novoalign AS:i:0 UQ:i:0 NM:i:0 MD:Z:30 ZS:Z:R NH:i:2 HI:i:1 IH:i:1



Each piece of information separated by a tab is it's own field and corresponds to some type of data.



Now, it's important to note that these files get BIG (10's of GB) and so splitting each one as soon as it's instantiated in some kind of POJO would be inefficient.



Hence, I've decided to create an object with a lazy loading mechanism. Only the original string is stored until one of the fields is requested by some calling code. This should minimise the amount of work done when the object is created, as well as minimise the amount of memory taken by the objects.



Here's my attempt:



/** Class for storing and working with sam formatted DNA sequence.
*
* Upon construction, only the String record is stored.
* All querying of fields is done on demand, to save time.
*
*/
public class SamRecord implements Record

private final String read;
private String id = null;
private int flag = -1;
private String referenceName = null;
private int pos = -1;
private int mappingQuality = -1;
private String cigar = null;
private String mateReferenceName = null;
private int matePosition = -1;
private int templateLength = -1;
private String sequence = null;
private String quality = null;
private String variableTerms = null;

private final static String REPEAT_TERM = "ZS:Z:R";
private final static String MATCH_TERM = "ZS:Z:NM";
private final static String QUALITY_CHECK_TERM = "ZS:Z:QC";

/** Simple constructor for the sam record
* @param read full read
*/
public SamRecord(String read)
this.read = read;


public String getRead()
return read;


/**
* @inheritDoc
*/
@Override
public String getId()
if(id == null)
id = XsamReadQueries.findID(read);


return id;


/**
* @inheritDoc
*/
@Override
public int getFlag() throws NumberFormatException
if(flag == -1)
flag = Integer.parseInt(XsamReadQueries.findElement(read, 1));

return flag;


/**
* @inheritDoc
*/
@Override
public String getReferenceName()
if(referenceName == null)
referenceName = XsamReadQueries.findReferneceName(read);


return referenceName;


/**
* @inheritDoc
*/
@Override
public int getPos() throws NumberFormatException
if(pos == -1)
pos = Integer.parseInt(XsamReadQueries.findElement(read, 3));


return pos;


/**
* @inheritDoc
*/
@Override
public int getMappingQuality() throws NumberFormatException
if(mappingQuality == -1)
mappingQuality = Integer.parseInt(XsamReadQueries.findElement(read, 4));


return mappingQuality;


/**
* @inheritDoc
*/
@Override
public String getCigar()
if(cigar == null)
cigar = XsamReadQueries.findCigar(read);


return cigar;


/**
* @inheritDoc
*/
@Override
public String getMateReferenceName()
if(mateReferenceName == null)
mateReferenceName = XsamReadQueries.findElement(read, 6);


return mateReferenceName;


/**
* @inheritDoc
*/
@Override
public int getMatePosition() throws NumberFormatException
if(matePosition == -1)
matePosition = Integer.parseInt(XsamReadQueries.findElement(read, 7));


return matePosition;


/**
* @inheritDoc
*/
@Override
public int getTemplateLength() throws NumberFormatException
if(templateLength == -1)
templateLength = Integer.parseInt(XsamReadQueries.findElement(read, 8));


return templateLength;


/**
* @inheritDoc
*/
@Override
public String getSequence()
if(sequence == null)
sequence = XsamReadQueries.findBaseSequence(read);


return sequence;


/**
* @inheritDoc
*/
@Override
public String getQuality()
if(quality == null)
quality = XsamReadQueries.findElement(read, 10);


return quality;


/**
* @inheritDoc
*/
@Override
public boolean isRepeat()
return read.contains(REPEAT_TERM);


/**
* @inheritDoc
*/
@Override
public boolean isMapped()
return !read.contains(MATCH_TERM);



/**
* @inheritDoc
*/
@Override
public String getVariableTerms()
if(variableTerms == null)
variableTerms = XsamReadQueries.findVariableRegionSequence(read);


return variableTerms;


/**
* @inheritDoc
*/
@Override
public boolean isQualityFailed()
return read.contains(QUALITY_CHECK_TERM);



@Override
public boolean equals(Object o) getClass() != o.getClass()) return false;
SamRecord samRecord = (SamRecord) o;
return Objects.equals(read, samRecord.read);


@Override
public int hashCode()
return Objects.hash(read);


@Override
public String toString()
return read;





The fields are returned by static methods in a helper class which retrieve them by looking at where the tab characters are. i.e. flag = Integer.parseInt(XsamReadQueries.findElement(read, 1));



Below is the XsamReadQuery class



/**
* Non-instantiable utility class for working with Xsam reads
*/
public final class XsamReadQueries

// Suppress instantiation
private XsamReadQueries()
throw new AssertionError();


/** finds the position of the tab directly before the start of the variable region
* @param read whole sam or Xsam read to search
* @return position of the tab in the String
*/
public static int findVariableRegionStart(String read)

int found = 0;

for(int i = 0; i < read.length(); i++)
if(read.charAt(i) == 't')
found++;
if(found >= 11 && i+1 < read.length() && (read.charAt(i+1) != 'x' && read.charAt(i+1) != 't')) //guard against double-tabs
return i + 1;




return -1;



/** Attempts to find the library name from SBL reads
* where SBL reads have the id SBL_LibraryName_ID:XXXXX
* if LibraryName end's with a lower case letter, the letter will be removed.
* if SBL_LibID is not valid, return the full ID.
* @param ID or String to search.
* @return Library name with lower case endings removed
*/
public static String findLibraryName(String ID)

if(!ID.startsWith("SBL")) return "";

try
int firstPos = XsamReadQueries.findPosAfter(ID, "_");
int i = firstPos;
while (ID.charAt(i) != '_' && ID.charAt(i) != 't')
i++;


String library = ID.substring(firstPos, i);

char lastChar = library.charAt(library.length()-1);
if(lastChar >= 97 && lastChar <= 122)
library = library.substring(0, library.length()-1);


return library;

catch (Exception e)
int i = 0;
while(ID.charAt(i) != 't')
i++;
if(i == ID.length())
break;


return ID.substring(0, i);



/** Returns the ID from the sample
* @param sample Xsam read
* @return ID
*/
public static String findID(String sample)
return findElement(sample, 0);

/** Returns the phred score from the sample
* @param sample Xsam read
* @return phred string
*/
public static String findPhred(String sample)
return findElement(sample, 10);


/**
* Returns the cigar from the xsam read
*
* @param sample read
* @return cigar string
*/
public static String findCigar(String sample)
return findElement(sample, 5);


/**
* Returns the bases from the xsam read
*
* @param sample read
* @return base string
*/
public static String findBaseSequence(String sample)
return findElement(sample, 9);


/**
* finds the n'th element in the tab delimited sample
* i.e findElement(0) returns one from "onettwo"
* 0 indexed.
*
* @param sample String to search
* @param element element to find
* @return found element or "" if not found
*/
public static String findElement(String sample, int element)
boolean tabsFound = false;
int i = 0;
int firstTab = 0;
int secondTab = 0;
int tabsToSkip = element - 1 >= 0 ? element - 1 : 0;
int skippedTabs = 0;
if (element == 0)
while (sample.charAt(i) != 't')
i++;

return sample.substring(0, i);
else
while (!tabsFound)
if (sample.charAt(i) != 't')
i++;
else
if (skippedTabs == tabsToSkip)
if (firstTab == 0)
firstTab = i;
else
secondTab = i;
tabsFound = true;

else
skippedTabs++;

i++;




return sample.substring(firstTab + 1, secondTab);


/** finds the variable region past the quality
* @param sample sam or Xsam record string
* @return variable sequence or empty string
*/
public static String findVariableRegionSequence(String sample)

int start = findVariableRegionStart(sample);

if(start == -1) return "";

return sample.substring(findVariableRegionStart(sample));


/** finds the xL field
* @param sample String to search
* @return position if found, '' (null) value if not.
*/
public static int findxLField(String sample)
int chartStart = findPosAfter(sample, "txL:i:");
if (chartStart == -1)
return -1; //return -1 if not found.

int i = chartStart;
while (sample.charAt(i) != 't')
i++;

return Integer.parseInt(sample.substring(chartStart, i));



/** finds the xR field
* @param sample String to search
* @return position if found, '' (null) value if not.
*/
public static int findxRField(String sample)
int chartStart = findPosAfter(sample, "txR:i:");
if (chartStart == -1)
return ''; //return NULL if not found.

int i = chartStart;
while (sample.charAt(i) != 't')
i++;

return Integer.parseInt(sample.substring(chartStart, i));



/** finds the xLSeq field
* @param sample String to search
* @return String if found, empty string if not.
*/
public static Optional<String> findxLSeqField(String sample)
int charStart = findPosAfter(sample, "txLseq:i:");
if (charStart == -1)
return Optional.empty(); //return NULL if not found.

int i = charStart;
while (sample.charAt(i) != 't')
i++;

return Optional.of(sample.substring(charStart, i));



/** finds the reference name field
* @param sample String to search
* @return String if found, empty string if not.
*/
public static String findReferneceName(String sample)
//should always appear between the second and third tabs
boolean tabsFound = false;
int i = 0;
int secondTab = 0;
int thirdTab = 0;
boolean skippedFirstTab = false;
while (!tabsFound)
if (sample.charAt(i) != 't')
i++;
else
if (skippedFirstTab)
if (secondTab == 0)
secondTab = i;
else
thirdTab = i;
tabsFound = true;


skippedFirstTab = true;
i++;



if(sample.substring(secondTab + 1, thirdTab).contains("/"))
String[] split = sample.substring(secondTab + 1, thirdTab).split("/");
return split[split.length-1];



return sample.substring(secondTab + 1, thirdTab);


/**
* Finds the needle in the haystack, and returns the position of the single next digit.
*
* @param haystack The string to search
* @param needle String field to search on.
* @return position of the end of the needle
*/
private static int findPosAfter(String haystack, String needle)
int hLen = haystack.length();
int nLen = needle.length();
int maxSearch = hLen - nLen;

outer:
for (int i = 0; i < maxSearch; i++)
for (int j = 0; j < nLen; j++)
if (haystack.charAt(i + j) != needle.charAt(j))
continue outer;


// If it reaches here, match has been found:
return i + nLen;

return -1; // Not found





My question is, are there are any drawbacks to this approach? Or any alternative way that might be more effective?



Thanks in advance,



Sam



Edit:



Instantiating code:



public interface RecordFactory<T extends Record> 

T createRecord(String recordString);



Implementing it like:



private RecordFactory<SamRecord> samRecordFactory = SamRecord::new






java bioinformatics lazy






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 2 days ago







Sam

















asked Apr 23 at 12:45









SamSam

21217




21217











  • $begingroup$
    Is this class, or could this class, be used in a multithreaded scenario?
    $endgroup$
    – IEatBagels
    Apr 23 at 13:47










  • $begingroup$
    It will be, yes. However, it's unlikely to be shared between threads. It's more likely that collections of these records will be handed off to separate workers.
    $endgroup$
    – Sam
    Apr 23 at 13:48










  • $begingroup$
    0. Do you have an actual, tested performance issue, or are you guessing at a potential problem? 1. What are you trying to minimize, execution time or memory footprint?
    $endgroup$
    – Eric Stein
    Apr 23 at 14:41










  • $begingroup$
    Also, are you able/willing to share the code that instantiates SamRecords?
    $endgroup$
    – Eric Stein
    Apr 23 at 15:09










  • $begingroup$
    Hi @EricStein. Sure thing, but it's just a short factory method using generics (since SamRecord extends from Record - and I also have a few other types that extend Record too.) I know from past experience that this is a performance issue, since these files have 100's of millions of records. When I split the file, and hand the records to workers, I need to do as little as possible with the record themselves. The software using this API most likely wont need every detail of the record at one time
    $endgroup$
    – Sam
    2 days ago

















  • $begingroup$
    Is this class, or could this class, be used in a multithreaded scenario?
    $endgroup$
    – IEatBagels
    Apr 23 at 13:47










  • $begingroup$
    It will be, yes. However, it's unlikely to be shared between threads. It's more likely that collections of these records will be handed off to separate workers.
    $endgroup$
    – Sam
    Apr 23 at 13:48










  • $begingroup$
    0. Do you have an actual, tested performance issue, or are you guessing at a potential problem? 1. What are you trying to minimize, execution time or memory footprint?
    $endgroup$
    – Eric Stein
    Apr 23 at 14:41










  • $begingroup$
    Also, are you able/willing to share the code that instantiates SamRecords?
    $endgroup$
    – Eric Stein
    Apr 23 at 15:09










  • $begingroup$
    Hi @EricStein. Sure thing, but it's just a short factory method using generics (since SamRecord extends from Record - and I also have a few other types that extend Record too.) I know from past experience that this is a performance issue, since these files have 100's of millions of records. When I split the file, and hand the records to workers, I need to do as little as possible with the record themselves. The software using this API most likely wont need every detail of the record at one time
    $endgroup$
    – Sam
    2 days ago
















$begingroup$
Is this class, or could this class, be used in a multithreaded scenario?
$endgroup$
– IEatBagels
Apr 23 at 13:47




$begingroup$
Is this class, or could this class, be used in a multithreaded scenario?
$endgroup$
– IEatBagels
Apr 23 at 13:47












$begingroup$
It will be, yes. However, it's unlikely to be shared between threads. It's more likely that collections of these records will be handed off to separate workers.
$endgroup$
– Sam
Apr 23 at 13:48




$begingroup$
It will be, yes. However, it's unlikely to be shared between threads. It's more likely that collections of these records will be handed off to separate workers.
$endgroup$
– Sam
Apr 23 at 13:48












$begingroup$
0. Do you have an actual, tested performance issue, or are you guessing at a potential problem? 1. What are you trying to minimize, execution time or memory footprint?
$endgroup$
– Eric Stein
Apr 23 at 14:41




$begingroup$
0. Do you have an actual, tested performance issue, or are you guessing at a potential problem? 1. What are you trying to minimize, execution time or memory footprint?
$endgroup$
– Eric Stein
Apr 23 at 14:41












$begingroup$
Also, are you able/willing to share the code that instantiates SamRecords?
$endgroup$
– Eric Stein
Apr 23 at 15:09




$begingroup$
Also, are you able/willing to share the code that instantiates SamRecords?
$endgroup$
– Eric Stein
Apr 23 at 15:09












$begingroup$
Hi @EricStein. Sure thing, but it's just a short factory method using generics (since SamRecord extends from Record - and I also have a few other types that extend Record too.) I know from past experience that this is a performance issue, since these files have 100's of millions of records. When I split the file, and hand the records to workers, I need to do as little as possible with the record themselves. The software using this API most likely wont need every detail of the record at one time
$endgroup$
– Sam
2 days ago





$begingroup$
Hi @EricStein. Sure thing, but it's just a short factory method using generics (since SamRecord extends from Record - and I also have a few other types that extend Record too.) I know from past experience that this is a performance issue, since these files have 100's of millions of records. When I split the file, and hand the records to workers, I need to do as little as possible with the record themselves. The software using this API most likely wont need every detail of the record at one time
$endgroup$
– Sam
2 days ago











1 Answer
1






active

oldest

votes


















5












$begingroup$

Performance



There is one thing that I believe could increase the performance of your application.



You often call findElement, which goes through the SAM record every time.



By loading a record, you are pretty certain that you will at least access it once.



At some point, maybe when creating the class, or when accessing the first property for the first time, you should "index" your SAM record.



Go through the whole file once and keep an array of where the tabs are. This way, if your code ends up calling :



XsamReadQueries.findElement(read, 1)
XsamReadQueries.findElement(read, 2)
XsamReadQueries.findElement(read, 3)


The calls to the second and third method would be much faster than they are now.



To do this, you could add a method to XsamReadQueries names something like IndexTabs, that would return an array of ints.



If you want more insight as to how to do this, you can write a comment and I'll add more information, but I'm pretty sure this would help you.



Code style



There are one of two things that are bothering me in your code with regards to clarity and future maintenance.



You have methods named findPhred, which call findElement , but in your SamRecord sometimes you call findElement and something a specific find*, which is basically the same code. You should decide on one way to do things, either have specific methods for each field in the XsamReadQueries or keep only the findElement method.



Finally, you could consider using an enum for the element parameter of the findElement method.






share|improve this answer











$endgroup$








  • 1




    $begingroup$
    Hi @IEatBagels, thanks for the answer. Indexing is a really good idea, I'll definitely look to implement that. I agree with the find* notation - It's partially as I'm halfway through coding the API and wanted some feedback before committing to one way or the other. The enum's a good idea too, it'll definitely make it more readable!
    $endgroup$
    – Sam
    Apr 23 at 16:00











Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f217955%2flazy-loading-a-bioinformatic-sam-record%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









5












$begingroup$

Performance



There is one thing that I believe could increase the performance of your application.



You often call findElement, which goes through the SAM record every time.



By loading a record, you are pretty certain that you will at least access it once.



At some point, maybe when creating the class, or when accessing the first property for the first time, you should "index" your SAM record.



Go through the whole file once and keep an array of where the tabs are. This way, if your code ends up calling :



XsamReadQueries.findElement(read, 1)
XsamReadQueries.findElement(read, 2)
XsamReadQueries.findElement(read, 3)


The calls to the second and third method would be much faster than they are now.



To do this, you could add a method to XsamReadQueries names something like IndexTabs, that would return an array of ints.



If you want more insight as to how to do this, you can write a comment and I'll add more information, but I'm pretty sure this would help you.



Code style



There are one of two things that are bothering me in your code with regards to clarity and future maintenance.



You have methods named findPhred, which call findElement , but in your SamRecord sometimes you call findElement and something a specific find*, which is basically the same code. You should decide on one way to do things, either have specific methods for each field in the XsamReadQueries or keep only the findElement method.



Finally, you could consider using an enum for the element parameter of the findElement method.






share|improve this answer











$endgroup$








  • 1




    $begingroup$
    Hi @IEatBagels, thanks for the answer. Indexing is a really good idea, I'll definitely look to implement that. I agree with the find* notation - It's partially as I'm halfway through coding the API and wanted some feedback before committing to one way or the other. The enum's a good idea too, it'll definitely make it more readable!
    $endgroup$
    – Sam
    Apr 23 at 16:00















5












$begingroup$

Performance



There is one thing that I believe could increase the performance of your application.



You often call findElement, which goes through the SAM record every time.



By loading a record, you are pretty certain that you will at least access it once.



At some point, maybe when creating the class, or when accessing the first property for the first time, you should "index" your SAM record.



Go through the whole file once and keep an array of where the tabs are. This way, if your code ends up calling :



XsamReadQueries.findElement(read, 1)
XsamReadQueries.findElement(read, 2)
XsamReadQueries.findElement(read, 3)


The calls to the second and third method would be much faster than they are now.



To do this, you could add a method to XsamReadQueries names something like IndexTabs, that would return an array of ints.



If you want more insight as to how to do this, you can write a comment and I'll add more information, but I'm pretty sure this would help you.



Code style



There are one of two things that are bothering me in your code with regards to clarity and future maintenance.



You have methods named findPhred, which call findElement , but in your SamRecord sometimes you call findElement and something a specific find*, which is basically the same code. You should decide on one way to do things, either have specific methods for each field in the XsamReadQueries or keep only the findElement method.



Finally, you could consider using an enum for the element parameter of the findElement method.






share|improve this answer











$endgroup$








  • 1




    $begingroup$
    Hi @IEatBagels, thanks for the answer. Indexing is a really good idea, I'll definitely look to implement that. I agree with the find* notation - It's partially as I'm halfway through coding the API and wanted some feedback before committing to one way or the other. The enum's a good idea too, it'll definitely make it more readable!
    $endgroup$
    – Sam
    Apr 23 at 16:00













5












5








5





$begingroup$

Performance



There is one thing that I believe could increase the performance of your application.



You often call findElement, which goes through the SAM record every time.



By loading a record, you are pretty certain that you will at least access it once.



At some point, maybe when creating the class, or when accessing the first property for the first time, you should "index" your SAM record.



Go through the whole file once and keep an array of where the tabs are. This way, if your code ends up calling :



XsamReadQueries.findElement(read, 1)
XsamReadQueries.findElement(read, 2)
XsamReadQueries.findElement(read, 3)


The calls to the second and third method would be much faster than they are now.



To do this, you could add a method to XsamReadQueries names something like IndexTabs, that would return an array of ints.



If you want more insight as to how to do this, you can write a comment and I'll add more information, but I'm pretty sure this would help you.



Code style



There are one of two things that are bothering me in your code with regards to clarity and future maintenance.



You have methods named findPhred, which call findElement , but in your SamRecord sometimes you call findElement and something a specific find*, which is basically the same code. You should decide on one way to do things, either have specific methods for each field in the XsamReadQueries or keep only the findElement method.



Finally, you could consider using an enum for the element parameter of the findElement method.






share|improve this answer











$endgroup$



Performance



There is one thing that I believe could increase the performance of your application.



You often call findElement, which goes through the SAM record every time.



By loading a record, you are pretty certain that you will at least access it once.



At some point, maybe when creating the class, or when accessing the first property for the first time, you should "index" your SAM record.



Go through the whole file once and keep an array of where the tabs are. This way, if your code ends up calling :



XsamReadQueries.findElement(read, 1)
XsamReadQueries.findElement(read, 2)
XsamReadQueries.findElement(read, 3)


The calls to the second and third method would be much faster than they are now.



To do this, you could add a method to XsamReadQueries names something like IndexTabs, that would return an array of ints.



If you want more insight as to how to do this, you can write a comment and I'll add more information, but I'm pretty sure this would help you.



Code style



There are one of two things that are bothering me in your code with regards to clarity and future maintenance.



You have methods named findPhred, which call findElement , but in your SamRecord sometimes you call findElement and something a specific find*, which is basically the same code. You should decide on one way to do things, either have specific methods for each field in the XsamReadQueries or keep only the findElement method.



Finally, you could consider using an enum for the element parameter of the findElement method.







share|improve this answer














share|improve this answer



share|improve this answer








edited Apr 23 at 14:19

























answered Apr 23 at 14:13









IEatBagelsIEatBagels

9,07323579




9,07323579







  • 1




    $begingroup$
    Hi @IEatBagels, thanks for the answer. Indexing is a really good idea, I'll definitely look to implement that. I agree with the find* notation - It's partially as I'm halfway through coding the API and wanted some feedback before committing to one way or the other. The enum's a good idea too, it'll definitely make it more readable!
    $endgroup$
    – Sam
    Apr 23 at 16:00












  • 1




    $begingroup$
    Hi @IEatBagels, thanks for the answer. Indexing is a really good idea, I'll definitely look to implement that. I agree with the find* notation - It's partially as I'm halfway through coding the API and wanted some feedback before committing to one way or the other. The enum's a good idea too, it'll definitely make it more readable!
    $endgroup$
    – Sam
    Apr 23 at 16:00







1




1




$begingroup$
Hi @IEatBagels, thanks for the answer. Indexing is a really good idea, I'll definitely look to implement that. I agree with the find* notation - It's partially as I'm halfway through coding the API and wanted some feedback before committing to one way or the other. The enum's a good idea too, it'll definitely make it more readable!
$endgroup$
– Sam
Apr 23 at 16:00




$begingroup$
Hi @IEatBagels, thanks for the answer. Indexing is a really good idea, I'll definitely look to implement that. I agree with the find* notation - It's partially as I'm halfway through coding the API and wanted some feedback before committing to one way or the other. The enum's a good idea too, it'll definitely make it more readable!
$endgroup$
– Sam
Apr 23 at 16:00

















draft saved

draft discarded
















































Thanks for contributing an answer to Code Review Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f217955%2flazy-loading-a-bioinformatic-sam-record%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Category:9 (number) SubcategoriesMedia in category "9 (number)"Navigation menuUpload mediaGND ID: 4485639-8Library of Congress authority ID: sh85091979ReasonatorScholiaStatistics

Circuit construction for execution of conditional statements using least significant bitHow are two different registers being used as “control”?How exactly is the stated composite state of the two registers being produced using the $R_zz$ controlled rotations?Efficiently performing controlled rotations in HHLWould this quantum algorithm implementation work?How to prepare a superposed states of odd integers from $1$ to $sqrtN$?Why is this implementation of the order finding algorithm not working?Circuit construction for Hamiltonian simulationHow can I invert the least significant bit of a certain term of a superposed state?Implementing an oracleImplementing a controlled sum operation

Magento 2 “No Payment Methods” in Admin New OrderHow to integrate Paypal Express Checkout with the Magento APIMagento 1.5 - Sales > Order > edit order and shipping methods disappearAuto Invoice Check/Money Order Payment methodAdd more simple payment methods?Shipping methods not showingWhat should I do to change payment methods if changing the configuration has no effects?1.9 - No Payment Methods showing upMy Payment Methods not Showing for downloadable/virtual product when checkout?Magento2 API to access internal payment methodHow to call an existing payment methods in the registration form?