Lazy Loading a Bioinformatic SAM recordLazy class instantiation in PythonLazy loading with __getLazy loaded property readabilityMax heap in JavaQuerying Facebook for details of a user's OAuth tokenBinary Puzzle Solver - 10000 questionsSimple Java program - Coding bat sumNumbersLeetcode: String to Integer (atoi)Lazy split and semi-lazy splitLazy-loading iframes as they scroll into view

Why does Mind Blank stop the Feeblemind spell?

Map of water taps to fill bottles

How do I check if a string is entirely made of the same substring?

Get consecutive integer number ranges from list of int

Contradiction proof for inequality of P and NP?

Is the claim "Employers won't employ people with no 'social media presence'" realistic?

How to have a sharp product image?

If a planet has 3 moons, is it possible to have triple Full/New Moons at once?

Why does nature favour the Laplacian?

Multiple options vs single option UI

Mistake in years of experience in resume?

What term is being referred to with "reflected-sound-of-underground-spirits"?

How can I practically buy stocks?

Could the terminal length of components like resistors be reduced?

Does tea made with boiling water cool faster than tea made with boiled (but still hot) water?

Alignment of various blocks in tikz

Relationship between strut and baselineskip

Phrase for the opposite of "foolproof"

Pulling the rope with one hand is as heavy as with two hands?

How to display Aura JS Errors Lightning Out

Why must Chinese maps be obfuscated?

How exactly does Hawking radiation decrease the mass of black holes?

Checks user level and limit the data before saving it to mongoDB

Why do games have consumables?

Lazy Loading a Bioinformatic SAM record

Lazy class instantiation in PythonLazy loading with __getLazy loaded property readabilityMax heap in JavaQuerying Facebook for details of a user's OAuth tokenBinary Puzzle Solver - 10000 questionsSimple Java program - Coding bat sumNumbersLeetcode: String to Integer (atoi)Lazy split and semi-lazy splitLazy-loading iframes as they scroll into view

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;

I'm currently writing an API to work with Bioinformatic SAM records. Here's an example of one:

SBL_XSBF463_ID:3230017:BCR1:GCATAA:BCR2:CATATA/1:vpe 97 hs07 38253395 3 30M = 38330420 77055 TTGTTCCACTGCCAAAGAGTTTCTTATAAT EEEEEEEEEEEEAEEEEEEEEEEEEEEEEE PG:Z:novoalign AS:i:0 UQ:i:0 NM:i:0 MD:Z:30 ZS:Z:R NH:i:2 HI:i:1 IH:i:1

Each piece of information separated by a tab is it's own field and corresponds to some type of data.

Now, it's important to note that these files get BIG (10's of GB) and so splitting each one as soon as it's instantiated in some kind of POJO would be inefficient.

Hence, I've decided to create an object with a lazy loading mechanism. Only the original string is stored until one of the fields is requested by some calling code. This should minimise the amount of work done when the object is created, as well as minimise the amount of memory taken by the objects.

Here's my attempt:

/** Class for storing and working with sam formatted DNA sequence.
 *
 * Upon construction, only the String record is stored.
 * All querying of fields is done on demand, to save time.
 *
 */
public class SamRecord implements Record 

 private final String read;
 private String id = null;
 private int flag = -1;
 private String referenceName = null;
 private int pos = -1;
 private int mappingQuality = -1;
 private String cigar = null;
 private String mateReferenceName = null;
 private int matePosition = -1;
 private int templateLength = -1;
 private String sequence = null;
 private String quality = null;
 private String variableTerms = null;

 private final static String REPEAT_TERM = "ZS:Z:R";
 private final static String MATCH_TERM = "ZS:Z:NM";
 private final static String QUALITY_CHECK_TERM = "ZS:Z:QC";

 /** Simple constructor for the sam record
 * @param read full read
 */
 public SamRecord(String read) 
 this.read = read;
 

 public String getRead() 
 return read;
 

 /**
 * @inheritDoc
 */
 @Override
 public String getId() 
 if(id == null)
 id = XsamReadQueries.findID(read);
 

 return id;
 

 /**
 * @inheritDoc
 */
 @Override
 public int getFlag() throws NumberFormatException 
 if(flag == -1) 
 flag = Integer.parseInt(XsamReadQueries.findElement(read, 1));
 
 return flag;
 

 /**
 * @inheritDoc
 */
 @Override
 public String getReferenceName() 
 if(referenceName == null)
 referenceName = XsamReadQueries.findReferneceName(read);
 

 return referenceName;
 

 /**
 * @inheritDoc
 */
 @Override
 public int getPos() throws NumberFormatException
 if(pos == -1)
 pos = Integer.parseInt(XsamReadQueries.findElement(read, 3));
 

 return pos;
 

 /**
 * @inheritDoc
 */
 @Override
 public int getMappingQuality() throws NumberFormatException 
 if(mappingQuality == -1)
 mappingQuality = Integer.parseInt(XsamReadQueries.findElement(read, 4));
 

 return mappingQuality;
 

 /**
 * @inheritDoc
 */
 @Override
 public String getCigar() 
 if(cigar == null)
 cigar = XsamReadQueries.findCigar(read);
 

 return cigar;
 

 /**
 * @inheritDoc
 */
 @Override
 public String getMateReferenceName() 
 if(mateReferenceName == null)
 mateReferenceName = XsamReadQueries.findElement(read, 6);
 

 return mateReferenceName;
 

 /**
 * @inheritDoc
 */
 @Override
 public int getMatePosition() throws NumberFormatException 
 if(matePosition == -1)
 matePosition = Integer.parseInt(XsamReadQueries.findElement(read, 7));
 

 return matePosition;
 

 /**
 * @inheritDoc
 */
 @Override
 public int getTemplateLength() throws NumberFormatException 
 if(templateLength == -1)
 templateLength = Integer.parseInt(XsamReadQueries.findElement(read, 8));
 

 return templateLength;
 

 /**
 * @inheritDoc
 */
 @Override
 public String getSequence() 
 if(sequence == null)
 sequence = XsamReadQueries.findBaseSequence(read);
 

 return sequence;
 

 /**
 * @inheritDoc
 */
 @Override
 public String getQuality() 
 if(quality == null)
 quality = XsamReadQueries.findElement(read, 10);
 

 return quality;
 

 /**
 * @inheritDoc
 */
 @Override
 public boolean isRepeat() 
 return read.contains(REPEAT_TERM);
 

 /**
 * @inheritDoc
 */
 @Override
 public boolean isMapped() 
 return !read.contains(MATCH_TERM);
 


 /**
 * @inheritDoc
 */
 @Override
 public String getVariableTerms() 
 if(variableTerms == null)
 variableTerms = XsamReadQueries.findVariableRegionSequence(read);
 

 return variableTerms;
 

 /**
 * @inheritDoc
 */
 @Override
 public boolean isQualityFailed() 
 return read.contains(QUALITY_CHECK_TERM);
 


 @Override
 public boolean equals(Object o) getClass() != o.getClass()) return false;
 SamRecord samRecord = (SamRecord) o;
 return Objects.equals(read, samRecord.read);
 

 @Override
 public int hashCode() 
 return Objects.hash(read);
 

 @Override
 public String toString() 
 return read;

The fields are returned by static methods in a helper class which retrieve them by looking at where the tab characters are. i.e. flag = Integer.parseInt(XsamReadQueries.findElement(read, 1));

Below is the XsamReadQuery class

/**
 * Non-instantiable utility class for working with Xsam reads
 */
public final class XsamReadQueries 

 // Suppress instantiation
 private XsamReadQueries() 
 throw new AssertionError();
 

 /** finds the position of the tab directly before the start of the variable region
 * @param read whole sam or Xsam read to search
 * @return position of the tab in the String
 */
 public static int findVariableRegionStart(String read)

 int found = 0;

 for(int i = 0; i < read.length(); i++)
 if(read.charAt(i) == 't')
 found++;
 if(found >= 11 && i+1 < read.length() && (read.charAt(i+1) != 'x' && read.charAt(i+1) != 't')) //guard against double-tabs
 return i + 1;
 
 
 

 return -1;

 

 /** Attempts to find the library name from SBL reads
 * where SBL reads have the id SBL_LibraryName_ID:XXXXX
 * if LibraryName end's with a lower case letter, the letter will be removed.
 * if SBL_LibID is not valid, return the full ID.
 * @param ID or String to search.
 * @return Library name with lower case endings removed
 */
 public static String findLibraryName(String ID)

 if(!ID.startsWith("SBL")) return "";

 try 
 int firstPos = XsamReadQueries.findPosAfter(ID, "_");
 int i = firstPos;
 while (ID.charAt(i) != '_' && ID.charAt(i) != 't') 
 i++;
 

 String library = ID.substring(firstPos, i);

 char lastChar = library.charAt(library.length()-1);
 if(lastChar >= 97 && lastChar <= 122)
 library = library.substring(0, library.length()-1);
 

 return library;

 catch (Exception e)
 int i = 0;
 while(ID.charAt(i) != 't')
 i++;
 if(i == ID.length())
 break;
 
 
 return ID.substring(0, i);
 
 

 /** Returns the ID from the sample
 * @param sample Xsam read
 * @return ID
 */
 public static String findID(String sample)
 return findElement(sample, 0);
 
 /** Returns the phred score from the sample
 * @param sample Xsam read
 * @return phred string
 */
 public static String findPhred(String sample)
 return findElement(sample, 10);
 

 /**
 * Returns the cigar from the xsam read
 *
 * @param sample read
 * @return cigar string
 */
 public static String findCigar(String sample) 
 return findElement(sample, 5);
 

 /**
 * Returns the bases from the xsam read
 *
 * @param sample read
 * @return base string
 */
 public static String findBaseSequence(String sample) 
 return findElement(sample, 9);
 

 /**
 * finds the n'th element in the tab delimited sample
 * i.e findElement(0) returns one from "onettwo"
 * 0 indexed.
 *
 * @param sample String to search
 * @param element element to find
 * @return found element or "" if not found
 */
 public static String findElement(String sample, int element) 
 boolean tabsFound = false;
 int i = 0;
 int firstTab = 0;
 int secondTab = 0;
 int tabsToSkip = element - 1 >= 0 ? element - 1 : 0;
 int skippedTabs = 0;
 if (element == 0) 
 while (sample.charAt(i) != 't') 
 i++;
 
 return sample.substring(0, i);
 else 
 while (!tabsFound) 
 if (sample.charAt(i) != 't') 
 i++;
 else 
 if (skippedTabs == tabsToSkip) 
 if (firstTab == 0) 
 firstTab = i;
 else 
 secondTab = i;
 tabsFound = true;
 
 else 
 skippedTabs++;
 
 i++;
 
 
 

 return sample.substring(firstTab + 1, secondTab);
 

 /** finds the variable region past the quality
 * @param sample sam or Xsam record string
 * @return variable sequence or empty string
 */
 public static String findVariableRegionSequence(String sample)

 int start = findVariableRegionStart(sample);

 if(start == -1) return "";

 return sample.substring(findVariableRegionStart(sample));
 

 /** finds the xL field
 * @param sample String to search
 * @return position if found, '' (null) value if not.
 */
 public static int findxLField(String sample) 
 int chartStart = findPosAfter(sample, "txL:i:");
 if (chartStart == -1) 
 return -1; //return -1 if not found.
 
 int i = chartStart;
 while (sample.charAt(i) != 't') 
 i++;
 
 return Integer.parseInt(sample.substring(chartStart, i));

 

 /** finds the xR field
 * @param sample String to search
 * @return position if found, '' (null) value if not.
 */
 public static int findxRField(String sample) 
 int chartStart = findPosAfter(sample, "txR:i:");
 if (chartStart == -1) 
 return ''; //return NULL if not found.
 
 int i = chartStart;
 while (sample.charAt(i) != 't') 
 i++;
 
 return Integer.parseInt(sample.substring(chartStart, i));

 

 /** finds the xLSeq field
 * @param sample String to search
 * @return String if found, empty string if not.
 */
 public static Optional<String> findxLSeqField(String sample) 
 int charStart = findPosAfter(sample, "txLseq:i:");
 if (charStart == -1) 
 return Optional.empty(); //return NULL if not found.
 
 int i = charStart;
 while (sample.charAt(i) != 't') 
 i++;
 
 return Optional.of(sample.substring(charStart, i));

 

 /** finds the reference name field
 * @param sample String to search
 * @return String if found, empty string if not.
 */
 public static String findReferneceName(String sample) 
 //should always appear between the second and third tabs
 boolean tabsFound = false;
 int i = 0;
 int secondTab = 0;
 int thirdTab = 0;
 boolean skippedFirstTab = false;
 while (!tabsFound) 
 if (sample.charAt(i) != 't') 
 i++;
 else 
 if (skippedFirstTab) 
 if (secondTab == 0) 
 secondTab = i;
 else 
 thirdTab = i;
 tabsFound = true;
 
 
 skippedFirstTab = true;
 i++;
 
 

 if(sample.substring(secondTab + 1, thirdTab).contains("/"))
 String[] split = sample.substring(secondTab + 1, thirdTab).split("/");
 return split[split.length-1];
 


 return sample.substring(secondTab + 1, thirdTab);
 

 /**
 * Finds the needle in the haystack, and returns the position of the single next digit.
 *
 * @param haystack The string to search
 * @param needle String field to search on.
 * @return position of the end of the needle
 */
 private static int findPosAfter(String haystack, String needle) 
 int hLen = haystack.length();
 int nLen = needle.length();
 int maxSearch = hLen - nLen;

 outer:
 for (int i = 0; i < maxSearch; i++) 
 for (int j = 0; j < nLen; j++) 
 if (haystack.charAt(i + j) != needle.charAt(j)) 
 continue outer;
 
 
 // If it reaches here, match has been found:
 return i + nLen;
 
 return -1; // Not found

My question is, are there are any drawbacks to this approach? Or any alternative way that might be more effective?

Thanks in advance,

Sam

Edit:

Instantiating code:

public interface RecordFactory<T extends Record> 

 T createRecord(String recordString);

Implementing it like:

private RecordFactory<SamRecord> samRecordFactory = SamRecord::new

edited 2 days ago

asked Apr 23 at 12:45

Sam

21217

$begingroup$
Is this class, or could this class, be used in a multithreaded scenario?
$endgroup$
– IEatBagels
Apr 23 at 13:47

$begingroup$
It will be, yes. However, it's unlikely to be shared between threads. It's more likely that collections of these records will be handed off to separate workers.
$endgroup$
– Sam
Apr 23 at 13:48

$begingroup$
0. Do you have an actual, tested performance issue, or are you guessing at a potential problem? 1. What are you trying to minimize, execution time or memory footprint?
$endgroup$
– Eric Stein
Apr 23 at 14:41

$begingroup$
Also, are you able/willing to share the code that instantiates SamRecords?
$endgroup$
– Eric Stein
Apr 23 at 15:09

$begingroup$
Hi @EricStein. Sure thing, but it's just a short factory method using generics (since SamRecord extends from Record - and I also have a few other types that extend Record too.) I know from past experience that this is a performance issue, since these files have 100's of millions of records. When I split the file, and hand the records to workers, I need to do as little as possible with the record themselves. The software using this API most likely wont need every detail of the record at one time
$endgroup$
– Sam
2 days ago

add a comment |

I'm currently writing an API to work with Bioinformatic SAM records. Here's an example of one:

Each piece of information separated by a tab is it's own field and corresponds to some type of data.

Now, it's important to note that these files get BIG (10's of GB) and so splitting each one as soon as it's instantiated in some kind of POJO would be inefficient.

Here's my attempt:

/** Class for storing and working with sam formatted DNA sequence.
 *
 * Upon construction, only the String record is stored.
 * All querying of fields is done on demand, to save time.
 *
 */
public class SamRecord implements Record 

 private final String read;
 private String id = null;
 private int flag = -1;
 private String referenceName = null;
 private int pos = -1;
 private int mappingQuality = -1;
 private String cigar = null;
 private String mateReferenceName = null;
 private int matePosition = -1;
 private int templateLength = -1;
 private String sequence = null;
 private String quality = null;
 private String variableTerms = null;

 private final static String REPEAT_TERM = "ZS:Z:R";
 private final static String MATCH_TERM = "ZS:Z:NM";
 private final static String QUALITY_CHECK_TERM = "ZS:Z:QC";

 /** Simple constructor for the sam record
 * @param read full read
 */
 public SamRecord(String read) 
 this.read = read;
 

 public String getRead() 
 return read;
 

 /**
 * @inheritDoc
 */
 @Override
 public String getId() 
 if(id == null)
 id = XsamReadQueries.findID(read);
 

 return id;
 

 /**
 * @inheritDoc
 */
 @Override
 public int getFlag() throws NumberFormatException 
 if(flag == -1) 
 flag = Integer.parseInt(XsamReadQueries.findElement(read, 1));
 
 return flag;
 

 /**
 * @inheritDoc
 */
 @Override
 public String getReferenceName() 
 if(referenceName == null)
 referenceName = XsamReadQueries.findReferneceName(read);
 

 return referenceName;
 

 /**
 * @inheritDoc
 */
 @Override
 public int getPos() throws NumberFormatException
 if(pos == -1)
 pos = Integer.parseInt(XsamReadQueries.findElement(read, 3));
 

 return pos;
 

 /**
 * @inheritDoc
 */
 @Override
 public int getMappingQuality() throws NumberFormatException 
 if(mappingQuality == -1)
 mappingQuality = Integer.parseInt(XsamReadQueries.findElement(read, 4));
 

 return mappingQuality;
 

 /**
 * @inheritDoc
 */
 @Override
 public String getCigar() 
 if(cigar == null)
 cigar = XsamReadQueries.findCigar(read);
 

 return cigar;
 

 /**
 * @inheritDoc
 */
 @Override
 public String getMateReferenceName() 
 if(mateReferenceName == null)
 mateReferenceName = XsamReadQueries.findElement(read, 6);
 

 return mateReferenceName;
 

 /**
 * @inheritDoc
 */
 @Override
 public int getMatePosition() throws NumberFormatException 
 if(matePosition == -1)
 matePosition = Integer.parseInt(XsamReadQueries.findElement(read, 7));
 

 return matePosition;
 

 /**
 * @inheritDoc
 */
 @Override
 public int getTemplateLength() throws NumberFormatException 
 if(templateLength == -1)
 templateLength = Integer.parseInt(XsamReadQueries.findElement(read, 8));
 

 return templateLength;
 

 /**
 * @inheritDoc
 */
 @Override
 public String getSequence() 
 if(sequence == null)
 sequence = XsamReadQueries.findBaseSequence(read);
 

 return sequence;
 

 /**
 * @inheritDoc
 */
 @Override
 public String getQuality() 
 if(quality == null)
 quality = XsamReadQueries.findElement(read, 10);
 

 return quality;
 

 /**
 * @inheritDoc
 */
 @Override
 public boolean isRepeat() 
 return read.contains(REPEAT_TERM);
 

 /**
 * @inheritDoc
 */
 @Override
 public boolean isMapped() 
 return !read.contains(MATCH_TERM);
 


 /**
 * @inheritDoc
 */
 @Override
 public String getVariableTerms() 
 if(variableTerms == null)
 variableTerms = XsamReadQueries.findVariableRegionSequence(read);
 

 return variableTerms;
 

 /**
 * @inheritDoc
 */
 @Override
 public boolean isQualityFailed() 
 return read.contains(QUALITY_CHECK_TERM);
 


 @Override
 public boolean equals(Object o) getClass() != o.getClass()) return false;
 SamRecord samRecord = (SamRecord) o;
 return Objects.equals(read, samRecord.read);
 

 @Override
 public int hashCode() 
 return Objects.hash(read);
 

 @Override
 public String toString() 
 return read;

The fields are returned by static methods in a helper class which retrieve them by looking at where the tab characters are. i.e. flag = Integer.parseInt(XsamReadQueries.findElement(read, 1));

Below is the XsamReadQuery class

/**
 * Non-instantiable utility class for working with Xsam reads
 */
public final class XsamReadQueries 

 // Suppress instantiation
 private XsamReadQueries() 
 throw new AssertionError();
 

 /** finds the position of the tab directly before the start of the variable region
 * @param read whole sam or Xsam read to search
 * @return position of the tab in the String
 */
 public static int findVariableRegionStart(String read)

 int found = 0;

 for(int i = 0; i < read.length(); i++)
 if(read.charAt(i) == 't')
 found++;
 if(found >= 11 && i+1 < read.length() && (read.charAt(i+1) != 'x' && read.charAt(i+1) != 't')) //guard against double-tabs
 return i + 1;
 
 
 

 return -1;

 

 /** Attempts to find the library name from SBL reads
 * where SBL reads have the id SBL_LibraryName_ID:XXXXX
 * if LibraryName end's with a lower case letter, the letter will be removed.
 * if SBL_LibID is not valid, return the full ID.
 * @param ID or String to search.
 * @return Library name with lower case endings removed
 */
 public static String findLibraryName(String ID)

 if(!ID.startsWith("SBL")) return "";

 try 
 int firstPos = XsamReadQueries.findPosAfter(ID, "_");
 int i = firstPos;
 while (ID.charAt(i) != '_' && ID.charAt(i) != 't') 
 i++;
 

 String library = ID.substring(firstPos, i);

 char lastChar = library.charAt(library.length()-1);
 if(lastChar >= 97 && lastChar <= 122)
 library = library.substring(0, library.length()-1);
 

 return library;

 catch (Exception e)
 int i = 0;
 while(ID.charAt(i) != 't')
 i++;
 if(i == ID.length())
 break;
 
 
 return ID.substring(0, i);
 
 

 /** Returns the ID from the sample
 * @param sample Xsam read
 * @return ID
 */
 public static String findID(String sample)
 return findElement(sample, 0);
 
 /** Returns the phred score from the sample
 * @param sample Xsam read
 * @return phred string
 */
 public static String findPhred(String sample)
 return findElement(sample, 10);
 

 /**
 * Returns the cigar from the xsam read
 *
 * @param sample read
 * @return cigar string
 */
 public static String findCigar(String sample) 
 return findElement(sample, 5);
 

 /**
 * Returns the bases from the xsam read
 *
 * @param sample read
 * @return base string
 */
 public static String findBaseSequence(String sample) 
 return findElement(sample, 9);
 

 /**
 * finds the n'th element in the tab delimited sample
 * i.e findElement(0) returns one from "onettwo"
 * 0 indexed.
 *
 * @param sample String to search
 * @param element element to find
 * @return found element or "" if not found
 */
 public static String findElement(String sample, int element) 
 boolean tabsFound = false;
 int i = 0;
 int firstTab = 0;
 int secondTab = 0;
 int tabsToSkip = element - 1 >= 0 ? element - 1 : 0;
 int skippedTabs = 0;
 if (element == 0) 
 while (sample.charAt(i) != 't') 
 i++;
 
 return sample.substring(0, i);
 else 
 while (!tabsFound) 
 if (sample.charAt(i) != 't') 
 i++;
 else 
 if (skippedTabs == tabsToSkip) 
 if (firstTab == 0) 
 firstTab = i;
 else 
 secondTab = i;
 tabsFound = true;
 
 else 
 skippedTabs++;
 
 i++;
 
 
 

 return sample.substring(firstTab + 1, secondTab);
 

 /** finds the variable region past the quality
 * @param sample sam or Xsam record string
 * @return variable sequence or empty string
 */
 public static String findVariableRegionSequence(String sample)

 int start = findVariableRegionStart(sample);

 if(start == -1) return "";

 return sample.substring(findVariableRegionStart(sample));
 

 /** finds the xL field
 * @param sample String to search
 * @return position if found, '' (null) value if not.
 */
 public static int findxLField(String sample) 
 int chartStart = findPosAfter(sample, "txL:i:");
 if (chartStart == -1) 
 return -1; //return -1 if not found.
 
 int i = chartStart;
 while (sample.charAt(i) != 't') 
 i++;
 
 return Integer.parseInt(sample.substring(chartStart, i));

 

 /** finds the xR field
 * @param sample String to search
 * @return position if found, '' (null) value if not.
 */
 public static int findxRField(String sample) 
 int chartStart = findPosAfter(sample, "txR:i:");
 if (chartStart == -1) 
 return ''; //return NULL if not found.
 
 int i = chartStart;
 while (sample.charAt(i) != 't') 
 i++;
 
 return Integer.parseInt(sample.substring(chartStart, i));

 

 /** finds the xLSeq field
 * @param sample String to search
 * @return String if found, empty string if not.
 */
 public static Optional<String> findxLSeqField(String sample) 
 int charStart = findPosAfter(sample, "txLseq:i:");
 if (charStart == -1) 
 return Optional.empty(); //return NULL if not found.
 
 int i = charStart;
 while (sample.charAt(i) != 't') 
 i++;
 
 return Optional.of(sample.substring(charStart, i));

 

 /** finds the reference name field
 * @param sample String to search
 * @return String if found, empty string if not.
 */
 public static String findReferneceName(String sample) 
 //should always appear between the second and third tabs
 boolean tabsFound = false;
 int i = 0;
 int secondTab = 0;
 int thirdTab = 0;
 boolean skippedFirstTab = false;
 while (!tabsFound) 
 if (sample.charAt(i) != 't') 
 i++;
 else 
 if (skippedFirstTab) 
 if (secondTab == 0) 
 secondTab = i;
 else 
 thirdTab = i;
 tabsFound = true;
 
 
 skippedFirstTab = true;
 i++;
 
 

 if(sample.substring(secondTab + 1, thirdTab).contains("/"))
 String[] split = sample.substring(secondTab + 1, thirdTab).split("/");
 return split[split.length-1];
 


 return sample.substring(secondTab + 1, thirdTab);
 

 /**
 * Finds the needle in the haystack, and returns the position of the single next digit.
 *
 * @param haystack The string to search
 * @param needle String field to search on.
 * @return position of the end of the needle
 */
 private static int findPosAfter(String haystack, String needle) 
 int hLen = haystack.length();
 int nLen = needle.length();
 int maxSearch = hLen - nLen;

 outer:
 for (int i = 0; i < maxSearch; i++) 
 for (int j = 0; j < nLen; j++) 
 if (haystack.charAt(i + j) != needle.charAt(j)) 
 continue outer;
 
 
 // If it reaches here, match has been found:
 return i + nLen;
 
 return -1; // Not found

My question is, are there are any drawbacks to this approach? Or any alternative way that might be more effective?

Thanks in advance,

Sam

Edit:

Instantiating code:

public interface RecordFactory<T extends Record> 

 T createRecord(String recordString);

Implementing it like:

private RecordFactory<SamRecord> samRecordFactory = SamRecord::new

edited 2 days ago

asked Apr 23 at 12:45

Sam

21217

$begingroup$
Is this class, or could this class, be used in a multithreaded scenario?
$endgroup$
– IEatBagels
Apr 23 at 13:47

$begingroup$
It will be, yes. However, it's unlikely to be shared between threads. It's more likely that collections of these records will be handed off to separate workers.
$endgroup$
– Sam
Apr 23 at 13:48

$begingroup$
0. Do you have an actual, tested performance issue, or are you guessing at a potential problem? 1. What are you trying to minimize, execution time or memory footprint?
$endgroup$
– Eric Stein
Apr 23 at 14:41

$begingroup$
Also, are you able/willing to share the code that instantiates SamRecords?
$endgroup$
– Eric Stein
Apr 23 at 15:09

$begingroup$
Hi @EricStein. Sure thing, but it's just a short factory method using generics (since SamRecord extends from Record - and I also have a few other types that extend Record too.) I know from past experience that this is a performance issue, since these files have 100's of millions of records. When I split the file, and hand the records to workers, I need to do as little as possible with the record themselves. The software using this API most likely wont need every detail of the record at one time
$endgroup$
– Sam
2 days ago

add a comment |

I'm currently writing an API to work with Bioinformatic SAM records. Here's an example of one:

Each piece of information separated by a tab is it's own field and corresponds to some type of data.

Now, it's important to note that these files get BIG (10's of GB) and so splitting each one as soon as it's instantiated in some kind of POJO would be inefficient.

Here's my attempt:

/** Class for storing and working with sam formatted DNA sequence.
 *
 * Upon construction, only the String record is stored.
 * All querying of fields is done on demand, to save time.
 *
 */
public class SamRecord implements Record 

 private final String read;
 private String id = null;
 private int flag = -1;
 private String referenceName = null;
 private int pos = -1;
 private int mappingQuality = -1;
 private String cigar = null;
 private String mateReferenceName = null;
 private int matePosition = -1;
 private int templateLength = -1;
 private String sequence = null;
 private String quality = null;
 private String variableTerms = null;

 private final static String REPEAT_TERM = "ZS:Z:R";
 private final static String MATCH_TERM = "ZS:Z:NM";
 private final static String QUALITY_CHECK_TERM = "ZS:Z:QC";

 /** Simple constructor for the sam record
 * @param read full read
 */
 public SamRecord(String read) 
 this.read = read;
 

 public String getRead() 
 return read;
 

 /**
 * @inheritDoc
 */
 @Override
 public String getId() 
 if(id == null)
 id = XsamReadQueries.findID(read);
 

 return id;
 

 /**
 * @inheritDoc
 */
 @Override
 public int getFlag() throws NumberFormatException 
 if(flag == -1) 
 flag = Integer.parseInt(XsamReadQueries.findElement(read, 1));
 
 return flag;
 

 /**
 * @inheritDoc
 */
 @Override
 public String getReferenceName() 
 if(referenceName == null)
 referenceName = XsamReadQueries.findReferneceName(read);
 

 return referenceName;
 

 /**
 * @inheritDoc
 */
 @Override
 public int getPos() throws NumberFormatException
 if(pos == -1)
 pos = Integer.parseInt(XsamReadQueries.findElement(read, 3));
 

 return pos;
 

 /**
 * @inheritDoc
 */
 @Override
 public int getMappingQuality() throws NumberFormatException 
 if(mappingQuality == -1)
 mappingQuality = Integer.parseInt(XsamReadQueries.findElement(read, 4));
 

 return mappingQuality;
 

 /**
 * @inheritDoc
 */
 @Override
 public String getCigar() 
 if(cigar == null)
 cigar = XsamReadQueries.findCigar(read);
 

 return cigar;
 

 /**
 * @inheritDoc
 */
 @Override
 public String getMateReferenceName() 
 if(mateReferenceName == null)
 mateReferenceName = XsamReadQueries.findElement(read, 6);
 

 return mateReferenceName;
 

 /**
 * @inheritDoc
 */
 @Override
 public int getMatePosition() throws NumberFormatException 
 if(matePosition == -1)
 matePosition = Integer.parseInt(XsamReadQueries.findElement(read, 7));
 

 return matePosition;
 

 /**
 * @inheritDoc
 */
 @Override
 public int getTemplateLength() throws NumberFormatException 
 if(templateLength == -1)
 templateLength = Integer.parseInt(XsamReadQueries.findElement(read, 8));
 

 return templateLength;
 

 /**
 * @inheritDoc
 */
 @Override
 public String getSequence() 
 if(sequence == null)
 sequence = XsamReadQueries.findBaseSequence(read);
 

 return sequence;
 

 /**
 * @inheritDoc
 */
 @Override
 public String getQuality() 
 if(quality == null)
 quality = XsamReadQueries.findElement(read, 10);
 

 return quality;
 

 /**
 * @inheritDoc
 */
 @Override
 public boolean isRepeat() 
 return read.contains(REPEAT_TERM);
 

 /**
 * @inheritDoc
 */
 @Override
 public boolean isMapped() 
 return !read.contains(MATCH_TERM);
 


 /**
 * @inheritDoc
 */
 @Override
 public String getVariableTerms() 
 if(variableTerms == null)
 variableTerms = XsamReadQueries.findVariableRegionSequence(read);
 

 return variableTerms;
 

 /**
 * @inheritDoc
 */
 @Override
 public boolean isQualityFailed() 
 return read.contains(QUALITY_CHECK_TERM);
 


 @Override
 public boolean equals(Object o) getClass() != o.getClass()) return false;
 SamRecord samRecord = (SamRecord) o;
 return Objects.equals(read, samRecord.read);
 

 @Override
 public int hashCode() 
 return Objects.hash(read);
 

 @Override
 public String toString() 
 return read;

The fields are returned by static methods in a helper class which retrieve them by looking at where the tab characters are. i.e. flag = Integer.parseInt(XsamReadQueries.findElement(read, 1));

Below is the XsamReadQuery class

/**
 * Non-instantiable utility class for working with Xsam reads
 */
public final class XsamReadQueries 

 // Suppress instantiation
 private XsamReadQueries() 
 throw new AssertionError();
 

 /** finds the position of the tab directly before the start of the variable region
 * @param read whole sam or Xsam read to search
 * @return position of the tab in the String
 */
 public static int findVariableRegionStart(String read)

 int found = 0;

 for(int i = 0; i < read.length(); i++)
 if(read.charAt(i) == 't')
 found++;
 if(found >= 11 && i+1 < read.length() && (read.charAt(i+1) != 'x' && read.charAt(i+1) != 't')) //guard against double-tabs
 return i + 1;
 
 
 

 return -1;

 

 /** Attempts to find the library name from SBL reads
 * where SBL reads have the id SBL_LibraryName_ID:XXXXX
 * if LibraryName end's with a lower case letter, the letter will be removed.
 * if SBL_LibID is not valid, return the full ID.
 * @param ID or String to search.
 * @return Library name with lower case endings removed
 */
 public static String findLibraryName(String ID)

 if(!ID.startsWith("SBL")) return "";

 try 
 int firstPos = XsamReadQueries.findPosAfter(ID, "_");
 int i = firstPos;
 while (ID.charAt(i) != '_' && ID.charAt(i) != 't') 
 i++;
 

 String library = ID.substring(firstPos, i);

 char lastChar = library.charAt(library.length()-1);
 if(lastChar >= 97 && lastChar <= 122)
 library = library.substring(0, library.length()-1);
 

 return library;

 catch (Exception e)
 int i = 0;
 while(ID.charAt(i) != 't')
 i++;
 if(i == ID.length())
 break;
 
 
 return ID.substring(0, i);
 
 

 /** Returns the ID from the sample
 * @param sample Xsam read
 * @return ID
 */
 public static String findID(String sample)
 return findElement(sample, 0);
 
 /** Returns the phred score from the sample
 * @param sample Xsam read
 * @return phred string
 */
 public static String findPhred(String sample)
 return findElement(sample, 10);
 

 /**
 * Returns the cigar from the xsam read
 *
 * @param sample read
 * @return cigar string
 */
 public static String findCigar(String sample) 
 return findElement(sample, 5);
 

 /**
 * Returns the bases from the xsam read
 *
 * @param sample read
 * @return base string
 */
 public static String findBaseSequence(String sample) 
 return findElement(sample, 9);
 

 /**
 * finds the n'th element in the tab delimited sample
 * i.e findElement(0) returns one from "onettwo"
 * 0 indexed.
 *
 * @param sample String to search
 * @param element element to find
 * @return found element or "" if not found
 */
 public static String findElement(String sample, int element) 
 boolean tabsFound = false;
 int i = 0;
 int firstTab = 0;
 int secondTab = 0;
 int tabsToSkip = element - 1 >= 0 ? element - 1 : 0;
 int skippedTabs = 0;
 if (element == 0) 
 while (sample.charAt(i) != 't') 
 i++;
 
 return sample.substring(0, i);
 else 
 while (!tabsFound) 
 if (sample.charAt(i) != 't') 
 i++;
 else 
 if (skippedTabs == tabsToSkip) 
 if (firstTab == 0) 
 firstTab = i;
 else 
 secondTab = i;
 tabsFound = true;
 
 else 
 skippedTabs++;
 
 i++;
 
 
 

 return sample.substring(firstTab + 1, secondTab);
 

 /** finds the variable region past the quality
 * @param sample sam or Xsam record string
 * @return variable sequence or empty string
 */
 public static String findVariableRegionSequence(String sample)

 int start = findVariableRegionStart(sample);

 if(start == -1) return "";

 return sample.substring(findVariableRegionStart(sample));
 

 /** finds the xL field
 * @param sample String to search
 * @return position if found, '' (null) value if not.
 */
 public static int findxLField(String sample) 
 int chartStart = findPosAfter(sample, "txL:i:");
 if (chartStart == -1) 
 return -1; //return -1 if not found.
 
 int i = chartStart;
 while (sample.charAt(i) != 't') 
 i++;
 
 return Integer.parseInt(sample.substring(chartStart, i));

 

 /** finds the xR field
 * @param sample String to search
 * @return position if found, '' (null) value if not.
 */
 public static int findxRField(String sample) 
 int chartStart = findPosAfter(sample, "txR:i:");
 if (chartStart == -1) 
 return ''; //return NULL if not found.
 
 int i = chartStart;
 while (sample.charAt(i) != 't') 
 i++;
 
 return Integer.parseInt(sample.substring(chartStart, i));

 

 /** finds the xLSeq field
 * @param sample String to search
 * @return String if found, empty string if not.
 */
 public static Optional<String> findxLSeqField(String sample) 
 int charStart = findPosAfter(sample, "txLseq:i:");
 if (charStart == -1) 
 return Optional.empty(); //return NULL if not found.
 
 int i = charStart;
 while (sample.charAt(i) != 't') 
 i++;
 
 return Optional.of(sample.substring(charStart, i));

 

 /** finds the reference name field
 * @param sample String to search
 * @return String if found, empty string if not.
 */
 public static String findReferneceName(String sample) 
 //should always appear between the second and third tabs
 boolean tabsFound = false;
 int i = 0;
 int secondTab = 0;
 int thirdTab = 0;
 boolean skippedFirstTab = false;
 while (!tabsFound) 
 if (sample.charAt(i) != 't') 
 i++;
 else 
 if (skippedFirstTab) 
 if (secondTab == 0) 
 secondTab = i;
 else 
 thirdTab = i;
 tabsFound = true;
 
 
 skippedFirstTab = true;
 i++;
 
 

 if(sample.substring(secondTab + 1, thirdTab).contains("/"))
 String[] split = sample.substring(secondTab + 1, thirdTab).split("/");
 return split[split.length-1];
 


 return sample.substring(secondTab + 1, thirdTab);
 

 /**
 * Finds the needle in the haystack, and returns the position of the single next digit.
 *
 * @param haystack The string to search
 * @param needle String field to search on.
 * @return position of the end of the needle
 */
 private static int findPosAfter(String haystack, String needle) 
 int hLen = haystack.length();
 int nLen = needle.length();
 int maxSearch = hLen - nLen;

 outer:
 for (int i = 0; i < maxSearch; i++) 
 for (int j = 0; j < nLen; j++) 
 if (haystack.charAt(i + j) != needle.charAt(j)) 
 continue outer;
 
 
 // If it reaches here, match has been found:
 return i + nLen;
 
 return -1; // Not found

My question is, are there are any drawbacks to this approach? Or any alternative way that might be more effective?

Thanks in advance,

Sam

Edit:

Instantiating code:

public interface RecordFactory<T extends Record> 

 T createRecord(String recordString);

Implementing it like:

private RecordFactory<SamRecord> samRecordFactory = SamRecord::new

edited 2 days ago

asked Apr 23 at 12:45

Sam

21217

I'm currently writing an API to work with Bioinformatic SAM records. Here's an example of one:

Each piece of information separated by a tab is it's own field and corresponds to some type of data.

Now, it's important to note that these files get BIG (10's of GB) and so splitting each one as soon as it's instantiated in some kind of POJO would be inefficient.

Here's my attempt:

/** Class for storing and working with sam formatted DNA sequence.
 *
 * Upon construction, only the String record is stored.
 * All querying of fields is done on demand, to save time.
 *
 */
public class SamRecord implements Record 

 private final String read;
 private String id = null;
 private int flag = -1;
 private String referenceName = null;
 private int pos = -1;
 private int mappingQuality = -1;
 private String cigar = null;
 private String mateReferenceName = null;
 private int matePosition = -1;
 private int templateLength = -1;
 private String sequence = null;
 private String quality = null;
 private String variableTerms = null;

 private final static String REPEAT_TERM = "ZS:Z:R";
 private final static String MATCH_TERM = "ZS:Z:NM";
 private final static String QUALITY_CHECK_TERM = "ZS:Z:QC";

 /** Simple constructor for the sam record
 * @param read full read
 */
 public SamRecord(String read) 
 this.read = read;
 

 public String getRead() 
 return read;
 

 /**
 * @inheritDoc
 */
 @Override
 public String getId() 
 if(id == null)
 id = XsamReadQueries.findID(read);
 

 return id;
 

 /**
 * @inheritDoc
 */
 @Override
 public int getFlag() throws NumberFormatException 
 if(flag == -1) 
 flag = Integer.parseInt(XsamReadQueries.findElement(read, 1));
 
 return flag;
 

 /**
 * @inheritDoc
 */
 @Override
 public String getReferenceName() 
 if(referenceName == null)
 referenceName = XsamReadQueries.findReferneceName(read);
 

 return referenceName;
 

 /**
 * @inheritDoc
 */
 @Override
 public int getPos() throws NumberFormatException
 if(pos == -1)
 pos = Integer.parseInt(XsamReadQueries.findElement(read, 3));
 

 return pos;
 

 /**
 * @inheritDoc
 */
 @Override
 public int getMappingQuality() throws NumberFormatException 
 if(mappingQuality == -1)
 mappingQuality = Integer.parseInt(XsamReadQueries.findElement(read, 4));
 

 return mappingQuality;
 

 /**
 * @inheritDoc
 */
 @Override
 public String getCigar() 
 if(cigar == null)
 cigar = XsamReadQueries.findCigar(read);
 

 return cigar;
 

 /**
 * @inheritDoc
 */
 @Override
 public String getMateReferenceName() 
 if(mateReferenceName == null)
 mateReferenceName = XsamReadQueries.findElement(read, 6);
 

 return mateReferenceName;
 

 /**
 * @inheritDoc
 */
 @Override
 public int getMatePosition() throws NumberFormatException 
 if(matePosition == -1)
 matePosition = Integer.parseInt(XsamReadQueries.findElement(read, 7));
 

 return matePosition;
 

 /**
 * @inheritDoc
 */
 @Override
 public int getTemplateLength() throws NumberFormatException 
 if(templateLength == -1)
 templateLength = Integer.parseInt(XsamReadQueries.findElement(read, 8));
 

 return templateLength;
 

 /**
 * @inheritDoc
 */
 @Override
 public String getSequence() 
 if(sequence == null)
 sequence = XsamReadQueries.findBaseSequence(read);
 

 return sequence;
 

 /**
 * @inheritDoc
 */
 @Override
 public String getQuality() 
 if(quality == null)
 quality = XsamReadQueries.findElement(read, 10);
 

 return quality;
 

 /**
 * @inheritDoc
 */
 @Override
 public boolean isRepeat() 
 return read.contains(REPEAT_TERM);
 

 /**
 * @inheritDoc
 */
 @Override
 public boolean isMapped() 
 return !read.contains(MATCH_TERM);
 


 /**
 * @inheritDoc
 */
 @Override
 public String getVariableTerms() 
 if(variableTerms == null)
 variableTerms = XsamReadQueries.findVariableRegionSequence(read);
 

 return variableTerms;
 

 /**
 * @inheritDoc
 */
 @Override
 public boolean isQualityFailed() 
 return read.contains(QUALITY_CHECK_TERM);
 


 @Override
 public boolean equals(Object o) getClass() != o.getClass()) return false;
 SamRecord samRecord = (SamRecord) o;
 return Objects.equals(read, samRecord.read);
 

 @Override
 public int hashCode() 
 return Objects.hash(read);
 

 @Override
 public String toString() 
 return read;

The fields are returned by static methods in a helper class which retrieve them by looking at where the tab characters are. i.e. flag = Integer.parseInt(XsamReadQueries.findElement(read, 1));

Below is the XsamReadQuery class

/**
 * Non-instantiable utility class for working with Xsam reads
 */
public final class XsamReadQueries 

 // Suppress instantiation
 private XsamReadQueries() 
 throw new AssertionError();
 

 /** finds the position of the tab directly before the start of the variable region
 * @param read whole sam or Xsam read to search
 * @return position of the tab in the String
 */
 public static int findVariableRegionStart(String read)

 int found = 0;

 for(int i = 0; i < read.length(); i++)
 if(read.charAt(i) == 't')
 found++;
 if(found >= 11 && i+1 < read.length() && (read.charAt(i+1) != 'x' && read.charAt(i+1) != 't')) //guard against double-tabs
 return i + 1;
 
 
 

 return -1;

 

 /** Attempts to find the library name from SBL reads
 * where SBL reads have the id SBL_LibraryName_ID:XXXXX
 * if LibraryName end's with a lower case letter, the letter will be removed.
 * if SBL_LibID is not valid, return the full ID.
 * @param ID or String to search.
 * @return Library name with lower case endings removed
 */
 public static String findLibraryName(String ID)

 if(!ID.startsWith("SBL")) return "";

 try 
 int firstPos = XsamReadQueries.findPosAfter(ID, "_");
 int i = firstPos;
 while (ID.charAt(i) != '_' && ID.charAt(i) != 't') 
 i++;
 

 String library = ID.substring(firstPos, i);

 char lastChar = library.charAt(library.length()-1);
 if(lastChar >= 97 && lastChar <= 122)
 library = library.substring(0, library.length()-1);
 

 return library;

 catch (Exception e)
 int i = 0;
 while(ID.charAt(i) != 't')
 i++;
 if(i == ID.length())
 break;
 
 
 return ID.substring(0, i);
 
 

 /** Returns the ID from the sample
 * @param sample Xsam read
 * @return ID
 */
 public static String findID(String sample)
 return findElement(sample, 0);
 
 /** Returns the phred score from the sample
 * @param sample Xsam read
 * @return phred string
 */
 public static String findPhred(String sample)
 return findElement(sample, 10);
 

 /**
 * Returns the cigar from the xsam read
 *
 * @param sample read
 * @return cigar string
 */
 public static String findCigar(String sample) 
 return findElement(sample, 5);
 

 /**
 * Returns the bases from the xsam read
 *
 * @param sample read
 * @return base string
 */
 public static String findBaseSequence(String sample) 
 return findElement(sample, 9);
 

 /**
 * finds the n'th element in the tab delimited sample
 * i.e findElement(0) returns one from "onettwo"
 * 0 indexed.
 *
 * @param sample String to search
 * @param element element to find
 * @return found element or "" if not found
 */
 public static String findElement(String sample, int element) 
 boolean tabsFound = false;
 int i = 0;
 int firstTab = 0;
 int secondTab = 0;
 int tabsToSkip = element - 1 >= 0 ? element - 1 : 0;
 int skippedTabs = 0;
 if (element == 0) 
 while (sample.charAt(i) != 't') 
 i++;
 
 return sample.substring(0, i);
 else 
 while (!tabsFound) 
 if (sample.charAt(i) != 't') 
 i++;
 else 
 if (skippedTabs == tabsToSkip) 
 if (firstTab == 0) 
 firstTab = i;
 else 
 secondTab = i;
 tabsFound = true;
 
 else 
 skippedTabs++;
 
 i++;
 
 
 

 return sample.substring(firstTab + 1, secondTab);
 

 /** finds the variable region past the quality
 * @param sample sam or Xsam record string
 * @return variable sequence or empty string
 */
 public static String findVariableRegionSequence(String sample)

 int start = findVariableRegionStart(sample);

 if(start == -1) return "";

 return sample.substring(findVariableRegionStart(sample));
 

 /** finds the xL field
 * @param sample String to search
 * @return position if found, '' (null) value if not.
 */
 public static int findxLField(String sample) 
 int chartStart = findPosAfter(sample, "txL:i:");
 if (chartStart == -1) 
 return -1; //return -1 if not found.
 
 int i = chartStart;
 while (sample.charAt(i) != 't') 
 i++;
 
 return Integer.parseInt(sample.substring(chartStart, i));

 

 /** finds the xR field
 * @param sample String to search
 * @return position if found, '' (null) value if not.
 */
 public static int findxRField(String sample) 
 int chartStart = findPosAfter(sample, "txR:i:");
 if (chartStart == -1) 
 return ''; //return NULL if not found.
 
 int i = chartStart;
 while (sample.charAt(i) != 't') 
 i++;
 
 return Integer.parseInt(sample.substring(chartStart, i));

 

 /** finds the xLSeq field
 * @param sample String to search
 * @return String if found, empty string if not.
 */
 public static Optional<String> findxLSeqField(String sample) 
 int charStart = findPosAfter(sample, "txLseq:i:");
 if (charStart == -1) 
 return Optional.empty(); //return NULL if not found.
 
 int i = charStart;
 while (sample.charAt(i) != 't') 
 i++;
 
 return Optional.of(sample.substring(charStart, i));

 

 /** finds the reference name field
 * @param sample String to search
 * @return String if found, empty string if not.
 */
 public static String findReferneceName(String sample) 
 //should always appear between the second and third tabs
 boolean tabsFound = false;
 int i = 0;
 int secondTab = 0;
 int thirdTab = 0;
 boolean skippedFirstTab = false;
 while (!tabsFound) 
 if (sample.charAt(i) != 't') 
 i++;
 else 
 if (skippedFirstTab) 
 if (secondTab == 0) 
 secondTab = i;
 else 
 thirdTab = i;
 tabsFound = true;
 
 
 skippedFirstTab = true;
 i++;
 
 

 if(sample.substring(secondTab + 1, thirdTab).contains("/"))
 String[] split = sample.substring(secondTab + 1, thirdTab).split("/");
 return split[split.length-1];
 


 return sample.substring(secondTab + 1, thirdTab);
 

 /**
 * Finds the needle in the haystack, and returns the position of the single next digit.
 *
 * @param haystack The string to search
 * @param needle String field to search on.
 * @return position of the end of the needle
 */
 private static int findPosAfter(String haystack, String needle) 
 int hLen = haystack.length();
 int nLen = needle.length();
 int maxSearch = hLen - nLen;

 outer:
 for (int i = 0; i < maxSearch; i++) 
 for (int j = 0; j < nLen; j++) 
 if (haystack.charAt(i + j) != needle.charAt(j)) 
 continue outer;
 
 
 // If it reaches here, match has been found:
 return i + nLen;
 
 return -1; // Not found

My question is, are there are any drawbacks to this approach? Or any alternative way that might be more effective?

Thanks in advance,

Sam

Edit:

Instantiating code:

public interface RecordFactory<T extends Record> 

 T createRecord(String recordString);

Implementing it like:

private RecordFactory<SamRecord> samRecordFactory = SamRecord::new

java bioinformatics lazy

edited 2 days ago

asked Apr 23 at 12:45

Sam

21217

edited 2 days ago

asked Apr 23 at 12:45

Sam

21217

edited 2 days ago

asked Apr 23 at 12:45

Sam

21217

asked Apr 23 at 12:45

Sam

21217

asked Apr 23 at 12:45

Sam

21217

$begingroup$
Is this class, or could this class, be used in a multithreaded scenario?
$endgroup$
– IEatBagels
Apr 23 at 13:47

$begingroup$
It will be, yes. However, it's unlikely to be shared between threads. It's more likely that collections of these records will be handed off to separate workers.
$endgroup$
– Sam
Apr 23 at 13:48

$begingroup$
0. Do you have an actual, tested performance issue, or are you guessing at a potential problem? 1. What are you trying to minimize, execution time or memory footprint?
$endgroup$
– Eric Stein
Apr 23 at 14:41

$begingroup$
Also, are you able/willing to share the code that instantiates SamRecords?
$endgroup$
– Eric Stein
Apr 23 at 15:09

$begingroup$
Hi @EricStein. Sure thing, but it's just a short factory method using generics (since SamRecord extends from Record - and I also have a few other types that extend Record too.) I know from past experience that this is a performance issue, since these files have 100's of millions of records. When I split the file, and hand the records to workers, I need to do as little as possible with the record themselves. The software using this API most likely wont need every detail of the record at one time
$endgroup$
– Sam
2 days ago

add a comment |

$begingroup$
Is this class, or could this class, be used in a multithreaded scenario?
$endgroup$
– IEatBagels
Apr 23 at 13:47

$begingroup$
It will be, yes. However, it's unlikely to be shared between threads. It's more likely that collections of these records will be handed off to separate workers.
$endgroup$
– Sam
Apr 23 at 13:48

$begingroup$
0. Do you have an actual, tested performance issue, or are you guessing at a potential problem? 1. What are you trying to minimize, execution time or memory footprint?
$endgroup$
– Eric Stein
Apr 23 at 14:41

$begingroup$
Also, are you able/willing to share the code that instantiates SamRecords?
$endgroup$
– Eric Stein
Apr 23 at 15:09

$begingroup$
Hi @EricStein. Sure thing, but it's just a short factory method using generics (since SamRecord extends from Record - and I also have a few other types that extend Record too.) I know from past experience that this is a performance issue, since these files have 100's of millions of records. When I split the file, and hand the records to workers, I need to do as little as possible with the record themselves. The software using this API most likely wont need every detail of the record at one time
$endgroup$
– Sam
2 days ago

Is this class, or could this class, be used in a multithreaded scenario?

– IEatBagels
Apr 23 at 13:47

It will be, yes. However, it's unlikely to be shared between threads. It's more likely that collections of these records will be handed off to separate workers.

– Sam
Apr 23 at 13:48

0. Do you have an actual, tested performance issue, or are you guessing at a potential problem? 1. What are you trying to minimize, execution time or memory footprint?

– Eric Stein
Apr 23 at 14:41

Also, are you able/willing to share the code that instantiates SamRecords?

– Eric Stein
Apr 23 at 15:09

Hi @EricStein. Sure thing, but it's just a short factory method using generics (since SamRecord extends from Record - and I also have a few other types that extend Record too.) I know from past experience that this is a performance issue, since these files have 100's of millions of records. When I split the file, and hand the records to workers, I need to do as little as possible with the record themselves. The software using this API most likely wont need every detail of the record at one time

– Sam
2 days ago

add a comment |

1 Answer
1

active

oldest

votes

Performance

There is one thing that I believe could increase the performance of your application.

You often call findElement, which goes through the SAM record every time.

By loading a record, you are pretty certain that you will at least access it once.

At some point, maybe when creating the class, or when accessing the first property for the first time, you should "index" your SAM record.

Go through the whole file once and keep an array of where the tabs are. This way, if your code ends up calling :

XsamReadQueries.findElement(read, 1)
XsamReadQueries.findElement(read, 2)
XsamReadQueries.findElement(read, 3)

The calls to the second and third method would be much faster than they are now.

To do this, you could add a method to XsamReadQueries names something like IndexTabs, that would return an array of ints.

If you want more insight as to how to do this, you can write a comment and I'll add more information, but I'm pretty sure this would help you.

Code style

There are one of two things that are bothering me in your code with regards to clarity and future maintenance.

You have methods named findPhred, which call findElement , but in your SamRecord sometimes you call findElement and something a specific find*, which is basically the same code. You should decide on one way to do things, either have specific methods for each field in the XsamReadQueries or keep only the findElement method.

Finally, you could consider using an enum for the element parameter of the findElement method.

edited Apr 23 at 14:19

answered Apr 23 at 14:13

IEatBagels

9,07323579

1

$begingroup$
Hi @IEatBagels, thanks for the answer. Indexing is a really good idea, I'll definitely look to implement that. I agree with the find* notation - It's partially as I'm halfway through coding the API and wanted some feedback before committing to one way or the other. The enum's a good idea too, it'll definitely make it more readable!
$endgroup$
– Sam
Apr 23 at 16:00

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f217955%2flazy-loading-a-bioinformatic-sam-record%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Performance

There is one thing that I believe could increase the performance of your application.

You often call findElement, which goes through the SAM record every time.

By loading a record, you are pretty certain that you will at least access it once.

At some point, maybe when creating the class, or when accessing the first property for the first time, you should "index" your SAM record.

Go through the whole file once and keep an array of where the tabs are. This way, if your code ends up calling :

XsamReadQueries.findElement(read, 1)
XsamReadQueries.findElement(read, 2)
XsamReadQueries.findElement(read, 3)

The calls to the second and third method would be much faster than they are now.

To do this, you could add a method to XsamReadQueries names something like IndexTabs, that would return an array of ints.

If you want more insight as to how to do this, you can write a comment and I'll add more information, but I'm pretty sure this would help you.

Code style

There are one of two things that are bothering me in your code with regards to clarity and future maintenance.

Finally, you could consider using an enum for the element parameter of the findElement method.

edited Apr 23 at 14:19

answered Apr 23 at 14:13

IEatBagels

9,07323579

1

$begingroup$
Hi @IEatBagels, thanks for the answer. Indexing is a really good idea, I'll definitely look to implement that. I agree with the find* notation - It's partially as I'm halfway through coding the API and wanted some feedback before committing to one way or the other. The enum's a good idea too, it'll definitely make it more readable!
$endgroup$
– Sam
Apr 23 at 16:00

add a comment |

Performance

There is one thing that I believe could increase the performance of your application.

You often call findElement, which goes through the SAM record every time.

By loading a record, you are pretty certain that you will at least access it once.

At some point, maybe when creating the class, or when accessing the first property for the first time, you should "index" your SAM record.

Go through the whole file once and keep an array of where the tabs are. This way, if your code ends up calling :

XsamReadQueries.findElement(read, 1)
XsamReadQueries.findElement(read, 2)
XsamReadQueries.findElement(read, 3)

The calls to the second and third method would be much faster than they are now.

To do this, you could add a method to XsamReadQueries names something like IndexTabs, that would return an array of ints.

If you want more insight as to how to do this, you can write a comment and I'll add more information, but I'm pretty sure this would help you.

Code style

There are one of two things that are bothering me in your code with regards to clarity and future maintenance.

Finally, you could consider using an enum for the element parameter of the findElement method.

edited Apr 23 at 14:19

answered Apr 23 at 14:13

IEatBagels

9,07323579

1

$begingroup$
Hi @IEatBagels, thanks for the answer. Indexing is a really good idea, I'll definitely look to implement that. I agree with the find* notation - It's partially as I'm halfway through coding the API and wanted some feedback before committing to one way or the other. The enum's a good idea too, it'll definitely make it more readable!
$endgroup$
– Sam
Apr 23 at 16:00

add a comment |

Performance

There is one thing that I believe could increase the performance of your application.

You often call findElement, which goes through the SAM record every time.

By loading a record, you are pretty certain that you will at least access it once.

At some point, maybe when creating the class, or when accessing the first property for the first time, you should "index" your SAM record.

Go through the whole file once and keep an array of where the tabs are. This way, if your code ends up calling :

XsamReadQueries.findElement(read, 1)
XsamReadQueries.findElement(read, 2)
XsamReadQueries.findElement(read, 3)

The calls to the second and third method would be much faster than they are now.

To do this, you could add a method to XsamReadQueries names something like IndexTabs, that would return an array of ints.

If you want more insight as to how to do this, you can write a comment and I'll add more information, but I'm pretty sure this would help you.

Code style

There are one of two things that are bothering me in your code with regards to clarity and future maintenance.

Finally, you could consider using an enum for the element parameter of the findElement method.

edited Apr 23 at 14:19

answered Apr 23 at 14:13

IEatBagels

9,07323579

Performance

There is one thing that I believe could increase the performance of your application.

You often call findElement, which goes through the SAM record every time.

By loading a record, you are pretty certain that you will at least access it once.

At some point, maybe when creating the class, or when accessing the first property for the first time, you should "index" your SAM record.

Go through the whole file once and keep an array of where the tabs are. This way, if your code ends up calling :

XsamReadQueries.findElement(read, 1)
XsamReadQueries.findElement(read, 2)
XsamReadQueries.findElement(read, 3)

The calls to the second and third method would be much faster than they are now.

To do this, you could add a method to XsamReadQueries names something like IndexTabs, that would return an array of ints.

If you want more insight as to how to do this, you can write a comment and I'll add more information, but I'm pretty sure this would help you.

Code style

There are one of two things that are bothering me in your code with regards to clarity and future maintenance.

Finally, you could consider using an enum for the element parameter of the findElement method.

edited Apr 23 at 14:19

answered Apr 23 at 14:13

IEatBagels

9,07323579

edited Apr 23 at 14:19

answered Apr 23 at 14:13

IEatBagels

9,07323579

answered Apr 23 at 14:13

IEatBagels

9,07323579

answered Apr 23 at 14:13

IEatBagels

9,07323579

1

$begingroup$
Hi @IEatBagels, thanks for the answer. Indexing is a really good idea, I'll definitely look to implement that. I agree with the find* notation - It's partially as I'm halfway through coding the API and wanted some feedback before committing to one way or the other. The enum's a good idea too, it'll definitely make it more readable!
$endgroup$
– Sam
Apr 23 at 16:00

add a comment |

1

$begingroup$
Hi @IEatBagels, thanks for the answer. Indexing is a really good idea, I'll definitely look to implement that. I agree with the find* notation - It's partially as I'm halfway through coding the API and wanted some feedback before committing to one way or the other. The enum's a good idea too, it'll definitely make it more readable!
$endgroup$
– Sam
Apr 23 at 16:00

Hi @IEatBagels, thanks for the answer. Indexing is a really good idea, I'll definitely look to implement that. I agree with the find* notation - It's partially as I'm halfway through coding the API and wanted some feedback before committing to one way or the other. The enum's a good idea too, it'll definitely make it more readable!

– Sam
Apr 23 at 16:00

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Code Review Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ttdfjt

1 Answer
1

Performance

Code style

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Performance

Code style

Performance

Code style

Performance

Code style

Performance

Code style

Post as a guest

Popular posts from this blog

Category:9 (number) SubcategoriesMedia in category "9 (number)"Navigation menuUpload mediaGND ID: 4485639-8Library of Congress authority ID: sh85091979ReasonatorScholiaStatistics

1 Answer 1

Performance

Code style

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Performance

Code style

Performance

Code style

Performance

Code style

Performance

Code style

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Category:9 (number) SubcategoriesMedia in category "9 (number)"Navigation menuUpload mediaGND ID: 4485639-8Library of Congress authority ID: sh85091979ReasonatorScholiaStatistics

1 Answer
1

1 Answer
1

1 Answer
1