Lazy Loading a Bioinformatic SAM recordLazy class instantiation in PythonLazy loading with __getMax heap in...
Married in secret, can marital status in passport be changed at a later date?
Preserving file and folder permissions with rsync
How would it unbalance gameplay to rule that Weapon Master allows for picking a fighting style?
"Working on a knee"
Determinant of a matrix with 2 equal rows
What is the ongoing value of the Kanban board to the developers as opposed to management
Could a cockatrice have parasitic embryos?
What is a 'Key' in computer science?
What is the evidence that custom checks in Northern Ireland are going to result in violence?
Is there an efficient way for synchronising audio events real-time with LEDs using an MCU?
Suing a Police Officer Instead of the Police Department
Does using the Inspiration rules for character defects encourage My Guy Syndrome?
What's called a person who works as someone who puts products on shelves in stores?
What was Apollo 13's "Little Jolt" after MECO?
France's Public Holidays' Puzzle
Feather, the Redeemed and Dire Fleet Daredevil
Variable does not exist: sObjectType (Task.sObjectType)
Eigenvalues of the Laplacian of the directed De Bruijn graph
Marquee sign letters
Was there ever a LEGO store in Miami International Airport?
Why did Israel vote against lifting the American embargo on Cuba?
How to compute a Jacobian using polar coordinates?
How can I wire a 9-position switch so that each position turns on one more LED than the one before?
When speaking, how do you change your mind mid-sentence?
Lazy Loading a Bioinformatic SAM record
Lazy class instantiation in PythonLazy loading with __getMax heap in JavaQuerying Facebook for details of a user's OAuth tokenUnderstanding lazy sequence in ClojureBinary Puzzle Solver - 10000 questionsSimple Java program - Coding bat sumNumbersLeetcode: String to Integer (atoi)Lazy split and semi-lazy splitLazy-loading iframes as they scroll into view
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ margin-bottom:0;
}
$begingroup$
I'm currently writing an API to work with Bioinformatic SAM records. Here's an example of one:
SBL_XSBF463_ID:3230017:BCR1:GCATAA:BCR2:CATATA/1:vpe 97 hs07 38253395 3 30M = 38330420 77055 TTGTTCCACTGCCAAAGAGTTTCTTATAAT EEEEEEEEEEEEAEEEEEEEEEEEEEEEEE PG:Z:novoalign AS:i:0 UQ:i:0 NM:i:0 MD:Z:30 ZS:Z:R NH:i:2 HI:i:1 IH:i:1
Each piece of information separated by a tab is it's own field and corresponds to some type of data.
Now, it's important to note that these files get BIG (10's of GB) and so splitting each one as soon as it's instantiated in some kind of POJO would be inefficient.
Hence, I've decided to create an object with a lazy loading mechanism. Only the original string is stored until one of the fields is requested by some calling code. This should minimise the amount of work done when the object is created, as well as minimise the amount of memory taken by the objects.
Here's my attempt:
/** Class for storing and working with sam formatted DNA sequence.
*
* Upon construction, only the String record is stored.
* All querying of fields is done on demand, to save time.
*
*/
public class SamRecord implements Record {
private final String read;
private String id = null;
private int flag = -1;
private String referenceName = null;
private int pos = -1;
private int mappingQuality = -1;
private String cigar = null;
private String mateReferenceName = null;
private int matePosition = -1;
private int templateLength = -1;
private String sequence = null;
private String quality = null;
private String variableTerms = null;
private final static String REPEAT_TERM = "ZS:Z:R";
private final static String MATCH_TERM = "ZS:Z:NM";
private final static String QUALITY_CHECK_TERM = "ZS:Z:QC";
/** Simple constructor for the sam record
* @param read full read
*/
public SamRecord(String read) {
this.read = read;
}
public String getRead() {
return read;
}
/**
* {@inheritDoc}
*/
@Override
public String getId() {
if(id == null){
id = XsamReadQueries.findID(read);
}
return id;
}
/**
* {@inheritDoc}
*/
@Override
public int getFlag() throws NumberFormatException {
if(flag == -1) {
flag = Integer.parseInt(XsamReadQueries.findElement(read, 1));
}
return flag;
}
/**
* {@inheritDoc}
*/
@Override
public String getReferenceName() {
if(referenceName == null){
referenceName = XsamReadQueries.findReferneceName(read);
}
return referenceName;
}
/**
* {@inheritDoc}
*/
@Override
public int getPos() throws NumberFormatException{
if(pos == -1){
pos = Integer.parseInt(XsamReadQueries.findElement(read, 3));
}
return pos;
}
/**
* {@inheritDoc}
*/
@Override
public int getMappingQuality() throws NumberFormatException {
if(mappingQuality == -1){
mappingQuality = Integer.parseInt(XsamReadQueries.findElement(read, 4));
}
return mappingQuality;
}
/**
* {@inheritDoc}
*/
@Override
public String getCigar() {
if(cigar == null){
cigar = XsamReadQueries.findCigar(read);
}
return cigar;
}
/**
* {@inheritDoc}
*/
@Override
public String getMateReferenceName() {
if(mateReferenceName == null){
mateReferenceName = XsamReadQueries.findElement(read, 6);
}
return mateReferenceName;
}
/**
* {@inheritDoc}
*/
@Override
public int getMatePosition() throws NumberFormatException {
if(matePosition == -1){
matePosition = Integer.parseInt(XsamReadQueries.findElement(read, 7));
}
return matePosition;
}
/**
* {@inheritDoc}
*/
@Override
public int getTemplateLength() throws NumberFormatException {
if(templateLength == -1){
templateLength = Integer.parseInt(XsamReadQueries.findElement(read, 8));
}
return templateLength;
}
/**
* {@inheritDoc}
*/
@Override
public String getSequence() {
if(sequence == null){
sequence = XsamReadQueries.findBaseSequence(read);
}
return sequence;
}
/**
* {@inheritDoc}
*/
@Override
public String getQuality() {
if(quality == null){
quality = XsamReadQueries.findElement(read, 10);
}
return quality;
}
/**
* {@inheritDoc}
*/
@Override
public boolean isRepeat() {
return read.contains(REPEAT_TERM);
}
/**
* {@inheritDoc}
*/
@Override
public boolean isMapped() {
return !read.contains(MATCH_TERM);
}
/**
* {@inheritDoc}
*/
@Override
public String getVariableTerms() {
if(variableTerms == null){
variableTerms = XsamReadQueries.findVariableRegionSequence(read);
}
return variableTerms;
}
/**
* {@inheritDoc}
*/
@Override
public boolean isQualityFailed() {
return read.contains(QUALITY_CHECK_TERM);
}
@Override
public boolean equals(Object o) {
if (this == o) return true;
if (o == null || getClass() != o.getClass()) return false;
SamRecord samRecord = (SamRecord) o;
return Objects.equals(read, samRecord.read);
}
@Override
public int hashCode() {
return Objects.hash(read);
}
@Override
public String toString() {
return read;
}
}
The fields are returned by static methods in a helper class which retrieve them by looking at where the tab characters are. i.e. flag = Integer.parseInt(XsamReadQueries.findElement(read, 1));
Below is the XsamReadQuery
class
/**
* Non-instantiable utility class for working with Xsam reads
*/
public final class XsamReadQueries {
// Suppress instantiation
private XsamReadQueries() {
throw new AssertionError();
}
/** finds the position of the tab directly before the start of the variable region
* @param read whole sam or Xsam read to search
* @return position of the tab in the String
*/
public static int findVariableRegionStart(String read){
int found = 0;
for(int i = 0; i < read.length(); i++){
if(read.charAt(i) == 't'){
found++;
if(found >= 11 && i+1 < read.length() && (read.charAt(i+1) != 'x' && read.charAt(i+1) != 't')){ //guard against double-tabs
return i + 1;
}
}
}
return -1;
}
/** Attempts to find the library name from SBL reads
* where SBL reads have the id SBL_LibraryName_ID:XXXXX
* if LibraryName end's with a lower case letter, the letter will be removed.
* if SBL_LibID is not valid, return the full ID.
* @param ID or String to search.
* @return Library name with lower case endings removed
*/
public static String findLibraryName(String ID){
if(!ID.startsWith("SBL")) return "";
try {
int firstPos = XsamReadQueries.findPosAfter(ID, "_");
int i = firstPos;
while (ID.charAt(i) != '_' && ID.charAt(i) != 't') {
i++;
}
String library = ID.substring(firstPos, i);
char lastChar = library.charAt(library.length()-1);
if(lastChar >= 97 && lastChar <= 122){
library = library.substring(0, library.length()-1);
}
return library;
}catch (Exception e){
int i = 0;
while(ID.charAt(i) != 't'){
i++;
if(i == ID.length()){
break;
}
}
return ID.substring(0, i);
}
}
/** Returns the ID from the sample
* @param sample Xsam read
* @return ID
*/
public static String findID(String sample){
return findElement(sample, 0);
}
/** Returns the phred score from the sample
* @param sample Xsam read
* @return phred string
*/
public static String findPhred(String sample){
return findElement(sample, 10);
}
/**
* Returns the cigar from the xsam read
*
* @param sample read
* @return cigar string
*/
public static String findCigar(String sample) {
return findElement(sample, 5);
}
/**
* Returns the bases from the xsam read
*
* @param sample read
* @return base string
*/
public static String findBaseSequence(String sample) {
return findElement(sample, 9);
}
/**
* finds the n'th element in the tab delimited sample
* i.e findElement(0) returns one from "onettwo"
* 0 indexed.
*
* @param sample String to search
* @param element element to find
* @return found element or "" if not found
*/
public static String findElement(String sample, int element) {
boolean tabsFound = false;
int i = 0;
int firstTab = 0;
int secondTab = 0;
int tabsToSkip = element - 1 >= 0 ? element - 1 : 0;
int skippedTabs = 0;
if (element == 0) {
while (sample.charAt(i) != 't') {
i++;
}
return sample.substring(0, i);
} else {
while (!tabsFound) {
if (sample.charAt(i) != 't') {
i++;
} else {
if (skippedTabs == tabsToSkip) {
if (firstTab == 0) {
firstTab = i;
} else {
secondTab = i;
tabsFound = true;
}
} else {
skippedTabs++;
}
i++;
}
}
}
return sample.substring(firstTab + 1, secondTab);
}
/** finds the variable region past the quality
* @param sample sam or Xsam record string
* @return variable sequence or empty string
*/
public static String findVariableRegionSequence(String sample){
int start = findVariableRegionStart(sample);
if(start == -1) return "";
return sample.substring(findVariableRegionStart(sample));
}
/** finds the xL field
* @param sample String to search
* @return position if found, '' (null) value if not.
*/
public static int findxLField(String sample) {
int chartStart = findPosAfter(sample, "txL:i:");
if (chartStart == -1) {
return -1; //return -1 if not found.
}
int i = chartStart;
while (sample.charAt(i) != 't') {
i++;
}
return Integer.parseInt(sample.substring(chartStart, i));
}
/** finds the xR field
* @param sample String to search
* @return position if found, '' (null) value if not.
*/
public static int findxRField(String sample) {
int chartStart = findPosAfter(sample, "txR:i:");
if (chartStart == -1) {
return ''; //return NULL if not found.
}
int i = chartStart;
while (sample.charAt(i) != 't') {
i++;
}
return Integer.parseInt(sample.substring(chartStart, i));
}
/** finds the xLSeq field
* @param sample String to search
* @return String if found, empty string if not.
*/
public static Optional<String> findxLSeqField(String sample) {
int charStart = findPosAfter(sample, "txLseq:i:");
if (charStart == -1) {
return Optional.empty(); //return NULL if not found.
}
int i = charStart;
while (sample.charAt(i) != 't') {
i++;
}
return Optional.of(sample.substring(charStart, i));
}
/** finds the reference name field
* @param sample String to search
* @return String if found, empty string if not.
*/
public static String findReferneceName(String sample) {
//should always appear between the second and third tabs
boolean tabsFound = false;
int i = 0;
int secondTab = 0;
int thirdTab = 0;
boolean skippedFirstTab = false;
while (!tabsFound) {
if (sample.charAt(i) != 't') {
i++;
} else {
if (skippedFirstTab) {
if (secondTab == 0) {
secondTab = i;
} else {
thirdTab = i;
tabsFound = true;
}
}
skippedFirstTab = true;
i++;
}
}
if(sample.substring(secondTab + 1, thirdTab).contains("/")){
String[] split = sample.substring(secondTab + 1, thirdTab).split("/");
return split[split.length-1];
}
return sample.substring(secondTab + 1, thirdTab);
}
/**
* Finds the needle in the haystack, and returns the position of the single next digit.
*
* @param haystack The string to search
* @param needle String field to search on.
* @return position of the end of the needle
*/
private static int findPosAfter(String haystack, String needle) {
int hLen = haystack.length();
int nLen = needle.length();
int maxSearch = hLen - nLen;
outer:
for (int i = 0; i < maxSearch; i++) {
for (int j = 0; j < nLen; j++) {
if (haystack.charAt(i + j) != needle.charAt(j)) {
continue outer;
}
}
// If it reaches here, match has been found:
return i + nLen;
}
return -1; // Not found
}
}
My question is, are there are any drawbacks to this approach? Or any alternative way that might be more effective?
Thanks in advance,
Sam
java bioinformatics lazy
$endgroup$
|
$begingroup$
I'm currently writing an API to work with Bioinformatic SAM records. Here's an example of one:
SBL_XSBF463_ID:3230017:BCR1:GCATAA:BCR2:CATATA/1:vpe 97 hs07 38253395 3 30M = 38330420 77055 TTGTTCCACTGCCAAAGAGTTTCTTATAAT EEEEEEEEEEEEAEEEEEEEEEEEEEEEEE PG:Z:novoalign AS:i:0 UQ:i:0 NM:i:0 MD:Z:30 ZS:Z:R NH:i:2 HI:i:1 IH:i:1
Each piece of information separated by a tab is it's own field and corresponds to some type of data.
Now, it's important to note that these files get BIG (10's of GB) and so splitting each one as soon as it's instantiated in some kind of POJO would be inefficient.
Hence, I've decided to create an object with a lazy loading mechanism. Only the original string is stored until one of the fields is requested by some calling code. This should minimise the amount of work done when the object is created, as well as minimise the amount of memory taken by the objects.
Here's my attempt:
/** Class for storing and working with sam formatted DNA sequence.
*
* Upon construction, only the String record is stored.
* All querying of fields is done on demand, to save time.
*
*/
public class SamRecord implements Record {
private final String read;
private String id = null;
private int flag = -1;
private String referenceName = null;
private int pos = -1;
private int mappingQuality = -1;
private String cigar = null;
private String mateReferenceName = null;
private int matePosition = -1;
private int templateLength = -1;
private String sequence = null;
private String quality = null;
private String variableTerms = null;
private final static String REPEAT_TERM = "ZS:Z:R";
private final static String MATCH_TERM = "ZS:Z:NM";
private final static String QUALITY_CHECK_TERM = "ZS:Z:QC";
/** Simple constructor for the sam record
* @param read full read
*/
public SamRecord(String read) {
this.read = read;
}
public String getRead() {
return read;
}
/**
* {@inheritDoc}
*/
@Override
public String getId() {
if(id == null){
id = XsamReadQueries.findID(read);
}
return id;
}
/**
* {@inheritDoc}
*/
@Override
public int getFlag() throws NumberFormatException {
if(flag == -1) {
flag = Integer.parseInt(XsamReadQueries.findElement(read, 1));
}
return flag;
}
/**
* {@inheritDoc}
*/
@Override
public String getReferenceName() {
if(referenceName == null){
referenceName = XsamReadQueries.findReferneceName(read);
}
return referenceName;
}
/**
* {@inheritDoc}
*/
@Override
public int getPos() throws NumberFormatException{
if(pos == -1){
pos = Integer.parseInt(XsamReadQueries.findElement(read, 3));
}
return pos;
}
/**
* {@inheritDoc}
*/
@Override
public int getMappingQuality() throws NumberFormatException {
if(mappingQuality == -1){
mappingQuality = Integer.parseInt(XsamReadQueries.findElement(read, 4));
}
return mappingQuality;
}
/**
* {@inheritDoc}
*/
@Override
public String getCigar() {
if(cigar == null){
cigar = XsamReadQueries.findCigar(read);
}
return cigar;
}
/**
* {@inheritDoc}
*/
@Override
public String getMateReferenceName() {
if(mateReferenceName == null){
mateReferenceName = XsamReadQueries.findElement(read, 6);
}
return mateReferenceName;
}
/**
* {@inheritDoc}
*/
@Override
public int getMatePosition() throws NumberFormatException {
if(matePosition == -1){
matePosition = Integer.parseInt(XsamReadQueries.findElement(read, 7));
}
return matePosition;
}
/**
* {@inheritDoc}
*/
@Override
public int getTemplateLength() throws NumberFormatException {
if(templateLength == -1){
templateLength = Integer.parseInt(XsamReadQueries.findElement(read, 8));
}
return templateLength;
}
/**
* {@inheritDoc}
*/
@Override
public String getSequence() {
if(sequence == null){
sequence = XsamReadQueries.findBaseSequence(read);
}
return sequence;
}
/**
* {@inheritDoc}
*/
@Override
public String getQuality() {
if(quality == null){
quality = XsamReadQueries.findElement(read, 10);
}
return quality;
}
/**
* {@inheritDoc}
*/
@Override
public boolean isRepeat() {
return read.contains(REPEAT_TERM);
}
/**
* {@inheritDoc}
*/
@Override
public boolean isMapped() {
return !read.contains(MATCH_TERM);
}
/**
* {@inheritDoc}
*/
@Override
public String getVariableTerms() {
if(variableTerms == null){
variableTerms = XsamReadQueries.findVariableRegionSequence(read);
}
return variableTerms;
}
/**
* {@inheritDoc}
*/
@Override
public boolean isQualityFailed() {
return read.contains(QUALITY_CHECK_TERM);
}
@Override
public boolean equals(Object o) {
if (this == o) return true;
if (o == null || getClass() != o.getClass()) return false;
SamRecord samRecord = (SamRecord) o;
return Objects.equals(read, samRecord.read);
}
@Override
public int hashCode() {
return Objects.hash(read);
}
@Override
public String toString() {
return read;
}
}
The fields are returned by static methods in a helper class which retrieve them by looking at where the tab characters are. i.e. flag = Integer.parseInt(XsamReadQueries.findElement(read, 1));
Below is the XsamReadQuery
class
/**
* Non-instantiable utility class for working with Xsam reads
*/
public final class XsamReadQueries {
// Suppress instantiation
private XsamReadQueries() {
throw new AssertionError();
}
/** finds the position of the tab directly before the start of the variable region
* @param read whole sam or Xsam read to search
* @return position of the tab in the String
*/
public static int findVariableRegionStart(String read){
int found = 0;
for(int i = 0; i < read.length(); i++){
if(read.charAt(i) == 't'){
found++;
if(found >= 11 && i+1 < read.length() && (read.charAt(i+1) != 'x' && read.charAt(i+1) != 't')){ //guard against double-tabs
return i + 1;
}
}
}
return -1;
}
/** Attempts to find the library name from SBL reads
* where SBL reads have the id SBL_LibraryName_ID:XXXXX
* if LibraryName end's with a lower case letter, the letter will be removed.
* if SBL_LibID is not valid, return the full ID.
* @param ID or String to search.
* @return Library name with lower case endings removed
*/
public static String findLibraryName(String ID){
if(!ID.startsWith("SBL")) return "";
try {
int firstPos = XsamReadQueries.findPosAfter(ID, "_");
int i = firstPos;
while (ID.charAt(i) != '_' && ID.charAt(i) != 't') {
i++;
}
String library = ID.substring(firstPos, i);
char lastChar = library.charAt(library.length()-1);
if(lastChar >= 97 && lastChar <= 122){
library = library.substring(0, library.length()-1);
}
return library;
}catch (Exception e){
int i = 0;
while(ID.charAt(i) != 't'){
i++;
if(i == ID.length()){
break;
}
}
return ID.substring(0, i);
}
}
/** Returns the ID from the sample
* @param sample Xsam read
* @return ID
*/
public static String findID(String sample){
return findElement(sample, 0);
}
/** Returns the phred score from the sample
* @param sample Xsam read
* @return phred string
*/
public static String findPhred(String sample){
return findElement(sample, 10);
}
/**
* Returns the cigar from the xsam read
*
* @param sample read
* @return cigar string
*/
public static String findCigar(String sample) {
return findElement(sample, 5);
}
/**
* Returns the bases from the xsam read
*
* @param sample read
* @return base string
*/
public static String findBaseSequence(String sample) {
return findElement(sample, 9);
}
/**
* finds the n'th element in the tab delimited sample
* i.e findElement(0) returns one from "onettwo"
* 0 indexed.
*
* @param sample String to search
* @param element element to find
* @return found element or "" if not found
*/
public static String findElement(String sample, int element) {
boolean tabsFound = false;
int i = 0;
int firstTab = 0;
int secondTab = 0;
int tabsToSkip = element - 1 >= 0 ? element - 1 : 0;
int skippedTabs = 0;
if (element == 0) {
while (sample.charAt(i) != 't') {
i++;
}
return sample.substring(0, i);
} else {
while (!tabsFound) {
if (sample.charAt(i) != 't') {
i++;
} else {
if (skippedTabs == tabsToSkip) {
if (firstTab == 0) {
firstTab = i;
} else {
secondTab = i;
tabsFound = true;
}
} else {
skippedTabs++;
}
i++;
}
}
}
return sample.substring(firstTab + 1, secondTab);
}
/** finds the variable region past the quality
* @param sample sam or Xsam record string
* @return variable sequence or empty string
*/
public static String findVariableRegionSequence(String sample){
int start = findVariableRegionStart(sample);
if(start == -1) return "";
return sample.substring(findVariableRegionStart(sample));
}
/** finds the xL field
* @param sample String to search
* @return position if found, '' (null) value if not.
*/
public static int findxLField(String sample) {
int chartStart = findPosAfter(sample, "txL:i:");
if (chartStart == -1) {
return -1; //return -1 if not found.
}
int i = chartStart;
while (sample.charAt(i) != 't') {
i++;
}
return Integer.parseInt(sample.substring(chartStart, i));
}
/** finds the xR field
* @param sample String to search
* @return position if found, '' (null) value if not.
*/
public static int findxRField(String sample) {
int chartStart = findPosAfter(sample, "txR:i:");
if (chartStart == -1) {
return ''; //return NULL if not found.
}
int i = chartStart;
while (sample.charAt(i) != 't') {
i++;
}
return Integer.parseInt(sample.substring(chartStart, i));
}
/** finds the xLSeq field
* @param sample String to search
* @return String if found, empty string if not.
*/
public static Optional<String> findxLSeqField(String sample) {
int charStart = findPosAfter(sample, "txLseq:i:");
if (charStart == -1) {
return Optional.empty(); //return NULL if not found.
}
int i = charStart;
while (sample.charAt(i) != 't') {
i++;
}
return Optional.of(sample.substring(charStart, i));
}
/** finds the reference name field
* @param sample String to search
* @return String if found, empty string if not.
*/
public static String findReferneceName(String sample) {
//should always appear between the second and third tabs
boolean tabsFound = false;
int i = 0;
int secondTab = 0;
int thirdTab = 0;
boolean skippedFirstTab = false;
while (!tabsFound) {
if (sample.charAt(i) != 't') {
i++;
} else {
if (skippedFirstTab) {
if (secondTab == 0) {
secondTab = i;
} else {
thirdTab = i;
tabsFound = true;
}
}
skippedFirstTab = true;
i++;
}
}
if(sample.substring(secondTab + 1, thirdTab).contains("/")){
String[] split = sample.substring(secondTab + 1, thirdTab).split("/");
return split[split.length-1];
}
return sample.substring(secondTab + 1, thirdTab);
}
/**
* Finds the needle in the haystack, and returns the position of the single next digit.
*
* @param haystack The string to search
* @param needle String field to search on.
* @return position of the end of the needle
*/
private static int findPosAfter(String haystack, String needle) {
int hLen = haystack.length();
int nLen = needle.length();
int maxSearch = hLen - nLen;
outer:
for (int i = 0; i < maxSearch; i++) {
for (int j = 0; j < nLen; j++) {
if (haystack.charAt(i + j) != needle.charAt(j)) {
continue outer;
}
}
// If it reaches here, match has been found:
return i + nLen;
}
return -1; // Not found
}
}
My question is, are there are any drawbacks to this approach? Or any alternative way that might be more effective?
Thanks in advance,
Sam
java bioinformatics lazy
$endgroup$
$begingroup$
Is this class, or could this class, be used in a multithreaded scenario?
$endgroup$
– IEatBagels
11 hours ago
$begingroup$
It will be, yes. However, it's unlikely to be shared between threads. It's more likely that collections of these records will be handed off to separate workers.
$endgroup$
– Sam
11 hours ago
$begingroup$
0. Do you have an actual, tested performance issue, or are you guessing at a potential problem? 1. What are you trying to minimize, execution time or memory footprint?
$endgroup$
– Eric Stein
10 hours ago
$begingroup$
Also, are you able/willing to share the code that instantiatesSamRecord
s?
$endgroup$
– Eric Stein
10 hours ago
|
$begingroup$
I'm currently writing an API to work with Bioinformatic SAM records. Here's an example of one:
SBL_XSBF463_ID:3230017:BCR1:GCATAA:BCR2:CATATA/1:vpe 97 hs07 38253395 3 30M = 38330420 77055 TTGTTCCACTGCCAAAGAGTTTCTTATAAT EEEEEEEEEEEEAEEEEEEEEEEEEEEEEE PG:Z:novoalign AS:i:0 UQ:i:0 NM:i:0 MD:Z:30 ZS:Z:R NH:i:2 HI:i:1 IH:i:1
Each piece of information separated by a tab is it's own field and corresponds to some type of data.
Now, it's important to note that these files get BIG (10's of GB) and so splitting each one as soon as it's instantiated in some kind of POJO would be inefficient.
Hence, I've decided to create an object with a lazy loading mechanism. Only the original string is stored until one of the fields is requested by some calling code. This should minimise the amount of work done when the object is created, as well as minimise the amount of memory taken by the objects.
Here's my attempt:
/** Class for storing and working with sam formatted DNA sequence.
*
* Upon construction, only the String record is stored.
* All querying of fields is done on demand, to save time.
*
*/
public class SamRecord implements Record {
private final String read;
private String id = null;
private int flag = -1;
private String referenceName = null;
private int pos = -1;
private int mappingQuality = -1;
private String cigar = null;
private String mateReferenceName = null;
private int matePosition = -1;
private int templateLength = -1;
private String sequence = null;
private String quality = null;
private String variableTerms = null;
private final static String REPEAT_TERM = "ZS:Z:R";
private final static String MATCH_TERM = "ZS:Z:NM";
private final static String QUALITY_CHECK_TERM = "ZS:Z:QC";
/** Simple constructor for the sam record
* @param read full read
*/
public SamRecord(String read) {
this.read = read;
}
public String getRead() {
return read;
}
/**
* {@inheritDoc}
*/
@Override
public String getId() {
if(id == null){
id = XsamReadQueries.findID(read);
}
return id;
}
/**
* {@inheritDoc}
*/
@Override
public int getFlag() throws NumberFormatException {
if(flag == -1) {
flag = Integer.parseInt(XsamReadQueries.findElement(read, 1));
}
return flag;
}
/**
* {@inheritDoc}
*/
@Override
public String getReferenceName() {
if(referenceName == null){
referenceName = XsamReadQueries.findReferneceName(read);
}
return referenceName;
}
/**
* {@inheritDoc}
*/
@Override
public int getPos() throws NumberFormatException{
if(pos == -1){
pos = Integer.parseInt(XsamReadQueries.findElement(read, 3));
}
return pos;
}
/**
* {@inheritDoc}
*/
@Override
public int getMappingQuality() throws NumberFormatException {
if(mappingQuality == -1){
mappingQuality = Integer.parseInt(XsamReadQueries.findElement(read, 4));
}
return mappingQuality;
}
/**
* {@inheritDoc}
*/
@Override
public String getCigar() {
if(cigar == null){
cigar = XsamReadQueries.findCigar(read);
}
return cigar;
}
/**
* {@inheritDoc}
*/
@Override
public String getMateReferenceName() {
if(mateReferenceName == null){
mateReferenceName = XsamReadQueries.findElement(read, 6);
}
return mateReferenceName;
}
/**
* {@inheritDoc}
*/
@Override
public int getMatePosition() throws NumberFormatException {
if(matePosition == -1){
matePosition = Integer.parseInt(XsamReadQueries.findElement(read, 7));
}
return matePosition;
}
/**
* {@inheritDoc}
*/
@Override
public int getTemplateLength() throws NumberFormatException {
if(templateLength == -1){
templateLength = Integer.parseInt(XsamReadQueries.findElement(read, 8));
}
return templateLength;
}
/**
* {@inheritDoc}
*/
@Override
public String getSequence() {
if(sequence == null){
sequence = XsamReadQueries.findBaseSequence(read);
}
return sequence;
}
/**
* {@inheritDoc}
*/
@Override
public String getQuality() {
if(quality == null){
quality = XsamReadQueries.findElement(read, 10);
}
return quality;
}
/**
* {@inheritDoc}
*/
@Override
public boolean isRepeat() {
return read.contains(REPEAT_TERM);
}
/**
* {@inheritDoc}
*/
@Override
public boolean isMapped() {
return !read.contains(MATCH_TERM);
}
/**
* {@inheritDoc}
*/
@Override
public String getVariableTerms() {
if(variableTerms == null){
variableTerms = XsamReadQueries.findVariableRegionSequence(read);
}
return variableTerms;
}
/**
* {@inheritDoc}
*/
@Override
public boolean isQualityFailed() {
return read.contains(QUALITY_CHECK_TERM);
}
@Override
public boolean equals(Object o) {
if (this == o) return true;
if (o == null || getClass() != o.getClass()) return false;
SamRecord samRecord = (SamRecord) o;
return Objects.equals(read, samRecord.read);
}
@Override
public int hashCode() {
return Objects.hash(read);
}
@Override
public String toString() {
return read;
}
}
The fields are returned by static methods in a helper class which retrieve them by looking at where the tab characters are. i.e. flag = Integer.parseInt(XsamReadQueries.findElement(read, 1));
Below is the XsamReadQuery
class
/**
* Non-instantiable utility class for working with Xsam reads
*/
public final class XsamReadQueries {
// Suppress instantiation
private XsamReadQueries() {
throw new AssertionError();
}
/** finds the position of the tab directly before the start of the variable region
* @param read whole sam or Xsam read to search
* @return position of the tab in the String
*/
public static int findVariableRegionStart(String read){
int found = 0;
for(int i = 0; i < read.length(); i++){
if(read.charAt(i) == 't'){
found++;
if(found >= 11 && i+1 < read.length() && (read.charAt(i+1) != 'x' && read.charAt(i+1) != 't')){ //guard against double-tabs
return i + 1;
}
}
}
return -1;
}
/** Attempts to find the library name from SBL reads
* where SBL reads have the id SBL_LibraryName_ID:XXXXX
* if LibraryName end's with a lower case letter, the letter will be removed.
* if SBL_LibID is not valid, return the full ID.
* @param ID or String to search.
* @return Library name with lower case endings removed
*/
public static String findLibraryName(String ID){
if(!ID.startsWith("SBL")) return "";
try {
int firstPos = XsamReadQueries.findPosAfter(ID, "_");
int i = firstPos;
while (ID.charAt(i) != '_' && ID.charAt(i) != 't') {
i++;
}
String library = ID.substring(firstPos, i);
char lastChar = library.charAt(library.length()-1);
if(lastChar >= 97 && lastChar <= 122){
library = library.substring(0, library.length()-1);
}
return library;
}catch (Exception e){
int i = 0;
while(ID.charAt(i) != 't'){
i++;
if(i == ID.length()){
break;
}
}
return ID.substring(0, i);
}
}
/** Returns the ID from the sample
* @param sample Xsam read
* @return ID
*/
public static String findID(String sample){
return findElement(sample, 0);
}
/** Returns the phred score from the sample
* @param sample Xsam read
* @return phred string
*/
public static String findPhred(String sample){
return findElement(sample, 10);
}
/**
* Returns the cigar from the xsam read
*
* @param sample read
* @return cigar string
*/
public static String findCigar(String sample) {
return findElement(sample, 5);
}
/**
* Returns the bases from the xsam read
*
* @param sample read
* @return base string
*/
public static String findBaseSequence(String sample) {
return findElement(sample, 9);
}
/**
* finds the n'th element in the tab delimited sample
* i.e findElement(0) returns one from "onettwo"
* 0 indexed.
*
* @param sample String to search
* @param element element to find
* @return found element or "" if not found
*/
public static String findElement(String sample, int element) {
boolean tabsFound = false;
int i = 0;
int firstTab = 0;
int secondTab = 0;
int tabsToSkip = element - 1 >= 0 ? element - 1 : 0;
int skippedTabs = 0;
if (element == 0) {
while (sample.charAt(i) != 't') {
i++;
}
return sample.substring(0, i);
} else {
while (!tabsFound) {
if (sample.charAt(i) != 't') {
i++;
} else {
if (skippedTabs == tabsToSkip) {
if (firstTab == 0) {
firstTab = i;
} else {
secondTab = i;
tabsFound = true;
}
} else {
skippedTabs++;
}
i++;
}
}
}
return sample.substring(firstTab + 1, secondTab);
}
/** finds the variable region past the quality
* @param sample sam or Xsam record string
* @return variable sequence or empty string
*/
public static String findVariableRegionSequence(String sample){
int start = findVariableRegionStart(sample);
if(start == -1) return "";
return sample.substring(findVariableRegionStart(sample));
}
/** finds the xL field
* @param sample String to search
* @return position if found, '' (null) value if not.
*/
public static int findxLField(String sample) {
int chartStart = findPosAfter(sample, "txL:i:");
if (chartStart == -1) {
return -1; //return -1 if not found.
}
int i = chartStart;
while (sample.charAt(i) != 't') {
i++;
}
return Integer.parseInt(sample.substring(chartStart, i));
}
/** finds the xR field
* @param sample String to search
* @return position if found, '' (null) value if not.
*/
public static int findxRField(String sample) {
int chartStart = findPosAfter(sample, "txR:i:");
if (chartStart == -1) {
return ''; //return NULL if not found.
}
int i = chartStart;
while (sample.charAt(i) != 't') {
i++;
}
return Integer.parseInt(sample.substring(chartStart, i));
}
/** finds the xLSeq field
* @param sample String to search
* @return String if found, empty string if not.
*/
public static Optional<String> findxLSeqField(String sample) {
int charStart = findPosAfter(sample, "txLseq:i:");
if (charStart == -1) {
return Optional.empty(); //return NULL if not found.
}
int i = charStart;
while (sample.charAt(i) != 't') {
i++;
}
return Optional.of(sample.substring(charStart, i));
}
/** finds the reference name field
* @param sample String to search
* @return String if found, empty string if not.
*/
public static String findReferneceName(String sample) {
//should always appear between the second and third tabs
boolean tabsFound = false;
int i = 0;
int secondTab = 0;
int thirdTab = 0;
boolean skippedFirstTab = false;
while (!tabsFound) {
if (sample.charAt(i) != 't') {
i++;
} else {
if (skippedFirstTab) {
if (secondTab == 0) {
secondTab = i;
} else {
thirdTab = i;
tabsFound = true;
}
}
skippedFirstTab = true;
i++;
}
}
if(sample.substring(secondTab + 1, thirdTab).contains("/")){
String[] split = sample.substring(secondTab + 1, thirdTab).split("/");
return split[split.length-1];
}
return sample.substring(secondTab + 1, thirdTab);
}
/**
* Finds the needle in the haystack, and returns the position of the single next digit.
*
* @param haystack The string to search
* @param needle String field to search on.
* @return position of the end of the needle
*/
private static int findPosAfter(String haystack, String needle) {
int hLen = haystack.length();
int nLen = needle.length();
int maxSearch = hLen - nLen;
outer:
for (int i = 0; i < maxSearch; i++) {
for (int j = 0; j < nLen; j++) {
if (haystack.charAt(i + j) != needle.charAt(j)) {
continue outer;
}
}
// If it reaches here, match has been found:
return i + nLen;
}
return -1; // Not found
}
}
My question is, are there are any drawbacks to this approach? Or any alternative way that might be more effective?
Thanks in advance,
Sam
java bioinformatics lazy
$endgroup$
I'm currently writing an API to work with Bioinformatic SAM records. Here's an example of one:
SBL_XSBF463_ID:3230017:BCR1:GCATAA:BCR2:CATATA/1:vpe 97 hs07 38253395 3 30M = 38330420 77055 TTGTTCCACTGCCAAAGAGTTTCTTATAAT EEEEEEEEEEEEAEEEEEEEEEEEEEEEEE PG:Z:novoalign AS:i:0 UQ:i:0 NM:i:0 MD:Z:30 ZS:Z:R NH:i:2 HI:i:1 IH:i:1
Each piece of information separated by a tab is it's own field and corresponds to some type of data.
Now, it's important to note that these files get BIG (10's of GB) and so splitting each one as soon as it's instantiated in some kind of POJO would be inefficient.
Hence, I've decided to create an object with a lazy loading mechanism. Only the original string is stored until one of the fields is requested by some calling code. This should minimise the amount of work done when the object is created, as well as minimise the amount of memory taken by the objects.
Here's my attempt:
/** Class for storing and working with sam formatted DNA sequence.
*
* Upon construction, only the String record is stored.
* All querying of fields is done on demand, to save time.
*
*/
public class SamRecord implements Record {
private final String read;
private String id = null;
private int flag = -1;
private String referenceName = null;
private int pos = -1;
private int mappingQuality = -1;
private String cigar = null;
private String mateReferenceName = null;
private int matePosition = -1;
private int templateLength = -1;
private String sequence = null;
private String quality = null;
private String variableTerms = null;
private final static String REPEAT_TERM = "ZS:Z:R";
private final static String MATCH_TERM = "ZS:Z:NM";
private final static String QUALITY_CHECK_TERM = "ZS:Z:QC";
/** Simple constructor for the sam record
* @param read full read
*/
public SamRecord(String read) {
this.read = read;
}
public String getRead() {
return read;
}
/**
* {@inheritDoc}
*/
@Override
public String getId() {
if(id == null){
id = XsamReadQueries.findID(read);
}
return id;
}
/**
* {@inheritDoc}
*/
@Override
public int getFlag() throws NumberFormatException {
if(flag == -1) {
flag = Integer.parseInt(XsamReadQueries.findElement(read, 1));
}
return flag;
}
/**
* {@inheritDoc}
*/
@Override
public String getReferenceName() {
if(referenceName == null){
referenceName = XsamReadQueries.findReferneceName(read);
}
return referenceName;
}
/**
* {@inheritDoc}
*/
@Override
public int getPos() throws NumberFormatException{
if(pos == -1){
pos = Integer.parseInt(XsamReadQueries.findElement(read, 3));
}
return pos;
}
/**
* {@inheritDoc}
*/
@Override
public int getMappingQuality() throws NumberFormatException {
if(mappingQuality == -1){
mappingQuality = Integer.parseInt(XsamReadQueries.findElement(read, 4));
}
return mappingQuality;
}
/**
* {@inheritDoc}
*/
@Override
public String getCigar() {
if(cigar == null){
cigar = XsamReadQueries.findCigar(read);
}
return cigar;
}
/**
* {@inheritDoc}
*/
@Override
public String getMateReferenceName() {
if(mateReferenceName == null){
mateReferenceName = XsamReadQueries.findElement(read, 6);
}
return mateReferenceName;
}
/**
* {@inheritDoc}
*/
@Override
public int getMatePosition() throws NumberFormatException {
if(matePosition == -1){
matePosition = Integer.parseInt(XsamReadQueries.findElement(read, 7));
}
return matePosition;
}
/**
* {@inheritDoc}
*/
@Override
public int getTemplateLength() throws NumberFormatException {
if(templateLength == -1){
templateLength = Integer.parseInt(XsamReadQueries.findElement(read, 8));
}
return templateLength;
}
/**
* {@inheritDoc}
*/
@Override
public String getSequence() {
if(sequence == null){
sequence = XsamReadQueries.findBaseSequence(read);
}
return sequence;
}
/**
* {@inheritDoc}
*/
@Override
public String getQuality() {
if(quality == null){
quality = XsamReadQueries.findElement(read, 10);
}
return quality;
}
/**
* {@inheritDoc}
*/
@Override
public boolean isRepeat() {
return read.contains(REPEAT_TERM);
}
/**
* {@inheritDoc}
*/
@Override
public boolean isMapped() {
return !read.contains(MATCH_TERM);
}
/**
* {@inheritDoc}
*/
@Override
public String getVariableTerms() {
if(variableTerms == null){
variableTerms = XsamReadQueries.findVariableRegionSequence(read);
}
return variableTerms;
}
/**
* {@inheritDoc}
*/
@Override
public boolean isQualityFailed() {
return read.contains(QUALITY_CHECK_TERM);
}
@Override
public boolean equals(Object o) {
if (this == o) return true;
if (o == null || getClass() != o.getClass()) return false;
SamRecord samRecord = (SamRecord) o;
return Objects.equals(read, samRecord.read);
}
@Override
public int hashCode() {
return Objects.hash(read);
}
@Override
public String toString() {
return read;
}
}
The fields are returned by static methods in a helper class which retrieve them by looking at where the tab characters are. i.e. flag = Integer.parseInt(XsamReadQueries.findElement(read, 1));
Below is the XsamReadQuery
class
/**
* Non-instantiable utility class for working with Xsam reads
*/
public final class XsamReadQueries {
// Suppress instantiation
private XsamReadQueries() {
throw new AssertionError();
}
/** finds the position of the tab directly before the start of the variable region
* @param read whole sam or Xsam read to search
* @return position of the tab in the String
*/
public static int findVariableRegionStart(String read){
int found = 0;
for(int i = 0; i < read.length(); i++){
if(read.charAt(i) == 't'){
found++;
if(found >= 11 && i+1 < read.length() && (read.charAt(i+1) != 'x' && read.charAt(i+1) != 't')){ //guard against double-tabs
return i + 1;
}
}
}
return -1;
}
/** Attempts to find the library name from SBL reads
* where SBL reads have the id SBL_LibraryName_ID:XXXXX
* if LibraryName end's with a lower case letter, the letter will be removed.
* if SBL_LibID is not valid, return the full ID.
* @param ID or String to search.
* @return Library name with lower case endings removed
*/
public static String findLibraryName(String ID){
if(!ID.startsWith("SBL")) return "";
try {
int firstPos = XsamReadQueries.findPosAfter(ID, "_");
int i = firstPos;
while (ID.charAt(i) != '_' && ID.charAt(i) != 't') {
i++;
}
String library = ID.substring(firstPos, i);
char lastChar = library.charAt(library.length()-1);
if(lastChar >= 97 && lastChar <= 122){
library = library.substring(0, library.length()-1);
}
return library;
}catch (Exception e){
int i = 0;
while(ID.charAt(i) != 't'){
i++;
if(i == ID.length()){
break;
}
}
return ID.substring(0, i);
}
}
/** Returns the ID from the sample
* @param sample Xsam read
* @return ID
*/
public static String findID(String sample){
return findElement(sample, 0);
}
/** Returns the phred score from the sample
* @param sample Xsam read
* @return phred string
*/
public static String findPhred(String sample){
return findElement(sample, 10);
}
/**
* Returns the cigar from the xsam read
*
* @param sample read
* @return cigar string
*/
public static String findCigar(String sample) {
return findElement(sample, 5);
}
/**
* Returns the bases from the xsam read
*
* @param sample read
* @return base string
*/
public static String findBaseSequence(String sample) {
return findElement(sample, 9);
}
/**
* finds the n'th element in the tab delimited sample
* i.e findElement(0) returns one from "onettwo"
* 0 indexed.
*
* @param sample String to search
* @param element element to find
* @return found element or "" if not found
*/
public static String findElement(String sample, int element) {
boolean tabsFound = false;
int i = 0;
int firstTab = 0;
int secondTab = 0;
int tabsToSkip = element - 1 >= 0 ? element - 1 : 0;
int skippedTabs = 0;
if (element == 0) {
while (sample.charAt(i) != 't') {
i++;
}
return sample.substring(0, i);
} else {
while (!tabsFound) {
if (sample.charAt(i) != 't') {
i++;
} else {
if (skippedTabs == tabsToSkip) {
if (firstTab == 0) {
firstTab = i;
} else {
secondTab = i;
tabsFound = true;
}
} else {
skippedTabs++;
}
i++;
}
}
}
return sample.substring(firstTab + 1, secondTab);
}
/** finds the variable region past the quality
* @param sample sam or Xsam record string
* @return variable sequence or empty string
*/
public static String findVariableRegionSequence(String sample){
int start = findVariableRegionStart(sample);
if(start == -1) return "";
return sample.substring(findVariableRegionStart(sample));
}
/** finds the xL field
* @param sample String to search
* @return position if found, '' (null) value if not.
*/
public static int findxLField(String sample) {
int chartStart = findPosAfter(sample, "txL:i:");
if (chartStart == -1) {
return -1; //return -1 if not found.
}
int i = chartStart;
while (sample.charAt(i) != 't') {
i++;
}
return Integer.parseInt(sample.substring(chartStart, i));
}
/** finds the xR field
* @param sample String to search
* @return position if found, '' (null) value if not.
*/
public static int findxRField(String sample) {
int chartStart = findPosAfter(sample, "txR:i:");
if (chartStart == -1) {
return ''; //return NULL if not found.
}
int i = chartStart;
while (sample.charAt(i) != 't') {
i++;
}
return Integer.parseInt(sample.substring(chartStart, i));
}
/** finds the xLSeq field
* @param sample String to search
* @return String if found, empty string if not.
*/
public static Optional<String> findxLSeqField(String sample) {
int charStart = findPosAfter(sample, "txLseq:i:");
if (charStart == -1) {
return Optional.empty(); //return NULL if not found.
}
int i = charStart;
while (sample.charAt(i) != 't') {
i++;
}
return Optional.of(sample.substring(charStart, i));
}
/** finds the reference name field
* @param sample String to search
* @return String if found, empty string if not.
*/
public static String findReferneceName(String sample) {
//should always appear between the second and third tabs
boolean tabsFound = false;
int i = 0;
int secondTab = 0;
int thirdTab = 0;
boolean skippedFirstTab = false;
while (!tabsFound) {
if (sample.charAt(i) != 't') {
i++;
} else {
if (skippedFirstTab) {
if (secondTab == 0) {
secondTab = i;
} else {
thirdTab = i;
tabsFound = true;
}
}
skippedFirstTab = true;
i++;
}
}
if(sample.substring(secondTab + 1, thirdTab).contains("/")){
String[] split = sample.substring(secondTab + 1, thirdTab).split("/");
return split[split.length-1];
}
return sample.substring(secondTab + 1, thirdTab);
}
/**
* Finds the needle in the haystack, and returns the position of the single next digit.
*
* @param haystack The string to search
* @param needle String field to search on.
* @return position of the end of the needle
*/
private static int findPosAfter(String haystack, String needle) {
int hLen = haystack.length();
int nLen = needle.length();
int maxSearch = hLen - nLen;
outer:
for (int i = 0; i < maxSearch; i++) {
for (int j = 0; j < nLen; j++) {
if (haystack.charAt(i + j) != needle.charAt(j)) {
continue outer;
}
}
// If it reaches here, match has been found:
return i + nLen;
}
return -1; // Not found
}
}
My question is, are there are any drawbacks to this approach? Or any alternative way that might be more effective?
Thanks in advance,
Sam
java bioinformatics lazy
java bioinformatics lazy
edited 11 hours ago
Sam
asked 12 hours ago
SamSam
21017
21017
$begingroup$
Is this class, or could this class, be used in a multithreaded scenario?
$endgroup$
– IEatBagels
11 hours ago
$begingroup$
It will be, yes. However, it's unlikely to be shared between threads. It's more likely that collections of these records will be handed off to separate workers.
$endgroup$
– Sam
11 hours ago
$begingroup$
0. Do you have an actual, tested performance issue, or are you guessing at a potential problem? 1. What are you trying to minimize, execution time or memory footprint?
$endgroup$
– Eric Stein
10 hours ago
$begingroup$
Also, are you able/willing to share the code that instantiatesSamRecord
s?
$endgroup$
– Eric Stein
10 hours ago
|
$begingroup$
Is this class, or could this class, be used in a multithreaded scenario?
$endgroup$
– IEatBagels
11 hours ago
$begingroup$
It will be, yes. However, it's unlikely to be shared between threads. It's more likely that collections of these records will be handed off to separate workers.
$endgroup$
– Sam
11 hours ago
$begingroup$
0. Do you have an actual, tested performance issue, or are you guessing at a potential problem? 1. What are you trying to minimize, execution time or memory footprint?
$endgroup$
– Eric Stein
10 hours ago
$begingroup$
Also, are you able/willing to share the code that instantiatesSamRecord
s?
$endgroup$
– Eric Stein
10 hours ago
$begingroup$
Is this class, or could this class, be used in a multithreaded scenario?
$endgroup$
– IEatBagels
11 hours ago
$begingroup$
Is this class, or could this class, be used in a multithreaded scenario?
$endgroup$
– IEatBagels
11 hours ago
$begingroup$
It will be, yes. However, it's unlikely to be shared between threads. It's more likely that collections of these records will be handed off to separate workers.
$endgroup$
– Sam
11 hours ago
$begingroup$
It will be, yes. However, it's unlikely to be shared between threads. It's more likely that collections of these records will be handed off to separate workers.
$endgroup$
– Sam
11 hours ago
$begingroup$
0. Do you have an actual, tested performance issue, or are you guessing at a potential problem? 1. What are you trying to minimize, execution time or memory footprint?
$endgroup$
– Eric Stein
10 hours ago
$begingroup$
0. Do you have an actual, tested performance issue, or are you guessing at a potential problem? 1. What are you trying to minimize, execution time or memory footprint?
$endgroup$
– Eric Stein
10 hours ago
$begingroup$
Also, are you able/willing to share the code that instantiates
SamRecord
s?$endgroup$
– Eric Stein
10 hours ago
$begingroup$
Also, are you able/willing to share the code that instantiates
SamRecord
s?$endgroup$
– Eric Stein
10 hours ago
|
1 Answer
1
active
oldest
votes
$begingroup$
Performance
There is one thing that I believe could increase the performance of your application.
You often call findElement
, which goes through the SAM record every time.
By loading a record, you are pretty certain that you will at least access it once.
At some point, maybe when creating the class, or when accessing the first property for the first time, you should "index" your SAM record.
Go through the whole file once and keep an array of where the tabs are. This way, if your code ends up calling :
XsamReadQueries.findElement(read, 1)
XsamReadQueries.findElement(read, 2)
XsamReadQueries.findElement(read, 3)
The calls to the second and third method would be much faster than they are now.
To do this, you could add a method to XsamReadQueries
names something like IndexTabs
, that would return an array of ints.
If you want more insight as to how to do this, you can write a comment and I'll add more information, but I'm pretty sure this would help you.
Code style
There are one of two things that are bothering me in your code with regards to clarity and future maintenance.
You have methods named findPhred
, which call findElement
, but in your SamRecord
sometimes you call findElement
and something a specific find*
, which is basically the same code. You should decide on one way to do things, either have specific methods for each field in the XsamReadQueries
or keep only the findElement
method.
Finally, you could consider using an enum
for the element
parameter of the findElement
method.
$endgroup$
1
$begingroup$
Hi @IEatBagels, thanks for the answer. Indexing is a really good idea, I'll definitely look to implement that. I agree with the find* notation - It's partially as I'm halfway through coding the API and wanted some feedback before committing to one way or the other. The enum's a good idea too, it'll definitely make it more readable!
$endgroup$
– Sam
9 hours ago
|
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Performance
There is one thing that I believe could increase the performance of your application.
You often call findElement
, which goes through the SAM record every time.
By loading a record, you are pretty certain that you will at least access it once.
At some point, maybe when creating the class, or when accessing the first property for the first time, you should "index" your SAM record.
Go through the whole file once and keep an array of where the tabs are. This way, if your code ends up calling :
XsamReadQueries.findElement(read, 1)
XsamReadQueries.findElement(read, 2)
XsamReadQueries.findElement(read, 3)
The calls to the second and third method would be much faster than they are now.
To do this, you could add a method to XsamReadQueries
names something like IndexTabs
, that would return an array of ints.
If you want more insight as to how to do this, you can write a comment and I'll add more information, but I'm pretty sure this would help you.
Code style
There are one of two things that are bothering me in your code with regards to clarity and future maintenance.
You have methods named findPhred
, which call findElement
, but in your SamRecord
sometimes you call findElement
and something a specific find*
, which is basically the same code. You should decide on one way to do things, either have specific methods for each field in the XsamReadQueries
or keep only the findElement
method.
Finally, you could consider using an enum
for the element
parameter of the findElement
method.
$endgroup$
1
$begingroup$
Hi @IEatBagels, thanks for the answer. Indexing is a really good idea, I'll definitely look to implement that. I agree with the find* notation - It's partially as I'm halfway through coding the API and wanted some feedback before committing to one way or the other. The enum's a good idea too, it'll definitely make it more readable!
$endgroup$
– Sam
9 hours ago
|
$begingroup$
Performance
There is one thing that I believe could increase the performance of your application.
You often call findElement
, which goes through the SAM record every time.
By loading a record, you are pretty certain that you will at least access it once.
At some point, maybe when creating the class, or when accessing the first property for the first time, you should "index" your SAM record.
Go through the whole file once and keep an array of where the tabs are. This way, if your code ends up calling :
XsamReadQueries.findElement(read, 1)
XsamReadQueries.findElement(read, 2)
XsamReadQueries.findElement(read, 3)
The calls to the second and third method would be much faster than they are now.
To do this, you could add a method to XsamReadQueries
names something like IndexTabs
, that would return an array of ints.
If you want more insight as to how to do this, you can write a comment and I'll add more information, but I'm pretty sure this would help you.
Code style
There are one of two things that are bothering me in your code with regards to clarity and future maintenance.
You have methods named findPhred
, which call findElement
, but in your SamRecord
sometimes you call findElement
and something a specific find*
, which is basically the same code. You should decide on one way to do things, either have specific methods for each field in the XsamReadQueries
or keep only the findElement
method.
Finally, you could consider using an enum
for the element
parameter of the findElement
method.
$endgroup$
1
$begingroup$
Hi @IEatBagels, thanks for the answer. Indexing is a really good idea, I'll definitely look to implement that. I agree with the find* notation - It's partially as I'm halfway through coding the API and wanted some feedback before committing to one way or the other. The enum's a good idea too, it'll definitely make it more readable!
$endgroup$
– Sam
9 hours ago
|
$begingroup$
Performance
There is one thing that I believe could increase the performance of your application.
You often call findElement
, which goes through the SAM record every time.
By loading a record, you are pretty certain that you will at least access it once.
At some point, maybe when creating the class, or when accessing the first property for the first time, you should "index" your SAM record.
Go through the whole file once and keep an array of where the tabs are. This way, if your code ends up calling :
XsamReadQueries.findElement(read, 1)
XsamReadQueries.findElement(read, 2)
XsamReadQueries.findElement(read, 3)
The calls to the second and third method would be much faster than they are now.
To do this, you could add a method to XsamReadQueries
names something like IndexTabs
, that would return an array of ints.
If you want more insight as to how to do this, you can write a comment and I'll add more information, but I'm pretty sure this would help you.
Code style
There are one of two things that are bothering me in your code with regards to clarity and future maintenance.
You have methods named findPhred
, which call findElement
, but in your SamRecord
sometimes you call findElement
and something a specific find*
, which is basically the same code. You should decide on one way to do things, either have specific methods for each field in the XsamReadQueries
or keep only the findElement
method.
Finally, you could consider using an enum
for the element
parameter of the findElement
method.
$endgroup$
Performance
There is one thing that I believe could increase the performance of your application.
You often call findElement
, which goes through the SAM record every time.
By loading a record, you are pretty certain that you will at least access it once.
At some point, maybe when creating the class, or when accessing the first property for the first time, you should "index" your SAM record.
Go through the whole file once and keep an array of where the tabs are. This way, if your code ends up calling :
XsamReadQueries.findElement(read, 1)
XsamReadQueries.findElement(read, 2)
XsamReadQueries.findElement(read, 3)
The calls to the second and third method would be much faster than they are now.
To do this, you could add a method to XsamReadQueries
names something like IndexTabs
, that would return an array of ints.
If you want more insight as to how to do this, you can write a comment and I'll add more information, but I'm pretty sure this would help you.
Code style
There are one of two things that are bothering me in your code with regards to clarity and future maintenance.
You have methods named findPhred
, which call findElement
, but in your SamRecord
sometimes you call findElement
and something a specific find*
, which is basically the same code. You should decide on one way to do things, either have specific methods for each field in the XsamReadQueries
or keep only the findElement
method.
Finally, you could consider using an enum
for the element
parameter of the findElement
method.
edited 10 hours ago
answered 11 hours ago
IEatBagelsIEatBagels
9,04323579
9,04323579
1
$begingroup$
Hi @IEatBagels, thanks for the answer. Indexing is a really good idea, I'll definitely look to implement that. I agree with the find* notation - It's partially as I'm halfway through coding the API and wanted some feedback before committing to one way or the other. The enum's a good idea too, it'll definitely make it more readable!
$endgroup$
– Sam
9 hours ago
|
1
$begingroup$
Hi @IEatBagels, thanks for the answer. Indexing is a really good idea, I'll definitely look to implement that. I agree with the find* notation - It's partially as I'm halfway through coding the API and wanted some feedback before committing to one way or the other. The enum's a good idea too, it'll definitely make it more readable!
$endgroup$
– Sam
9 hours ago
1
1
$begingroup$
Hi @IEatBagels, thanks for the answer. Indexing is a really good idea, I'll definitely look to implement that. I agree with the find* notation - It's partially as I'm halfway through coding the API and wanted some feedback before committing to one way or the other. The enum's a good idea too, it'll definitely make it more readable!
$endgroup$
– Sam
9 hours ago
$begingroup$
Hi @IEatBagels, thanks for the answer. Indexing is a really good idea, I'll definitely look to implement that. I agree with the find* notation - It's partially as I'm halfway through coding the API and wanted some feedback before committing to one way or the other. The enum's a good idea too, it'll definitely make it more readable!
$endgroup$
– Sam
9 hours ago
|
$begingroup$
Is this class, or could this class, be used in a multithreaded scenario?
$endgroup$
– IEatBagels
11 hours ago
$begingroup$
It will be, yes. However, it's unlikely to be shared between threads. It's more likely that collections of these records will be handed off to separate workers.
$endgroup$
– Sam
11 hours ago
$begingroup$
0. Do you have an actual, tested performance issue, or are you guessing at a potential problem? 1. What are you trying to minimize, execution time or memory footprint?
$endgroup$
– Eric Stein
10 hours ago
$begingroup$
Also, are you able/willing to share the code that instantiates
SamRecord
s?$endgroup$
– Eric Stein
10 hours ago