I finished my initial SSE2 project. It worked out well but I was hoping that it would work out better than just “well.”

The algorithm I needed to speed up is called the Smith-Waterman algorithm. In reality, Smith and Waterman modified a type of algorithm called a dynamic programming algorithm. I’m not sure why it’s called that but I’m not a very academic person so names like that hold little meaning to me until someone explains them very clearly. I do understand the the dynamic part of it.

The algorithm compares two single dimensional arrays of data. In my case, nucleotide or amino acid data. DNA and that sort of stuff. The algorithm basically computes a score for every x,y position in a two dimensional array bounded by each of the one dimensional arrays of data. IN other words, take a sheet of graph paper and fill in boxes across the top with letters of the alphabet. Then fill in letters down the left side with other letters of the alphabet. Any letters will do. Then compute a score for each x,y position and fill in each of the graph paper squares with that score. The best score is the score that is reported.

There are a few interesting things about this type of algorithm. First, the score for a specific cell, or box on the graph paper, is computed using the score values from other cells that have been calculated. Second, there are a few different values that are used as possible scores and the highest value is the one stored in the cell. To be a little more clear, the score is the maximum of four different values. There are also two other scores that are computed for every cell and They are used to get two of those values used in the earlier max operation. The details are not important and there are papers on the internet that describe the algorithm. It is public knowledge.

My first attempt at speeding up this algorithm was futile. I thought that I could get rid of a bunch of if() tests in my code by using bit masks to accomplish the max() operations. This worked and the code looked neat but it ran at the same speed as with the if() tests. I had optimized the if() test code so that there was already no operation done that was not needed. Let me explain:

To get the max of four different values a, b, c, and d, you can do this:

if( b > a )
    a = b;

if( c > a )
    a = c;

if( d > a )
    a = d;

It is clear that this will work fine and can’t even be optimized much. the a variable holds the maximum of the values. Sure you could subtract one value from another, shift the result 31 bits (assuming 32 bit integers) right and get the sign bit in all bits to use as a mask to then mask in and out the values. It sure is more code and runs at the same speed as the code above.

So how can this algorithm be sped up. I’ll explain only what is published information. None of this information is proprietary to my company or even intellectual property because you can download an SSE2 version of this software from various places.

The important point of my project was to not speed up the cell calculations. In fact, they are much slower using SSE2 instructions. The optimization is to calculate the results for 8 cells all in a single set of operations. Since each cell relies on the cell above it, the cell to the left, and the cell left and up, a pipeline is used to process the data diagonally in strips of 8 row values. I won’t go into any more details because it is time consuming and the published papers already describe the details. All I am pointing out is that to speed up a process with SSE2 or any SIMD operations on Intel and AMD processors, find a way to do multiple operations in parallel.

I will go into a little detail.

I fill my 8 cell pipeline with information about the first 8 rows of data. That includes some scoring information and a bunch of zeroed data. A flag is stored in the cells indicating which ones are valid rows because some rows in the pipeline might be past the end of the row data. Then I loop through columns. For the first column of data, I stick the “letter of the alphabet” in the first cell in the pipeline. The all cells get scores computed at the same time. I essentially do the code shown up above but each “if( b > a) a = b;” test and operation is done for all 8 cells. The results are stored and the “if( c > a ) a = c;” operation is done on all 8 cells. I use the _mm_max_epi16() compiler intrinsic function to accomplish each max test. Once the scores are calculated, the next column is scored.

The details of the scoring goes way beyond this but basically, the pipline method of computing this stuff works well. It’s called a pipeline because some of the results of the pipeline calculations are passed from one pipeline cell to the next for each iteration of column processing. Some data, on the other hand, remains in the cells until all processing is done. The scores for the cells are compared, again with a parallel max operation, to the best scores so far and any score that is better than what has been calculated for the cell in the past is saved.

I wrote my implementation from scratch and it is 75% the speed of a version of this that I downloaded. I have not bothered to find the source of the difference. I think it is something very clever in the other package that keeps it from having to do memory accesses for each set of pipeline calculations. There is some temporary storage that allows the next 8 rows to be processed after the first 8 were processed in the pipeline. The first cell needs information that was calculated during the entire previous pass through the column data. Reducing or removing the need for that temporary storage would make a big change in performance.

Ah, but now the interesting part of this article. To get a human readable report on how the best score was generated, other information is needed besides the score. At a minimum, we need this code:

flag = A_IS_BEST;
if( b > a )
{
    a = b;
    flag = B_IS_BEST;
}

if( c > a )
{
    a = c;
    flag = C_IS_BEST;
}

if( d > a )
{
    a = d;
    flag = D_IS_BEST;
}

There is a way to optimize that to skip at least one of the flag settings and my old code did that. In fact, I had moved around the code and tried all kinds of software changes to make the old version as quick as possible. Now, I needed to change it so that I could use the SSE2 instructions to set those flags and also keep track of the location of the best score in the 2D array. That’s where the _mm_cmpgt_epi16 comes in. What a great function. This does the comparison but the result is either 16 bits of all ones if the test is true or 16 bits of all zeros if it is false. An inversion and a few AND operations and the flags are set. The rest of the code uses these same instructions.

Here is a single max operation that gets the max of 8 values vs. 8 other values but also sets the flag separately for each of the 8 val;ues:

mask = _mm_cmpgt_epi16( a, b )
temp = _mm_andnot_si128( mask, B_IS_BEST);
flag = _mm_and_si128( mask, A_IS_BEST);
flag = _mm_or_si128( flag, temp );
a = _mm_max_epi16( a, b );	

The _mm_max_epi16() is still used because it is probably faster that trying to use the mask to get the maximum value. The mask is only used for the handling the flag selection. I’ll leave it to the reader to figure out exactly what this code does with the bits of the various values. This compares 8 values in a to 8 values in b and sets 8 flags appropriately.

To be as optimal as possible, the best score row and column are also stored in __m128i data types and only after all of the data is processed do I bother to figure out just which of the pipeline cells contains the very best score and then I extract the row and column information and I’m almost done. To get human readable output, there is other code that starts at the best score location and simply follows the flags back through the data creating strings of the alphabet letters to show which ones and in what locations they generated the best score.

It was very challenging and some of the most interesting work I’ve done in a while. There are other optimizations that can allow some data to be skipped entirely and I’m working on those now. They don’t involve any new parallelization so I’m done with the SSE2 work for now.