I don’t have time to do it now but I’m looking forward to writing up a description of the things I’m learning using the SSE2 instructions. One thing I see already is that making an 8 item pipeline and operating on 8 items in parallel only seems to get me a 3.5x speed improvement. I guess that my algorithm was already fairly well optimized.

If anyone reads this and is interested, look up the Smith-Waterman algorithm used in bioinformatics. IN FPGA hardware, it can be made to perform 100’s of operations in parallel.