What alignment algorithm is used for "Align to Reference Sequence" tool?

June 12, 2025 02:46
Updated

The "Align to Reference" tool (see Align Sanger Reads to a Reference Sequence) iteratively finds seed matches between the reference and aligned sequences. Matches are combined to generate the best alignment. We use our own implementation of the Smith-Waterman algorithm^† (Smith & Waterman, 1981 - https://doi.org/10.1016/0022-2836(81)90087-5) to refine gaps between the perfectly matching regions and at the ends of the alignment.

^† When extending the first/last match (Dynamic Smith Waterman) the weights are as follows:

MATCH_SCORE = 1

AMBIGUOUS_SCORE = -1

MISMATCH_SCORE = -2

OPEN_GAP_SCORE = -3

EXTEND_GAP_SCORE = -1

MAX_DROP = 40

For traditional Smith Waterman (when refining gaps between extended seed matches) the weights are as follows:

MATCH_SCORE = 1

MISMATCH_SCORE = -2

OPEN_GAP_SCORE = -3

EXTEND_GAP_SCORE = -1

SHIFT_INCREMENT = 500 (increment by which we explore alignments around the numerical origin)

This approach allows SnapGene to align test sequences to very large reference sequences as well as circular reference sequences, while also handling mismatches, insertions, deletions and replacements within the alignment.