Mutations in the human genome are diverse: single nucleotide substitutions (SNPs), insertions, deletions, structural variants. ML models currently struggle with predicting the effects of indels because of sequence misalignment problems. Our new paper "Shift augmentation improves DNA convolutional neural network indel effect predictions" solves this problem.
We've developed new data augmentation strategies that significantly reduce technical variance in CNN predictions for insertions and deletions. Our "stitching" approach improves indel eQTL classification accuracy and extends to even larger structural variants and tandem repeat expansions!
Plus, our new in silico deletion (ISD) technique provides an alternative sequence interpretation technique alongside established in silico mutagenesis (ISM). These advances help unlock the full spectrum of effect prediction for noncoding genetic variation — critical for understanding human development and disease.
The figure above (left) demonstrates how shifting a DNA sequence before a max pooling block affects the boundaries after performing the max pooing operation. For a width 2 max pool, the 2 nt shift output is similar, but with all values shifted by one position. However, the 1 nt shift changes some output values because the max operation is computed between different adjacent pairs. This means that the output of the max pooling operation is sensitive to these minor shifts, which can lead to different predictions for the same sequence upon small shifts.
This is a problem for convolutional neural networks (CNNs) that use max pooling. In DNA variant effect predictors, we usually compare the reference allele with the alternative. If the alternative allele is a deletion or insertion, the sequence downstream of the indel is inevitably shifted. This results in overestimation of the indel effect size because of a purely technical difference in predictions between the reference allele and the shifted part of the alternative allele.
The figure (right) illustrates a couple of augmentation strategies that address this issue. The first strategy takes the average of both the left and right shifts to mitigate the effect of newly inserted or deleted pieces of sequence. The second strategy, called "stitching", involves concatenating (or stitching) together left side predictions from the left-matched prediction and right side predictions from the right-matched prediction. This approach showed the best results in the indel and SV eQTL benchmarks -- check out the paper for more details!
In silico mutagenesis (ISM) is a common regulatory sequence interpretation technique to identify the influential functional DNA motifs and other sequence factors driving model predictions. ISM mutates every reference nucleotide to its three alternatives, computing a prediction for each (see figure above), and scoring the reference nucleotide based on the reference prediction relative to the average alternative.
In silico deletion (ISD) of reference nucleotides could be an alternative to the typical ISM technique, where we compute the prediction for the reference and alternative with sequential deletion of one nucleotide at a time.
We benchmarked ISD on the MPRA data (see the manuscript for details) and found that ISD is a good alternative to ISM, or it could be used in conjunction with ISM to determine motifs that are more sensitive to deletion than substitution.