CNN for Genomic Sequence Analysis¶

What this notebook does

  • Demonstrates a DeepBind-style 1D CNN on synthetic DNA data.
  • Implements motif-extraction: finds top-activating k-mers for convolutional filters, builds PWMs and plots sequence logos using logomaker.

How to run : Need to install tensorflow

In [39]:
import sys
print(sys.executable)
/opt/anaconda3/envs/genomics-cnn/bin/python
In [40]:
import tensorflow as tf
print(tf.__version__)
2.16.2
In [3]:
import numpy as np
import tensorflow as tf
from tensorflow.keras import layers, models, optimizers, callbacks
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import logomaker
from Bio import SeqIO
import requests
from tqdm import tqdm

print('TensorFlow version:', tf.__version__)
print('Logomaker version present')
TensorFlow version: 2.16.2
Logomaker version present
In [4]:
# One-hot encoding utilities and synthetic dataset
BASES = 'ACGT'
base_to_idx = {b:i for i,b in enumerate(BASES)}

def one_hot_encode(seqs, L=None):
    if L is None: L = len(seqs[0])
    arr = np.zeros((len(seqs), L, 4), dtype=np.float32)
    for i, s in enumerate(seqs):
        for j, ch in enumerate(s):
            if ch in base_to_idx:
                arr[i,j,base_to_idx[ch]] = 1.0
    return arr

import random

def random_sequence(L):
    return ''.join(np.random.choice(list(BASES), L))

def implant_motif(seq, motif, pos=None):
    L = len(seq)
    m = len(motif)
    if pos is None:
        pos = np.random.randint(0, L-m+1)
    return seq[:pos] + motif + seq[pos+m:]

# Synthetic dataset generator (positive = sequence contains motif)
def generate_synthetic_dataset(n=2000, L=101, motif='ACGTG'):
    seqs = [random_sequence(L) for _ in range(n)]
    labels = np.zeros((n,1), dtype=np.float32)
    for i in range(n):
        if np.random.rand() < 0.5:
            seqs[i] = implant_motif(seqs[i], motif)
            labels[i,0] = 1.0
    X = one_hot_encode(seqs, L)
    return X, labels, seqs

# Quick smoke test
X, y, raw = generate_synthetic_dataset(n=200, L=101, motif='ACGTG')
print('X shape, y shape:', X.shape, y.shape)
print('Example sequence:', raw[0])
X shape, y shape: (200, 101, 4) (200, 1)
Example sequence: GGATAGTGGGCTCCTCCTTAACAAAGCATAGTCGTAAAATGAGCCCAGATTCAGAGACGTGTACTCACATTAAACGTAACATCCGTCGCCCCGAACCTCCG
In [5]:
# Build DeepBind-style model
from tensorflow.keras import layers, models

def make_deepbind_model(L=101, n_filters=32, filter_len=9):
    inp = layers.Input(shape=(L,4))
    x = layers.Conv1D(filters=n_filters, kernel_size=filter_len, activation='relu', name='conv1')(inp)
    x = layers.GlobalMaxPooling1D(name='gmp')(x)
    x = layers.Dense(32, activation='relu', name='dense1')(x)
    out = layers.Dense(1, activation='sigmoid', name='out')(x)
    model = models.Model(inputs=inp, outputs=out)
    model.compile(optimizer=optimizers.Adam(1e-3), loss='binary_crossentropy', metrics=['accuracy'])
    return model

model = make_deepbind_model(L=101, n_filters=32, filter_len=9)
model.summary()
2026-02-11 12:56:38.373774: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M1
2026-02-11 12:56:38.373837: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 16.00 GB
2026-02-11 12:56:38.373844: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 5.33 GB
2026-02-11 12:56:38.373915: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2026-02-11 12:56:38.373941: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
Model: "functional"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Layer (type)                         ┃ Output Shape                ┃         Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ input_layer (InputLayer)             │ (None, 101, 4)              │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv1 (Conv1D)                       │ (None, 93, 32)              │           1,184 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ gmp (GlobalMaxPooling1D)             │ (None, 32)                  │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense1 (Dense)                       │ (None, 32)                  │           1,056 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ out (Dense)                          │ (None, 1)                   │              33 │
└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘
 Total params: 2,273 (8.88 KB)
 Trainable params: 2,273 (8.88 KB)
 Non-trainable params: 0 (0.00 B)

32 filters¶

Why 32 filters? Because the model is learning 32 different motifs. Some may detect:

  1. Core TF motif
  2. Secondary motifs
  3. GC-rich regions
  4. Co-factor motifs
  5. Noise

Dense layer¶

A Dense layer computes: $$ output=𝑊x + b $$

  1. x = 32 motif scores
  2. W = 32 × 32 weight matrix
  3. b = 32 biases

So it learns:

How to combine motif signals together. Adding dense1 + ReLU makes the model:

  1. Nonlinear

  2. Able to model motif interactions

Biological interpretation¶

  1. Conv layer:

Detect motifs individually

  1. Pooling:

Get strongest occurrence of each motif

  1. Dense layer:

Learn interactions between motifs

This layer can learn things like:

(1) Motif A + Motif B together = strong binding

(2) Motif C suppresses effect of Motif D

(3) Certain motifs matter more than others

So it models combinatorial regulation.

In [17]:
# Train on synthetic data (small example)
X, y, seqs = generate_synthetic_dataset(n=3000, L=101, motif='ACGTG')
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2, random_state=42)

model = make_deepbind_model(L=101, n_filters=32, filter_len=9)

es = callbacks.EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
history = model.fit(Xtrain, ytrain, validation_split=0.1, epochs=30, batch_size=64, callbacks=[es])

# Evaluate
yp = model.predict(Xtest).ravel()
print('AUROC (synthetic):', roc_auc_score(ytest, yp))

# Plot a small training curve
import matplotlib.pyplot as plt
plt.plot(history.history['loss'], label='loss')
plt.plot(history.history['val_loss'], label='val_loss')
plt.legend()
plt.xlabel('epoch')
plt.show()
Epoch 1/30
34/34 ━━━━━━━━━━━━━━━━━━━━ 1s 16ms/step - accuracy: 0.5027 - loss: 0.7553 - val_accuracy: 0.6042 - val_loss: 0.6810
Epoch 2/30
34/34 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - accuracy: 0.5858 - loss: 0.6791 - val_accuracy: 0.6458 - val_loss: 0.6684
Epoch 3/30
34/34 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - accuracy: 0.6770 - loss: 0.6628 - val_accuracy: 0.7042 - val_loss: 0.6498
Epoch 4/30
34/34 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - accuracy: 0.8393 - loss: 0.6233 - val_accuracy: 0.7375 - val_loss: 0.5973
Epoch 5/30
34/34 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - accuracy: 0.8917 - loss: 0.5518 - val_accuracy: 0.9000 - val_loss: 0.4994
Epoch 6/30
34/34 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - accuracy: 0.9361 - loss: 0.4487 - val_accuracy: 0.9000 - val_loss: 0.4215
Epoch 7/30
34/34 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - accuracy: 0.9389 - loss: 0.3495 - val_accuracy: 0.9125 - val_loss: 0.3393
Epoch 8/30
34/34 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - accuracy: 0.9326 - loss: 0.2800 - val_accuracy: 0.9042 - val_loss: 0.2934
Epoch 9/30
34/34 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - accuracy: 0.9508 - loss: 0.2137 - val_accuracy: 0.9042 - val_loss: 0.2812
Epoch 10/30
34/34 ━━━━━━━━━━━━━━━━━━━━ 0s 11ms/step - accuracy: 0.9413 - loss: 0.2052 - val_accuracy: 0.9125 - val_loss: 0.2871
Epoch 11/30
34/34 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - accuracy: 0.9397 - loss: 0.1963 - val_accuracy: 0.9042 - val_loss: 0.2754
Epoch 12/30
34/34 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - accuracy: 0.9414 - loss: 0.1909 - val_accuracy: 0.9125 - val_loss: 0.2824
Epoch 13/30
34/34 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - accuracy: 0.9442 - loss: 0.1766 - val_accuracy: 0.9042 - val_loss: 0.2780
Epoch 14/30
34/34 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - accuracy: 0.9345 - loss: 0.1944 - val_accuracy: 0.9042 - val_loss: 0.2736
Epoch 15/30
34/34 ━━━━━━━━━━━━━━━━━━━━ 0s 11ms/step - accuracy: 0.9384 - loss: 0.1844 - val_accuracy: 0.9125 - val_loss: 0.3060
Epoch 16/30
34/34 ━━━━━━━━━━━━━━━━━━━━ 0s 11ms/step - accuracy: 0.9394 - loss: 0.1894 - val_accuracy: 0.9042 - val_loss: 0.2736
Epoch 17/30
34/34 ━━━━━━━━━━━━━━━━━━━━ 0s 11ms/step - accuracy: 0.9432 - loss: 0.1651 - val_accuracy: 0.9042 - val_loss: 0.2747
Epoch 18/30
34/34 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - accuracy: 0.9391 - loss: 0.1679 - val_accuracy: 0.9083 - val_loss: 0.2816
Epoch 19/30
34/34 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - accuracy: 0.9498 - loss: 0.1508 - val_accuracy: 0.9125 - val_loss: 0.2903
19/19 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 
AUROC (synthetic): 0.9510327892532139
No description has been provided for this image
In [23]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import logomaker
from tqdm import tqdm
import tensorflow as tf

# get conv layer and weights
conv_layer = model.get_layer('conv1')
conv_w = conv_layer.get_weights()[0]   # shape (filter_len, 4, n_filters)
filter_len = conv_w.shape[0]
n_filters = conv_w.shape[2]
print('conv filters shape:', conv_w.shape)   # expect (9, 4, 32) for your model

# build a small model that outputs conv activations
conv_model = tf.keras.Model(inputs=model.input, outputs=conv_layer.output)

# compute conv activations for Xtest
# choose a reasonable batch_size depending on RAM; 128 is usually fine
conv_out = conv_model.predict(Xtest, batch_size=128)  
# conv_out shape: (N, positions, n_filters)
print('conv_out shape:', conv_out.shape)

# helper: get top activating k-mers for a filter
def top_kmers_for_filter(filter_idx, raw_seqs, conv_out, k=200):
    # conv_out: (N, P, n_filters)
    activations = conv_out[:, :, filter_idx]   # (N, P)
    N, P = activations.shape
    flat_idx = np.argsort(activations.ravel())[::-1]  # descending
    kmers = []
    seen = 0
    for idx in flat_idx:
        seq_idx = idx // P
        pos = idx % P
        score = activations[seq_idx, pos]
        # stop at non-positive activations (ReLU was used in conv)
        if score <= 0:
            break
        kmer = raw_seqs[seq_idx][pos:pos+filter_len]
        kmers.append(kmer)
        seen += 1
        if seen >= k:
            break
    return kmers

# sanity check your variables:
print('Xtest.shape =', Xtest.shape)
print('Number of raw seqs available =', len(seqs))

# Example: extract top kmers for filter 0
filter_idx = 0
kmers = top_kmers_for_filter(filter_idx, seqs, conv_out, k=200)
print('Top kmers example (filter %d):' % filter_idx, kmers[:10])

# Build PWM from kmers (counts -> probabilities)
def kmers_to_pwm(kmers):
    if len(kmers) == 0:
        raise ValueError('No kmers found for this filter.')
    L = len(kmers[0])
    counts = np.ones((L,4), dtype=float) * 1e-3  # pseudocount
    base_to_idx_local = {'A':0,'C':1,'G':2,'T':3}
    for k in kmers:
        for i, ch in enumerate(k):
            counts[i, base_to_idx_local.get(ch,0)] += 1
    pwm = counts / counts.sum(axis=1, keepdims=True)
    return pwm

pwm = kmers_to_pwm(kmers)
pwm_df = pd.DataFrame(pwm, columns=list('ACGT'))
print(pwm_df.head())

# Plot sequence logo using logomaker
plt.figure(figsize=(max(6, filter_len/1.5), 3))
logo = logomaker.Logo(pwm_df)
plt.title(f'Filter {filter_idx} PWM logo (top {len(kmers)} kmers)')
plt.savefig("figure.png")
plt.show()
conv filters shape: (9, 4, 32)
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 8ms/step 
conv_out shape: (600, 93, 32)
Xtest.shape = (600, 101, 4)
Number of raw seqs available = 3000
Top kmers example (filter 0): ['ACTAACCGG', 'AACCACGGG', 'GCATGACTC', 'GGCACGATG', 'CGACAAGGA', 'GGCGGCTTT', 'GAACGTGCA', 'GGAATTGTT', 'GGACCTGAC', 'CACCCGCTC']
          A         C         G         T
0  0.205001  0.245000  0.250000  0.299999
1  0.225000  0.265000  0.265000  0.245000
2  0.260000  0.255000  0.255000  0.230000
3  0.240000  0.309999  0.255000  0.195001
4  0.289999  0.284999  0.190001  0.235000
<Figure size 600x300 with 0 Axes>
No description has been provided for this image
In [19]:
### This strongest Filter 
# Compute average max activation per filter on positives
pos_idx = (ytest.ravel() == 1)

pos_conv = conv_out[pos_idx]  # only positives

# For each filter, compute mean max activation
mean_scores = pos_conv.max(axis=1).mean(axis=0)

print(mean_scores)
print("Best filter:", np.argmax(mean_scores))
[0.48183775 0.29306132 0.80339956 0.97591794 0.832431   0.41517872
 1.2068839  0.646036   0.29869142 0.01184291 1.0127937  0.5475508
 0.8214295  0.5333251  0.6611986  0.53398556 1.054617   1.0388899
 0.08910651 0.90396106 0.71023184 0.7482838  0.6017198  1.1549921
 0.6085278  1.0348054  0.4923234  0.7413878  0.8477556  0.5728764
 0.29544333 0.858427  ]
Best filter: 6
In [30]:
# ---- extract motifs from positives ----
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import pandas as pd
import logomaker

# re-split including seqs to get seqs_test aligned (use same random_state as training)
Xtrain, Xtest, ytrain, ytest, seqtrain, seqtest = train_test_split(X, y, seqs, test_size=0.2, random_state=42)

# If conv_out wasn't computed for this Xtest, (re)compute it:
conv_model = tf.keras.Model(inputs=model.input, outputs=model.get_layer('conv1').output)
conv_out = conv_model.predict(Xtest, batch_size=128)   # shape (N_test, P, n_filters)

print("Xtest.shape:", Xtest.shape, "conv_out.shape:", conv_out.shape)
print("len(seqtest):", len(seqtest), "len(ytest):", len(ytest))

# restrict to positives
pos_mask = (ytest.ravel() == 1)
pos_conv = conv_out[pos_mask]   # shape (N_pos, P, n_filters)
pos_seqs = [seqtest[i] for i in range(len(seqtest)) if pos_mask[i]]

print("Positives:", pos_conv.shape[0], "windows per seq:", pos_conv.shape[1])

# rank filters by mean (or mean of max) activation on positives
# mean of per-sequence max activation (good for finding motif detectors)
mean_scores = pos_conv.max(axis=1).mean(axis=0)   # shape (n_filters,)
best_filters = np.argsort(mean_scores)[::-1]     # descending order

print("Top filters by mean(max activation):", best_filters[:6])
best_filter = best_filters[0]
print("Best filter:", best_filter, "score:", mean_scores[best_filter])

# helper to extract top kmers for a filter (works with conv_out that aligns with raw_seqs)
def top_kmers_for_filter(filter_idx, raw_seqs, conv_out_for_seqs, k=200):
    activations = conv_out_for_seqs[:, :, filter_idx]  # (N, P)
    N, P = activations.shape
    flat_idx = np.argsort(activations.ravel())[::-1]
    kmers = []
    seen = 0
    for idx in flat_idx:
        seq_idx = idx // P
        pos = idx % P
        score = activations[seq_idx, pos]
        if score <= 0:
            break
        kmer = raw_seqs[seq_idx][pos:pos+filter_len]
        kmers.append(kmer)
        seen += 1
        if seen >= k:
            break
    return kmers

# extract kmers from positives only for top few filters and plot logos
n_show = 3
for rank in range(n_show):
    f = best_filters[rank]
    kmers = top_kmers_for_filter(f, pos_seqs, pos_conv, k=200)
    if len(kmers) == 0:
        print(f"Filter {f}: no positive kmers with positive activation found.")
        continue

    # Build PWM
    def kmers_to_pwm(kmers):
        L = len(kmers[0])
        counts = np.ones((L,4), dtype=float) * 1e-3
        base_to_idx_local = {'A':0,'C':1,'G':2,'T':3}
        for kmer in kmers:
            for i,ch in enumerate(kmer):
                counts[i, base_to_idx_local.get(ch,0)] += 1
        pwm = counts / counts.sum(axis=1, keepdims=True)
        return pwm

    pwm = kmers_to_pwm(kmers)
    pwm_df = pd.DataFrame(pwm, columns=list('ACGT'))

    # plot
    plt.figure(figsize=(max(6, filter_len/1.2), 2.5))
    logomaker.Logo(pwm_df)
    plt.title(f'Filter {f} PWM logo (top {len(kmers)} kmers)')
    plt.tight_layout()
    filename="figure" + str(rank) + ".png"
    plt.savefig(filename)
    plt.show()
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 8ms/step 
Xtest.shape: (600, 101, 4) conv_out.shape: (600, 93, 32)
len(seqtest): 600 len(ytest): 600
Positives: 299 windows per seq: 93
Top filters by mean(max activation): [ 6 23 16 17 25 10]
Best filter: 6 score: 1.2068839
<Figure size 750x250 with 0 Axes>
No description has been provided for this image
<Figure size 750x250 with 0 Axes>
No description has been provided for this image
<Figure size 750x250 with 0 Axes>
No description has been provided for this image
In [31]:
print("AUROC:", roc_auc_score(ytest, yp))
AUROC: 0.9510327892532139
In [33]:
f = 6
max_act = conv_out[:,:,f].max(axis=1)

import scipy.stats as stats
print(stats.pearsonr(max_act, ytest.ravel()))
PearsonRResult(statistic=0.0010995743041838368, pvalue=0.9785571988475986)
In [34]:
f = 23
max_act = conv_out[:,:,f].max(axis=1)

import scipy.stats as stats
print(stats.pearsonr(max_act, ytest.ravel()))
PearsonRResult(statistic=-0.010052912043016407, pvalue=0.8058852430996971)
In [35]:
# compute importance of each filter to final prediction

# get weights from dense1 layer
dense_weights = model.get_layer('dense1').get_weights()[0]  # shape (32,32)
out_weights = model.get_layer('out').get_weights()[0]  # shape (32,1)

# approximate filter importance via contribution
importance = np.abs(out_weights[:,0])

print("Most important filters:", np.argsort(importance)[::-1][:10])
Most important filters: [31  3  5 11  2  6  9 27  7 12]
In [36]:
import numpy as np

# compute max activation per sequence per filter
max_acts = conv_out.max(axis=1)  # shape (N_test, n_filters)

# compute correlation between each filter and label
from scipy.stats import pearsonr

correlations = []
for f in range(max_acts.shape[1]):
    r, _ = pearsonr(max_acts[:, f], ytest.ravel())
    correlations.append(abs(r))

correlations = np.array(correlations)

best_filter = np.argmax(correlations)
print("Best predictive filter:", best_filter)
print("Correlation:", correlations[best_filter])
Best predictive filter: 27
Correlation: 0.86943315562593
In [37]:
pos_mask = (ytest.ravel() == 1)
pos_conv = conv_out[pos_mask]
pos_seqs = [seqtest[i] for i in range(len(seqtest)) if pos_mask[i]]

kmers = top_kmers_for_filter(best_filter, pos_seqs, pos_conv, k=50)
In [38]:
import numpy as np
import pandas as pd
import logomaker
import matplotlib.pyplot as plt

def kmers_to_pwm(kmers):
    L = len(kmers[0])
    counts = np.ones((L, 4)) * 1e-3  # small pseudocount
    base_to_idx = {'A':0, 'C':1, 'G':2, 'T':3}
    
    for k in kmers:
        for i, ch in enumerate(k):
            if ch in base_to_idx:
                counts[i, base_to_idx[ch]] += 1
    
    pwm = counts / counts.sum(axis=1, keepdims=True)
    return pwm

pwm = kmers_to_pwm(kmers)
pwm_df = pd.DataFrame(pwm, columns=list("ACGT"))
plt.figure(figsize=(8,3))
logo = logomaker.Logo(pwm_df)
plt.title(f"Filter {best_filter} PWM logo (top {len(kmers)} kmers)")
plt.xlabel("Position")
plt.ylabel("Probability")
plt.tight_layout()
plt.show()
<Figure size 800x300 with 0 Axes>
No description has been provided for this image