CNN for Genomic Sequence Analysis¶
What this notebook does
- Demonstrates a DeepBind-style 1D CNN on synthetic DNA data.
- Implements motif-extraction: finds top-activating k-mers for convolutional filters, builds PWMs and plots sequence logos using
logomaker.
How to run : Need to install tensorflow
In [39]:
import sys
print(sys.executable)
/opt/anaconda3/envs/genomics-cnn/bin/python
In [40]:
import tensorflow as tf
print(tf.__version__)
2.16.2
In [3]:
import numpy as np
import tensorflow as tf
from tensorflow.keras import layers, models, optimizers, callbacks
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import logomaker
from Bio import SeqIO
import requests
from tqdm import tqdm
print('TensorFlow version:', tf.__version__)
print('Logomaker version present')
TensorFlow version: 2.16.2 Logomaker version present
In [4]:
# One-hot encoding utilities and synthetic dataset
BASES = 'ACGT'
base_to_idx = {b:i for i,b in enumerate(BASES)}
def one_hot_encode(seqs, L=None):
if L is None: L = len(seqs[0])
arr = np.zeros((len(seqs), L, 4), dtype=np.float32)
for i, s in enumerate(seqs):
for j, ch in enumerate(s):
if ch in base_to_idx:
arr[i,j,base_to_idx[ch]] = 1.0
return arr
import random
def random_sequence(L):
return ''.join(np.random.choice(list(BASES), L))
def implant_motif(seq, motif, pos=None):
L = len(seq)
m = len(motif)
if pos is None:
pos = np.random.randint(0, L-m+1)
return seq[:pos] + motif + seq[pos+m:]
# Synthetic dataset generator (positive = sequence contains motif)
def generate_synthetic_dataset(n=2000, L=101, motif='ACGTG'):
seqs = [random_sequence(L) for _ in range(n)]
labels = np.zeros((n,1), dtype=np.float32)
for i in range(n):
if np.random.rand() < 0.5:
seqs[i] = implant_motif(seqs[i], motif)
labels[i,0] = 1.0
X = one_hot_encode(seqs, L)
return X, labels, seqs
# Quick smoke test
X, y, raw = generate_synthetic_dataset(n=200, L=101, motif='ACGTG')
print('X shape, y shape:', X.shape, y.shape)
print('Example sequence:', raw[0])
X shape, y shape: (200, 101, 4) (200, 1) Example sequence: GGATAGTGGGCTCCTCCTTAACAAAGCATAGTCGTAAAATGAGCCCAGATTCAGAGACGTGTACTCACATTAAACGTAACATCCGTCGCCCCGAACCTCCG
In [5]:
# Build DeepBind-style model
from tensorflow.keras import layers, models
def make_deepbind_model(L=101, n_filters=32, filter_len=9):
inp = layers.Input(shape=(L,4))
x = layers.Conv1D(filters=n_filters, kernel_size=filter_len, activation='relu', name='conv1')(inp)
x = layers.GlobalMaxPooling1D(name='gmp')(x)
x = layers.Dense(32, activation='relu', name='dense1')(x)
out = layers.Dense(1, activation='sigmoid', name='out')(x)
model = models.Model(inputs=inp, outputs=out)
model.compile(optimizer=optimizers.Adam(1e-3), loss='binary_crossentropy', metrics=['accuracy'])
return model
model = make_deepbind_model(L=101, n_filters=32, filter_len=9)
model.summary()
2026-02-11 12:56:38.373774: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M1 2026-02-11 12:56:38.373837: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 16.00 GB 2026-02-11 12:56:38.373844: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 5.33 GB 2026-02-11 12:56:38.373915: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support. 2026-02-11 12:56:38.373941: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
Model: "functional"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩ │ input_layer (InputLayer) │ (None, 101, 4) │ 0 │ ├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤ │ conv1 (Conv1D) │ (None, 93, 32) │ 1,184 │ ├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤ │ gmp (GlobalMaxPooling1D) │ (None, 32) │ 0 │ ├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤ │ dense1 (Dense) │ (None, 32) │ 1,056 │ ├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤ │ out (Dense) │ (None, 1) │ 33 │ └──────────────────────────────────────┴─────────────────────────────┴─────────────────┘
Total params: 2,273 (8.88 KB)
Trainable params: 2,273 (8.88 KB)
Non-trainable params: 0 (0.00 B)
32 filters¶
Why 32 filters? Because the model is learning 32 different motifs. Some may detect:
- Core TF motif
- Secondary motifs
- GC-rich regions
- Co-factor motifs
- Noise
Dense layer¶
A Dense layer computes: $$ output=𝑊x + b $$
- x = 32 motif scores
- W = 32 × 32 weight matrix
- b = 32 biases
So it learns:
How to combine motif signals together. Adding dense1 + ReLU makes the model:
Nonlinear
Able to model motif interactions
Biological interpretation¶
- Conv layer:
Detect motifs individually
- Pooling:
Get strongest occurrence of each motif
- Dense layer:
Learn interactions between motifs
This layer can learn things like:
(1) Motif A + Motif B together = strong binding
(2) Motif C suppresses effect of Motif D
(3) Certain motifs matter more than others
So it models combinatorial regulation.
In [17]:
# Train on synthetic data (small example)
X, y, seqs = generate_synthetic_dataset(n=3000, L=101, motif='ACGTG')
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2, random_state=42)
model = make_deepbind_model(L=101, n_filters=32, filter_len=9)
es = callbacks.EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
history = model.fit(Xtrain, ytrain, validation_split=0.1, epochs=30, batch_size=64, callbacks=[es])
# Evaluate
yp = model.predict(Xtest).ravel()
print('AUROC (synthetic):', roc_auc_score(ytest, yp))
# Plot a small training curve
import matplotlib.pyplot as plt
plt.plot(history.history['loss'], label='loss')
plt.plot(history.history['val_loss'], label='val_loss')
plt.legend()
plt.xlabel('epoch')
plt.show()
Epoch 1/30 34/34 ━━━━━━━━━━━━━━━━━━━━ 1s 16ms/step - accuracy: 0.5027 - loss: 0.7553 - val_accuracy: 0.6042 - val_loss: 0.6810 Epoch 2/30 34/34 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - accuracy: 0.5858 - loss: 0.6791 - val_accuracy: 0.6458 - val_loss: 0.6684 Epoch 3/30 34/34 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - accuracy: 0.6770 - loss: 0.6628 - val_accuracy: 0.7042 - val_loss: 0.6498 Epoch 4/30 34/34 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - accuracy: 0.8393 - loss: 0.6233 - val_accuracy: 0.7375 - val_loss: 0.5973 Epoch 5/30 34/34 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - accuracy: 0.8917 - loss: 0.5518 - val_accuracy: 0.9000 - val_loss: 0.4994 Epoch 6/30 34/34 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - accuracy: 0.9361 - loss: 0.4487 - val_accuracy: 0.9000 - val_loss: 0.4215 Epoch 7/30 34/34 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - accuracy: 0.9389 - loss: 0.3495 - val_accuracy: 0.9125 - val_loss: 0.3393 Epoch 8/30 34/34 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - accuracy: 0.9326 - loss: 0.2800 - val_accuracy: 0.9042 - val_loss: 0.2934 Epoch 9/30 34/34 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - accuracy: 0.9508 - loss: 0.2137 - val_accuracy: 0.9042 - val_loss: 0.2812 Epoch 10/30 34/34 ━━━━━━━━━━━━━━━━━━━━ 0s 11ms/step - accuracy: 0.9413 - loss: 0.2052 - val_accuracy: 0.9125 - val_loss: 0.2871 Epoch 11/30 34/34 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - accuracy: 0.9397 - loss: 0.1963 - val_accuracy: 0.9042 - val_loss: 0.2754 Epoch 12/30 34/34 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - accuracy: 0.9414 - loss: 0.1909 - val_accuracy: 0.9125 - val_loss: 0.2824 Epoch 13/30 34/34 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - accuracy: 0.9442 - loss: 0.1766 - val_accuracy: 0.9042 - val_loss: 0.2780 Epoch 14/30 34/34 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - accuracy: 0.9345 - loss: 0.1944 - val_accuracy: 0.9042 - val_loss: 0.2736 Epoch 15/30 34/34 ━━━━━━━━━━━━━━━━━━━━ 0s 11ms/step - accuracy: 0.9384 - loss: 0.1844 - val_accuracy: 0.9125 - val_loss: 0.3060 Epoch 16/30 34/34 ━━━━━━━━━━━━━━━━━━━━ 0s 11ms/step - accuracy: 0.9394 - loss: 0.1894 - val_accuracy: 0.9042 - val_loss: 0.2736 Epoch 17/30 34/34 ━━━━━━━━━━━━━━━━━━━━ 0s 11ms/step - accuracy: 0.9432 - loss: 0.1651 - val_accuracy: 0.9042 - val_loss: 0.2747 Epoch 18/30 34/34 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - accuracy: 0.9391 - loss: 0.1679 - val_accuracy: 0.9083 - val_loss: 0.2816 Epoch 19/30 34/34 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - accuracy: 0.9498 - loss: 0.1508 - val_accuracy: 0.9125 - val_loss: 0.2903 19/19 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step AUROC (synthetic): 0.9510327892532139
In [23]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import logomaker
from tqdm import tqdm
import tensorflow as tf
# get conv layer and weights
conv_layer = model.get_layer('conv1')
conv_w = conv_layer.get_weights()[0] # shape (filter_len, 4, n_filters)
filter_len = conv_w.shape[0]
n_filters = conv_w.shape[2]
print('conv filters shape:', conv_w.shape) # expect (9, 4, 32) for your model
# build a small model that outputs conv activations
conv_model = tf.keras.Model(inputs=model.input, outputs=conv_layer.output)
# compute conv activations for Xtest
# choose a reasonable batch_size depending on RAM; 128 is usually fine
conv_out = conv_model.predict(Xtest, batch_size=128)
# conv_out shape: (N, positions, n_filters)
print('conv_out shape:', conv_out.shape)
# helper: get top activating k-mers for a filter
def top_kmers_for_filter(filter_idx, raw_seqs, conv_out, k=200):
# conv_out: (N, P, n_filters)
activations = conv_out[:, :, filter_idx] # (N, P)
N, P = activations.shape
flat_idx = np.argsort(activations.ravel())[::-1] # descending
kmers = []
seen = 0
for idx in flat_idx:
seq_idx = idx // P
pos = idx % P
score = activations[seq_idx, pos]
# stop at non-positive activations (ReLU was used in conv)
if score <= 0:
break
kmer = raw_seqs[seq_idx][pos:pos+filter_len]
kmers.append(kmer)
seen += 1
if seen >= k:
break
return kmers
# sanity check your variables:
print('Xtest.shape =', Xtest.shape)
print('Number of raw seqs available =', len(seqs))
# Example: extract top kmers for filter 0
filter_idx = 0
kmers = top_kmers_for_filter(filter_idx, seqs, conv_out, k=200)
print('Top kmers example (filter %d):' % filter_idx, kmers[:10])
# Build PWM from kmers (counts -> probabilities)
def kmers_to_pwm(kmers):
if len(kmers) == 0:
raise ValueError('No kmers found for this filter.')
L = len(kmers[0])
counts = np.ones((L,4), dtype=float) * 1e-3 # pseudocount
base_to_idx_local = {'A':0,'C':1,'G':2,'T':3}
for k in kmers:
for i, ch in enumerate(k):
counts[i, base_to_idx_local.get(ch,0)] += 1
pwm = counts / counts.sum(axis=1, keepdims=True)
return pwm
pwm = kmers_to_pwm(kmers)
pwm_df = pd.DataFrame(pwm, columns=list('ACGT'))
print(pwm_df.head())
# Plot sequence logo using logomaker
plt.figure(figsize=(max(6, filter_len/1.5), 3))
logo = logomaker.Logo(pwm_df)
plt.title(f'Filter {filter_idx} PWM logo (top {len(kmers)} kmers)')
plt.savefig("figure.png")
plt.show()
conv filters shape: (9, 4, 32) 5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 8ms/step conv_out shape: (600, 93, 32) Xtest.shape = (600, 101, 4) Number of raw seqs available = 3000 Top kmers example (filter 0): ['ACTAACCGG', 'AACCACGGG', 'GCATGACTC', 'GGCACGATG', 'CGACAAGGA', 'GGCGGCTTT', 'GAACGTGCA', 'GGAATTGTT', 'GGACCTGAC', 'CACCCGCTC'] A C G T 0 0.205001 0.245000 0.250000 0.299999 1 0.225000 0.265000 0.265000 0.245000 2 0.260000 0.255000 0.255000 0.230000 3 0.240000 0.309999 0.255000 0.195001 4 0.289999 0.284999 0.190001 0.235000
<Figure size 600x300 with 0 Axes>
In [19]:
### This strongest Filter
# Compute average max activation per filter on positives
pos_idx = (ytest.ravel() == 1)
pos_conv = conv_out[pos_idx] # only positives
# For each filter, compute mean max activation
mean_scores = pos_conv.max(axis=1).mean(axis=0)
print(mean_scores)
print("Best filter:", np.argmax(mean_scores))
[0.48183775 0.29306132 0.80339956 0.97591794 0.832431 0.41517872 1.2068839 0.646036 0.29869142 0.01184291 1.0127937 0.5475508 0.8214295 0.5333251 0.6611986 0.53398556 1.054617 1.0388899 0.08910651 0.90396106 0.71023184 0.7482838 0.6017198 1.1549921 0.6085278 1.0348054 0.4923234 0.7413878 0.8477556 0.5728764 0.29544333 0.858427 ] Best filter: 6
In [30]:
# ---- extract motifs from positives ----
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import pandas as pd
import logomaker
# re-split including seqs to get seqs_test aligned (use same random_state as training)
Xtrain, Xtest, ytrain, ytest, seqtrain, seqtest = train_test_split(X, y, seqs, test_size=0.2, random_state=42)
# If conv_out wasn't computed for this Xtest, (re)compute it:
conv_model = tf.keras.Model(inputs=model.input, outputs=model.get_layer('conv1').output)
conv_out = conv_model.predict(Xtest, batch_size=128) # shape (N_test, P, n_filters)
print("Xtest.shape:", Xtest.shape, "conv_out.shape:", conv_out.shape)
print("len(seqtest):", len(seqtest), "len(ytest):", len(ytest))
# restrict to positives
pos_mask = (ytest.ravel() == 1)
pos_conv = conv_out[pos_mask] # shape (N_pos, P, n_filters)
pos_seqs = [seqtest[i] for i in range(len(seqtest)) if pos_mask[i]]
print("Positives:", pos_conv.shape[0], "windows per seq:", pos_conv.shape[1])
# rank filters by mean (or mean of max) activation on positives
# mean of per-sequence max activation (good for finding motif detectors)
mean_scores = pos_conv.max(axis=1).mean(axis=0) # shape (n_filters,)
best_filters = np.argsort(mean_scores)[::-1] # descending order
print("Top filters by mean(max activation):", best_filters[:6])
best_filter = best_filters[0]
print("Best filter:", best_filter, "score:", mean_scores[best_filter])
# helper to extract top kmers for a filter (works with conv_out that aligns with raw_seqs)
def top_kmers_for_filter(filter_idx, raw_seqs, conv_out_for_seqs, k=200):
activations = conv_out_for_seqs[:, :, filter_idx] # (N, P)
N, P = activations.shape
flat_idx = np.argsort(activations.ravel())[::-1]
kmers = []
seen = 0
for idx in flat_idx:
seq_idx = idx // P
pos = idx % P
score = activations[seq_idx, pos]
if score <= 0:
break
kmer = raw_seqs[seq_idx][pos:pos+filter_len]
kmers.append(kmer)
seen += 1
if seen >= k:
break
return kmers
# extract kmers from positives only for top few filters and plot logos
n_show = 3
for rank in range(n_show):
f = best_filters[rank]
kmers = top_kmers_for_filter(f, pos_seqs, pos_conv, k=200)
if len(kmers) == 0:
print(f"Filter {f}: no positive kmers with positive activation found.")
continue
# Build PWM
def kmers_to_pwm(kmers):
L = len(kmers[0])
counts = np.ones((L,4), dtype=float) * 1e-3
base_to_idx_local = {'A':0,'C':1,'G':2,'T':3}
for kmer in kmers:
for i,ch in enumerate(kmer):
counts[i, base_to_idx_local.get(ch,0)] += 1
pwm = counts / counts.sum(axis=1, keepdims=True)
return pwm
pwm = kmers_to_pwm(kmers)
pwm_df = pd.DataFrame(pwm, columns=list('ACGT'))
# plot
plt.figure(figsize=(max(6, filter_len/1.2), 2.5))
logomaker.Logo(pwm_df)
plt.title(f'Filter {f} PWM logo (top {len(kmers)} kmers)')
plt.tight_layout()
filename="figure" + str(rank) + ".png"
plt.savefig(filename)
plt.show()
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 8ms/step Xtest.shape: (600, 101, 4) conv_out.shape: (600, 93, 32) len(seqtest): 600 len(ytest): 600 Positives: 299 windows per seq: 93 Top filters by mean(max activation): [ 6 23 16 17 25 10] Best filter: 6 score: 1.2068839
<Figure size 750x250 with 0 Axes>
<Figure size 750x250 with 0 Axes>
<Figure size 750x250 with 0 Axes>
In [31]:
print("AUROC:", roc_auc_score(ytest, yp))
AUROC: 0.9510327892532139
In [33]:
f = 6
max_act = conv_out[:,:,f].max(axis=1)
import scipy.stats as stats
print(stats.pearsonr(max_act, ytest.ravel()))
PearsonRResult(statistic=0.0010995743041838368, pvalue=0.9785571988475986)
In [34]:
f = 23
max_act = conv_out[:,:,f].max(axis=1)
import scipy.stats as stats
print(stats.pearsonr(max_act, ytest.ravel()))
PearsonRResult(statistic=-0.010052912043016407, pvalue=0.8058852430996971)
In [35]:
# compute importance of each filter to final prediction
# get weights from dense1 layer
dense_weights = model.get_layer('dense1').get_weights()[0] # shape (32,32)
out_weights = model.get_layer('out').get_weights()[0] # shape (32,1)
# approximate filter importance via contribution
importance = np.abs(out_weights[:,0])
print("Most important filters:", np.argsort(importance)[::-1][:10])
Most important filters: [31 3 5 11 2 6 9 27 7 12]
In [36]:
import numpy as np
# compute max activation per sequence per filter
max_acts = conv_out.max(axis=1) # shape (N_test, n_filters)
# compute correlation between each filter and label
from scipy.stats import pearsonr
correlations = []
for f in range(max_acts.shape[1]):
r, _ = pearsonr(max_acts[:, f], ytest.ravel())
correlations.append(abs(r))
correlations = np.array(correlations)
best_filter = np.argmax(correlations)
print("Best predictive filter:", best_filter)
print("Correlation:", correlations[best_filter])
Best predictive filter: 27 Correlation: 0.86943315562593
In [37]:
pos_mask = (ytest.ravel() == 1)
pos_conv = conv_out[pos_mask]
pos_seqs = [seqtest[i] for i in range(len(seqtest)) if pos_mask[i]]
kmers = top_kmers_for_filter(best_filter, pos_seqs, pos_conv, k=50)
In [38]:
import numpy as np
import pandas as pd
import logomaker
import matplotlib.pyplot as plt
def kmers_to_pwm(kmers):
L = len(kmers[0])
counts = np.ones((L, 4)) * 1e-3 # small pseudocount
base_to_idx = {'A':0, 'C':1, 'G':2, 'T':3}
for k in kmers:
for i, ch in enumerate(k):
if ch in base_to_idx:
counts[i, base_to_idx[ch]] += 1
pwm = counts / counts.sum(axis=1, keepdims=True)
return pwm
pwm = kmers_to_pwm(kmers)
pwm_df = pd.DataFrame(pwm, columns=list("ACGT"))
plt.figure(figsize=(8,3))
logo = logomaker.Logo(pwm_df)
plt.title(f"Filter {best_filter} PWM logo (top {len(kmers)} kmers)")
plt.xlabel("Position")
plt.ylabel("Probability")
plt.tight_layout()
plt.show()
<Figure size 800x300 with 0 Axes>