RNA-seq Expression Homework — Treatment Shift Version¶
RNA-seq Expression Homework (Treatment) — Instructor Solutions ** Author: John Doe
In this assignment, we will
(1) Simulate gene expression data
(2) Implement function to compute gene-level summaries
(3) Filter genes, and normalize expression matrices.
(4) Perform differetial gene analysis
(5) Use volcano plot to inspect the result.
Do not change function names or arguments¶
How to use:
- Run the simulation cell to create an example dataset (
df_expr,df_meta) with a subset of genes shifted in the Treatment group. - Implement the functions in the cells marked
TODO.
Learning Objectives¶
- Practice NumPy and pandas operations
- Filter low-quality features
- Detect genes with treatment-associated mean shifts
Sections
- Simulated data with treatment effect (run to get
df_expranddf_meta) - Implementations (fill in the
TODOcells) - Exploration & tests (run these after implementing)
Part I: Simulate RNA-seq expression dataset with teatment-specific mean shifts¶
# Simulate an RNA-seq expression dataset with treatment-specific mean shifts
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def simulate_dataset(n_samples=60, n_genes=100, n_shift=10, shift_size=3.0, noise_sd=0.2, missing_p=0.0, seed=None):
rng = np.random.RandomState(seed)
samples = [f"S{i+1:02d}" for i in range(n_samples)]
genes = [f"Gene{g+1:05d}" for g in range(n_genes)]
conditions = ["Control"]*(n_samples//2) + ["Treatment"]*(n_samples - n_samples//2)
df_meta = pd.DataFrame({"sample": samples, "condition": conditions}).set_index("sample")
baseline_log = rng.normal(loc=1.5, scale=0.6, size=n_genes)
per_sample_noise = rng.normal(0, noise_sd, size=(n_samples, n_genes))
expr = np.exp(baseline_log + per_sample_noise)
# shifted genes
shift_idx = rng.choice(n_genes, size=n_shift, replace=False)
treatment_mask = np.array(df_meta["condition"] == "Treatment")
expr[np.ix_(treatment_mask, shift_idx)] *= np.exp(shift_size)
# missingness
if missing_p > 0:
mask = rng.rand(n_samples, n_genes) < missing_p
expr[mask] = np.nan
df_expr = pd.DataFrame(expr, index=samples, columns=genes)
true_genes = np.array(genes)[shift_idx].tolist()
return df_expr, df_meta, true_genes
simData=simulate_dataset(
n_samples=100, n_genes=1000, n_shift=10,
shift_size=4, noise_sd=0.2,
missing_p=0.05, seed=44
)
df_expr=simData[0]
df_meta=simData[1]
true_genes=simData[2]
df_expr.head()
| Gene00001 | Gene00002 | Gene00003 | Gene00004 | Gene00005 | Gene00006 | Gene00007 | Gene00008 | Gene00009 | Gene00010 | ... | Gene00991 | Gene00992 | Gene00993 | Gene00994 | Gene00995 | Gene00996 | Gene00997 | Gene00998 | Gene00999 | Gene01000 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S01 | 2.703283 | 13.925822 | 8.355345 | 2.074732 | 1.812259 | 1.667333 | 12.409141 | 5.019294 | 4.089832 | 8.536002 | ... | 1.565173 | 3.476716 | 5.094963 | 2.314342 | 5.183206 | 14.306271 | 2.359026 | 4.629968 | 3.497944 | 1.437714 |
| S02 | 2.949598 | 11.232153 | 13.433076 | 1.189799 | 1.979275 | 1.579362 | 13.044639 | 2.338486 | 3.986011 | 5.690107 | ... | NaN | NaN | 4.486724 | 2.591446 | 3.469119 | 11.405596 | 2.140216 | 4.702188 | 3.271724 | 2.471256 |
| S03 | 2.466550 | 12.327434 | 7.087745 | 1.596691 | 1.836656 | 1.453915 | 18.294748 | 4.847640 | 3.509057 | 6.524596 | ... | 2.570030 | 1.807175 | 4.083114 | NaN | 3.105927 | 12.160771 | NaN | 4.152616 | 3.307645 | 1.607030 |
| S04 | 4.380852 | 11.139415 | 10.147750 | 1.459345 | 1.414977 | 1.367162 | 12.137341 | 5.354844 | 6.099626 | 4.729234 | ... | 2.121303 | 3.450361 | 6.553294 | 4.162107 | 2.856044 | 15.102878 | NaN | NaN | 3.355429 | 1.463993 |
| S05 | 3.188087 | 9.918421 | 8.002543 | 1.912038 | 1.280222 | 1.185378 | 15.780576 | 4.396150 | 3.322813 | 6.121251 | ... | 2.551916 | 3.400628 | 4.730747 | 2.872398 | 3.376408 | 11.082408 | 2.853187 | 4.947910 | 3.224200 | 1.623399 |
5 rows × 1000 columns
true_genes
['Gene00963', 'Gene00755', 'Gene00883', 'Gene00639', 'Gene00826', 'Gene00486', 'Gene00900', 'Gene00641', 'Gene00228', 'Gene00331']
df_meta.iloc[25:35]
| condition | |
|---|---|
| sample | |
| S26 | Control |
| S27 | Control |
| S28 | Control |
| S29 | Control |
| S30 | Control |
| S31 | Control |
| S32 | Control |
| S33 | Control |
| S34 | Control |
| S35 | Control |
Questions¶
- Explain how the treatment shift works in the code above? Add code below to find out what are the shifted genes?
- Save the index of the shifted gene into a csv file (ShiftedGenes.csv) in a local folder. We will use this file later.
Answer for Q1¶
The simulation code multiplies the expression values for the selected genes in Treatment samples by np.exp(shift_size), i.e. it applies a multiplicative fold-change of exp(shift_size) only for those genes and only in Treatment samples (before any missingness is applied). Because the underlying values are generated from expr = np.exp(baseline_log + noise), multiplying by np.exp(shift_size) is equivalent to adding shift_size on the log-scale.
## Answer for Q2 code
df_true = pd.DataFrame({"shifted_gene": true_genes})
df_true.to_csv("true_shifted_genes.csv", index=False)
print("Saved true shifted genes to 'true_shifted_genes.csv'")
Saved true shifted genes to 'true_shifted_genes.csv'
Part 2 — Gene filtering¶
Implement filter_genes(df_expr, max_missing=0.8).
Return a tuple (df_filtered, gene_summary_df) where gene_summary_df contains:
mean_expressionmissingness(fraction missing)pass_filters(boolean)
def filter_genes(df_expr, max_missing=0.8):
"""
Filter genes by mean expression and missingness.
Returns (df_filtered, gene_summary_df)
"""
mean_expr = df_expr.mean(axis=0)
missingness = df_expr.isna().mean(axis=0)
pass_filters = (missingness <= max_missing)
summary = pd.DataFrame({
"mean_expression": mean_expr,
"missingness": missingness,
"pass_filters": pass_filters
})
df_filtered = df_expr.loc[:, pass_filters.values]
return df_filtered, summary
df_expr_filtered=filter_genes(df_expr, max_missing=0.8)[0]
df_expr_filtered.shape
(100, 1000)
df_summ=filter_genes(df_expr, max_missing=0.8)[1]
df_summ.head()
| mean_expression | missingness | pass_filters | |
|---|---|---|---|
| Gene00001 | 2.965357 | 0.03 | True |
| Gene00002 | 9.982291 | 0.03 | True |
| Gene00003 | 9.490975 | 0.06 | True |
| Gene00004 | 1.778047 | 0.08 | True |
| Gene00005 | 1.923846 | 0.08 | True |
Part 3 — log2 normalization¶
Implement log2_normalize_expression(df_expr):
Steps:
1. For each gene, compute mean ignoring NaN
Return a DataFrame of same shape as input.
import numpy as np
import pandas as pd
def log2_normalize_expression(df_expr, pseudocount=1.0):
"""
Log2-normalize gene expression counts.
Parameters
----------
df_expr : pd.DataFrame
Expression matrix (samples x genes or genes x samples)
pseudocount : float
Value added to counts to avoid log(0)
Returns
-------
pd.DataFrame
Log2-transformed expression matrix of same shape as input
"""
return np.log2(df_expr + pseudocount)
df_expr_norm=log2_normalize_expression(df_expr_filtered)
df_expr_norm.head()
| Gene00001 | Gene00002 | Gene00003 | Gene00004 | Gene00005 | Gene00006 | Gene00007 | Gene00008 | Gene00009 | Gene00010 | ... | Gene00991 | Gene00992 | Gene00993 | Gene00994 | Gene00995 | Gene00996 | Gene00997 | Gene00998 | Gene00999 | Gene01000 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S01 | 1.888805 | 3.899738 | 3.225791 | 1.620461 | 1.491729 | 1.415398 | 3.745145 | 2.589594 | 2.347618 | 3.253385 | ... | 1.359056 | 2.162441 | 2.607617 | 1.728722 | 2.628355 | 3.936051 | 1.748043 | 2.493127 | 2.169266 | 1.285529 |
| S02 | 1.981706 | 3.612606 | 3.851307 | 1.130798 | 1.574961 | 1.367015 | 3.811948 | 1.739194 | 2.317886 | 2.742029 | ... | NaN | NaN | 2.455945 | 1.844565 | 2.159990 | 3.632919 | 1.650864 | 2.511516 | 2.094818 | 1.795458 |
| S03 | 1.793500 | 3.736327 | 3.015738 | 1.376674 | 1.504191 | 1.295085 | 4.270136 | 2.547854 | 2.172826 | 2.911614 | ... | 1.835936 | 1.489119 | 2.345713 | NaN | 2.037708 | 3.718172 | NaN | 2.365305 | 2.106899 | 1.382407 |
| S04 | 2.427835 | 3.601627 | 3.478681 | 1.298274 | 1.272009 | 1.243159 | 3.715601 | 2.667857 | 2.827743 | 2.518342 | ... | 1.642148 | 2.153922 | 2.917106 | 2.367960 | 1.947122 | 4.009247 | NaN | NaN | 2.122815 | 1.300998 |
| S05 | 2.066291 | 3.448692 | 3.170333 | 1.542029 | 1.189174 | 1.127883 | 4.068720 | 2.431930 | 2.111970 | 2.832131 | ... | 1.828597 | 2.137709 | 2.518723 | 1.953227 | 2.129747 | 3.594836 | 1.946052 | 2.572383 | 2.078678 | 1.391437 |
5 rows × 1000 columns
Part 4 — Treatment effect exploration¶
Using the filtered and standardized data, explore genes with treatment-associated mean shifts.
Tasks¶
- Compute group means per gene for Control and Treatment:
mean_control = ...
mean_treatment = ...
Compute the difference
delta = mean_treatment - mean_controlfor each gene.Compute the t-test statistics and p value for each gene.
Report the top 20 genes by p values.
Create Volcano plot
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind
from statsmodels.stats.multitest import multipletests
def group_mean_differences(df_expr, df_meta, group_col="group",
control_label="control", treatment_label="treatment"):
"""
Compute per-gene group means, delta, t-test statistics, and p-values.
Assumes df_expr is shaped: samples x genes (rows = samples, columns = genes)
and df_meta is indexed by sample IDs (matching df_expr.index).
Returns a DataFrame indexed by gene with columns:
mean_control, mean_treatment, delta, t_stat, p_value, p_adj
"""
# --- align metadata to expression rows (samples) ---
# Reindex rather than .loc to allow detecting missing samples
df_meta = df_meta.reindex(df_expr.index)
if group_col not in df_meta.columns:
raise KeyError(f"Metadata does not contain column '{group_col}'")
# detect missing metadata rows
if df_meta[group_col].isna().any():
missing = df_meta[df_meta[group_col].isna()].index.tolist()
raise ValueError(f"Missing group labels for samples: {missing}")
# make labels case-insensitive strings
groups = df_meta[group_col].astype(str).str.lower()
control_label = control_label.lower()
treatment_label = treatment_label.lower()
if control_label not in groups.unique() or treatment_label not in groups.unique():
raise ValueError(f"Expected group labels '{control_label}' and '{treatment_label}' "
f"in metadata column '{group_col}'. Found: {sorted(groups.unique())}")
control_samples = groups[groups == control_label].index.tolist()
treatment_samples = groups[groups == treatment_label].index.tolist()
if len(control_samples) == 0 or len(treatment_samples) == 0:
raise ValueError("One of the groups has zero samples after filtering.")
# --- vectorized means per gene (columns are genes) ---
mean_control = df_expr.loc[control_samples].mean(axis=0) # Series indexed by gene
mean_treatment = df_expr.loc[treatment_samples].mean(axis=0)
delta = mean_treatment - mean_control
# --- per-gene Welch t-tests (loop over genes / columns) ---
t_stats = []
p_vals = []
# iterate over genes (columns)
for gene in df_expr.columns:
ctrl_vals = df_expr.loc[control_samples, gene].values
trt_vals = df_expr.loc[treatment_samples, gene].values
# small-sample NaN guard: if all-NaN in either group, produce p=1 and t=nan
if (np.all(np.isnan(ctrl_vals))) or (np.all(np.isnan(trt_vals))):
t_stats.append(np.nan)
p_vals.append(1.0)
continue
t_stat, p_val = ttest_ind(trt_vals, ctrl_vals, equal_var=False, nan_policy="omit")
if np.isnan(p_val):
p_val = 1.0
t_stats.append(t_stat)
p_vals.append(p_val)
results = pd.DataFrame({
"mean_control": mean_control,
"mean_treatment": mean_treatment,
"delta": delta,
"t_stat": t_stats,
"p_value": p_vals
}, index=df_expr.columns) # index = genes
# BH FDR correction, safe for any NaNs replaced by 1 above
results["p_adj"] = multipletests(results["p_value"], method="fdr_bh")[1]
return results
result=group_mean_differences(df_expr_norm, df_meta, group_col="condition")
result.head()
| mean_control | mean_treatment | delta | t_stat | p_value | p_adj | |
|---|---|---|---|---|---|---|
| Gene00001 | 1.965577 | 1.982009 | 0.016431 | 0.400105 | 0.689995 | 0.980785 |
| Gene00002 | 3.421483 | 3.428782 | 0.007300 | 0.117269 | 0.906900 | 0.987245 |
| Gene00003 | 3.360063 | 3.380528 | 0.020465 | 0.409158 | 0.683377 | 0.980785 |
| Gene00004 | 1.470501 | 1.456684 | -0.013818 | -0.380341 | 0.704589 | 0.980785 |
| Gene00005 | 1.518749 | 1.554387 | 0.035638 | 0.946918 | 0.346256 | 0.911328 |
top10 = (
result
.sort_values("p_value", ascending=True)
.head(10)
)
top10
| mean_control | mean_treatment | delta | t_stat | p_value | p_adj | |
|---|---|---|---|---|---|---|
| Gene00331 | 2.408285 | 7.912500 | 5.504215 | 112.513229 | 1.299061e-101 | 1.299061e-98 |
| Gene00900 | 2.457207 | 8.073242 | 5.616036 | 122.856351 | 5.950546e-100 | 2.975273e-97 |
| Gene00228 | 3.076924 | 8.649999 | 5.573074 | 102.696595 | 9.160968e-96 | 3.053656e-93 |
| Gene00826 | 2.683420 | 8.149848 | 5.466428 | 92.615616 | 6.425737e-95 | 1.606434e-92 |
| Gene00755 | 2.526643 | 8.041864 | 5.515221 | 91.532234 | 2.524452e-94 | 5.048904e-92 |
| Gene00963 | 2.535840 | 7.929154 | 5.393314 | 97.945220 | 1.303224e-93 | 2.172040e-91 |
| Gene00641 | 1.842058 | 7.108171 | 5.266113 | 96.959633 | 6.885896e-92 | 9.836994e-90 |
| Gene00486 | 2.538174 | 8.066318 | 5.528144 | 95.975268 | 1.616442e-89 | 1.905236e-87 |
| Gene00639 | 2.118075 | 7.455710 | 5.337634 | 104.945633 | 1.714713e-89 | 1.905236e-87 |
| Gene00883 | 1.833715 | 7.146268 | 5.312553 | 96.348524 | 1.524332e-80 | 1.524332e-78 |
true_genes
['Gene00963', 'Gene00755', 'Gene00883', 'Gene00639', 'Gene00826', 'Gene00486', 'Gene00900', 'Gene00641', 'Gene00228', 'Gene00331']
result.loc[true_genes].sort_values("p_value")
| mean_control | mean_treatment | delta | t_stat | p_value | p_adj | |
|---|---|---|---|---|---|---|
| Gene00331 | 2.408285 | 7.912500 | 5.504215 | 112.513229 | 1.299061e-101 | 1.299061e-98 |
| Gene00900 | 2.457207 | 8.073242 | 5.616036 | 122.856351 | 5.950546e-100 | 2.975273e-97 |
| Gene00228 | 3.076924 | 8.649999 | 5.573074 | 102.696595 | 9.160968e-96 | 3.053656e-93 |
| Gene00826 | 2.683420 | 8.149848 | 5.466428 | 92.615616 | 6.425737e-95 | 1.606434e-92 |
| Gene00755 | 2.526643 | 8.041864 | 5.515221 | 91.532234 | 2.524452e-94 | 5.048904e-92 |
| Gene00963 | 2.535840 | 7.929154 | 5.393314 | 97.945220 | 1.303224e-93 | 2.172040e-91 |
| Gene00641 | 1.842058 | 7.108171 | 5.266113 | 96.959633 | 6.885896e-92 | 9.836994e-90 |
| Gene00486 | 2.538174 | 8.066318 | 5.528144 | 95.975268 | 1.616442e-89 | 1.905236e-87 |
| Gene00639 | 2.118075 | 7.455710 | 5.337634 | 104.945633 | 1.714713e-89 | 1.905236e-87 |
| Gene00883 | 1.833715 | 7.146268 | 5.312553 | 96.348524 | 1.524332e-80 | 1.524332e-78 |
true_set = set(true_genes)
true_set
{'Gene00228',
'Gene00331',
'Gene00486',
'Gene00639',
'Gene00641',
'Gene00755',
'Gene00826',
'Gene00883',
'Gene00900',
'Gene00963'}
de_set = set(top10.index)
de_set
{'Gene00228',
'Gene00331',
'Gene00486',
'Gene00639',
'Gene00641',
'Gene00755',
'Gene00826',
'Gene00883',
'Gene00900',
'Gene00963'}
union = len(true_set & de_set)
union
10
# Code for Volcano plot
import matplotlib.pyplot as plt
def volcano_plot(results, alpha=0.05):
plt.figure(figsize=(6, 5))
plt.scatter(
results["delta"],
-np.log10(results["p_value"]),
c=(results["p_value"] < alpha),
alpha=0.5
)
plt.axhline(-np.log10(alpha), linestyle="--")
plt.axvline(0, linestyle="--", color="grey")
plt.xlabel("log2 fold change (Treatment − Control)")
plt.ylabel("-log10(pvalue)")
plt.title("Volcano plot (t-test)")
plt.show()
volcano_plot(result)
Questions¶
- How many genes in the top 10 are the true shifted genes in ShiftedGenes.csv created in Quesiton 2?
- Report the false positve rate and false negative rate (with p values $<0.05$)
null_genes = result.index.difference(true_genes)
fp = (result.loc[null_genes, "p_value"] < 0.05).sum()
fpr = fp / len(null_genes)
fpr
np.float64(0.0595959595959596)
fn = (result.loc[true_genes, "p_value"] >= 0.05).sum()
fn
fnr = fn / len(true_genes)
fnr
np.float64(0.0)
Answer to Q3:¶
All top 10 genes are the true shifted genes.
Answer to Q4:¶
False postive rate is $\approx 0.06$. False negative rate is 0.
Part 5 - False Positive Rate¶
Question¶
- If you reset the shift_size=0 parameter in the function `simulate_dataset' what do you think the false positive rate should be why?
Answer to Q5¶
The result should match the type-I error level (0.05) — so with 1,000 genes you’d expect about 50 false positives on average. If we do not perform multiple-testing correction, false positive rate should be 0.05 (as demonstrated in the answer to Question 4).