RNA-seq Expression Homework — Treatment Shift Version¶
Due Date: Feb 02, 2026¶
RNA-seq Expression Homework (Treatment) — Instructor Solutions
** Author: John Doe
In this assignment, we will
(1) Simulate gene expression data
(2) Implement function to compute gene-level summaries
(3) Filter genes, and normalize expression matrices.
(4) Perform differetial gene analysis using Student t-test
(5) Use volcano plot to inspect the result.
Do not change function names or arguments¶
How to use:
- Run the simulation cell to create an example dataset (
df_expr,df_meta) with a subset of genes shifted in the Treatment group. - Implement the functions in the cells marked
TODO.
Learning Objectives¶
- Practice NumPy and pandas operations
- Filter low-quality features
- Detect genes with treatment-associated mean shifts
Sections
- Simulated data with treatment effect (run to get
df_expranddf_meta) - Implementations (fill in the
TODOcells) - Exploration & tests (run these after implementing)
Part I: Simulate RNA-seq expression dataset with teatment-specific mean shifts¶
# Simulate an RNA-seq expression dataset with treatment-specific mean shifts
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def simulate_dataset(n_samples=60, n_genes=100, n_shift=10, shift_size=3.0, noise_sd=0.2, missing_p=0.0, seed=None):
rng = np.random.RandomState(seed)
samples = [f"S{i+1:02d}" for i in range(n_samples)]
genes = [f"Gene{g+1:05d}" for g in range(n_genes)]
conditions = ["Control"]*(n_samples//2) + ["Treatment"]*(n_samples - n_samples//2)
df_meta = pd.DataFrame({"sample": samples, "condition": conditions}).set_index("sample")
baseline_log = rng.normal(loc=1.5, scale=0.6, size=n_genes)
per_sample_noise = rng.normal(0, noise_sd, size=(n_samples, n_genes))
expr = np.exp(baseline_log + per_sample_noise)
# shifted genes
shift_idx = rng.choice(n_genes, size=n_shift, replace=False)
treatment_mask = np.array(df_meta["condition"] == "Treatment")
expr[np.ix_(treatment_mask, shift_idx)] *= np.exp(shift_size)
# missingness
if missing_p > 0:
mask = rng.rand(n_samples, n_genes) < missing_p
expr[mask] = np.nan
df_expr = pd.DataFrame(expr, index=samples, columns=genes)
true_genes = np.array(genes)[shift_idx].tolist()
return df_expr, df_meta, true_genes
simData=simulate_dataset(
n_samples=100, n_genes=1000, n_shift=10,
shift_size=4, noise_sd=0.2,
missing_p=0.05, seed=44
)
df_expr=simData[0]
df_meta=simData[1]
true_genes=simData[2]
df_expr.head()
| Gene00001 | Gene00002 | Gene00003 | Gene00004 | Gene00005 | Gene00006 | Gene00007 | Gene00008 | Gene00009 | Gene00010 | ... | Gene00991 | Gene00992 | Gene00993 | Gene00994 | Gene00995 | Gene00996 | Gene00997 | Gene00998 | Gene00999 | Gene01000 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S01 | 2.703283 | 13.925822 | 8.355345 | 2.074732 | 1.812259 | 1.667333 | 12.409141 | 5.019294 | 4.089832 | 8.536002 | ... | 1.565173 | 3.476716 | 5.094963 | 2.314342 | 5.183206 | 14.306271 | 2.359026 | 4.629968 | 3.497944 | 1.437714 |
| S02 | 2.949598 | 11.232153 | 13.433076 | 1.189799 | 1.979275 | 1.579362 | 13.044639 | 2.338486 | 3.986011 | 5.690107 | ... | NaN | NaN | 4.486724 | 2.591446 | 3.469119 | 11.405596 | 2.140216 | 4.702188 | 3.271724 | 2.471256 |
| S03 | 2.466550 | 12.327434 | 7.087745 | 1.596691 | 1.836656 | 1.453915 | 18.294748 | 4.847640 | 3.509057 | 6.524596 | ... | 2.570030 | 1.807175 | 4.083114 | NaN | 3.105927 | 12.160771 | NaN | 4.152616 | 3.307645 | 1.607030 |
| S04 | 4.380852 | 11.139415 | 10.147750 | 1.459345 | 1.414977 | 1.367162 | 12.137341 | 5.354844 | 6.099626 | 4.729234 | ... | 2.121303 | 3.450361 | 6.553294 | 4.162107 | 2.856044 | 15.102878 | NaN | NaN | 3.355429 | 1.463993 |
| S05 | 3.188087 | 9.918421 | 8.002543 | 1.912038 | 1.280222 | 1.185378 | 15.780576 | 4.396150 | 3.322813 | 6.121251 | ... | 2.551916 | 3.400628 | 4.730747 | 2.872398 | 3.376408 | 11.082408 | 2.853187 | 4.947910 | 3.224200 | 1.623399 |
5 rows × 1000 columns
df_meta.iloc[25:35]
| condition | |
|---|---|
| sample | |
| S26 | Control |
| S27 | Control |
| S28 | Control |
| S29 | Control |
| S30 | Control |
| S31 | Control |
| S32 | Control |
| S33 | Control |
| S34 | Control |
| S35 | Control |
Questions¶
- Explain how the treatment shift works in the code above? Add code below to find out what are the shifted genes?
- Save the gene names of the shifted genes into a csv file (ShiftedGenes.csv) in a local folder. We will use this file later.
Part 2 — Gene filtering¶
Implement filter_genes(df_expr, max_missing=0.8).
Return a tuple (df_filtered, gene_summary_df) where gene_summary_df contains:
missingness(fraction missing)pass_filters(boolean)
def filter_genes(df_expr, max_missing=0.8):
"""
Filter genes by missingness.
Returns (df_filtered, gene_summary_df)
"""
# TODO: implement using missing rate
raise NotImplementedError()
Part 3 — log2 normalization¶
The funciton below performs log2 normalization.
import numpy as np
import pandas as pd
def log2_normalize_expression(df_expr, pseudocount=1.0):
"""
Log2-normalize gene expression counts.
Parameters
----------
df_expr : pd.DataFrame
Expression matrix (samples x genes or genes x samples)
pseudocount : float
Value added to counts to avoid log(0)
Returns
-------
pd.DataFrame
Log2-transformed expression matrix of same shape as input
"""
return np.log2(df_expr + pseudocount)
Part 4 — Treatment effect exploration¶
Using the filtered and normalized data, explore genes with treatment-associated mean shifts.
Tasks¶
- Compute group means per gene for Control and Treatment:
mean_control = ...
mean_treatment = ...
Compute the difference
delta = mean_treatment - mean_controlfor each gene.Compute the t-test statistics and p value for each gene.
Report the top 10 genes by p values.
Create Volcano plot
## skeleton function for differential expression
def group_mean_differences(df_expr, df_meta, group_col="group"):
"""
Return a DataFrame with columns:
- mean_control
- mean_treatment
- log2 fold change (mean_treatment - mean_control)
- t_stat
- p_value
"""
# TODO: compute per-group means, delta, t-test, and p values.
raise NotImplementedError()
# Code for Volcano plot
import matplotlib.pyplot as plt
def volcano_plot(results, alpha=0.05):
plt.figure(figsize=(6, 5))
plt.scatter(
results["delta"],
-np.log10(results["p_value"]),
c=(results["p_value"] < alpha),
alpha=0.5
)
plt.axhline(-np.log10(alpha), linestyle="--")
plt.axvline(0, linestyle="--", color="grey")
plt.xlabel("log2 fold change (Treatment − Control)")
plt.ylabel("-log10(p_value)")
plt.title("Volcano plot (t-test)")
plt.show()
Questions¶
- How many genes in the top 10 are the true shifted genes in ShiftedGenes.csv created in Quesiton 2?
- Report the false positve rate and false negative rate (with p values $<0.05$)
Part 5 - False Positive Rate¶
Question¶
- If you reset the shift_size=0 parameter in the function `simulate_dataset' what do you think the false positive rate should be why?
Notes & Next steps¶
In this homework, we simply normalized the data with log2 transformation. In practice, normalization could involve other preprocessing such as library size and composition bias correction, and batch effect adjustment. In addition, routine pactices utilize more advanced methods in R packages such as DEseq or edgeR with multiple testing correction such as false discovery rate (FDR) control. For best practice workflow (in R) check out https://www.bioconductor.org/packages/devel/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html