RNA-seq Expression Homework — Treatment Shift Version¶

Due Date: Feb 02, 2026¶

RNA-seq Expression Homework (Treatment) — Instructor Solutions

** Author: John Doe

In this assignment, we will

(1) Simulate gene expression data

(2) Implement function to compute gene-level summaries

(3) Filter genes, and normalize expression matrices.

(4) Perform differetial gene analysis using Student t-test

(5) Use volcano plot to inspect the result.

Do not change function names or arguments¶

How to use:

  1. Run the simulation cell to create an example dataset (df_expr, df_meta) with a subset of genes shifted in the Treatment group.
  2. Implement the functions in the cells marked TODO.

Learning Objectives¶

  • Practice NumPy and pandas operations
  • Filter low-quality features
  • Detect genes with treatment-associated mean shifts

Sections

  1. Simulated data with treatment effect (run to get df_expr and df_meta)
  2. Implementations (fill in the TODO cells)
  3. Exploration & tests (run these after implementing)

Part I: Simulate RNA-seq expression dataset with teatment-specific mean shifts¶

In [1]:
# Simulate an RNA-seq expression dataset with treatment-specific mean shifts
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


def simulate_dataset(n_samples=60, n_genes=100, n_shift=10, shift_size=3.0, noise_sd=0.2, missing_p=0.0, seed=None):
    rng = np.random.RandomState(seed)
    samples = [f"S{i+1:02d}" for i in range(n_samples)]
    genes = [f"Gene{g+1:05d}" for g in range(n_genes)]
    conditions = ["Control"]*(n_samples//2) + ["Treatment"]*(n_samples - n_samples//2)
    df_meta = pd.DataFrame({"sample": samples, "condition": conditions}).set_index("sample")
    baseline_log = rng.normal(loc=1.5, scale=0.6, size=n_genes)
    per_sample_noise = rng.normal(0, noise_sd, size=(n_samples, n_genes))
    expr = np.exp(baseline_log + per_sample_noise)
    # shifted genes
    shift_idx = rng.choice(n_genes, size=n_shift, replace=False)
    treatment_mask = np.array(df_meta["condition"] == "Treatment")
    expr[np.ix_(treatment_mask, shift_idx)] *= np.exp(shift_size)
    # missingness
    if missing_p > 0:
        mask = rng.rand(n_samples, n_genes) < missing_p
        expr[mask] = np.nan
    df_expr = pd.DataFrame(expr, index=samples, columns=genes)
    true_genes = np.array(genes)[shift_idx].tolist()
    return df_expr, df_meta, true_genes
In [2]:
simData=simulate_dataset(
            n_samples=100, n_genes=1000, n_shift=10,
            shift_size=4, noise_sd=0.2,
            missing_p=0.05, seed=44
)
df_expr=simData[0]
df_meta=simData[1]
true_genes=simData[2]

df_expr.head()
Out[2]:
Gene00001 Gene00002 Gene00003 Gene00004 Gene00005 Gene00006 Gene00007 Gene00008 Gene00009 Gene00010 ... Gene00991 Gene00992 Gene00993 Gene00994 Gene00995 Gene00996 Gene00997 Gene00998 Gene00999 Gene01000
S01 2.703283 13.925822 8.355345 2.074732 1.812259 1.667333 12.409141 5.019294 4.089832 8.536002 ... 1.565173 3.476716 5.094963 2.314342 5.183206 14.306271 2.359026 4.629968 3.497944 1.437714
S02 2.949598 11.232153 13.433076 1.189799 1.979275 1.579362 13.044639 2.338486 3.986011 5.690107 ... NaN NaN 4.486724 2.591446 3.469119 11.405596 2.140216 4.702188 3.271724 2.471256
S03 2.466550 12.327434 7.087745 1.596691 1.836656 1.453915 18.294748 4.847640 3.509057 6.524596 ... 2.570030 1.807175 4.083114 NaN 3.105927 12.160771 NaN 4.152616 3.307645 1.607030
S04 4.380852 11.139415 10.147750 1.459345 1.414977 1.367162 12.137341 5.354844 6.099626 4.729234 ... 2.121303 3.450361 6.553294 4.162107 2.856044 15.102878 NaN NaN 3.355429 1.463993
S05 3.188087 9.918421 8.002543 1.912038 1.280222 1.185378 15.780576 4.396150 3.322813 6.121251 ... 2.551916 3.400628 4.730747 2.872398 3.376408 11.082408 2.853187 4.947910 3.224200 1.623399

5 rows × 1000 columns

In [3]:
df_meta.iloc[25:35]
Out[3]:
condition
sample
S26 Control
S27 Control
S28 Control
S29 Control
S30 Control
S31 Control
S32 Control
S33 Control
S34 Control
S35 Control

Questions¶

  1. Explain how the treatment shift works in the code above? Add code below to find out what are the shifted genes?
  2. Save the gene names of the shifted genes into a csv file (ShiftedGenes.csv) in a local folder. We will use this file later.

Part 2 — Gene filtering¶

Implement filter_genes(df_expr, max_missing=0.8).

Return a tuple (df_filtered, gene_summary_df) where gene_summary_df contains:

  • missingness (fraction missing)
  • pass_filters (boolean)
In [4]:
def filter_genes(df_expr, max_missing=0.8):
    """
    Filter genes by missingness.
    Returns (df_filtered, gene_summary_df)
    """
    # TODO: implement using missing rate
    raise NotImplementedError()

Part 3 — log2 normalization¶

The funciton below performs log2 normalization.

In [6]:
import numpy as np
import pandas as pd

def log2_normalize_expression(df_expr, pseudocount=1.0):
    """
    Log2-normalize gene expression counts.

    Parameters
    ----------
    df_expr : pd.DataFrame
        Expression matrix (samples x genes or genes x samples)
    pseudocount : float
        Value added to counts to avoid log(0)

    Returns
    -------
    pd.DataFrame
        Log2-transformed expression matrix of same shape as input
    """
    return np.log2(df_expr + pseudocount)

Part 4 — Treatment effect exploration¶

Using the filtered and normalized data, explore genes with treatment-associated mean shifts.

Tasks¶

  1. Compute group means per gene for Control and Treatment:
mean_control = ...
mean_treatment = ...
  1. Compute the difference delta = mean_treatment - mean_control for each gene.

  2. Compute the t-test statistics and p value for each gene.

  3. Report the top 10 genes by p values.

  4. Create Volcano plot

In [8]:
## skeleton function for differential expression 
def group_mean_differences(df_expr, df_meta, group_col="group"):
    """
    Return a DataFrame with columns:
    - mean_control
    - mean_treatment
    - log2 fold change (mean_treatment - mean_control)
    - t_stat
    - p_value
    """
    # TODO: compute per-group means, delta, t-test, and p values. 
    raise NotImplementedError()
In [9]:
# Code for Volcano plot 
import matplotlib.pyplot as plt

def volcano_plot(results, alpha=0.05):
    plt.figure(figsize=(6, 5))
    plt.scatter(
        results["delta"],
        -np.log10(results["p_value"]),
        c=(results["p_value"] < alpha),
        alpha=0.5
    )
    plt.axhline(-np.log10(alpha), linestyle="--")
    plt.axvline(0, linestyle="--", color="grey")
    plt.xlabel("log2 fold change (Treatment − Control)")
    plt.ylabel("-log10(p_value)")
    plt.title("Volcano plot (t-test)")
    plt.show()

Questions¶

  1. How many genes in the top 10 are the true shifted genes in ShiftedGenes.csv created in Quesiton 2?
  2. Report the false positve rate and false negative rate (with p values $<0.05$)

Part 5 - False Positive Rate¶

Question¶

  1. If you reset the shift_size=0 parameter in the function `simulate_dataset' what do you think the false positive rate should be why?

Notes & Next steps¶

In this homework, we simply normalized the data with log2 transformation. In practice, normalization could involve other preprocessing such as library size and composition bias correction, and batch effect adjustment. In addition, routine pactices utilize more advanced methods in R packages such as DEseq or edgeR with multiple testing correction such as false discovery rate (FDR) control. For best practice workflow (in R) check out https://www.bioconductor.org/packages/devel/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html

In [ ]: