RNA-seq Expression Homework — Treatment Shift Version¶

RNA-seq Expression Homework (Treatment) — Instructor Solutions ** Author: John Doe

In this assignment, we will

(1) Simulate gene expression data

(2) Implement function to compute gene-level summaries

(3) Filter genes, and normalize expression matrices.

(4) Perform differetial gene analysis

(5) Use volcano plot to inspect the result.

Do not change function names or arguments¶

How to use:

  1. Run the simulation cell to create an example dataset (df_expr, df_meta) with a subset of genes shifted in the Treatment group.
  2. Implement the functions in the cells marked TODO.

Learning Objectives¶

  • Practice NumPy and pandas operations
  • Filter low-quality features
  • Detect genes with treatment-associated mean shifts

Sections

  1. Simulated data with treatment effect (run to get df_expr and df_meta)
  2. Implementations (fill in the TODO cells)
  3. Exploration & tests (run these after implementing)

Part I: Simulate RNA-seq expression dataset with teatment-specific mean shifts¶

In [1]:
# Simulate an RNA-seq expression dataset with treatment-specific mean shifts
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


def simulate_dataset(n_samples=60, n_genes=100, n_shift=10, shift_size=3.0, noise_sd=0.2, missing_p=0.0, seed=None):
    rng = np.random.RandomState(seed)
    samples = [f"S{i+1:02d}" for i in range(n_samples)]
    genes = [f"Gene{g+1:05d}" for g in range(n_genes)]
    conditions = ["Control"]*(n_samples//2) + ["Treatment"]*(n_samples - n_samples//2)
    df_meta = pd.DataFrame({"sample": samples, "condition": conditions}).set_index("sample")
    baseline_log = rng.normal(loc=1.5, scale=0.6, size=n_genes)
    per_sample_noise = rng.normal(0, noise_sd, size=(n_samples, n_genes))
    expr = np.exp(baseline_log + per_sample_noise)
    # shifted genes
    shift_idx = rng.choice(n_genes, size=n_shift, replace=False)
    treatment_mask = np.array(df_meta["condition"] == "Treatment")
    expr[np.ix_(treatment_mask, shift_idx)] *= np.exp(shift_size)
    # missingness
    if missing_p > 0:
        mask = rng.rand(n_samples, n_genes) < missing_p
        expr[mask] = np.nan
    df_expr = pd.DataFrame(expr, index=samples, columns=genes)
    true_genes = np.array(genes)[shift_idx].tolist()
    return df_expr, df_meta, true_genes
In [2]:
simData=simulate_dataset(
            n_samples=100, n_genes=1000, n_shift=10,
            shift_size=4, noise_sd=0.2,
            missing_p=0.05, seed=44
)
df_expr=simData[0]
df_meta=simData[1]
true_genes=simData[2]
df_expr.head()
Out[2]:
Gene00001 Gene00002 Gene00003 Gene00004 Gene00005 Gene00006 Gene00007 Gene00008 Gene00009 Gene00010 ... Gene00991 Gene00992 Gene00993 Gene00994 Gene00995 Gene00996 Gene00997 Gene00998 Gene00999 Gene01000
S01 2.703283 13.925822 8.355345 2.074732 1.812259 1.667333 12.409141 5.019294 4.089832 8.536002 ... 1.565173 3.476716 5.094963 2.314342 5.183206 14.306271 2.359026 4.629968 3.497944 1.437714
S02 2.949598 11.232153 13.433076 1.189799 1.979275 1.579362 13.044639 2.338486 3.986011 5.690107 ... NaN NaN 4.486724 2.591446 3.469119 11.405596 2.140216 4.702188 3.271724 2.471256
S03 2.466550 12.327434 7.087745 1.596691 1.836656 1.453915 18.294748 4.847640 3.509057 6.524596 ... 2.570030 1.807175 4.083114 NaN 3.105927 12.160771 NaN 4.152616 3.307645 1.607030
S04 4.380852 11.139415 10.147750 1.459345 1.414977 1.367162 12.137341 5.354844 6.099626 4.729234 ... 2.121303 3.450361 6.553294 4.162107 2.856044 15.102878 NaN NaN 3.355429 1.463993
S05 3.188087 9.918421 8.002543 1.912038 1.280222 1.185378 15.780576 4.396150 3.322813 6.121251 ... 2.551916 3.400628 4.730747 2.872398 3.376408 11.082408 2.853187 4.947910 3.224200 1.623399

5 rows × 1000 columns

In [3]:
true_genes
Out[3]:
['Gene00963',
 'Gene00755',
 'Gene00883',
 'Gene00639',
 'Gene00826',
 'Gene00486',
 'Gene00900',
 'Gene00641',
 'Gene00228',
 'Gene00331']
In [4]:
df_meta.iloc[25:35]
Out[4]:
condition
sample
S26 Control
S27 Control
S28 Control
S29 Control
S30 Control
S31 Control
S32 Control
S33 Control
S34 Control
S35 Control

Questions¶

  1. Explain how the treatment shift works in the code above? Add code below to find out what are the shifted genes?
  2. Save the index of the shifted gene into a csv file (ShiftedGenes.csv) in a local folder. We will use this file later.

Answer for Q1¶

The simulation code multiplies the expression values for the selected genes in Treatment samples by np.exp(shift_size), i.e. it applies a multiplicative fold-change of exp(shift_size) only for those genes and only in Treatment samples (before any missingness is applied). Because the underlying values are generated from expr = np.exp(baseline_log + noise), multiplying by np.exp(shift_size) is equivalent to adding shift_size on the log-scale.

In [5]:
## Answer for Q2 code 
df_true = pd.DataFrame({"shifted_gene": true_genes})
df_true.to_csv("true_shifted_genes.csv", index=False)

print("Saved true shifted genes to 'true_shifted_genes.csv'")
Saved true shifted genes to 'true_shifted_genes.csv'

Part 2 — Gene filtering¶

Implement filter_genes(df_expr, max_missing=0.8).

Return a tuple (df_filtered, gene_summary_df) where gene_summary_df contains:

  • mean_expression
  • missingness (fraction missing)
  • pass_filters (boolean)
In [6]:
def filter_genes(df_expr, max_missing=0.8):
    """
    Filter genes by mean expression and missingness.
    Returns (df_filtered, gene_summary_df)
    """
    mean_expr = df_expr.mean(axis=0)
    missingness = df_expr.isna().mean(axis=0)
    pass_filters = (missingness <= max_missing)
    summary = pd.DataFrame({
        "mean_expression": mean_expr,
        "missingness": missingness,
        "pass_filters": pass_filters
    })
    df_filtered = df_expr.loc[:, pass_filters.values]
    return df_filtered, summary
In [7]:
df_expr_filtered=filter_genes(df_expr, max_missing=0.8)[0]
df_expr_filtered.shape
Out[7]:
(100, 1000)
In [8]:
df_summ=filter_genes(df_expr, max_missing=0.8)[1]
df_summ.head()
Out[8]:
mean_expression missingness pass_filters
Gene00001 2.965357 0.03 True
Gene00002 9.982291 0.03 True
Gene00003 9.490975 0.06 True
Gene00004 1.778047 0.08 True
Gene00005 1.923846 0.08 True

Part 3 — log2 normalization¶

Implement log2_normalize_expression(df_expr):

Steps:

1. For each gene, compute mean ignoring NaN

Return a DataFrame of same shape as input.

In [9]:
import numpy as np
import pandas as pd

def log2_normalize_expression(df_expr, pseudocount=1.0):
    """
    Log2-normalize gene expression counts.

    Parameters
    ----------
    df_expr : pd.DataFrame
        Expression matrix (samples x genes or genes x samples)
    pseudocount : float
        Value added to counts to avoid log(0)

    Returns
    -------
    pd.DataFrame
        Log2-transformed expression matrix of same shape as input
    """
    return np.log2(df_expr + pseudocount)
In [10]:
df_expr_norm=log2_normalize_expression(df_expr_filtered)
df_expr_norm.head()
Out[10]:
Gene00001 Gene00002 Gene00003 Gene00004 Gene00005 Gene00006 Gene00007 Gene00008 Gene00009 Gene00010 ... Gene00991 Gene00992 Gene00993 Gene00994 Gene00995 Gene00996 Gene00997 Gene00998 Gene00999 Gene01000
S01 1.888805 3.899738 3.225791 1.620461 1.491729 1.415398 3.745145 2.589594 2.347618 3.253385 ... 1.359056 2.162441 2.607617 1.728722 2.628355 3.936051 1.748043 2.493127 2.169266 1.285529
S02 1.981706 3.612606 3.851307 1.130798 1.574961 1.367015 3.811948 1.739194 2.317886 2.742029 ... NaN NaN 2.455945 1.844565 2.159990 3.632919 1.650864 2.511516 2.094818 1.795458
S03 1.793500 3.736327 3.015738 1.376674 1.504191 1.295085 4.270136 2.547854 2.172826 2.911614 ... 1.835936 1.489119 2.345713 NaN 2.037708 3.718172 NaN 2.365305 2.106899 1.382407
S04 2.427835 3.601627 3.478681 1.298274 1.272009 1.243159 3.715601 2.667857 2.827743 2.518342 ... 1.642148 2.153922 2.917106 2.367960 1.947122 4.009247 NaN NaN 2.122815 1.300998
S05 2.066291 3.448692 3.170333 1.542029 1.189174 1.127883 4.068720 2.431930 2.111970 2.832131 ... 1.828597 2.137709 2.518723 1.953227 2.129747 3.594836 1.946052 2.572383 2.078678 1.391437

5 rows × 1000 columns

Part 4 — Treatment effect exploration¶

Using the filtered and standardized data, explore genes with treatment-associated mean shifts.

Tasks¶

  1. Compute group means per gene for Control and Treatment:
mean_control = ...
mean_treatment = ...
  1. Compute the difference delta = mean_treatment - mean_control for each gene.

  2. Compute the t-test statistics and p value for each gene.

  3. Report the top 20 genes by p values.

  4. Create Volcano plot

In [11]:
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind
from statsmodels.stats.multitest import multipletests

def group_mean_differences(df_expr, df_meta, group_col="group",
                           control_label="control", treatment_label="treatment"):
    """
    Compute per-gene group means, delta, t-test statistics, and p-values.

    Assumes df_expr is shaped: samples x genes (rows = samples, columns = genes)
    and df_meta is indexed by sample IDs (matching df_expr.index).

    Returns a DataFrame indexed by gene with columns:
      mean_control, mean_treatment, delta, t_stat, p_value, p_adj
    """
    # --- align metadata to expression rows (samples) ---
    # Reindex rather than .loc to allow detecting missing samples
    df_meta = df_meta.reindex(df_expr.index)

    if group_col not in df_meta.columns:
        raise KeyError(f"Metadata does not contain column '{group_col}'")

    # detect missing metadata rows
    if df_meta[group_col].isna().any():
        missing = df_meta[df_meta[group_col].isna()].index.tolist()
        raise ValueError(f"Missing group labels for samples: {missing}")

    # make labels case-insensitive strings
    groups = df_meta[group_col].astype(str).str.lower()
    control_label = control_label.lower()
    treatment_label = treatment_label.lower()

    if control_label not in groups.unique() or treatment_label not in groups.unique():
        raise ValueError(f"Expected group labels '{control_label}' and '{treatment_label}' "
                         f"in metadata column '{group_col}'. Found: {sorted(groups.unique())}")

    control_samples = groups[groups == control_label].index.tolist()
    treatment_samples = groups[groups == treatment_label].index.tolist()

    if len(control_samples) == 0 or len(treatment_samples) == 0:
        raise ValueError("One of the groups has zero samples after filtering.")

    # --- vectorized means per gene (columns are genes) ---
    mean_control = df_expr.loc[control_samples].mean(axis=0)    # Series indexed by gene
    mean_treatment = df_expr.loc[treatment_samples].mean(axis=0)
    delta = mean_treatment - mean_control

    # --- per-gene Welch t-tests (loop over genes / columns) ---
    t_stats = []
    p_vals = []

    # iterate over genes (columns)
    for gene in df_expr.columns:
        ctrl_vals = df_expr.loc[control_samples, gene].values
        trt_vals = df_expr.loc[treatment_samples, gene].values

        # small-sample NaN guard: if all-NaN in either group, produce p=1 and t=nan
        if (np.all(np.isnan(ctrl_vals))) or (np.all(np.isnan(trt_vals))):
            t_stats.append(np.nan)
            p_vals.append(1.0)
            continue

        t_stat, p_val = ttest_ind(trt_vals, ctrl_vals, equal_var=False, nan_policy="omit")
        if np.isnan(p_val):
            p_val = 1.0
        t_stats.append(t_stat)
        p_vals.append(p_val)

    results = pd.DataFrame({
        "mean_control": mean_control,
        "mean_treatment": mean_treatment,
        "delta": delta,
        "t_stat": t_stats,
        "p_value": p_vals
    }, index=df_expr.columns)   # index = genes

    # BH FDR correction, safe for any NaNs replaced by 1 above
    results["p_adj"] = multipletests(results["p_value"], method="fdr_bh")[1]

    return results
In [12]:
result=group_mean_differences(df_expr_norm, df_meta, group_col="condition")
result.head()
Out[12]:
mean_control mean_treatment delta t_stat p_value p_adj
Gene00001 1.965577 1.982009 0.016431 0.400105 0.689995 0.980785
Gene00002 3.421483 3.428782 0.007300 0.117269 0.906900 0.987245
Gene00003 3.360063 3.380528 0.020465 0.409158 0.683377 0.980785
Gene00004 1.470501 1.456684 -0.013818 -0.380341 0.704589 0.980785
Gene00005 1.518749 1.554387 0.035638 0.946918 0.346256 0.911328
In [13]:
top10 = (
    result
    .sort_values("p_value", ascending=True)
    .head(10)
)

top10
Out[13]:
mean_control mean_treatment delta t_stat p_value p_adj
Gene00331 2.408285 7.912500 5.504215 112.513229 1.299061e-101 1.299061e-98
Gene00900 2.457207 8.073242 5.616036 122.856351 5.950546e-100 2.975273e-97
Gene00228 3.076924 8.649999 5.573074 102.696595 9.160968e-96 3.053656e-93
Gene00826 2.683420 8.149848 5.466428 92.615616 6.425737e-95 1.606434e-92
Gene00755 2.526643 8.041864 5.515221 91.532234 2.524452e-94 5.048904e-92
Gene00963 2.535840 7.929154 5.393314 97.945220 1.303224e-93 2.172040e-91
Gene00641 1.842058 7.108171 5.266113 96.959633 6.885896e-92 9.836994e-90
Gene00486 2.538174 8.066318 5.528144 95.975268 1.616442e-89 1.905236e-87
Gene00639 2.118075 7.455710 5.337634 104.945633 1.714713e-89 1.905236e-87
Gene00883 1.833715 7.146268 5.312553 96.348524 1.524332e-80 1.524332e-78
In [14]:
true_genes
Out[14]:
['Gene00963',
 'Gene00755',
 'Gene00883',
 'Gene00639',
 'Gene00826',
 'Gene00486',
 'Gene00900',
 'Gene00641',
 'Gene00228',
 'Gene00331']
In [15]:
result.loc[true_genes].sort_values("p_value")
Out[15]:
mean_control mean_treatment delta t_stat p_value p_adj
Gene00331 2.408285 7.912500 5.504215 112.513229 1.299061e-101 1.299061e-98
Gene00900 2.457207 8.073242 5.616036 122.856351 5.950546e-100 2.975273e-97
Gene00228 3.076924 8.649999 5.573074 102.696595 9.160968e-96 3.053656e-93
Gene00826 2.683420 8.149848 5.466428 92.615616 6.425737e-95 1.606434e-92
Gene00755 2.526643 8.041864 5.515221 91.532234 2.524452e-94 5.048904e-92
Gene00963 2.535840 7.929154 5.393314 97.945220 1.303224e-93 2.172040e-91
Gene00641 1.842058 7.108171 5.266113 96.959633 6.885896e-92 9.836994e-90
Gene00486 2.538174 8.066318 5.528144 95.975268 1.616442e-89 1.905236e-87
Gene00639 2.118075 7.455710 5.337634 104.945633 1.714713e-89 1.905236e-87
Gene00883 1.833715 7.146268 5.312553 96.348524 1.524332e-80 1.524332e-78
In [16]:
true_set = set(true_genes)
true_set
Out[16]:
{'Gene00228',
 'Gene00331',
 'Gene00486',
 'Gene00639',
 'Gene00641',
 'Gene00755',
 'Gene00826',
 'Gene00883',
 'Gene00900',
 'Gene00963'}
In [18]:
de_set = set(top10.index)
de_set
Out[18]:
{'Gene00228',
 'Gene00331',
 'Gene00486',
 'Gene00639',
 'Gene00641',
 'Gene00755',
 'Gene00826',
 'Gene00883',
 'Gene00900',
 'Gene00963'}
In [19]:
union = len(true_set & de_set)
union
Out[19]:
10
In [20]:
# Code for Volcano plot 
import matplotlib.pyplot as plt

def volcano_plot(results, alpha=0.05):
    plt.figure(figsize=(6, 5))
    plt.scatter(
        results["delta"],
        -np.log10(results["p_value"]),
        c=(results["p_value"] < alpha),
        alpha=0.5
    )
    plt.axhline(-np.log10(alpha), linestyle="--")
    plt.axvline(0, linestyle="--", color="grey")
    plt.xlabel("log2 fold change (Treatment − Control)")
    plt.ylabel("-log10(pvalue)")
    plt.title("Volcano plot (t-test)")
    plt.show()
In [21]:
volcano_plot(result)
No description has been provided for this image

Questions¶

  1. How many genes in the top 10 are the true shifted genes in ShiftedGenes.csv created in Quesiton 2?
  2. Report the false positve rate and false negative rate (with p values $<0.05$)
In [33]:
null_genes = result.index.difference(true_genes)
fp = (result.loc[null_genes, "p_value"] < 0.05).sum()
fpr = fp / len(null_genes)
fpr
Out[33]:
np.float64(0.0595959595959596)
In [36]:
fn = (result.loc[true_genes, "p_value"] >= 0.05).sum()
fn
fnr = fn / len(true_genes)
fnr
Out[36]:
np.float64(0.0)

Answer to Q3:¶

All top 10 genes are the true shifted genes.

Answer to Q4:¶

False postive rate is $\approx 0.06$. False negative rate is 0.

Part 5 - False Positive Rate¶

Question¶

  1. If you reset the shift_size=0 parameter in the function `simulate_dataset' what do you think the false positive rate should be why?

Answer to Q5¶

The result should match the type-I error level (0.05) — so with 1,000 genes you’d expect about 50 false positives on average. If we do not perform multiple-testing correction, false positive rate should be 0.05 (as demonstrated in the answer to Question 4).