{ "cells": [ { "cell_type": "markdown", "id": "8ea0e822", "metadata": { "user_expressions": [] }, "source": [ "# RNA-seq Expression Homework — Treatment Shift Version\n", "# Due Date: Feb 02, 2026\n", "\n", "RNA-seq Expression Homework (Treatment) — Instructor Solutions\n", "\n", "** Author: John Doe\n", "\n", "In this assignment, we will\n", "\n", "(1) Simulate gene expression data\n", "\n", "(2) Implement function to compute gene-level summaries\n", "\n", "(3) Filter genes, and normalize expression matrices.\n", "\n", "(4) Perform differetial gene analysis using Student t-test \n", "\n", "(5) Use volcano plot to inspect the result. \n", "\n", "Do not change function names or arguments\n", "---\n", "\n", "**How to use:** \n", "1. Run the simulation cell to create an example dataset (`df_expr`, `df_meta`) with a subset of genes shifted in the Treatment group. \n", "2. Implement the functions in the cells marked `TODO`. \n" ] }, { "cell_type": "markdown", "id": "2d80fb78", "metadata": { "user_expressions": [] }, "source": [ "\n", "## Learning Objectives\n", "\n", "- Practice NumPy and pandas operations\n", "- Filter low-quality features\n", "- Detect genes with treatment-associated mean shifts\n", "---\n", "\n", "**Sections**\n", "1. Simulated data with treatment effect (run to get `df_expr` and `df_meta`) \n", "2. Implementations (fill in the `TODO` cells) \n", "3. Exploration & tests (run these after implementing)\n" ] }, { "cell_type": "markdown", "id": "d747b5c6-1215-4273-ae40-9d4b9e3c450e", "metadata": { "user_expressions": [] }, "source": [ "# Part I: Simulate RNA-seq expression dataset with teatment-specific mean shifts" ] }, { "cell_type": "code", "execution_count": 1, "id": "7a1b9694", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Simulate an RNA-seq expression dataset with treatment-specific mean shifts\n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "\n", "\n", "def simulate_dataset(n_samples=60, n_genes=100, n_shift=10, shift_size=3.0, noise_sd=0.2, missing_p=0.0, seed=None):\n", " rng = np.random.RandomState(seed)\n", " samples = [f\"S{i+1:02d}\" for i in range(n_samples)]\n", " genes = [f\"Gene{g+1:05d}\" for g in range(n_genes)]\n", " conditions = [\"Control\"]*(n_samples//2) + [\"Treatment\"]*(n_samples - n_samples//2)\n", " df_meta = pd.DataFrame({\"sample\": samples, \"condition\": conditions}).set_index(\"sample\")\n", " baseline_log = rng.normal(loc=1.5, scale=0.6, size=n_genes)\n", " per_sample_noise = rng.normal(0, noise_sd, size=(n_samples, n_genes))\n", " expr = np.exp(baseline_log + per_sample_noise)\n", " # shifted genes\n", " shift_idx = rng.choice(n_genes, size=n_shift, replace=False)\n", " treatment_mask = np.array(df_meta[\"condition\"] == \"Treatment\")\n", " expr[np.ix_(treatment_mask, shift_idx)] *= np.exp(shift_size)\n", " # missingness\n", " if missing_p > 0:\n", " mask = rng.rand(n_samples, n_genes) < missing_p\n", " expr[mask] = np.nan\n", " df_expr = pd.DataFrame(expr, index=samples, columns=genes)\n", " true_genes = np.array(genes)[shift_idx].tolist()\n", " return df_expr, df_meta, true_genes" ] }, { "cell_type": "code", "execution_count": 2, "id": "6909c89e-f3e4-484e-996d-502cf838d3f8", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
| \n", " | Gene00001 | \n", "Gene00002 | \n", "Gene00003 | \n", "Gene00004 | \n", "Gene00005 | \n", "Gene00006 | \n", "Gene00007 | \n", "Gene00008 | \n", "Gene00009 | \n", "Gene00010 | \n", "... | \n", "Gene00991 | \n", "Gene00992 | \n", "Gene00993 | \n", "Gene00994 | \n", "Gene00995 | \n", "Gene00996 | \n", "Gene00997 | \n", "Gene00998 | \n", "Gene00999 | \n", "Gene01000 | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S01 | \n", "2.703283 | \n", "13.925822 | \n", "8.355345 | \n", "2.074732 | \n", "1.812259 | \n", "1.667333 | \n", "12.409141 | \n", "5.019294 | \n", "4.089832 | \n", "8.536002 | \n", "... | \n", "1.565173 | \n", "3.476716 | \n", "5.094963 | \n", "2.314342 | \n", "5.183206 | \n", "14.306271 | \n", "2.359026 | \n", "4.629968 | \n", "3.497944 | \n", "1.437714 | \n", "
| S02 | \n", "2.949598 | \n", "11.232153 | \n", "13.433076 | \n", "1.189799 | \n", "1.979275 | \n", "1.579362 | \n", "13.044639 | \n", "2.338486 | \n", "3.986011 | \n", "5.690107 | \n", "... | \n", "NaN | \n", "NaN | \n", "4.486724 | \n", "2.591446 | \n", "3.469119 | \n", "11.405596 | \n", "2.140216 | \n", "4.702188 | \n", "3.271724 | \n", "2.471256 | \n", "
| S03 | \n", "2.466550 | \n", "12.327434 | \n", "7.087745 | \n", "1.596691 | \n", "1.836656 | \n", "1.453915 | \n", "18.294748 | \n", "4.847640 | \n", "3.509057 | \n", "6.524596 | \n", "... | \n", "2.570030 | \n", "1.807175 | \n", "4.083114 | \n", "NaN | \n", "3.105927 | \n", "12.160771 | \n", "NaN | \n", "4.152616 | \n", "3.307645 | \n", "1.607030 | \n", "
| S04 | \n", "4.380852 | \n", "11.139415 | \n", "10.147750 | \n", "1.459345 | \n", "1.414977 | \n", "1.367162 | \n", "12.137341 | \n", "5.354844 | \n", "6.099626 | \n", "4.729234 | \n", "... | \n", "2.121303 | \n", "3.450361 | \n", "6.553294 | \n", "4.162107 | \n", "2.856044 | \n", "15.102878 | \n", "NaN | \n", "NaN | \n", "3.355429 | \n", "1.463993 | \n", "
| S05 | \n", "3.188087 | \n", "9.918421 | \n", "8.002543 | \n", "1.912038 | \n", "1.280222 | \n", "1.185378 | \n", "15.780576 | \n", "4.396150 | \n", "3.322813 | \n", "6.121251 | \n", "... | \n", "2.551916 | \n", "3.400628 | \n", "4.730747 | \n", "2.872398 | \n", "3.376408 | \n", "11.082408 | \n", "2.853187 | \n", "4.947910 | \n", "3.224200 | \n", "1.623399 | \n", "
5 rows × 1000 columns
\n", "| \n", " | condition | \n", "
|---|---|
| sample | \n", "\n", " |
| S26 | \n", "Control | \n", "
| S27 | \n", "Control | \n", "
| S28 | \n", "Control | \n", "
| S29 | \n", "Control | \n", "
| S30 | \n", "Control | \n", "
| S31 | \n", "Control | \n", "
| S32 | \n", "Control | \n", "
| S33 | \n", "Control | \n", "
| S34 | \n", "Control | \n", "
| S35 | \n", "Control | \n", "