{ "cells": [ { "cell_type": "markdown", "id": "8ea0e822", "metadata": { "user_expressions": [] }, "source": [ "# RNA-seq Expression Homework — Treatment Shift Version\n", "# Due Date: Feb 02, 2026\n", "\n", "RNA-seq Expression Homework (Treatment) — Instructor Solutions\n", "\n", "** Author: John Doe\n", "\n", "In this assignment, we will\n", "\n", "(1) Simulate gene expression data\n", "\n", "(2) Implement function to compute gene-level summaries\n", "\n", "(3) Filter genes, and normalize expression matrices.\n", "\n", "(4) Perform differetial gene analysis using Student t-test \n", "\n", "(5) Use volcano plot to inspect the result. \n", "\n", "Do not change function names or arguments\n", "---\n", "\n", "**How to use:** \n", "1. Run the simulation cell to create an example dataset (`df_expr`, `df_meta`) with a subset of genes shifted in the Treatment group. \n", "2. Implement the functions in the cells marked `TODO`. \n" ] }, { "cell_type": "markdown", "id": "2d80fb78", "metadata": { "user_expressions": [] }, "source": [ "\n", "## Learning Objectives\n", "\n", "- Practice NumPy and pandas operations\n", "- Filter low-quality features\n", "- Detect genes with treatment-associated mean shifts\n", "---\n", "\n", "**Sections**\n", "1. Simulated data with treatment effect (run to get `df_expr` and `df_meta`) \n", "2. Implementations (fill in the `TODO` cells) \n", "3. Exploration & tests (run these after implementing)\n" ] }, { "cell_type": "markdown", "id": "d747b5c6-1215-4273-ae40-9d4b9e3c450e", "metadata": { "user_expressions": [] }, "source": [ "# Part I: Simulate RNA-seq expression dataset with teatment-specific mean shifts" ] }, { "cell_type": "code", "execution_count": 1, "id": "7a1b9694", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Simulate an RNA-seq expression dataset with treatment-specific mean shifts\n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "\n", "\n", "def simulate_dataset(n_samples=60, n_genes=100, n_shift=10, shift_size=3.0, noise_sd=0.2, missing_p=0.0, seed=None):\n", " rng = np.random.RandomState(seed)\n", " samples = [f\"S{i+1:02d}\" for i in range(n_samples)]\n", " genes = [f\"Gene{g+1:05d}\" for g in range(n_genes)]\n", " conditions = [\"Control\"]*(n_samples//2) + [\"Treatment\"]*(n_samples - n_samples//2)\n", " df_meta = pd.DataFrame({\"sample\": samples, \"condition\": conditions}).set_index(\"sample\")\n", " baseline_log = rng.normal(loc=1.5, scale=0.6, size=n_genes)\n", " per_sample_noise = rng.normal(0, noise_sd, size=(n_samples, n_genes))\n", " expr = np.exp(baseline_log + per_sample_noise)\n", " # shifted genes\n", " shift_idx = rng.choice(n_genes, size=n_shift, replace=False)\n", " treatment_mask = np.array(df_meta[\"condition\"] == \"Treatment\")\n", " expr[np.ix_(treatment_mask, shift_idx)] *= np.exp(shift_size)\n", " # missingness\n", " if missing_p > 0:\n", " mask = rng.rand(n_samples, n_genes) < missing_p\n", " expr[mask] = np.nan\n", " df_expr = pd.DataFrame(expr, index=samples, columns=genes)\n", " true_genes = np.array(genes)[shift_idx].tolist()\n", " return df_expr, df_meta, true_genes" ] }, { "cell_type": "code", "execution_count": 2, "id": "6909c89e-f3e4-484e-996d-502cf838d3f8", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Gene00001Gene00002Gene00003Gene00004Gene00005Gene00006Gene00007Gene00008Gene00009Gene00010...Gene00991Gene00992Gene00993Gene00994Gene00995Gene00996Gene00997Gene00998Gene00999Gene01000
S012.70328313.9258228.3553452.0747321.8122591.66733312.4091415.0192944.0898328.536002...1.5651733.4767165.0949632.3143425.18320614.3062712.3590264.6299683.4979441.437714
S022.94959811.23215313.4330761.1897991.9792751.57936213.0446392.3384863.9860115.690107...NaNNaN4.4867242.5914463.46911911.4055962.1402164.7021883.2717242.471256
S032.46655012.3274347.0877451.5966911.8366561.45391518.2947484.8476403.5090576.524596...2.5700301.8071754.083114NaN3.10592712.160771NaN4.1526163.3076451.607030
S044.38085211.13941510.1477501.4593451.4149771.36716212.1373415.3548446.0996264.729234...2.1213033.4503616.5532944.1621072.85604415.102878NaNNaN3.3554291.463993
S053.1880879.9184218.0025431.9120381.2802221.18537815.7805764.3961503.3228136.121251...2.5519163.4006284.7307472.8723983.37640811.0824082.8531874.9479103.2242001.623399
\n", "

5 rows × 1000 columns

\n", "
" ], "text/plain": [ " Gene00001 Gene00002 Gene00003 Gene00004 Gene00005 Gene00006 \\\n", "S01 2.703283 13.925822 8.355345 2.074732 1.812259 1.667333 \n", "S02 2.949598 11.232153 13.433076 1.189799 1.979275 1.579362 \n", "S03 2.466550 12.327434 7.087745 1.596691 1.836656 1.453915 \n", "S04 4.380852 11.139415 10.147750 1.459345 1.414977 1.367162 \n", "S05 3.188087 9.918421 8.002543 1.912038 1.280222 1.185378 \n", "\n", " Gene00007 Gene00008 Gene00009 Gene00010 ... Gene00991 Gene00992 \\\n", "S01 12.409141 5.019294 4.089832 8.536002 ... 1.565173 3.476716 \n", "S02 13.044639 2.338486 3.986011 5.690107 ... NaN NaN \n", "S03 18.294748 4.847640 3.509057 6.524596 ... 2.570030 1.807175 \n", "S04 12.137341 5.354844 6.099626 4.729234 ... 2.121303 3.450361 \n", "S05 15.780576 4.396150 3.322813 6.121251 ... 2.551916 3.400628 \n", "\n", " Gene00993 Gene00994 Gene00995 Gene00996 Gene00997 Gene00998 \\\n", "S01 5.094963 2.314342 5.183206 14.306271 2.359026 4.629968 \n", "S02 4.486724 2.591446 3.469119 11.405596 2.140216 4.702188 \n", "S03 4.083114 NaN 3.105927 12.160771 NaN 4.152616 \n", "S04 6.553294 4.162107 2.856044 15.102878 NaN NaN \n", "S05 4.730747 2.872398 3.376408 11.082408 2.853187 4.947910 \n", "\n", " Gene00999 Gene01000 \n", "S01 3.497944 1.437714 \n", "S02 3.271724 2.471256 \n", "S03 3.307645 1.607030 \n", "S04 3.355429 1.463993 \n", "S05 3.224200 1.623399 \n", "\n", "[5 rows x 1000 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "simData=simulate_dataset(\n", " n_samples=100, n_genes=1000, n_shift=10,\n", " shift_size=4, noise_sd=0.2,\n", " missing_p=0.05, seed=44\n", ")\n", "df_expr=simData[0]\n", "df_meta=simData[1]\n", "true_genes=simData[2]\n", "\n", "df_expr.head()" ] }, { "cell_type": "code", "execution_count": 3, "id": "8391e752-2782-4f74-bea2-0b664864d26f", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
condition
sample
S26Control
S27Control
S28Control
S29Control
S30Control
S31Control
S32Control
S33Control
S34Control
S35Control
\n", "
" ], "text/plain": [ " condition\n", "sample \n", "S26 Control\n", "S27 Control\n", "S28 Control\n", "S29 Control\n", "S30 Control\n", "S31 Control\n", "S32 Control\n", "S33 Control\n", "S34 Control\n", "S35 Control" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_meta.iloc[25:35]" ] }, { "cell_type": "markdown", "id": "7b2fc98b-2c6f-474c-afc7-12b9b748e2fb", "metadata": { "user_expressions": [] }, "source": [ "## Questions\n", "1. Explain how the treatment shift works in the code above? Add code below to find out what are the shifted genes?\n", "2. Save the gene names of the shifted genes into a csv file (ShiftedGenes.csv) in a local folder. We will use this file later. " ] }, { "cell_type": "markdown", "id": "57ed1cf1", "metadata": { "user_expressions": [] }, "source": [ "## Part 2 — Gene filtering\n", "\n", "Implement `filter_genes(df_expr, max_missing=0.8)`.\n", "\n", "Return a tuple `(df_filtered, gene_summary_df)` where `gene_summary_df` contains:\n", "- `missingness` (fraction missing)\n", "- `pass_filters` (boolean)\n" ] }, { "cell_type": "code", "execution_count": 4, "id": "29b537b2", "metadata": {}, "outputs": [], "source": [ "def filter_genes(df_expr, max_missing=0.8):\n", " \"\"\"\n", " Filter genes by missingness.\n", " Returns (df_filtered, gene_summary_df)\n", " \"\"\"\n", " # TODO: implement using missing rate\n", " raise NotImplementedError()\n" ] }, { "cell_type": "markdown", "id": "fb9de52b", "metadata": { "tags": [], "user_expressions": [] }, "source": [ "## Part 3 — log2 normalization \n", "The funciton below performs log2 normalization." ] }, { "cell_type": "code", "execution_count": 6, "id": "7fa5a4de", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "def log2_normalize_expression(df_expr, pseudocount=1.0):\n", " \"\"\"\n", " Log2-normalize gene expression counts.\n", "\n", " Parameters\n", " ----------\n", " df_expr : pd.DataFrame\n", " Expression matrix (samples x genes or genes x samples)\n", " pseudocount : float\n", " Value added to counts to avoid log(0)\n", "\n", " Returns\n", " -------\n", " pd.DataFrame\n", " Log2-transformed expression matrix of same shape as input\n", " \"\"\"\n", " return np.log2(df_expr + pseudocount)" ] }, { "cell_type": "markdown", "id": "f2ee96b9", "metadata": { "tags": [], "user_expressions": [] }, "source": [ "## Part 4 — Treatment effect exploration\n", "\n", "Using the filtered and normalized data, explore genes with treatment-associated mean shifts.\n", "\n", "### Tasks\n", "\n", "1. Compute group means per gene for Control and Treatment:\n", "```python\n", "mean_control = ...\n", "mean_treatment = ...\n", "```\n", "\n", "2. Compute the difference `delta = mean_treatment - mean_control` for each gene.\n", "\n", "3. Compute the t-test statistics and p value for each gene. \n", "\n", "4. Report the top 10 genes by p values.\n", "\n", "5. Create Volcano plot\n" ] }, { "cell_type": "code", "execution_count": 8, "id": "54c5f2a5", "metadata": {}, "outputs": [], "source": [ "## skeleton function for differential expression \n", "def group_mean_differences(df_expr, df_meta, group_col=\"group\"):\n", " \"\"\"\n", " Return a DataFrame with columns:\n", " - mean_control\n", " - mean_treatment\n", " - log2 fold change (mean_treatment - mean_control)\n", " - t_stat\n", " - p_value\n", " \"\"\"\n", " # TODO: compute per-group means, delta, t-test, and p values. \n", " raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": 9, "id": "586b61e2-bf4d-4043-9c64-3af012c7f3be", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Code for Volcano plot \n", "import matplotlib.pyplot as plt\n", "\n", "def volcano_plot(results, alpha=0.05):\n", " plt.figure(figsize=(6, 5))\n", " plt.scatter(\n", " results[\"delta\"],\n", " -np.log10(results[\"p_value\"]),\n", " c=(results[\"p_value\"] < alpha),\n", " alpha=0.5\n", " )\n", " plt.axhline(-np.log10(alpha), linestyle=\"--\")\n", " plt.axvline(0, linestyle=\"--\", color=\"grey\")\n", " plt.xlabel(\"log2 fold change (Treatment − Control)\")\n", " plt.ylabel(\"-log10(p_value)\")\n", " plt.title(\"Volcano plot (t-test)\")\n", " plt.show()" ] }, { "cell_type": "markdown", "id": "19768938-8cfa-46f8-ae14-21c912fe55e0", "metadata": { "user_expressions": [] }, "source": [ "## Questions\n", "3. How many genes in the top 10 are the true shifted genes in ShiftedGenes.csv created in Quesiton 2?\n", "4. Report the false positve rate and false negative rate (with p values $<0.05$)\n" ] }, { "cell_type": "markdown", "id": "7f81e964-3772-4a84-a12e-93ce1e229e6e", "metadata": { "user_expressions": [] }, "source": [ "## Part 5 - False Positive Rate" ] }, { "cell_type": "markdown", "id": "96edb19f-6b76-4737-b420-3ba72079df55", "metadata": { "user_expressions": [] }, "source": [ "## Question\n", "5. If you reset the shift_size=0 parameter in the function `simulate_dataset' what do you think the false positive rate should be why?" ] }, { "cell_type": "markdown", "id": "76713434-491c-4500-94d5-a4edb94a2c97", "metadata": { "user_expressions": [] }, "source": [ "## Notes & Next steps\n", "In this homework, we simply normalized the data with log2 transformation. In practice, normalization could involve other preprocessing such as library size and composition bias correction, and batch effect adjustment. In addition, routine pactices utilize more advanced methods in R packages such as DEseq or edgeR with multiple testing correction such as false discovery rate (FDR) control. \n", "For best practice workflow (in R) check out \n", "https://www.bioconductor.org/packages/devel/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html" ] }, { "cell_type": "code", "execution_count": null, "id": "980d09a6-6ec2-4ab0-b90a-76c652456dc4", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python (base)", "language": "python", "name": "base" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.7" } }, "nbformat": 4, "nbformat_minor": 5 }