{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "8ea0e822",
   "metadata": {
    "user_expressions": []
   },
   "source": [
    "# RNA-seq Expression Homework — Treatment Shift Version\n",
    "# Due Date: Feb 02, 2026\n",
    "\n",
    "RNA-seq Expression Homework (Treatment) — Instructor Solutions\n",
    "\n",
    "** Author: John Doe\n",
    "\n",
    "In this assignment, we will\n",
    "\n",
    "(1) Simulate gene expression data\n",
    "\n",
    "(2) Implement function to compute gene-level summaries\n",
    "\n",
    "(3) Filter genes, and normalize expression matrices.\n",
    "\n",
    "(4) Perform differetial gene analysis using Student t-test  \n",
    "\n",
    "(5) Use volcano plot to inspect the result. \n",
    "\n",
    "Do not change function names or arguments\n",
    "---\n",
    "\n",
    "**How to use:**  \n",
    "1. Run the simulation cell to create an example dataset (`df_expr`, `df_meta`) with a subset of genes shifted in the Treatment group.  \n",
    "2. Implement the functions in the cells marked `TODO`.  \n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2d80fb78",
   "metadata": {
    "user_expressions": []
   },
   "source": [
    "\n",
    "## Learning Objectives\n",
    "\n",
    "- Practice NumPy and pandas operations\n",
    "- Filter low-quality features\n",
    "- Detect genes with treatment-associated mean shifts\n",
    "---\n",
    "\n",
    "**Sections**\n",
    "1. Simulated data with treatment effect (run to get `df_expr` and `df_meta`)  \n",
    "2. Implementations (fill in the `TODO` cells)  \n",
    "3. Exploration & tests (run these after implementing)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d747b5c6-1215-4273-ae40-9d4b9e3c450e",
   "metadata": {
    "user_expressions": []
   },
   "source": [
    "# Part I: Simulate RNA-seq expression dataset with teatment-specific mean shifts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "7a1b9694",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Simulate an RNA-seq expression dataset with treatment-specific mean shifts\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "\n",
    "def simulate_dataset(n_samples=60, n_genes=100, n_shift=10, shift_size=3.0, noise_sd=0.2, missing_p=0.0, seed=None):\n",
    "    rng = np.random.RandomState(seed)\n",
    "    samples = [f\"S{i+1:02d}\" for i in range(n_samples)]\n",
    "    genes = [f\"Gene{g+1:05d}\" for g in range(n_genes)]\n",
    "    conditions = [\"Control\"]*(n_samples//2) + [\"Treatment\"]*(n_samples - n_samples//2)\n",
    "    df_meta = pd.DataFrame({\"sample\": samples, \"condition\": conditions}).set_index(\"sample\")\n",
    "    baseline_log = rng.normal(loc=1.5, scale=0.6, size=n_genes)\n",
    "    per_sample_noise = rng.normal(0, noise_sd, size=(n_samples, n_genes))\n",
    "    expr = np.exp(baseline_log + per_sample_noise)\n",
    "    # shifted genes\n",
    "    shift_idx = rng.choice(n_genes, size=n_shift, replace=False)\n",
    "    treatment_mask = np.array(df_meta[\"condition\"] == \"Treatment\")\n",
    "    expr[np.ix_(treatment_mask, shift_idx)] *= np.exp(shift_size)\n",
    "    # missingness\n",
    "    if missing_p > 0:\n",
    "        mask = rng.rand(n_samples, n_genes) < missing_p\n",
    "        expr[mask] = np.nan\n",
    "    df_expr = pd.DataFrame(expr, index=samples, columns=genes)\n",
    "    true_genes = np.array(genes)[shift_idx].tolist()\n",
    "    return df_expr, df_meta, true_genes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "6909c89e-f3e4-484e-996d-502cf838d3f8",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Gene00001</th>\n",
       "      <th>Gene00002</th>\n",
       "      <th>Gene00003</th>\n",
       "      <th>Gene00004</th>\n",
       "      <th>Gene00005</th>\n",
       "      <th>Gene00006</th>\n",
       "      <th>Gene00007</th>\n",
       "      <th>Gene00008</th>\n",
       "      <th>Gene00009</th>\n",
       "      <th>Gene00010</th>\n",
       "      <th>...</th>\n",
       "      <th>Gene00991</th>\n",
       "      <th>Gene00992</th>\n",
       "      <th>Gene00993</th>\n",
       "      <th>Gene00994</th>\n",
       "      <th>Gene00995</th>\n",
       "      <th>Gene00996</th>\n",
       "      <th>Gene00997</th>\n",
       "      <th>Gene00998</th>\n",
       "      <th>Gene00999</th>\n",
       "      <th>Gene01000</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>S01</th>\n",
       "      <td>2.703283</td>\n",
       "      <td>13.925822</td>\n",
       "      <td>8.355345</td>\n",
       "      <td>2.074732</td>\n",
       "      <td>1.812259</td>\n",
       "      <td>1.667333</td>\n",
       "      <td>12.409141</td>\n",
       "      <td>5.019294</td>\n",
       "      <td>4.089832</td>\n",
       "      <td>8.536002</td>\n",
       "      <td>...</td>\n",
       "      <td>1.565173</td>\n",
       "      <td>3.476716</td>\n",
       "      <td>5.094963</td>\n",
       "      <td>2.314342</td>\n",
       "      <td>5.183206</td>\n",
       "      <td>14.306271</td>\n",
       "      <td>2.359026</td>\n",
       "      <td>4.629968</td>\n",
       "      <td>3.497944</td>\n",
       "      <td>1.437714</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>S02</th>\n",
       "      <td>2.949598</td>\n",
       "      <td>11.232153</td>\n",
       "      <td>13.433076</td>\n",
       "      <td>1.189799</td>\n",
       "      <td>1.979275</td>\n",
       "      <td>1.579362</td>\n",
       "      <td>13.044639</td>\n",
       "      <td>2.338486</td>\n",
       "      <td>3.986011</td>\n",
       "      <td>5.690107</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4.486724</td>\n",
       "      <td>2.591446</td>\n",
       "      <td>3.469119</td>\n",
       "      <td>11.405596</td>\n",
       "      <td>2.140216</td>\n",
       "      <td>4.702188</td>\n",
       "      <td>3.271724</td>\n",
       "      <td>2.471256</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>S03</th>\n",
       "      <td>2.466550</td>\n",
       "      <td>12.327434</td>\n",
       "      <td>7.087745</td>\n",
       "      <td>1.596691</td>\n",
       "      <td>1.836656</td>\n",
       "      <td>1.453915</td>\n",
       "      <td>18.294748</td>\n",
       "      <td>4.847640</td>\n",
       "      <td>3.509057</td>\n",
       "      <td>6.524596</td>\n",
       "      <td>...</td>\n",
       "      <td>2.570030</td>\n",
       "      <td>1.807175</td>\n",
       "      <td>4.083114</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3.105927</td>\n",
       "      <td>12.160771</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4.152616</td>\n",
       "      <td>3.307645</td>\n",
       "      <td>1.607030</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>S04</th>\n",
       "      <td>4.380852</td>\n",
       "      <td>11.139415</td>\n",
       "      <td>10.147750</td>\n",
       "      <td>1.459345</td>\n",
       "      <td>1.414977</td>\n",
       "      <td>1.367162</td>\n",
       "      <td>12.137341</td>\n",
       "      <td>5.354844</td>\n",
       "      <td>6.099626</td>\n",
       "      <td>4.729234</td>\n",
       "      <td>...</td>\n",
       "      <td>2.121303</td>\n",
       "      <td>3.450361</td>\n",
       "      <td>6.553294</td>\n",
       "      <td>4.162107</td>\n",
       "      <td>2.856044</td>\n",
       "      <td>15.102878</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3.355429</td>\n",
       "      <td>1.463993</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>S05</th>\n",
       "      <td>3.188087</td>\n",
       "      <td>9.918421</td>\n",
       "      <td>8.002543</td>\n",
       "      <td>1.912038</td>\n",
       "      <td>1.280222</td>\n",
       "      <td>1.185378</td>\n",
       "      <td>15.780576</td>\n",
       "      <td>4.396150</td>\n",
       "      <td>3.322813</td>\n",
       "      <td>6.121251</td>\n",
       "      <td>...</td>\n",
       "      <td>2.551916</td>\n",
       "      <td>3.400628</td>\n",
       "      <td>4.730747</td>\n",
       "      <td>2.872398</td>\n",
       "      <td>3.376408</td>\n",
       "      <td>11.082408</td>\n",
       "      <td>2.853187</td>\n",
       "      <td>4.947910</td>\n",
       "      <td>3.224200</td>\n",
       "      <td>1.623399</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 1000 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     Gene00001  Gene00002  Gene00003  Gene00004  Gene00005  Gene00006  \\\n",
       "S01   2.703283  13.925822   8.355345   2.074732   1.812259   1.667333   \n",
       "S02   2.949598  11.232153  13.433076   1.189799   1.979275   1.579362   \n",
       "S03   2.466550  12.327434   7.087745   1.596691   1.836656   1.453915   \n",
       "S04   4.380852  11.139415  10.147750   1.459345   1.414977   1.367162   \n",
       "S05   3.188087   9.918421   8.002543   1.912038   1.280222   1.185378   \n",
       "\n",
       "     Gene00007  Gene00008  Gene00009  Gene00010  ...  Gene00991  Gene00992  \\\n",
       "S01  12.409141   5.019294   4.089832   8.536002  ...   1.565173   3.476716   \n",
       "S02  13.044639   2.338486   3.986011   5.690107  ...        NaN        NaN   \n",
       "S03  18.294748   4.847640   3.509057   6.524596  ...   2.570030   1.807175   \n",
       "S04  12.137341   5.354844   6.099626   4.729234  ...   2.121303   3.450361   \n",
       "S05  15.780576   4.396150   3.322813   6.121251  ...   2.551916   3.400628   \n",
       "\n",
       "     Gene00993  Gene00994  Gene00995  Gene00996  Gene00997  Gene00998  \\\n",
       "S01   5.094963   2.314342   5.183206  14.306271   2.359026   4.629968   \n",
       "S02   4.486724   2.591446   3.469119  11.405596   2.140216   4.702188   \n",
       "S03   4.083114        NaN   3.105927  12.160771        NaN   4.152616   \n",
       "S04   6.553294   4.162107   2.856044  15.102878        NaN        NaN   \n",
       "S05   4.730747   2.872398   3.376408  11.082408   2.853187   4.947910   \n",
       "\n",
       "     Gene00999  Gene01000  \n",
       "S01   3.497944   1.437714  \n",
       "S02   3.271724   2.471256  \n",
       "S03   3.307645   1.607030  \n",
       "S04   3.355429   1.463993  \n",
       "S05   3.224200   1.623399  \n",
       "\n",
       "[5 rows x 1000 columns]"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "simData=simulate_dataset(\n",
    "            n_samples=100, n_genes=1000, n_shift=10,\n",
    "            shift_size=4, noise_sd=0.2,\n",
    "            missing_p=0.05, seed=44\n",
    ")\n",
    "df_expr=simData[0]\n",
    "df_meta=simData[1]\n",
    "true_genes=simData[2]\n",
    "\n",
    "df_expr.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "8391e752-2782-4f74-bea2-0b664864d26f",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>condition</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>sample</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>S26</th>\n",
       "      <td>Control</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>S27</th>\n",
       "      <td>Control</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>S28</th>\n",
       "      <td>Control</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>S29</th>\n",
       "      <td>Control</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>S30</th>\n",
       "      <td>Control</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>S31</th>\n",
       "      <td>Control</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>S32</th>\n",
       "      <td>Control</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>S33</th>\n",
       "      <td>Control</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>S34</th>\n",
       "      <td>Control</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>S35</th>\n",
       "      <td>Control</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       condition\n",
       "sample          \n",
       "S26      Control\n",
       "S27      Control\n",
       "S28      Control\n",
       "S29      Control\n",
       "S30      Control\n",
       "S31      Control\n",
       "S32      Control\n",
       "S33      Control\n",
       "S34      Control\n",
       "S35      Control"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_meta.iloc[25:35]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7b2fc98b-2c6f-474c-afc7-12b9b748e2fb",
   "metadata": {
    "user_expressions": []
   },
   "source": [
    "## Questions\n",
    "1. Explain how the treatment shift works in the code above? Add code below to find out what are the shifted genes?\n",
    "2. Save the gene names of the shifted genes into a csv file (ShiftedGenes.csv) in a local folder. We will use this file later. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "57ed1cf1",
   "metadata": {
    "user_expressions": []
   },
   "source": [
    "## Part 2 — Gene filtering\n",
    "\n",
    "Implement `filter_genes(df_expr, max_missing=0.8)`.\n",
    "\n",
    "Return a tuple `(df_filtered, gene_summary_df)` where `gene_summary_df` contains:\n",
    "- `missingness` (fraction missing)\n",
    "- `pass_filters` (boolean)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "29b537b2",
   "metadata": {},
   "outputs": [],
   "source": [
    "def filter_genes(df_expr, max_missing=0.8):\n",
    "    \"\"\"\n",
    "    Filter genes by missingness.\n",
    "    Returns (df_filtered, gene_summary_df)\n",
    "    \"\"\"\n",
    "    # TODO: implement using missing rate\n",
    "    raise NotImplementedError()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fb9de52b",
   "metadata": {
    "tags": [],
    "user_expressions": []
   },
   "source": [
    "## Part 3 — log2 normalization \n",
    "The funciton below performs log2 normalization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "7fa5a4de",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "def log2_normalize_expression(df_expr, pseudocount=1.0):\n",
    "    \"\"\"\n",
    "    Log2-normalize gene expression counts.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    df_expr : pd.DataFrame\n",
    "        Expression matrix (samples x genes or genes x samples)\n",
    "    pseudocount : float\n",
    "        Value added to counts to avoid log(0)\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    pd.DataFrame\n",
    "        Log2-transformed expression matrix of same shape as input\n",
    "    \"\"\"\n",
    "    return np.log2(df_expr + pseudocount)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f2ee96b9",
   "metadata": {
    "tags": [],
    "user_expressions": []
   },
   "source": [
    "## Part 4 — Treatment effect exploration\n",
    "\n",
    "Using the filtered and normalized data, explore genes with treatment-associated mean shifts.\n",
    "\n",
    "### Tasks\n",
    "\n",
    "1. Compute group means per gene for Control and Treatment:\n",
    "```python\n",
    "mean_control = ...\n",
    "mean_treatment = ...\n",
    "```\n",
    "\n",
    "2. Compute the difference `delta = mean_treatment - mean_control` for each gene.\n",
    "\n",
    "3. Compute the t-test statistics and p value for each gene. \n",
    "\n",
    "4. Report the top 10 genes by p values.\n",
    "\n",
    "5. Create Volcano plot\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "54c5f2a5",
   "metadata": {},
   "outputs": [],
   "source": [
    "## skeleton function for differential expression \n",
    "def group_mean_differences(df_expr, df_meta, group_col=\"group\"):\n",
    "    \"\"\"\n",
    "    Return a DataFrame with columns:\n",
    "    - mean_control\n",
    "    - mean_treatment\n",
    "    - log2 fold change (mean_treatment - mean_control)\n",
    "    - t_stat\n",
    "    - p_value\n",
    "    \"\"\"\n",
    "    # TODO: compute per-group means, delta, t-test, and p values. \n",
    "    raise NotImplementedError()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "586b61e2-bf4d-4043-9c64-3af012c7f3be",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Code for Volcano plot \n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "def volcano_plot(results, alpha=0.05):\n",
    "    plt.figure(figsize=(6, 5))\n",
    "    plt.scatter(\n",
    "        results[\"delta\"],\n",
    "        -np.log10(results[\"p_value\"]),\n",
    "        c=(results[\"p_value\"] < alpha),\n",
    "        alpha=0.5\n",
    "    )\n",
    "    plt.axhline(-np.log10(alpha), linestyle=\"--\")\n",
    "    plt.axvline(0, linestyle=\"--\", color=\"grey\")\n",
    "    plt.xlabel(\"log2 fold change (Treatment − Control)\")\n",
    "    plt.ylabel(\"-log10(p_value)\")\n",
    "    plt.title(\"Volcano plot (t-test)\")\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19768938-8cfa-46f8-ae14-21c912fe55e0",
   "metadata": {
    "user_expressions": []
   },
   "source": [
    "## Questions\n",
    "3. How many genes in the top 10 are the true shifted genes in ShiftedGenes.csv created in Quesiton 2?\n",
    "4. Report the false positve rate and false negative rate (with p values $<0.05$)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7f81e964-3772-4a84-a12e-93ce1e229e6e",
   "metadata": {
    "user_expressions": []
   },
   "source": [
    "## Part 5 -  False Positive Rate"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "96edb19f-6b76-4737-b420-3ba72079df55",
   "metadata": {
    "user_expressions": []
   },
   "source": [
    "## Question\n",
    "5. If you reset the shift_size=0 parameter in the function `simulate_dataset' what do you think the false positive rate should be why?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "76713434-491c-4500-94d5-a4edb94a2c97",
   "metadata": {
    "user_expressions": []
   },
   "source": [
    "## Notes & Next steps\n",
    "In this homework, we simply normalized the data with log2 transformation. In practice, normalization could involve other preprocessing such as library size and composition bias correction, and batch effect adjustment.  In addition, routine pactices utilize more advanced methods in R packages such as DEseq or edgeR with multiple testing correction such as false discovery rate (FDR) control. \n",
    "For best practice workflow (in R) check out \n",
    "https://www.bioconductor.org/packages/devel/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "980d09a6-6ec2-4ab0-b90a-76c652456dc4",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python (base)",
   "language": "python",
   "name": "base"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}