Homework: Introductory GWAS Analysis¶
Due Date: Feb 16, 2026¶
Goal: Run a reproducible GWAS analysis on simulated genotype data, evaluate basic QC, perform a per-SNP association test with covariates, visualize results, and write a short interpretation.
Learning objectives
- Understand genotype simulation and basic quality control (minor allele frequency, MAF, missingness).
- Run PCA to capture population structure and include PCs as covariates.
- Perform per-SNP linear regression GWAS with covariates.
Instructions
- Run each cell in order in JupyterLab. The notebook will create a
gwas_outputs/folder with CSVs and PNGs. - Modify parameters (e.g.,
n_samples,n_snps, or number of causal SNPs) if you want to explore effects on power, but stick to the approximate time budget.
Grading rubric (100 pts)
- Notebook runs without errors and outputs saved: 40 pts
- Correct GWAS implementation and inclusion of covariates: 25 pts
- QC, PCA, and reasonable plotting: 15 pts
- Short written interpretation, discussing results & limitations: 20 pts
Good luck — ask if you get stuck!
# Setup and imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
from sklearn.decomposition import PCA
import os
np.random.seed(2025)
os.makedirs('gwas_outputs', exist_ok=True)
print('Environment ready. Outputs will be written to gwas_outputs/')
Environment ready. Outputs will be written to gwas_outputs/
Part 1:¶
Generate Synthetic SNP Genotype Data
# Setup and imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
from sklearn.decomposition import PCA
import os
np.random.seed(2025)
os.makedirs('gwas_outputs', exist_ok=True)
print('Environment ready. Outputs will be written to gwas_outputs/')
Environment ready. Outputs will be written to gwas_outputs/
# Generate synthetic genotype data
n_samples = 50
n_snps = 500
samples = [f"S{i+1:02d}" for i in range(n_samples)]
pop_labels = np.array(['A'] * (n_samples//2) + ['B'] * (n_samples - n_samples//2))
baseline_af = np.random.uniform(0.05, 0.5, size=n_snps)
popB_shift = np.random.normal(loc=0.0, scale=0.08, size=n_snps)
af_A = np.clip(baseline_af, 0.001, 0.999)
af_B = np.clip(baseline_af + popB_shift, 0.001, 0.999)
genotypes = np.full((n_samples, n_snps), np.nan)
for v in range(n_snps):
pA = af_A[v]
pB = af_B[v]
probsA = [ (1-pA)**2, 2*pA*(1-pA), pA**2 ]
probsB = [ (1-pB)**2, 2*pB*(1-pB), pB**2 ]
genotypes[:n_samples//2, v] = np.random.choice([0,1,2], size=n_samples//2, p=probsA)
genotypes[n_samples//2:, v] = np.random.choice([0,1,2], size=n_samples - n_samples//2, p=probsB)
genotypes[0:5, 0:5]
array([[0., 0., 1., 0., 1.],
[0., 1., 1., 0., 2.],
[0., 1., 1., 0., 0.],
[0., 2., 0., 0., 1.],
[0., 1., 2., 0., 0.]])
# Simulate covariates: age and sex (binary)
age = np.random.normal(50, 10, size=n_samples).astype(int)
sex = np.random.binomial(1, 0.5, size=n_samples)
# Create phenotype: a polygenic trait with 10 causal SNPs
causal_idx = np.random.choice(n_snps, size=10, replace=False)
beta_snps = np.zeros(n_snps)
beta_snps[causal_idx] = np.random.normal(0.3, 0.1, size=10)
# Genetic effect
# mean-impute genotypes for calculation
G_imp = np.where(np.isnan(genotypes), np.nanmean(genotypes, axis=0), genotypes)
polygenic = G_imp.dot(beta_snps)
# Covariate effects
beta_age = 0.02
beta_sex = -0.1
beta_pop = 0.4
# Environmental noise
env = np.random.normal(0, 1.0, size=n_samples)
## convert population label to numeric
pop_numeric = np.array([0 if x == 'A' else 1 for x in pop_labels])
y = 0.5 + polygenic + beta_age * age + beta_sex * sex + beta_pop * pop_numeric + env
df_meta = pd.DataFrame({'sample': samples, 'age': age, 'sex': sex, 'pop': pop_labels, 'y': y}).set_index('sample')
df_meta.to_csv('gwas_outputs/phenotypes.csv')
print('Phenotypes saved: gwas_outputs/phenotypes.csv')
# ~2% missing
mask_missing = np.random.rand(n_samples, n_snps) < 0.02
genotypes[mask_missing] = np.nan
snps_ids = [f'rs{100000+i}' for i in range(n_snps)]
df_geno = pd.DataFrame(genotypes, index=samples, columns=snps_ids)
print('Generated genotype matrix shape:', df_geno.shape)
df_geno.head()
Phenotypes saved: gwas_outputs/phenotypes.csv Generated genotype matrix shape: (50, 500)
| rs100000 | rs100001 | rs100002 | rs100003 | rs100004 | rs100005 | rs100006 | rs100007 | rs100008 | rs100009 | ... | rs100490 | rs100491 | rs100492 | rs100493 | rs100494 | rs100495 | rs100496 | rs100497 | rs100498 | rs100499 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S01 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 0.0 | 1.0 | NaN | 2.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.0 | 0.0 |
| S02 | 0.0 | 1.0 | 1.0 | 0.0 | 2.0 | 0.0 | 1.0 | 1.0 | 2.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 |
| S03 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | ... | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| S04 | 0.0 | 2.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 2.0 | 0.0 | 1.0 |
| S05 | 0.0 | 1.0 | 2.0 | 0.0 | 0.0 | 0.0 | 2.0 | 2.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 |
5 rows × 500 columns
Questions¶
- What does the variable `pop_labels' represents in the simulation code above? How does it affect the genotype data being generated?
- How is the phenotype (y) generated in the simulation above? What are the important factors contribute to the value of y in the simulation?
- What is the structure of the genotype matrix (df_genotype)? what does each row represent? How about columns? What do the values in the cells mean?
- How are the genotypes generated in the code above? Explain what is Hardy–Weinberg equilibrium and how does it related to how the genotypes being generated in the simulation?
Answer t Question 1: pop_labels represents labels for the two populations. The way it affects genotype data being generated is that for group "A", the probability of having the alternate allele for SNP $v$ (denoted $p_v^{(A)}$) is simulated from the reference $p_v^{(A)} \sim Unif(0.05, 0.5)$. But for group "B", we have the probability of having the alternate allele for SNP $v$ is given by $p_v^{(B)} = p_v^{(A)} + \text{rnorm}(1, 0, 0.08)$.
Answer to Question 2: The phenotype $(y)$ is generated as $$y_i = 0.5 + \sum_{v=1}^{500}G_{i,v}\beta_v + \text{age}_i \beta_{age} + \mathcal{1}\left(\text{sex}_i== \text{male}\right) \beta_{sex} + \mathcal{1}\left(\text{pop}_i==B\right)\beta_{pop} + \text{env}_i$$ where $G_{i,v} \in \{0,1,2\}$ denotes the genotype of individual $i$ at SNP $v$, encoded as the number of alternate allele copies. Also, $$\begin{aligned}\beta_v &= \begin{cases} \sim \mathcal{N}(0.3, 0.1^2), & \text{if } v \text{ is one of the 10 causal SNPs}, \\ 0, & \text{ else } \end{cases} \\ \beta_{age} &= 0.02 \\ \beta_{sex} &= -0.1 \\ \beta_{pop} &= 0.4\\ env_i &\sim N(0,1).\end{aligned}$$ So as we can see, being in population "A" or "B" has a large impact on $y$ (with coeff $0.4$), sex has a moderate impact as well (coeff $-0.1$), and since $\beta_{v}$ for the causal SNPs comes from $N(0.3, 0.1^2)$, then the sum of those causal SNP effects has the potential to be quite large.
Answer to Question 3: We have $50$ rows representing $50$ samples, and we have $500$ columns representing $500$ SNPs. Entry $i, v$ in the matrix represents if sample $i$ has $0, 1, $ or $2$ copies of the alternate allele for SNP $v$.
Answer to Question 4: In the code above, genotypes are simulated assuming Hardy–Weinberg equilibrium (HWE) within each population. Under HWE, if the allele frequency of a variant is p, the expected genotype frequencies in a large, randomly mating population with no selection, mutation, or migration are: \begin{eqnarray*} && p^2 \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \mbox {for genotype 0 (homozygous reference)} \\ && 2(1-p)\times p \ \ \ \ \mbox {for genotype 1 (heterozygous)} \\ && (1-p)^2 \ \ \ \ \ \ \ \ \ \ \ \mbox {for genotype 2 (homozygous alternate)}. \\ \end{eqnarray*} These probabilities are used to randomly sample diploid genotypes for each individual at each variant.
import numpy as np
def compute_maf(genotypes):
"""
Compute minor allele frequency (MAF) for a single SNP.
Parameters
----------
genotypes : array-like
Genotypes coded as 0, 1, 2 (alt allele count).
Missing values can be np.nan.
Returns
-------
maf : float
Minor allele frequency, or np.nan if no valid genotypes.
"""
g = np.asarray(genotypes, dtype=float)
g = g[~np.isnan(g)] # remove missing
if g.size == 0:
return np.nan
# total number of alternate alleles
alt_allele_count = g.sum()
# total number of alleles
total_alleles = 2 * g.size
af = alt_allele_count / total_alleles
maf = min(af, 1 - af)
return maf
maf = df_geno.apply(compute_maf, axis=0)
maf
rs100000 0.061224
rs100001 0.377551
rs100002 0.340000
rs100003 0.180000
rs100004 0.153061
...
rs100495 0.360000
rs100496 0.071429
rs100497 0.438776
rs100498 0.010417
rs100499 0.132653
Length: 500, dtype: float64
Questions:¶
- What is an allele? Explain how allele frequencies are calculated. What is minor allele frequency (MAF)?
- Generate a plot for the MAF distribution.
Answer to Question 5:
An allele is a different version of a gene found at a specific location (called a locus) on a chromosome. For example, if a gene determines eye color, one allele might code for brown eyes and another allele might code for blue eyes.
Since humans are diploid (we have two copies of each chromosome), we inherit:
One allele from our mother
One allele from our father
So each person has two alleles per gene (except for genes on sex chromosomes in males).
maf.dropna().plot.hist(bins=30)
plt.xlabel("MAF")
plt.title("MAF distribution")
plt.show()
# 1) Compute missingness for each SNP and save as gwas_outputs/variant_qc.csv
missingness = df_geno.isna().mean(axis=0)
missingness.name = "prop_missing"
print(missingness.head())
qc = pd.DataFrame({
"MAF": maf,
"missingness": missingness
})
qc.to_csv("gwas_outputs/variant_qc.csv")
rs100000 0.02 rs100001 0.02 rs100002 0.00 rs100003 0.00 rs100004 0.02 Name: prop_missing, dtype: float64
# 2) Mean-impute genotypes and run PCA on a subset of SNPs to get PCs (save to gwas_outputs/pcs.csv)
# mean impute means nas get replaced by the mean of all non-na entries in the column
G_imp = df_geno.fillna(df_geno.mean(axis=0))
# pca wants centered
G_centered = G_imp - G_imp.mean(axis=0)
# to get the subset, we want to choose the SNPs that dont have too low of MAF or too high of missingness
good_snps = maf[(maf > 0.05) & (missingness < 0.05)].index
G_centered = G_centered.loc[:,good_snps]
print(G_centered.shape)
# run PCA
pca = PCA(n_components=4)
pcs = pca.fit_transform(G_centered.values)
pc_df = pd.DataFrame(pcs, index=G_centered.index, columns=['PC1','PC2','PC3','PC4'])
print(pc_df.head())
pc_df.to_csv("gwas_outputs/pcs.csv")
(50, 434)
PC1 PC2 PC3 PC4
S01 2.883814 1.777985 3.227693 -1.363845
S02 2.232474 -0.799068 0.523152 1.308283
S03 3.039008 4.965095 -0.791743 0.538847
S04 1.287036 -1.235998 -1.985315 -1.384421
S05 2.072885 -1.522972 -4.730467 -2.029128
# 3) Plot PC1 vs PC2 colored by population and save as gwas_outputs/pca.png
plt.figure(figsize=(6, 5))
for label in df_meta['pop'].unique():
idx = df_meta['pop'] == label
plt.scatter(
pc_df.loc[idx, 'PC1'],
pc_df.loc[idx, 'PC2'],
label=label,
s=30,
alpha=0.8
)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('PCA: PC1 vs PC2')
plt.legend(title='Population', loc='lower right')
plt.tight_layout()
plt.show()
Questions¶
- Report the first5 rows and first 5 columns of the object `qc'.
qc.iloc[0:5, 0:5]
| MAF | missingness | |
|---|---|---|
| rs100000 | 0.061224 | 0.02 |
| rs100001 | 0.377551 | 0.02 |
| rs100002 | 0.340000 | 0.00 |
| rs100003 | 0.180000 | 0.00 |
| rs100004 | 0.153061 | 0.02 |
### Impute Missing Values and Center genotype for PCA analysis
G_imp = df_geno.fillna(df_geno.mean(axis=0))
G_centered = G_imp - G_imp.mean(axis=0)
# PCA on mean-imputed genotypes (for population structure)
# perform PCA on a random subset of SNPs for speed
pca = PCA(n_components=4)
pcs = pca.fit_transform(G_centered.values)
pc_df = pd.DataFrame(pcs, index=samples, columns=['PC1','PC2','PC3','PC4'])
pc_df.head()
| PC1 | PC2 | PC3 | PC4 | |
|---|---|---|---|---|
| S01 | 3.022323 | 2.958171 | -2.997434 | 2.000000 |
| S02 | 2.579011 | 0.195340 | -1.161110 | -1.865826 |
| S03 | 3.639002 | 4.192559 | 2.081473 | 0.728862 |
| S04 | 1.436168 | -1.578011 | 1.235982 | 0.333851 |
| S05 | 2.060543 | -4.638060 | 2.448404 | 2.724931 |
df_meta.head()
| age | sex | pop | y | |
|---|---|---|---|---|
| sample | ||||
| S01 | 51 | 1 | A | 3.786828 |
| S02 | 34 | 0 | A | 2.322089 |
| S03 | 56 | 0 | A | 1.878572 |
| S04 | 36 | 1 | A | 2.316062 |
| S05 | 51 | 0 | A | 2.656425 |
plt.figure(figsize=(6, 5))
for label in df_meta['pop'].unique():
idx = df_meta['pop'] == label
plt.scatter(
pc_df.loc[idx, 'PC1'],
pc_df.loc[idx, 'PC2'],
label=label,
s=30,
alpha=0.8
)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('PCA: PC1 vs PC2')
plt.legend(title='Population', loc='lower right')
plt.tight_layout()
plt.show()
Questions¶
- Explain what is PCA (principal component analysis) in simple terms and explain what does the PCA plot above demonstrate?
Answer to Question 5:
In a 2 dimensional example, each point is one individual (or sample), and the axes represent variables (for example, SNP1 and SNP2).
Often: the data cloud is tilted and the variation is not aligned with the original axes
PCA rotates the coordinate system so that:
The new first axis (PC1) points in the direction where the data varies the most. The second axis (PC2) points in the next largest direction of variation
Each new axis is perpendicular (orthogonal) to the others
The separation along PC1 suggests that population membership is the primary
source of genetic variation in this dataset.
Notes & Next steps¶
- This assignment uses simulated data
- For real GWAS, use PLINK, Hail, or SAIGE for large-scale analyses.