Homework: Introductory GWAS Analysis¶

Due Date: Feb 16, 2026¶

Goal: Run a reproducible GWAS analysis on simulated genotype data, evaluate basic QC, perform a per-SNP association test with covariates, visualize results, and write a short interpretation.

Learning objectives

  • Understand genotype simulation and basic quality control (minor allele frequency, MAF, missingness).
  • Run PCA to capture population structure and include PCs as covariates.
  • Perform per-SNP linear regression GWAS with covariates.

Instructions

  • Run each cell in order in JupyterLab. The notebook will create a gwas_outputs/ folder with CSVs and PNGs.
  • Modify parameters (e.g., n_samples, n_snps, or number of causal SNPs) if you want to explore effects on power, but stick to the approximate time budget.

Grading rubric (100 pts)

  • Notebook runs without errors and outputs saved: 40 pts
  • Correct GWAS implementation and inclusion of covariates: 25 pts
  • QC, PCA, and reasonable plotting: 15 pts
  • Short written interpretation, discussing results & limitations: 20 pts

Good luck — ask if you get stuck!

In [1]:
# Setup and imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
from sklearn.decomposition import PCA
import os

np.random.seed(2025)
os.makedirs('gwas_outputs', exist_ok=True)
print('Environment ready. Outputs will be written to gwas_outputs/')
Environment ready. Outputs will be written to gwas_outputs/

Part 1:¶

Generate Synthetic SNP Genotype Data

In [2]:
# Setup and imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
from sklearn.decomposition import PCA
import os

np.random.seed(2025)
os.makedirs('gwas_outputs', exist_ok=True)
print('Environment ready. Outputs will be written to gwas_outputs/')
Environment ready. Outputs will be written to gwas_outputs/
In [6]:
# Generate synthetic genotype data
n_samples = 50
n_snps = 500
samples = [f"S{i+1:02d}" for i in range(n_samples)]
pop_labels = np.array(['A'] * (n_samples//2) + ['B'] * (n_samples - n_samples//2))

baseline_af = np.random.uniform(0.05, 0.5, size=n_snps)
popB_shift = np.random.normal(loc=0.0, scale=0.08, size=n_snps)
af_A = np.clip(baseline_af, 0.001, 0.999)
af_B = np.clip(baseline_af + popB_shift, 0.001, 0.999)

genotypes = np.full((n_samples, n_snps), np.nan)
for v in range(n_snps):
    pA = af_A[v]
    pB = af_B[v]
    probsA = [ (1-pA)**2, 2*pA*(1-pA), pA**2 ]
    probsB = [ (1-pB)**2, 2*pB*(1-pB), pB**2 ]
    genotypes[:n_samples//2, v] = np.random.choice([0,1,2], size=n_samples//2, p=probsA)
    genotypes[n_samples//2:, v] = np.random.choice([0,1,2], size=n_samples - n_samples//2, p=probsB)
genotypes[0:5, 0:5]
Out[6]:
array([[0., 2., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [2., 0., 0., 0., 1.],
       [2., 1., 1., 1., 0.],
       [2., 0., 0., 0., 1.]])
In [7]:
# Simulate covariates: age and sex (binary)
age = np.random.normal(50, 10, size=n_samples).astype(int)
sex = np.random.binomial(1, 0.5, size=n_samples)

# Create phenotype: a polygenic trait with 10 causal SNPs
causal_idx = np.random.choice(n_snps, size=10, replace=False)
beta_snps = np.zeros(n_snps)
beta_snps[causal_idx] = np.random.normal(0.3, 0.1, size=10)

# Genetic effect
# mean-impute genotypes for calculation
G_imp = np.where(np.isnan(genotypes), np.nanmean(genotypes, axis=0), genotypes)
polygenic = G_imp.dot(beta_snps)
In [8]:
# Covariate effects
beta_age = 0.02
beta_sex = -0.1
beta_pop = 0.4

# Environmental noise
env = np.random.normal(0, 1.0, size=n_samples)
## convert population label to numeric 
pop_numeric = np.array([0 if x == 'A' else 1 for x in pop_labels])

y = 0.5 + polygenic + beta_age * age + beta_sex * sex + beta_pop * pop_numeric + env
In [10]:
df_meta = pd.DataFrame({'sample': samples, 'age': age, 'sex': sex, 'pop': pop_labels, 'y': y}).set_index('sample')
df_meta.to_csv('gwas_outputs/phenotypes.csv')
print('Phenotypes saved: gwas_outputs/phenotypes.csv')

# ~2% missing
mask_missing = np.random.rand(n_samples, n_snps) < 0.02
genotypes[mask_missing] = np.nan

snps_ids = [f'rs{100000+i}' for i in range(n_snps)]
df_geno = pd.DataFrame(genotypes, index=samples, columns=snps_ids)

print('Generated genotype matrix shape:', df_geno.shape)
df_geno.head()
Phenotypes saved: gwas_outputs/phenotypes.csv
Generated genotype matrix shape: (50, 500)
Out[10]:
rs100000 rs100001 rs100002 rs100003 rs100004 rs100005 rs100006 rs100007 rs100008 rs100009 ... rs100490 rs100491 rs100492 rs100493 rs100494 rs100495 rs100496 rs100497 rs100498 rs100499
S01 0.0 2.0 0.0 0.0 0.0 0.0 2.0 1.0 0.0 0.0 ... 0.0 2.0 1.0 0.0 NaN 1.0 0.0 1.0 NaN 1.0
S02 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
S03 NaN 0.0 0.0 0.0 1.0 1.0 1.0 0.0 2.0 1.0 ... 0.0 0.0 1.0 0.0 2.0 1.0 0.0 1.0 1.0 NaN
S04 2.0 1.0 1.0 1.0 0.0 0.0 2.0 NaN 1.0 0.0 ... 1.0 1.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0
S05 2.0 0.0 0.0 0.0 1.0 1.0 0.0 1.0 0.0 0.0 ... 0.0 1.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0 0.0

5 rows × 500 columns

Questions¶

  1. What does the variable `pop_labels' represents in the simulation code above? How does it affect the genotype data being generated?
  2. How is the phenotype (y) generated in the simulation above? What are the important factors contribute to the value of y in the simulation?
  3. What is the structure of the genotype matrix (df_genotype)? what does each row represent? How about columns? What do the values in the cells mean?
  4. How are the genotypes generated in the code above? Explain what is Hardy–Weinberg equilibrium and how does it related to how the genotypes being generated in the simulation?
In [11]:
import numpy as np

def compute_maf(genotypes):
    """
    Compute minor allele frequency (MAF) for a single SNP.

    Parameters
    ----------
    genotypes : array-like
        Genotypes coded as 0, 1, 2 (alt allele count).
        Missing values can be np.nan.

    Returns
    -------
    maf : float
        Minor allele frequency, or np.nan if no valid genotypes.
    """
    g = np.asarray(genotypes, dtype=float)
    g = g[~np.isnan(g)]  # remove missing

    if g.size == 0:
        return np.nan

    # total number of alternate alleles
    alt_allele_count = g.sum()
    # total number of alleles
    total_alleles = 2 * g.size

    af = alt_allele_count / total_alleles
    maf = min(af, 1 - af)

    return maf
In [95]:
maf = df_geno.apply(compute_maf, axis=0)
maf
Out[95]:
rs100000    0.061224
rs100001    0.377551
rs100002    0.340000
rs100003    0.180000
rs100004    0.153061
              ...   
rs100495    0.360000
rs100496    0.071429
rs100497    0.438776
rs100498    0.010417
rs100499    0.132653
Length: 500, dtype: float64

Questions:¶

  1. What is an allele? Explain how allele frequencies are calculated. What is minor allele frequency (MAF)?
  2. Generate a plot for the MAF distribution.
In [14]:
# TASK: Compute QC metrics and run PCA
# 1) Compute missingness for each SNP and save as gwas_outputs/variant_qc.csv
# 2) Mean-impute genotypes and run PCA on a subset of SNPs to get PCs (save to gwas_outputs/pcs.csv)
# 3) Plot PC1 vs PC2 colored by population and save as gwas_outputs/pca.png
#
# --- TODO: Computing missingness for each snp ---
### YOUR CODE HERE ###

# Save QC (once implemented)
# qc = pd.DataFrame({'MAF': maf, 'missingness': missingness})
# qc[1:10, 1:10]

Questions¶

  1. Report the first5 rows and first 5 columns of the object `qc'.
In [15]:
### Impute Missing Values and Center genotype for PCA analysis
G_imp = df_geno.fillna(df_geno.mean(axis=0))
G_centered = G_imp - G_imp.mean(axis=0)
In [16]:
# PCA on mean-imputed genotypes (for population structure)
# perform PCA on a random subset of SNPs for speed
pca = PCA(n_components=4)
pcs = pca.fit_transform(G_centered.values)

pc_df = pd.DataFrame(pcs, index=samples, columns=['PC1','PC2','PC3','PC4'])
pc_df.head()
Out[16]:
PC1 PC2 PC3 PC4
S01 -1.888524 -3.009312 0.707643 0.659718
S02 -2.375150 4.459011 -3.793163 1.617335
S03 -5.630537 -2.389041 -3.736545 -2.093097
S04 -1.966514 0.735002 -4.722171 0.654252
S05 -2.190321 0.636415 3.232525 -0.872838
In [17]:
df_meta.head()
Out[17]:
age sex pop y
sample
S01 45 0 A 1.592461
S02 45 1 A 3.078932
S03 48 1 A 0.203718
S04 45 1 A 2.332965
S05 54 0 A 2.585327
In [18]:
plt.figure(figsize=(6, 5))
for label in df_meta['pop'].unique():
    idx = df_meta['pop'] == label
    plt.scatter(
        pc_df.loc[idx, 'PC1'],
        pc_df.loc[idx, 'PC2'],
        label=label,
        s=30,
        alpha=0.8
    )

plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('PCA: PC1 vs PC2')
plt.legend(title='Population',  loc='lower right')
plt.tight_layout()
plt.show()
No description has been provided for this image

Questions¶

  1. Explain what is PCA (principal component analysis) in simple terms and explain what does the PCA plot above demonstrate?

Notes & Next steps¶

  • This assignment uses simulated data
  • For real GWAS, use PLINK, Hail, or SAIGE for large-scale analyses.
In [ ]: