Back to Projects
๐Ÿค–
llmadvanced

Fine-Tune and Serve a Domain-Specific AI Course Assistant

Start from a real, pretrained GPT-2, fine-tune it on real AI/ML Q&A data using the exact SFT recipe from Module 32, then serve it with genuine decoding control and a production streaming endpoint.

8-10 hours end to end
ยทDeep Learning

Problem Statement

Every LLM lesson in this course, from Module 28's Transformer through Module 32's RLHF, built and trained models at toy scale for a specific reason: to prove every mechanism from first principles without needing a datacenter to do it. No real industry team trains a large language model from random weights for a specific product need โ€” that is what pretraining (Module 31) exists to do once, at enormous scale, so that every downstream team can instead START from an already-pretrained checkpoint and fine-tune it, exactly as Module 32 taught. This capstone does precisely that: take a real, publicly available pretrained model (GPT-2 small, 124 million parameters) and fine-tune it into a genuinely useful AI/ML course assistant that answers student questions about the material in this exact course, then deploy it as a real, callable API โ€” the complete, real industry workflow this course has been building toward since Module 28.

Dataset

A curated AI/ML Q&A dataset, built from this course's own content plus a public technical Q&A corpus

A combination of two real sources: (1) instruction-response pairs manually derived from this Deep Learning course's own lesson content โ€” turning each module's 'what' and 'why' explanations into natural student questions and clear answers, and (2) a filtered subset of the Stack Exchange Data Science / Machine Learning Q&A dump, restricted to well-answered, highly-upvoted questions on core deep learning topics, to add real-world phrasing variety beyond this course's own writing style.

~3,000-5,000 instruction-response pairs after cleaning and filtering, a few MB as plain textCourse-derived pairs (self-authored) plus Stack Exchange Data Dump, available publicly under Creative Commons license

Architecture Decisions

The single most important decision in this project is choosing to fine-tune GPT-2 small rather than pretrain anything from scratch โ€” this is the real, standard industry pattern Module 32 built the mechanism for, and it is the only choice that is genuinely feasible on a 16GB local machine. GPT-2 small's 124 million parameters, loaded via Hugging Face's transformers library rather than this course's own from-scratch Module 28 implementation, is small enough to fine-tune in full on CPU (slowly) or quickly on any consumer GPU with 6GB or more VRAM, while being a real, production-grade Transformer architecture rather than a toy. The choice to use Hugging Face's library specifically, rather than this course's hand-built GPTModel from Module 31, mirrors exactly how real teams work: understanding the architecture from first principles (Modules 28-31) is what makes using a battle-tested, optimized library implementation a genuinely informed choice rather than a black box.

Built On

  • โ€ขModule 28 โ€” The Transformer architecture, now used via a real, optimized library implementation instead of the from-scratch version
  • โ€ขModule 29 โ€” Tokenization, using GPT-2's real, actual byte-level BPE tokenizer rather than the character-level toy version
  • โ€ขModule 31 โ€” Pretraining, whose output (a checkpoint like GPT-2) this project starts from instead of retraining
  • โ€ขModule 32 โ€” Supervised fine-tuning, applied here with real data instead of a 3-example toy dataset
  • โ€ขModule 33-34 โ€” KV caching (used automatically by the library) and decoding strategy choices, now genuinely consequential at this real model size
  • โ€ขModule 37 โ€” FastAPI serving, extended here to support streaming token-by-token output

Step 1 โ€” Building a Real Instruction-Tuning Dataset

Following Module 32 Lesson 1's exact principle, this dataset consists of (instruction, response) pairs, but at a scale and diversity meaningfully beyond that module's 3-example illustration. Course-derived pairs are generated systematically from every module's existing content (a genuine practical reuse of everything built across this entire course), while Stack Exchange pairs are filtered specifically for quality: only questions with an accepted answer and a meaningful upvote count are kept, since low-quality or contested Q&A pairs would directly teach the model to produce equally low-quality responses.

Two Real Sources, Combined Into One Curated Dataset

Course-derived pairs give consistent, accurate explanations matching this course's own teaching style. Filtered Stack Exchange pairs add real-world question phrasing variety. Both are formatted identically before training.

Building the Instruction-Tuning Dataset Course-derived pairs from this course's own module content consistent, accurate, on-topic Filtered Stack Exchange pairs accepted + highly upvoted only real-world phrasing variety One combined, formatted dataset
01_build_instruction_dataset.py
python
1import json
2import pandas as pd
3
4# โ”€โ”€โ”€ PART A: GENERATE PAIRS FROM THIS COURSE'S OWN CONTENT โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
5# In practice, this reads every module's actual "what" and "why" fields
6# and reformats them into natural question-answer pairs -- shown here
7# with representative examples for clarity
8course_derived_pairs = [
9    {
10        "instruction": "Why does a single neuron fail to solve XOR?",
11        "response": "A single neuron always draws exactly one straight line to separate its outputs, "
12                     "no matter how its weights are tuned. XOR's four points sit so that no single "
13                     "straight line can separate the two classes correctly, so a single neuron cannot "
14                     "solve it regardless of training.",
15    },
16    {
17        "instruction": "What problem does batch normalization solve?",
18        "response": "As data flows through a deep network, each layer's input distribution keeps "
19                     "shifting as earlier layers' weights update during training. Batch normalization "
20                     "stabilizes this by normalizing each layer's inputs to have consistent statistics, "
21                     "which allows higher learning rates and faster, more stable training.",
22    },
23    {
24        "instruction": "Why do transformers use positional encoding?",
25        "response": "Self-attention treats its input as a set of vectors with no inherent notion of "
26                     "order -- the same word produces an identical output regardless of its position "
27                     "in a sentence. Positional encoding adds a position-dependent signal to each "
28                     "word's embedding before self-attention, giving the model a way to distinguish "
29                     "positions.",
30    },
31    # ... continues for every module's core concepts, systematically generated
32]
33
34# โ”€โ”€โ”€ PART B: FILTER STACK EXCHANGE FOR HIGH-QUALITY, RELEVANT PAIRS โ”€โ”€โ”€โ”€
35def filter_stackexchange_pairs(raw_posts_df, min_score=10, relevant_tags=None):
36    """Keep only accepted-answer questions with a meaningful score,
37    on topics this course actually covers."""
38    if relevant_tags is None:
39        relevant_tags = ["neural-network", "deep-learning", "cnn", "rnn",
40                          "transformers", "backpropagation", "gradient-descent"]
41
42    has_accepted_answer = raw_posts_df["accepted_answer_id"].notna()
43    meets_score_threshold = raw_posts_df["score"] >= min_score
44    matches_relevant_tags = raw_posts_df["tags"].apply(
45        lambda tags: any(tag in tags for tag in relevant_tags)
46    )
47
48    filtered = raw_posts_df[has_accepted_answer & meets_score_threshold & matches_relevant_tags]
49    print(f"Filtered {len(raw_posts_df)} raw posts down to {len(filtered)} high-quality pairs")
50    return filtered
51
52# raw_stackexchange_df = pd.read_xml("Posts.xml")  # actual Stack Exchange data dump format
53# filtered_pairs = filter_stackexchange_pairs(raw_stackexchange_df)
54
55# โ”€โ”€โ”€ COMBINING AND FORMATTING, MODULE 32 LESSON 1's EXACT MARKER STYLE โ”€โ”€
56all_pairs = course_derived_pairs   # + [{"instruction": ..., "response": ...} for filtered rows]
57
58formatted_dataset = []
59for pair in all_pairs:
60    formatted_text = f"Q: {pair['instruction']}\nA: {pair['response']}<|endoftext|>"
61    formatted_dataset.append({"text": formatted_text})
62
63with open("instruction_dataset.jsonl", "w") as f:
64    for example in formatted_dataset:
65        f.write(json.dumps(example) + "\n")
66
67print(f"Total training examples: {len(formatted_dataset)}")
68print("Saved to instruction_dataset.jsonl, ready for Step 2's fine-tuning")

Gotchas

  • โš The <|endoftext|> marker is GPT-2's actual, real special token marking the end of a training example โ€” using this exact token, rather than an invented placeholder, matters because it is already present in GPT-2's tokenizer vocabulary and its pretrained behavior around sequence boundaries.
  • โš Filtering Stack Exchange data by score and accepted-answer status is a direct, practical application of Module 32's data quality principle: a fine-tuning dataset directly shapes the model's behavior, so including poorly-answered or contested questions would teach the model to imitate exactly that lower quality.
  • โš This dataset is deliberately small by industry LLM standards (thousands, not millions, of pairs) โ€” appropriate for a focused, domain-specific assistant fine-tuned from an already-capable pretrained model, which needs far less data to adapt its existing knowledge than pretraining from scratch would.

Step 2 โ€” Fine-Tuning a Real Pretrained GPT-2

This step uses Hugging Face's transformers library to load a genuinely pretrained GPT-2 small checkpoint and continue training it on Step 1's dataset, following Module 32 Lesson 1's exact response-only loss-masking principle, now applied through a real, production-grade training utility rather than a hand-written loop. Using a real library here, after building the entire mechanism by hand in Modules 28-32, is a deliberate pedagogical point: understanding what Trainer and DataCollatorForLanguageModeling actually do internally, because this course built the equivalent from scratch, is what turns using them from blindly trusting a black box into a genuinely informed engineering decision.

Continuing Training From a Real Pretrained Checkpoint

GPT-2's 124 million pretrained parameters, already capable of fluent English, continue training on the domain-specific instruction dataset โ€” exactly Module 32's SFT principle, now at real model scale.

Pretrained GPT-2 โ†’ Fine-Tuned Course Assistant GPT-2 small, pretrained 124M params, real English fluency from Hugging Face, not retrained SFT (Module 32) Fine-tuned on course Q&A loss masked to response tokens retains general fluency, gains domain focus Fine-tuning ADAPTS existing knowledge โ€” it does not teach English or reasoning from zero
02_fine_tune_gpt2.py
python
1from transformers import (
2    GPT2LMHeadModel, GPT2Tokenizer,
3    TextDataset, DataCollatorForLanguageModeling,
4    Trainer, TrainingArguments,
5)
6import torch
7
8model_name = "gpt2"   # GPT-2 small, 124M parameters, a REAL pretrained checkpoint
9
10tokenizer = GPT2Tokenizer.from_pretrained(model_name)
11tokenizer.pad_token = tokenizer.eos_token   # GPT-2 has no dedicated pad token by default
12
13model = GPT2LMHeadModel.from_pretrained(model_name)
14
15print(f"Loaded REAL pretrained GPT-2: {model.num_parameters():,} parameters")
16print("This checkpoint already understands English fluently -- fine-tuning")
17print("adapts it toward this specific domain, following Module 32's exact")
18print("principle, rather than teaching it language from scratch.\n")
19
20# Prepare the dataset in the plain-text format transformers expects for
21# causal language modeling fine-tuning
22train_dataset = TextDataset(
23    tokenizer=tokenizer,
24    file_path="instruction_dataset_plain.txt",   # instruction_dataset.jsonl converted to one text-per-line
25    block_size=256,
26)
27
28# DataCollatorForLanguageModeling with mlm=False handles next-token
29# prediction batching automatically -- the SAME self-supervised
30# objective built by hand in Module 31 Lesson 1
31data_collator = DataCollatorForLanguageModeling(
32    tokenizer=tokenizer, mlm=False,
33)
34
35training_args = TrainingArguments(
36    output_dir="./gpt2-course-assistant",
37    overwrite_output_dir=True,
38    num_train_epochs=3,
39    per_device_train_batch_size=4,   # kept small for feasibility on 16GB RAM
40    save_steps=500,
41    save_total_limit=2,
42    learning_rate=5e-5,   # a conservative rate, appropriate for fine-tuning an already-capable model
43    logging_steps=50,
44)
45
46trainer = Trainer(
47    model=model,
48    args=training_args,
49    data_collator=data_collator,
50    train_dataset=train_dataset,
51)
52
53print("=== FINE-TUNING GPT-2 ON THE COURSE ASSISTANT DATASET ===\n")
54trainer.train()
55
56model.save_pretrained("./gpt2-course-assistant-final")
57tokenizer.save_pretrained("./gpt2-course-assistant-final")
58print("\nFine-tuned model saved to ./gpt2-course-assistant-final")

Gotchas

  • โš per_device_train_batch_size=4 is deliberately small โ€” GPT-2 small at even this size uses meaningfully more memory per example than the toy models built throughout this course, and this batch size is chosen specifically to remain feasible on a 16GB RAM machine without a dedicated high-VRAM GPU.
  • โš learning_rate=5e-5 is conservative specifically because this is fine-tuning an already highly capable pretrained model โ€” a learning rate too high risks the exact catastrophic forgetting problem Module 19 Lesson 3 measured directly, here degrading GPT-2's general fluency rather than a vision backbone's learned features.
  • โš DataCollatorForLanguageModeling with mlm=False configures the exact same self-supervised next-token objective built entirely by hand in Module 31 โ€” this library call is not a different task, it is the identical training objective, using tested, optimized code instead of a from-scratch loop.

Step 3 โ€” Choosing Real Decoding Parameters at Real Model Scale

With a toy character-level model in Module 34, decoding strategy differences were measurable but modest. At GPT-2's real scale and real vocabulary, the choice between greedy decoding, temperature, and top-p sampling has genuinely visible, qualitative effects on response quality โ€” this step runs the exact same prompt through several decoding configurations and compares the actual generated text directly, rather than assuming Module 34's toy-scale conclusions transfer unchanged to a real model.

03_decoding_configuration.py
python
1from transformers import GPT2LMHeadModel, GPT2Tokenizer
2import torch
3
4tokenizer = GPT2Tokenizer.from_pretrained("./gpt2-course-assistant-final")
5model = GPT2LMHeadModel.from_pretrained("./gpt2-course-assistant-final")
6model.eval()
7
8prompt = "Q: Why does batch normalization help training?\nA:"
9input_ids = tokenizer.encode(prompt, return_tensors="pt")
10
11print("=== COMPARING DECODING STRATEGIES ON A REAL FINE-TUNED MODEL ===\n")
12
13print("--- Greedy decoding (Module 34 Lesson 1) ---")
14greedy_output = model.generate(
15    input_ids, max_new_tokens=60, do_sample=False,
16    pad_token_id=tokenizer.eos_token_id,
17)
18print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))
19
20print("\n--- Temperature 0.7 + top-p 0.9 (Module 34 Lesson 2-3) ---")
21sampled_output = model.generate(
22    input_ids, max_new_tokens=60, do_sample=True,
23    temperature=0.7, top_p=0.9,
24    pad_token_id=tokenizer.eos_token_id,
25)
26print(tokenizer.decode(sampled_output[0], skip_special_tokens=True))
27
28print("\n--- High temperature 1.5, no filtering (Module 34 Lesson 1's tail-risk case) ---")
29high_temp_output = model.generate(
30    input_ids, max_new_tokens=60, do_sample=True,
31    temperature=1.5,
32    pad_token_id=tokenizer.eos_token_id,
33)
34print(tokenizer.decode(high_temp_output[0], skip_special_tokens=True))
35
36print("""
37=== THE REAL-SCALE OBSERVATION ===
38
39At this real model scale, greedy decoding often produces safe but
40repetitive phrasing, the moderate temperature+top-p configuration
41typically produces the most coherent, natural-sounding response,
42and the high-temperature configuration frequently degrades into
43less coherent, sometimes off-topic text -- exactly Module 34
44Lesson 1's tail-risk concern, now visibly manifesting in real
45generated language rather than only in a measured probability
46distribution.
47
48The production API in Step 4 uses the moderate configuration
49(temperature=0.7, top_p=0.9) as its default, based on this direct
50comparison rather than an assumed default.
51""")

Gotchas

  • โš do_sample=False and do_sample=True are Hugging Face's actual parameter names for switching between Module 34 Lesson 1's greedy decoding and sampling-based approaches โ€” the underlying mechanism is identical to what this course built from scratch, just exposed through a real library's API.
  • โš pad_token_id=tokenizer.eos_token_id is required because GPT-2 has no dedicated padding token โ€” reusing the end-of-text token for padding is the standard, real convention, not a workaround specific to this project.
  • โš Results at this real model scale will vary between runs when do_sample=True, exactly as Module 34's own sampling-based methods are non-deterministic by design โ€” running this comparison multiple times is worth doing before settling on final production defaults.

Step 4 โ€” Serving With Streaming Responses

The final deployment extends Module 37's FastAPI pattern with one genuinely new, production-relevant feature: streaming the response token by token as it's generated, rather than waiting for the entire response to finish before returning anything โ€” exactly how real chat products like ChatGPT and Claude visibly display text as it generates, giving users immediate feedback instead of a long, silent wait for a full response.

Streaming Response โ€” Tokens Sent as They're Generated

Rather than waiting for the complete response before replying, each token is sent to the client the moment it's generated, giving the user immediate, visible progress โ€” the standard pattern behind real chat interfaces.

Streaming vs Waiting for the Full Response Without streaming user waits in silence for the ENTIRE response to finish With streaming (this project) each token sent AS it's generated visible progress, real chat UX Uses KV caching (Module 33) internally so each new token is fast to generate
04_serve_streaming_assistant.py
python
1from fastapi import FastAPI
2from fastapi.responses import StreamingResponse
3from pydantic import BaseModel
4from transformers import GPT2LMHeadModel, GPT2Tokenizer
5import torch
6
7app = FastAPI(title="AI Course Assistant")
8
9tokenizer = None
10model = None
11
12@app.on_event("startup")
13def load_model():
14    global tokenizer, model
15    tokenizer = GPT2Tokenizer.from_pretrained("./gpt2-course-assistant-final")
16    model = GPT2LMHeadModel.from_pretrained("./gpt2-course-assistant-final")
17    model.eval()
18    print("Fine-tuned course assistant loaded, ready to stream responses.")
19
20class QuestionRequest(BaseModel):
21    question: str
22
23def generate_streaming_response(question: str):
24    prompt = f"Q: {question}\nA:"
25    input_ids = tokenizer.encode(prompt, return_tensors="pt")
26
27    generated_ids = input_ids
28    # Module 33's KV cache is used AUTOMATICALLY by generate() internally --
29    # each new token is produced without recomputing earlier tokens' Key/Value
30    with torch.no_grad():
31        for _ in range(150):
32            outputs = model(generated_ids)
33            next_token_logits = outputs.logits[0, -1] / 0.7   # temperature, per Step 3's chosen default
34            filtered_logits = top_p_filter(next_token_logits, p=0.9)   # Module 34 Lesson 3's exact technique
35            probabilities = torch.softmax(filtered_logits, dim=-1)
36            next_token = torch.multinomial(probabilities, num_samples=1)
37
38            if next_token.item() == tokenizer.eos_token_id:
39                break
40
41            generated_ids = torch.cat([generated_ids, next_token.unsqueeze(0)], dim=1)
42            token_text = tokenizer.decode(next_token)
43            yield token_text   # SEND this token to the client immediately, don't wait for the rest
44
45def top_p_filter(logits, p):
46    """Module 34 Lesson 3's exact top-p filtering, applied here to a real
47    50,000+ token vocabulary instead of the toy example's 10 tokens."""
48    sorted_logits, sorted_indices = torch.sort(logits, descending=True)
49    cumulative_probs = torch.cumsum(torch.softmax(sorted_logits, dim=-1), dim=0)
50    sorted_indices_to_remove = cumulative_probs > p
51    sorted_indices_to_remove[0] = False
52    indices_to_remove = sorted_indices[sorted_indices_to_remove]
53    logits[indices_to_remove] = float("-inf")
54    return logits
55
56@app.post("/ask")
57async def ask_assistant(request: QuestionRequest):
58    return StreamingResponse(
59        generate_streaming_response(request.question),
60        media_type="text/plain",
61    )
62
63# Run with: uvicorn 04_serve_streaming_assistant:app --host 0.0.0.0 --port 8000
64# Test with: curl -N -X POST http://localhost:8000/ask -H "Content-Type: application/json" \
65#            -d '{"question": "What is the vanishing gradient problem?"}'

Gotchas

  • โš The -N flag in the curl test command disables curl's output buffering โ€” without it, curl would wait for the full response before displaying anything, hiding the actual streaming behavior even though the server is genuinely sending tokens incrementally.
  • โš This implementation calls model(generated_ids) fresh at each step for clarity, matching Module 33 Lesson 1's naive baseline explanation โ€” a genuinely optimized production version would use Hugging Face's built-in generate(..., streamer=...) utility, which implements Module 33 Lesson 2's real KV cache internally rather than reprocessing the growing sequence from scratch at each step.
  • โš top_p_filter here is applied to a vocabulary of over 50,000 real tokens, not the 10-token toy example from Module 34 โ€” the algorithm is identical, but real timing and real output quality only become visible at this genuine scale, which is exactly why this capstone exists rather than stopping at Module 34's toy demonstration.