AI Training Exercise

Welcome to the AI Training Exercise

This tool walks you through building a real machine learning model, a spam classifier, from scratch. No programming knowledge is needed. Each step mirrors what a data scientist actually does.

Why does this matter? Machine learning models are increasingly used to detect threats, filter content, and flag anomalies across security and business contexts. Understanding how they are built helps you ask better questions: What data was used? Which signals matter? How reliable is the result? Working through the process hands-on makes those questions concrete.

Classic software approach

A developer writes explicit rules: "if the message contains 'winner' and a URL, mark it as spam." These rules are fast and predictable, but they must be written and maintained by hand. Attackers can evade them simply by rewording their messages.

→

Machine learning approach

Instead of writing rules, you collect labelled examples and let the model find the patterns itself. Classic code still measures properties of each message (that is what Step 3, Features, does), but the model decides how to weigh them based on the training data.

What you will do

Inspect

Review 60 labelled messages and spot data quality problems.

Clean

Fix missing labels, remove duplicates, and correct mislabels.

Features

Choose which properties of a message the model learns from.

Train

Configure and train a neural network. Watch it learn in real time.

Test

Try the finished model on new messages and see how it performs.

Step 2: Inspect Data

Before training any model, you need to understand your data. Review the dataset below. Some rows have problems. Can you spot them? The summary panel shows what was found automatically.

Where do the labels come from? Every message already has a label: Spam or Not Spam. Labels can be applied in two ways:

By hand. A human reviews each message and assigns a label. This is called data annotation. It is reliable but slow: it can take hours or weeks of work depending on dataset size.
By a program. A developer writes rules such as keyword lists, regular expressions, and sender blocklists that automatically assign labels. This is faster, but the labels are only as good as the rules, and edge cases are easy to miss.

Either way, the quality of the labels directly determines how well the model learns. A wrong label teaches the model the wrong thing, which is exactly why the next step exists.

#	Message	Label	Issues

Step 3: Clean Data

Fix all data quality issues before training. Use the controls to correct mislabels, remove duplicates, and assign missing labels. All issues must be resolved before you can proceed.

#	Message	Label	Action

Step 4: Select Features

Features are the measurable properties of a message that the model will learn from. Select at least 2 features. The preview panel shows computed values for a sample of your data.

Available Features

Live Preview (5 sample rows)

Step 5: Train the Model

Configure and train a neural network on your cleaned data. Watch the loss and accuracy curves update in real time as the model learns.

What is an epoch? One epoch means the model has seen every training example once. Running multiple epochs gives it more chances to learn. However, too many epochs can cause overfitting, where the model memorises the training data and performs poorly on new messages. Watch the accuracy curve: if it flattens or drops, you have enough epochs.

What is learning rate? After each example, the model nudges its internal settings slightly to reduce its mistakes. The learning rate controls how big those nudges are. Too high: the model overshoots and never settles on a good answer. Too low: training is very slow and may not converge. Medium is a safe default for most cases.

What are loss and accuracy? Loss measures how wrong the model's predictions are on average: lower is better. Think of it as a penalty score: a perfect prediction scores 0, a completely wrong one scores higher. Accuracy is simply the percentage of examples the model classified correctly. Ideally, loss falls and accuracy rises as training progresses. If they stop improving, the model has learned as much as it can from the data.

Epochs: 20

Learning Rate

Ready to train.

Loss

Accuracy

Step 6: Test the Model

Try your trained model on new messages. Type or paste a message below, then click Classify to see the prediction.

Top Feature Values for This Message

Preset Test Messages