5 Machine Learning Projects You Can Build Using Open-Source Cyber security Data

BetterMind Labs
13 hours ago
7 min read

Most high school students who say they are interested in AI have a GitHub with a half-finished tutorial. That is not a project. That is not a portfolio. And increasingly, that is not enough.

The students getting noticed by top CS programs are not the ones who finished a Coursera course. They are the ones who took a real dataset, built something functional, and can walk an admissions officer through what they made and why it matters. Cybersecurity is one of the most overlooked entry points for this kind of work, and the open-source data is sitting right there waiting to be used. Here is what five real students built, and how you can do the same.

Why Cybersecurity Datasets Are Ideal for Machine Learning Projects

Cybersecurity data is structured, publicly available, and genuinely high-stakes. That combination is rare.

Unlike toy datasets built for classroom exercises, cybersecurity data has noise, imbalance, and real-world complexity. Phishing URLs, network traffic logs, transaction records, and malware signatures all require you to actually think about your model, not just run it.

Some of the best open-source sources to start with:

CICIDS 2017/2018 (Canadian Institute for Cybersecurity) covers network intrusion detection with labeled attack types
Kaggle's Credit Card Fraud Dataset has 284,000 transactions with a strong class imbalance challenge
PhishTank offers verified phishing URLs updated in real time
EMBER (Endgame Malware Benchmark) is a malware classification dataset with feature extraction already done
MITRE ATT&CK provides structured threat intelligence data for more advanced projects

The point is not to pick the hardest dataset. The point is to pick one that lets you build something you can actually explain, demo, and document.

5 Real Student Projects Built on Open-Source Cybersecurity Data

These are not hypothetical. These are projects built by real students, most of them in structured mentorship programs during the summer.

1. Phishing and Social Engineering Detection (Verifeye)

https://www.youtube.com/watch?v=KML-QuNUzCA

Built by Sushant Purunu

Sushant built Verifeye, a web-based AI application that lets users paste a suspicious message or URL and get an immediate risk assessment. The app walks users through a short guided survey, then uses Google Gemini AI to identify threat indicators, assign a risk level, and recommend next steps.

What makes this project strong is its accessibility. Verifeye is not built for security engineers. It is built for everyday users who do not know what a phishing attempt looks like. That design decision, explaining who you are building for and why, is what separates a real project from a class assignment.

Datasets and tools to replicate this: PhishTank for URL data, Gemini API for classification, Streamlit for the interface.

What you learn: Prompt engineering, API integration, UI/UX for non-technical users, risk classification logic.

2. Cyber Threat Pattern Detection (Ventura AI)

https://www.youtube.com/watch?v=0QR8J99qty4

Built by Bharath Chowlur

Bharath built Ventura AI, a Streamlit app where you paste or upload text or an image and the model flags cyber-attack patterns like SQL injection or DDoS attempts. It returns a structured verdict, assigns a threat type, and explains findings in plain English.

What is technically interesting here is the session tracking. Every scan is logged by UUID in a JSON history file. Users can give feedback on each result, and the app can generate deeper reports on demand. That feedback loop is what makes this a real system rather than a one-off demo.

Datasets and tools to replicate this: CICIDS dataset for network traffic patterns, Gemini for language model inference, Streamlit for deployment.

What you learn: Multi-modal input handling, session management, structured output design, feedback loops.

If you are still figuring out what kind of project you want to build, this list of passion project ideas for high school students is a good place to start thinking.

3. Health Misinformation Detection

https://www.youtube.com/watch?v=obarCLt3nVY

Built by Anjali Kumar

Anjali's project sits at the intersection of cybersecurity and public health, specifically, the problem of false medical claims spreading online. Users upload a PDF or text block and the app highlights misleading or false health claims, explains its reasoning, and links out to credible sources for the user to verify.

This is a sophisticated framing. Misinformation is a security problem. It manipulates users the same way phishing does. Anjali's project makes that connection explicit.

Datasets and tools to replicate this: PubMed abstracts for ground truth, Gemini for reasoning, Streamlit for upload and display logic.

What you learn: Document parsing, claim extraction, multi-label classification, citation and sourcing in AI outputs.

4. Malware and Bug Detection in Web Requests

https://www.youtube.com/watch?v=BHSX4Hc9zDk

Built by Jovan Tran

Jovan built an AI web application that analyzes cyber requests for potential bugs, malware, or infections. The core idea is straightforward: take incoming request data, run it through a classification model, and flag anything that looks like a threat.

Projects like this are particularly strong for admissions because they mirror what security engineers actually do in enterprise environments. It is not a toy problem.

Datasets and tools to replicate this: EMBER for malware features, CICIDS for request-level traffic data, XGBoost or a simple neural net for classification.

What you learn: Feature engineering on raw request data, binary and multi-class classification, false positive management.

5. Credit Card Fraud Detection

https://www.youtube.com/watch?v=YQe6I8CB4D4

Built by Himaghna Roy

Himaghna built a model to detect fraudulent credit card transactions. The use case is specifically designed for bulk verification, helping businesses scan large volumes of transactions and flag misconduct at scale.

Fraud detection is one of the canonical machine learning problems, but what makes a student project stand out is not just that you ran a model. It is that you thought about the class imbalance problem, chose evaluation metrics that actually matter (precision-recall over accuracy), and designed for real-world use rather than benchmark performance.

Datasets and tools to replicate this: Kaggle Credit Card Fraud Dataset, Scikit-learn for modeling, SMOTE or undersampling for class imbalance, a Streamlit dashboard for results display.

What you learn: Imbalanced classification, SMOTE, threshold tuning, evaluation metric selection.

For students wondering how these kinds of projects affect college applications, this piece on why passion projects can make or break your college application is worth reading.

How BetterMind Labs Students Build Projects Like These

Every project above was built by a student in BetterMind Labs' AI program. That is not a coincidence.

BetterMind Labs runs 4-week fully online summer cohorts with a 1:3 expert mentorship ratio. Students do not watch lectures and take quizzes. They build. The program is structured around real project milestones: scoping, data preparation, model building, evaluation, and deployment. Each student leaves with a working application, documented capstone, and a mentor who has watched them build from scratch.

That last part matters for admissions. A Letter of Recommendation from a mentor who supervised your technical work is meaningfully different from one written by a teacher who gave you an A. Admissions officers at research universities know the difference.

The students featured in this article built healthcare prediction systems, fraud detection tools, threat detection apps, and misinformation classifiers. They are not just interested in AI. They have proof.

Case Study: How Merwan Indukuri Built FraudDetect AI

https://www.youtube.com/watch?v=FmS9yBWraIg

Merwan came into the BetterMind Labs program with interest in AI and a clear sense that invoice fraud was a problem worth solving. Invoice fraud costs businesses trillions annually, but most small companies do not have the tools to catch it early.

His project, FraudDetect AI, is a web app that uses Gemini 1.5 Vision to analyze uploaded invoices for signs of fraud. It processes the document, runs AI inference, and returns an assessment with supporting reasoning.

Technically, Merwan handled three distinct pieces: the Streamlit interface, the Gemini integration, and the file processing and analytics layer. That is a full-stack AI application built by a high school student over four weeks.

What changed for Merwan was not just the project. It was the clarity. He came in curious about AI. He left knowing he could build production-grade tools that solve real business problems. That shift in self-perception is what structured mentorship actually produces.

You can read more about how students like Merwan build real cybersecurity applications in this detailed breakdown of how an 11th grader built a real cyber threat detection app.

Group of people working on a laptop, text reads "Know more about AI/ML Program at BetterMind Labs." Grid background, "Learn More" button.

Frequently Asked Questions

Can I build these projects without a mentor? You can start. Most students can get a dataset loaded and a basic model running on their own. But the gap between a working notebook and a deployed, documented, admissions-ready project is where mentorship makes the real difference. Knowing which decisions matter and why requires someone who has done this before.

Do I need to know Python before starting a cybersecurity ML project? Comfortable Python basics are enough to begin. You do not need to know machine learning theory upfront. Most of these projects teach you the framework as you build. Starting with a defined problem and a real dataset is more effective than studying in the abstract first.

What makes a cybersecurity ML project actually strong for college applications? Three things: a clearly defined problem, a working deployment, and documentation that explains your reasoning. Any student can say they are interested in AI. The ones who stand out can show a live app, explain the dataset they used, and describe what they got wrong the first time and how they fixed it.

Are programs like BetterMind Labs worth it for building these projects? For students who are serious about outcomes, yes. The combination of structured milestones, expert mentorship, and a capstone document produces something qualitatively different from self-directed learning. BetterMind Labs specifically is designed for students who want portfolio-ready results, not just exposure.

One Last Thing

Open-source cybersecurity data is not a shortcut. It is a starting point. What matters is what you build on top of it and whether you can explain it to someone who is deciding your future.

The students in this article did not get lucky. They worked on real problems, with real mentors, over a focused period of time. The result is not a line on a resume. It is a project you can demo, a story you can tell, and a capability you can build on.

That is what changes things.

Explore more at bettermindlabs.org.