Building a Privacy-Preserving Fraud Detection System with Federated Learning

Nov 1, 2025 ·

Federated Learning ML Privacy Research

This blog post is a condensed, reflective version of my undergraduate dissertation project completed at Vishwakarma Institute of Information Technology, Pune. I worked on this with my incredible team members Sushant Kuratkar, Pratik Nule, and Prateek Mazumder under the guidance of Prof. Geetanjali Yadav. While the original dissertation was a comprehensive 60-page technical document, I have distilled it here into the core concepts, implementation details, and honest reflections on what we learned.

Interestingly, the evaluation metrics turned out near perfect, though this was incidental and not the main goal. The primary aim of the project was to explore feasibility and demonstrate that the concept could be implemented effectively, rather than to optimize for performance.

Introduction: The Problem That Started It All

When I first learned about decentralized finance (DeFi), I was fascinated by its promise of removing intermediaries and enabling peer-to-peer transactions. However, as I dug deeper, I discovered a critical challenge that traditional risk management systems simply could not address: how do you detect fraud and assess risk across multiple autonomous entities without compromising data privacy?

This question led me and my team to develop a federated learning-based framework for DeFi risk management. Over the course of this project, I learned that the most innovative solutions often emerge when privacy constraints force you to think differently about collaboration.

I want to be upfront about the scope of this work, this was primarily a proof-of-concept academic project designed to explore whether federated learning could viably address DeFi risk management challenges. Our focus was on demonstrating the core mechanics of privacy-preserving collaborative learning rather than building a production-ready system. As you will see throughout this blog, this meant we made certain tradeoffs in evaluation rigor and system robustness that I would approach differently in a real-world deployment. But those limitations themselves became valuable learning experiences.

Why Traditional Risk Management Fails in DeFi

Traditional financial institutions rely on centralized risk assessment models. They aggregate sensitive data into a single repository, analyze it, and make decisions. This approach has three fundamental problems when applied to DeFi:

Privacy violations: Centralized data collection exposes systems to breaches and violates user confidentiality. In an era of GDPR and increasing data protection regulations, this is simply unacceptable.

Incompatibility with decentralization: DeFi operates on distributed, trustless architectures. Centralizing data contradicts the core philosophy of blockchain-based systems.

Single point of failure: When one entity controls all the data, that entity becomes a vulnerability. If compromised, the entire system collapses.

I realized that we needed an entirely different approach one that could learn from distributed data without ever seeing it directly.

Enter Federated Learning

Federated learning offers an elegant solution to this paradox. Instead of bringing data to the model, we bring the model to the data. Here is how it works:

Multiple entities (which we call “client nodes”) each hold their own private transaction data
A central server distributes a machine learning model to all clients
Each client trains the model locally on their private data
Clients send only model updates (weights and gradients) back to the server, never raw data
The server aggregates these updates to create an improved global model
This process repeats iteratively

The beauty of this approach is that sensitive financial data never leaves its original location. Each participant contributes to collective intelligence while maintaining complete control over their information.

Designing the System Architecture

The Core Components

I structured the system around three main components:

Client Nodes: These represent individual DeFi platforms or financial institutions. Each node stores its transaction data locally, trains models independently, and communicates only parameter updates.

Central Server: This orchestrates the federated learning process using the Flower framework. It manages training rounds, aggregates model updates using Federated Averaging (FedAvg), and distributes the improved global model back to clients.

Neural Network Model: I designed a Multi-Layer Perceptron called FraudDetectionNet with two hidden layers (64 and 32 neurons) using ReLU activation functions. The output layer uses a sigmoid activation for binary classification determining whether a transaction is fraudulent or legitimate.

Figure 10.1 - High-Level System Architecture showing server and multiple client nodes

The Data Challenge

One of the most significant challenges I faced was class imbalance. In fraud detection, fraudulent transactions are rare typically less than 1% of all transactions. If you train a model on this imbalanced data, it will simply learn to classify everything as non-fraudulent and still achieve 99% accuracy, which is useless.

To address this, I implemented SMOTE (Synthetic Minority Over-sampling Technique) during preprocessing. This technique generates synthetic samples of the minority class (fraudulent transactions) by creating new instances along the line segments connecting existing minority class examples. This gave our model enough fraud examples to actually learn meaningful patterns.

The Ethereum Dataset and Feature Engineering

For this project, I used an Ethereum transaction dataset from Kaggle containing anonymized blockchain transactions. The data included temporal patterns, transaction values, gas prices, and smart contract interactions.

The feature engineering process involved extracting risk indicators that could signal fraudulent behavior:

Transaction frequency patterns
Value fluctuations
Network interaction behaviors
Temporal anomalies

I standardized all numerical features using StandardScaler to ensure no single feature dominated the learning process due to its scale. This preprocessing pipeline was crucial I saved the fitted scaler object to ensure consistent scaling across all clients during inference.

Implementation Deep Dive

Federated Learning Workflow Sequence Diagram

The Training Process

I ran the federated learning process for 10 communication rounds. In each round:

The server selected participating clients (in our case, all three simulated clients)
Each client received the current global model parameters
Clients trained locally for 5 epochs with a batch size of 32
Clients transmitted their model updates back to the server
The server aggregated updates using weighted averaging based on dataset sizes
The updated global model was distributed back to all clients

Throughout this process, I monitored several key metrics to ensure the model was learning effectively.

Results That Exceeded Expectations

The results genuinely surprised me. The federated learning model achieved:

AUC-ROC: 0.9970
F1 Score: 0.9814
Precision: 0.9976
Recall: 0.9656
Accuracy: 0.9901

To validate these results, I trained a centralized model on the same data (combined from all clients) as a benchmark. The federated model not only matched but in some metrics slightly exceeded the centralized model’s performance.

This was a pivotal moment in the project. It demonstrated that you do not have to sacrifice model quality for privacy. The federated approach achieved comparable performance while keeping all sensitive data decentralized.

Understanding the Metrics

The high recall (96.56%) was particularly important for fraud detection. This means the model correctly identified over 96% of actual fraudulent transactions. In financial systems, missing fraud (false negatives) can result in significant losses.

The high precision (99.76%) indicated that when the model flagged a transaction as fraudulent, it was almost always correct. This minimizes false alarms, which is critical for operational efficiency you do not want fraud analysts investigating thousands of legitimate transactions.

Deep Dive: Model Performance Across Training Rounds

One of the most fascinating aspects of federated learning is watching the global model evolve over successive communication rounds. Each round represents a cycle where clients train locally, send updates, and receive an improved global model back. Let me walk you through what actually happened during our 10 training rounds.

Global Model Evolution

Accuracy Progression

AUC-ROC: The Gold Standard for Imbalanced Classification

F1 Score: Balancing Precision and Recall

Loss Trajectory

Precision: Minimizing False Alarms

Recall: Catching Actual Fraud

Client-Specific Performance Analysis

While the global model metrics tell one story, examining individual client performance reveals the heterogeneity inherent in federated learning systems.

Performance Variance Across Clients

The Federated vs. Centralized Showdown

The differences are remarkably small. In fact, the federated model slightly outperformed the centralized model in several metrics (AUC, F1, Recall, Accuracy). This is counterintuitive; you would typically expect federated learning to perform slightly worse due to the challenges of aggregating models trained on different data distributions.

The fact that our federated model matched or exceeded centralized performance could indicate:

The FedAvg algorithm is highly effective for this type of data
Our data partitioning created client splits that were still relatively similar (IID)
The differences are within noise margins and not statistically significant
There might be evaluation issues affecting both models similarly

What These Results Actually Mean

Looking at all these metrics together, here is my honest interpretation:

The Good:

The federated learning framework successfully trained a model without centralizing data
Performance was competitive with centralized training, proving the concept works
The model showed consistent improvement across rounds, indicating effective aggregation
Different clients contributed meaningfully despite performance variance

The Concerning:

Near-perfect metrics (especially the 100% scores for Clients 1 and 2) are red flags
The extremely low final loss suggests potential overfitting
The small performance variance between federated and centralized models is suspiciously good
Real-world fraud detection systems rarely achieve 99%+ precision and 96%+ recall simultaneously

The Realistic:

This was an academic proof-of-concept, not a production-ready system
The results validate that federated learning can work for fraud detection
The methodology needs refinement to ensure robust evaluation
The framework is sound even if the specific metrics need more rigorous validation

Key Takeaways for Practitioners

If you are considering implementing federated learning for fraud detection:

Start with a centralized baseline: Always compare against centralized performance to quantify the privacy-performance tradeoff
Monitor client-specific metrics: Understanding variance across clients reveals data heterogeneity issues
Use multiple evaluation metrics: Accuracy alone is meaningless for imbalanced problems
Watch for convergence patterns: Different metrics converge at different rates
Be skeptical of perfect results: In real-world ML, perfection usually indicates problems
Test on truly held-out data: Your evaluation set should come from a different time period or source

Technical Challenges I Faced

Suspiciously Perfect Metrics

The biggest challenge I encountered was not during implementation, but during results analysis. When Clients 1 and 2 both achieved perfect 100% scores across all metrics with near-zero loss (0.00157 and 0.00140 respectively), I knew something was off. In real-world machine learning, especially fraud detection, you simply do not get perfect performance.

This raised several questions: Did we have data leakage? Was our test set too small or too easy? Did we accidentally use the same data for training and testing? These are the kinds of issues that are easy to miss when you are rushing to complete a project, but they fundamentally undermine the validity of results.

Class Imbalance and SMOTE Side Effects

While SMOTE (Synthetic Minority Over-sampling Technique) helped us address the severe class imbalance in fraud detection, it likely created its own problems. By generating synthetic fraud samples through interpolation between existing fraud cases, we may have made the classification task artificially easier.

The model might have learned to identify these synthetic patterns rather than real fraud characteristics. This could explain why our metrics were so high we were essentially testing on data that was mathematically similar to our training augmentations. A production system would face real fraud that does not follow neat interpolated patterns.

Data Distribution Challenges

Although we simulated three separate clients, we partitioned data from a single source dataset. This meant our data was likely more IID (Independent and Identically Distributed) than real-world federated scenarios would be. True DeFi platforms would have fundamentally different:

User populations
Transaction patterns
Types of fraud attempts
Volume distributions

Client 3’s slightly worse performance (97.87% AUC vs. 100% for Clients 1 and 2) gave us a glimpse of what heterogeneous data might look like, but we did not truly test the system under extreme non-IID conditions.

Evaluation Methodology Gaps

Looking back, I realize we made several evaluation mistakes:

Single Dataset Split: We used one dataset and split it into train/test. Real-world validation should use transactions from different time periods or different blockchain networks entirely.

No Temporal Validation: Fraud patterns evolve over time. We should have trained on older transactions and tested on newer ones to simulate real deployment.

Potential Data Leakage: If we applied our StandardScaler or SMOTE before splitting data, information from the test set could have leaked into training. This would artificially inflate all our metrics.

Critical Reflection: What Could Have Gone Wrong

Looking back at this project with a more critical eye, I need to acknowledge several potential issues that could have inflated our results:

Possible Overfitting: The near-perfect metrics (99.7% AUC-ROC, 99.76% precision) are suspiciously high for a real-world fraud detection system. This could indicate overfitting, where the model memorized patterns specific to our dataset rather than learning generalizable fraud detection strategies. In production, the model would likely perform worse on truly unseen transaction patterns.

SMOTE’s Double-Edged Sword: While SMOTE helped address class imbalance, it may have introduced its own problems. By generating synthetic fraud samples through interpolation, we might have created unrealistic transaction patterns that are easier to classify than real fraudulent transactions. The model could have learned to identify these synthetic patterns rather than actual fraud characteristics. A better approach might have been to use ensemble methods designed for imbalanced data or cost-sensitive learning that penalizes false negatives more heavily.

Data Leakage Concerns: I need to be honest there is a possibility of data leakage in our preprocessing pipeline. If we applied SMOTE before splitting the data into training and test sets, or if the same StandardScaler was fitted on the entire dataset before splitting, our test metrics would be artificially inflated. The model would have seen information about the test distribution during training, making evaluation metrics unreliable.

Limited Dataset Diversity: We used a single Ethereum transaction dataset and simply partitioned it across simulated clients. Real-world DeFi platforms would have fundamentally different transaction patterns, user behaviors, and fraud types. Our results do not account for the extreme non-IID data that would exist across actual institutions.

Evaluation on Synthetic Splits: The test set we used came from the same distribution as our training data. In reality, fraud patterns evolve over time. A robust evaluation would require testing on transactions from a later time period or from completely different blockchain networks to assess true generalization.

Small-Scale Simulation: With only three simulated clients, we did not encounter many of the real challenges of federated learning client dropout, extreme data heterogeneity, Byzantine attacks, or communication constraints. Our clean academic setup does not reflect the messiness of production systems.

Perfect Scores Raise Red Flags: The fact that two out of three clients achieved perfect 100% scores across all metrics is highly suspicious. This almost certainly indicates problems with our evaluation methodology rather than genuinely perfect model performance.

What I Would Do Differently

If I were to redo this project with the knowledge I have now:

Implement proper cross-validation: Use time-series cross-validation where training data comes from earlier time periods and test data from later periods, mimicking real-world deployment.
Use class weights instead of SMOTE: Configure the loss function to penalize misclassifying fraud more heavily, rather than creating synthetic samples.
Add a holdout validation set: Keep a completely separate dataset that is never touched during development to get honest performance metrics at the end.
Test on multiple datasets: Evaluate the federated model on transaction data from different blockchain networks or DeFi protocols to assess true generalization.
Implement adversarial validation: Check if a model can distinguish between training and test data if it can, that indicates distribution shift or leakage.
More realistic client simulation: Use actual data from different DeFi protocols rather than artificially partitioning one dataset.

Lessons Learned

Privacy and performance are not mutually exclusive: Even with the caveats above, the core insight remains valid federated learning can achieve competitive performance while providing strong privacy guarantees. The framework itself is sound even if our specific metrics were inflated.

Data preprocessing is critical but dangerous: SMOTE and other preprocessing techniques can dramatically impact results, sometimes in misleading ways. Understanding exactly when and how to apply these techniques is crucial.

Simplicity enables scale: Starting with a simple MLP architecture and straightforward FedAvg aggregation made the system easier to debug and iterate on. Complexity should be added incrementally as needed.

Evaluation metrics matter, but so does evaluation methodology: In imbalanced classification problems, accuracy alone is misleading. However, even sophisticated metrics like AUC-ROC can be misleading if the evaluation setup has fundamental flaws.

Honesty about limitations is a strength: In machine learning, acknowledging what you do not know and what could have gone wrong demonstrates maturity and understanding. Real-world ML is messy, and pretending otherwise does not help anyone.

Future Enhancements

While I am proud of what we accomplished, there are several directions I would explore to make this system production-ready:

Differential Privacy Integration: Implement local differential privacy to add mathematically provable privacy guarantees to each client’s contributions.

Scalability Testing: Evaluate the system with hundreds or thousands of clients to understand real-world scalability limits.

Asynchronous Federated Learning: Remove the requirement for synchronous client participation, allowing clients to contribute updates whenever they are available.

Advanced Aggregation Strategies: Explore alternatives to FedAvg that are more robust to non-IID data and adversarial clients, such as FedProx or Krum.

Blockchain Integration: Deploy smart contracts to manage model versioning, client contributions, and incentive mechanisms in a truly decentralized manner.

Conclusion

This project transformed my understanding of what is possible when privacy constraints force creative thinking. Federated learning is not just a technical solution to a privacy problem, it represents a fundamental shift in how we think about collaborative machine learning.

In the context of DeFi and financial systems more broadly, this approach enables institutions to collaborate on fraud detection without exposing sensitive customer data. It aligns with the core principles of decentralization while delivering the model performance that real-world applications demand.

The techniques I explored here federated averaging, SMOTE for class imbalance, careful feature engineering, and comprehensive evaluation are applicable far beyond fraud detection. Any domain where data is sensitive, distributed, or regulated could benefit from this approach: healthcare, telecommunications, IoT, and more.

As I continue my journey in AI engineering, I carry forward the lessons from this project: that the most elegant solutions often emerge from embracing constraints rather than fighting them, and that privacy and performance can coexist when we design systems thoughtfully.

Technical Stack: Python, PyTorch, Flower Framework, Scikit-learn, Pandas, NumPy, Flask, Matplotlib

Dataset: Ethereum Fraud Detection Dataset (Kaggle)

Code: Available on GitHub

Team: Ayush Patne, Sushant Kuratkar, Pratik Nule, Prateek Mazumder