By Prince Grover, Zheng Li, Jianbo Liu, Jakub
Zablocki, Hao Zhou, Julia Xu and Anqi Cheng The Fraud Dataset Benchmark (FDB) is a compilation of publicly available datasets relevant to fraud detection (arXiv Link). The FDB aims to cover a wide variety of fraud detection tasks, ranging from card not present transaction fraud, bot attacks, malicious traffic, loan risk and content moderation. The Python based data loaders from FDB provide dataset loading, standardized train-test splits and performance evaluation metrics. The goal of our work is to provide researchers working in the field of fraud and abuse detection a standardized set of benchmarking datasets and evaluation tools for their experiments. Using FDB tools we evaluate 4 AutoML pipelines including AutoGluon, H2O, Amazon Fraud Detector and Auto-sklearn across 9 different fraud detection datasets and discuss the results. Datasets used in FDBBrief summary of the datasets used in FDB. Each dataset is described in detail in data source section.
InstallationRequirements
Step 1: Setup Kaggle CLIThe Use intructions from How to Use Kaggle guide. The steps include: Remember to download the authentication token from "My Account" on Kaggle, and save token at Step 2: Clone RepoOnce
Kaggle CLI is setup and installed, clone the github repo using Step 3: InstallOnce repo is cloned, from your terminal, FraudDatasetBenchmark UsageThe usage is straightforward, where you create a
Notebook template to load dataset using FDB data-loader is available at scripts/examples/Test_FDB_Loader.ipynb ReproducibilityReproducibility scripts are available at scripts/reproducibility/ in respective folders for afd, autogluon and h2o. Each folder also had README with steps to reproduce. Benchmark Results
ROC CurvesThe numbers in the legend represent AUC-ROC from different models from our baseline evaluations on AutoML. Data Sources
Citation
LicenseThis project is licensed under the MIT-0 License. AcknowledgementWe thank creators of all datasets used in the benchmark and organizations that have helped in hosting the datasets and making them widely availabel for research purposes. |