A Foundation
Team consisting of a Mastercard Senior Software Engineer with an IISc MS degree, specializing in enterprise Java, Spring Boot, microservices, and secure cloud architectures.
Project Description
Project Submission: Credit Foundation Model (CFM)
Executive Summary
Typical consumer and mortgage credit risk modeling relies on static snapshots (e.g., credit scores, debt-to-income ratios) or crude lagging indicators (e.g., 30/60/90 days past due status). In the Trillion-Dollar Structured Finance (MBS/ABS) Market, this lack of granular, dynamic visibility translates directly into mispriced risk, inefficient capital allocation, and late-stage default notices.
The Credit Foundation Model (CFM) is an end-to-end self-supervised deep learning pipeline and live dashboard that re-imagines credit performance as a sequential tokenization problem—similar to modern LLMs. By encoding dynamic loan histories (repayments, DPD transitions, forbearance, cure events) into dense sequence embeddings, CFM enables financial institutions to predict default/delinquency months in advance, perform ESG-sensitive pricing, and optimize portfolio management.
- Direct Solution to a Trillion-Dollar Challenge
Structured finance markets (e.g., residential mortgage-backed securities or RMBS) bundle millions of individual loans. Traditional pricing models fail to capture micro-level changes in borrower behavior.
• Dynamic Event Tokenization: CFM converts monthly payment histories into structured tokens (incorporating seasoning, DPD status, balance changes, and restructuring).
• Early Warning Performance: By applying self-supervised pre-training to these event sequences, the model learns deep behavioral representations. Downstream evaluation proves that CFM embeddings capture defaults 1 to 6 months earlier than standard credit scoring methods, safeguarding capital for ABS issuers and warehouse lenders.
- Working Application Completeness
The CFM prototype is a fully functional, production-ready pipeline with zero placeholders. It integrates the following components into a single run-time framework:
• Step 1 (Ingest, Validate, Train Baseline): Zero-copy ingestion of performance data via DuckDB, an 18-point Gold data validation suite, and a baseline XGBoost classifier.
• Step 2 (Tokenize Sequences): A custom Loan Event Tokenizer that constructs dense sequence histories (sequences.parquet).
• Step 3 (Foundation Model Training): A modular deep learning engine in PyTorch supporting multiple architectures (LSTM, TFT, PatchTST, and a custom Hybrid TFT+PatchTST model). It executes self-supervised pre-training and multi-task fine-tuning (default and cure prediction) with FP16 Automatic Mixed Precision (AMP).
• Step 4 (Downstream Evaluation & Dashboard): An evaluation engine comparing Handcrafted Features, Embeddings-Only, Hybrid, Linear Probes, and PyTorch DNNs side-by-side on metrics like PR-AUC, Gini, Brier Score, and Expected Calibration Error (ECE).
• Interactive UI: A unified web dashboard serving real-time logs, metrics tables, and dynamic charts (early-warning decay, post-calibration scatter plots, and ESG-fairness slices).
- Business Model Feasibility
CFM addresses a high-value B2B enterprise market. It is structured to scale as an API-first platform:
• Integration: Can be deployed on-premise or in private clouds (via APIs) to integrate directly into underwriting and servicing platforms.
• Value Proposition: A 0.5% improvement in default capture rate translates to tens of millions of dollars in saved recovery costs and lower risk premiums on a $1B mortgage portfolio.
• ESG Integration: Servicers can dynamically adjust outreach policies based on ESG performance (using energy performance labels to identify transition risk).
Technical Excellence & Tech Stack Leverage
The application is built on a high-performance stack optimized for financial tabular sequences:
Core Frameworks & Libraries
• PyTorch (with FP16 AMP): Powering the deep learning training loop. FP16 mixed precision accelerates training and dramatically reduces memory usage on local GPU hardware (such as laptop GPUs).
• DuckDB: Serves as the high-speed, local analytical storage layer. Enables zero-copy, highly efficient SQL joins between Parquet embeddings and metadata tables.
• FastAPI: The backend web server hosting endpoints to dynamically run pipelines, monitor status, and return serialized metrics.
• Scikit-Learn: Leveraged for metrics calculation and Platt scaling calibration. We solved compatibility limits in newer versions by leveraging sklearn.frozen.FrozenEstimator to calibrate pre-trained models on validation data without triggering a complete refit.
• XGBoost: Trained as the baseline model and downstream evaluator.
• Vanilla HTML5 / CSS3 / JS: Serving a lightweight, responsive dashboard using HSL CSS custom properties (sleek dark mode, neon highlights), grid systems, micro-animations, and Chart.js for visualization.
Pushing Technical Boundaries
- Hybrid TFT-PatchTST Architecture: Our leading foundation architecture combines a Variable Selection Network (VSN) (from Temporal Fusion Transformers) to learn feature importances per month with Patching + Transformer Encoders (from PatchTST) to capture long-term sequence dependencies.
- Event Vocab Mapping: Representing time-series sequences as discrete vocabularies mapped to financial states, allowing standard NLP-style self-supervised pre-training (Masked Token Prediction).
- Imbalanced Loss Optimization: Combines Focal Loss for rare default events with standard Binary Cross-Entropy (BCE) for cure events in a joint multi-task fine-tuning objective.
- Strict OOT Splitting: Enforces out-of-time (OOT) validation splits (pre-2024 vs. post-2025) to avoid temporal data leakage, matching strict institutional model risk validation standards.
Prior Work
There is no prior work. Entire work is done as part of the hackathon.
Team
Products & Tools
Additional Links
Credit Foundation Model - codebase