Hackathon Showcase 1st Place Winner

A Foundation

Team consisting of a Mastercard Senior Software Engineer with an IISc MS degree, specializing in enterprise Java, Spring Boot, microservices, and secure cloud architectures.

1 member

Project Description

Project Submission: Credit Foundation Model (CFM)
Executive Summary
Typical consumer and mortgage credit risk modeling relies on static snapshots (e.g., credit scores, debt-to-income ratios) or crude lagging indicators (e.g., 30/60/90 days past due status). In the Trillion-Dollar Structured Finance (MBS/ABS) Market, this lack of granular, dynamic visibility translates directly into mispriced risk, inefficient capital allocation, and late-stage default notices.
The Credit Foundation Model (CFM) is an end-to-end self-supervised deep learning pipeline and live dashboard that re-imagines credit performance as a sequential tokenization problem—similar to modern LLMs. By encoding dynamic loan histories (repayments, DPD transitions, forbearance, cure events) into dense sequence embeddings, CFM enables financial institutions to predict default/delinquency months in advance, perform ESG-sensitive pricing, and optimize portfolio management.

Direct Solution to a Trillion-Dollar Challenge

Structured finance markets (e.g., residential mortgage-backed securities or RMBS) bundle millions of individual loans. Traditional pricing models fail to capture micro-level changes in borrower behavior.
• Dynamic Event Tokenization: CFM converts monthly payment histories into structured tokens (incorporating seasoning, DPD status, balance changes, and restructuring).
• Early Warning Performance: By applying self-supervised pre-training to these event sequences, the model learns deep behavioral representations. Downstream evaluation proves that CFM embeddings capture defaults 1 to 6 months earlier than standard credit scoring methods, safeguarding capital for ABS issuers and warehouse lenders.

Working Application Completeness

The CFM prototype is a fully functional, production-ready pipeline with zero placeholders. It integrates the following components into a single run-time framework:
• Step 1 (Ingest, Validate, Train Baseline): Zero-copy ingestion of performance data via DuckDB, an 18-point Gold data validation suite, and a baseline XGBoost classifier.
• Step 2 (Tokenize Sequences): A custom Loan Event Tokenizer that constructs dense sequence histories (sequences.parquet).
• Step 3 (Foundation Model Training): A modular deep learning engine in PyTorch supporting multiple architectures (LSTM, TFT, PatchTST, and a custom Hybrid TFT+PatchTST model). It executes self-supervised pre-training and multi-task fine-tuning (default and cure prediction) with FP16 Automatic Mixed Precision (AMP).
• Step 4 (Downstream Evaluation & Dashboard): An evaluation engine comparing Handcrafted Features, Embeddings-Only, Hybrid, Linear Probes, and PyTorch DNNs side-by-side on metrics like PR-AUC, Gini, Brier Score, and Expected Calibration Error (ECE).
• Interactive UI: A unified web dashboard serving real-time logs, metrics tables, and dynamic charts (early-warning decay, post-calibration scatter plots, and ESG-fairness slices).

Business Model Feasibility

CFM addresses a high-value B2B enterprise market. It is structured to scale as an API-first platform:
• Integration: Can be deployed on-premise or in private clouds (via APIs) to integrate directly into underwriting and servicing platforms.
• Value Proposition: A 0.5% improvement in default capture rate translates to tens of millions of dollars in saved recovery costs and lower risk premiums on a $1B mortgage portfolio.
• ESG Integration: Servicers can dynamically adjust outreach policies based on ESG performance (using energy performance labels to identify transition risk).

Technical Excellence & Tech Stack Leverage
The application is built on a high-performance stack optimized for financial tabular sequences:
Core Frameworks & Libraries
• PyTorch (with FP16 AMP): Powering the deep learning training loop. FP16 mixed precision accelerates training and dramatically reduces memory usage on local GPU hardware (such as laptop GPUs).
• DuckDB: Serves as the high-speed, local analytical storage layer. Enables zero-copy, highly efficient SQL joins between Parquet embeddings and metadata tables.
• FastAPI: The backend web server hosting endpoints to dynamically run pipelines, monitor status, and return serialized metrics.
• Scikit-Learn: Leveraged for metrics calculation and Platt scaling calibration. We solved compatibility limits in newer versions by leveraging sklearn.frozen.FrozenEstimator to calibrate pre-trained models on validation data without triggering a complete refit.
• XGBoost: Trained as the baseline model and downstream evaluator.
• Vanilla HTML5 / CSS3 / JS: Serving a lightweight, responsive dashboard using HSL CSS custom properties (sleek dark mode, neon highlights), grid systems, micro-animations, and Chart.js for visualization.
Pushing Technical Boundaries

Hybrid TFT-PatchTST Architecture: Our leading foundation architecture combines a Variable Selection Network (VSN) (from Temporal Fusion Transformers) to learn feature importances per month with Patching + Transformer Encoders (from PatchTST) to capture long-term sequence dependencies.
Event Vocab Mapping: Representing time-series sequences as discrete vocabularies mapped to financial states, allowing standard NLP-style self-supervised pre-training (Masked Token Prediction).
Imbalanced Loss Optimization: Combines Focal Loss for rare default events with standard Binary Cross-Entropy (BCE) for cure events in a joint multi-task fine-tuning objective.
Strict OOT Splitting: Enforces out-of-time (OOT) validation splits (pre-2024 vs. post-2025) to avoid temporal data leakage, matching strict institutional model risk validation standards.

Prior Work

There is no prior work. Entire work is done as part of the hackathon.

Team

Anusha G.

Products & Tools

AI Tinkerers Channel FINOS Hypoport NVIDIA

Additional Links

https://github.com/AnushaFreshStart/A-foundation-credit-model/tree/main

Credit Foundation Model - codebase

Summarizing URL...