Headline numbers
296,457
File instances across 5 mature OSS systems — 50–330× larger than PROMISE / AEEEM benchmarks. 10-year commit window.
0.8998
AUC of the best classifier (Random Forest, Paper A). XGBoost a close second at 0.8955.
43.82%
Defects captured in the top 10% of risk-ranked files — a 4.37× lift over uniform random and 81.4% of the oracle ceiling.
68%
Effort reduction from manual test authoring to RAITG (Paper E): 184h → 58.9h across 312 requirements.
94.1%
Requirement coverage achieved by the LLM + rule-verifier pipeline, up from 71.2% manual baseline.
96.3%
First-pass verification rate for generated tests — the deterministic rule check passes on the first try.
Where the research lives in production
A direct line from each paper to a running service or running deployment.
- Paper A → the production defect-prediction model in TestForge AI is
gb-paper1-v4-fixed_age: Gradient Boosting + SMOTE, AUC 0.8917 on the corrected dataset. - Paper C → the FastAPI
/predictmicroservice and the GitHub Actions integration that turn that model into pre-merge risk advice. - Paper E → the test-generation service in TestForge AI that emits Playwright + BDD specs from English requirements.
- Paper D → the self-healing runtime that recovers broken Selenium locators using DOM similarity and tree-ensemble ranking.