AI-Driven Software Quality Engineering — System Architecture Layered architecture from data sources through ML and runtime services to CI/CD integration, with all six research papers anchored to the components they produce. 1 · DATA SOURCES 2 · FEATURE EXTRACTION & DATASET 3 · ML / LLM CORE 4 · RUNTIME SERVICES (FastAPI) 5 · CI/CD & DEVELOPER SURFACE Git Commit Logs Elasticsearch · Spring Boot · Hadoop Issue / Defect History Keyword labels · bug tags · JIRA Live DOM Snapshots Selenium / Playwright sessions Natural-Language Reqs 312 reqs · 3 domains Test-Run Telemetry Failures · stack traces · timings Repository Analytics Engine commit_count · unique_developers · churn lines_added · lines_deleted · file_age_days commit_frequency (bug-fixed in v4) Unified Dataset 296,457 file instances · 5 OSS repos 18.61% defect ratio · 10-yr window ▸ PAPER A §3 DOM Feature Extractor 8 feature families · id-stability · role/aria structural · text · depth · attribute Jaccard ▸ PAPER D Requirement Pre-processor RAITG prompt templates · domain glossary acceptance-criteria normalisation ▸ PAPER E Defect-Prediction Models LR · DT · RF · GB · XGB · MLP RF AUC 0.8998 (best) XGB AUC 0.8955 Stratified 5-fold CV · SMOTE ▸ PAPER A · PAPER B Cross-Repo Transfer Leave-one-repository-out Cross AUC 0.867 · F1 0.631 AUC–F1 asymmetry analysis Defect-rate mismatch driver ▸ PAPER B Locator-Ranking Models Tree-ensemble over 8 features 2,400 mutation events 7 refactor classes Heuristic + ML hybrid ▸ PAPER D LLM + Rule Verifier Prompt-engineered generation Deterministic rule check Symbolic mutation indicators 96.3% first-pass verify ▸ PAPER E Risk Prediction API /predict · gb-paper1-v4 top-k coverage endpoint ▸ PAPER C Test Prioritisation Top 10% → 43.82% defects 4.37× risk-vs-random lift ▸ PAPER C Self-Healing Runtime DOM-similarity recovery Confidence-gated heal/flag ▸ PAPER D Test Generation Service Reqs → executable tests 94.1% req coverage ▸ PAPER E Defect-Attribution Svc SHAP × test-failure fusion Triage suggestions (vision) ▸ PAPER F GitHub Actions PR & merge gating · advisory Risk Dashboard Quality metrics · drilldowns CLI & IDE Plugin Pre-commit risk hints Reports & Audit Log Versioned prompts · receipts AI GOVERNANCE · payload controls · prompt versioning · privacy hashing · audit trails ▲ telemetry & feedback loop AI-Driven Software Quality Engineering — System Architecture

Layer-by-layer walkthrough

Each layer publishes a stable contract to the one above it; each box names the paper(s) where its design is justified.

1Data sources

Five upstream sources feed the platform: Git commit logs (history across Elasticsearch, Spring Boot, Hadoop, Kafka, Express), issue / defect history (keyword labelling on bug tags), live DOM snapshots from Selenium and Playwright sessions, natural-language requirements (312 across three domains), and test-run telemetry from real CI executions.

Sources are intentionally heterogeneous: the platform's job is to fuse them into a single risk model and a single triage signal.

The Repository Analytics Engine produces the seven process metrics powering Paper A: commit_count, unique_developers, code_churn, lines_added, lines_deleted, file_age_days, and commit_frequency. The May 2026 rebuild corrected the file_age_days sign bug that previously contaminated 40% of rows, lifting commit_frequency from a near-zero contributor to a meaningful secondary signal.

In parallel the DOM feature extractor (Paper D) emits eight feature families and the requirement pre-processor (Paper E) normalises requirements via the RAITG prompt schema and domain glossary.

Four model families share a uniform training pipeline (stratified 5-fold CV, SMOTE oversampling, paired-t/Wilcoxon significance):

Defect-prediction models — six classifiers benchmarked head-to-head, Random Forest best at AUC 0.8998 (Paper A). Cross-repository transfer — leave-one-repository-out with cross AUC 0.867 and cross F1 0.631 (Paper B). Locator-ranking models — tree ensembles over 2,400 mutation events spanning seven refactor classes (Paper D). LLM + rule verifier — prompt-engineered generation with deterministic rule checks and symbolic mutation indicators (Paper E).

Five microservices expose the trained models behind versioned HTTP endpoints: risk-prediction API, test-prioritisation service (top-10% capture 43.82% of defects, 4.37× lift; Paper C), self-healing runtime, test-generation service (94.1% requirement coverage, 96.3% first-pass verification; Paper E), and the defect-attribution service (Paper F, vision).

The production model in service is gb-paper1-v4-fixed_age — Gradient Boosting with SMOTE, trained on 386,076 instances post-correction, AUC 0.8917.

The platform meets developers where they already work: a GitHub Actions integration with advisory and gating rollouts, a risk dashboard, a CLI / IDE plugin for pre-commit risk hints, and versioned reports and audit logs — including the immutable prompt receipts that the AI governance layer requires.

A cross-cutting concern, not a layer. Implements payload controls, content redaction, author-privacy hashing, prompt versioning, and audit trails. Every model call — whether to the defect classifier, the LLM, or the attribution service — passes through this rail and produces a verifiable receipt.

Why this shape? The architecture is deliberately ML-agnostic at the service boundary — the same FastAPI contract serves the current Gradient Boosting model and would serve a fine-tuned transformer if and when one outperforms it on calibration, not just on raw AUC.