Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Explainable AI

Why explainability matters

  • Modern deep models often function as black boxes, prioritizing predictive performance over transparency.

  • Transparency is crucial where models impact society (health, credit, justice, safety).

  • A typical ML pipeline maps inputs (image/text/tabular) through a function fθ()f_\theta(\cdot) to an output y^\hat{y}; errors backpropagate to optimize θ\theta. Understanding what features and why they influence y^\hat{y} motivates XAI.

Two core challenges in Explainable AI (XAI)

  • Missingness problem. How should we represent a feature being “absent”? Choice of baseline or masking strongly affects explanations.

  • Correlation problem. Real features are dependent. Methods that vary one feature while holding others fixed may produce unrealistic inputs and misleading attributions.

Nomenclature and taxonomy

  • Intrinsic vs post hoc. Intrinsic (glass box) models are interpretable by design; post hoc methods explain already-trained black-box models.

  • Local vs global. Local methods explain a specific prediction; global methods summarize model behavior across the dataset.

  • Surrogates. A simple, interpretable model gg approximates a complex ff (locally or globally) to obtain human-readable explanations.

  • Counterfactuals. “What minimal change to xx would flip the decision?” Provides actionable insight for an instance.

  • Data modalities. Images, text, and tabular data call for different masking/baseline strategies and surrogates.

  • Attributions A heatmap that shows which pixes in an image contributed the most to a models decision.

Glass-box models

Linear models

Linear regression. With standardized features, coefficients indicate relative importance.

  • Prediction: y^=β0+j=1pβjxj\hat{y}=\beta_0+\sum_{j=1}^p \beta_j x_j.

  • Fit by least squares (MSE):

    minβ0,,βp i=1n(y(i)β0j=1pβjxj(i))2.\min_{\beta_0,\ldots,\beta_p}\ \sum_{i=1}^n \Bigl(y^{(i)}-\beta_0-\sum_{j=1}^p\beta_j x^{(i)}_j\Bigr)^2.

Logistic regression. Probabilistic classification with sigmoid/softmax link.

  • Binary: P(y=1x)=σ(z)P(y=1\mid x)=\sigma(z), where z=β0+jβjxjz=\beta_0+\sum_j \beta_j x_j,  σ(z)=11+ez\ \sigma(z)=\tfrac{1}{1+e^{-z}}.

  • Fit by (negative) log-likelihood:

    minβ i=1n[y(i)logσ(z(i))+(1y(i))log(1σ(z(i)))].\min_{\beta}\ -\sum_{i=1}^n \Bigl[y^{(i)}\log\sigma(z^{(i)})+(1-y^{(i)})\log(1-\sigma(z^{(i)}))\Bigr].

Interpretation caveats.

  • Standardization helps compare βj|\beta_j| across features.

  • Signs: βj ⁣ ⁣> ⁣0\beta_j\!\!>\!0 means increasing xjx_j increases y^\hat{y} (or the log-odds).

Decision trees (and forests)

Decision trees. Interpretable by rules/paths from root to leaf. For classification, each leaf stores a class probability; the prediction comes from the leaf reached by the instance.

Purity via entropy (classification):

H(D)=c=1kp(Cc)log2p(Cc).H(D)= -\sum_{c=1}^k p(C_c)\log_2 p(C_c).

Information gain for attribute AA:

IG(D,A)=H(D)tsplitsDtDH(Dt).\mathrm{IG}(D,A)= H(D) - \sum_{t\in\text{splits}} \frac{|D_t|}{|D|} H(D_t).

Growing trees. Greedy splits that maximize information gain until stopping criteria (max depth, min leaf size, purity). Watch for overfitting; forests (bagging, random feature subsets) improve generalization and support tasks like outlier detection (e.g., Isolation Forest).

Explainability.

  • Global: feature importances, prototypical rules.

  • Local: path-level rules for a prediction.

Perturbation-based explanations: occlusion

Idea. Measure output change when parts of the input are masked.

  • Sliding window occlusion trades off window size and stride; results can vary notably.

  • Adaptive occlusion seeks the smallest region that preserves the prediction:

    L(m)=y^(x)y^(xm)+αimi,\mathcal{L}(\mathbf{m}) = \bigl|\hat{y}(\mathbf{x})-\hat{y}(\mathbf{x}\odot\mathbf{m})\bigr| + \alpha \sum_{i}|m_i|,

    with mask m[0,1]d\mathbf{m}\in[0,1]^d and elementwise product \odot. Smaller masks with similar scores are preferred.

Randomized occlusion. Sample many random masks, score, normalize, and aggregate into a heatmap.

Caveats.

  • Can resemble segmentation rather than decision evidence.

  • Computationally expensive (many forward passes).

  • Sensitive to the choice of baseline (missingness).

Class Activation Mapping (CAM and Grad-CAM)

CAM (for specific CNN architectures).

  • Requires a global average pooling just before a linear classifier (softmax).

  • For class cc, the map is the weighted sum of final conv feature maps:

    Mc=ReLU ⁣(kwkcAk),M_c=\mathrm{ReLU}\!\left(\sum_{k} w^{c}_{k}\,A^{k}\right),

    where AkA^k are feature maps and wkcw_k^c are classifier weights. ReLU highlights positively contributing regions.

Grad-CAM (architecture-agnostic for CNNs).

  • Uses gradients to obtain weights αkc\alpha_k^c for the last conv layer’s feature maps:

    Mc=ReLU ⁣(kαkcAk),αkcglobal average of y^cAk.M_c=\mathrm{ReLU}\!\left(\sum_{k} \alpha^{c}_{k}\,A^{k}\right),\qquad \alpha^{c}_{k}\propto \text{global average of }\frac{\partial \hat{y}_c}{\partial A^{k}}.
  • Frees architectural constraints, but resolution depends on the final conv layer; heatmap intensities are not comparable across images.

Surrogate explanations: LIME (local, model-agnostic)

Goal. Approximate a black-box ff near an instance xx by a simple surrogate model gg in an interpretable feature space zz' (e.g., superpixels, word indicators).

  • Sample perturbations around xx in zz'-space, map back to original space zz, and evaluate f(z)f(z).

  • Fit gg with locality weights πx(z)\pi_x(z):

    mingG i πx(zi)(f(zi)g(zi))2 + Ω(g),\min_{g\in\mathcal{G}}\ \sum_{i}\ \pi_x(z_i)\,\bigl(f(z_i)-g(z'_i)\bigr)^2\ +\ \Omega(g),

    where Ω(g)\Omega(g) penalizes complexity (sparsity).

Design choices.

  • Interpretable binary representation zz' (e.g., superpixel on/off).

  • Locality kernel πx\pi_x (e.g., exponential kernel in interpretable space).

  • Fidelity vs simplicity trade-off via Ω(g)\Omega(g).

Limitations.

  • Faithfulness depends on sampling scheme, kernel, and representation of missingness.

  • Correlated features and unrealistic perturbations can mislead.

Plot-based global summaries: PDP and ICE

Partial dependence (PDP). For feature ii,

PDi(xi)=Exi[f(xi,xi)].\mathrm{PD}_i(x_i) = \mathbb{E}_{x_{-i}}\bigl[f(x_i,x_{-i})\bigr].

Averages out other features to show the marginal effect of xix_i.

Individual conditional expectation (ICE).

  • For a fixed instance x(j)x^{(j)}, trace f(xi,xi(j))f(x_i, x_{-i}^{(j)}) as xix_i varies.

  • Reveals heterogeneity that PDP might hide.

Caveats.

  • If features are correlated, varying xix_i alone may create unrealistic inputs (correlation problem).

  • Aggregation across the dataset can hide important local behaviors.

Good practice and common pitfalls

  • Always consider the missingness and correlation problems when designing explanations.

  • Prefer data-aware perturbations and realistic baselines/masks.

  • Validate explanations qualitatively and, where possible, quantitatively (e.g., deletion/insertion tests, sanity checks).

  • Combine complementary views: intrinsic models, local surrogates, perturbation maps, and global plots.

Explainable AI II

Gradient-based attribution: vanilla gradients and Gradient × Input

Setup. Let a trained model produce a scalar score f(x)f(x) for input xRdx \in \mathbb{R}^d (e.g., the logit for a chosen class).

  • Vanilla gradients (saliency): use the input gradient xf(x)\nabla_x f(x) as an importance signal for each feature ii:

    Si(x)  =  f(x)xi.S_i(x) \;=\; \frac{\partial f(x)}{\partial x_i}.

    This measures local sensitivity of ff to infinitesimal changes in xix_i.

  • Gradient × Input: scale sensitivity by the feature value to indicate direction and magnitude:

    Mi(x)  =  xif(x)xi.M_i(x) \;=\; x_i \cdot \frac{\partial f(x)}{\partial x_i}.

    This heuristic often sharpens attribution by weighting gradients with input intensity.

Known issues. Raw input-gradients can be noisy and reflect high-frequency, local properties of piecewise-linear networks. They may also suffer from saturation (e.g., ReLU plateaus or sigmoid saturation), where important features can have near-zero gradients even though the prediction relies on them. This motivates path-based methods that probe sensitivity away from the saturated neighborhood around xx.

Integrated Gradients (IG)

Goal. Attribute the prediction at xx relative to a baseline (reference) xx' by aggregating gradients along a path from xx' to xx. The standard straight-line path gives attributions

ϕiIG ⁣(f,x,x)  =  (xixi)α=01f ⁣(x+α(xx))xidα.\phi_i^{IG}\!\left(f, x, x'\right) \;=\; (x_i - x_i') \int_{\alpha = 0}^{1} \frac{\partial f\!\left(x' + \alpha (x - x')\right)}{\partial x_i}\, d\alpha.

Intuition. Starting from xx' (representing feature missingness or neutral input), gradually morph into xx; accumulate each feature’s gradient contribution along the way. This alleviates local saturation around xx and captures how changes along the entire path affect ff.

Discrete approximation. Using mm steps with points x(k)=x+km(xx)x^{(k)} = x' + \tfrac{k}{m}(x - x'):

ϕiIG(xixi)1mk=1mf ⁣(x(k))xi.\phi_i^{IG} \approx (x_i - x_i') \cdot \frac{1}{m}\sum_{k=1}^{m} \frac{\partial f\!\left(x^{(k)}\right)}{\partial x_i}.

Baseline choice matters. Common baselines include a black/white image, blurred input, uniform random, or Gaussian noise references. Each choice encodes a notion of “feature missingness” and can bias the attribution. In practice, one should justify the baseline and, when possible, use multiple references or data-driven baselines.

Shapley values for feature attribution

Cooperative game view. Features are “players” that form coalitions; the model score ff is the “value” of a coalition. The Shapley value for feature ii is the average marginal contribution of ii over all subsets SS that exclude ii:

Combinatorics: Exact computation is intractable for high-dimensional inputs due to 2N2^{|N|} subsets.

Monte Carlo approximations sample permutations or coalitions and estimate marginal effects. For tabular data, a common strategy replaces “missing” features using values from a background dataset to keep inputs realistic (value with/without and average the difference).

Relation to EG: EG can be viewed as a Shapley-inspired expectation for continuous inputs (e.g., images), where the background distribution substitutes for discrete coalition masking.

Axioms for Shapley values

Shapley is characterized by the following axioms:

Expected Gradients (EG)

Practical recipe:

  1. Sample baselines xDx' \sim \mathcal{D}.

  2. For each xx', sample α[0,1]\alpha \in [0,1] and evaluate the IG integrand along the straight-line path from xx' to xx.

  3. Average the contributions to obtain EG(x)\mathrm{EG}(x).

This links path-integral attributions to data-aware notions of missingness and connects to Shapley-style expectations in image space.

Counterfactual explanations (multi-objective view)

Definition. Given an instance xx with an undesirable prediction f(x)f(x), a counterfactual is a nearby xx' that achieves a desired outcome (e.g., flip a class) while satisfying constraints (plausibility/feasibility).

Typical objectives.

  • Proximity: make xx' close to xx (small xx\|x'-x\|).

  • Sparsity: change as few features as possible.

  • Plausibility/feasibility: xx' should be realistic and respect constraints (immutability, valid ranges).

  • Validity: f(x)f(x') attains the target outcome.

Optimization. Many methods frame counterfactual search as a multi-objective problem and seek a diverse set of Pareto-optimal solutions. Selection emphasizes both performance (target achievement) and diversity among viable changes.

Strengths & limitations.

  • Clear, actionable “what-to-change” guidance; can work with black-box models (just need f()f(\cdot)).

  • Hyper-local: insightful for an instance, but less suited for global model understanding without aggregation.

Practical notes and recap

  • Gradients are fast but can be noisy and saturate.

  • Integrated Gradients mitigate saturation by aggregating along a path from a baseline; results depend on the baseline choice.

  • Expected Gradients average IG over baselines and path positions using a background dataset, reducing baseline bias.

  • Shapley values provide principled attributions with fairness axioms; use Monte Carlo approximations with realistic background data.

  • Counterfactuals offer actionable, instance-level recourses via multi-objective optimization (proximity, sparsity, plausibility, validity).