Explainable AI

Why explainability matters¶

Modern deep models often function as black boxes, prioritizing predictive performance over transparency.
Transparency is crucial where models impact society (health, credit, justice, safety).
A typical ML pipeline maps inputs (image/text/tabular) through a function $f_\theta(\cdot)$ to an output $\hat{y}$ ; errors backpropagate to optimize $\theta$ . Understanding what features and why they influence $\hat{y}$ motivates XAI.

Two core challenges in Explainable AI (XAI)¶

Missingness problem. How should we represent a feature being “absent”? Choice of baseline or masking strongly affects explanations.
Correlation problem. Real features are dependent. Methods that vary one feature while holding others fixed may produce unrealistic inputs and misleading attributions.

Nomenclature and taxonomy¶

Intrinsic vs post hoc. Intrinsic (glass box) models are interpretable by design; post hoc methods explain already-trained black-box models.
Local vs global. Local methods explain a specific prediction; global methods summarize model behavior across the dataset.
Surrogates. A simple, interpretable model $g$ approximates a complex $f$ (locally or globally) to obtain human-readable explanations.
Counterfactuals. “What minimal change to $x$ would flip the decision?” Provides actionable insight for an instance.
Data modalities. Images, text, and tabular data call for different masking/baseline strategies and surrogates.
Attributions A heatmap that shows which pixes in an image contributed the most to a models decision.

Glass-box models¶

Linear models¶

Linear regression. With standardized features, coefficients indicate relative importance.

Prediction: $\hat{y}=\beta_0+\sum_{j=1}^p \beta_j x_j$ .
Fit by least squares (MSE):
$\min_{\beta_0,\ldots,\beta_p}\ \sum_{i=1}^n \Bigl(y^{(i)}-\beta_0-\sum_{j=1}^p\beta_j x^{(i)}_j\Bigr)^2.$
(1)

Logistic regression. Probabilistic classification with sigmoid/softmax link.

Binary: $P(y=1\mid x)=\sigma(z)$ , where $z=\beta_0+\sum_j \beta_j x_j$ , $\ \sigma(z)=\tfrac{1}{1+e^{-z}}$ .
Fit by (negative) log-likelihood:
$\min_{\beta}\ -\sum_{i=1}^n \Bigl[y^{(i)}\log\sigma(z^{(i)})+(1-y^{(i)})\log(1-\sigma(z^{(i)}))\Bigr].$
(2)

Interpretation caveats.

Standardization helps compare $|\beta_j|$ across features.
Signs: $\beta_j\!\!>\!0$ means increasing $x_j$ increases $\hat{y}$ (or the log-odds).

Decision trees (and forests)¶

Decision trees. Interpretable by rules/paths from root to leaf. For classification, each leaf stores a class probability; the prediction comes from the leaf reached by the instance.

Purity via entropy (classification):

H(D)= -\sum_{c=1}^k p(C_c)\log_2 p(C_c).

(3)

Information gain for attribute $A$ :

\mathrm{IG}(D,A)= H(D) - \sum_{t\in\text{splits}} \frac{|D_t|}{|D|} H(D_t).

(4)

Growing trees. Greedy splits that maximize information gain until stopping criteria (max depth, min leaf size, purity). Watch for overfitting; forests (bagging, random feature subsets) improve generalization and support tasks like outlier detection (e.g., Isolation Forest).

Explainability.

Global: feature importances, prototypical rules.
Local: path-level rules for a prediction.

Perturbation-based explanations: occlusion¶

Idea. Measure output change when parts of the input are masked.

Sliding window occlusion trades off window size and stride; results can vary notably.
Adaptive occlusion seeks the smallest region that preserves the prediction:
$\mathcal{L}(\mathbf{m}) = \bigl|\hat{y}(\mathbf{x})-\hat{y}(\mathbf{x}\odot\mathbf{m})\bigr| + \alpha \sum_{i}|m_i|,$
(5)
with mask $\mathbf{m}\in[0,1]^d$ and elementwise product $\odot$ . Smaller masks with similar scores are preferred.

Randomized occlusion. Sample many random masks, score, normalize, and aggregate into a heatmap.

Caveats.

Can resemble segmentation rather than decision evidence.
Computationally expensive (many forward passes).
Sensitive to the choice of baseline (missingness).

Class Activation Mapping (CAM and Grad-CAM)¶

CAM (for specific CNN architectures).

Requires a global average pooling just before a linear classifier (softmax).
For class $c$ , the map is the weighted sum of final conv feature maps:
$M_c=\mathrm{ReLU}\!\left(\sum_{k} w^{c}_{k}\,A^{k}\right),$
(6)
where $A^k$ are feature maps and $w_k^c$ are classifier weights. ReLU highlights positively contributing regions.

Grad-CAM (architecture-agnostic for CNNs).

Uses gradients to obtain weights $\alpha_k^c$ for the last conv layer’s feature maps:
$M_c=\mathrm{ReLU}\!\left(\sum_{k} \alpha^{c}_{k}\,A^{k}\right),\qquad \alpha^{c}_{k}\propto \text{global average of }\frac{\partial \hat{y}_c}{\partial A^{k}}.$
(7)
Frees architectural constraints, but resolution depends on the final conv layer; heatmap intensities are not comparable across images.

Surrogate explanations: LIME (local, model-agnostic)¶

Goal. Approximate a black-box $f$ near an instance $x$ by a simple surrogate model $g$ in an interpretable feature space $z'$ (e.g., superpixels, word indicators).

Sample perturbations around $x$ in $z'$ -space, map back to original space $z$ , and evaluate $f(z)$ .
Fit $g$ with locality weights $\pi_x(z)$ :
$\min_{g\in\mathcal{G}}\ \sum_{i}\ \pi_x(z_i)\,\bigl(f(z_i)-g(z'_i)\bigr)^2\ +\ \Omega(g),$
(8)
where $\Omega(g)$ penalizes complexity (sparsity).

Design choices.

Interpretable binary representation $z'$ (e.g., superpixel on/off).
Locality kernel $\pi_x$ (e.g., exponential kernel in interpretable space).
Fidelity vs simplicity trade-off via $\Omega(g)$ .

Limitations.

Faithfulness depends on sampling scheme, kernel, and representation of missingness.
Correlated features and unrealistic perturbations can mislead.

Plot-based global summaries: PDP and ICE¶

Partial dependence (PDP). For feature $i$ ,

\mathrm{PD}_i(x_i) = \mathbb{E}_{x_{-i}}\bigl[f(x_i,x_{-i})\bigr].

(9)

Averages out other features to show the marginal effect of $x_i$ .

Individual conditional expectation (ICE).

For a fixed instance $x^{(j)}$ , trace $f(x_i, x_{-i}^{(j)})$ as $x_i$ varies.
Reveals heterogeneity that PDP might hide.

Caveats.

If features are correlated, varying $x_i$ alone may create unrealistic inputs (correlation problem).
Aggregation across the dataset can hide important local behaviors.

Good practice and common pitfalls¶

Always consider the missingness and correlation problems when designing explanations.
Prefer data-aware perturbations and realistic baselines/masks.
Validate explanations qualitatively and, where possible, quantitatively (e.g., deletion/insertion tests, sanity checks).
Combine complementary views: intrinsic models, local surrogates, perturbation maps, and global plots.

Explainable AI II¶

Gradient-based attribution: vanilla gradients and Gradient × Input¶

Setup. Let a trained model produce a scalar score $f(x)$ for input $x \in \mathbb{R}^d$ (e.g., the logit for a chosen class).

Vanilla gradients (saliency): use the input gradient $\nabla_x f(x)$ as an importance signal for each feature $i$ :
$S_i(x) \;=\; \frac{\partial f(x)}{\partial x_i}.$
(10)
This measures local sensitivity of $f$ to infinitesimal changes in $x_i$ .
Gradient × Input: scale sensitivity by the feature value to indicate direction and magnitude:
$M_i(x) \;=\; x_i \cdot \frac{\partial f(x)}{\partial x_i}.$
(11)
This heuristic often sharpens attribution by weighting gradients with input intensity.

Known issues. Raw input-gradients can be noisy and reflect high-frequency, local properties of piecewise-linear networks. They may also suffer from saturation (e.g., ReLU plateaus or sigmoid saturation), where important features can have near-zero gradients even though the prediction relies on them. This motivates path-based methods that probe sensitivity away from the saturated neighborhood around $x$ .

Integrated Gradients (IG)¶

Goal. Attribute the prediction at $x$ relative to a baseline (reference) $x'$ by aggregating gradients along a path from $x'$ to $x$ . The standard straight-line path gives attributions

\phi_i^{IG}\!\left(f, x, x'\right) \;=\; (x_i - x_i') \int_{\alpha = 0}^{1} \frac{\partial f\!\left(x' + \alpha (x - x')\right)}{\partial x_i}\, d\alpha.

(12)

Intuition. Starting from $x'$ (representing feature missingness or neutral input), gradually morph into $x$ ; accumulate each feature’s gradient contribution along the way. This alleviates local saturation around $x$ and captures how changes along the entire path affect $f$ .

Discrete approximation. Using $m$ steps with points $x^{(k)} = x' + \tfrac{k}{m}(x - x')$ :

\phi_i^{IG} \approx (x_i - x_i') \cdot \frac{1}{m}\sum_{k=1}^{m} \frac{\partial f\!\left(x^{(k)}\right)}{\partial x_i}.

(13)

Baseline choice matters. Common baselines include a black/white image, blurred input, uniform random, or Gaussian noise references. Each choice encodes a notion of “feature missingness” and can bias the attribution. In practice, one should justify the baseline and, when possible, use multiple references or data-driven baselines.

Axiom 1 (Completeness)

Completeness is a desirable property because it states that the importance scores for each feature break down the output of the network: each importance score represents that feature’s individual contribution to the network output, and added when together, we recover the output value itself.

\sum_{i} \phi_i^{IG} (f, x, x') = f(x) - f(x')

(14)

The completeness axiom also provides a way to measure convergence:

In practice, we can’t compute the exact value of the integral. Instead, we use a discrete sum approximation with kk linearly-spaced points between 0 and 1 for some value of kk. If we only chose 1 point to approximate the integral, that feels like too few. Is 10 enough? 100? Intuitively 1,000 may seem like enough, but can we be certain?

We can use the completeness axiom as a sanity check on convergence: run integrated gradients with kk points, measure $| \sum_{i} \phi_{i}^{\mathrm{IG}} (f, x, x') - (f(x) - f(x')) |$ , and if the difference is large, re-run with a larger kk.

Shapley values for feature attribution¶

Cooperative game view. Features are “players” that form coalitions; the model score $f$ is the “value” of a coalition. The Shapley value for feature $i$ is the average marginal contribution of $i$ over all subsets $S$ that exclude $i$ :

Combinatorics: Exact computation is intractable for high-dimensional inputs due to $2^{|N|}$ subsets.

Monte Carlo approximations sample permutations or coalitions and estimate marginal effects. For tabular data, a common strategy replaces “missing” features using values from a background dataset to keep inputs realistic (value with/without and average the difference).

Relation to EG: EG can be viewed as a Shapley-inspired expectation for continuous inputs (e.g., images), where the background distribution substitutes for discrete coalition masking.

Axioms for Shapley values¶

Shapley is characterized by the following axioms:

Expected Gradients (EG)¶

Practical recipe:

Sample baselines $x' \sim \mathcal{D}$ .
For each $x'$ , sample $\alpha \in [0,1]$ and evaluate the IG integrand along the straight-line path from $x'$ to $x$ .
Average the contributions to obtain $\mathrm{EG}(x)$ .

This links path-integral attributions to data-aware notions of missingness and connects to Shapley-style expectations in image space.

Counterfactual explanations (multi-objective view)¶

Definition. Given an instance $x$ with an undesirable prediction $f(x)$ , a counterfactual is a nearby $x'$ that achieves a desired outcome (e.g., flip a class) while satisfying constraints (plausibility/feasibility).

Typical objectives.

Proximity: make $x'$ close to $x$ (small $\|x'-x\|$ ).
Sparsity: change as few features as possible.
Plausibility/feasibility: $x'$ should be realistic and respect constraints (immutability, valid ranges).
Validity: $f(x')$ attains the target outcome.

Optimization. Many methods frame counterfactual search as a multi-objective problem and seek a diverse set of Pareto-optimal solutions. Selection emphasizes both performance (target achievement) and diversity among viable changes.

Strengths & limitations.

Clear, actionable “what-to-change” guidance; can work with black-box models (just need $f(\cdot)$ ).
Hyper-local: insightful for an instance, but less suited for global model understanding without aggregation.

Practical notes and recap¶

Gradients are fast but can be noisy and saturate.
Integrated Gradients mitigate saturation by aggregating along a path from a baseline; results depend on the baseline choice.
Expected Gradients average IG over baselines and path positions using a background dataset, reducing baseline bias.
Shapley values provide principled attributions with fairness axioms; use Monte Carlo approximations with realistic background data.
Counterfactuals offer actionable, instance-level recourses via multi-objective optimization (proximity, sparsity, plausibility, validity).