Identifying Interactions at Scale for Language Models

Introduction to the Interpretability of Language Models

Understanding machine learning systems, especially Large Language Models (LLMs), is a major challenge in modern artificial intelligence. Interpretability research aims to make the decision-making process more transparent for both model developers and end users. This is a crucial step towards creating more reliable and secure AI.

Analyzing LLMs from Different Angles

To grasp the behavior of these complex systems, it is vital to analyze them from various perspectives. Three key approaches emerge: feature attribution, which identifies input elements influencing a prediction; data attribution, which links model behavior to significant training examples; and mechanistic interpretability, which examines the functions of the internal components of the model. Each of these perspectives highlights the inherent complexity of analyzing large-scale models.

Complexity at Scale: A Persistent Challenge

The main difficulty lies in the fact that model behavior cannot be reduced to isolated components. Instead, it arises from complex dependencies and patterns. To achieve optimal performance, these models synthesize complex relationships between features and extract common patterns from varied training examples. Consequently, interpretability methods must also capture these influential interactions. The increase in the number of features, training data points, and model components leads to an exponential growth in potential interactions, making exhaustive analysis practically impossible.

SPEX and ProxySPEX: Promising Algorithms

In this context, we introduce SPEX (Spectral Explainer) and ProxySPEX, algorithms designed to identify these critical interactions at scale. Central to our approach is the concept of ablation, which measures the influence of an element by observing the changes that occur when it is removed.

Attribution through Ablation

Feature Attribution: We mask or remove specific segments of the input prompt and measure the resulting shift in predictions.
Data Attribution: We train models on different subsets of the training set, assessing how the model’s output on a test point shifts in the absence of specific training data.
Model Component Attribution: We intervene on the model’s forward pass by removing the influence of specific internal components, determining which internal structures are responsible for the model’s prediction.

In each case, the goal is the same: to isolate the drivers of a decision by systematically perturbing the system, hoping to uncover key interactions. However, each ablation incurs a significant cost, whether through expensive inference calls or retraining.

The SPEX Framework: Efficiency and Innovation

The SPEX framework stands out for its use of signal processing and coding theory, allowing interaction discovery at scales much larger than previous methods. By exploiting a key structural observation — that most interactions are not influential — SPEX reformulates the complex search problem into a solvable sparse recovery problem. Leveraging powerful tools from signal processing, SPEX uses strategically selected ablations to combine many candidate interactions, and then utilizes efficient decoding algorithms to disentangle these combined signals and isolate the specific interactions responsible for the model’s behavior.

ProxySPEX: An Additional Advancement

With ProxySPEX, we identified another common structural property in complex machine learning models: hierarchy. This means that when a higher-order interaction is significant, its lower-order subsets are likely significant as well. This observation dramatically reduces computational costs while maintaining performance comparable to SPEX with about 10 times fewer ablations.

Feature Attributions: A Powerful Tool

Feature attribution techniques assign importance scores to input features based on their influence on the model’s output. For instance, if an LLM is used to make a medical diagnosis, this approach can pinpoint the specific symptoms that led the model to its conclusion. While attributing importance to individual features is valuable, the true power of sophisticated models lies in their ability to capture complex relationships between features.

Conclusion: Towards New Applications

SPEX and ProxySPEX pave the way for new applications in feature, data, and model component attribution. By enhancing our understanding of key interactions, these frameworks promise to transform how we engage with language models, making their decisions more transparent and reliable.

To learn more about applying these new technologies to your project or business, please feel free to Contact me.