AI Atlas: Creating an Attentive Hybrid: From Hawk to Griffin
AI breakthroughs, concepts, and techniques that are tangibly valuable, specific, and actionable. Written by Glasswing Founder and Managing Partner, Rudina Seseri
In my last two posts, I dove into the roots of the current wave in generative AI and explained how it all traces back to a paper authored at Google in 2017: “Attention is All You Need.” The “attention” mechanism that this paper described unlocked the ability now common to large language models, particularly those utilizing transformers, to compare each word of a sentence to every other word, allowing AI models to capture context behind entire passages.
However, while transformer- and attention-based models are the sources of immense technological value, there are significant computational costs required to utilize them at scale. Furthermore, their infamous tendency to hallucinate, or confidently output incorrect information, cannot be solved simply by making a model larger, meaning training it on more data. These barriers have inspired research into alternative AI architectures and techniques for achieving similarly high performance.
This post is the last (for now!) of a multi-part dive into potential alternatives to the current transformer- and attention-based standard in large language models, such as Open AI’s GPT. Previously, I discussed the innovative models Hyena and Mamba. This week, I return the focus to where everything started, Google, and explore two major developments which were announced this month by Google DeepMind: Hawk and Griffin.
🗺️ What are Hawk and Griffin?
Hawk and Griffin are two models developed by researchers at Google DeepMind to address the limitations of typical recurrent neural networks (RNNs), which are a type of machine learning architecture that is particularly effective at processing sequential data such as text or time series. Because RNNs have a clear temporal flow, where future data is influenced by past data, they can produce inferences quickly and with decision-making that is easier to track and interpret relative to other architectures. However, RNNs are complex and slow to train and, like transformers, have difficulty remembering information over long periods and thus are difficult to scale.
To address the issue of scaling, Hawk applies gated logic to identify important features and forget unnecessary details, similar to a Long Short-Term Memory Network (LSTM). Like with LSTMs, this mechanism enables Hawk to learn from data with long-range dependencies more effectively and avoid issues like vanishing gradients, where an AI system fails to continue learning as it receives more data.
Griffin, on the other hand, is unique in that it reintroduces the attention mechanism into Hawk’s architecture, resulting in a hybrid model capable of handling even longer sequences and improving interpretability by focusing on relevant portions of input sequences more efficiently. Griffin matched the performance of Meta’s LLaMA-2 despite being trained on a significantly smaller dataset, and has demonstrated the ability to extrapolate on longer sequences than those encountered during training.
🤔 What is the significance of these models and what are their limitations?
Both models have been shown to achieve remarkable performance relative to transformers. Like Mamba, Hawk demonstrates that RNNs can be creatively leveraged to process data quickly despite being trained on a dataset that is a fraction of the size. Griffin, meanwhile, extends these advancements even further by highlighting potential synergy between RNNs and attention mechanisms. While attention is very good at picking out important details, RNNs can then remember those details over a longer period of time and retrieve them at faster speeds.
- Performance: Achieving parity with models such as LLaMA-2 at a fraction of the data cost is exciting for enterprises looking to scale models on large future datasets.
- Quicker responses: Hawk and Griffin demonstrated improved latency over transformers, which means that outputs can be generated in less time.
- Interpretability: Because RNNs process data in sequence, their decision-making is easier to explain than that of transformers, where inputs are condensed and outputs are obfuscated.
However, Hawk and Griffin are still in the testing stages, and there are plenty of questions around the finer details of their operation. Google DeepMind has yet to open-source the model, so for now one can only make assumptions off of the research they published.
- Training efficiency: Training RNNs can be difficult because of their complex structures and incorporation of long-term data. Neither Hawk nor Griffin introduces significant innovations for improving this complication. The paper outlines strategies for distributed training, which may alleviate this hurdle, but more testing is required to demonstrate such a solution in practice.
- Ability to scale: Hawk and Griffin have shown stellar progress on relatively small datasets, but will need to be tested at larger scales to see if this performance remains consistent.
- Potential for error: Without an open model being available for testing, it is still unclear how accurately Hawk and Griffin perform on real-world error benchmarks.
🛠️ Applications of Hawk and Griffin
Hawk and Griffin, with their enhanced sequence modeling capabilities, have many potential real-world applications including:
- Real-time systems: The improved latency of Hawk and Griffin create an opportunity for more responsive AI systems, from rapid data anomaly detection to autonomous robotic agents.
- Conversational agents: Unlocking the ability to retain information over longer periods, alongside quicker response times, could make chatbots and synthetic speech more human-like.
- Time series analysis: Beyond speech and text, RNNs are particularly well-suited for sequence analysis such as when analyzing financial data or employee productivity across time.
Stay up-to-date on the latest AI news by subscribing to Rudina’s AI Atlas.