Mistral Models Explained: Performance, Innovation, and Open Source – Deeper Insights

Have you been hearing a lot about Mistral and their models? If you want to know more about Mistral, what the fuss is about, and how to use them, then we’ve got you covered.

We will give a brief overview of Mistral, what their models are, and touch on what technical advances they are using.

What is a Mistral?

There has been a lot of buzz in the AI world about Mistral – a France based company that is famous for their Large Language Model services and open source models.

Heralded by many as a European challenger to the US and (to a lesser extent) Chinese dominated GenAI landscape. The company’s contributions to the state of the art have been huge, including a top end OpenAI-like API service and also open source contributions.

‍Mistral recently received a valuation of $6 billion based on their most recent fundraising.

This extraordinary success is epitomised by their two flagship families of models:

  • Mistral (“Mistral-7B”)
  • Mixtral (“Mixtral-8x7B” and latterly “Mixtral-8x22B”)

Why are people excited?

While a new model seems to be an everyday occurrence in the GenAI world, these two hit differently for three main reasons:

  • They perform really well
  • They are truly open source (they have a very permissive Apache 2.0 licence)
  • They are regularly updated and iterated

On that last point – we are now onto the 3rd improved version of Mistral-7B in just 9 months (v0.3 now supports advanced features like function calling) and Mixtral now has a larger more powerful sibling.

These models cover different use cases:

  • “Mistral” → a lightweight but capable model, suitable for many tasks as a de-facto baseline.
  • “Mixtral” → a much larger but highly efficient SOTA set of models (closing the gap to GPT-4), suitable for pushing open source to its limits.

Why are Mistral’s models so good?

Surprisingly for many, there are actually empirical scaling laws investigated by Google, OpenAI and others that precisely tell you how much accuracy you can expect given specific data size, model size, and available training resources. This means that if you have fixed model size (e.g. 7 Billion params for Mistral), and you have set a budget for training on GPUs (not an issue for a company like Mistral), then all that remains is working out your data quantity/quality.

Mistral have been tight-lipped on their data used to pre-train their models, this is the often overlooked secret sauce for creating high quality models of any size. We can assume that Mistral is spending a lot of time and money curating large high quality datasets – typically by sifting web data down to many Billions of tokens (or many Millions of pages) that have been semi-automatically filtered and checked by subject matter experts.

Mistral have shared in their promotional material and technical information some other advances they are using, but we’ll have to back up a second and look into what’s going on behind the scenes in LLMs to understand.

What is a layman description of LLM processes?

Firstly, LLMs operate by taking a piece of text, breaking it into smaller chunks called tokens (basically words), and then running these tokens through their huge neural network.

As we all know, the meaning of words depends on the context of how they appear with each other, for example “Rock” could mean a stone, or it could mean music. The same applies to LLMs, they need to understand the context of words and how they relate together.

This is essentially handled in two ways:

  1. The “Attention Mechanism”, where the LLM compares each word to the other words in the text.
  2. Training many many parameters that can model complex relations between words.

The Attention Mechanism ensures that the LLM can understand the true meaning of each word in the context and hence understand the full piece of text. This mechanism is famously a bottleneck that slows things down, it requires a lot of computation, but is crucial for why the Transformer architecture is getting such good results. If the models don’t understand context properly then predictions go awry, but also it is costly to compare every word to every other word.

So there are two key issues Mistral is looking to alleviate:

  1. Attention Mechanism (contextual relations between words) is a bottleneck.
  2. LLMs are huge and hence computationally expensive and slow to use.

How exactly do these models overcome these problems?

For point 1 (Attention Mechanism is a bottleneck), Mistral 7B leverages a Google-pioneered variant of Attention Mechanism called “Grouped-Query Attention” (GQA), where we essentially allow some parameters to be shared during Attention which greatly reduces the computation requirements. GQA is a careful balancing act as reducing computation inevitably leads to quality degradation. It essentially acts as a compromise between two extremes: Multi Query Attention (~10x more efficient but degraded performance) vs Multi Head Attention (the canonical version which is least efficient but most performant).

Source: https://arxiv.org/pdf/2305.13245#page=2

For point 2 (LLMs are huge and slow), Mixtral leverages a Mixture of Experts (MoE) architecture to help reduce GPU demands. This architecture basically means that the model is composed of many smaller “expert” models, and each token is gated through a Router network to the relevant expert/s. The end result is that for each token only a subset of these “experts” are used during inference.

Source: https://developer.nvidia.com/blog/applying-mixture-of-experts-in-llm-architectures/

For example, if there are 8 Expert Models, a token may only be routed to Expert Model 1 and Expert Model 3, this means you only require 2 / 8 experts = 25% of the total parameters.

This gives the usual advantages of larger models: many parameters that capture complex meanings. However, we also have reduced computations because the model is sparse: large parts of the model are “switched off” for each word.

Want to get hands on with Mistral or other AI models?

Using Open Source models gives access to high quality State of the art performance. However, leveraging these properly and efficiently is an endless task that requires true expertise and experience. We’ve helped dozens of clients use LLMs to achieve their business aims.

Reach out to us if you would like to get started with Mistral and more to unlock the power of your data.