When Quantum Computers will be useful for Quantum NLP

Dr. Christopher Erickson

Large Language Models are powerful tools for solving sequence generation problems. However, their quadratic growth of compute, in the sequence dimension, to both train and run inference with such models is a limitation to their accessibility. Furthermore, the main difference in scale between different LLMs is the number of parameters and layers which is chiefly a function of the embedding dimension. Quantum computing can address both of these opportunities.

In this work we first analyse a quantum linearization of the attention mechanism. We use a specific realization, but our analysis applies to any algorithm which encodes the entire sequence within a single ‘shot’ of the quantum processing unit. Next, we remark on an empirical study of different putative quantum kernel methods for NISQ devices, and analyse the throughput of current quantum computers and simulators of quantum computers. Finally, we comment on a quantum simulator for QML we developed, compare it to commercially available QPU compute, and give a set of suggestions for quantum computing hardware companies to be competitive in this space.

Linear Quantum Attention

Quantum and classical processing are frequently powerful for different sub-tasks of a problem.

The main bottleneck in current classical machine learning methods is in matrix multiplication – specifically taking the inner product of two different embeddings of the sequence in the context of LLMs using the attention mechanism. The time it takes to perform such an inner product (assuming such values are encoded as the amplitude of different states) only grows linearly on a quantum computer.

However, nonlinear operations – such as ReLU, are only possible with ‘data reuploading’ on a quantum computer. Essentially – the no cloning theorem implies that one must encode a state twice to calculate the square of the amplitude of a given state.

One can combine these two facts to analyse the number of qubits and number of gates required to approximate the attention mechanism on a quantum computer. Here we give our values for a specific realization, however, the trend is universal.

Calculating minimum circuit sizes in qubits for different sequence lengths, non-linear orders, and embedding dimensions, we found that the order of nonlinearity (s), sequence-length (l), and embedding dimension (N) are related to the number of qubits (n) by:

For example, the headline qubit numbers are:

To restate, we could run (any layer of) a model significantly larger than GPT4 on the entire British Library within a single forward call of a 1024-qubit QPU, if it had a logical error rate low enough and a gate speed fast enough to sample circuits that had a depth of 10¹⁶ gates.

Perhaps more feasibly, a poor approximation of (one layer of) a state-of-the-art LLM may become possible within two years. A good approximation may become possible within four, assuming that the logical error rate is low enough to allow sampling of these circuits.

Quantum Kernel Methods and Throughput

There are nearer-term possibilities for QLLMs, however. While we are in the noisy intermediate-scale quantum (NISQ) regime we encode every element of our sequence separately and take the inner product using quantum kernel methods. Such as the Quantum Kernel Self-Attention Mechanism (or Network, QKSAN).

There are numerous resources with different data encodings, different parameter encodings, and different inner product calculations. However, with all of these methods the chief limitation is throughput (understood as the number of quantum operations, or “QuOps”, that can be performed per second). This is an underlying hardware-based limitation, and the applicability of any quantum machine-learning method is hamstrung by the fact that most modern QPU offerings are less powerful than modern GPU offerings.

A direct comparison between QuOps and FLOPs requires subtlety, since it is dependent not only on how the simulated quantum state is stored, but also the data within the state. For the quantum kernels we analysed the relative QPU and GPU resources available to us in several different cases and concluded that, at time of writing, GPUs are just faster.

Modern state-of-the-art GPUs have ~ .5 TB of VRAM and can apply ~ 100 TFLOPS (for 64-bit floats, which is a higher precision than normally used in LLMs, but are required for simulating QPUs). Modern GPU clusters can have ~ 1000 State of the Art GPUs in them, but all of the quantum simulators you can see on the market are optimized for at most ~ 10 GPUs.

A state vector representation of a quantum state takes 2ⁿ⁺¹ floats, where n is the number of qubits. Every operator applies in parallel to the entire state, and therefore requires ~ 2ⁿ⁺³ FLOP for one-qubit operations and ~ 2ⁿ⁺⁵ FLOP for two-qubit operations.

We can store a state of log₂⁡(10¹²)- 2 ~ 37 qubits in a full state-vector representation on a .5 TB GPU. For our circuits we targeted a 32 qubit QPU. For this every one- and two-qubit operation takes ~ .1 TFLOP and .4 TFLOP respectively. So, a circuit with 1000 one-qubit gates and 500 two-qubit gates requires (.1×1000 + .4×500) TFLOP / 100 TFLOPS ~ 3 seconds to simulate in full precision.

Currently, quantum hardware can perform ~ 10⁷ gates/second therefore they should only take ~ 10^-4 seconds to complete the same calculation. Therefore, if you compare a single shot of a 32 qubit QPU to the processing power of a current GPU using this technique, the QPU wins by several orders of magnitude. However, this ignores the fact that QPUs must repeat the same circuit many times, since the QPU only outputs n bits of information per shot. The process of repeating shots to better approximate the quantum state pre-collapse is called tomography, and, for the circuits we require for our QLLM, we require roughly 1000 shots. Meaning that the QPU is about ~ 10¹ times faster.

There are practical constraints on this as well. Current QPUs also have an additional overhead between shots which costs a cooldown time of ~ 10^-3 seconds, reducing their speed advantage to about equal for a state-vector simulation.

The overheads necessary to perform inference on an LLM, that is, the number ‘jobs’ we require with the QKSAN mechanism, goes like the square of the context length of the LLM. For ChatGPT, for example, has a context length of ~ 10⁵ and would require 10¹³ shots, which is ~ 300 years per token on a single modern QPU!

Our Quantum Simulator

To train our QLLM, we therefore required several simulator improvements.

First, we implemented vectorization of a tensor-network simulators so we could simulate millions of simultaneous circuits. In the real-case we were able to simulate ~ 10⁶ circuits in parallel in ~ 10³ seconds, on a single .5 TB GPU. Therefore our simulator, for the shallow circuits relevant for QLLMs, was 10³ times faster than available QPU compute. Second, we implemented a novel method for calculating gradients in a constant multiple for time and memory of inference time.

Further hardware advancements in the QPU could bring this to a speed advantage of 10² (simulator over QPU), but further software advancements to the simulator could bring this back up to 10⁴ against the improved hardware.

Therefore, quantum computing is currently in a transitionary period, as the advantage of GPUs over QPUs is chipped away by a factor of two by every new qubit available in QPUs. However, at the moment, small QPUs are significantly slower than our simulator. Most of the ‘killer apps’ of quantum computers require highly error-corrected systems, and most roadmaps list improvements in the error correction of their logical qubits. However, for real-world applications of QML methods, more, faster, qubits with less error correction is preferable.

Important considerations for QPU roadmaps

Our analysis is highly sensitive to the size of and latency of QPUs. While a 32-qubit machine with a throughput of 1e6 shots / second is roughly equivalent to a State of the Art GPU, a 33-qubit machine is worth four such GPUs, a 34-qubit machine is worth sixteen, a 35-qubit machine is worth sixty-four, and so on: an extra factor of two is compounded with each additional qubit.

This is the basis of Neven’s law, that states that the growth of available quantum compute is double exponential in time. This might be restated that the growth in the number of logical qubits is at least exponential in time (Hartnett, 2019). Therefore, our simulator is at best a transitional tool. It only requires modest growth in quantum computers for our simulator to stop being useful. To run the Linearised Quantum Attention mechanism, and Quantum AI solutions more generally, there are several critical things which should be added to Quantum Computing company roadmaps:

If QPUs can achieve these latter two benchmarks, they will outperform GPUs at inference time running the same models! This is the power of quantum operations: they are so inherently parallel that a QPU operating at much lower frequencies can be competitive with GPUs running at much higher clock rates.

The depths of this analysis wouldn’t have been possible without OQC (Oxford Quantum Circuits). Our team would like to extend a big “thank you” to OQC for being so open with us about their QPU stack. It’s worthwhile calling out that this was significantly more than anyone else in the industry. With this collaborative approach, we’ve been able to bring them along on this our journey while we built the world’s first Quantum LLM.

Relevant references:

“OpenAI's GPT-3, with 175 billion parameters, required approximately 3.14 × 10²³ floating-point operations (FLOPs) for training” https://lambdalabs.com/blog/demystifying-gpt-3?srsltid=AfmBOoqqErfehko8ePeBB83F2WhbjJf62Jpgdf2Oj5rTxDxJ9ymNdxhF
QKSAN, https://www.computer.org/csdl/journal/tp/2024/12/10613453/1YYI16a6Z0s
Hartnett, K. (2019). A New Law to Describe Quantum Computing’s Rise? Quanta Magazine. https://www.quantamagazine.org/does-nevens-law-describe-quantum-computings-rise-20190618/