Calculating the fibonacci numbers on GPU

21 Jun, 2025

In this blogpost we will show how to perform very fast calculation of the Fibonacci sequence using GPU programming. In this blogpost we will employ Thrust an NVIDIA library which uses concepts from modern C++ to make GPU programming easy.

Introduction

Scan is one of the fundamental examples for parallelizable algorithms. If you are interested in the foundations of the algorithm I refer you to a previous blogpost where I implemented scan in CUDA.

The scan is the operation which transforms

x = [x_{1}, . . ., x_{N}]

y = [y_{1}, . . ., y_{N}]

We call the inclusive iff

y_{i} = x_{0} \oplus . . . \oplus x_{i}

and exclusive iff

y_{i} = x_{0} \oplus . . . \oplus x_{i - 1}

where $\oplus$ is an associative binary operator.

Simple exlusive scan in thrust

This is how to perform an exlusive scan using thrust.

  thrust::device_vector<int> in{1,2,3};
  thrust::device_vector<int>  out(3);

By default we apply the plus operator. Here we use 0 as the initial element $x_{0}$ .

  thrust::exclusive_scan(thrust::device,
    in.begin(),
    in.end(),
    out.begin(),
    0
    );

This will output

out: 0;1;3;

We can similary apply the multiply operator:

  thrust::exclusive_scan(thrust::device,
    in.begin(),
    in.end(),
    out.begin(),
    1,
    [=]__host__ __device__(int cur, int prev) { return cur * prev; }
    );

where we use a simple lambda function to define the binary operation. This will output

out: 1;1;2;

Matrix scan operation

Thrust is very flexible and the matrix multiplication is associative. This allows us to perform a scan using matrices!

// 2 x 2 Matrix
// a00, a01, a10, a11
using Matrix = thrust::tuple<int, int, int, int>;
...
Matrix scale_two{thrust::make_tuple(2, 0, 0, 2)};
Matrix identity_matrix{thrust::make_tuple(1, 0, 0, 1)};
...
thrust::exclusive_scan(thrust::device,
    in.begin(),
    in.end(),
    out.begin(),
    identity_matrix,
    matmul_lambda
    );
...

This will output

out:
[1, 0, 0, 1]
[2, 0, 0, 2]
[4, 0, 0, 4]
[8, 0, 0, 8]
[16, 0, 0, 16]

At each step our scan operation is simply a multiplication by the diagonal matrix with 2 on the diagonal. The result at step n simply contains the corresponding power of 2. It is verify on paper that this is what we expect.

The matmul can be easily defined as a lambda as follows:

  // Mult the current matrix from left to the previous one as scan op.
  auto matmul_lambda = [=]__host__ __device__(const Matrix &x, const Matrix &y) {
    auto a00 = thrust::get<0>(x); auto a01 = thrust::get<1>(x);
    auto a10 = thrust::get<2>(x); auto a11 = thrust::get<3>(x);
    auto b00 = thrust::get<0>(y); auto b01 = thrust::get<1>(y);
    auto b10 = thrust::get<2>(y); auto b11 = thrust::get<3>(y);

    auto c00 = a00 * b00 + a01 * b10;
    auto c01 = a00 * b01 + a01 * b11;
    auto c10 = a10 * b00 + a11 * b10;
    auto c11 = a10 * b01 + a11 * b11;

    return thrust::make_tuple(
      c00,
      c01,
      c10,
      c11
    );
  };

Fibonacci in terms of matrices

We are now ready to calculate the Fibonacci numbers on GPU. Note the following identity which you can convince yourself of by calculating it on a piece of paper. Pasted image 20250621191135

That means we can simply calculate $F_{n}$ by performing the scan with the matrix $Q$ and obtain the fibonacci number by taking one of the antidiagonal entries.

const int N = 32;
...
Matrix fibonacci{thrust::make_tuple(1, 1, 1, 0)};
Matrix identity_matrix{thrust::make_tuple(1, 0, 0, 1)};
thrust::device_vector<Matrix> in(N, fibonacci);
thrust::device_vector<Matrix> out(N);
...
thrust::exclusive_scan(thrust::device,
    in.begin(),
    in.end(),
    out.begin(),
    identity_matrix,
    matmul_lambda
    );
...

This will output

out:
0;1;1;2;3;5;8;13;21;34;55;89;144;233;377;610;987;1597;2584;4181;6765;10946;17711;28657;46368;75025;121393;196418;317811;514229;832040;1346269;

which are indeed the correct results which we can verify for example using Wolfram Alpha:

Pasted image 20250621191554

Going large

To calculate extremly large fibonacci numbers we need to avoid integer overflow. A simple way to avoid that is to just take everything modulo a given number in the matrix multiplication. We can than verify that we obtain the correct result modulo this number using Wolfram Alpha. Using the power of GPUs we are able to calculate $F_{99999999}$ in only 17 milliseconds on my Consumer GPU NVIDIA GeForce RTX 3060 Mobile. You can verify the correctness yourself by checking out my code on Github. I obtain as a result

Last element of out (F(99999999) mod 9837):
7558

which can again be verified using Wolfram Alpha.

Pasted image 20250621192052

Conclusion

I hoped this blogpost illustrated what a powerful operation Scan is and how easy it is to implement quiet complicated operations using Thrust. The idea to calculate the Fibonacci numbers using Scan is from an exercise given by Guy Blelloch. I suggest to check this paper out for a deeper dive into the Scan operation.