simons blog

An introduction to Thrust

Thrust is the C++ parallel algorithms library which inspired the introduction of parallel algorithms to the C++ Standard Library. It offers an abstraction to CUDA programmers and gives a convenient way of prototyping algorithms before implementing the kernels from scratch.

Segmented sum

In this blogpost we will calculate the segmented sum as an example. Imagine you have N measurements per period and P periods. You now want to sum up the measurments over each period.

To write a CUDA kernel that performs this task and generalizes well for arbitrary N and P is not completly trivial. We can therefore use thrust to write such a kernel before implementing it ourselves.

Host code

We can initialize a random vector and copy it to the device in thrust as follows:

  int length_period = 1 << 10;
  int num_periods = 1 << 18;

  // Initialize a random vector with given
  // period length and number of periods on host
  thrust::host_vector<int> v_host(length_period * num_periods);
  thrust::generate(v_host.begin(), v_host.end(), my_rand);

  // Init vector on device and copy over from host
  thrust::device_vector<int> v_device(length_period * num_periods);
  thrust::copy(v_host.begin(), v_host.end(), v_device.begin());

where

int my_rand(void) {
  static thrust::default_random_engine rng;
  static thrust::uniform_int_distribution<int> dist(0, 10);

  return dist(rng);
}

From here we can call and benchmark the kernel as follows:

  // Call function & measure time
  auto begin = std::chrono::high_resolution_clock::now();

  auto segmented_sums_device = naive_segmented_sum(length_period, num_periods, v_device);

  auto end = std::chrono::high_resolution_clock::now();
  const double seconds = std::chrono::duration<double>(end - begin).count();
  const double gigabytes =
      static_cast<double>(v_device.size() * sizeof(int)) / 1024 / 1024 / 1024;
  const double throughput = gigabytes / seconds;

  std::cout << "Seconds = " << seconds << std::endl;
  std::cout << "Bandwidth = " << throughput << std::endl;

Note that the benchmarking should be done with warmup and averaged. Here we avoid that to keep the code concise.

Note that it was brought to my attention by Pauleonix that we should use steady_clock instead to measure performance. See here for an answer by the creator of chrono.

As usual in GPU programming we than copy the content of our vector to the host as follows:

  thrust::host_vector<int> segmented_sums_host(num_periods);
  thrust::copy(segmented_sums_device.begin(), segmented_sums_device.end(), segmented_sums_host.begin());

Naive kernel

A naive kernel in thrust could be as follows.

thrust::device_vector<int>
naive_segmented_sum(int length_period, int num_periods,
                    const thrust::device_vector<int> &v_device) {
  // Segmented sum: Sum over each period.
  thrust::device_vector<int> segmented_sums(num_periods);

  // Take raw pointer to the vector
  const int *v_ptr = thrust::raw_pointer_cast(v_device.data());

  // compute segmented sum
  thrust::tabulate(thrust::device, segmented_sums.begin(), segmented_sums.end(),
                   [=] __host__ __device__(int period_id) {
                     int sum = 0;
                     for (int i = 0; i < length_period; i++) {
                       sum += v_ptr[period_id * length_period + i];
                     }
                     return sum;
                   });

  return segmented_sums;
}

Note that by the Thrust docs tabulate fills the range [first, last) with the value of a function applied to each element’s index. Here we loop for each i over the corresponding period, access the corresponding elements via a pointer to the vector and sum them up.

After looking a moment at the code it should be obvious that this kernel will perform poorly. GPU kernels perform best when calculating stuff in parallel and the parallelization above is poor because we calculate the sum sequentially for each period. Additionally we have strided memory accesses period_id * length_period + i.

Indeed when measuring the performance of the above kernel we only archive 55 GB/s which poorly even for my consumer GPU which has a maximum bandwidth of 332 GB/s.

Iterators

A simple iterator in c++ could look as follows:

#include <array>
#include <cstdio>
#include <tuple>

struct zip_iterator 
{
  int *a;
  int *b;

  std::tuple<int, int> operator[](int i) 
  {
    return {a[i], b[i]};
  }
};

int main() 
{
  std::array<int, 3> a{ 0, 1, 2 };
  std::array<int, 3> b{ 5, 4, 2 };

  zip_iterator it{a.data(), b.data()};

  std::printf("it[0]: (%d, %d)\n", std::get<0>(it[0]), std::get<1>(it[0])); // prints (0, 5)
  std::printf("it[1]: (%d, %d)\n", std::get<0>(it[1]), std::get<1>(it[1])); // prints (1, 4)
}

The struct zip_iterator overloads the [] operator. When the iterator is accessed with [i] it will return a tuple initialized with a[i] and b[i].

The tuple doesn't materialize before accessed and than materializes via fast pointer access.

Thrust uses iterators heavily to perform operations efficently.

Reduce by key

A better approach to the above is as follows

thrust::device_vector<int>
naive_segmented_sum(int length_period, int num_periods,
                    const thrust::device_vector<int> &v_device) {
  // Segmented sum: Sum over each period.
  thrust::device_vector<int> segmented_sums(num_periods);

  // Calculate period id in advance
  thrust::device_vector<int> period_ids(v_device.size());
  thrust::tabulate(thrust::device, period_ids.begin(), period_ids.end(),
                   [=] __host__ __device__(int i) {
                     return i / length_period;
                   });
  
  // Reduce for each period
  thrust::reduce_by_key(thrust::device, period_ids.begin(),
                        period_ids.end(),                   // input keys
                        v_device.begin(),                    // input values
                        thrust::make_discard_iterator(), // output keys
                        segmented_sums.begin());                   // output values

  return segmented_sums;
}

This kernel uses reduce_by_key, a thrust function that does reduction over a provided key value. We than use tabulate to fill the corresponding key with values.

Note that after the reduction we don't need the key anymore and provide the discard_iterator which can be thought of as something like this:

struct wrapper
{
   void operator=(int value) {
      // discard value
   }
};

struct discard_iterator 
{
  wrapper operator[](int i) 
  {
    return {};
  }
};

int main() 
{
  discard_iterator it{};

  it[0] = 10;
  it[1] = 20;
}

Note however that the above approach has a weakness. It allocates another vector of size num_periods and writes the values away. This is potentially expensive because memory transfer is often what binds performance on a GPU.

Therefore the performance gain is not as huge and we measure a performance of 67 GB/s on my GPU.

Use iterator for key

The below kernel avoids the memory transfers to calculate the key mentioned above:

thrust::device_vector<int>
naive_segmented_sum(int length_period, int num_periods,
                    const thrust::device_vector<int> &v_device) {
  // Segmented sum: Sum over each period.
  thrust::device_vector<int> segmented_sums(num_periods);

  // Calculate period id in advance
  auto count_it = thrust::make_counting_iterator(0);
  auto period_id_it = thrust::make_transform_iterator(
    // iterator to the beginning of the input sequence
    count_it, 
    // capture constant in the lambda by value with `[name]`
    [length_period]__host__ __device__(int i) { 
      // transformation of each element
      return i / length_period; 
    });
  
  // Reduce for each period
  thrust::reduce_by_key(thrust::device, period_id_it,
                        period_id_it + v_device.size(),                   // input keys
                        v_device.begin(),                    // input values
                        thrust::make_discard_iterator(), // output keys
                        segmented_sums.begin());                   // output values

  return segmented_sums;
}

Instead of calculating the key and writing the output to an array we initialize here a proper iterator. This avoids the additional memory transfers!

We first initialize a counting_iterator. By the Thrust docs counting_iterator is an iterator which represents a pointer into a range of sequentially changing values.

We can imagine it as follows:

#include <iostream>

struct count_iterator 
{
  int operator[](int i) 
  {
    return i;
  }
};

int main() 
{
  count_iterator it{};

  std::cout << it[0] << ";" << it[1] << std::endl;
}

Note that we don't need to allocate any memory to obtain an iterator that just counts up!

We'll than pass this iterator to make_transform_iterator. This will return an iterator that simply applies the lambda function

[length_period]__host__ __device__(int i) { 
      // transformation of each element
      return i / length_period; 
    });

To the given iterator (in our case the trivial count iterator).

We'll than pass this iterator to reduce_by_key an therefore avoid the expensive memory transfer.

This kernel will give us 226 GB/s. This is not overwhelming but pretty good considering we only wrote a few lines of codes.

It has the additional advantage that it's easily portable and maintainable and general. I imagine the above kernel doesn't archieve peak performance because we don't apply techniques like thread coarsening which are often important to archieve peak performance in memory bound kernels. I choose however to stop at this point because this blogpost should just be an introduction to Thrust.

Conclusion

We saw that Thrust offers us a way to easily prototype CUDA kernels. To go deeper into the topic I recommend this excellent lecture by NVIDIA.