GPU Tutorial: PFB

From Casper

Jump to: navigation, search

This page is under construction.


SAlbcgt9 casper logo-1.png

Tutorial 5: Heterogeneous Instrumentation Tutorial

Authors: Michael Kandrashoff, Jayanth Chennamangalam (Version 1)

Expected Completion Time: 4 hours


Contents

[hide]

Audience

This tutorial is intended for those who are familiar with CASPER tools, and understand CASPER-based design at the level of Tutorial 2, and who are also familiar with basic CUDA programming in C.

Introduction

In this tutorial, you will learn how to build an astronomical signal-processing instrument - specifically, a polyphase filtering spectrometer - using both FPGA and Graphics Processing Unit (GPU) technologies. A typical heterogeneous instrument consists of three parts - an ADC-FPGA system (an <TODO: ADC> and a ROACH, in this case), a PC-based data-recording mechanism, and a backend processing system that uses a GPU. Likewise, this tutorial is divided into three sections. In the first section (FPGA Design), you will build an FPGA design using Simulink and CASPER tools, in the second section (Data Acquisition), you will learn how to capture data on a 10GbE link and record it on a PC, and in the third section (Polyphase Filter Bank Spectrometer), you will write CUDA kernels that will perform polyphase filtering and spectrometry.

Setup

If you are working on this tutorial at a CASPER Workshop, you should have received separate network connection and log-in instructions. If you are working through this tutorial outside of a workshop, you will need to install and configure the necessary Matlab and Xilinx software tools as well as the most current version of the CASPER libraries. You will also need an NVIDIA CUDA-capable GPU with compute capability 2.x and CUDA Toolkit 4.0.

FPGA Design

For this tutorial, you will be using a Simulink design built around a 10 Gb Ethernet Block. Theoretically, you should be able to build such a design after the completion of Tutorial 2. However, for the sake of time and convenience, included is a mdl and a bof file of a packet transmitter designed by Aaron Parsons. Here is an outline of its components:

Pack Model.png

ten_GbE: The 10Gb Ethernet Block. For details, go to Tutorial 2 or the 10GbE info page (linked above).
tx_dest_ip/port: Registers holding the desired fabric IP/Port. This is best configured using Katcp.
bram_data_msb/lsb: BRAM's holding the data you wish to transmit. Each BRAM holds 32 bits of data.
tx_enable:Register holding a boolean value to determine if the ten_GbE block is transmitting or not.
bram_len: Determines how many BRAM's to transmit (ie: How large the packet is).
tx_efficiency_32_7: This register takes an unsigned 7-bit number as the number of 1/128's of the clock rate as the rate at which BRAM's are emptied into the ten_GbE block. Therefore, this register determines the data rate of the FPGA (eg: If the clock rate is 200MHz and the tx_efficiency_32_7 is set to 89, then the data rate would be 64 bits \times .2 GHz \times 90/128 = 9Gb/s)

Data Acquisition

Now you will bridge the gap between the FPGA and the CPU. First you have to configure the FPGA to transmit data. Using the design outlined in the last section, set the tx_effiency_32_7 block to 89 using write_int within Katcp and set the bram_len to around 1000 or so. This will send 8kB packets of data to the CPU at a rate of about 9Gb/s.

For actual astronomy, the most efficient framework for handling the data being sent by the FPGA is a single program with several subroutines for carrying the data to the CPU, copying the data into the processing unit, then taking the data out of the processing unit and either writing them to disk or transmitting them to another computer over a network, with each of these routines running on a separate thread if not a separate processor. For this tutorial, you will be using a receive program written in C with two threads running simultaneously while a host thread starts, stops, and cleans up after all of the data are gathered and stored. You will be storing these data in 5 2GB files labeled "file0001" through "file0005". The program is here and this part of the tutorial will proceed to highlight important attributes of the program that should be included in any real-time, high throughput data gathering application.

  1. Ring Buffer
    The Buffer is critical if you don't wish to lose packets within your data-handling routine. It serves as a holding area for data to be held in-between processes. Its brilliance (not to mention its name) comes from the fact that it doesn't have an "end": it can be written to and read from continuously for indefinite periods of time. This is because the Ring Buffer is designed to allow read chunks of data to be overwritten. The problem at hand is a matter of how to synchronize the reads and writes such that chunks aren't read twice or overwritten before they can be read. This can be easily done in C using semaphores. Semaphores are objects that are part of the pthread.c library that act as adjustable counters which every thread can access and modify. However, it can only affect threads that you specify. When the counter becomes negative, the thread it affects waits for another thread to free it. For example, in the C code you will be using, there are two semaphores, one for the write thread and one for the read thread. When the write thread writes a chunk into the buffer, it lowers its semaphore while raising the semaphore of the read thread, allowing it to read another block of data. The buffer itself is an array of pointers with each pointer pointing to a chunk of data. While this may not be the quickest or most efficient form of a ring buffer, it is very illustrative for the purposes of this tutorial.
  1. Write Thread
    The write thread takes the data from the Interface Card and copies it into the ring buffer.
  1. Read Thread
    The read thread makes the filenames, makes the files, and writes data from the ring buffer into the files.

Compiling and Running the Code
This code is exceptionally easy to compile and run. All you need to do is download the file and run your favorite compiler, such as gcc, making sure to include the pthread library in the compilation (the flag -lpthread in gcc, for example). Then just execute it and wait for the files to be written.

Polyphase Filter Bank Spectrometer

At this point, you will have a set of files, named 'file0000', 'file0001', 'file0002', and so on in numerical sequence, that contains time domain data. The next step is to do the actual spectrometry.

A spectrometer is a mechanism that converts time domain data to the frequency domain, outputting spectra at regular intervals of time. A simple spectrometer is made up of a Discrete Fourier Transform (DFT) -- usually, a Fast Fourier Transform (FFT) -- operation performed on the time series data, resulting in output spectra that may or may not be added up (accumulated) for a given time duration, to increase the signal-to-noise ratio. Such a basic spectrometer suffers from a couple of inherent drawbacks of the DFT, namely, spectral leakage and scalloping loss. The Polyphase Filter Bank (PFB) is a signal-processing technique that reduces these drawbacks.

In a nutshell, what a PFB spectrometer does is this: Instead of taking an N-point transform directly, a block of data of size N x P = M is read, and multiplied point-by-point with a window function (in other words, the data is 'weighted'). The shape of the window function determines the shape of the single-bin frequency response. Since we wish the single-bin frequency response to resemble a rectangular function as much as possible, we choose its Fourier Transform pair, the sinc function, as our window function. Once the multiplication is done, the block of data is split into P subsets of length N each, and added point-by-point. This array is then passed to a regular DFT routine to get an N-point transform that exhibits less leakage. This is graphically depicted in the figure below. For a more detailed background on the PFB technique, see the PFB memo.

Polyphase filtering
Graphical depiction of polyphase filtering. x(i) is a time series of length M = 1024 samples, multiplied point-by-point with the window function w(i) (a sinc function), also of the same length. The product is split into P = 4 blocks of length N = 256 samples each, and summed. This summed array of length N = 256 samples, shown at the bottom, on the right, is then input to a routine that takes a 256-point Fourier Transform. (Image courtesy: Dale Gary)

In this section of the tutorial, you will be given a skeleton CUDA/C code that reads data from the input files, and plots the data from an output memory buffer. The objective of this section of the tutorial is to write CUDA kernel-related code, namely, kernel launch parameter calculation, and the actual kernels.

The basic code flow is depicted below:


                                                                     Initialisation
                           +------------------+                      +--------------------------------------------+
                           |                  |                      |1. Get device properties                    |
                           |  Initialisation  |                      |2. Calculate kernel launch parameters       |
                           |                  |                      |3. Load filter coefficients to device memory|
                           +---------+--------+                      |4. Create FFT plan                          |
                                     |                               +--------------------------------------------+
             +-----------------------+
             |                       |
             |          +------------+------------+
             |          |                         |
             |          |  Copy time series data  |
             |          |   from host to device   |
             |          |                         |
             |          +------------+------------+
             |                       |
             |                       |
             |           +-----------+-----------+
             |           |                       |
             |           |       Perform         |
             |           |  polyphase filtering  |
             |           |                       |
             |           +-----------+-----------+
             |                       |
             |                       |
             |            +----------+----------+
             |            |                     |
             |            |      Perform        |
             |            |  Fourier Transform  |
             |            |                     |
             |            +----------+----------+
             |                       |
             |                       |
             |           +-----------+-----------+
             |           |                       |
             |           |  Accumulate spectrum  |
             |           |                       |
             |           +-----------+-----------+
             |                       |
             |                       |
             |        +--------------+--------------+
             |        |                             |
             |        |  Copy accumulated spectrum  |
             |        |     from host to device     |
             |        |                             |
             |        +--------------+--------------+
             |                       |
             |                       |
             +-----------------------+


Directory Structure

Download the problem set and the solution set (given for reference).

Search for the string 'Task <X>' (where <X> = A, B, and so on, as given below) in the problem set source files to see the tasks you need to do (computing kernel launch parameters, writing the CUDA kernel code for PFB, FFT and accumulation, etc.).

   bin/             : Directory in which the output binaries will be created
   tut5_fileread.cu : Data-file-reading routines
   tut5_fileread.h  : Header for data-file-reading routines
   tut5_kernels.cu  : CUDA kernels
   tut5_kernels.h   : Header for CUDA kernels
   tut5_main.cu     : Top-level file
   tut5_main.h      : Top-level header file
   tut5_plot.cu     : Plotting routines
   tut5_plot.h      : Header for plotting routines
   Makefile         : The makefile
   tut5_gencoeff.py : Python script to generate filter coefficients

Tasks

Task A. Initialisation

File: tut5_main.cu

The initialisation routine Init() performs all start-up related operations. In addition to allocation of memory used by the various data arrays, this routine does the following operations that you will code:

  1. Get device properties
    cudaError_t cudaGetDeviceProperties(struct cudaDeviceProp* prop, int device)
    This function populates the fields in the structure variable prop with device properties. The property we are interested in is an integer field named maxThreadsPerBlock, which contains the maximum number of threads a threadblock can have.
  2. Calculate kernel launch parameters
    Kernel launch parameters are usually CUDA structures of type dim3 that tell the GPU the number of threads per block and the number of blocks per grid to use when launching the kernel. It also specifies the layout of the block and grid. For example, here is how you create a MxNxP threadblock:
    dim3 dimBlock;
    ...
    dimBlock.x = M;
    dimBlock.y = N;
    dimBlock.z = P;
    For this spectrometer, the simplest layout is a one-dimensional threadblock, having a length equal to the number of points in the transform, N. If the number of points in the transform is more than cudaDeviceProp::maxThreadsPerBlock, then we need to set the length to the maximum allowed value, and use multiple such blocks, as shown below:
    dimBlock.x = prop.maxThreadsPerBlock;
    dimBlock.y = 1;
    dimBlock.z = 1;
    dimGrid.x = N / dimBlock.x;
    dimGrid.y = 1;
    cudaDeviceProp::maxThreadsPerBlock is always a power of 2. The assumption in this piece of code is that the number of points in the transform, N is also a power of 2, and that N > cudaDeviceProp::maxThreadsPerBlock.
  3. Load filter coefficients to device memory
    cudaError_t cudaMemcpy(void* dst, const void* src, size_t count, enum cudaMemcpyKind kind)
    Loading the filter coefficients to device memory is a regular host-to-device memory copy operation, and just involves an invocation of this function, with kind being cudaMemcpyHostToDevice.
  4. Create FFT plan
    cufftResult cufftPlanMany(cufftHandle *plan, int rank, int *n, int *inembed, int istride, int idist, int *onembed, int ostride, int odist, cufftType type, int batch)
    This function -- a Beta feature of the CUFFT 4.0 library -- is used to create an FFT plan that enables multiple Fourier Transforms to be performed simultaneously. A brief description of some of the arguments is given below.
    • rank is the dimension of the transform
    • n is the number of points in the FFT
    • inembed is the storage dimension of the input data, which, in this case, is equal to the number of points in the FFT
    • istride is the distance between two successive input elements. In our case, the data is packed thus:
      Gpu tutorial input data format.gif
    Here, S1 to Sn are n sub-bands, whose data is packed in a completely interleaved fashion. One 'element' is one complex time sample, consisting of both the real and imaginary parts. The distance between two such successive elements is n times the number of polarisations (2), or 2n.
    • idist is the distance between the first elements of two consecutive batches in the input data. For the n-sub-band data shown above, cufftPlanMany creates plans to perform 2n (number of polarisations times the number of sub-bands) FFTs in parallel. Each FFT operation is a 'batch', so that the distance between the first elements of two consecutive batches would just one complex time sample.
    • onembed is similar to inembed, for the output data.
    • ostride is similar to istride, for the output data.
    • odist is similar to idist, for the output data.
    • type is the kind of Fourier Transform to be performed. The only supported type, which meets our requirements, is CUFFT_C2C, the complex-to-complex Fourier Transform.
    • batch is the number of FFTs performed in parallel, which is 2n.

Task B. Copy Time Series Data from Host to Device

File: tut5_fileread.cu

This task is already done for you.

cudaError_t cudaMemcpy(void* dst, const void* src, size_t count, enum cudaMemcpyKind kind)

Copying time series data to device memory is a regular host-to-device memory copy operation, and just involves an invocation of this function, with kind being cudaMemcpyHostToDevice.

Task C. Perform Polyphase Filtering

File: tut5_kernels.cu

In this section, you write code for an 8-tap polyphase filter. This function is of the following form.

   __global__ void DoPFB(char4 *pc4Data, float4 *pf4FFTIn, float *pfPFBCoeff);

The input and output data are of special CUDA data types. As the name implies, char4 is a struct that encapsulates four char variables, named x, y, z, and w. For example, pc4Data->x gives you the first char and so on. float4 is of similar structure, encapsulating floating-point variables.

Since our input data is packed as shown in the figure in Task A.4., pc4Data->x will contain Re(X), pc4Data->y will contain Im(X), pc4Data->z will contain Re(Y), and pc4Data->w will contain Im(Y).

The first step in writing any CUDA kernel is calucating the thread ID. Since we are using one-dimensional threadblocks in a one-dimensional grid, the thread ID calculation is as follows.

   int i = (blockIdx.x * blockDim.x) + threadIdx.x;

Since we are performing an N-point transform, the input to the PFB function is of length N x P, where the number of taps, P = 8, in this case. We need the value of N in the kernel for indexing into the data arrays. There are two ways to do this: copy the value of N into device memory prior to invoking the kernel and use it here, or, as has been done in the solution set, compute it on the fly. The kernel launch parameters have been selected such that the dimension of a threadblock times the dimension of the grid gives you N, as shown below.

   int iNFFT = (gridDim.x * blockDim.x);

We need some memory to hold the output of the PFB operation, and the best data type that we can use in this case is a float4 - floating-point, because the next stage (CUFFT) requires that data type as input, and specifically float4 because each time sample, as described before, is composed of real and imaginary values for both X and Y polarisations. One thing to note is that this variable needs to be declared using a CUDA constructor to ensure proper alignment.

   float4 f4PFBOut = make_float4(0.0, 0.0, 0.0, 0.0);

The PFB algorithm is as follows, in pseudo code.

   for each tap
       compute index into data arrays
       out_re(x) += in_re(x) * coefficient
       out_im(x) += in_im(x) * coefficient
       out_re(y) += in_re(y) * coefficient
       out_im(y) += in_im(y) * coefficient

Here, coefficient is given in pfPFBCoeff. For optimisation, each coefficient repeats 4 times, so that even though the index is incremented within the loop, the coefficient is the same for all four multiplications.

Task D. Perform Fourier Transform

File: tut5_fileread.cu

cufftResult cufftExecC2C(cufftHandle plan, cufftComplex *idata, cufftComplex *odata, int direction)

This function executes all the single-precision complex-to-complex Fourier Transforms in parallel.

Task E. Accumulate Spectrum

File: tut5_kernels.cu

The kernel you write should perform the following computations, and add these values to corresponding accumulators.

  • A = Re2(X) + Im2(X)
  • B = Re2(Y) + Im2(Y)
  • C = Re(XY * ) = Re(X)Re(Y) + Im(X)Im(Y)
  • D = Im(XY * ) = Im(X)Re(Y) − Re(X)Im(Y)

Here, A and B are the powers in X and Y polarisations, respectively. The Stokes parameters of the signal can (later, offline) be calculated from these four values as follows:

I = A + B

Q = AB

U = 2C

V = 2D

For writing the kernel, follow the general instructions given in Task C. The kernel is of the form:

   __global__ void Accumulate(float4 *pf4FFTOut, float4 *pf4SumStokes);

Task F. Copy Accumulated Spectrum from Host to Device

File: tut5_main.cu

cudaError_t cudaMemcpy(void* dst, const void* src, size_t count, enum cudaMemcpyKind kind)

Copying the computed spectra back to device memory is a regular device-to-host memory copy operation, and just involves an invocation of this function, with kind being cudaMemcpyDeviceToHost.

Conclusion

By the end of this tutorial, you would have learnt how to construct a heterogeneous astronomical signal processing instrument that uses an FPGA design to packetise time domain data and send it to a PC over 10GbE, acquire the data on the PC, and generate polyphase-filtered spectra.