Implementing Large Integer Multiplication with cuFFT

Keywords: less

Preface

In some cases, we may need to use integers far beyond the range of built-in integers, such as public key encryption. If the most primitive vertical formula is used, then the time complexity is T(n2), where n is the number of digits multiplied by two integers. Use Karatsuba algorithm Optimized, the time complexity can be reduced to T (nlog23)T(n1.585). If FFT is used, it can be optimized to T(nlogn).

principle

Integer Multiplication and Polynomials

An integer, we can express it as a polynomial. For example, a decimal integer of length n, a0a1a2a3...an_1, is written as a polynomial:

A(x)=a0xn−1+a1xn−2+...+an−2x+an−1

The integer 2345, written as a polynomial, is:
A(x)=2x3+3x2+4x+5

In this way, integer multiplication becomes polynomial multiplication. For example, 2345*123 becomes:
×+=2x5+2x5+4x4+3x4+7x4+2x3+6x3+6x3+4x3+16x3+3x2+x2+9x2+8x2+5x222x2+4x+2x+12x+10x22x+531515

(It will be found that some coefficients in the result polynomial are greater than 10. As long as carry processing is done, 288435 can be obtained, which is the result of 2345*123.)

Obviously, the time complexity of this direct calculation is T(n2) (assuming both polynomials are n-1).

For simplicity, the coefficients of polynomials can also be extracted and expressed as vectors: (2, 7, 16, 22, 22, 15), which is the coefficient representation of polynomials.

For an n-1 degree polynomial A(x), it is determined by N coefficients a0,a1,a2,...,an_1. If we evaluate n different x, we can get n different points: (x0, A (x0), (x1, A (x1), (xn_1,A(xn_1), (xn_1).
Conversely, as long as we know n different points on y=A(x), we can get n coefficients of A(x) by interpolating n points.

The unique n-1 degree polynomial can be determined by n different points as well as n coefficients. This method of representing polynomials by n different points is called the point-value representation of polynomials.

Polynomial Multiplication and Convolution

The coefficients of the polynomial multiplication result C(x)=A(x) B(x) are formulated as follows:

ci=∑k=0iakbi−k

If you have studied signal processing, you may find this formula familiar. Yes, that's the convolution formula. Polynomial multiplication is actually convolution.

For example, (2,3,4,5) convolutes with (1,2,3):

2 3 4 5
3 2 1 (2∗1)
2 3 4 5
3 2 1 (2,2×2+3×1)
2 3 4 5
3 2 1 (2,7,2×3+3×2+4×1)
2 3 4 5
3 2 1 (2,7,16,3×3+4×2+5×1)
2 3 4 5
3 2 1 (2,7,16,22,4×3+5×2)
2 3 4 5
3 2 1 (2,7,16,22,22,5×3)=(2,7,16,22,22,15)

Convolution and Fast Fourier Transform

There are many ways to calculate convolution. Here we mainly introduce the method of using fast Fourier transform. This method makes use of the following conclusions:

Convolution in time domain is equivalent to point multiplication in frequency domain - > polynomial multiplication in coefficient expression is equivalent to point multiplication in point value expression

Let n-1 degree polynomials A(x) and B(x) have n coefficients, and 2n-2 degree polynomials C(x)=A(x) B(x), which have 2n-1 coefficients.

Evaluate A(x), B(x) to get A(x), B(x):

A(x):(x0,A(x0)),(x1,A(x1)),...(x2n−2,A(x2n−2))B(x):(x0,B(x0)),(x1,B(x1)),...(x2n−2,B(x2n−2))

Because C(xk)=A(xk) B (xk), then there are
C(x):(x0,A(x0)∗B(x0)),(x1,A(x1)∗B(x1)),...(x2n−2,A(x2n−2)∗B(x2n−2))

That is to say, the calculation of A(x) B(x) under the representation of point value is point multiplication. The time complexity of point multiplication of 2n points is T(n). It is much less complex than the polynomial multiplication T(n2) expressed by coefficients.

Therefore, the steps of calculating polynomial multiplication by point value representation are as follows:

  1. Evaluation. The polynomials A(x),B(x) represented by coefficients are evaluated to obtain A(x),B(x) represented by point values.
  2. Dot multiplication. Under the point value representation, the point value representation of C(x) is obtained by multiplying A(x),B(x).
  3. Interpolation. The point representing C(x) is interpolated to get C(x) expressed by coefficients.

We know that the second step of point multiplication time complexity is T(n), the key lies in the first and third steps of the algorithm.

If we directly evaluate a polynomial of length n times to get n points, then the time complexity is T(n2), which is the same as the complexity of direct multiplication of polynomials.

But if we use fast Fourier transform to evaluate, then the time complexity is T(nlogn).
Similarly, the inverse fast Fourier transform can be used to interpolate, and the time complexity is also T(nlogn).

The principle of fast Fourier transform is not introduced here. It is so important that you can find a detailed introduction to it in almost every book on digital signal processing.

cuFFT implementation

cuFFT

CuFFT is a CUDA-based FFT library developed by nVidia. Using cuFFT, FFT operation can be accelerated by graphics card.

Improvement

cuFFT is based on floating-point operation. It is difficult to avoid errors when converting floating-point numbers to integers. For example, the following codes are converted to decimal systems. The last few bits of the results are often wrong, which is caused by errors.

A better solution is to use Fast Fourier Transform (FNT) based on modulo n additive group, which is called Fast Number Theory Transform (FNT) at this time. It does not involve floating-point numbers at all, and there is no error problem.

Complete code

The nvcc compilation needs to add the parameter - lcufft. When using Visual Studio, add cufft.lib to the additional dependencies of the project.

#include <device_launch_parameters.h>
#include <cuda_runtime.h>
#include <cufft.h>
#include <vector>
#include <cmath>

const auto BATCH = 1;

__global__ void ComplexPointwiseMulAndScale(cufftComplex *a, cufftComplex *b, int size)
{
    const int numThreads = blockDim.x * gridDim.x;
    const int threadID = blockIdx.x * blockDim.x + threadIdx.x;
    float scale = 1.0f / (float)size;
    cufftComplex c;
    for (int i = threadID; i < size; i += numThreads)
    {
        c = cuCmulf(a[i], b[i]);
        b[i] = make_cuFloatComplex(scale*cuCrealf(c), scale*cuCimagf(c));
    }
}

__global__ void ConvertToInt(cufftReal *a, int size)
{
    const int numThreads = blockDim.x * gridDim.x;
    const int threadID = blockIdx.x * blockDim.x + threadIdx.x;
    auto b = (int*)a;
    for (int i = threadID; i < size; i += numThreads)
        b[i] = static_cast<int>(round(a[i]));
}

std::vector<int> multiply(const std::vector<float> &a, const std::vector<float> &b)
{
    const auto NX = a.size();
    cufftHandle plan_a, plan_b, plan_c;
    cufftComplex *data_a, *data_b;
    std::vector<int> c(a.size() + 1);
    c[0] = 0;

    //Allocate graphics card memory and initialize, assuming sizeof(int)==sizeof(float), sizeof(cufftComplex)==2*sizeof(float)
    cudaMalloc((void**)&data_a, sizeof(cufftComplex) * (NX / 2 + 1) * BATCH);
    cudaMalloc((void**)&data_b, sizeof(cufftComplex) * (NX / 2 + 1) * BATCH);
    cudaMemcpy(data_a, a.data(), sizeof(float) * a.size(), cudaMemcpyHostToDevice);
    cudaMemcpy(data_b, b.data(), sizeof(float) * b.size(), cudaMemcpyHostToDevice);
    if (cudaGetLastError() != cudaSuccess) { fprintf(stderr, "Cuda error: Failed to allocate\n"); return c; }

    if (cufftPlan1d(&plan_a, NX, CUFFT_R2C, BATCH) != CUFFT_SUCCESS) { fprintf(stderr, "CUFFT error: Plan creation failed"); return c; }
    if (cufftPlan1d(&plan_b, NX, CUFFT_R2C, BATCH) != CUFFT_SUCCESS) { fprintf(stderr, "CUFFT error: Plan creation failed"); return c; }
    if (cufftPlan1d(&plan_c, NX, CUFFT_C2R, BATCH) != CUFFT_SUCCESS) { fprintf(stderr, "CUFFT error: Plan creation failed"); return c; }

    //Converting A(x) to Frequency Domain
    if (cufftExecR2C(plan_a, (cufftReal*)data_a, data_a) != CUFFT_SUCCESS)
    {
        fprintf(stderr, "CUFFT error: ExecR2C Forward failed");
        return c;
    }

    //Converting B(x) to Frequency Domain
    if (cufftExecR2C(plan_b, (cufftReal*)data_b, data_b) != CUFFT_SUCCESS)
    {
        fprintf(stderr, "CUFFT error: ExecR2C Forward failed");
        return c;
    }

    //Point multiplication
    ComplexPointwiseMulAndScale<<<NX / 256 + 1, 256>>>(data_a, data_b, NX);

    //Converting C(x) back to time domain
    if (cufftExecC2R(plan_c, data_b, (cufftReal*)data_b) != CUFFT_SUCCESS)
    {
        fprintf(stderr, "CUFFT error: ExecC2R Forward failed");
        return c;
    }

    //Converting the results of floating-point numbers to integers
    ConvertToInt<<<NX / 256 + 1, 256>>>((cufftReal*)data_b, NX);

    if (cudaDeviceSynchronize() != cudaSuccess) 
    {
        fprintf(stderr, "Cuda error: Failed to synchronize\n");
        return c;
    }

    cudaMemcpy(&c[1], data_b, sizeof(float) * b.size(), cudaMemcpyDeviceToHost);

    cufftDestroy(plan_a);
    cufftDestroy(plan_b);
    cufftDestroy(plan_c);
    cudaFree(data_a);
    cudaFree(data_b);
    return c;
}


int main(int argc, char **argv) 
{
    //Set base
    const auto base = 10;

    //999 * 9
    std::vector<float> a{ 0, 9, 9, 9 }; 
    std::vector<float> b{ 0, 0, 0, 9 };

    auto c = multiply(a, b);

    for (auto i : c)
        printf("%d ", i);
    printf("\n");

    //Processing carry
    for (int i = c.size() - 1; i > 0; i--)
    {
        if (c[i] >= base)
        {
            c[i - 1] += c[i] / base;
            c[i] %= base;
        }
    }

    //Remove excess zeros
    c.pop_back();
    auto i = 0;
    if (c[0] == 0)
        i++;

    //To output the final result, we need to change the mode of output, such as the decimal system is "% 2d" and the decimal system is "% 3d".
    for (; i < c.size(); i++)
        printf("%d", c[i]);
    printf("\n");

    return 0;
}

Reference

Introduction to Algorithms
cuFFT documentation

Posted by xentia on Fri, 31 May 2019 10:20:49 -0700