Preface
In some cases, we may need to use integers far beyond the range of built-in integers, such as public key encryption. If the most primitive vertical formula is used, then the time complexity is T(n2), where n is the number of digits multiplied by two integers. Use Karatsuba algorithm Optimized, the time complexity can be reduced to T (nlog23)T(n1.585). If FFT is used, it can be optimized to T(nlogn).
principle
Integer Multiplication and Polynomials
An integer, we can express it as a polynomial. For example, a decimal integer of length n, a0a1a2a3...an_1, is written as a polynomial:
The integer 2345, written as a polynomial, is:
In this way, integer multiplication becomes polynomial multiplication. For example, 2345*123 becomes:
(It will be found that some coefficients in the result polynomial are greater than 10. As long as carry processing is done, 288435 can be obtained, which is the result of 2345*123.)
Obviously, the time complexity of this direct calculation is T(n2) (assuming both polynomials are n-1).
For simplicity, the coefficients of polynomials can also be extracted and expressed as vectors: (2, 7, 16, 22, 22, 15), which is the coefficient representation of polynomials.
For an n-1 degree polynomial A(x), it is determined by N coefficients a0,a1,a2,...,an_1. If we evaluate n different x, we can get n different points: (x0, A (x0), (x1, A (x1), (xn_1,A(xn_1), (xn_1).
Conversely, as long as we know n different points on y=A(x), we can get n coefficients of A(x) by interpolating n points.
The unique n-1 degree polynomial can be determined by n different points as well as n coefficients. This method of representing polynomials by n different points is called the point-value representation of polynomials.
Polynomial Multiplication and Convolution
The coefficients of the polynomial multiplication result C(x)=A(x) B(x) are formulated as follows:
If you have studied signal processing, you may find this formula familiar. Yes, that's the convolution formula. Polynomial multiplication is actually convolution.
For example, (2,3,4,5) convolutes with (1,2,3):
2 | 3 | 4 | 5 | |||
3 | 2 | 1 | (2∗1) | |||
2 | 3 | 4 | 5 | |||
3 | 2 | 1 | (2,2×2+3×1) | |||
2 | 3 | 4 | 5 | |||
3 | 2 | 1 | (2,7,2×3+3×2+4×1) | |||
2 | 3 | 4 | 5 | |||
3 | 2 | 1 | (2,7,16,3×3+4×2+5×1) | |||
2 | 3 | 4 | 5 | |||
3 | 2 | 1 | (2,7,16,22,4×3+5×2) | |||
2 | 3 | 4 | 5 | |||
3 | 2 | 1 | (2,7,16,22,22,5×3)=(2,7,16,22,22,15) |
Convolution and Fast Fourier Transform
There are many ways to calculate convolution. Here we mainly introduce the method of using fast Fourier transform. This method makes use of the following conclusions:
Convolution in time domain is equivalent to point multiplication in frequency domain - > polynomial multiplication in coefficient expression is equivalent to point multiplication in point value expression
Let n-1 degree polynomials A(x) and B(x) have n coefficients, and 2n-2 degree polynomials C(x)=A(x) B(x), which have 2n-1 coefficients.
Evaluate A(x), B(x) to get A(x), B(x):
Because C(xk)=A(xk) B (xk), then there are
That is to say, the calculation of A(x) B(x) under the representation of point value is point multiplication. The time complexity of point multiplication of 2n points is T(n). It is much less complex than the polynomial multiplication T(n2) expressed by coefficients.
Therefore, the steps of calculating polynomial multiplication by point value representation are as follows:
- Evaluation. The polynomials A(x),B(x) represented by coefficients are evaluated to obtain A(x),B(x) represented by point values.
- Dot multiplication. Under the point value representation, the point value representation of C(x) is obtained by multiplying A(x),B(x).
- Interpolation. The point representing C(x) is interpolated to get C(x) expressed by coefficients.
We know that the second step of point multiplication time complexity is T(n), the key lies in the first and third steps of the algorithm.
If we directly evaluate a polynomial of length n times to get n points, then the time complexity is T(n2), which is the same as the complexity of direct multiplication of polynomials.
But if we use fast Fourier transform to evaluate, then the time complexity is T(nlogn).
Similarly, the inverse fast Fourier transform can be used to interpolate, and the time complexity is also T(nlogn).
The principle of fast Fourier transform is not introduced here. It is so important that you can find a detailed introduction to it in almost every book on digital signal processing.
cuFFT implementation
cuFFT
CuFFT is a CUDA-based FFT library developed by nVidia. Using cuFFT, FFT operation can be accelerated by graphics card.
Improvement
cuFFT is based on floating-point operation. It is difficult to avoid errors when converting floating-point numbers to integers. For example, the following codes are converted to decimal systems. The last few bits of the results are often wrong, which is caused by errors.
A better solution is to use Fast Fourier Transform (FNT) based on modulo n additive group, which is called Fast Number Theory Transform (FNT) at this time. It does not involve floating-point numbers at all, and there is no error problem.
Complete code
The nvcc compilation needs to add the parameter - lcufft. When using Visual Studio, add cufft.lib to the additional dependencies of the project.
#include <device_launch_parameters.h>
#include <cuda_runtime.h>
#include <cufft.h>
#include <vector>
#include <cmath>
const auto BATCH = 1;
__global__ void ComplexPointwiseMulAndScale(cufftComplex *a, cufftComplex *b, int size)
{
const int numThreads = blockDim.x * gridDim.x;
const int threadID = blockIdx.x * blockDim.x + threadIdx.x;
float scale = 1.0f / (float)size;
cufftComplex c;
for (int i = threadID; i < size; i += numThreads)
{
c = cuCmulf(a[i], b[i]);
b[i] = make_cuFloatComplex(scale*cuCrealf(c), scale*cuCimagf(c));
}
}
__global__ void ConvertToInt(cufftReal *a, int size)
{
const int numThreads = blockDim.x * gridDim.x;
const int threadID = blockIdx.x * blockDim.x + threadIdx.x;
auto b = (int*)a;
for (int i = threadID; i < size; i += numThreads)
b[i] = static_cast<int>(round(a[i]));
}
std::vector<int> multiply(const std::vector<float> &a, const std::vector<float> &b)
{
const auto NX = a.size();
cufftHandle plan_a, plan_b, plan_c;
cufftComplex *data_a, *data_b;
std::vector<int> c(a.size() + 1);
c[0] = 0;
//Allocate graphics card memory and initialize, assuming sizeof(int)==sizeof(float), sizeof(cufftComplex)==2*sizeof(float)
cudaMalloc((void**)&data_a, sizeof(cufftComplex) * (NX / 2 + 1) * BATCH);
cudaMalloc((void**)&data_b, sizeof(cufftComplex) * (NX / 2 + 1) * BATCH);
cudaMemcpy(data_a, a.data(), sizeof(float) * a.size(), cudaMemcpyHostToDevice);
cudaMemcpy(data_b, b.data(), sizeof(float) * b.size(), cudaMemcpyHostToDevice);
if (cudaGetLastError() != cudaSuccess) { fprintf(stderr, "Cuda error: Failed to allocate\n"); return c; }
if (cufftPlan1d(&plan_a, NX, CUFFT_R2C, BATCH) != CUFFT_SUCCESS) { fprintf(stderr, "CUFFT error: Plan creation failed"); return c; }
if (cufftPlan1d(&plan_b, NX, CUFFT_R2C, BATCH) != CUFFT_SUCCESS) { fprintf(stderr, "CUFFT error: Plan creation failed"); return c; }
if (cufftPlan1d(&plan_c, NX, CUFFT_C2R, BATCH) != CUFFT_SUCCESS) { fprintf(stderr, "CUFFT error: Plan creation failed"); return c; }
//Converting A(x) to Frequency Domain
if (cufftExecR2C(plan_a, (cufftReal*)data_a, data_a) != CUFFT_SUCCESS)
{
fprintf(stderr, "CUFFT error: ExecR2C Forward failed");
return c;
}
//Converting B(x) to Frequency Domain
if (cufftExecR2C(plan_b, (cufftReal*)data_b, data_b) != CUFFT_SUCCESS)
{
fprintf(stderr, "CUFFT error: ExecR2C Forward failed");
return c;
}
//Point multiplication
ComplexPointwiseMulAndScale<<<NX / 256 + 1, 256>>>(data_a, data_b, NX);
//Converting C(x) back to time domain
if (cufftExecC2R(plan_c, data_b, (cufftReal*)data_b) != CUFFT_SUCCESS)
{
fprintf(stderr, "CUFFT error: ExecC2R Forward failed");
return c;
}
//Converting the results of floating-point numbers to integers
ConvertToInt<<<NX / 256 + 1, 256>>>((cufftReal*)data_b, NX);
if (cudaDeviceSynchronize() != cudaSuccess)
{
fprintf(stderr, "Cuda error: Failed to synchronize\n");
return c;
}
cudaMemcpy(&c[1], data_b, sizeof(float) * b.size(), cudaMemcpyDeviceToHost);
cufftDestroy(plan_a);
cufftDestroy(plan_b);
cufftDestroy(plan_c);
cudaFree(data_a);
cudaFree(data_b);
return c;
}
int main(int argc, char **argv)
{
//Set base
const auto base = 10;
//999 * 9
std::vector<float> a{ 0, 9, 9, 9 };
std::vector<float> b{ 0, 0, 0, 9 };
auto c = multiply(a, b);
for (auto i : c)
printf("%d ", i);
printf("\n");
//Processing carry
for (int i = c.size() - 1; i > 0; i--)
{
if (c[i] >= base)
{
c[i - 1] += c[i] / base;
c[i] %= base;
}
}
//Remove excess zeros
c.pop_back();
auto i = 0;
if (c[0] == 0)
i++;
//To output the final result, we need to change the mode of output, such as the decimal system is "% 2d" and the decimal system is "% 3d".
for (; i < c.size(); i++)
printf("%d", c[i]);
printf("\n");
return 0;
}
Reference
Introduction to Algorithms
cuFFT documentation