- Define network
Defining a network for INT8 is exactly the same as defining a network for any other precision. The weights are imported as FP32, and the builder will calibrate the network to find the appropriate quantization factor, so as to reduce the network to INT8 accuracy. This example uses NvCaffeParser to import a network:
const nvcaffeparser1;:IBlobNameToTensor* blobNameToTensor = parser->parse(locateFile(mParams.prototxtFileName,mParams.dataDirs).c_str(), locateFile(mParams.weightsFileName,mParams.dataDirs).c_str(),*network, dataType == DtatType::kINT8 ? DataType::kFLOAT : dataType);
- Set calibrator
Calibration is an additional step when building a network for INT8. The application must provide sample input to TensorRT, i.e. calibration data. TensorRT will then perform inference in FP32 and collect statistics about the intermediate activation layer for building an INT8 engine with reduced accuracy.
Create int calibrator
std::unique_ptr<IInt8Calibrator> calibrator; config->setAvgTimingIterations(1); config->setMinTimingIterations(1); config->setMaxWOrkspaceSize(1_GiB);
(1) Calibration data
if(dataType == DataType::kINT8) { MNISTBatchStream calibrationStream(mParams.calBatchSize,mParams.nbCalBatches,"train-images-idx3-ubyte", "train-labels-idx1-ubyte",mParams.dataDirs); }
The MNISTBatchStream class provides help methods for retrieving batch data. The calibrator uses batch stream objects to retrieve batch data on time. Typically, the batchstream class is
class IBatchStream { public: virtual void reset(int firstBatch) = 0; virtual bool next() = 0; virtual void skip(int skipCount) = 0; virtual float* getBatch() = 0; virtual float* getLabels() = 0; virtual int getBatchesRead() const = 0; virtual int getBatchSize() const = 0; virtual nvinfer1::Dims getDims() const = 0; };
Note: the calibration data must represent the input provided to TensorRT during operation; For example, for an image classification network, it should not contain only a small number of categories of images. For ImageNet, calibration of approximately 500 images is sufficient.
(2) Calibration reasoning
The application must implement the "IInt8Calibrator" interface to provide calibration data and auxiliary methods for reading and writing calibration table files. tensorrt provides a method to implement IInt8Calibrator in 4:
IInt8EntropyCalibrator IInt8EntropyCalibrator2 IInt8MinMaxCalibrator IInt8LegacyCalibrator
IInt8EntropyCalibrator2 is used in this example
calibrator.reset(new Int8EntropyCalibrator2<MNISTBatchStream>( calibrationStream,0,mParams.networkName.c_str(),mParams.inputTensorNames[0].c_str()));
The calibrator object requires a calibration batch stream.
In order to perform calibration, the interface must provide implementations of "getBatchSize()" and "getBatch()" to retrieve data from the batchstream object.
The builder starts by calling the getBatchSize() method to get the batch size of the calibration set.
config->setInt8Calibrator(calibrator.get());
Then repeatedly call the "getBatch()" method to get the batch from the application until the method returns the false position. Each calibration batch must contain the number of images specified as the batch size.
float* getBatch() override { return mData.data() + (mBatchCount * mBatchSize * samplesCommon::volume(mDims)); }
float* getBatch() override { return mBatch.data(); }
while(batchStream.next()) { assert(mParams.inputTensorNames.size() == 1); if(!processInput(buffers,batchStream.getBatch())) { return false; } ... }
For each input tensor, a pointer that only inputs data into GPU memory must be written to the binding array. The name array contains the name of the input tensor. The position of each tensor in the binding array matches the position of its name in the name array. The size of both arrays is "nbBings". Because the calibration step is very time-consuming, it can be implemented through "writeCalibrationCache" to write the calibration table to an appropriate location for use in future runs. Then, read the calibration table file from the desired location through the 'readCalibrationCache' method. During calibration, the generator checks whether the calibration file exists using readCalibrationCache(). The generator recalibrates only if the calibration file does not exist or is incompatible with the current TensorRT version or calibrator variant that generated the file.
(3) Calibration file
A tensor calibration file is stored for each tensor network activation. The activation ratio is calculated using the dynamic range generated by the calibration algorithm, = abs (maximum dynamic range) / 127.0f.
The calibration file name is "CalibrationTable", where "" is your network name, such as "MNIST". The file is located in the 'TensorRT-x.x.x.x/data/mnist' directory, where 'x.x.x.x' is the version of TensorRT you installed.
If the "CalibrationTable" file is not found, the builder runs the calibration algorithm again to create it. The contents of the "calibration table" include:
TRT-7000-EntropyCalibration2 //TensorRt version of TRT calibration algorithm - Calibration Algorithm //layer naem: floating point activation ratio determined for each tensor in the network during calibration data: 3c008912 conv1: 3c88edfc pool1: 3c88edfc conv2: 3ddc858b pool2: 3ddc858b ip1: 3db6bd6e ip2: 3e691968 prob: 3c010a14
The "CalibrationTable" file is generated during the build phase of running the calibration algorithm. After creating the calibration file, you can read the file for subsequent operation without running the calibration again. You can provide an implementation for readCalibrationCache() to load calibration files from the desired location. If the read calibration file is compatible with the calibrator type (used to generate the file) and TensorRT version, the generator will skip the calibration step and use each tensor scale value in the calibration file.
- Configuration Generator
config->setAvgTimingIterations(1); //Sets the minimum number of timed iterations config->setMinTimingIterations(1);//Sets the average number of timed iterations config->setMaxWOrkspaceSize(1_GiB);//Set maximum workspace //Set the allowed generator precision to kFP16 except kHALF if(dataType == DataType::kHALF) { config->setFlag(BuilderFlag::kFP16); } //Set the allowed generator precision to kINT8 except FP32, if(dataType == DataType::kINT8) { config->setFlag(BuilderFlag::kINT8); } //Set maximum batch size builder->setMaxBatchSize(mParams.batchSize); if(dataType == DataType::kINT8) { //Create batch processing flow MNISTBatchStream calibrationStream(mParams.calBatchSize,mParams.nbCalBatches,"train-images-idx3-ubyte", "train-labels-idx1-ubyte",mParams.dataDirs); //Create calibrator calibrator.reset(new Int8EntropyCalibrator2<MNISTBatchStream>( calibrationStream,0,mParams.networkName.c_str(),mParams.inputTensorNames[0].c_str())); //Pass the calibrator object to builder() config->setInt8Calibrator(calibrator.get()); }
- Build engine
mEngine = std::shared_ptr<nvinfer1::ICudaEngine>( builder->buildEngineWithConfig(*network,*config),samplesCommon::InferDeleter());
- Run engine
The input and output remain 32-bit floating-point numbers
bool SampleINT8::infer(std::vector<float>& score,int firstScoreBatch, int nbScoreBatches) { float ms{0.0f}; //Allocate output memory buffer samplesCommon::BufferManager buffers(mEngine,mParams.batchSize); //Create execution context auto context = SampleUniquePtr<nvinfer1::IExecutionContext>(mEngine->createExecutionContext()); if(!context) { return false; } MNISTBatchStream batchStream(mParams.batchSize,nbScoreBatches+firstScoreBatch,"train-images-idx3-ubyte", "train-labels-idx1-ubyte",mPArams.dataDirs); batchStream.skip(firstScoreBatch); //Get data dimension Dims outputDims = context->getEngine().getBindingDimensions( context->getEngine().getBindingIndex(mParams.outputTensorNames[0].c_str())); int outputSize = samplesCommon::volume(outputDims); int top1{0},top5{0}; float totalTime{0.0f}; while(batchStream.next()) { assert(mParams.inputTensorNames.size() == 1); //Read input data and generate managed buffer if(!processInput(buffers,batchStream.getBatch())) { return false; } //Copy the input data from host to device buffers.copyInputToDevice(); cudaStream_t stream; CHECK(cudaSTreamCreate(&stream)); cudaEvent_t start,end; CHECK(cudaEventCreateWithFlags(&start,cudaEventBlockingSync)); CHECK(cudaEventCreatWithFlags(&end,cudaEventBlockingSync)); cudaEventRecord(start,stream); //Operational reasoning bool status = context->enqueue(mParams.batchSize,buffers.getDeviceBindings().data(),stream,nullptr); if(!status) { return false; } cudaEventRecord(end,stream); cudaEventSynchronize(end); cudaEventElapsedTime(&ms,start,end); cudaEventDestory(start); cudaEventDestory(end); totalTime += ms; //Copy output from device to host buffers.copyOutputToHost(); CHECK(cudaStreamDestory(stream)); //This example outputs the Top-1 and Top-5 indicators of FP32 and INT8 precision, //And the Top-1 and Top-5 indicators of FP16 (if the hardware supports it locally). These figures should be within 1%. top1 += calculateScore(buffers,batchStream.getLabels(),mParams.batchSize,outputSize,1); top5 += calculateScore(buffers,batchStream.getLabels(),mPArams.batchSIze,outputSize,5); if(batchStream.getBatchesRead() % 100 ==0) { sample::gLogInfo <<"Processing next set of max 100 batches"<<std::endl; } } int imagesRead = (batchStream.getBatchesRead() - firstScoreBatch) * mParams.batchSize; score[0] = float(top1) / float(imagesRead); score[1] = float(top5) / float(imagesRead); sample::gLogInfo <<"Top1: "<<score[0]<<", Top5: "<<score[1]<<std::endl; sample::gLogInfo <<"Processing "<<imagesRead<<" images averaged "<<totalTime/imagesRead <<" ms/image and " <<totalTime / batchStream.getBatchesRead() <<" ms/batch."<<std::endl; return true; }
- Verify output
7.5.2 sample_int8
#include "common/BatchStream.h" #include "common/EntropyCalibrator.h" #include "common/argsParser.h" #include "common/buffers.h" #include "common/common.h" #include "common/logger.h" #include "common/logging.h" #include "NvCaffeParser.h" #include "NvInfer.h" #include <cuda_runtime_api.h> #include <cstdlib> #include <fstream> #include <iostream> #include <sstream> const std::string gSampleName = "TensorRT.sample_int8"; struct SampleINT8Params : public samplesCommon::CaffeSampleParams { int nbCalBatches; int calBatchSize; std::string networkName; }; class SampleINT8 { template <typename T> using SampleUniquePtr = std::unique_ptr<T,samplesCommon::InferDeleter>; public: SampleINT8(const SampleINT8Params& params) : mParams(params) , mEngine(nullptr) { initLibNvInferPlugins(&sample::gLogger.getTRTLOgger(),""); } bool build(DataType dataType); bool isSupported(DataType dataType); bool infer(std::vector<float>& score,int firstScoreBatch, int nbScoreBatches); bool teardown; private: SampleINT8Params mParams; std::shared_ptr<nvinfer1::ICudaEngine> mEngine; nvinfer1::Dims mInputDims; bool constructNetwork(SampleUniquePtr<nvinfer1::IBuilder>& builder, SampleUniquePtr<nvinfer1::INetworkDefinition>& network, SampleUniquePtr<nvinfer1::IBuilderConfig>& config, SampleUniquePtr<nvcaffeparser1::ICaffeParser>& parser, DataType datattype); bool processInput(const samplesCommon::BufferManager& buffers,const float* data); int calculateScore(const samplesCommon::BufferManager& buffers,float* labels,int batchSize,int outputSize,int threshold); }; bool SampleINT8::build(DataType dataType) { auto builder = ampleUniquePtr<nvinfer1::IBuilder>(nvinfer1::createInferBuilder(sample::gLogger.getTRTLogger())); if(!builder) { return false; } auto network = SampleUniquePtr<nvinfer1::INetworkDefinition>(builder->createNetwork()); if(!network) { return false; } auto config = SampleUniquePtr<nvinfer1::IBuilderConfig>(builder->createBuilderConfig()); if(!config) { return false; } auto parser = SampleUniquePtr<nvcaffeparser1::ICaffeParser>(nvcaffeparser1::createCaffeParser()); if(!parser) { return false; } if((dataType == DataType::kINT8 && ! builder->platformHaFastInt8()) || (dataType == DataType::kHALF && !builder->platformHasFatDp16())) { return false; } auto constructed = constructNetwork(builder,network,config,parser,dataType); if(!constructed) { return false; } assert(network->getNbInputs() == 1); mInputDims = network->getInput(0)->getDimensions(); assert(mInputDims.nbDims == 3); return true; } bool SampleINT8::isSupported(DataType dataType) { auto builder = SampleUniquePtr<nvinfer1::IBuilder>(nvinfer1::createInferBuilder(sample::gLogger.getTRTLogger())); if(!builder) { return false; } if((dataType == DataType::kINT8 && !builder->platformHaFastInt8()) || (dataType == DataType::kHALF && !builder->platformHasFatFp16())) { return false; } return true; } bool SampleINT8::constructNetwork(SampleUniquePtr<nvinfer1::IBuilder> &builder, SampleUniquePtr<nvinfer1::INetwokDefinition> &network, SampleUniquePtr<nvinfer1::IBuilderCon> &config, int &parser, int datattype) { mEngine = nullptr; const nvcaffeparser1;:IBlobNameToTensor* blobNameToTensor = parser->parse(locateFile(mParams.prototxtFileName,mParams.dataDirs).c_str(), locateFile(mParams.weightsFileName,mParams.dataDirs).c_str(),*network, dataType == DtatType::kINT8 ? DataType::kFLOAT : dataType); for(auto & s: mPrams.outputTensorNames) { network->markOutput(*blobNameToTensor->find(s.c_str())); } std::unique_ptr<IInt8Calibrator> calibrator; config->setAvgTimingIterations(1); config->setMinTimingIterations(1); config->setMaxWorkspaceSize(1_GiB); if(dataType == DataType::kHALF) { config->setFlag(BuilderFlag::kFP16); } if(dataType == DataType::kINT8) { config->setFlag(BuilderFlag::kINT8); } builder->setMaxBatchSize(mParams.batchSize); if(dataType == DataType::kINT8) { MNISTBatchStream calibrationStream(mParams.calBatchSize,mParams.nbCalBatches,"train-images-idx3-ubyte", "train-labels-idx1-ubyte",mParams.dataDirs); calibrator.reset(new Int8EntropyCalibrator2<MNISTBatchStream>( calibrationStream,0,mParams.networkName.c_str(),mParams.inputTensorNames[0].c_str())); config->setInt8Calibrator(calibrator.get()); } if(mParams.dlaCore >= 0) { samplesCommon::enableDLA(builder.get(),config.get(),mParams.dlaCore); if(mParams.batchSize > builder->getMaxDLABatchSize()) { sample::gLogError << "Requested batch size "<<mParams.batchSIze << "is greater than the max DLA batch size of "<<builder->getMaxDLABatchSize() << ". Reducing batch size accordingly."<<std::endl; return false; } } mEngine = std::shared_ptr<nvinfer1::ICudaEngine>( builder->buildEngineWithConfig(*network,*config),samplesCommon::InferDeleter()); if(!mEngine) { return false; } return true; } bool SampleINT8::infer(std::vector<float>& score,int firstScoreBatch, int nbScoreBatches) { float ms{0.0f}; samplesCommon::BufferManager buffers(mEngine,mParams.batchSize); auto context = SampleUniquePtr<nvinfer1::IExecutionContext>(mEngine->createExecutionContext()); if(!context) { return false; } MNISTBatchStream batchStream(mParams.batchSize,nbScoreBatches+firstScoreBatch,"train-images-idx3-ubyte", "train-labels-idx1-ubyte",mPArams.dataDirs); batchStream.skip(firstScoreBatch); Dims outputDims = context->getEngine().getBindingDimensions( context->getEngine().getBindingIndex(mParams.outputTensorNames[0].c_str())); int outputSize = samplesCommon::volume(outputDims); int top1{0},top5{0}; float totalTime{0.0f}; while(batchStream.next()) { assert(mParams.inputTensorNames.size() == 1); if(!processInput(buffers,batchStream.getBatch())) { return false; } buffers.copyInputToDevice(); cudaStream_t stream; CHECK(cudaSTreamCreate(&stream)); cudaEvent_t start,end; CHECK(cudaEventCreateWithFlags(&start,cudaEventBlockingSync)); CHECK(cudaEventCreatWithFlags(&end,cudaEventBlockingSync)); cudaEventRecord(start,stream); bool status = context->enqueue(mParams.batchSize,buffers.getDeviceBindings().data(),stream,nullptr); if(!status) { return false; } cudaEventRecord(end,stream); cudaEventSynchronize(end); cudaEventElapsedTime(&ms,start,end); cudaEventDestory(start); cudaEventDestory(end); totalTime += ms; buffers.copyOutputToHost(); CHECK(cudaStreamDestory(stream)); top1 += calculateScore(buffers,batchStream.getLabels(),mParams.batchSize,outputSize,1); top5 += calculateScore(buffers,batchStream.getLabels(),mPArams.batchSIze,outputSize,5); if(batchStream.getBatchesRead() % 100 ==0) { sample::gLogInfo <<"Processing next set of max 100 batches"<<std::endl; } } int imagesRead = (batchStream.getBatchesRead() - firstScoreBatch) * mParams.batchSize; score[0] = float(top1) / float(imagesRead); score[1] = float(top5) / float(imagesRead); sample::gLogInfo <<"Top1: "<<score[0]<<", Top5: "<<score[1]<<std::endl; sample::gLogInfo <<"Processing "<<imagesRead<<" images averaged "<<totalTime/imagesRead <<" ms/image and " <<totalTime / batchStream.getBatchesRead() <<" ms/batch."<<std::endl; return true; } bool SampleINT8::teardown() { nvcaffeparser1::shutdownProtobufLibrary(); return true; } bool SampleINT8::processInput(const samplesCommon::BufferManager& buffers, const float* data) { float* hostDataBuffer = static_cast<float*>(buffers.getHstBuffer(mParams.inputTensorNames[0])); std::memcpy(hostDataBuffer,data,mParams.batchSize*samplesCommon::volume(mInputDims)*sizeof(float)); return true; } int SampleINT8::calculatScore( const samplesCommon::BufferManager& buffers,float* labels,int batchSize,int outputSize,int threshold) { float* probs = static_cast<float*>(buffers.getHostBuffer(mParams.outputTensorNames[0])); int success = 0; for(int i=0;i<batchSize;i++) { float *prob = probs + outputSize*i, correct = prob[(int) labels[i]]; int buffer = 0; for(int j=0;j<outputSize;j++) { if(prob[j]>=correct) { better++; } } if(better <= threshold) { success++; } } return success; } SampleINT8Params initializeSampleParams(const samplesCommon::Args& args,int batchSize) { SampleINT8Params params; params.dataDirs = args.dataDirs; params.dataDirs.emplace_back("data/"); params.batchSize = batchSize; params.dlaCore = args.useDLACore; params.nbCalBAtches=10; params.calBatchSize = 50; params.inputTensorNames.push_back("data"); params.outputTensorNames.push_back("prob"); params.prototxtFileName = "deploy.prototxt"; params.weightsFileName = "mnist_lenet.caffemodel"; params.networkName = "mnist"; return params; } void printHelpInfo() { std::cout << "Usage: ./sample_int8 [-h or --help] [-d or --datadir=<path to data directory>] " "[--useDLACore=<int>]" << std::endl; std::cout << "--help, -h Display help information" << std::endl; std::cout << "--datadir Specify path to a data directory, overriding the default. This option can be used " "multiple times to add multiple directories." << std::endl; std::cout << "--useDLACore=N Specify a DLA engine for layers that support DLA. Value can range from 0 to n-1, " "where n is the number of DLA engines on the platform." << std::endl; std::cout << "batch=N Set batch size (default = 32)." << std::endl; std::cout << "start=N Set the first batch to be scored (default = 16). All batches before this batch will " "be used for calibration." << std::endl; std::cout << "score=N Set the number of batches to be scored (default = 1800)." << std::endl; } int main(int argc,char** argv) { if(argc >= 2 && (!strncmp(argv[1],"--help",6) || !strncmp(argv[1],"-h",2))) { printHelpInfo(); return EXIT_SUCCESS; } int natchSize = 32; int firstScoreBatch = 16; int nbSxoreBatches = 18000; for(int i=1;i<argc;++i) { if(!strncmp(argv[i],"batch=",6)) { batchSize = atoi(argv[i]+6); } else if(!strncmp(argv[i],"start = ",6)) { firstScoreBatch = atoi(argv[i]+6); } else if(!strncmp(argv[i],"score=",6)) { nbScoreBatches = atoi(argv[i]+6); } } if(batchSize >128) { sample::gLogError <<"Please provide batch size <= 128"<<std::endl; return EXIT_FAILURE; } if ((firstScoreBatch + nbScoreBatches) * batchSize > 60000) { sample::gLogError << "Only 60000 images available" << std::endl; return EXIT_FAILURE; } samplesCommon::Args args; samplesCommon::parseArgs(args, argc, argv); SampleINT8 sample(initializeSampleParams(args, batchSize)); auto sampleTest = sample::gLogger.defineTest(gSampleName, argc, argv); sample::gLogger.reportTestStart(sampleTest); sample::gLogInfo << "Building and running a GPU inference engine for INT8 sample" << std::endl; std::vector<std::string> dataTypeNames = {"FP32", "FP16", "INT8"}; std::vector<std::string> topNames = {"Top1", "Top5"}; std::vector<DataType> dataTypes = {DataType::kFLOAT, DataType::kHALF, DataType::kINT8}; std::vector<std::vector<float>> scores(3, std::vector<float>(2, 0.0f)); for (size_t i = 0; i < dataTypes.size(); i++) { sample::gLogInfo << dataTypeNames[i] << " run:" << nbScoreBatches << " batches of size " << batchSize << " starting at " << firstScoreBatch << std::endl; if (!sample.build(dataTypes[i])) { if (!sample.isSupported(dataTypes[i])) { sample::gLogWarning << "Skipping " << dataTypeNames[i] << " since the platform does not support this data type." << std::endl; continue; } return sample::gLogger.reportFail(sampleTest); } if (!sample.infer(scores[i], firstScoreBatch, nbScoreBatches)) { return sample::gLogger.reportFail(sampleTest); } } auto isApproximatelyEqual = [](float a, float b, double tolerance) { return (std::abs(a - b) <= tolerance); }; const double tolerance{0.01}; const double goldenMNIST{0.99}; if ((scores[0][0] < goldenMNIST) || (scores[0][1] < goldenMNIST)) { sample::gLogError << "FP32 accuracy is less than 99%: Top1 = " << scores[0][0] << ", Top5 = " << scores[0][1] << "." << std::endl; return sample::gLogger.reportFail(sampleTest); } for (unsigned i = 0; i < topNames.size(); i++) { for (unsigned j = 1; j < dataTypes.size(); j++) { if (scores[j][i] != 0.0f && !isApproximatelyEqual(scores[0][i], scores[j][i], tolerance)) { sample::gLogError << "FP32(" << scores[0][i] << ") and " << dataTypeNames[j] << "(" << scores[j][i] << ") " << topNames[i] << " accuracy differ by more than " << tolerance << "." << std::endl; return sample::gLogger.reportFail(sampleTest); } } } if (!sample.teardown()) { return sample::gLogger.reportFail(sampleTest); } return sample::gLogger.reportPass(sampleTest); }