|
Snapdragon Neural Processing Engine SDK
Reference Guide
|
| Version | Date | Description |
|---|---|---|
| 1.42.0 | August 2020 | Removed V60 DSP libs from SNPE SDK. Enabled the AIP runtime support for generating the intermediate outputs from HTA with online compiler. Enabled multithread for re-quantize process in DSP runtime. Added optional parameter to set the hysteris period for sustained high and burst profiles in DSP runtime. Added support for opaque float concat operation in SNPE DSP concat layer. Fixed bug in UserBufferTF8 where retrieving the encoding would always return null. Fixed box decoder performance issue on mobilenet v2 ssd model for DSP runtime. Fixed tanh performance issue by replacing QuantizedTanh_8_ref with QuantizedTanh_8 op in DSP runtime. |
| 1.41.0 | July 2020 | Added MatMul support on the CPU runtime. Added support for new version of 7250 with integrated PMIC module. User Defined Operations(UDO) with weight parameters have been added to demonstrate both quantization and network execution on CPU and DSP runtime cores respectively. Optimized tile Op in DSP runtime, that used 2d memcpy for w-d plane tiling and HVX for tiling along depth. Fixed stack overflow issue in concat layer in DSP runtime. Fixed issue with input for multibatch in DSP runtime. Fixed issue in TF converter that prevented FusedBatchNorm operations from being merged into previous Convolution layer. Fixed DSP crash issue due to stack overflow Concat layer preparation. |
| 1.40.0 | June 2020 | Added DSP Graph Caching support for AIP models with HVX subnets. Upgraded DSP to use Hexagon SDK 3.5.2 toolchain. Added support for 16 bit UDO layers in DSP. Added support for large average pooling, reduce_mean layer and improved elemetnwise_mul support for larger tensor size. Fixed the issue with buffer ordering during the execution of batched models on AIP runtime. Fixed issue with SsdDetectionOut when number of classes is only 1. Fixed accuracy issue with Correlation 1D op. Fixed improper processing when 16bit input quantization is used in certain cases. Fixed scaling logic in convert_16 op. |
| 1.39.1 | May 2020 | Fixed the performance regression of Mobilenet SSD model on AIP runtime. |
| 1.39.0 | May 2020 | The SNPE license (LICENSE.pdf) has been updated, please review it for more details. Additionally the REDIST.txt has been removed, as redistribution is covered in the license. Added graph caching support which improves init times for DSP & AIP networks. (DSP subnet with in AIP is not supported) Optimized Prelu to reduce saturation loss during re-quantization at prelu by using cubic approximation in AIP runtime. Fixed the input conversions to allocate the required buffers during initialization itself, to improve the inference time for AIP runtime. Fixed potential bug with freeing threads in DSP runtime. Added additional logging messages for debugging in DSP runtime. Added support for the AIP runtime in the SNPE sample "snpe-sample". Added support for BBox transform layer in Caffe2 converter. Added new opset support in the ONNX converter: ArgMax, ArgMin, Concat, PRelu, ReduceMean, ReduceMax, ReduceMin, ReduceSum, Squeeze, Unsqueeze, MatMul, Flatten, Max, Split, Clip. Added support for the fixed-point version of the MobileNetV3 model with H-Swish neuron in TF converter. Improved support of resizing in Crop layer for TF and Caffe converter by introducing new “counts” parameter. Fixed issue of incorrect UDO tensor datatype in quantizer. Fixed issue with setting the performance profile mode for HTA from AIP runtime in multi-threading use cases that could cause performance to drop. Fixed issue with snpe_bench.py memory profiling. |
| 1.38.0 | April 2020 | Enabled FC/MatMul to use VTCM if available in DSP. Optimized 16-bit MeanVarianceNormalize in DSP runtime. Added support batchwise scalar divide operation in DSP runtime. Optimized Hard-swish operator for mobilenetV3. Added support for EltwiseMin layer for ONNX converter and CPU runtime. Added support for Onnx BatchNorm layer (OpVer 9, 12) in Onnx Converters. Caffe preprocessing subtract_mean layer is added. If specified, converter will enable preprocessing specified by a data layer transform_param subtract_mean. ONNX softmax converter support only existed for rank <= 2. Support for tensors rank <= 4 was added. Enabled the end-user / developer to request the use of an unsigned process domain to avoid the requirement of signed libraries for SNPE execution on 8250 and newer devices. Removed autoquantization for classes output in MultiClassNMS layer and added support for float addition in ElementwiseOp layer to handle this case. Fixed the issue with enabling stats for AIP runtime on models where number of layers in HTA subnet is more than SNPE layers. Fixed the output conversions to allocate the required buffers during initialization itself in AIP runtime, to improve the inference time. Enabled honoring of padding information from the HTA driver which is pre-computed by AIP runtime earlier, to unblock execution of more models. Fixed the issue with output buffer id while converting depth2space to deconv on HTA. Fixed a bug during graph transformation while folding the batchnorm on HTA. Increased DCVS relaxed sleep latency duration, this will let power system know that CDSP can goto deeper sleep state. If there is no active request for inferencing, it is better for system to go in deeper sleep state. |
| 1.37.0 | March 2020 | Enabled the online compiler support for HTA 1.x family of devices. AIP performance profiles behavior is aligned similar to DSP runtime for reduced power consumption in case of inference inactivity. ONNX Converter: Added support for Onnx Pad layer (OpVer 11). Added support for the h-swish layer used by MobileNet V3. Removed support for the Generate Proposals, ROI Align, and ROI Proposal layers. Added improved support for the reporting of Exceptions in the Java API. Updated the DSP UDO header file to be compatible with SNPE 1.37.0. The DSP UDO support is updated to be compatible with Hexagon SDK 3.5.1. The network creation action was moved onto another thread to avoid impacting the affinity for the main thread of the calling program. Snpe-dlc-info: Fixed issue in MACs calculation error for deconvolution layer. Avoid crash on SDM845 and other v65 targets when unable to retrieve VTCM memory. Fixed an issue in the TensorFlow converter where the weights in the Fully Connected layer were incorrectly transposed. Fixed the support for using DSP UDO with the AIP runtime. Previously, the UDO packages would not be properly loaded in the AIP runtime. Fixed DiagLog data for a UDO on GPU, where it did not report proper values for start and stop. Enable support for keras batchnorm with empty mean and variance to a default values. Fixed a memory leak when using IsRuntimeAvailable() with the VOLATILE_CHECK for the DSP runtime. |
| 1.36.0 | February 2020 | Added Java API extension to register UDO package with SNPE. snpe-dlc-info now prints the command-line that was used to quantize the DLC if applicable. Added support to handle UDO layers with multiple TF8 outputs with different quantization parameters. Added support for an additional profiling level (moderate) for SNPE benchmarking script and associated snpe-net-run executable for tracking initialization time metrics. Upgraded DSP to use Hexagon SDK 3.5.1 toolchain. Extend Platform Validator to detect HTA API version. Add VOLATILE_CHECK Mode for SNPE DSP Runtime Checking to query runtime availability in each call instead of giving cached result. Performance modes like LOW_POWER_SAVER, HIGH_POWER_SAVER, LOW_BALANCED added for CPU runtime. Fixed bug with propagation of model version during conversion. Fixed the issue with selecting the correct output shape during graph transformation while inserting1x1 conv2d for different input format. Fixed the issue with allocation of layer descriptor while loading the network on HTA. |
| 1.35.0 | January 2020 | Introduce the User-Defined Operations (UDO) feature. Added support for SDM720G/SM7125. Added support to snpe-throughput-net-run for UserBuffer input tensors (both INT8 and INT16). Input batching support is added for networks that can run completely on AIP runtime. Add support for the tf.stack and tf.unstack ops to the DSP and CPU runtimes. Add support for the tf.stack, tf.unstack, tf.floor, tf.minimum to the TF converter. Fixed some small memory leaks that are seen when repeatedly calling dlopen()/dlclose() on libSNPE.so. Updated the Deconvolution operation on DSP with a new kernel that improves performance on various kernel sizes and strides. Fix ssd_detection CDSP crash on DSP runtime. Updated the HTA to partition the input layer, if it has a connection to a layer that is not included in the same partition. Improved the tiling configuration support for depth wise convolution layer. |
| 1.34.0 | January 2020 | Initial support for ops with 16-bit activations using HTA in both snpe-dlc-quantize and in the SNPE AIP runtime. New option for snpe-net-run to automatically turn unconsumed tensors of the network (tensors that are not inputs to a layer) into network outputs. Fixed inconsistent results on SM8250 in certain cases for depthwise convolutions. Add support for the depth2space operation on the GPU. Using optimized Softmax implementation in AIP networks when input activation has more than 5000 elements. Truncate detection output on DSP to return valid data only. Ensure weights are properly flushed to DDR for use during inference in the DSP runtime. Fix support for NV21 encoding in the DSP runtime. |
| 1.33.2 | November 2019 | Address accuracy issues for Deconvolution in the AIP runtime. Changed behavior of Crop layer resize, so it retains the number of copied elements on each dimension. Make quantizer –override_params work for AIP. Reordered PerformanceProfile_t to be ABI compatible with 1.32.0. Using optimized Softmax implementation in AIP networks when input activation has more than 5000 elements. |
| 1.33.1 | November 2019 | Fixed a build issue that incorrectly removed Symphony. |
| 1.33.0 | November 2019 | New performance modes have been added:
snpe-dlc-info adds a summary of the layer types in use in the model. |
| 1.32.0 | Oct 2019 | Add Caffe MVN Layer support in the Caffe Converter, CPU Runtime, and DSP Runtime snpe-dlc-quantize: Enable the use of quantization parameters calculated during training when using dlc quantizer. To override the SNPE generated quantization parameters pass –override_params to snpe-dlc-quantize. Removed deprecated command line arguments from converters. All three converters now require passing -i/–input_network for model input paths. snpe-dlc-diff: Added command-line option [–diff_by_id/-i] to snpe-dlc-diff. This option allows users to compare 2 models in order(sorted by id) Added support for L2Norm layer to TensorFlow converter Optimized the DSP performance for the 'Space To Depth' layer Add support in the Java API for setInitCacheEnabled(), and setStorageDirectory() to enable DLC caching support. Allow graceful recovery after a fastrpc error - Recreate the userPD after the cDSP crashes so that the user can continue on the SNPE process with subsequent instances, instead of having to close the SNPE process. Note: all the instance associated to the previous userPD will be lost. snpe-dlc-viewer: Associate each layer type to a fixed color for consistency when using snpe-dlc-viewer Split the SNPE isRuntimeAvailable method into two separate functions to improve backward compatibility with existing client binaries that were built against the older signature. TF Converter: Fix Elementwise Broadcast support ONNX Converter: Fixed bug where output dimension was incorrect when keep_dims parameter was set to False for Argmax, ReduceSum and ReduceMax. ONNX Converter: Fixed bug where pad attribute was not properly parsed for Deconv Op. Caffe Converter: Fixed bug when converting SSD-based models when using Python 3. TF Converter: Fixed bug where converter was removing const Op input to reshape op when passed through identity op(s). i.e const-> identity -> reshape. Fixed bug where getOutputSize() would give the wrong result on output tensors in UserBuffer mode |
| 1.31.0 | September 2019 | New patterns were added to enable running the CLE algorithm on more op patterns and model architectures. Added Tensorflow converter support for Caffe-style SSD networks. Added support for HeatmapMaxKeypoint layer in the CPU runtime. Added support for ROI Align layer in CPU runtime. Added initial L2Norm layer support in CPU runtime. No support for axis parameter yet: normalization is performed along the inner-most dimension of the input tensor. Support for single-input Concatenation layers was added to CPU, GPU and DSP. Changed determination of number of batch dimensions in the Fully Connected layer so rank greater than 1 is always assumed to mean that there is 1 batch dimension. Removed constraint on the LSTM layer in the GPU runtime that prevented batch mode operation. Added support for Leaky-RELU in the TensorFlow converter. Both the actual Leaky-Relu op and the elementwise op representation are supported and map to SNPE's Prelu op. Added Argmax support to the Caffe converter, and optimized performance on the DSP runtime. Added new column to snpe-dlc-info that displays the supported runtimes for each layer. Fixed an edge case where in certain conditions OpenCL would return CL_INVALID_WORK_GROUP_SIZE. Made isRuntimeAvailable Java API thread-safe. Replace unstable image from sample Android classifier application data set with an image that is more consistent. |
| 1.30.0 | August 2019 | Documentation has been added to reflect the new common converter command line options for input processing; Converters now propagate required batchnorm information for performing quantization optimizations; Support for the new bias correction quantization optimization which adjusts biases by analyzing float vs quantized activation errors and adjusting the model to compensate; ONNX converter now filters single input Concats as a no ops as SNPE didn’t support them; Converter input processing now uniformly handles different input types and encodings; ONNX converter now supports the ConvTranspose ‘output_padding’ attribute by adding an additional pad layer after the ConvTranspose op; Integrates the latest flatbuffer 1.11 library which brings speed improvements and options for model size reduction; GPU size limitations with the ArgMax op (when setting the keepDims op attribute to false) can be worked around by enabling CPU fallback; Fixed DSP error with MobileNet SSD on QCS403 and QCS405; Fixed the issue with partitioning of deconv layer in HTA; |
| 1.29.0 | July 2019 | Added support for dlc reorder tool;Optimization of HTA d32 conversions;Added tf space_to_depth op for SNPE CPU and DSP runtime;Benchmarking scripts enhanced for showing further break down of execution time, across various components;Added support for additional ONNX binary element-wise ops;Optimized deconv layer for improving performance;Fixed an issue related to runtime error in DSP runtime;Performance Optimization of SNPE GPU Runtime for Shufflenet V2 by using profiling level config |
| 1.28.0 | June 2019 | Added an optional argument to isRuntimeAvailable for the DSP runtime so that it doesn't activate the DSP; Allow UB_T8 and UB_FLOAT output for snpe-net-run; Added a new command line option for snpe-dlc-diff to check layer names; Updated the –dlc argument to –output_path for snpe-caffe-to-dlc to align with the ONNX converter; Added –dry_run argument to snpe-onnx-to-dlc to allow evaluation for successful conversion on an ONNX model; Added support for the gather op in the DSP runtime; Added support to convert the TF MobileNet-V1-FPN-SSD model; Fixed a memory leak in the DSP runtime that is seen when repeatedly loading and unloading a network; Addressed issues on V66 DSPs related to acquiring VTCM memory; Fixed an issue related to multiple inputs for the Caffe converter; Fixed an issue in the TF converter related to element-wise sun and the atrous parameter; Fixed an issue in the TF converter related to tf.crop_and_resize when there are only 2 inputs.; Fixed additional cases of uncaught exceptions with the aarch64-android-clang6.0 platform; |
| 1.27.0 | May 2019 | Added new APIs support for setting output tensor names to snpeBuilder and to fetch output tensor names for a given output layer name; Improved the peak memory usage with DLC v3 format; Fixed few issues with performance and runtime failures on DSP runtime; Fixed few issues and improved error handling for platform validator; Fixed the issues with Pooling and Instance norm layers of Tensorflow converter; Removed *-android-gcc4.9 platform support. This compiler has been retired for the Android NDK, so all support is transitioning to using Clang for Android; Removed arm-linux-gcc4.8hf platform. The development platform has been retired; |
| 1.26.0 | Apr 2019 | Added support for the ONNX Gather Op in the ONNX Converter and CPU runtime; Optimized DeConvolution Layer for the DSP runtime; Support for tf.nn.moments in the TF converter, CPU and DSP runtimes; Added TF Reflect Pad support for the DSP runtime; Add symmetric quantizer option in snpe-dlc-quantize; Add support for batch > 1 when using the Scale Layer on the DSP runtime; Updated Platform Validator python script to be OS-independent; Added additional optimizations for HTA input conversion; |
| 1.25.0 | Mar 2019 | Updated DLC format to improve load time performance and memory consumption. Old DLCs will continue to work as is, but new DLCs generated from 1.25 will use the new format; Added support for optimized; MultiClassNms and ArgMax ops on DSP runtime; Added option to request larger memory allocations on the DSP for improved init time, at the expense of more memory use; Improved concurrency for multiple; SNPE objects running simultaneously on DSP; Improvements when using priority control on DSP; Added support for channel shuffle and ArgMax in the ONNX converter; Support multiple subnets within the AIP runtime; |
| 1.24.0 | Feb 2019 | Adding setProfilingLevel API support for AIP and CPU runtimes; Various stability issues on aip runtimes are addressed;Added support for Snapdragon 712;Support multi inputs and multiple outputs on each SNPE AIP’s subnet |
| 1.23.0 | Jan 2019 | Upgrade to Android NDK r17c to build SNPE; Improving initialization and de-initialization times; Various DSP timing fixes; Addressed some DSP concurrency edge cases that could impact output values; TF converter support for non max suppression, crop and resize Ops |
| 1.22.0 | Nov 2018 | Support for several new ops on DSP runtime; Upgrade to Android NDK r16b to build SNPE; setProfilingLevel API support in DSP runtime; Added new tool snpe-throughput-net-run |
| 1.21.0 | Oct 2018 | Tensorflow converter and CPU runtime support for various ops; DSP runtime support for Eltwise Realdiv and Square ops; GPU support for resize_align_corners layer |
| 1.20.0 | Sep 2018 | Support for QCS605 LE platform; NDK version upgrade to r14b; Tensorflow converter support for elementwise sqrt and softmax with dimension > 2; Platform validation command line tool |
| 1.19.0 | Aug 2018 | ELU op support for Tensorflow/Onnx Converters and CPU/GPU runtimes; BoxWithNMSLimit and BBoxTransform ops support in caffe2 converter; Support for Caffe Power Layer in GPU |
| 1.18.0 | Jul 2018 | Support for pad and elementwise subtraction on GPU; ONNX converter support for shape and pad ops; Tensorflow converter support for additional ops |
| 1.17.0 | Jun 2018 | Support for Scale Layer in Caffe converter and DSP runtime, DSP support for batch>1 and ChannelShuffle, Updated SDK examples for Inception v3 2016 model |
| 1.16.2 | May 2018 | Remove linkage to libstdc++.so in DSP loader libraries |
| 1.16.1 | May 2018 | Remove linkage to libstdc++.so, DSP runtime fixes, fix for 1D BatchNorm |
| 1.16.0 | May 2018 | Batch>1 support (except DSP runtime); layer optimizations for DSP runtime; Caffe2 ChannelShuffle support (except DSP runtime) |
| 1.15.2 | Mar 2018 | Fix for GPU runtime memory leak and reshape to/from 1D |
| 1.15.1 | Apr 2018 | Fix for converter for instance normalization followed by scale |
| 1.15.0 | Apr 2018 | Support for instance normalization for Caffe and Caffe2, MobilenetSSD (Caffe) |
| 1.14.1 | Mar 2018 | Minor fixes |
| 1.14.0 | Mar 2018 | ONNX converter (alpha), multiple enhancements and fixes |
| 1.13.0 | Feb 2018 | GPU and DSP v65 performance improvements. GPU floating point 16 support. |
| 1.12.0 | Jan 2018 | Support for Android LLVM/libc++, MobilenetSSD (TensorFlow) |
| 1.10.1 | Dec 2017 | Fix a bug in the DSP runtime when using mixed userbuffer input types |
| 1.10.0 | Dec 2017 | Support for Mobilenet on DSP, enhanced DSP runtime, Snapdragon Flight Board, updates for UserBuffers |
| 1.8.0 | Nov 2017 | Mobilenet support on CPU, GPU, Support for Snapdragon 636 and Android 64 bit |
| 1.6.0 | Oct 2017 | Support for Snapdragon 450, minor updates and fixes |
| 1.4.0 | Aug 2017 | Support for Snapdragon 630, FasterRCNN and ADSP on AGL |
| 1.2.2 | July 2017 | QDN release |
| 1.2.0 | June 2017 | Beta Caffe2 Converter |
| 1.0.2 | May 2017 | Support for 820AGL platform, Snapdragon 660, and Compute DSP on Android |
| 1.0.1 | Apr 2017 | Documentation update only |
| 1.0 | Apr 2017 |