QML

OpenMP

QML uses the OpenMP runtime for parallelism. More information about OpenMP can be found at http://openmp.org/wp/. In particular, OpenMP environment variables will affect the behavior of QML.

QML Directory Structure

The QML directory structure is organized in the following way:

/<plat>/<arch>/<interface>/<toolchain>/

Where possible values for each directory are:

plat: android, linux, macos, windows

arch: arm32, aarch64, x64

interface: lp64, ilp64

toolchain: ndk-r10e, ndk-r11, gcc-4.x, gcc-5.x, gcc-4.9-ubuntu-12.04, msvc, clang

Once the platform, architecture, interface, and supported toolchain are selected, the final directory will contain the follow subdirectories:

/examples

/include

/lib

ILP64 and LP64

The BLAS specification uses 32 bit integers for everything from the dimensions of matrices to the stride between elements in a dot product. This is sufficient for most problems and 32 bit architectures, however, certain classes of problem require more bits to indicate things like the length of the input vectors when calling AXPY or DOT. This has lead the community to support both ILP64 and LP64 even on 64 bit architectures. QML also supports both ILP64 and LP64, with the appropriate libraries either in the ilp64 or lp64 folder. To aid development on diverse platforms we provide the type qml_long in qml_types.h which will be 32 bits on 32 bit platforms and 64 bits on 64 bit platforms if the ilp64 folder header is used. Otherwise if the lp64 folder header is used, qml_long will be 32 bits on all platforms.

Using QML on Android

In order to run QML on Android, the appropriate QML shared library will need to be copied to the device in addition to the program that uses QML. If the device is rooted, the appropriate place for the 32-bit (armv7-a) version of QML is /system/vendor/lib/. The 64-bit (aarch64/armv8-a) version of QML belongs in /system/vendor/lib64/. Additionally, the 32-bit (armv7-a) version of QML is compiled for Android API level 19 (Android 4.4), whereas the 64-bit (aarch64/armv8-a) version of QML is compiled for Android API level 21 (Android 5.0).

Environment Variables

Three environment variables are available to control the threadpool used by QML:

OMP_NUM_THREADS=<N>
Controls the number of threads in the OpenMP threadpool at launch time.
QML_NUM_THREADS=<N>
Controls the degree of parallelism of QML functions, does not affect the size of the OpenMP threadpool. The number of tasks launched by QML will be the minimum of QML_NUM_THREADS and OMP_NUM_THREADS.
QML_HET_RATIO=<little:big[:prime]>
Controls the work distribution ratio (as percentage) of parallel operations like GEMM. The first number is for the first CPU cluster (e.g. LITTLE cluster), then for the big cluster and then prime cluster (if applicable). The numbers should be integers, and sum of all numbers must be 100. In case of bad formatting, the default partitioning is used which may not be optimal for all SOCs.

On x86 based systems, over subscription can be a problem for compute bound workloads as it causes cache thrashing which destroys performance. To prevent this it is recommended to set the number of threads using one of these two environment variables to the number of physical cores present rather than the default number of reported cores.

Example (Linux x86)

Let’s assume we are using an x86 based machine that supports hyperthreading. This means there are X real cores, however, the system behaves as if it has 2X cores. On modern Linux systems, the first N/2 cores are different cores, the next N/2 are duplicates of the first half. In order to avoid over subscription for compute bound workloads (matrix multiply or GEMM), we don’t want to have a threadpool larger than the number of real cores on the machine. We also don’t want these threads migrating in the middle of computation as that also causes cache thrashing. To avoid this we will employ both a tool that is provided on Linux as well as the environment variables used by OpenMP. For instance, let’s assume the machine has 4 real cores. Then to invoke a program that calls QML GEMM we would do the following:

OMP_NUM_THREADS=4 taskset -c 0-3 <program>

Additional Information

Some additional information can be found in the release notes.

Questions and Problems

Please send any suggestions, questions, bugs, or integration issues to qml@qti.qualcomm.com.