Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core (opencl): CLBlast integration via dyanmic loading #25568

Open
wants to merge 7 commits into
base: 4.x
Choose a base branch
from

Conversation

fengyuentau
Copy link
Member

@fengyuentau fengyuentau commented May 10, 2024

Second commit is all about auto-generated code.

Usage

Get CLBlast:

git clone https://github.com/CNugteren/CLBlast
cmake -B build -S CLBlast -DCMAKE_INSTALL_PREFIX=build/install
cmake --build build --target install -j8

Test with this patch:

git clone https://github.com/fengyuentau/opencv
cd opencv
git checkout clblast_integration

export CLBLAST_INSTALL_DIR=/abs/path/to/CLBLAST-build/install
cmake -B build -DWITH_OPENCL=ON .
cmake --build build --target opencv_test_core opencv_perf_core -j8

export LD_LIBRARY_PATH=/abs/path/to/CLBLAST-build/install/lib # Use DYLD_LIBRARYPATH on macOS
./build/bin/opencv_test_core --gtest_filter="*OCL_*Gemm*"
./build/bin/opencv_perf_core --gtest_filter="*OCL_GemmFixture_Gemm*"

Performance

Usage example:

python opencv/modules/ts/misc/summary.py opencv_perf_core.gtx1080ti.xml opencv_perf_core.gtx1080ti.clblast.xml

Khadas VIM4 (8GB mem, 32GB disk space) with Mali G52 r1p0

Geometric mean (ms)

                        Name of Test                            opencv            opencv                opencv
                                                                 perf              perf                  perf
                                                             core.mali-g52 core.mali-g52.clblast core.mali-g52.clblast
                                                                                                          vs
                                                                                                        opencv
                                                                                                         perf
                                                                                                     core.mali-g52
                                                                                                      (x-factor)
Gemm::OCL_GemmFixture::(640x640, 0, 32FC1)                      40.510            24.351                 1.66
Gemm::OCL_GemmFixture::(640x640, 0, 32FC2)                      99.486            160.065                0.62
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T, 32FC1)               42.447            23.615                 1.80
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T, 32FC2)               102.918           100.531                1.02
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T|GEMM_2_T, 32FC1)      43.153            24.388                 1.77
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T|GEMM_2_T, 32FC2)      103.611           99.365                 1.04
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T, 32FC1)               42.378            25.265                 1.68
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T, 32FC2)               102.890           156.734                0.66
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T|GEMM_3_T, 32FC1)      38.043            21.727                 1.75
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T|GEMM_3_T, 32FC2)      94.870            150.405                0.63
Gemm::OCL_GemmFixture::(640x640, GEMM_3_T, 32FC1)               36.955            21.274                 1.74
Gemm::OCL_GemmFixture::(640x640, GEMM_3_T, 32FC2)               93.829            153.478                0.61
Gemm::OCL_GemmFixture::(1280x1280, 0, 32FC1)                    290.018           147.040                1.97
Gemm::OCL_GemmFixture::(1280x1280, 0, 32FC2)                    776.815           592.293                1.31
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T, 32FC1)             294.465           146.519                2.01
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T, 32FC2)             784.642           588.987                1.33
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T|GEMM_2_T, 32FC1)    295.935           145.909                2.03
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T|GEMM_2_T, 32FC2)    788.559           590.310                1.34
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T, 32FC1)             294.613           148.811                1.98
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T, 32FC2)             784.563           594.052                1.32
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T|GEMM_3_T, 32FC1)    280.617           137.701                2.04
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T|GEMM_3_T, 32FC2)    758.959           571.672                1.33
Gemm::OCL_GemmFixture::(1280x1280, GEMM_3_T, 32FC1)             278.827           136.011                2.05
Gemm::OCL_GemmFixture::(1280x1280, GEMM_3_T, 32FC2)             755.590           567.531                1.33

Macbook Air M1 (16GB mem, 512GB disk space)

Accuracy problem with scale >= 1280, but it is ok with scal = 1024.

Geometric mean (ms)

                        Name of Test                         opencv      opencv          opencv
                                                              perf        perf            perf
                                                             core.m1 core.m1.clblast core.m1.clblast
                                                                                           vs
                                                                                         opencv
                                                                                          perf
                                                                                         core.m1
                                                                                       (x-factor)
Gemm::OCL_GemmFixture::(640x640, 0, 32FC1)                    2.248       2.257           1.00
Gemm::OCL_GemmFixture::(640x640, 0, 32FC2)                    9.272       9.889           0.94
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T, 32FC1)             2.438       2.714           0.90
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T, 32FC2)             9.434       9.708           0.97
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T|GEMM_2_T, 32FC1)    2.910       2.764           1.05
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T|GEMM_2_T, 32FC2)   10.068       8.795           1.14
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T, 32FC1)             2.585       2.812           0.92
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T, 32FC2)             9.563       9.202           1.04
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T|GEMM_3_T, 32FC1)    2.756       2.568           1.07
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T|GEMM_3_T, 32FC2)    9.506       9.080           1.05
Gemm::OCL_GemmFixture::(640x640, GEMM_3_T, 32FC1)             2.887       2.640           1.09
Gemm::OCL_GemmFixture::(640x640, GEMM_3_T, 32FC2)             9.897       9.642           1.03
Gemm::OCL_GemmFixture::(1280x1280, 0, 32FC1)                 25.201      23.861           1.06
Gemm::OCL_GemmFixture::(1280x1280, 0, 32FC2)                 107.464     107.136          1.00
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T, 32FC1)          26.242      26.826           0.98
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T, 32FC2)          108.138     108.599          1.00
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T|GEMM_2_T, 32FC1) 27.284      27.497           0.99
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T|GEMM_2_T, 32FC2) 107.704     108.396          0.99
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T, 32FC1)          26.712      26.136           1.02
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T, 32FC2)          108.275     108.282          1.00
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T|GEMM_3_T, 32FC1) 26.257      27.556           0.95
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T|GEMM_3_T, 32FC2) 109.048     109.098          1.00
Gemm::OCL_GemmFixture::(1280x1280, GEMM_3_T, 32FC1)          25.408      25.929           0.98
Gemm::OCL_GemmFixture::(1280x1280, GEMM_3_T, 32FC2)          108.337     107.886          1.00

PC with i7-12700K (64GB mem, 1T disk space) with Intel(R) UHD Graphics 770

Accuracy problem with complex (type CV_32FC2).

Geometric mean (ms)

                        Name of Test                           opencv          opencv              opencv
                                                                perf            perf                perf
                                                             core.uhd770 core.uhd770.clblast core.uhd770.clblast
                                                                                                     vs
                                                                                                   opencv
                                                                                                    perf
                                                                                                 core.uhd770
                                                                                                 (x-factor)
Gemm::OCL_GemmFixture::(640x640, 0, 32FC1)                      1.191           1.185               1.01
Gemm::OCL_GemmFixture::(640x640, 0, 32FC2)                      9.739           9.740               1.00
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T, 32FC1)               1.522           1.525               1.00
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T, 32FC2)               9.859           9.851               1.00
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T|GEMM_2_T, 32FC1)      3.854           3.866               1.00
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T|GEMM_2_T, 32FC2)      9.948           9.919               1.00
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T, 32FC1)               1.522           1.522               1.00
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T, 32FC2)               9.863           9.803               1.01
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T|GEMM_3_T, 32FC1)      1.536           1.529               1.00
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T|GEMM_3_T, 32FC2)      9.819           9.810               1.00
Gemm::OCL_GemmFixture::(640x640, GEMM_3_T, 32FC1)               1.177           1.178               1.00
Gemm::OCL_GemmFixture::(640x640, GEMM_3_T, 32FC2)               9.735           9.735               1.00
Gemm::OCL_GemmFixture::(1280x1280, 0, 32FC1)                    9.300           9.314               1.00
Gemm::OCL_GemmFixture::(1280x1280, 0, 32FC2)                   77.424          77.427               1.00
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T, 32FC1)            12.225          12.245               1.00
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T, 32FC2)            78.307          78.342               1.00
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T|GEMM_2_T, 32FC1)   30.315          30.172               1.00
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T|GEMM_2_T, 32FC2)   78.971          79.028               1.00
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T, 32FC1)            11.066          10.987               1.01
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T, 32FC2)            78.249          78.211               1.00
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T|GEMM_3_T, 32FC1)   11.065          11.014               1.00
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T|GEMM_3_T, 32FC2)   78.147          78.144               1.00
Gemm::OCL_GemmFixture::(1280x1280, GEMM_3_T, 32FC1)             9.368           9.342               1.00
Gemm::OCL_GemmFixture::(1280x1280, GEMM_3_T, 32FC2)            77.360          77.323               1.00

PC with GTX 1080 Ti (12GB gpu mem, CUDA 12.3)

Geometric mean (ms)

                        Name of Test                             opencv             opencv                 opencv
                                                                  perf               perf                   perf
                                                             core.gtx1080ti core.gtx1080ti.clblast core.gtx1080ti.clblast
                                                                                                             vs
                                                                                                           opencv
                                                                                                            perf
                                                                                                       core.gtx1080ti
                                                                                                         (x-factor)
Gemm::OCL_GemmFixture::(640x640, 0, 32FC1)                       0.338              0.310                   1.09
Gemm::OCL_GemmFixture::(640x640, 0, 32FC2)                       0.650              0.483                   1.34
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T, 32FC1)                0.443              0.308                   1.44
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T, 32FC2)                0.822              0.484                   1.70
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T|GEMM_2_T, 32FC1)       0.545              0.287                   1.90
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T|GEMM_2_T, 32FC2)       0.976              0.517                   1.89
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T, 32FC1)                0.435              0.292                   1.49
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T, 32FC2)                0.819              0.499                   1.64
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T|GEMM_3_T, 32FC1)       0.399              0.294                   1.35
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T|GEMM_3_T, 32FC2)       0.756              0.503                   1.50
Gemm::OCL_GemmFixture::(640x640, GEMM_3_T, 32FC1)                0.337              0.309                   1.09
Gemm::OCL_GemmFixture::(640x640, GEMM_3_T, 32FC2)                0.659              0.482                   1.37
Gemm::OCL_GemmFixture::(1280x1280, 0, 32FC1)                     2.211              1.349                   1.64
Gemm::OCL_GemmFixture::(1280x1280, 0, 32FC2)                     4.375              3.551                   1.23
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T, 32FC1)              2.413              0.979                   2.47
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T, 32FC2)              4.838              3.054                   1.58
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T|GEMM_2_T, 32FC1)     2.531              1.203                   2.10
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T|GEMM_2_T, 32FC2)     5.295              3.501                   1.51
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T, 32FC1)              2.352              1.380                   1.70
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T, 32FC2)              4.795              3.876                   1.24
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T|GEMM_3_T, 32FC1)     2.294              1.391                   1.65
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T|GEMM_3_T, 32FC2)     4.892              3.904                   1.25
Gemm::OCL_GemmFixture::(1280x1280, GEMM_3_T, 32FC1)              2.031              1.202                   1.69
Gemm::OCL_GemmFixture::(1280x1280, GEMM_3_T, 32FC2)              4.235              3.545                   1.19

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
  • The PR is proposed to the proper branch
  • There is a reference to the original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake
force_builders=Linux OpenCL

@fengyuentau
Copy link
Member Author

Observed problems:

  1. On Intel i7-12700K with Intel(R) UHD Graphics 770: clblast has accuracy problem with complex (type CV_32FC2).
  2. on Apple M1: clblast has accuracy problem if scale >= 1280, but it is ok with scale = 1024.

Copy link
Contributor

@opencv-alalek opencv-alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dynamic loading makes sense if there is strong API versioning in used project.

modules/core/include/opencv2/core/opencl/ocl_defs.hpp Outdated Show resolved Hide resolved
if(WITH_CLBLAST)
find_path(CLBLAST_INCLUDE_DIR
NAMES clblast_c.h
HINTS ENV CLBLAST_INSTALL_DIR
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to always use "dynamic loading" then it makes sense to place this header into 3rdparty/include

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it may work but CLBlast version is bumped occassionally along with some tuned results for different devices. If we put a fixed header in 3rdparty/include, I guess it won't block the new library unless this file has significant changes right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is just no reason to use DIFFERENT versions of

  • header clblast_c.h
  • autogenerated files with function from another version of clblast_c.h

modules/core/src/matmul.dispatch.cpp Outdated Show resolved Hide resolved
modules/core/src/matmul.dispatch.cpp Outdated Show resolved Hide resolved
modules/core/src/matmul.dispatch.cpp Outdated Show resolved Hide resolved
modules/core/src/matmul.dispatch.cpp Show resolved Hide resolved
(const cl_mem)B.handle(ACCESS_READ), offsetB, ldb,
(float)beta,
(cl_mem)D.handle(ACCESS_RW), offsetC, ldc,
&queue, NULL);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is about async processing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean async processing several calls to Sgemm for example?

modules/core/src/ocl.cpp Outdated Show resolved Hide resolved
modules/core/src/opencl/runtime/opencl_clblast.cpp Outdated Show resolved Hide resolved
@@ -415,6 +415,9 @@ OCV_OPTION(WITH_OPENCLAMDFFT "Include AMD OpenCL FFT library support" ON
OCV_OPTION(WITH_OPENCLAMDBLAS "Include AMD OpenCL BLAS library support" ON
VISIBLE_IF NOT ANDROID AND NOT IOS AND NOT XROS AND NOT WINRT
VERIFY HAVE_CLAMDBLAS)
OCV_OPTION(WITH_CLBLAST "Include CLBlast library support" ON
VISIBLE_IF TRUE
VERIFY HAVE_CLBLAST)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VERIFY

This would break existed build configurations with ENABLE_CONFIG_VERIFICATION

/cc @mshabunin

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked build with clblast and verification, it works fine:

cmake \
        -DCMAKE_INSTALL_PREFIX=install \
        -DWITH_QT=ON \
        -DWITH_1394=OFF \
        -DWITH_JASPER=OFF \
        -DWITH_OPENCLAMDFFT=OFF \
        -DWITH_OPENCLAMDBLAS=OFF \
        -DWITH_LAPACK=OFF \
        -DWITH_CLBLAST=ON \
        -DENABLE_CONFIG_VERIFICATION=ON \
        ../opencv

...

--   OpenCL:                        YES (CLBlast INTELVA)
--     Include path:                /work/opencv/3rdparty/include/opencl/1.2 /usr/include
--     Link libraries:              Dynamic load

...

-- Verifying WITH_CLBLAST=ON => 'HAVE_CLBLAST'=TRUE

...

BTW, I observe several identical warnings in matmul.dispatch.cpp (GCC 11, Ubuntu 22):

/opencv/modules/core/src/matmul.dispatch.cpp:147:31: warning: type qualifiers ignored on cast result type [-Wignored-qualifiers]
  147 |                               (const cl_mem)A.handle(ACCESS_READ), offsetA, lda,

...

/opencv/modules/core/src/matmul.dispatch.cpp:184:31: warning: type qualifiers ignored on cast result type [-Wignored-qualifiers]
  184 |                               (const cl_mem)B.handle(ACCESS_READ), offsetB, ldb,
      |                               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I observe several identical warnings in matmul.dispatch.cpp (GCC 11, Ubuntu 22):

/opencv/modules/core/src/matmul.dispatch.cpp:147:31: warning: type qualifiers ignored on cast result type [-Wignored-qualifiers]
  147 |                               (const cl_mem)A.handle(ACCESS_READ), offsetA, lda,

...

/opencv/modules/core/src/matmul.dispatch.cpp:184:31: warning: type qualifiers ignored on cast result type [-Wignored-qualifiers]
  184 |                               (const cl_mem)B.handle(ACCESS_READ), offsetB, ldb,
      |                               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Warnings are resolved by removing const.

@vpisarev
Copy link
Contributor

@fengyuentau, from the patch I can conclude that we need only a small portion of clblast. Can we extract a subset of clblast and put it to opencv/3rdparty and link it to OpenCV? (i.e. don't use dynamic loading, which is much less convenient for end users). Also, I believe, we need to solve problems with mac and intel somehow. I remember you said (and also see it from the performance charts) that the current Intel version of gemm in OpenCV is faster than clblast, maybe we should keep Intel version.

@asmorkalov
Copy link
Contributor

@fengyuentau Thanks a lot for the effort! The PR was discussed on OpenCV Core team meeting and conclusion is the following:

  • Do not implement dynamic dependency for now.
  • Use find_package or alternative to find CLBlast as dependency and build against external library instance.
  • Do not put it to 3rdparty for now as soon as we have troubles with the most popular platforms: Intel and Apple ARM.

@fengyuentau
Copy link
Member Author

we have troubles with the most popular platforms: Intel and Apple ARM.

I have done several testings on the clblast accuracy problem. It turns out clblast with tuning results on these platform gives incorrect results, and after reverting those tuning results it can give the correct results. See my repo for testing: https://github.com/fengyuentau/test-clblast.

@asmorkalov asmorkalov added this to the 4.11.0 milestone Jun 3, 2024
@asmorkalov
Copy link
Contributor

@fengyuentau What is the PR status? What are the next steps here?

@fengyuentau
Copy link
Member Author

we have troubles with the most popular platforms: Intel and Apple ARM.

I have done several testings on the clblast accuracy problem. It turns out clblast with tuning results on these platform gives incorrect results, and after reverting those tuning results it can give the correct results. See my repo for testing: https://github.com/fengyuentau/test-clblast.

Upstream has fixed the accuracy problem both on Intel GPU and Apple M1. Performance results are updated.

@fengyuentau
Copy link
Member Author

@fengyuentau What is the PR status? What are the next steps here?

@asmorkalov We may need to discuss once again whether the integration should be done in the way of dynamic loading or not, since the library itself is updated quite often with tuned parameters on different platforms. It has steady APIs and if the integration is done via dynamic loading, users just need to upgrade CLBlast and do not need to re-build OpenCV.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants