You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for the feedback @tornikeo, indeed it's better to have docs covering CuPy internals. Here are quick answers:
How does the translation of Python function to cuda __global__ function occur?
This depends on the fucntion. Some are backed by ElementwiseKernel which is translated like this. Some are backed by RawModule (i.e., raw CUDA code) like this. cupyx.jit translates user Python function to CUDA source by traversing AST.
What happens with the variables referenced from outer context of the cuda.jit-ed functions? How are they made available to each thread?
cupyx.jit requires all variables referenced to be given as inputs.
What is the real number and size of parameters sent to each thread?
The blocksize is 128 for ElementwiseKernels and 512 for ReductionKenrels.
At which stage does the nvcc get called? with what args?
For most functions, NVRTC (take it as a library version of nvcc) is called for compilation, which happens on the first invocation.
Description
There is no dicumentation on how CuPy works, end-to-end. Explanations for...
__global__
function occur?cuda.jit
-ed functions? How are they made available to each thread?nvcc
get called? with what args?would greatly help incoming developers to see the kernel issues before they arise. Like, why do I get an 1024 blocksize, but not at 512 blocksize?
Idea or request for content
No response
The text was updated successfully, but these errors were encountered: