GitHub - NetroScript/GPU-memory-access-visualization: A single header CUDA library that allows logging individual memory accesses of a GPU kernel with as little changes to the code as possible. An export to JSON together with a web-based visualization is included and allows for easy analysis of the memory access patterns.

GPU-memory-access-visualization

A single header CUDA library that allows logging individual memory accesses of a GPU kernel with as little changes to the code as possible. An export to JSON together with a web-based visualization is included and allows for easy analysis of the memory access patterns.
View Documentation for the visualization · Report Bug · Request Feature

About The Project
Usage
- Simple Example
- Gotchas
Contributing
License

About The Project

This repository contains a single header CUDA library that allows logging individual memory accesses of a GPU kernel with as little changes to the code as possible. Internally the library uses CUDA Unified Memory to store the memory access information. Because of this, please make sure your targeted architecture supports this feature. The library was tested on a GeForce RTX 2070 Ti and on a GTX 1060.

The overall design was to require as little changes to the code as possible. The concrete usage is shown in the Usage section.

The library takes care of storing all the memory accesses using the provided data structure (which is almost equivalent to a normal array). Besides that, the library provides functionality to store this data to the filesystem. By default, an extremely basic HTML output is provided together with a JSON output. The JSON output can be used to create a custom visualization, this repository already includes one application which can visualize this data in a browser. You can find that application in the html html together with a documentation on how to use it. For easier usage, the releases section already contains on default pre-built version of the application.

Should these data formats not be sufficient for your needs, you can easily pass in a custom callback function to the library. This callback function will be called for every memory access and can be used to store the data in any format you like. To get an idea how to use it (as it is not documented), you can take a look inside the generateTemplatedOutput function.

(back to top)

Usage

This section shows the most basic usage of the library for use with CUDA code. If you want to know instead how to use the visualization, please take a look at the documentation here.

For complete working examples you can also take a look at the examples folder. This folder contains multiple simple CUDA applications that already use the library (files ending with _original are the cuda files before adapting them to use the kernel).

To use the library, just include the header file cuda_mav.cuh in your project. The library is a single header file and does not require any additional files to be included.

Simple Example

Example Kernel

Let's assume we have the following code right now:

__global__ void kernel(int* data, int* result) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    result[idx] = data[idx] * 2;
}

int main() {
    // Data on the host
    int* h_data = new int[100];
    int* h_result = new int[100];
    
    // Data on the device
    int* d_data;
    int* d_result;
    
    // Allocate memory on the device
    cudaMalloc(&d_data, 100 * sizeof(int));
    cudaMalloc(&d_result, 100 * sizeof(int));
    
    // Copy data to the device
    cudaMemcpy(d_data, h_data, 100 * sizeof(int), cudaMemcpyHostToDevice);
    
    // Execute the kernel
    kernel<<<10, 10>>>(d_data, d_result);
    
    // Copy data back to the host
    cudaMemcpy(h_result, d_result, 100 * sizeof(int), cudaMemcpyDeviceToHost);
    
    // Free the device memory
    cudaFree(d_data);
    cudaFree(d_result);
    
    // Free the host memory
    delete[] h_data;
    delete[] h_result;
}

We can leave the old code almost entirely intact and only slightly need to change the kernel and kernel call.

Creating an instance to store the memory

The first thing you need to do is create a new instance of a CudaMemAccessStorage. This class stores all the memory access information, and you will need to pass in an expected size for all the memory accesses. This is mainly limited by your available host memory, as internally the library uses CUDA Unified Memory to store the data.

// We are using auto with make_unique here to automatically free the memory when the scope ends
// You can also use a normal pointer and free it manually
auto memAccessStorage = std::make_unique<CudaMemAccessStorage<int>>(100*2);

Wrapping the data arrays

After that we need to wrap our data arrays in our custom class CudaMemAccessLogger, provide the length of the array, and a custom description / name for the visualization and provide the Logger with a reference to a CudaMemAccessStorage instance. This class will then intercept all accesses to the original data using a proxy. The proxy class will then forward the memory access information to the CudaMemAccessStorage instance.

// Wrap the (device) data arrays in the CudaMemAccessLogger class
// Get the object itself from the smart pointer first
auto data = CudaMemAccessLogger<int>(d_data, 100, "Input data", *memAccessStorage);
auto result = CudaMemAccessLogger<int>(d_result, 100, "Result data", *memAccessStorage);
// Once again, you can also use a normal pointer here, but then you need to make sure to free the memory manually
// As the CudaMemAccessLogger class does not allocate any memory, you do not need to use a smart pointer here

Changing the kernel

Next we need to change the kernel slightly to take in the CudaMemAccessLogger instances instead of the original data arrays.

// This is all you need to change
__global__ void kernel(CudaMemAccessLogger<int> data, CudaMemAccessLogger<int> result) {

Additionally, we now need to change the call to the kernel to pass in the wrapped data arrays instead of the original ones.

kernel<<<10, 10>>>(data, result);

Getting the data

Now the code continues working as expected, but you also want to get the stored data of the accesses.

For this you can just use the generateJSONOutput function which you would want to use in the most cases as this produces just one JSON file which you can then drag and drop into web based visualization.

// Get the data from the storage
// Make sure the kernel has finished executing before calling this function
memAccessStorage->generateJSONOutput("./my_data_output.json");

If you instead for example want to use the HTML template to directly embed the data in the HTML file already (warning: loading then is much slower) the code would look like this:

memAccessStorage->generateTemplatedOutput("./path_to_template_file.html", "./path_to_output_file.html", CudaMemAccessStorage<int>::parseDataForJSPage)

Full example

Click below to open the spoiler and see the full example code.

Full example code

#include <memory>
#include "cuda_mav.cuh"

__global__ void kernel(CudaMemAccessLogger<int> data, CudaMemAccessLogger<int> result) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    result[idx] = data[idx] * 2;
}

int main() {
    // Data on the host
    int* h_data = new int[100];
    int* h_result = new int[100];
    
    // Data on the device
    int* d_data;
    int* d_result;
    
    // Allocate memory on the device
    cudaMalloc(&d_data, 100 * sizeof(int));
    cudaMalloc(&d_result, 100 * sizeof(int));
    
    // Copy data to the device
    cudaMemcpy(d_data, h_data, 100 * sizeof(int), cudaMemcpyHostToDevice);
    
    auto memAccessStorage = std::make_unique<CudaMemAccessStorage<int>>(100*2);
    auto data = CudaMemAccessLogger<int>(d_data, 100, "Input data", *memAccessStorage);
    auto result = CudaMemAccessLogger<int>(d_result, 100, "Result data", *memAccessStorage);
    
    // Execute the kernel
    kernel<<<10, 10>>>(data, result);
    
    // Copy data back to the host
    cudaMemcpy(h_result, d_result, 100 * sizeof(int), cudaMemcpyDeviceToHost);
    
    // Store the memory access data
    memAccessStorage->generateJSONOutput("./my_data_output.json");
    
    // Free the device memory
    cudaFree(d_data);
    cudaFree(d_result);
    
    // Free the host memory
    delete[] h_data;
    delete[] h_result;
}

As you can see only 4 lines of code were added, and 2 lines of code were changed. The remaining code is the same as before.

(back to top)

Gotchas

There are some things you need to watch out for when using this library. The two main things have to do with how the submitted array is wrapped.

Passing a `CudaMemAccessLogger` pointer to a kernel

Assuming you pass in a pointer of a CudaMemAccessLogger instance, instead of the instance itself, you will need to dereference the pointer before using it, as otherwise the array operator is not called and then wrong memory is accessed.

The previously shown example code would then look like this:

__global__ void kernel(CudaMemAccessLogger<int>* data, CudaMemAccessLogger<int>* result) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    (*result)[idx] = (*data)[idx] * 2;
}

Doing operations besides just assignment

The wrapper class only implements assigning to the templated type, or assigning to another instance of the wrapper class. This means you can only use the = operator. If you for example want to use the ++ operation, you will have to change your kernel from:

__global__ void kernel(CudaMemAccessLogger<int> data) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    data[idx]++;
}

to this:

__global__ void kernel(CudaMemAccessLogger<int> data) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    data[idx] = data[idx] + 1;
}

Synchronizing the device

The library does not synchronize the device after each kernel call. This means that if you want to get the data from the device, you need to synchronize the device manually. You do this either by explicitly calling cudaDeviceSynchronize() before using any of the CudaMemAccessStorage functions to output the data, or you can just place the call to for example generateJSONOutput below a synchronous memory operation, like cudaMemcpy.

(back to top)

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

(back to top)

License

Distributed under the MPL-2.0 License. See LICENSE.md for more information.

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
examples/basic		examples/basic
html		html
include		include
templates		templates
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE.md		LICENSE.md
README.md		README.md

License

NetroScript/GPU-memory-access-visualization

Folders and files

Latest commit

History

Repository files navigation

GPU-memory-access-visualization

About The Project

Usage

Simple Example

Example Kernel

Creating an instance to store the memory

Wrapping the data arrays

Changing the kernel

Getting the data

Full example

Gotchas

Passing a CudaMemAccessLogger pointer to a kernel

Doing operations besides just assignment

Synchronizing the device

Contributing

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

Passing a `CudaMemAccessLogger` pointer to a kernel