Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions Related to the "Abnormal" Memory Usage of Binaryen #6239

Open
mobsceneZ opened this issue Jan 25, 2024 · 11 comments
Open

Questions Related to the "Abnormal" Memory Usage of Binaryen #6239

mobsceneZ opened this issue Jan 25, 2024 · 11 comments

Comments

@mobsceneZ
Copy link
Contributor

Hi developers, I am incorporating Binaryen as third-party library into my fuzzing framework. In short, Binaryen is used to parse input Wasm module, do some mutation and generation work and emit a new testcase back in this case.

Everything went fine until I deployed it on a cloud server. The memory usage of my fuzzing framework kept going on and eventually kernel's OOM Killer killed my fuzzing process.

It really took me a lot of effort to figure out the reason: during the mutation or generation stage, our work will generate block/loop names for BinaryenBlock()/BinaryenLoop() C API, which will further call IString::interned to update some static variables:

static std::string_view interned(std::string_view s, bool reuse = true);

The problem is that fuzzing may generate different block/loop names for different input Wasm module, and these names become useless when we have finished the mutation/generation stage for this specific module, but the corresponding block/loop names are not erased from these static variables accordingly! Eventually, it results in excessive memory occupation for maintaining these meaningless names.

So, I wonder if there is any way to erase elements inside these static/global variables on a per-module basis, so that the overall memory usage is affordable.

@kripken
Copy link
Member

kripken commented Jan 25, 2024

Atm Binaryen is optimized for short-running processes, like calling wasm-opt from the commandline to optimize a module. After the optimization the process ends, which is how toolchains (Emscripten, J2CL, Kotlin, etc.) use Binaryen. The specific issue is that we intern strings for speed, and we never remove them (until the process ends). That is simple and efficient, but if you have a very long running process then it will accumulate memory.

Perhaps we could add an option to not intern strings. That would just need a replacement file for istring.h. It would be slower, but avoid this overhead. However, if you can restart your fuzzer process periodically that would be simpler.

@tlively
Copy link
Member

tlively commented Jan 25, 2024

We similarly intern types and never delete them, but we do have a function called destroyAllTypesForTestingPurposesOnly that deletes all the interned types. It's wildly unsafe to use, but I could see it being useful to call between independent fuzzer iterations. Perhaps we could have a similar function for the interned strings?

@mobsceneZ
Copy link
Contributor Author

Atm Binaryen is optimized for short-running processes, like calling wasm-opt from the commandline to optimize a module. After the optimization the process ends, which is how toolchains (Emscripten, J2CL, Kotlin, etc.) use Binaryen. The specific issue is that we intern strings for speed, and we never remove them (until the process ends). That is simple and efficient, but if you have a very long running process then it will accumulate memory.

Perhaps we could add an option to not intern strings. That would just need a replacement file for istring.h. It would be slower, but avoid this overhead. However, if you can restart your fuzzer process periodically that would be simpler.

Hey kripken, I think it may not be necessary to implement a new option that support not interning strings. Instead, like @tlively said in the comment, I wonder if it's acceptable to export a C API to clear up these interned types or strings which could make things easier.
Besides, I'm afraid that restarting fuzzer process may not be a good idea for the effectiveness of fuzzing. Another option for me is to maintain a block/loop name pool where each name can be reused across modules/functions, which as I presume could effectively reduce the memory usage. But it will depend on whether "exporting a C API" option is acceptable for Binaryen community :).

@mobsceneZ
Copy link
Contributor Author

We similarly intern types and never delete them, but we do have a function called destroyAllTypesForTestingPurposesOnly that deletes all the interned types. It's wildly unsafe to use, but I could see it being useful to call between independent fuzzer iterations. Perhaps we could have a similar function for the interned strings?

Yes, functions like destroyAllTypesForTestingPurposesOnly are exactly what I need. I'm curious about why this function is "wildly unsafe to use", is is related to the thread safety?

@kripken
Copy link
Member

kripken commented Jan 26, 2024

I'm not opposed to such an API.

@tlively
Copy link
Member

tlively commented Jan 26, 2024

We similarly intern types and never delete them, but we do have a function called destroyAllTypesForTestingPurposesOnly that deletes all the interned types. It's wildly unsafe to use, but I could see it being useful to call between independent fuzzer iterations. Perhaps we could have a similar function for the interned strings?

Yes, functions like destroyAllTypesForTestingPurposesOnly are exactly what I need. I'm curious about why this function is "wildly unsafe to use", is is related to the thread safety?

Yeah, it's not safe for any thread to be doing anything with types while another thread is calling this function. But more generally, any types from before the call cannot be used after the call, which is a lifetime issue that code using types normally wouldn't need to think about.

@MaxGraey
Copy link
Contributor

MaxGraey commented Jan 26, 2024

How about to use arena or bump allocator for interner? I guess generated names could be quite long and continuously grow in "converge" mode, which may caused to OOM, especially for wasm32 binaryen builds. It would be nice to be able to do a reset for interner state.

@mobsceneZ
Copy link
Contributor Author

We similarly intern types and never delete them, but we do have a function called destroyAllTypesForTestingPurposesOnly that deletes all the interned types. It's wildly unsafe to use, but I could see it being useful to call between independent fuzzer iterations. Perhaps we could have a similar function for the interned strings?

Yes, functions like destroyAllTypesForTestingPurposesOnly are exactly what I need. I'm curious about why this function is "wildly unsafe to use", is is related to the thread safety?

Yeah, it's not safe for any thread to be doing anything with types while another thread is calling this function. But more generally, any types from before the call cannot be used after the call, which is a lifetime issue that code using types normally wouldn't need to think about.

You're right, if there are multiple threads then things could get messy. I do not use multiple threads in my fuzzing framework, so a clean up function is perfectly suitable for me. If that's ok for you to have this kind of C API, I will try adding one and issue a pull request ;).

@mobsceneZ
Copy link
Contributor Author

How about to use arena or bump allocator for interner? I guess generated names could be quite long and continuously grow in "converge" mode, which may caused to OOM especially on for wasm32 binaryen builds. It would be nice to be able to do a reset for interner state.

Hey MaxGraey, I think for this case arena allocator is a pretty nice choice, but I'm not an expert in designing such memory allocators so it may depends on developers.

@tlively
Copy link
Member

tlively commented Jan 29, 2024

@mobsceneZ, sounds good! I look forward to the PR.

@kripken
Copy link
Member

kripken commented Jan 29, 2024

A bump allocator is separately a good idea here. We can use the existing MixedArena which gives exactly what we want. A separate PR with that would also be welcome here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants