Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dqn_torch_test build failure #1216

Closed
tacertain opened this issue Apr 26, 2024 · 6 comments
Closed

dqn_torch_test build failure #1216

tacertain opened this issue Apr 26, 2024 · 6 comments

Comments

@tacertain
Copy link
Contributor

I am trying to build Open Spiel for the first time. Using a new Ubuntu 22.04 install under WSL2. I am building with Cuda and libtorch. After a bunch of tinkering, I have gotten down to everything builds and only a single test failure:

terminate called after throwing an instance of 'c10::Error'
  what():  masked_fill_ only supports boolean masks, but got mask with dtype int
Exception raised from masked_fill_impl_cpu at ../aten/src/ATen/native/TensorAdvancedIndexing.cpp:1910 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7f3af2ac42ac in /home/certain/GitHub/open_spiel/open_spiel/libtorch/libtorch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xfa (0x7f3af2a6dcbc in /home/certain/GitHub/open_spiel/open_spiel/libtorch/libtorch/lib/libc10.so)
frame #2: <unknown function> + 0x20b2d4d (0x7f3ad8cb2d4d in /home/certain/GitHub/open_spiel/open_spiel/libtorch/libtorch/lib/libtorch_cpu.so)
frame #3: at::native::masked_fill__cpu(at::Tensor&, at::Tensor const&, c10::Scalar const&) + 0x49 (0x7f3ad8cb2df9 in /home/certain/GitHub/open_spiel/open_spiel/libtorch/libtorch/lib/libtorch_cpu.so)
frame #4: at::_ops::masked_fill__Scalar::call(at::Tensor&, at::Tensor const&, c10::Scalar const&) + 0x16f (0x7f3ad99ad1df in /home/certain/GitHub/open_spiel/open_spiel/libtorch/libtorch/lib/libtorch_cpu.so)
frame #5: at::native::masked_fill(at::Tensor const&, at::Tensor const&, c10::Scalar const&) + 0xd1 (0x7f3ad8cd6ae1 in /home/certain/GitHub/open_spiel/open_spiel/libtorch/libtorch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x306bcbb (0x7f3ad9c6bcbb in /home/certain/GitHub/open_spiel/open_spiel/libtorch/libtorch/lib/libtorch_cpu.so)
frame #7: at::_ops::masked_fill_Scalar::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::Scalar const&) + 0x92 (0x7f3ad993e592 in /home/certain/GitHub/open_spiel/open_spiel/libtorch/libtorch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x4d4fd2c (0x7f3adb94fd2c in /home/certain/GitHub/open_spiel/open_spiel/libtorch/libtorch/lib/libtorch_cpu.so)
frame #9: <unknown function> + 0x4d5038e (0x7f3adb95038e in /home/certain/GitHub/open_spiel/open_spiel/libtorch/libtorch/lib/libtorch_cpu.so)
frame #10: at::_ops::masked_fill_Scalar::call(at::Tensor const&, at::Tensor const&, c10::Scalar const&) + 0x17a (0x7f3ad99a4c9a in /home/certain/GitHub/open_spiel/open_spiel/libtorch/libtorch/lib/libtorch_cpu.so)
frame #11: <unknown function> + 0x3fc95f (0x56439005495f in /home/certain/GitHub/open_spiel/build/algorithms/dqn_torch/dqn_torch_test)
frame #12: <unknown function> + 0x3fa7c1 (0x5643900527c1 in /home/certain/GitHub/open_spiel/build/algorithms/dqn_torch/dqn_torch_test)
frame #13: <unknown function> + 0x814cd (0x56438fcd94cd in /home/certain/GitHub/open_spiel/build/algorithms/dqn_torch/dqn_torch_test)
frame #14: <unknown function> + 0x29d90 (0x7f3a80c77d90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #15: __libc_start_main + 0x80 (0x7f3a80c77e40 in /lib/x86_64-linux-gnu/libc.so.6)
frame #16: <unknown function> + 0x80c95 (0x56438fcd8c95 in /home/certain/GitHub/open_spiel/build/algorithms/dqn_torch/dqn_torch_test)

At some point when I was having more issues, I did get some warning about uint8 vectors being deprecated in some context and to use bools, but I didn't write it down - sorry.

Is this something I should worry about or that you want me to troubleshoot more directly? I will say that the whole Cuda/pyTorch/WSL marriage seems a little janky and so I am totally willing to believe I screwed something up there.

@lanctot
Copy link
Collaborator

lanctot commented Apr 27, 2024

Wow, that's amazing. I'm happy that you got this far.. we have not been actively supporting libtorch based code for a while, nice to know it mostly still works (or at least compiles!) and under WSL with Cuda nonetheless.

Ok, so the error reminded me of something I fixed on the python side when we moved to PyTorch 2. We had to change the way were doing masks to use booleans. Take a look at the changes indqn.py in here: https://github.com/google-deepmind/open_spiel/pull/1141/files

Maybe we now need to change that on the C++ side. My guess is you're using a LibTorch version that is in line with PyTorch 2, but the last time we tested anything with LibTorch it was likely 1.10 (based on the link here).

If that's the case, then maybe we're in luck and the fix might be as easy as it was on the Python side, but I'm not sure how to do it. But, if you're not planning to use C++ DQN then you're probably fine to ignore it. It'd be great to get the libtorch code working with V2 though.

@tacertain
Copy link
Contributor Author

Well, the most-obvious translation of the python change doesn't seem to be what's needed, as it's already defined to be a bool (at least on first glance): https://github.com/tacertain/open_spiel/blob/1208f832568063c36d0e3076069e103d5b00cf5b/open_spiel/algorithms/dqn_torch/dqn.cc#L170

I will poke around some more tomorrow. I'm going to try to build unoptimized with symbols, etc to get a better stack trace. I am not familiar with cmake (other than as a naive user), so if there's any pointers to making it build that way, lemme know. Otherwise I'll probably just try to get the compilation lines out and run them by hand with the right flags.

@tacertain
Copy link
Contributor Author

I tried changing the build type to Debug in build_and_run_tests.sh, but I still didn't get the symbols in the stack trace. Going to fall back to trying to run the compilation by hand.

@lanctot
Copy link
Collaborator

lanctot commented Apr 27, 2024

Hi @tacertain,

You need to set this environment variable, here:

if(${BUILD_TYPE} STREQUAL "Debug")

But then you have to entirely get rid of the build/ directory and redo from scratch (CMake has to run on a fresh empty build dir).

Then, you need to run dqn_torch_test within gdb. You might not get line numbers inside the libtorch functions unless the library has been built with debug symbols.

@lanctot
Copy link
Collaborator

lanctot commented Apr 27, 2024

I think you can set the build type to debug directly in CMakeLists.txt too, which might be easier.

@lanctot
Copy link
Collaborator

lanctot commented May 13, 2024

Fixed by #1219 which is now merged. Thanks @tacertain!

@lanctot lanctot closed this as completed May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants