Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure safety of indirect dispatch #5714

Open
wants to merge 5 commits into
base: trunk
Choose a base branch
from

Conversation

teoxoy
Copy link
Member

@teoxoy teoxoy commented May 17, 2024

Ensure safety of indirect dispatch by injecting a compute shader that validates the content of the indirect buffer.

Part of #2431.

@teoxoy teoxoy requested a review from a team as a code owner May 17, 2024 15:38
@teoxoy
Copy link
Member Author

teoxoy commented May 17, 2024

Hmm, I'm not sure how we should deal with the wgsl feature not being available.

@cwfitzgerald
Copy link
Member

Let's just require it in core, honestly.

We could build a naga module in a build script, serialize it and store that in the binary, but that's wayyyy to much work for no real gain

@cwfitzgerald cwfitzgerald self-requested a review May 18, 2024 02:26
@teoxoy
Copy link
Member Author

teoxoy commented May 20, 2024

We could build a naga module in a build script, serialize it and store that in the binary, but that's wayyyy to much work for no real gain

That's what I thought as well.

Requiring it in wgpu-core will make wgpu's wgsl feature meaningless as well. Removing the wgsl feature from both wgpu-core and wgpu will effectively revert #2890.

That PR was prefaced with:

It also might turn out entirely unnecessary if the compile-time savings aren't worth the added complexity.

This was also mentioned in #2549 (comment):

If you want to provide more concrete numbers on the binary size (that could be saved), that would help.

@daxpedda were the benefits of #2890 substantial in terms of compile time and binary size?

Copy link
Member

@jimblandy jimblandy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our office hours meeting this morning, we came up with a better way to test, by requesting a device with a lower limit than the adapter offers. Testing this way would give a more positive indication that the limit is being imposed.

@daxpedda
Copy link
Contributor

daxpedda commented May 20, 2024

If you want to provide more concrete numbers on the binary size (that could be saved), that would help.

@daxpedda were the benefits of #2890 substantial in terms of compile time and binary size?

I just tried a minimal example when using GLSL shaders on Wasm with WebGL.
Using fat LTO, a single codegen unit, reference types and wasm-opt for all scenarios.

O3 (O4 for wasm-opt) Os Oz
without wgsl 3 081 862 2 070 971 1 871 247
with wgsl 3 428 880 2 306 315 2 084 794
diff (11.26%) 347 018 (11.36%) 235 344 (11.41%) 213 547

Compile times are similar. Total compile time with O3 goes from 12.21s to 13.48s, so again around a 10% increase.
For Naga alone (according to Cargo timings) I'm getting an increase from 7.74s to 9.67s, so around 25% longer.

Keep in mind that this is a minimal example and might not be representative for a full blown application.

@teoxoy
Copy link
Member Author

teoxoy commented May 21, 2024

I would say 10% is significant especially for web apps but I'd be curious how much it decreases with larger applications.

It's unfortunate though that since WebGL doesn't support indirect calls, this increase doesn't bring anything of value for this use case.

Maybe we can feature gate the validation on the wgsl feature and enable wgc/wgsl when not(target_arch = "wasm32") in wgpu?

The downside being that wgpu's wgsl feature won't be respected (wgsl input will always be enabled) on native platforms (non-web).

@teoxoy teoxoy force-pushed the validate-indirect-compute branch from 018b23b to a5bebb0 Compare May 21, 2024 12:14
@teoxoy
Copy link
Member Author

teoxoy commented May 21, 2024

@jimblandy I lowered the limit required to 10 since the testing infra is not setup to be able to require limits based on what the adapter supports. I think 10 should be safe, the default value is 65535.

@teoxoy teoxoy requested a review from jimblandy May 21, 2024 12:20
@teoxoy teoxoy force-pushed the validate-indirect-compute branch 4 times, most recently from ac3f089 to 36281af Compare May 21, 2024 14:35
@teoxoy
Copy link
Member Author

teoxoy commented May 21, 2024

I added another test for the num_workgroups builtin (expected to fail on dx12 - for now).

@teoxoy
Copy link
Member Author

teoxoy commented May 24, 2024

I wanted to get CI working so I just implemented #5714 (comment) for now.

@teoxoy
Copy link
Member Author

teoxoy commented May 24, 2024

Filed #5739 for the metal issue which I've worked around in this PR.

@teoxoy
Copy link
Member Author

teoxoy commented May 24, 2024

The wgpu_test::compute_pass_resource_ownership::compute_pass_resource_ownership test is reliably failing on all platforms.

And is because the internal call to device_create_bind_group tries to get the buffer via its id which should no longer be used.

I was hoping to get away with calling wgpu-core's public APIs to implement this validation but it seems that won't work.

by injecting a compute shader that validates the content of the indirect buffer
…ed to `min_uniform_buffer_offset_alignment`

also adds missing indirect buffer offset validation
@teoxoy teoxoy force-pushed the validate-indirect-compute branch from b57350e to e0bd41a Compare May 27, 2024 14:25
@teoxoy
Copy link
Member Author

teoxoy commented May 28, 2024

Actually, I don't think the test is correct. Shouldn't our Drop implementation behave the same as calling .destroy() from the spec?

The destroyed property is checked by all Queue commands (notably .submit() in this case); the command only succeeding if none of the resources referenced by the command buffer have been tagged as destroyed.

The test was recently added in #5570.

Added a new test that ensures that resources passed into a ComputePass can be dropped before executing the compute pass.
Proved that the test would crash before these changes and passes with them.

@Wumpf could you chime in on this? Did we use to crash or raise a validation error?

@cwfitzgerald
Copy link
Member

Actually, I don't think the test is correct. Shouldn't our Drop implementation behave the same as calling .destroy() from the spec?

No, drop just removes the user's handle to it, doesn't make access to the resource invalid.

Copy link
Member

@jimblandy jimblandy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thoughts so far:

tests/tests/dispatch_workgroups_indirect.rs Show resolved Hide resolved
tests/tests/dispatch_workgroups_indirect.rs Outdated Show resolved Hide resolved
tests/tests/dispatch_workgroups_indirect.rs Show resolved Hide resolved
tests/tests/dispatch_workgroups_indirect.rs Outdated Show resolved Hide resolved
@teoxoy
Copy link
Member Author

teoxoy commented May 29, 2024

Actually, I don't think the test is correct. Shouldn't our Drop implementation behave the same as calling .destroy() from the spec?

No, drop just removes the user's handle to it, doesn't make access to the resource invalid.

I see, I will try to convert the code doing bindgroup creation to use Arcs instead of Ids (moving in the direction of #5121) to see how feasible that is.

Copy link
Member

@cwfitzgerald cwfitzgerald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance/code simplification thoughts.

Dispatches are unfortunately not trivial to validate all at once :(

Comment on lines +99 to +101
#[cfg(feature = "wgsl")]
pub(crate) dispatch_indirect_validation_pipeline:
Mutex<Option<(id::ComputePipelineId, id::BindGroupLayoutId)>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we create this on device start?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, and it will simplify things - will do that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is turning out difficult to do because I have to convert all code to hal code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I found a way to get this working by first initializing it with None and later when we have the Arc<Device> and can use all its methods, I set it to Some.

let aligned_offset = offset
- offset % device.limits.min_storage_buffer_offset_alignment as u64;

let (bind_group_id, error) = self.device_create_bind_group::<A>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do these bind groups/buffers get dropped? As it looks now they will exist forever.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, I was thinking it will happen eventually since we use Arcs but I missed we still have the registry and Ids 😅. I think what I have to do is just remove the id from the registry.

Comment on lines +976 to +985
let (dst_buffer_id, error) = self.device_create_buffer::<A>(
device_id,
&crate::resource::BufferDescriptor {
label: None,
size: 4 * 3,
usage: wgt::BufferUsages::INDIRECT | wgt::BufferUsages::STORAGE,
mapped_at_creation: false,
},
None,
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating a buffer and bind group every dispatch is going to be very expensive. In this case we can make a single buffer/bind group per queue and a single bind group per indirect command buffer. We'd trade one bind group for two, but mean we never have to create resources at runtime.

I really wish we had root-uavs/bda so we didn't need a bind group here.

Copy link
Member Author

@teoxoy teoxoy May 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't a single dst buffer (with its own bind group) per queue become a bottleneck in cases where a user issues 2 dispatchIndirect commands back to back each with their own indirect buffer? By only having one dst buffer, the 2 internal dispatches we do for validation can't run in parallel and I think nor can the 2nd dispatchIndirect start before the first dispatchIndirect finished.

&self,
device_id: DeviceId,
) -> Result<
(id::ComputePipelineId, id::BindGroupLayoutId),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we wrap this functionality in an encapsulated type with constructor and execute functions, preferably in its own file?

Comment on lines +1921 to +1929
buffer.usage
// Allow indirect buffers to be used as storage buffers, allowing
// us to bind them and validate their content. Note that we already
// pass this usage to hal at buffer creation.
| if buffer.usage.contains(wgt::BufferUsages::INDIRECT) {
wgt::BufferUsages::STORAGE
} else {
wgt::BufferUsages::empty()
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to be passing these into check_buffer_usage? If the user will do something wrong with usage, they will get an error message that includes a buffer usage that they did not request. We should just assume our code is right (or strict_assert maybe).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants