-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TPU v3-8 CrossReplicaSum_33 Error #68210
Comments
What's even more weirder is that I'm defining the input sizes explicitly like so: Config.COMPUTED_BATCH_SIZE = 128
with strategy.scope():
model = my_model()
input_shapes = [
[Config.COMPUTED_BATCH_SIZE, 192],
[Config.COMPUTED_BATCH_SIZE, 192],
[COMPUTED_CHANNELS, 105, 129, 100],
[COMPUTED_CHANNELS, 105, 129, 100],
[COMPUTED_CHANNELS, 105, 129, 100],
[Config.COMPUTED_BATCH_SIZE, 70],
[Config.COMPUTED_BATCH_SIZE, 320]
]
model.build(input_shape=input_shapes)
Ignore this since it appears that Hidden Size * 2 is also 256. So it could be a transpose inside a Dense layer. Edit: I've tracked it down to this code: hidden_size = 128
self.descriptor_embedding = layers.Dense(
hidden_size * 2, # 256
activation='relu',
input_shape=(Config.COMPUTED_BATCH_SIZE, 70)
)
learned_descriptors = tf.expand_dims(
self.descriptor_embedding(descriptors),
1
) # [BS, 1, HS * 2] |
It seems to be an issue with |
@vivekjoshy, tf.reshape(tf.image.decode_jpeg(image, channels = 3),[256,256, 3]), class_idx Kindly find the gist of it here. |
Fixed by passing in explicit shapes everywhere. |
Issue type
Support
Have you reproduced the bug with TensorFlow Nightly?
No
Source
source
TensorFlow version
2.15.0
Custom code
Yes
OS platform and distribution
No response
Mobile device
No response
Python version
3.10
Bazel version
No response
GCC/compiler version
No response
CUDA/cuDNN version
No response
GPU model and memory
No response
Current behavior?
I'm getting this error when I run model.fit
I have a very large code base, and am unable to reproduce. I was using AUC metric, but removed it as seen here #33890. The issue still persists.
Another relevant issue is #41590.
I have attached an MVCE example to reproduce, but it's still not enough to know that explicit shapes are needed as explained in #41590 since I don't know where I'm using transpose. I need a stack trace to point to where in the code it's causing the issue.
Standalone code to reproduce the issue
https://colab.research.google.com/drive/1bYuuwG0pFnIQe1X6jA7FJMLjZtvAmqlj?usp=sharing
Relevant log output
No response
The text was updated successfully, but these errors were encountered: