Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken tests in Selenoid with TestNG threads when Chrome node version is 124 #1423

Open
farious2009 opened this issue May 18, 2024 · 7 comments

Comments

@farious2009
Copy link

farious2009 commented May 18, 2024

Hi Team,

Opening this topic as you might be able to help or clarify what is wrong with our run and whether this is a problem with Selenoid or maybe Chrome or there could be something else.

We do run a lot of tests using TestNG 7... + Selenide 7.. ( its Configuration.timeout=45000 and Configuration.pageLoadTimeout = 60000) + Java 17 with jobs using Selenoid Grid and Selenoid, and Chrome node version 115 without problems in parallel having 16 cores and max number of threads up to 14 (depends on a suite) in dind.

Test suites look simple. This one as an example with parallel run and 4 threads, and a couple more tests with parallel="none" within the same suite here :
suite
having on a BaseClass
@AfterMethod(alwaysRun = true)
Selenide.closeWebDriver();

The problem starts when we wanna switch to newer version of Chrome: and here the higher version we are using, more random number of tests (and what is important here threads either) start failing being marked as broken with different errors related to either
session timeouts
or
invalid session id
or
org.openqa.selenium.WebDriverException: Command duration or timeout: 1 milliseconds
(maybe there are others, but I wrote down these as examples and attached some ones below)

And the main thing here is that when one test fails, all the other subsequent ones in one thread also fall like dominoes in a very short period of time.

I have tried 122 and 124 versions. So, where Chrome 115 marks tests correctly as failed and Allure takes a snapshot with an actual error after a failed step, with Chrome 124 random number of tests are marked as broken (and it may happen even to the same test in the same places, what makes me think that might be a problem with Node). Cannot blame our tests either, as they do work fine with 115 either it is Selenoid or Selenium Grid.

I have played with timeouts: -timeout 2m, -timeout 5m; -service-startup-timeout was around 1m20s (I probably had tried increasing all 4 timeout options in 2x, but with no luck); to be honest, all test are run against chrome 115 without any additional settings at all, so here I am a bit skeptical that my changes would have helped out, and they actually have not (unless new nodes are more heavier being required more tuning).

What is weird that when the first test is broken, then the second test starts either in a defined in Selenide timeout of 45 sec or in ~5 sec (but it will be broken anyway regardless); and next test starts in 5 sec again and will fail and this is all till the moment when thread is over (I will be shown below). But we do not have those 5 secs in settings at all: I have no clue where they are coming from. I was expecting the same behavior regardless nodes and all tests being isolated, and if 1 test fails, it is marked as failed, session closed and the new test will be kicked off within a separate node, and we would have been having normal allure yellow-green report, but not with broken ones.

Chrome 115

Allure snapshots below will be probably with limit as 14 and thread-count="4":
115
115_

Allure results
allure_110

Chrome 124

Test 1 (the first yellow one in a thread)
The number of affected threads (yellow ones) could be different:
test1
More than 1 broken thread (as an example that it is a random)
many_threads
Test 1 content (from the snapshot there 1 thread):
test1con
Selenoid logs (test 1)
I was able to find only these 3 entries of session ID but you will see more attempts to use the same ID in a following tests below:
15:13:22 [31218] [SESSION_CREATED] [8f113074857c76c329dffdc1b432def2] [1] [1.44s]
15:14:24 [31299] [SESSION_TIMED_OUT] [8f113074857c76c329dffdc1b432def2]
15:14:24 [34422] [SESSION_DELETED] [8f113074857c76c329dffdc1b432def2]

Test 2 (the second yellow in a thread)

test2
test2con
Take a look that sessions IDs are the same: seems like all tests starts with the same session?

4th test (as 3rd looks similar to 2nd)

3test
test3cont

I found only 1 SESSION_TIMED_OUT entry in the selenoid logs related to a common for all session as
8f113074857c76c329dffdc1b432def2. The rest look like duplicates and even being grouped under Allure Category as here:
alltests

Grid with Chrome 125

Grids compose file does not have additional timeout options at all only 14 nodes in parallel: we do have different jobs being run against the same suites on Selenoid and Grid, but the output varies drastically.

This is a recent test run against Chrome 125 on Grid, where beforeMethod failed after 1m timeout, but, at least, snapshop is taken and we are able to see the error:
before
grid_125
timeline

Also, as I have already mentioned those 5 sec timeouts from yellow threads above look weird and they obviously do not come from Selenide nor chrome capabilities.
I am not 100% sure that chrome 115 is stable, but we do currently on it, and will see if I could switch Grid to Chrome 125 after its resent release.

What I have not tried yet is to create a chrome image with the latest chrome driver, but I do not think it would help out.

Is there a big difference in their internal settings between Selenoid and Grid when it comes to Nodes? Is there a way to stop the test/node within a defined timeout and not to start a new test with (probably) the same ID? It looks to me that an isolation does not work properly, but I could be wrong.

I really like Selenoid and I would like to switch to it considering its video’s feature, but currently blocked by unstable results. If you have any advises based on info above how to set up considering provided timeouts and settings or whether who should I contact with, it would be appreciated.

Thanks.
Kind regards

@vania-pooh
Copy link
Member

@farious2009 actually we noticed that some Chrome \ chromedriver versions are from time to time less responsive than another versions. In 1-2 versions once Chrome team fixes bugs in Chrome \ chromedriver everything could become more stable again. I would rerun freezing \ slow test cases with video recording enabled to see what's happening inside. Also I would check that your application does not rely on some new APIs when available.

@farious2009
Copy link
Author

I had tried checking video, but they have not even started being recorded remained with prefix selenoid_ and small size.

Do you think it would be work asking Selenide devs as maybe that framework uses something from updated Chromium there under the hood mentioning an entire engine as Edge works in the same way)?

@vania-pooh
Copy link
Member

@farious2009 if your videos are not renamed to <session-id>.mp4, then this is Selenoid misconfiguration. Selenoid images contain both webdriver binary and browser binary, so I don't think Selenide guys will help. Selenide is just a Selenium helper library on top of official Java Selenium client.

@farious2009
Copy link
Author

Thanks. I have asked about the same under Selenium Docker, because I am guessing it is not related to Selenoid only, because 124-125 versions are failing with different errors. but the behavior look similar (though, visible errors are different): but again, it varies from Chrome to lover Chrome.

@farious2009
Copy link
Author

and, by the way, regarding video: videos themselves are recorded, I am able to do so, but only against normal tests; broken ones what I am here about, as you said are marked with selenoid_... and cannot be opened (what is probably expected by the logic)

@farious2009
Copy link
Author

I have a feeling it is either node's issue, but I am not an expert here, of course, only what I noticed in common

@cybertuck
Copy link

cybertuck commented May 29, 2024

@vania-pooh @farious2009 Same error for me when I switched from Chrome 105.0 to 123.0. I`m using selenoid 4.18.0, Java 17
1-2 out of 10 executions are failing intermittently even for 1 test in 1 thread !!!

org.openqa.selenium.WebDriverException:

Command duration or timeout: 3 milliseconds
Build info: version: '4.18.0', revision: 'b6bf9de7cc*'
System info: os.name: 'Windows Server 2022', os.arch: 'amd64', os.version: '10.0', java.version: '17.0.10'
Driver info: org.openqa.selenium.remote.RemoteWebDriver
Command: [efc1fa36585d0beb85bf3d84efd13228, quit {}]
Capabilities {acceptInsecureCerts: true, browserName: chrome, browserVersion: 123.0.6312.58, chrome: {chromedriverVersion: 123.0.6312.58 (6b4b19e9dfbb..., userDataDir: /tmp/.org.chromium.Chromium...}, fedcm:accounts: true, goog:chromeOptions: {debuggerAddress: localhost:35191}, networkConnectionEnabled: false, pageLoadStrategy: normal, platformName: linux, proxy: Proxy(), setWindowRect: true, strictFileInteractability: false, timeouts: {implicit: 0, pageLoad: 300000, script: 30000}, unhandledPromptBehavior: dismiss and notify, webauthn:extension:credBlob: true, webauthn:extension:largeBlob: true, webauthn:extension:minPinLength: true, webauthn:extension:prf: true, webauthn:virtualAuthenticators: true}
Session ID: efc1fa36585d0beb85bf3d84efd13228

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants