Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

investigate SCHED0011 failure on odroid_xu4 #1130

Open
lsf37 opened this issue Nov 9, 2023 · 8 comments
Open

investigate SCHED0011 failure on odroid_xu4 #1130

lsf37 opened this issue Nov 9, 2023 · 8 comments
Labels
MCS issues about the mixed-criticality system config

Comments

@lsf37
Copy link
Member

lsf37 commented Nov 9, 2023

This could be just a fluke, but I have not seen this failure before, so we should investigate. The scheduler accuracy test failed for config ODROID_XU4_debug_MCS_clang_32:

  	<testcase classname="sel4test" name="SCHED0011">
  Running test SCHED0011 (Test scheduler accuracy)
  		<error>Check diff(103054000) <= period_ns + 2 * (1000ULL * 1000ULL)(102000000) failed. at line 1026 of file /github/workspace/projects/sel4test/apps/sel4test-tests/src/tests/scheduler.c</error>
  		<error>Check diff(103034000) <= period_ns + 2 * (1000ULL * 1000ULL)(102000000) failed. at line 1026 of file /github/workspace/projects/sel4test/apps/sel4test-tests/src/tests/scheduler.c</error>
  		<error>Check diff(103056000) <= period_ns + 2 * (1000ULL * 1000ULL)(102000000) failed. at line 1026 of file /github/workspace/projects/sel4test/apps/sel4test-tests/src/tests/scheduler.c</error>
  		<error>Check diff(103054000) <= period_ns + 2 * (1000ULL * 1000ULL)(102000000) failed. at line 1026 of file /github/workspace/projects/sel4test/apps/sel4test-tests/src/tests/scheduler.c</error>
  		<error>Check diff(103050000) <= period_ns + 2 * (1000ULL * 1000ULL)(102000000) failed. at line 1026 of file /github/workspace/projects/sel4test/apps/sel4test-tests/src/tests/scheduler.c</error>
  		<error>Check diff(103050000) <= period_ns + 2 * (1000ULL * 1000ULL)(102000000) failed. at line 1026 of file /github/workspace/projects/sel4test/apps/sel4test-tests/src/tests/scheduler.c</error>
  		<error>Check diff(103058000) <= period_ns + 2 * (1000ULL * 1000ULL)(102000000) failed. at line 1026 of file /github/workspace/projects/sel4test/apps/sel4test-tests/src/tests/scheduler.c</error>
  		<error>Check diff(103044000) <= period_ns + 2 * (1000ULL * 1000ULL)(102000000) failed. at line 1026 of file /github/workspace/projects/sel4test/apps/sel4test-tests/src/tests/scheduler.c</error>
  		<error>Check diff(103041000) <= period_ns + 2 * (1000ULL * 1000ULL)(102000000) failed. at line 1026 of file /github/workspace/projects/sel4test/apps/sel4test-tests/src/tests/scheduler.c</error>
  		<error>Check diff(103039000) <= period_ns + 2 * (1000ULL * 1000ULL)(102000000) failed. at line 1026 of file /github/workspace/projects/sel4test/apps/sel4test-tests/src/tests/scheduler.c</error>
  Test SCHED0011 failed
  		<failure type="failure">result == SUCCESS at line 296 of file /github/workspace/projects/sel4test/apps/sel4test-driver/src/testtypes.c</failure>
  		<error>result == SUCCESS at line 217 of file /github/workspace/projects/sel4test/apps/sel4test-driver/src/main.c</error>
  	</testcase>

The corresponding chunk of code in the test is here:

        uint64_t start = sel4test_timestamp(env);
        seL4_Yield();
        uint64_t end = sel4test_timestamp(env);
        /* calculate diff in ns */
        uint64_t diff = (end - start);
        uint64_t period_ns = period * NS_IN_US;
        if (i > 0) {
            test_geq(diff, period_ns - 2 * NS_IN_MS);
            test_leq(diff, period_ns + 2 * NS_IN_MS);
            if (diff > period_ns) {
                ZF_LOGD("Too late: by %llu us", diff - period_ns);
            } else if (diff < period_ns) {
                ZF_LOGD("Too soon: by %llu us", period_ns - diff);
            }
        }

The bit that fails is test_leq(diff, period_ns + 2 * NS_IN_MS);

@lsf37 lsf37 added the MCS issues about the mixed-criticality system config label Nov 9, 2023
@axel-h
Copy link
Member

axel-h commented Nov 9, 2023

The seL4_Yield() from line 1016 could move into the loop (see https://github.com/seL4/sel4test/blob/master/apps/sel4test-tests/src/tests/scheduler.c#L1016-L1023) to guarantee we have a full time slice to get the time? This might avoid a race condition, but that should rarely hit.
What seem odd is that it's constantly slightly more than 3 ms. Give the board is fast enough, I'd expect to see not 103.044.000us but something like 100.044.000us there.

@Indanz
Copy link
Contributor

Indanz commented Nov 11, 2023

Moving the yield won't help this particular issue though, as the time is too long, not too short.

Do we even now for sure that the clock used for scheduling and the clock used for timing are correct and synchronous enough to make valid measurements?

This is again on the somewhat dodgy odroidxu4_1, can't we use odroidxu4_2 for CI instead?

Looking at the raw log I'm not convinced that something weird isn't going on:

2023-11-09T08:15:27.7834661Z Running test SCHED0011 (Test scheduler accuracy)
2023-11-09T08:15:28.4092443Z 		<error>Check diff(103054000) <= period_ns + 2 * (1000ULL * 1000ULL)(102000000) failed. at line 1026 of file /github/workspace/projects/sel4test/apps/sel4test-tests/src/tests/scheduler.c</error>
2023-11-09T08:15:28.8247532Z 		<error>Check diff(103034000) <= period_ns + 2 * (1000ULL * 1000ULL)(102000000) failed. at line 1026 of file /github/workspace/projects/sel4test/apps/sel4test-tests/src/tests/scheduler.c</error>
2023-11-09T08:15:29.2414247Z 		<error>Check diff(103056000) <= period_ns + 2 * (1000ULL * 1000ULL)(102000000) failed. at line 1026 of file /github/workspace/projects/sel4test/apps/sel4test-tests/src/tests/scheduler.c</error>
2023-11-09T08:15:29.6576481Z 		<error>Check diff(103054000) <= period_ns + 2 * (1000ULL * 1000ULL)(102000000) failed. at line 1026 of file /github/workspace/projects/sel4test/apps/sel4test-tests/src/tests/scheduler.c</error>
2023-11-09T08:15:30.0737663Z 		<error>Check diff(103050000) <= period_ns + 2 * (1000ULL * 1000ULL)(102000000) failed. at line 1026 of file /github/workspace/projects/sel4test/apps/sel4test-tests/src/tests/scheduler.c</error>
2023-11-09T08:15:30.4905678Z 		<error>Check diff(103050000) <= period_ns + 2 * (1000ULL * 1000ULL)(102000000) failed. at line 1026 of file /github/workspace/projects/sel4test/apps/sel4test-tests/src/tests/scheduler.c</error>
2023-11-09T08:15:30.9075889Z 		<error>Check diff(103058000) <= period_ns + 2 * (1000ULL * 1000ULL)(102000000) failed. at line 1026 of file /github/workspace/projects/sel4test/apps/sel4test-tests/src/tests/scheduler.c</error>
2023-11-09T08:15:31.3243316Z 		<error>Check diff(103044000) <= period_ns + 2 * (1000ULL * 1000ULL)(102000000) failed. at line 1026 of file /github/workspace/projects/sel4test/apps/sel4test-tests/src/tests/scheduler.c</error>
2023-11-09T08:15:31.7411949Z 		<error>Check diff(103041000) <= period_ns + 2 * (1000ULL * 1000ULL)(102000000) failed. at line 1026 of file /github/workspace/projects/sel4test/apps/sel4test-tests/src/tests/scheduler.c</error>
2023-11-09T08:15:32.1575577Z 		<error>Check diff(103039000) <= period_ns + 2 * (1000ULL * 1000ULL)(102000000) failed. at line 1026 of file /github/workspace/projects/sel4test/apps/sel4test-tests/src/tests/scheduler.c</error>
2023-11-09T08:15:32.1577427Z Test SCHED0011 failed
2023-11-09T08:15:32.1579569Z 		<failure type="failure">result == SUCCESS at line 296 of file /github/workspace/projects/sel4test/apps/sel4test-driver/src/testtypes.c</failure>
2023-11-09T08:15:32.1581885Z 		<error>result == SUCCESS at line 217 of file /github/workspace/projects/sel4test/apps/sel4test-driver/src/main.c</error>
2023-11-09T08:15:32.1585043Z 	</testcase>
2023-11-09T08:15:32.1586060Z 	<testcase classname="sel4test" name="SCHED0012">

The period is 100 ms, so the whole test should take about 1 second, it takes more than 4 seconds instead.

A successful run looks like:

2023-11-08T16:15:30.2571784Z Running test SCHED0011 (Test scheduler accuracy)
2023-11-08T16:15:31.2987898Z Test SCHED0011 passed

@canarysnort01
Copy link
Contributor

FYI, I've seen this test fail when LibUtilsDefaultZfLogLevel is increased. I assume the extra prints take long enough to exceed the margin.

@axel-h
Copy link
Member

axel-h commented Nov 11, 2023

The printing happens after the measurement in each loop, so this wont affect it.

@canarysnort01
Copy link
Contributor

canarysnort01 commented Nov 11, 2023

I've definitely seen that flag reliably trigger this failure on another platform. If that's not supposed to happen I'll repro it again and file an issue about it.

@canarysnort01
Copy link
Contributor

canarysnort01 commented Nov 11, 2023

I also think some of the timer drivers in libplatsupport might introduce some rounding error when converting from/to timer ticks and nanoseconds because they use a pre-calculated number of nanoseconds per timer tick, which due to using integer arithmetic is truncated to a whole integer.

I don't know if this is a problem on this platform or if it would be significant enough to cause any issues, but it seems to me this small error at the nanosecond scale might get large for longer time periods.

@canarysnort01
Copy link
Contributor

The printing happens after the measurement in each loop, so this wont affect it.

Ah, the failure I saw was in a different test, SCHED0021.

@Indanz
Copy link
Contributor

Indanz commented Nov 11, 2023

The timestamp is added by the logging system and includes the loop time, so it just shows a ridiculously slow (buggy?) UART driver (2 ms per character).

The quality of the timer drivers in libplatsupport is low, but mostly bad overflow handling (I have a local patch to fix it, but didn't get to it yet). The meson timer assumes an 1 MHz clock it seems, if a rounding error is unlikely.

According to the schematics, there's only a 24MHz clock and a 32kHz clock input, so it's unlikely that a different clock source is being used either.

3% overhead for "something" is ridiculous.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MCS issues about the mixed-criticality system config
Projects
None yet
Development

No branches or pull requests

4 participants