In a busy loop, two continuously getting time encounters a big time gap

In a busy loop, some codes contiguously get the current time twice and count their time interval. The time interval is supposed to be small, since the codes belong to a SCHED_FIFO process with highest priority(99) and could not be interrupted by other processes while running.

rdtsc instruction is used to get the current time from userspace, and cpuid is used to serialize instructions.

the codes are as follow:

  #define RDTSCP(u64) do {                                    \
      unsigned int hi, lo;                                    \
      __asm__ __volatile__ ("RDTSC\n\t" : "=a" (lo), "=d" (hi)); \
      u64 = ((uint64_t)hi << 32) | lo;                        \

  unsigned int usecs = 20;  //20us
  int tsum = 0;
  for(long long ii = 0; ii < OUTER_LOOP; ++ii) {
      for (long long jj = 0; jj < INNER_LOOP; ++jj) {
          __asm__ __volatile__ ("CPUID");
          RDTSCP(begin);   // get time
          __asm__ __volatile__ ("CPUID");
          RDTSCP(end);     // get time as well 
          __asm__ __volatile__ ("CPUID");
          temp = end-begin;
          if (temp < min) min = temp;  // minimum time interval changes
          if (temp > max) max = temp, flag=true;  // maximum time interval changes   

      if (flag) {
          printf("min: %ld\t\tmax: %ld \t(clocks) \n", min, max);

In observation of several hours, the biggest time interval in my PC (CPU i9-9900k, 3600MHz) is 578112 (clocks), which is about 160us. This value is extraordinary IMHO.

My question is why there is such a big time interval between these two sequential rdtsc instructions and what causes it ?

The compiler seems to be innocent and translated assembly codes are as follow:

0x5555555554fb <main+594>       mfence                    
0x5555555554fe <main+597>       cpuid                     
0x555555555500 <main+599>       rdtsc                     
0x555555555502 <main+601>       mov    %eax,-0xe0(%rbp)   
0x555555555508 <main+607>       mov    %edx,-0xdc(%rbp)   
0x55555555550e <main+613>       mov    -0xdc(%rbp),%ea
0x555555555514 <main+619>       shl    $0x20,%ra      
0x555555555518 <main+623>       mov    %rax,%rd       
0x55555555551b <main+626>       mov    -0xe0(%rbp),%ea
0x555555555521 <main+632>       or     %rdx,%ra       
0x555555555524 <main+635>       mov    %rax,-0xa0(%rbp)   
0x55555555552b <main+642>       cpuid                     
0x55555555552d <main+644>       rdtsc                     
0x55555555552f <main+646>       mov    %eax,-0xd8(%rbp)   
0x555555555535 <main+652>       mov    %edx,-0xd4(%rbp)   
0x55555555553b <main+658>       mov    -0xd4(%rbp),%ea
0x555555555541 <main+664>       shl    $0x20,%ra      
0x555555555545 <main+668>       mov    %rax,%rd       
0x555555555548 <main+671>       mov    -0xd8(%rbp),%ea
0x55555555554e <main+677>       or     %rdx,%ra       
0x555555555551 <main+680>       mov    %rax,-0x98(%rbp)   
0x555555555558 <main+687>       cpuid                     
0x55555555555a <main+689>       mfence 

1 answer

  • answered 2021-11-23 02:12 minhuw

    Though with the highest scheduling priority, a userspace process may still be preempted by a kernel thread. You may

    • monitor the kernel scheduler with perf sched to track kernel schedule events and figure out which thread uses the core.
    • Reduce the kernel noise in the test core to reduce performance variance:
      • Isolate the test core from the scheduler entirely with boot option isolcpus
      • prevent the kernel from sending schedule interrupts with boot option nohz_full
      • move rcu callback to other cores with boot option rcu_nocbs.
      • my own experience: avoid use the first and the last core (0 and N).

How many English words
do you know?
Test your English vocabulary size, and measure
how many words do you know
Online Test
Powered by Examplum