tech-toolchain archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: pkg/50939: Bug in GCC optionization causing i386 net-snmpd, to crash



Some time ago, David Holland wrote:
  This sounds like it is overwriting its stack, probably in the mem_mib
  call. Then when it returns form the mem_mib call it manages to go to
  the wrong place. Can you check in the debugger if this is the case?

  What gets trashed if you overwrite the stack can depend heavily on
  compiler optimizations, so it's not necessarily a gcc bug.

  I don't see anything obviously wrong with the code, but that isn't
  conclusive.

  Also, is this happening on real i386, or in a 32-bit chroot on an
  amd64? Might also be a problem with the compat32 sysctl().

I have reproduced this on NetBSD 7.1 on a real i386 machine.

The problem appears to be a compiler bug. Consider the following code, from the middle of netsnmp_cpu_arch_load:

        for (i = 0; i < cpu_num; i++) {
            netsnmp_cpu_info  *ncpu = netsnmp_cpu_get_byIdx( i, 1 );
            size_t j = i * CPUSTATES;
            ncpu->user_ticks = (unsigned long long)ncpu_stats[j + CP_USER];
            ncpu->nice_ticks = (unsigned long long)ncpu_stats[j + CP_NICE];
            ncpu->sys2_ticks = (unsigned long long)ncpu_stats[j + CP_SYS]+cpu_stats[j + CP_INTR];
            ncpu->kern_ticks = (unsigned long long)ncpu_stats[j + CP_SYS];
            ncpu->idle_ticks = (unsigned long long)ncpu_stats[j + CP_IDLE];
            ncpu->intrpt_ticks = (unsigned long long)ncpu_stats[j + CP_INTR];
        }

This is translated into the following block of code (disassembled by gdb). The block is entered via a conditional branch from elsewhere, if cpu_num > 0.

   0xbba64c88 <+1039>:  movl   $0x1,0x4(%esp)
   0xbba64c90 <+1047>:  movl   $0x0,(%esp)
   0xbba64c97 <+1054>:  call   0xbba09460 <netsnmp_cpu_get_byIdx@plt>
   0xbba64c9c <+1059>:  mov    (%edi),%edx
   0xbba64c9e <+1061>:  mov    0x4(%edi),%ecx
   0xbba64ca1 <+1064>:  mov    %edx,0x2008(%eax)
   0xbba64ca7 <+1070>:  mov    %ecx,0x200c(%eax)
   0xbba64cad <+1076>:  mov    0x8(%edi),%edx
   0xbba64cb0 <+1079>:  mov    0xc(%edi),%ecx
   0xbba64cb3 <+1082>:  mov    %edx,0x2010(%eax)
   0xbba64cb9 <+1088>:  mov    %ecx,0x2014(%eax)
   0xbba64cbf <+1094>:  mov    0x10(%edi),%edx
   0xbba64cc2 <+1097>:  mov    0x14(%edi),%ecx
   0xbba64cc5 <+1100>:  add    0x54(%esp),%edx
   0xbba64cc9 <+1104>:  adc    0x58(%esp),%ecx
   0xbba64ccd <+1108>:  mov    %edx,0x2068(%eax)
   0xbba64cd3 <+1114>:  mov    %ecx,0x206c(%eax)
   0xbba64cd9 <+1120>:  mov    0x10(%edi),%edx
   0xbba64cdc <+1123>:  mov    0x14(%edi),%ecx
   0xbba64cdf <+1126>:  mov    %edx,0x2030(%eax)
   0xbba64ce5 <+1132>:  mov    %ecx,0x2034(%eax)
   0xbba64ceb <+1138>:  mov    0x20(%edi),%edx
   0xbba64cee <+1141>:  mov    0x24(%edi),%ecx
   0xbba64cf1 <+1144>:  mov    %edx,0x2020(%eax)
   0xbba64cf7 <+1150>:  mov    %ecx,0x2024(%eax)
   0xbba64cfd <+1156>:  mov    0x18(%edi),%edx
   0xbba64d00 <+1159>:  mov    0x1c(%edi),%ecx
   0xbba64d03 <+1162>:  mov    %edx,0x2038(%eax)
   0xbba64d09 <+1168>:  mov    %ecx,0x203c(%eax)
   0xbba64d0f <+1174>:  mov    -0x258(%ebx),%eax
   0xbba64d15 <+1180>:  mov    (%eax),%eax
   0xbba64d17 <+1182>:  cmp    $0x1,%eax
   0xbba64d1a <+1185>:  jle    0xbba64ace <netsnmp_cpu_arch_load+597>
   0xbba64d20 <+1191>:  movl   $0x1,0x4(%esp)
   0xbba64d28 <+1199>:  movl   $0x1,(%esp)
   0xbba64d2f <+1206>:  call   0xbba09460 <netsnmp_cpu_get_byIdx@plt>
   0xbba64d34 <+1211>:  mov    0x28(%edi),%edx
   0xbba64d37 <+1214>:  mov    0x2c(%edi),%ecx
   0xbba64d3a <+1217>:  mov    %edx,0x2008(%eax)
   0xbba64d40 <+1223>:  mov    %ecx,0x200c(%eax)
   0xbba64d46 <+1229>:  mov    0x30(%edi),%esi
   0xbba64d49 <+1232>:  mov    0x34(%edi),%edi
   0xbba64d4c <+1235>:  mov    %esi,0x2010(%eax)
   0xbba64d52 <+1241>:  mov    %edi,0x2014(%eax)

The branch to 0xbba64ace is a branch back to continue the normal execution of the code, where free(...) is called and life carries on.

Note that the compiler appears to have partially unrolled the loop. But this is the end of that block of code. The next block of code happens to be the cleanup code sysctl(mem_mib, ...) failing, which logs "sysctl vm.vm_meter failed". This appears to be purely coincidental, and the real failure here is that execution just falls off the end of this half-finished loop unrolling.

   0xbba64d58 <+1247>:  call   0xbba0abf0 <__errno@plt>
   0xbba64d5d <+1252>:  mov    (%eax),%eax
   0xbba64d5f <+1254>:  mov    %eax,0x8(%esp)
   0xbba64d63 <+1258>:  lea    -0x41e78(%ebx),%eax
   0xbba64d69 <+1264>:  mov    %eax,0x4(%esp)
   0xbba64d6d <+1268>:  movl   $0x3,(%esp)
   0xbba64d74 <+1275>:  call   0xbba0af70 <snmp_log@plt>
   0xbba64d79 <+1280>:  jmp    0xbba649cd <netsnmp_cpu_arch_load+340>

It does look like a machine with only one CPU would be spared this fate as it would exit the loop after the first iteration and not try to execute the second, incomplete, iteration. This problem should be reproducible on any NetBSD/i386 machine with at least 2 CPUs.

Obviously in the short term, the package will need to work around this by disabling optimisation, but this is clearly something the compiler is getting wrong.



Home | Main Index | Thread Index | Old Index