Recently We had some strange segmentation fault in one of our server app running on x86_64.
We had this segmentation fault one in thousand operations, We could not reproduce this at all in our dev/load testing environments.
We somehow got the 'stack trace' while one of the segmentation fault was happening.
But we were concerned about the validity of this stack trace as few of the symbols had the strange values, So we thought it could be a 'stack or heap corruption'.
But we could observe 'similar' call stacks in all crashes.
So We started with the function at the top of the stack.
The failure is as follows,
top_level_function calls 'ssl_var_lookup'(Function implemented by mod_ssl).
It looks as follows,
char *x = ssl_var_lookup(blah, blah);
We could print the value of 'x' using '%x'. The moment we dereference the address held in 'x' we get segmentation fault. We started printing byte by byte the memory pointed by 'x'. We get segmentation fault while accessing the first byte itself. We printed the address and its contents from inside 'ssl_var_lookup' which eventually returned to 'top_level_function' and we could do so without any segmentation fault from inside 'ssl_var_lookup'.
The addresses remain same both inside(just before return statement) and what
'top_level_function' receives.
Then what is the problem?
- Faulty memory chip?
- Strange bug in 'apr_pool' that we use to allocate memory?
- Faulty intel
- Faulty libc
- Faulty kernel.
- Anything else we are unaware of and able to blame.
Oh, Then it striked us when 'looking' at the addresses returned by 'ssl_var_lookup'.
We could see segmentation fault only when the address is of '8' hex digits.
i.e we could see segmentation fault if x is '0xabcd1234' not when
'0xabcd123'(See seven digits) or '0x6781ab' or anything less.
This striked us to think something in the lines of 'high memory address'.
So we instrumented the code in such a way to leak '10MB' before this
'ssl_var_lookup' call so that 'ssl_var_lookup' would return
larger addresses consistently and we could see the failure in controlled
environments.
YES, with '10M' leak we could ALWAYS trigger segmentation fault with the same
stack trace.
By closely executing the app assembly instruction by instruction and observing
the registers we could get the sense of what is happening.
We further instrumented our code to prove our findings.
Mistake in our earlier debugging:
We used '%x' to print the address whereas we should have used '%lx' on 64 bit
systems, if we have done that we could have got the problem without assembly
debugging.
Now 'What is the problem?'.
'ssl_var_lookup' returns a pointer which '64 bit address' in x86_64, whereas 'top_level_function' receives it as '32 bit address' which is fine if the address is having upper 32 bits as zeros. So whenever memory usage by application is high'ssl_var_lookup' returns 'true' 64 bit address which our top_level_function
partially reads and segfaults.
Should it be taught something about '64 bit' or keep teaching it whenever we
deploy on new intel archs?
No.
Why top_level_function behaves crankily?
First 'ssl_var_lookup' is not a public function(I mean no prototype definition
in mod_ssl's headers.).
Actually 'mod_ssl' uses apr's 'optional function' infrastructure to mangle its
function like 'apr_ofn_ssl_var_lookup' or something. So if the caller want to make use it they have to use the similar mechanism.
mod_ssl's source which defines 'ssl_var_lookup' does not put a 'static'
marker so the symbol is exported and hence our code is able to make a call with
its unmangled name.
'C' compiler assumes the return type of the function to be 'int'(32 bit) if prototype is not defined.
'C' compiler gives a warning when it assumes like this warning reads as follows, "warning: initialization makes pointer from integer without a cast".
So here it takes only lower 32 bits of the returned address from the
un-prototyped 'ssl_var_lookup' which may give a value which can be bigger than
32 bits.
Moral of this story: Pay attention to compiler warnings to save on Debugging costs.