My learnings

Friday, December 2, 2011

I love BSNL GPRS

3 months Back I ported my mobile number to BSNL.

I simply love the GPRS bandwidth of BSNL.

Earlier I used Airtel GPRS for 6 months.... It works only from mobile. You can not tether it. I tried bluetooth tethering once, it was pathetically slow.

For 2 months I used uninor too .... Which I did not try tethering... I guess it should also be slow.

Both with Airtel and Uninor my mobile GPRS bandwidth usage is max 50 MB per month.

For the past 3 months... I simply tether my BSNL GPRS. It works fantastic. I stopped using my home broadband(Use it only when I need highest speeds).

I guess my Bandwidth usage should be 1-2 GB per month may be more not sure how to measure it(Probably I need to sum ifconfig output...).

For instance I ran yum update with 2 weeks of updates from Fedora 16 since its release. It took just 30 minutes(including the install/upgrade) with a download of 110MB.

I could see download progress meter showing 27KBps.

You know what, I just pay 98Rs per month for this high speed 3GB internet with ROAMING free across India.

For instance I had a nice 20 minutes skype call via GPRS while I was on roaming.

BSNL ROCKS...

I just checked my monthly GPRS usage, it is over 1.5 GB while I could only consume 50M in airtel.

Wednesday, July 1, 2009

Pay attention to compiler warnings to save debugging cost

Recently We had some strange segmentation fault in one of our server app running on x86_64.

We had this segmentation fault one in thousand operations, We could not reproduce this at all in our dev/load testing environments.

We somehow got the 'stack trace' while one of the segmentation fault was happening.

But we were concerned about the validity of this stack trace as few of the symbols had the strange values, So we thought it could be a 'stack or heap corruption'.

But we could observe 'similar' call stacks in all crashes.

So We started with the function at the top of the stack.

The failure is as follows,
top_level_function calls 'ssl_var_lookup'(Function implemented by mod_ssl).
It looks as follows,
char *x = ssl_var_lookup(blah, blah);

We could print the value of 'x' using '%x'. The moment we dereference the address held in 'x' we get segmentation fault. We started printing byte by byte the memory pointed by 'x'. We get segmentation fault while accessing the first byte itself. We printed the address and its contents from inside 'ssl_var_lookup' which eventually returned to 'top_level_function' and we could do so without any segmentation fault from inside 'ssl_var_lookup'.

The addresses remain same both inside(just before return statement) and what
'top_level_function' receives.

Then what is the problem?
- Faulty memory chip?
- Strange bug in 'apr_pool' that we use to allocate memory?
- Faulty intel
- Faulty libc
- Faulty kernel.
- Anything else we are unaware of and able to blame.

Oh, Then it striked us when 'looking' at the addresses returned by 'ssl_var_lookup'.

We could see segmentation fault only when the address is of '8' hex digits.

i.e we could see segmentation fault if x is '0xabcd1234' not when
'0xabcd123'(See seven digits) or '0x6781ab' or anything less.

This striked us to think something in the lines of 'high memory address'.

So we instrumented the code in such a way to leak '10MB' before this
'ssl_var_lookup' call so that 'ssl_var_lookup' would return
larger addresses consistently and we could see the failure in controlled
environments.

YES, with '10M' leak we could ALWAYS trigger segmentation fault with the same
stack trace.

By closely executing the app assembly instruction by instruction and observing
the registers we could get the sense of what is happening.

We further instrumented our code to prove our findings.

Mistake in our earlier debugging:
We used '%x' to print the address whereas we should have used '%lx' on 64 bit
systems, if we have done that we could have got the problem without assembly
debugging.

Now 'What is the problem?'.

'ssl_var_lookup' returns a pointer which '64 bit address' in x86_64, whereas 'top_level_function' receives it as '32 bit address' which is fine if the address is having upper 32 bits as zeros. So whenever memory usage by application is high'ssl_var_lookup' returns 'true' 64 bit address which our top_level_function
partially reads and segfaults.

Should it be taught something about '64 bit' or keep teaching it whenever we
deploy on new intel archs?
No.

Why top_level_function behaves crankily?
First 'ssl_var_lookup' is not a public function(I mean no prototype definition
in mod_ssl's headers.).

Actually 'mod_ssl' uses apr's 'optional function' infrastructure to mangle its
function like 'apr_ofn_ssl_var_lookup' or something. So if the caller want to make use it they have to use the similar mechanism.

mod_ssl's source which defines 'ssl_var_lookup' does not put a 'static'
marker so the symbol is exported and hence our code is able to make a call with
its unmangled name.

'C' compiler assumes the return type of the function to be 'int'(32 bit) if prototype is not defined.
'C' compiler gives a warning when it assumes like this warning reads as follows, "warning: initialization makes pointer from integer without a cast".

So here it takes only lower 32 bits of the returned address from the
un-prototyped 'ssl_var_lookup' which may give a value which can be bigger than
32 bits.

Moral of this story: Pay attention to compiler warnings to save on Debugging costs.

Tuesday, March 18, 2008

How to debug segfaulting apache module?

Recently I had a strange segmentation fault in one of the apache modules.

I want to get closer to the point of segmentation fault.
First let me share the ineffective approaches and then share the effective approaches.
Ineffective approaches:

Logged lots of printfs to some text file, to trace the flow of execution.
Attached the debugger to 'apache listen process'

set follow-fork-mode child

'2' Looks elegant but it did not work for me, because one can not be sure about which child process will serve their request.

Effective approach:
Configured my apache to have only one child(worker process) at any time.

StartServers 1
MinSpareServers 1
MaxSpareServers 0
MaxClients 150
MaxRequestsPerChild 0

With this configuration you will see only 2 apache processes at any time.
One process owned by 'root', a listener process(which we do not care), another by apache user(which we care).

Attach a debugger to process owned by 'apache'
set a relevant breakpoints, watchpoints etc.
Allow the process to continue
Make a http request
You will get a control in your debugger.

Even if you run apache as non-root user it does not matter, it is easy to identify the worker process.
Apache process which remains alive across requests is not a worker apache.