Coasting in Neutral: July 2008

Here's something I learned recently and found quite interesting. A neat way to calculate the value of pi using estimation is to take a unit circle inside a square of side 2 units and take random shots on the surface of the square.

Now, the ratio of the area between the circle and square can be given by,

pi*r^2/L^2 = hits/shots

Thus, we can estimate the value of pi as,

pi=4*hits/shots

Now, I wrote a little script in Python which does just that. I also put in the value of pi to 10 decimal places and wrote out the error produced. Initially, I didn't take a circle of unit radius but that just produced too much error:

import math;
import random;
import sys;

#calculate the value of pi using probability
maxtries = float(sys.argv[1]);

radius = 1.0;
side = 2.0 * radius;
tries = 0.0;
hits = 0.0;
pi = 3.1415926536;

while (tries < maxtries):
tries = tries + 1;
#get a point within the square

rx = random.random()*side;
ry = random.random()*side;

#distance from center of circle
dist = math.sqrt(math.pow(rx - radius, 2.0) + math.pow(ry - radius, 2.0));
if (dist <= radius): hits = hits + 1;

#estimated value of pi
piest = (hits*4/tries);
print "Radius: %d, Tries: %d, Hits: %d, pi: %1.10f, Error:%1.10f\n" % (radius, tries, hits, piest, piest - pi);

My output looks something like:

~$ python pi.py 10.0
Radius: 1, Tries: 10, Hits: 6, pi: 2.4000000000, Error:-0.7415926536

~$ python pi.py 100.0
Radius: 1, Tries: 100, Hits: 83, pi: 3.3200000000, Error:0.1784073464

~$ python pi.py 1000.0
Radius: 1, Tries: 1000, Hits: 790, pi: 3.1600000000, Error:0.0184073464

~$ python pi.py 10000.0
Radius: 1, Tries: 10000, Hits: 7822, pi: 3.1288000000, Error:-0.0127926536

~$ python pi.py 100000.0
Radius: 1, Tries: 100000, Hits: 78591, pi: 3.1436400000, Error:0.0020473464

~$ python pi.py 1000000.0
Radius: 1, Tries: 1000000, Hits: 786368, pi: 3.1454720000, Error:0.0038793464

~$ python pi.py 10000000.0
Radius: 1, Tries: 10000000, Hits: 7853556, pi: 3.1414224000, Error:-0.0001702536

~$ python pi.py 100000000.0
Radius: 1, Tries: 100000000, Hits: 78539696, pi: 3.1415878400, Error:-0.0000048136

~$ python pi.py 1000000000.0
Radius: 1, Tries: 1000000000, Hits: 785382068, pi: 3.1415282720, Error:-0.0000643816

The last observation is the most interesting. It ran for a few hours on my machine but did absolutely nothing to improve the error. In fact, the error actually increased!

About 6 months back I worked on an interesting project for school where we were developing a checkpointing application. My goal was to port the software onto Ubuntu where it some reason would not work. I initially put a query on ubuntuforums.org to which I never replied and people have queried me on and off about whether I did get an answer to this issue. So before the project information gets deleted from my school wiki, I'm going to put the full problem statement and solution here.

Checkpointing is a technique for storing the state of a process in order to recover it at a later point. Checkpointing is used where there are long running processes, saving the state of scientific applications to execute different conditions from that state or cases where debugging of processes is required. The different approaches to checkpointing include:

a. Kernel Level Checkpointing
b. User Level Checkpointing
c. Application Level Checkpointing

In Kernel Level Checkpointing, the operating system is modified to support checkpointing for
applications. This is the most transparent but the least efficient of the approaches. Here, the operating system does not know anything about the application and simply dumps the memory into a file for later restart. An issue with this approach is that the changes required in the kernel to support checkpointing will be different for each OS. User level checkpointing allows the user to link the application to the checkpointing framework in order to make the application checkpointable. With application level checkpointing, the developer of the application implements checkpointing within the application itself as the person has detailed knowledge about the application. At the same time this approach has the least amount of transparency and it is very difficult to migrate the checkpointing approach from one application to another.

In DMTCP (Distributed MultiThreaded CheckPointing) we study a user-level checkpointing technique used to restart or migrate a process. DMTCP works at the socket level so it can checkpoint applications without requiring modification to either the operating system or the application. DMTCP operates using a checkpoint control process which sends messages to a checkpoint manager thread about each process. The manager then uses signals to gain control of other threads before checkpointing. Our term project will be to study DMTCP and to test it for different application and to extend DMTCP to applications using LAM/MPI.

The first problem that we tackled was to port mtcp to Ubuntu. We noticed a problem when we first tried to run dmtcp on Ubuntu 7.10. When trying to run just mtcp mtcp_restore used to give a segmentation fault.

After debugging through the code, we traced the problem to the mtcp_sys_munmap() call in mtcp_restatic.c. Looking at the memory maps gave us something like this:

08048000-080be000 r-xp 00000000 08:07 594167 /home/vsriram/dmtcp/mtcp/mtcp_restore
080be000-080c0000 rw-p 00075000 08:07 594167 /home/vsriram/dmtcp/mtcp/mtcp_restore
080c0000-080e3000 rw-p 080c0000 00:00 0 [heap]
4001e000-4002c000 rwxp 4001e000 00:00 0
bfa6d000-bfa83000 rw-p bfa6d000 00:00 0 [stack]
ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso]

The problem occurred when unmapping the heap area of the memory. We figured that the error would be either with the Ubuntu Linux Kernel or with glibc. This is because mtcp works fine in other Linux distros like Suse or Red Hat and even Linux versions prior to 6.10. The only other calls were to the Linux Kernel and glibc libraries. The next steps were to put printk statements in the kernel and see where it gives an error or to compile a vanilla glibc and see if the error was still reproducing using the glibc libraries.

We first approached the problem by trying to re-compile glibc. glibc would not compile easily on Ubuntu and it gave an undefined reference errorwhich we could not resolve even by setting the CFLAGS environment variable to -fno-stack-protector. So, this approach was abandoned in favor of recompiling the kernel and putting printk statements in the kernel code. After doing this and recompiling the kernel, we noticed that the sys_munmap() system call returns without any issues. This meant that the problem lay outside the system call.

As no system calls (even printf) can be called after the return from the sys_munmap() sytem call we decided to put while(1); statements to see till what points the application would continue (or in this case hang) without breaking. While doing this we could go upto the highest_userspace_address() call in code. The first thought was that the function does not exist in the address any more and this was leading to a segmentation fault.

We then recompiled mtcp_restatic.c to an assembly file (by using gcc -S) and put the assembly code for the while(1); loop in the mtcp_restatic.s file and recompiled the assembly code to an object file. This was to see if the program counter could enter the highest_userspace_address() method.

Assembly code for while(1); loop is shown below:

.L170
jmp .L170

At this point we found that the error occurs during the reference to %gs:20 which is a segment register referencing a memory location which is in the heap. Googling this lead to an interesting webpage where they talked about stack protection in Ubuntu which might be cause of this issue.

We read about Stack smashing protection where canary values are placed between a buffer and control data on the stack to monitor buffer overflows. When the buffer overflows, the first data to be corrupted will be the canary, and a failed verification of the canary data is therefore an alert of an overflow, which can then be handled. The canary values are inserted in by gcc while compiling the code.

The assembly code below sets the canary value at the beginning of the function.

movl %gs:20, %eax
movl %eax, -8(%ebp)

The assembly code below checks if the value of the canary remained the same. This code is placed at the end of the function.

movl -8(%ebp), %edx
xorl %gs:20, %edx
je .L179
call __stack_chk_fail

After the canary checking assembly code was removed from the assembly code and compiled to object code, the highest_userspace_address function worked fine but it failed further on. This validated the hypothesis that the segmentation fault was because of the reference to the memory location in the heap which by then was already unmapped.

Adding -fno-stack-protector to gcc (in the Makefile) before compiling the entire mtcp code removes the default stack protection and mtcp works fine after that.

Coasting in Neutral

Saturday, July 26, 2008

Estimating the value of pi

Wednesday, July 23, 2008

glibc on Ubuntu

I Support