Sir Charles Babbage's computational engine, and the Navy's first automatic digital computers, were designed to compute ballistics tables. Today, ballistics tables are important both for the army and navy, and in path planning for space exploration.

This program computes the destination location for ore launched by a lunar mass driver. The ore experiences drag as it encounters mining dust in flight. The launcher's maximum range is 25 miles, approximately a one minute flight time.

typedef float real; /* "real" is our numeric type (could also be double, __m128) */This code is also on the powerwall in "~olawlor/441_demo/hw4/".

/* Coefficients used in one path computation. */

class path_coeff {

public:

real drag; /* multiply by velocity to get drag-induced acceleration */

real g; /* gravitational acceleration, miles/sec/sec */

real dt; /* timestep, in seconds */

};

/*

Run one ballistic path computation.

Y axis is height. X axis is distance downrange.

Projectile starts at (0,0) miles with initial velocity vx,vy miles/second.

Projectile's terminal (y==0) range in miles is returned in final_x,

and total flight time in seconds is returned in final_t.

*/

void follow_path(const path_coeff &coeff,

real vx,real vy,

real *final_x,real *final_t)

{

real x=(real)0.0, y=(real)0.0, t=(real)0.0;

real dt=coeff.dt; /* timestep */

bool done=false;

while (!done) {

real ax=-coeff.drag*vx, ay=-coeff.drag*vy-coeff.g; /* acceleration */

vx+=dt*ax; vy+=dt*ay; /* velocity */

if (y+dt*vy<0.0) { /* we're about to hit the ground */

*final_x=x;

*final_t=t;

done=true;

}

x+=dt*vx; y+=dt*vy; /* position */

t+=dt;

}

}

/* Return the initial velocity for path i */

void path_initial(int i,real *vx,real *vy) {

*vx=0.73-i*0.005;

*vy=0.01+i*0.0005;

}

/* Stores a set of computed flight paths */

class path_table {

public:

enum {n=100};

real x[n], t[n]; /* range (miles) and time (seconds) for path */

};

/* Build and print a set of computed ballistics paths */

void fill_table(const path_coeff &coeff,path_table &table) {

for (int i=0;i<path_table::n;i++) {

real vx,vy;

path_initial(i,&vx,&vy);

follow_path(coeff,vx,vy,&table.x[i],&table.t[i]);

}

}

void print_table(path_table &table) {

for (int i=0;i<path_table::n;i++) {

printf("Path %3d: x=%.3f miles, t=%.3f seconds\n",

i,table.x[i],table.t[i]);

}

}

path_coeff coeff;

int time_fill(void) {

path_table table;

fill_table(coeff,table);

return 0;

}

int foo(void) {

coeff.drag=0.007; /* drag due to ore collisions with mining dust */

coeff.g=1/6.0*(32/5280.0); /*lunar gravity acceleration, miles/s^2*/

coeff.dt=0.001; /*timestep, in seconds: more accurate at 0.0001 or smaller*/

printf("fill_table takes %.3f ms\n",time_function(time_fill)*1.0e3);

path_table table;

fill_table(coeff,table);

print_table(table);

return 0;

}

"follow_path" is where most of the time is spent. The "time_function" call above is NetRun-specific; feel free to replace it with an MPI timer call in the MPI version.

For maximum accuracy, they'd like to run this same program at dt=0.0001 or smaller, and for bidding purposes, they'd like the results to be returned as fast as possible. Of course, the results still have to be printed the same way, and they have to be right!

For the homework, please:

- Determine how this program's answer and run time changes as you
vary the timestep coeff.dt. Note that NetRun will only let the
code run for so long, so smaller dt's may get killed off by
NetRun. Very small dt's may cause rounding error to dominate the
answer.

- Optimize this program using OpenMP. Luckily, there are no global variables in this program. However, due to the different path lengths, load balance between threads can be a serious performance problem; I recommend "schedule(dynamic,1)". Feel free to use either NetRun (quad core; switch to the "OpenMP" language), the powerwall (dual core; compile with ~olawlor/441_demo/hw4/ makefile or "g++-4.2 -msse3 -fopenmp main.cpp -o main"), or use your own machine for performance tests.
- Optimize this program using SSE. Note that "follow_path" doesn't do any 4-input arithmetic, so it'll be tough to optimize follow_path much without changing the interface. Feel free to change the call to follow_path! I recommend making a new "__m128" version of follow_path that traces 4 paths at once--the only tricky part about doing this is the "hit the ground" check, which can be transformed into an SSE-style parallel if.
- Optimize this program using MPI. You'll need to do your
performance runs on the powerwall. Feel free to start with your
SSE version, or the original program. You'll need to collect up
the final answers onto one machine to be printed. As with OpenMP,
load balance is a serious problem; one solution is to have process p
compute the iterations where i%size==p (called "round robin"
decomposition). Because they'll be computing tables repeatedly on
the fly, you can ignore the MPI startup and shutdown time--MPI_Wtime()
is the important time.