Image

Imagejspence wrote in Imagecpp

Alignment

Recently, I wrote a tool while thinking about performance problems on the P4; it's an alignment problem catcher for x86 machines. Links:

Unix
Win32.

To understand what's going on here, you'll probably need a little theory first. If you just want to use those two tools to do alignment checking, feel free to go nuts with the cut and paste.

Basically, the tools enable what's called the AC bit in the processor. When the AC bit is on, you should get signals or exceptions when you do an unaligned access (if your operating system supports it). If you don't know what I'm talking about, keep reading.



Consider your computer's memory. This is where all your stuff goes while you're using the computer to work on it: your email, web pages, code, game maps, etc all get stored in RAM while the processor does stuff to the data to make pretty pictures or let you add and remove words from whatever you're writing.

On the P4 based system I'm writing this on, the processor talks to a chip on the motherboard called the north bridge, which in turn talks to those sticks of RAM with your computer's memory on it [1]. The north bridge contains something called a "memory controller" which is responsible for getting your CPU and your RAM to talk to each other.

For technical reasons, those physical sticks of RAM are internally divided into little buckets of bytes; your processor requests those buckets via the memory controller. When the processor asks the the memory controller to read and write memory, the memory controller figures out which bucket of memory contains what the processor wants, then it retrieves the entire bucket and hands it over to the processor. Note that it retrieves the entire bucket, even if the bucket contains more than what the processor wanted.

Got that? You should be picturing buckets full of memory flying around on your motherboard between your RAM and your CPU, like an hyperactive 800 megahertz chinese fire brigade. Think about it for a minute.

Right, moving on.

The alignment problem is illustrated with this example: suppose the processor wants to do something to a chunk of memory which is located partly in one bucket and partly in another bucket. To work on the entire chunk, the memory controller has to fetch two buckets to get the whole thing, which takes twice as long as if the chunk were located in only one bucket. More buckets means more work for the memory controller, which means your program runs slower.

To avoid this, your goal as a programmer is to make sure that your data is aligned. Aligned is just a fancy term that means your data fits in as few buckets as possible. [2]

Here's an example in C:

struct recipe {
  int recipe_id;
  char chefs_initials[3];
  short cooking_time; /* in minutes */
};

In memory, this would look like this:
                    BUCKET 1 (00-07)                    |       BUCKET 2 (08-16)
                                                        |
   00     01     02     03     04     05     06     07  |  08     09 ...
+------+------+------+------+------+------+------+------+------+-----...
|         recipe_id         |   chefs_initials   | cooking_time|
+------+------+------+------+------+------+------+------+------+-----...

Note that cooking time is located in both bucket 1 and bucket 2. If these were actual memory addresses, the address of cooking_time would probably be something like 0x804c707. In that case, assuming 8 byte buckets like on a P4, bucket 1 would be addresses 0x804c700-0x804c707, and bucket 2 would be addresses 0x804c708-0x804c710. Because cooking_time is a short, it's two bytes long, and so it's in both bucket 1 and bucket 2.

Now suppose that you have a bajillion recipies in memory and you want to update the cooking_time field in every one of them because someone entered the cooking time in seconds instead of minutes. Oops! You're doing work on twice the number of memory buckets than you really need to! Had you defined your structure like this, so cooking_time was located in only one bucket, it would take only half the time to do it:

struct recipe {
  int recipe_id;
  short cooking_time; /* in minutes */
  char chefs_initials[3];
};

                    BUCKET 1 (00-07)                    |       BUCKET 2 (08-16)
                                                        |
   00     01     02     03     04     05     06     07  |  08     09 ...
+------+------+------+------+------+------+------+------+------+-----...
|         recipe_id         | cooking_time|   chefs_initials   | 
+------+------+------+------+------+------+------+------+------+-----...

If you've been paying attention, you might have noticed that after we rearranged the structure, we now have the same problem with the chefs_initials field: it's in both bucket 1 and bucket 2. The usual solution to this problem is to add what's called a pad field between cooking_time and chefs_initials:

struct recipe {
  int recipe_id;
  short cooking_time; /* in minutes */
  char pad[2];
  char chefs_initials[3];
};

 
                    BUCKET 1 (00-07)                    |           BUCKET 2 (08-16)
                                                        |
   00     01     02     03     04     05     06     07  |  08     09     10     11 ...
+------+------+------+------+------+------+------+------+------+------+------+-----...
|         recipe_id         | cooking_time| **  PAD  ** |   chefs_initials   |
+------+------+------+------+------+------+------+------+------+------+------+-----...

So now it takes up more space in memory because of the pad field, but your code will run faster because an access to any field only requires an access to one bucket. This is an example of what's called a time-space tradeoff, where you use more space to go faster, or you use less space but your code is slower. There is no one answer to whether to you should favor time or space when faced with this kind of decision; you have to think about your software's goals and your user's requirements and do what meets their needs the best.

[1] If I had a different processor, like the AMD Athlon64, the processor would be directly connected to the memory. This is because in that case, the circuits that would normally go in the north bridge (which are collectively called "the memory controller") are already in the Athlon64 processor itself, so there isn't any need for a seperate memory controller in the north bridge.

[2] If you've taken a course on discrete mathematics or studied number theory, you may recognize this similar to the 2k and 2k+1 definitions of even and odd integers, or perhaps as an integer version of a packing problem involving elements of identical size. A more formal way of defining proper alignment is you want your objects to be located at some address A in memory, where A = s * k, k is an integer and s is the size of your object. An even more simple restatement of this (using the quotient-remainder theorem) is A mod s should be 0.