Tuesday, November 20, 2012

C Pointers Explained, Really

While I was in college, a friend of mine complained that he was confused while programming in C, struggling to learn the syntax for pointers.

He gave the example of something like: *x=**p++ being ugly and unreadable, with too many operations layered on each other, making it hard to tell what was happening.  He said he had done a bit of programming with assembly language, but he wasn't accustomed to the nuances of C.

I wrote the following explanation on our student message board, and I got a lot of good feedback.  Some people said that they had been programming in C for years, but not until they read my post did they finally understand pointers.  So here it is, unearthed from my backups and slightly edited.  I hope it helps someone again...


Message 1956 (8 left): Thu Jan 25 1990  2:44am
From: Bill! (easterb@ucscb)
Subject: Okay


Well, if you know assembly, you have a head start
on many of the cis freshpersons here.  You at least know
about memory maps:  RAM is a long long array of bytes.
It helped me to learn about pointers if I kept this in mind.
For some reason, books and instructors talking about
pointers want to overlook this.


When I have some code:

main()
{
int n;
int *p;


There is a place in my memory that looks like this:

           :
Address:   :
        |-----|
  0x5100|     |   n is an integer, one machine word big
        |-----|
  0x5104|     |   p is a pointer, also one word big
        |-----|
  0x5108|     |   other unused memory
        |-----|
           :
           :

Let's give these variables some values.
I set n to be the number 151.

        n = 151;

I set the pointer p to point to the integer n.

        p = &n;

That says, "the value of the variable p is assigned the
address of the variable n".

             :
Address:     :     Value at that address:
          |----|
  0x5100  | 151|  n
          |----|
  0x5104  |5100|  p
          |----|
  0x5108  |   ?|
          |----|
             :
             :

Now I want to print out the value of n, by two ways.

        printf("n is %d.\n", n);
        printf("n is %d.\n", *p);

The * operator says, "give me the object at the following address."
The object's type is the type that the pointer was declared as.
So, since we declared "int *p", the object pointed at will be
_assumed_ by C to be an int.  In this case, we were careful to
make this coincide with what we were pointing at.

Now I want to print out the memory address of n.

        printf("n is located at $%x.\n", &n);
        printf("n is located at $%x.\n", p);

The & operator says, "tell me the address where the following object
starts."  In this case, it is hex 5100 (I put a '$' before it, to
conform to the Assembly notation I am used to).
Notice the _value_ of p is an address.

Hm.  Does p have an address?  Sure.  It is a variable, and all
variables have their own address.  The address of p is hex 5104.

        printf("p is located at $%x.\n", &p);

Here we are taking the address of a pointer variable, 
using the & operator.

main()
{
char name[] = "Bill";
char *p;
int *q;

Now we have an array to play with.  Here's how memory looks now:

        |---|
 0x5100 |'B'|  "name" is an address constant that has value hex 5100
        |---|
 0x5101 |'i'|  char: 1 byte
        |---|
 0x5102 |'l'|  char: 1 byte
        |---|
 0x5103 |'l'|  char: 1 byte
        |---|
 0x5104 |\0 |  char: 1 byte
        |---|
 0x5105 |   |  p is a pointer: 1 word
        |---|
 0x5109 |   |  q is a pointer: 1 word
        |---|

        p = name;

We set p to the value of name.  Now p has value hex 5100 too.
We can use the * dereferencing operator on p, and get the
character 'B' as a result.

Now what happens if I do this:

        ++p;

The pointer p is incremented.  What value does it have now?
Hex 5101.  Pretty simple.

Now let's try something irresponsible:

        q = name;

But q is a pointer to int!  If we dereference q, it will take
the word (typically 4 bytes) beginning at address "name" (which
is hex 5100) and try to convert it to an int.  'B', 'i', 'l', 'l'
converted to an int will be some large number, dependant on the
bit-ordering algorithm on your machine's architecture.  On ucscb,
it becomes 1114205292.  (to see how, line up the binary representation
of the ascii values for those 4 characters, and then run the 32 bits
together, and convert that resultant binary number as an integer.)

What we have just seen here is a key issue of pointers that I
mentioned earlier:  C assumes that what they are pointing at
is an object of the type that the pointer was designed to point at.
It is up to the programmer to make sure this happens correctly.

        ++q;

The int pointer is incremented.  What value does it have now?
Hex 5104.  Huh?!?  The answer is simple if you accept the above
paragraph.  It gets incremented by the size of the object it
_thinks_ it is pointing at.  It's an int pointer, so incrementing
it makes it advance a number of bytes equal to the size of an int.

Now print the dereferenced value of q (i.e. the value of the object
q is pointing to).  Well, it's pointing at a null byte, and then
the first 3 bytes of the char *p.  Now we're all messed up.
Nice going.  Try to convert _that_ to an integer representation.
Well actually, C will do it happily.  But it'll be another weird 
number.


main()
{
int n;

        n = 151;
        f(n);
}

f(x)
int x;
{
        printf("%d.\n", x);
}

Here is a simple program that passes an int "by value".
That is, it copies the value of n into the new variable x!

        |---|
 0x5100 |151|  n is an integer
        |---|
 0x5104 |151|  x is another integer
        |---|

When we mention x, we are using the value at location 5104,
and we can change it, read it, whatever, and it won't affect n,
the int at location 5100.

But what if we want to have f() modify the value and then
have that new value be available in main()?  C does this by
passing the variable "by reference".

main()
{
int n;

        n = 151;
        f(&n);
}

f(x)
int *x;
{
        printf("%d.\n", *x);
        *x = 451;
}

Pass the _address_ of n, and declare x as a _pointer_ to int.
Actually, this is still passing by value, but the value being
passed is the address, not the number.

        |----|
 0x5100 | 151|  n is an integer
        |----|
 0x5104 |5100|  x is a pointer to int
        |----|

Now if f() when we make use of *x, we are referring to the
value at location 5100.  This is the location of n.
After the assignment "*x = 451;", this is what we have:

        |----|
 0x5100 | 451|  n is an integer
        |----|
 0x5104 |5100|  x is a pointer to int
        |----|

x still points to location 5100, but we have changed the value
of the object at that location.




Well, those are the basics.
You mentioned things like "*x=**p++" being ugly and unreadable.
Well, yeah, but here is a diagram that may help:

        |----|  here is a word in memory with initial value 0. 
 0x5100 |   0|  no variable name
        |----|
 0x5104 |  12|  here is a value, a word in memory.  no variable name.
        |----|
 0x5108 |5104|  Here is an int pointer, pointing at the previous word.
        |----|
 0x511c |5108|  here is p, a pointer to int pointer.
        |----|
 0x5120 |5100|  here is x, a pointer.  guess where it's pointing.
        |----|

First let's see what p and x were declared as:
int *x;    /* pointer to int */
int **p;   /* pointer to pointer.  
              The subordinate pointer is a pointer to int.*/

You should know now what "*x" means.  It means, "the value of location 5100."
And you know what "*p" means, "the value of location 5108".
Now that value is another address!  Okay, let's dereference that
address: "**p" and we find (by the declaration) an int.

Now "*x = **p" looks like, "this int at 5100 gets the value of
that int at 5104."

And what does "**p++" mean?  Well, ++ binds tighter than *, so this
is equivalent to:  *( *( p++ ) )
Or, "pointer to pointer to int, and by the way, after we're done,
p has been incremented.  But we looked where it was pointing
before it got incremented, so we don't care.  Let the next statement
worry about it."



This content is copyright 2012 by Bill Karwin.  I'll share it under the terms of the Creative Commons License, Attribution-NonCommercial-ShareAlike 3.0 Unported.

Wednesday, March 24, 2010

Rendering Trees with Closure Tables

I got a comment from a reader about the Naive Trees section of my presentation SQL Antipatterns Strike Back. I've given this presentation at the MySQL Conference & Expo in the past.

I'd also like to mention that I've developed these ideas into a new book, SQL Antipatterns: Avoiding the Pitfalls of Database Programming. The book is now available in Beta and for pre-order from Pragmatic Bookshelf.

Here's the reader's question:
I would like to ask if there's a way I can dump all the hierarchies in a single query using a closure table? For example I have a following tree:

rootree
- 1stbranch
- midbranch
- corebranch
- leafnodes
- lastbranch
- lastleaf

and I want to display it like:

rootree -> 1stbranch
rootree -> midbranch
rootree -> midbranch -> corebranch
rootree -> midbranch -> corebranch -> leafnodes
rootree -> lastbranch
rootree -> lastbranch -> lastleaf

The Closure Table is a design for representing trees in a relational database by storing all the paths between tree nodes. Using the reader's example, one could define and populate two tables like this:
drop table if exists closure;
drop table if exists nodes;

create table nodes (
node int auto_increment primary key,
label varchar(20) not null
);

insert into nodes (node, label) values
(1, 'rootree'),
(2, '1stbranch'),
(3, 'midbranch'),
(4, 'corebranch'),
(5, 'leafnodes'),
(6, 'lastbranch'),
(7, 'lastleaf');

create table closure (
ancestor int not null,
descendant int not null,
primary key (ancestor, descendant),
foreign key (ancestor) references nodes(node),
foreign key (descendant) references nodes(node)
);

insert into closure (ancestor, descendant) values
(1,1), (1,2), (1,3), (1,4), (1,5), (1,6), (1,7),
(2,2),
(3,3), (3,4), (3,5),
(4,4), (4,5),
(5,5),
(6,6), (6,7),
(7,7);
What we need to do is find all the descendants of the root node 1, then for each of these descendant nodes, list its ancestors in order, separated by an arrow. We can use MySQL's useful GROUP_CONCAT() function to build this list for us.
select group_concat(n.label order by n.node separator ' -> ') as path
from closure d
join closure a on (a.descendant = d.descendant)
join nodes n on (n.node = a.ancestor)
where d.ancestor = 1 and d.descendant != d.ancestor
group by d.descendant;

Here's the output in the MySQL client. It looks like what the reader asked for:
+-------------------------------------------------+
| path |
+-------------------------------------------------+
| rootree -> 1stbranch |
| rootree -> midbranch |
| rootree -> midbranch -> corebranch |
| rootree -> midbranch -> corebranch -> leafnodes |
| rootree -> lastbranch |
| rootree -> lastbranch -> lastleaf |
+-------------------------------------------------+
I do assume for the purposes of ordering that all of a node's ancestors have a lower node number. You could alternatively use a pathlength column to the closure table and sort by that.

The Closure Table design is nice compared to the Nested Sets (or Preorder Traversal) design, because it supports the use of referential integrity. By using indexes, the EXPLAIN report shows that MySQL query optimizer does a pretty good job on it (I've omitted a few columns for brevity):
+-------+--------+-------------------+--------------------------+
| table | type | ref | Extra |
+-------+--------+-------------------+--------------------------+
| d | range | NULL | Using where; Using index |
| a | ref | test.d.descendant | |
| n | eq_ref | test.a.ancestor | |
+-------+--------+-------------------+--------------------------+

Thursday, May 21, 2009

EAV FAIL


Image

The photo above illustrates (by counter-example) an important characteristic of a normalized database: each logical "type" of attribute belongs in a separate column.

Just because three values happen to be numeric doesn't mean it makes sense to SUM() them together. But if dissimilar attributes are stored in the same column, it's tempting to treat them as compatible in this way.

This also shows a fallacy of the Entity-Attribute-Value antipattern. In this design, all attribute values are stored in a single column.

CREATE TABLE EntityAttributeValue (
entity        VARCHAR(20) NOT NULL,
attribute     VARCHAR(20) NOT NULL,
value         VARCHAR(1000) NOT NULL,
PRIMARY KEY (entity, attribute)
);

INSERT INTO EntityAttributeValue (entity, attribute, value)
VALUES
('New Cuyama', 'Population',          '562'),
('New Cuyama', 'Ft. above sea level', '2150'),
('New Cuyama', 'Established',         '1951'),

SELECT SUM(value) FROM EntityAttributeValue
WHERE entity = 'New Cuyama';


The Entity-Attribute-Value design does not support or conform to rules of database normalization.

To be clear, the proper way to design a database is to put different attributes in different columns. Use column names, not strings, to identify the attributes.

CREATE TABLE Cities (
 city_id          SERIAL PRIMARY KEY,
 city_name        VARCHAR(100) NOT NULL,
 population       INT UNSIGNED NOT NULL,
 feet_altitude    SMALLINT UNSIGNED NOT NULL,
 year_established SMALLINT UNSIGNED NOT NULL
);

Friday, May 23, 2008

ActiveRecord does not suck

I've been reading a few blog postings such as Kore Nordmann's ActiveRecord sucks and Mike Seth's ActiveRecord sucks, but Kore Nordmann is wrong.

ActiveRecord is fine.  It is a tool that does just what it's designed to do.  What sucks is when developers try to make it do other things than what it's intended to do.

I worked for Zend, managing the Zend Framework project through its 1.0 release.  I also completed the implementation and documentation of Zend_Db and its related components. To set the record straight, Zend_Db does not implement the ActiveRecord pattern. It implements the Table Data Gateway and Row Data Gateway patterns, which taken together offer similar value as the ActiveRecord pattern.

I totally agree with Mike Seth that MVC should not be taken as "ActiveRecord-View-Controller." I tried to make this point in documentation, screencasts, conference presentations, and in many mailing list messages, but I met with little success.

Unfortunately, the Ruby on Rails drumbeat that Models are simply database table wrappers has established momentum.  The term "Model" has (incorrectly) become synonymous in many developers' minds with "ActiveRecord."  Since Models by this definition are tied to database access and largely implementing various query techniques, Ruby on Rails development forces you to write a large amount of code in the controller classes that should properly be written in Model classes.

This has a few consequences. Unit-testing controller classes becomes very complex, since that's where the majority of your application code resides.  To test a controller class you need to mock HTTP requests, and sift through HTML output.  This is fine, but it results in more work since testing the controller class is so important and complex.  If the application code were separated into a true Model, then unit-testing the controller would simply be testing whether the HTTP request had been communicated to the Model correctly.  Testing the behavior of the Model would be much more straightforward unit-testing of a class API in isolation, requiring no mock HTTP requests or scraping HTML output.

Also, unit-testing Rails-style Model classes is difficult, since the Model is coupled with the database.  We start to see unfortunate things like database fixtures as being necessary before you can execute the simplest tests against your Model class.  This makes testing Models time-consuming, error-prone, and run slowly.

If developers were to separate Models and Controllers properly, and separate data access components from Models, unit-testing all of these classes could be done more simply, and with greater isolation from other classes.  This makes it easier to diagnose defects, when they occur.  Isn't this the point of unit tests?

A Model is a class that provides a logical component of your application domain.  Models are products of OO design, which is a development activity I see get very little attention in the developer blogosphere or the developer tools market.  Developers seem more enchanted by evangelizing their favorite code editor or debugger, or by squeezing more performance out of their database, than by mental analysis to make sure their OO architecture is modeling its application requirements well.

A single Model class may be backed by a database table, or multiple database tables, or perhaps even no database tables.  Data persistence should be an internal implementation detail within a Model; the external API of the Model class should reflect its logical OO requirements, not the physical database structure.

(update) What I often tell people is that the relationship between a Model and an ORM class should be "HAS-A" rather than "IS-A."  The latter is the assumption of Rails and other frameworks who are enticed by ActiveRecord.  If the Model uses ORM objects instead of inheriting from ORM classes, then you can design the Model to contain all data and code for the domain it's supposed to model -- even if it takes multiple database tables to represent it.

Many developers have complained that Zend Framework provides no base Model class. Of course it doesn't provide a base Model class! That's your job. This complaint is like saying that Microsoft Word sucks because it doesn't provide finished documents for you.

So I wouldn't say ActiveRecord sucks.  I would say that people are expecting ActiveRecord to magically solve their OO design for them, in a false quest to avoid doing the design themselves.