Do references get updated when Garbage Collectors move data in heap?

Posted on

Problem :

I read that GC (Garbage Collectors) moves data in Heap for performance reasons, which I don’t quite understand why since it is random access memory, maybe for better sequential access but I wonder if references in Stack get updated when such a move occurs in Heap. But maybe the offset address remains the same but other parts of data get moved by Garbage Collectors, I am not sure though.

I think this question pertains to implementation detail since not all garbage collectors may perform such optimization or they may do it but not update references (if it is a common practice among garbage collector implementations). But I would like to get some overall answer specific to CLR (Common Language Runtime) garbage collectors though.

And also I was reading Eric Lippert’s “References are not addresses” article here, and the following paragraph confused me little bit:

If you think of a reference is actually being an opaque GC handle then
it becomes clear that to find the address associated with the handle
you have to somehow “fix” the object. You have to tell the GC “until
further notice, the object with this handle must not be moved in
memory, because someone might have an interior pointer to it”. (There
are various ways to do that which are beyond the scope of this
screed.)

It sounds like for reference types, we don’t want data to be moved. Then what else we store in the heap, which we can move around for performance optimization? Maybe type information we store there? By the way, in case you wonder what that article is about, then Eric Lippert is comparing references to pointers little bit and try to explain how it may be wrong to say that references are just addresses even though it is how C# implements it.

And also, if any of my assumptions above is wrong, please correct me.

Solution :

Yes, references get updated during a garbage collection. Necessarily so, objects are moved when the heap is compacted. Compacting serves two major purposes:

  • it makes programs more efficient by using the processor’s data caches more efficiently. That is a very, very big deal on modern processors, RAM is exceedingly slow compared to the execution engine, a fat two orders of magnitude. The processor can be stalled for hundreds of instructions when it has to wait for RAM to supply a variable value.
  • it solves the fragmentation problem that heaps suffer from. Fragmentation occurs when a small object is released that is surrounded by live objects. A hole that cannot be used for anything else but an object of equal or smaller size. Bad for memory usage efficiency and processor efficiency. Note how the LOH, the Large Object Heap in .NET, does not get compacted and therefore suffers from this fragmentation problem. Many questions about that at SO.

In spite of Eric’s didactic, an object reference really is just an address. A pointer, exactly the same kind you’d use in a C or C++ program. Very efficient, necessarily so. And all the GC has to do after moving an object is update the address stored in that pointer to the moved object. The CLR also permits allocating handles to objects, extra references. Exposed as the GCHandle type in .NET, but only necessary if the GC needs help determining if an object should stay alive or should not be moved. Only relevant if you interop with unmanaged code.

What is not so simple is finding that pointer back. The CLR is heavily invested in ensuring that can be done reliably and efficiently. Such pointers can be stored in many different places. The easier ones to find back are object references stored in a field of an object, a static variable or a GCHandle. The hard ones are pointers stored on the processor stack or a CPU register. Happens for method arguments and local variables for example.

One guarantee that the CLR needs to provide to make that happen is that the GC can always reliably walk the stack of a thread. So it can find local variables back that are stored in a stack frame. Then it needs to know where to look in such a stack frame, that’s the job of the JIT compiler. When it compiles a method, it doesn’t just generate the machine code for the method, it also builds a table that describes where those pointers are stored. You’ll find more details about that in this post.

Looking at C++CLI In Action, there’s a section about interior pointers vs pinning pointers:

C++/CLI provides two kinds of pointers that work around this problem.
The first kind is called an interior pointer, which is updated by the
runtime to reflect the new location of the object that’s pointed to
every time the object is relocated. The physical address pointed to by
the interior pointer never remains the same, but it always points to
the same object. The other kind is called a pinning pointer, which
prevents the GC from relocating the object; in other words, it pins
the object to a specific physical location in the CLR heap. With some
restrictions, conversions are possible between interior, pinning, and
native pointers.

From that, you can conclude that reference types do move in the heap and their addresses do change. After the Mark and Sweep phase, the objects get compacted inside the heap, thus actually moving to new addresses. The CLR is responsible to keep track of the actual storage location and update those interior pointers using an internal table, making sure that when accessed, it still points to the valid location of the object.

There’s an example taken from here:

ref struct CData
{
    int age;
};

int main()
{
    for(int i=0; i<100000; i++) // ((1))
        gcnew CData();

    CData^ d = gcnew CData();
    d->age = 100;

    interior_ptr<int> pint = &d->age; // ((2))

    printf("%p %drn",pint,*pint);

    for(int i=0; i<100000; i++) // ((3))
        gcnew CData();

    printf("%p %drn",pint,*pint); // ((4))
    return 0;
}

Which is explained:

In the sample code, you create 100,000 orphan CData objects ((1)) so
that you can fill up a good portion of the CLR heap. You then create a
CData object that’s stored in a variable and ((2)) an interior pointer
to the int member age of this CData object. You then print out the
pointer address as well as the int value that is pointed to. Now,
((3)) you create another 100,000 orphan CData objects; somewhere along
the line, a garbage-collection cycle occurs (the orphan objects
created earlier ((1)) get collected because they aren’t referenced
anywhere). Note that you don’t use a GC::Collect call because that’s
not guaranteed to force a garbage-collection cycle. As you’ve already
seen in the discussion of the garbage-collection algorithm in the
previous chapter, the GC frees up space by removing the orphan objects
so that it can do further allocations.
At the end of the code (by
which time a garbage collection has occurred), you again ((4)) print
out the pointer address and the value of age. This is the output I got
on my machine (note that the addresses will vary from machine to
machine, so your output values won’t be the same):

012CB4C8 100
012A13D0 100

Leave a Reply

Your email address will not be published. Required fields are marked *