One section claims "Physical Subtyping is Broken", where "physical subtyping" is defined as "the struct-based implementation of inheritance in C." I assume this means the typical pattern of:
typedef struct {
int base_member_1;
int base_member_2;
} Base;
typedef struct {
Base base;
int derived_member 1;
} Derived;
The article claims physical subtyping is broken because casting between pointer types results in undefined behavior. The article gives this example:
I agree this example is broken, but casting between pointer types in this way is totally unnecessary for C-based inheritance. You can do upcasts and downcasts that are totally legal:
If you compile this on an architecture like x86 that truly allows unaligned reads, you'll see that modern compilers do the "chunking optimization" for you:
It says next that "int8_t and uint8_t Are Not Necessarily Character Types." That is indeed a good point and probably not well-known. So I agree this is something people should keep in mind. But most of this article is warning against practices that are generally unnecessary and known to be bad C in 2016.
It's true that a lot of legacy code-bases still break these rules. But many are cleaning up their act, fixing practices that were never correct but used to work. For example, here is an example of Python fixing its API to comply with strict aliasing, and this is from almost 10 years ago: https://www.python.org/dev/peps/pep-3123/
I'm with you. I wrote the code that breaks this stuff in gcc (and the implementation of struct-sensitive pointer analysis).
It explicitly and deliberately follows the first member rule, as it should :)
In C++, this is covered by 6.5/7, and allowed because it's a type compatible with the effective type of the object (in a standard layout class, a pointer to the a structure object points to the initial member)
I understand the upcast (which is certainly legal but it forces the casting code to know the depth of the inheritance hierarchy - as in &derived->base1.base2), but what's the argument making the downcast back to Derived legal C? (I honestly wonder; personally I either compile with -fno-strict-aliasing or trust my tests to validate the build...)
> I understand the upcast (which is certainly legal but it forces the casting code to know the depth of the inheritance hierarchy - as in &derived->base1.base2)
If this is inconvenient, just casting directly to Base pointer is also legal.
> but what's the argument making the downcast back to Derived legal C?
The justification comes from this part of the C standard (C99 6.7.2.1 p13):
Within a structure object, the non-bit-field
members and the units in which bit-fields reside
have addresses that increase in the order in which
they are declared. A pointer to a structure
object, suitably converted, points to its initial
member (or if that member is a bit-field, then to
the unit in which it resides), and vice versa.
There may be unnamed padding within a structure
object, but not at its beginning.
It follows that:
Derived *d = GetDerived();
// This is legal: a pointer to Derived, suitably converted,
// points to its initial member "base":
Base *base = (Base*)d;
// This is also legal: a pointer to Derived.base, the initial
// member of Derived, suitably converted, points to Derived.
Derived *d2 = (Derived*)base;
About the memcpy stuff, when you consider the whole picture, this is ridiculous though. There should be no reason for any sane implementation to ever do nasty stuff about
Now I don't remember the article where I saw that, but technically given the current orientation of compiler writers there are some even more ridiculous situations. Like (a<<n) | (a>>(32-n)) having the obviously desired effect on all current architectures when you look at what would be an obvious direct translation (and quite efficient one already), and yet given the current orientation of compiler writers I would not like to see that code AT ALL unless it is proved that n is always strictly between 1 and 31. And now if they want to restore any kind of efficiency after all that madness, they would have to implement yet another case of convoluted peephole optim. Stupid. Give me my original intent of the langage back, because virtually everybody is using it like that consciously or not, and that will just not change.
How about the fact that the addresses might not be aligned?
How about the fact that there is no reason to write that if what you actually mean is memcpy(dst, src, 8)? Chunking yourself is a premature optimization that the compiler is in a better position to actually perform.
typedef struct {
int x;
} Base;
typedef struct {
Base base;
int y;
} Derived;
int f(Base* b, Derived* d) {
b->x = 0;
d->base.x = 1;
return b->x;
}
Notice that if we are accessing the base members of "d", we are still accessing them through a struct of type "Base" (d->base.x). If we compile this with strict aliasing, you can see the output is allowing that the two might alias (while this isn't a proof, it's a strong indication that this is aliasing-correct).
6.5.7. An object shall have its stored value accessed only by an lvalue expression that has one of
the following types:
- an aggregate or union type that includes one of the aforementioned types among its
members (including, recursively, a member of a subaggregate or contained union)
If we talk in terms of concepts that exist in the C standard, we would say that you can't cast an object to pointer-to-X unless your pointer actually points to an X.
The reason your example is illegal is that you are casting to pointer-to-"struct derived", but the thing being pointed to is not actually a "struct derived."
The "physical subtyping" pattern works because the C standard says that a pointer to a struct, suitably converted, also points to its first member. So a pointer-to-Derived, converted to a pointer-to-Base, points at Derived's first member. But a pointer-to-Base doesn't point at a Derived unless that object actually is a Derived. So the downcast is only legal if the object actually is a Derived.
Looks like your other comment hit the max reply depth so this will need to finish up, but in any case I don't agree with your reading of the vice versa.
It may be that the aliasing rules are also required to fully justify my conclusion (ie. a Base can't have its stored value accessed via a pointer-to-derived due to the aliasing rules). But I have a very high degree of confidence in the conclusion itself. I think that you will find that your compiler implements the behavior I have described.
There isn't actually a depth limit (or if there is we haven't hit it yet :). HackerNews just hides the "reply" link for 5 minutes or so to cool down flamewars.
You can work around this by clicking on the link for the post itself (ie. "3 minutes ago") which allows you to reply immediately.
I'm not making a one-way argument. If the underlying object actually is a Derived, you can freely cast between pointer-to-Base and pointer-to-Derived. That is what "and vice versa" means.
But if the object isn't actually a Derived, you can't cast to pointer-to-Derived:
Derived derived;
Derived *pDerived = &derived;
// This is legal because it's equivalent to:
// Base *pb = &derived.base;
//
// ie. there actually is a Base object there that the
// pointer is pointing to.
Base *pBase = (Base*)pDerived;
// This is legal because pBase points to the initial member
// of a Derived. So, suitably converted, it points at the
// Derived.
//
// The key point is that there actually is a Derived object
// there that we are pointing at.
pDerived = (Derived*)pBase;
Base base;
// This is illegal, because this base object is not actually
// a part of a larger Derived object, it's just a Base.
// So we have a pDerived that doesn't actually point at a
// Derived object -- this is illegal.
pDerived = (Derived*)&base;
// Imagine if the above were actually legal -- this would
// reference unallocated memory!
pDerived->some_derived_member = 5;
To my mind, 'illegal' means that the compiler will complain. In this case, I don't even see weird, scary UB; this is just a case of the standard being completely unable to say anything about what will happen.
After spending too much of my life chasing these bugs, here the compiler will do exactly what you told it to, which probably means making your day miserable.
> Actually, casts to/from char * are always defined in C (chars are always assumed to alias).
Not true. The standard says you can access any object's value via the char type, but not the reverse. You can't cast a character array to any type and dereference it.
> The author was talking about "chunking" non-char units.
Sure, but you can call my copy_8_bytes() function like so legally:
Yes, I believe it would be. That's a good point, now that you mention it -- I have code that does just that, and I hadn't realized it's probably undefined.
The "effective type" (this is a term defined in the standard) of the char array elements would be "char", whereas the memory returned from malloc() is considered to be an object that initially has no effective type. I don't know of any way to take the char array and "erase" its effective type so that it can be used generically, like the value returned from malloc().
This is one reason among many why people who write serious low-level code (e.g. game developers) think all the new aliasing rules are completely bonkers.
We implement our own allocators all the time. If you can't even do such a basic thing legally, then the rules are obvious nonsense.
>
Just make a fast allocator that uses heap instead of the stack. You only need to malloc once and it can be used for any type since like you pointed out, it's effective type can be changed.
Sometimes, especially in embedded systems, it is useful to have a bunch of statically allocated heaps. You can see them in a memory map, and the linker will tell you if they don't fit in memory.
There is also the case where you have some raw data from a file or network, that you want to re-interpret as a struct. That is always dangerous with endianness and struct padding, but it is a very common practice. You could always memcpy from a char array to a struct, but that can waste memory.
"More importantly, they allocate a data-dependent amount of stack space that can trigger difficult-to-find memory overwriting bugs: "It ran fine on my machine, but dies mysteriously in production"."
One section claims "Physical Subtyping is Broken", where "physical subtyping" is defined as "the struct-based implementation of inheritance in C." I assume this means the typical pattern of:
The article claims physical subtyping is broken because casting between pointer types results in undefined behavior. The article gives this example: I agree this example is broken, but casting between pointer types in this way is totally unnecessary for C-based inheritance. You can do upcasts and downcasts that are totally legal: So I don't think the article has proved that "Physical Subtyping is Broken."The next section says that "Chunking Optimizations Are Broken," because code like this is illegal:
While this is true, such optimizations are generally unnecessary. For example, write this instead as: If you compile this on an architecture like x86 that truly allows unaligned reads, you'll see that modern compilers do the "chunking optimization" for you: It says next that "int8_t and uint8_t Are Not Necessarily Character Types." That is indeed a good point and probably not well-known. So I agree this is something people should keep in mind. But most of this article is warning against practices that are generally unnecessary and known to be bad C in 2016.It's true that a lot of legacy code-bases still break these rules. But many are cleaning up their act, fixing practices that were never correct but used to work. For example, here is an example of Python fixing its API to comply with strict aliasing, and this is from almost 10 years ago: https://www.python.org/dev/peps/pep-3123/