c++

undefined-behavior

language-lawyer

unions

I was under the impression that accessing a union member other than the last one set is UB, but I can't seem to find a solid reference (other than answers claiming it's UB but without any support from the standard).

So, is it undefined behavior?

Solution 1

The confusion is that C explicitly permits type-punning through a union, whereas C++ () has no such permission.

6.5.2.3 Structure and union members

95) If the member used to read the contents of a union object is not the same as the member last used to store a value in the object, the appropriate part of the object representation of the value is reinterpreted as an object representation in the new type as described in 6.2.6 (a process sometimes called type punning). This might be a trap representation.

The situation with C++:

9.5 Unions [class.union]

In a union, at most one of the non-static data members can be active at any time, that is, the value of at most one of the non-static data members can be stored in a union at any time.

C++ later has language permitting the use of unions containing structs with common initial sequences; this doesn't however permit type-punning.

To determine whether union type-punning is allowed in C++, we have to search further. Recall that is a normative reference for C++11 (and C99 has similar language to C11 permitting union type-punning):

3.9 Types [basic.types]

4 - The object representation of an object of type T is the sequence of N unsigned char objects taken up by the object of type T, where N equals sizeof(T). The value representation of an object is the set of bits that hold the value of type T. For trivially copyable types, the value representation is a set of bits in the object representation that determines a value, which is one discrete element of an implementation-dened set of values. 42
42) The intent is that the memory model of C++ is compatible with that of ISO/IEC 9899 Programming Language C.

It gets particularly interesting when we read

3.8 Object lifetime [basic.life]

The lifetime of an object of type T begins when: storage with the proper alignment and size for type T is obtained, and if the object has non-trivial initialization, its initialization is complete.

So for a primitive type (which ipso facto has trivial initialization) contained in a union, the lifetime of the object encompasses at least the lifetime of the union itself. This allows us to invoke

3.9.2 Compound types [basic.compound]

If an object of type T is located at an address A, a pointer of type cv T* whose value is the address A is said to point to that object, regardless of how the value was obtained.

Assuming that the operation we are interested in is type-punning i.e. taking the value of a non-active union member, and given per the above that we have a valid reference to the object referred to by that member, that operation is lvalue-to-rvalue conversion:

4.1 Lvalue-to-rvalue conversion [conv.lval]

A glvalue of a non-function, non-array type T can be converted to a prvalue. If T is an incomplete type, a program that necessitates this conversion is ill-formed. If the object to which the glvalue refers is not an object of type T and is not an object of a type derived from T, or if the object is uninitialized, a program that necessitates this conversion has undened behavior.

The question then is whether an object that is a non-active union member is initialized by storage to the active union member. As far as I can tell, this is not the case and so although if:

  • a union is copied into char array storage and back (3.9:2), or
  • a union is bytewise copied to another union of the same type (3.9:3), or
  • a union is accessed across language boundaries by a program element conforming to ISO/IEC 9899 (so far as that is defined) (3.9:4 note 42), then

the access to a union by a non-active member is defined and is defined to follow the object and value representation, access without one of the above interpositions is undefined behaviour. This has implications for the optimisations allowed to be performed on such a program, as the implementation may of course assume that undefined behaviour does not occur.

That is, although we can legitimately form an lvalue to a non-active union member (which is why assigning to a non-active member without construction is ok) it is considered to be uninitialized.

Solution 2

The C++11 standard says it this way

9.5 Unions

In a union, at most one of the non-static data members can be active at any time, that is, the value of at most one of the non-static data members can be stored in a union at any time.

If only one value is stored, how can you read another? It just isn't there.


The gcc documentation lists this under Implementation defined behavior

  • A member of a union object is accessed using a member of a different type (C90 6.3.2.3).

The relevant bytes of the representation of the object are treated as an object of the type used for the access. See Type-punning. This may be a trap representation.

indicating that this is not required by the C standard.


2016-01-05: Through the comments I was linked to C99 Defect Report #283 which adds a similar text as a footnote to the C standard document:

78a) If the member used to access the contents of a union object is not the same as the member last used to store a value in the object, the appropriate part of the object representation of the value is reinterpreted as an object representation in the new type as described in 6.2.6 (a process sometimes called "type punning"). This might be a trap representation.

Not sure if it clarifies much though, considering that a footnote is not normative for the standard.

Solution 3

I think the closest the standard comes to saying it's undefined behavior is where it defines the behavior for a union containing a common initial sequence (C99, §6.5.2.3/5):

One special guarantee is made in order to simplify the use of unions: if a union contains several structures that share a common initial sequence (see below), and if the union object currently contains one of these structures, it is permitted to inspect the common initial part of any of them anywhere that a declaration of the complete type of the union is visible. Two structures share a common initial sequence if corresponding members have compatible types (and, for bit-fields, the same widths) for a sequence of one or more initial members.

C++11 gives similar requirements/permission at §9.2/19:

If a standard-layout union contains two or more standard-layout structs that share a common initial sequence, and if the standard-layout union object currently contains one of these standard-layout structs, it is permitted to inspect the common initial part of any of them. Two standard-layout structs share a common initial sequence if corresponding members have layout-compatible types and either neither member is a bit-field or both are bit-fields with the same width for a sequence of one or more initial members.

Though neither states it directly, these both carry a strong implication that "inspecting" (reading) a member is "permitted" only if 1) it is (part of) the member most recently written, or 2) is part of a common initial sequence.

That's not a direct statement that doing otherwise is undefined behavior, but it's the closest of which I'm aware.

Solution 4

Something that is not yet mentioned by available answers is the footnote 37 in the paragraph 21 of the section 6.2.5:

Note that aggregate type does not include union type because an object with union type can only contain one member at a time.

This requirement seem to clearly imply that you must not write in a member and read in another one. In this case it might be undefined behavior by lack of specification.

Solution 5

I well explain this with a example.
assume we have the following union:

union A{
   int x;
   short y[2];
};

I well assume that sizeof(int) gives 4, and that sizeof(short) gives 2.
when you write union A a = {10} that well create a new var of type A in put in it the value 10.

your memory should look like that: (remember that all of the union members get the same location)

       |                   x                   |
       |        y[0]       |       y[1]        |
       -----------------------------------------
   a-> |0000 0000|0000 0000|0000 0000|0000 1010|
       -----------------------------------------

as you could see, the value of a.x is 10, the value of a.y1 is 10, and the value of a.y[0] is 0.

now, what well happen if I do this?

a.y[0] = 37;

our memory will look like this:

       |                   x                   |
       |        y[0]       |       y[1]        |
       -----------------------------------------
   a-> |0000 0000|0010 0101|0000 0000|0000 1010|
       -----------------------------------------

this will turn the value of a.x to 2424842 (in decimal).

now, if your union has a float, or double, your memory map well be more of a mess, because of the way you store exact numbers. more info you could get in here.