"Усердие все превозмогает!"
К. Прутков, Мысли и афоризмы, I, 84
In C and C++ a static variable can be defined in a function scope:
int foo() {
static int counter = 1;
printf("foo() has been called %i times.\n", counter++);
...
}
Technically, this defines counter
as an object of static storage duration that is allocated not within the function activation frame (which is typically on the stack, but can be on the heap for a coroutine), but as a global object. This is often used to shift computational cost out of the hot path, by precomputing some state and storing it in a static object.
When exactly a static object is initialised?
For C this question is vacuous, because the initialiser must be a compile-time constant, so the actual value of the static object is embedded in the compiled binary and is always valid.
C++ has a bizarrely complicated taxonomy of initialisations. There is static initialisation, which roughly corresponds to C initialisation, subdivided into constant-initialisation and zero-initialisation. Then there is dynamic initialisation, further divided into unordered, partially-ordered and ordered categories. None of these, however, captures our case: for block-local variables, the Standard has a special sub-section in "Declaration statement" [stmt.dcl.4]:
Dynamic initialization of a block-scope variable with static storage duration or thread storage duration is performed the first time control passes through its declaration; such a variable is considered initialized upon the completion of its initialization. If the initialization exits by throwing an exception, the initialization is not complete, so it will be tried again the next time control enters the declaration. If control enters the declaration concurrently while the variable is being initialized, the concurrent execution shall wait for completion of the initialization. [85] If control re-enters the declaration recursively while the variable is being initialized, the behavior is undefined. [ Example:
int foo(int i) {
static int s = foo(2*i); // undefined behavior: recursive call
return i+1;
}
— end example ]
[0]
85) The implementation must not introduce any deadlock around execution of the initializer. Deadlocks might still be caused by the program logic; the implementation need only avoid deadlocks due to its own synchronization operations.
For example in
struct Bar {
Bar() : var(1) {}
int var;
};
int foo(int x) {
static Bar b{};
return b.var + 1;
}
the constructor for b
should be called exactly once when foo()
is called the first time. This initialisation semantics is very close (sans the exceptions part) to pthread_once
(). It is clear that the compiler must add some sort of an internal flag to check whether the initialisation has already been performed and some synchronisation object to serialise concurrent calls to foo()
[godbolt]:
foo(int):
push rbp
mov rbp, rsp
sub rsp, 16
mov DWORD PTR [rbp-4], edi
movzx eax, BYTE PTR guard variable for foo(int)::b[rip]
test al, al
sete al
test al, al
je .L3
mov edi, OFFSET FLAT:guard variable for foo(int)::b
call __cxa_guard_acquire
test eax, eax
setne al
test al, al
je .L3
mov edi, OFFSET FLAT:foo(int)::b
call Bar::Bar() [complete object constructor]
mov edi, OFFSET FLAT:guard variable for foo(int)::b
call __cxa_guard_release
.L3:
mov eax, DWORD PTR foo(int)::b[rip]
add eax, 1
leave
ret
This corresponds roughly to the following code:
int foo(int x) {
static Bar b{};
static std::atomic<int> __b_guard = 0;
if (__cxa_guard_acquire(&__b_guard) != 0) {
new (&b) Bar{}; /* Construct b in-place. */
__cxa_guard_release(&__b_guard)
}
return b.var + 1;
}
Here __b_guard
(guard variable for foo(int)::b
in assembly) is the flag variable added by the compiler. __cxa_guard_acquire
() is a suprisingly complex function, which includes its own synchronisation mechanism implemented directly on top of the raw Linux futex syscall.
Even after the static variable has been initialised, the overhead of accessing it is still considerable: a function call to __cxa_guard_acquire()
, plus atomic_load_explicit(&__b_guard, memory_order::acquire)
in __cxa_guard_acquire()
. On ARM, such atomic load incurs a memory barrier---a fairly expensive operation.
Can this additional cost be reduced? Yes, in fact it can be completely eliminated, making block-level static variables exactly as efficient as file-level ones. For this we need a certain old, but little-known feature of UNIX linkers. From GNU binutils documentation (beware than in the old versions the terminating symbol is mistakenly referred to as __end_SECNAME
):
If an output section’s name is the same as the input section’s name and is representable as a C identifier, then the linker will automatically PROVIDE
two symbols: __start_SECNAME
and __stop_SECNAME
, where SECNAME
is the name of the section. These indicate the start address and end address of the output section respectively. Note: most section names are not representable as C identifiers because they contain a ‘.’ character.
(Solaris linker calls them "Encapsulation Symbols", see here.)
The idea is the following: instead of defining a block-level static instance of Bar
, define a trivially-initialisable object of a size sufficient to hold an instance of Bar
in a dedicated section STATIC_Bar
, via (more or less portable) __attribute__((section))
. Only such place-holder objects and nothing else are placed in this section. Then, during global static initialisation, scan the resulting array of place-holder objects from __start_STATIC_Bar
to __stop_STATIC_Bar
and initialise Bar instances in-place. Assuming that functions where static Bar
s are defined are not themselves called during global static initialisation, this would initialise everything correctly: by the time foo()
is called, its b
has already been initialised.
Something like this:
#include <stdio.h>
#include <new> /* For placement new. */
#define FAST_STATIC(T) \
*({ \
struct placeholder { \
alignas(T) char buf[sizeof(T)]; \
}; \
static constinit placeholder ph __attribute__((section ("STATIC_" #T))) {{}}; \
reinterpret_cast<T *>(ph.buf); \
})
template <typename T> static int section_init(T *start, T *stop)
{
for (T *s = start; s < stop; ++s)
new (s) T; /* Construct in-place. */
return 0;
}
#define FAST_STATIC_INIT(T) \
extern "C" T __start_STATIC_ ## T; \
extern "C" T __stop_STATIC_ ## T; \
static int _init_ ## T = section_init<T>(&__start_STATIC_ ## T, \
&__stop_STATIC_ ## T);
struct Bar {
Bar() : var(1) {}
int var;
};
int foo(int x) {
Bar &b0 = FAST_STATIC(Bar);
Bar &b1 = FAST_STATIC(Bar);
return b0.var + b1.var + 1;
}
FAST_STATIC_INIT(Bar);
int main(int argc, char **argv) {
return printf("%i\n", foo(argc)); /* Prints "3". */
}
Check the resulting assembly [godbolt]:
foo(int)::ph:
.zero 4
foo(int)::ph:
.zero 4
foo(int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
mov QWORD PTR [rbp-8], OFFSET FLAT:foo(int)::ph
mov QWORD PTR [rbp-16], OFFSET FLAT:foo(int)::ph
mov rax, QWORD PTR [rbp-8]
mov edx, DWORD PTR [rax]
mov rax, QWORD PTR [rbp-16]
mov eax, DWORD PTR [rax]
add eax, edx
add eax, 1
pop rbp
ret
Voilà! The calls to __cxa_guard_acquire()
are gone, yet b0
and b1
are initialised before foo()
is called, just as we want. But not so fast, it's C++.
Let's add another static Bar instance, this time in an inline function:
int inline baz(int x) {
Bar &b = FAST_STATIC(Bar);
return b.var * x;
}
GCC reports [godbolt]:
<source>:9:38: error: 'ph' causes a section type conflict with 'ph' in section 'STATIC_Bar'
(clang works fine [godbolt], by the way.)
The problem is that in addition to name, sections output by the compiler also have attributes. The compiler selects the attributes based on the properties of the scope where the symbol (to which __attribute__((section))
is applied) is defined. Inline functions force a different attribute selection (similarly do template members), and the linker ends up with multiple sections with the same name, but conflicting attributes. See stackoverflow for details.
As it is, FAST_STATIC()
is usable, but section attribute conflicts put awkward resrictions on its applicability. Is this the best we can do? For some time I thought that it is, but then I realised that there is another way to specify the section in which the variable is located: the .pushsection
directive of the embedded assembler (do not be afraid, we will use only portable part).
If you do something like
__asm__(".pushsection STATIC_Bar,\"aw\",@progbits\n" \
".quad " symbol "\n" \
".popsection\n")
then the address of the symbol
is placed in STATIC_Bar
section with the specified attributes.
All we need is something like
#define FAST_STATIC(T) \
*({ \
struct placeholder { \
alignas(T) char buf[sizeof(T)]; \
}; \
static constinit placeholder ph {{}}; \
__asm__(".pushsection STATIC_" #T ",\"aw\",@progbits\n" \
".quad ph\n" \
".popsection\n"); \
reinterpret_cast<T *>(ph.buf); \
})
and we are good (section_init()
needs to be fixed a bit, because STATIC_Bar
now contains pointers, not instances). But not so fast, it's C++. This does not even compile [godbolt]:
ld: /tmp/ccZRzXXj.o:(STATIC_Bar+0x0): undefined reference to `ph'
ld: /tmp/ccZRzXXj.o:(STATIC_Bar+0x8): undefined reference to `ph'
ld: /tmp/ccZRzXXj.o:(STATIC_Bar+0x10): undefined reference to `ph'
collect2: error: ld returned 1 exit status
Execution build compiler returned: 1
When you define static constinit placeholder ph
, the actual name the compiler uses for the symbol is not ph
it is the mangled version of something like foo(int)::ph
that we saw in the assembly listing above. There is no ph
for .quad ph
to resolve to.
OK. Are we stuck now? In fact not. You can instruct the compiler to use a particular symbol name, instead of the mangled one. With
int foo asm ("bar") = 2;
the compiler will use "bar" as the symbol name for foo
(both gcc and clang support this).
Of course if we just do
static constinit placeholder ph asm("ph") {{}};
we fall in the opposite trap of having multiple definitions for "ph". We need to define unique names for our symbols, but there is more or less standard trick for this, based on __COUNTER__
macro. We also need a couple of, again standard, macros for concatenation and stringification. The final version looks like this:
#define CAT0(a, b) a ## b
#define CAT(a, b) CAT0(a, b)
#define STR0(x) # x
#define STR(x) STR0(x)
#define FAST_STATIC_DO(T, id) \
*({ \
struct placeholder { \
alignas(T) char buf[sizeof(T)]; \
}; \
static constinit placeholder id asm(STR(id)) {{}}; \
__asm__(".pushsection STATIC_" #T ",\"aw\",@progbits\n" \
".quad " STR(id) "\n" \
".popsection\n"); \
reinterpret_cast<T *>(id.buf); \
})
#define FAST_STATIC(T) FAST_STATIC_DO(T, CAT(ph_, __COUNTER__))
template <typename T> static int section_init(T **start, T **stop)
{
for (T **s = start; s < stop; ++s)
new (*s) T; /* Construct in-place. */
return 0;
}
#define FAST_STATIC_INIT(T) \
extern "C" T *__start_STATIC_ ## T; \
extern "C" T *__stop_STATIC_ ## T; \
static int _init_ ## T = section_init<T>(&__start_STATIC_ ## T, \
&__stop_STATIC_ ## T);
The resulting assembly for foo()
and foo_init()
[godbolt] accesses statics directly:
foo(int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
mov QWORD PTR [rbp-8], OFFSET FLAT:ph_0
mov QWORD PTR [rbp-16], OFFSET FLAT:ph_1
mov rax, QWORD PTR [rbp-8]
mov edx, DWORD PTR [rax]
mov rax, QWORD PTR [rbp-16]
mov eax, DWORD PTR [rax]
add eax, edx
add eax, 1
pop rbp
ret
foo_inline(int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
mov QWORD PTR [rbp-8], OFFSET FLAT:ph_2
mov rax, QWORD PTR [rbp-8]
mov eax, DWORD PTR [rax]
imul eax, DWORD PTR [rbp-20]
pop rbp
ret
Finally we won!
"Бывает, что усердие превозмогает и рассудок"
К. Прутков, Мысли и афоризмы, II, 27
P.S. The actual implementation requires more bells and whistles. Parameters need to be passed to constructors, they can be stored within the placeholder. Typenames are not necessarily valid identifiers (think A::B::foo<T>
), so the section name needs to be a separate parameter, etc., but the basic idea should be clear.
P.P.S. I have a similar story about optimising access to thread-local variables, involving C++20 constinit
and __attribute__((tls_model("initial-exec")))
.