Opened 2 years ago

Closed 22 months ago

#64368 closed defect (fixed)

cmake @3.22.1_0: "Error running link command: Invalid argument" when run under trace mode

Reported by: chrstphrchvz (Christopher Chavez) Owned by: michaelld (Michael Dickens)
Priority: Normal Milestone:
Component: ports Version: 2.7.1
Keywords: tracemode Cc: ryandesign (Ryan Carsten Schmidt), l2dy (Zero King), cooljeanius (Eric Gallager)
Port: cmake

Description (last modified by chrstphrchvz (Christopher Chavez))

I observe that when building ports under trace mode, the file descriptors given to processes can reach/exceed the default value of FD_SETSIZE, which is 1024 when not manually defined before including <sys/select.h>. CMake—and possibly other programs—are compiled without manually defining FD_SETSIZE on macOS, and without defining _DARWIN_UNLIMITED_SELECT or _DARWIN_C_SOURCE. This leads to at least two problems using CMake under trace mode: Error running link command: Invalid argument being output; and out-of-bounds memory accesses, one effect being corruption of the Invalid argument message.

More detailed description:

If a program is compiled without defining _DARWIN_UNLIMITED_SELECT (on macOS 10.5 Leopard and later, and before including <sys/time.h> for more recent macOS versions) or _DARWIN_C_SOURCE (unsure which macOS versions), then calling select() for more than 1024 file descriptors fails, and errno is set to EINVAL.

The select() call in kwsysProcessWaitForPipe() in ProcessUNIX.c (inherited from KWSys—Kitware System Library) is affected by this limitation: select() will fail when max+1 exceeds 1024. One use of the affected call is when running cmake -E cmake_link_script. If a trace mode build uses enough file descriptors to cause the select() call to fail, then strerror(errno) is used to get the error message (for EINVAL, it is "Invalid argument"), which CMake eventually prints out as:

Error running link command: Invalid argument

Using the FD_CLR()/FD_SET()/FD_ISSET() macros with too high of a file descriptor is known to cause out-of-bounds accesses (on various platforms—not just macOS), as sizeof(fd_set) is directly dependent on FD_SETSIZE. On macOS, these macros implement bounds checking, but only for an fd_set residing on the stack, and only on very recent macOS versions. The relevant fd_set in CMake is a member of a structure allocated on the heap, and so corruption/bogus reads do occur for file descriptors which are too high: one example is corruption to the error message buffer located only a few slots after the fd_set in the same structure, causing CMake to instead output e.g.

Error running link command: Invalid `rgument

or

Error running link command: In^Valid argument

where ^V is the ASCII SYN character.

Some programs (e.g. Tcl, in tclUnixPort.h) manually define FD_SETSIZE to OPEN_MAX (from <sys/syslimits.h>). Currently, CMake only defines FD_SETSIZE manually (in ProcessUNIX.c) for Cygwin.

I observe that defining FD_SETSIZE to OPEN_MAX and defining _DARWIN_UNLIMITED_SELECT in ProcessUNIX.c prevents select() from failing with EINVAL when called for more than 1024 file descriptors.

  • Source/kwsys/ProcessUNIX.c

    old new  
    4040#if defined(__CYGWIN__)
    4141/* Increase the file descriptor limit for select() before including
    4242   related system headers. (Default: 64) */
    4343#  define FD_SETSIZE 16384
     44#elif defined(__APPLE__)
     45/* Increase the file descriptor limit for select() before including
     46   related system headers. (Default: 1024) */
     47#  define _DARWIN_UNLIMITED_SELECT
     48#  include <limits.h> /* OPEN_MAX */
     49#  define FD_SETSIZE OPEN_MAX
    4450#endif
    4551
    4652#include <assert.h>    /* assert */

I believe that both of these definitions are necessary to avoid issues. I observe that if _DARWIN_UNLIMITED_SELECT is defined but FD_SETSIZE is left at its default value, the fd_set structure is still not large enough to avoid out-of-bounds accesses by FD_SET()/FD_CLR()/FD_ISSET(); and the select() call, which previously would fail, now erroneously returns early (because fd_set is not large enough to indicate which file to block on), causing CMake to poll/busy-wait rather than block as intended, but without causing the build to fail. (I wonder if it is possible for the opposite to happen, where select() erroneously blocks on a file descriptor that fd_set is too small to describe, and cause the build to stall indefinitely; this would resemble unresolved trace mode issues I have noticed and reported for other ports.)

The port I used to investigate this issue is mysql8, but I hesitate to claim that making these changes to CMake would fix #63455, at least for anyone other than myself, since I’ve already led others who are much more knowledgable to believe that that issue is due to outdated libtool. I also don’t know if this fixes #55086.

I don’t know whether these changes are something upstream CMake/KWSys should incorporate, because I don’t know whether this is necessarily a bug in CMake, i.e. whether CMake (particularly when running cmake_link_script or other command needing select()) is normally able to get a file descriptor 1024 or higher, and encounter this issue outside of trace mode. But I do find that it is possible, at least on certain recent macOS versions, to naïvely compile a simple program which successfully opens more than 1024 files without defining FD_SETSIZE or _DARWIN_UNLIMITED_SELECT.

I don’t know if file descriptors growing to unusually large values under trace mode is expected behavior. If it isn’t—e.g. due to file descriptor leak in trace mode itself, or not cleaning up after programs with file descriptor leaks—then I don’t know whether it is causing “Too many open files” errors I sometimes observe under trace mode, but have neglected to report. I also don’t know if this issue should instead indicate something missing from the design of trace mode, like abstraction of file descriptor values. And I don’t know if this issue should be considered grave enough to condemn previous trace mode builds because FD_SET()/FD_CLR()/FD_ISSET() may have been silently causing programs to misbehave due to invalid memory accesses.

One idea for making this issue easier to reproduce and find affected programs would be to reserve file descriptors under 1024 at the beginning of trace mode, forcing programs to use file descriptors 1024 and higher. Also note that parallel building—and the number of processors if enabled—are factors in triggering this issue, as these affects the order in which steps in a build are done, and in turn how large the file descriptor values are at a certain point in the build.

Change History (11)

comment:1 Changed 2 years ago by chrstphrchvz (Christopher Chavez)

Description: modified (diff)

comment:2 Changed 2 years ago by ryandesign (Ryan Carsten Schmidt)

Cc: ryandesign added

comment:3 Changed 2 years ago by neverpanic (Clemens Lang)

Trace mode should only use one file descriptor per thread in addition to what is being used without trace mode. I doubt that CMake's linking tests would usually use 1024 file descriptors, so that means there's probably a leak somewhere, possibly related to dynamically creating and destroying threads.

I'm in favor of sending this patch upstream, since from what I understand this will also lead to memory corruption if a CMake build ever tried to legitimately open more than 1024 files, right?

Thanks for the extensive debugging. I had seen those error messages before and traced them to the invocation of select(), but couldn't figure out what was wrong.

comment:4 Changed 2 years ago by neverpanic (Clemens Lang)

Maybe we should just give up on persisting sockets in darwintrace code, and open new ones whenever socket communication is needed. That would probably avoid this entire problem, since our interposing functions all run to completion, so there should never be an open socket left over. Maybe I'll find a few minutes to benchmark how bad this would be from a performance point of view.

comment:5 Changed 2 years ago by chrstphrchvz (Christopher Chavez)

Description: modified (diff)

Adjusted patch: POSIX would say to include <limits.h> for OPEN_MAX (which is also where Tcl actually picks it up from), not <sys/syslimits.h>.

Submitted upstream: https://gitlab.kitware.com/cmake/cmake/-/merge_requests/6909

comment:6 Changed 2 years ago by neverpanic (Clemens Lang)

Thanks!

I've meanwhile spent a little bit of time to look at the performance difference of just not keeping the file descriptors in darwintrace.dylib open, and it turns out that that really doesn't make a big difference one way or the other. So in addition to this being fixed in CMake, I'll probably follow up with a change that modifies trace mode to hopefully no longer create that many file descriptors.

comment:7 Changed 2 years ago by neverpanic (Clemens Lang)

https://github.com/macports/macports-base/pull/255 may fix the file descriptor leak, turns out I forgot an fclose(3) in basically every exec.

comment:8 Changed 2 years ago by l2dy (Zero King)

Cc: l2dy added

comment:9 Changed 2 years ago by chrstphrchvz (Christopher Chavez)

This ticket can be closed when cmake is updated to 3.23.0 as it incorporates the patch proposed here.

comment:10 Changed 2 years ago by cooljeanius (Eric Gallager)

Cc: cooljeanius added

comment:11 Changed 22 months ago by neverpanic (Clemens Lang)

Resolution: fixed
Status: assignedclosed

[661ded822f7411509d5a1869ab6ad76006aae497/macports-ports] updates to 3.23.2, so we can consider this solved.

Additionally, [34dcfa69c2eabc1f7eeb7658264d3ee3bb626ef6/macports-base] is also likely to fix this.

Note: See TracTickets for help on using tickets.