Opened 10 years ago

Closed 10 years ago

#45309 closed defect (fixed)

p11-kit child atfork handler causes gimp forks to crash due to gegl/glib2 dlclose(), no gimp plugins available

Reported by: anddam (Andrea D'Amore) Owned by: ryandesign (Ryan Carsten Schmidt)
Priority: Normal Milestone:
Component: ports Version:
Keywords: Cc: cooljeanius (Eric Gallager), eborisch (Eric A. Borisch), jon.fairbairn@…, ewen-naos-nz (Ewen McNeill), kurtjaeke@…, mdkcore@…, p-bro, wdc@…, su-v, machsna (J. ‘mach’ Wust), ClintO (Clint O), Crosmatron, dershow, lalakerhater@…, skymoo (Adam Mercer), jpenney (Jason Penney), jeremyhu (Jeremy Huddleston Sequoia), dtakahashi@…, EnekoGotzon
Port: gegl p11-kit gimp2 glib2

Description

GIMP doesn't load its plug-ins at boot, it prints out a dbus-related error on standard error

Error spawning command line `launchctl getenv DBUS_LAUNCHD_SESSION_BUS_SOCKET': Child process killed by signal 11

and then

LibGimpBase-WARNING **: gimp: gimp_wire_read(): error

The same symptoms have been confirmed by the port maintainer.

Running gimp with --verbose provides a pair of

Querying plug-in: '/opt/local/lib/gimp/2.0/plug-ins/blinds'
Terminating plug-in: '/opt/local/lib/gimp/2.0/plug-ins/blinds'

output lines per each plug-in, this results in a GIMP instance with no plug-in that is of very little help, among other features it won't read/write any format other than its own XCF.

On my 10.9.5 system with dbus-system daemon and dbus-session agent from MacPorts running I have a DBUS_LAUNCHD_SESSION_BUS_SOCKET env variable in shell but not in launchd environment, this is probably why GIMP is complaining about the getenv at start.

I'm attaching full standard output and error.

Attachments (8)

gimp-out.txt (31.8 KB) - added by anddam (Andrea D'Amore) 10 years ago.
standard output with --verbose flag
gimp-err.txt (13.4 KB) - added by anddam (Andrea D'Amore) 10 years ago.
standard error with --verbose flag
gimp-lldb-b-workq.txt (13.9 KB) - added by zhu527812567@… 10 years ago.
lldb transcript, break at workq_open and workq_return
gimp-lldb-transcript-pthread_atfork.txt (12.6 KB) - added by dtakahashi@… 10 years ago.
lldb transcript breaking at pthread_atfork()
gspawn.c-print-sleep.diff (1.2 KB) - added by ewen-naos-nz (Ewen McNeill) 10 years ago.
glib/gspawn.c(): fork_exec_with_pipes() debug print patch
dtruss-gimp-2014-11-02.txt (535.9 KB) - added by ewen-naos-nz (Ewen McNeill) 10 years ago.
Output of "sudo dtruss -f -n gimp-2.8 2>&1"
main.c (1.8 KB) - added by dtakahashi@… 10 years ago.
modified main.c of the gimp-2.8.14
glib2-45309.patch (2.3 KB) - added by jeremyhu (Jeremy Huddleston Sequoia) 10 years ago.
glib2-45309.patch

Download all attachments as: .zip

Change History (68)

Changed 10 years ago by anddam (Andrea D'Amore)

Attachment: gimp-out.txt added

standard output with --verbose flag

Changed 10 years ago by anddam (Andrea D'Amore)

Attachment: gimp-err.txt added

standard error with --verbose flag

comment:1 Changed 10 years ago by dbevans (David B. Evans)

Port: gimp gimp-app removed
Status: newassigned
Version: 2.3.1

Thanks for opening this ticket and confirming the problem. It appears there are two separate issues involved here, first the dbus problem and secondly the gimp plugin one.

The dbus issue seems to be with the GDBus API provided by the GIO component of glib2. The stock code expects the environment variable DBUS_SESSION_BUS_ADDRESS to be set a session debus address of the form 'unix:path=<path to dbus session socket>'. We have added custom code that attempts to obtain the socket path using the the launchctl command mentioned in the error message if DBUS_SESSION_BUS_ADDRESS is not set.

On my 10.9.5 system, manually running the command yields the socket address as follows.

$ launchctl getenv DBUS_LAUNCHD_SESSION_BUS_SOCKET
/tmp/launch-vW5khj/unix_domain_listener

I can override the GIMP dbus error by setting DBUS_SESSION_BUS_ADDRESS with this value

export DBUS_SESSION_BUS_ADDRESS="unix:path=/tmp/launch-vW5khj/unix_domain_listener"

You can get more debugging output from glib2 by setting the additional env variable

export G_DBUS_DEBUG=address

This produces the additional output

GDBus-debug:Address: In g_dbus_address_get_for_bus_sync() for bus type 'session'
GDBus-debug:Address: env var DBUS_SESSION_BUS_ADDRESS is not set
GDBus-debug:Address: env var DBUS_SYSTEM_BUS_ADDRESS is not set
GDBus-debug:Address: env var DBUS_STARTER_BUS_TYPE is not set
GDBus-debug:Address: launchctl command line: `launchctl getenv DBUS_LAUNCHD_SESSION_BUS_SOCKET'
GDBus-debug:Address: Cannot look-up address bus type 'session': Error spawning command line `launchctl getenv DBUS_LAUNCHD_SESSION_BUS_SOCKET': Child process killed by signal 11

This give me an idea of where to look in glib2 to see what's going on.

No idea what's going on with the plugin load failures as yet. Will look at it today.

comment:2 Changed 10 years ago by cooljeanius (Eric Gallager)

Cc: egall@… added

Cc Me!

comment:3 Changed 10 years ago by eborisch (Eric A. Borisch)

Cc: eborisch@… added

Cc Me!

comment:4 Changed 10 years ago by eborisch (Eric A. Borisch)

Same symptoms on 10.6.8, FWIW.

comment:5 in reply to:  4 Changed 10 years ago by dbevans (David B. Evans)

Replying to eborisch@…:

Same symptoms on 10.6.8, FWIW.

I've reproduced the same symptoms with fresh builds on SL, Mavericks and even Yosemite Beta 5 so it looks like OS version is not an issue.

comment:6 Changed 10 years ago by dbevans (David B. Evans)

Status:

So far I've been able to confirm that the dbus issue and the failing plugins have a common element: they both involve spawning a child process that terminates prematurely.

In the dbus case, g_dbus_address_get_for_bus_sync() attempts to execute the launchctl command in a spawned child process which reports killed by signal 11. However, this only happens when running gimp. For example, running eog (or most other GNOME apps) the child process is spawned successfully and the correct result returned.

$ eog
GDBus-debug:Address: In g_dbus_address_get_for_bus_sync() for bus type 'session'
GDBus-debug:Address: env var DBUS_SESSION_BUS_ADDRESS is not set
GDBus-debug:Address: env var DBUS_SYSTEM_BUS_ADDRESS is not set
GDBus-debug:Address: env var DBUS_STARTER_BUS_TYPE is not set
GDBus-debug:Address: launchctl command line: `launchctl getenv DBUS_LAUNCHD_SESSION_BUS_SOCKET'
GDBus-debug:Address: launchctl stdout:
  0000: 2f 74 6d 70  2f 6c 61 75  6e 63 68 2d  37 4d 46 69    /tmp/launch-7MFi
  0010: 64 37 2f 75  6e 69 78 5f  64 6f 6d 61  69 6e 5f 6c    d7/unix_domain_l
  0020: 69 73 74 65  6e 65 72 0a  00                          istener..
GDBus-debug:Address: Returning address 'unix:path=/tmp/launch-7MFid7/unix_domain_listener' for bus type 'session'

In the case of the GIMP plugins, the failure point occurs when GIMP is querying the individual plugins for their characteristics. The plugin is executed as a child process which communicates bidirectionally with the main program using a pair of pipes. After sending a query command to each plugin, the main process attempts to read an expected response but immediately gets an EOF instead and the plugin is skipped. A crash report for the child is produced as follows:

Process:         gimp-2.8 [62774]
Path:            /opt/local/bin/gimp-2.8
Identifier:      gimp-2.8
Version:         ???
Code Type:       X86-64 (Native)
Parent Process:  gimp-2.8 [62731]
Responsible:     Terminal [361]
User ID:         501

Date/Time:       2014-10-09 12:14:46.255 -0700
OS Version:      Mac OS X 10.9.5 (13F34)
Report Version:  11
Anonymous UUID:  25BD4C19-81DA-575C-A9EB-3C11B6BF2F84


Crashed Thread:  0  Dispatch queue: com.apple.main-thread

Exception Type:  EXC_BAD_ACCESS (SIGSEGV)
Exception Codes: KERN_INVALID_ADDRESS at 0x00000001118b69b5

VM Regions Near 0x1118b69b5:
    __LINKEDIT             0000000110f36000-0000000110f47000 [   68K] r--/rwx SM=COW  /opt/local/lib/*.dylib
--> 
    MALLOC_TINY            00007ff479400000-00007ff479800000 [ 4096K] rw-/rwx SM=COW  

Application Specific Information:
*** multi-threaded process forked ***
crashed on child side of fork pre-exec

Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0   ???                           	0x00000001118b69b5 0 + 4589316533
1   libsystem_pthread.dylib       	0x00007fff85d06f5b _pthread_fork_child_postinit + 69
2   libsystem_c.dylib             	0x00007fff85c0317e fork + 29
3   libglib-2.0.0.dylib           	0x000000010f692067 fork_exec_with_pipes + 695 (gspawn.c:1337)
4   libglib-2.0.0.dylib           	0x000000010f691293 g_spawn_async_with_pipes + 851 (gspawn.c:661)
5   libglib-2.0.0.dylib           	0x000000010f690f26 g_spawn_async + 198 (gspawn.c:146)
6   gimp                          	0x000000010e078486 gimp_plug_in_open + 1894 (gimpplugin.c:363)
7   gimp                          	0x000000010e0826a1 gimp_plug_in_manager_call_query + 641 (gimppluginmanager-call.c:75)
8   gimp                          	0x000000010e087e9e gimp_plug_in_manager_query_new + 382 (gimppluginmanager-restore.c:355)
9   gimp                          	0x000000010e087688 gimp_plug_in_manager_restore + 536 (gimppluginmanager-restore.c:112)
10  gimp                          	0x000000010e0b4a71 gimp_real_restore + 97 (gimp.c:715)
11  libgobject-2.0.0.dylib        	0x000000010f559674 g_cclosure_marshal_VOID__POINTER + 244 (gmarshal.c:1201)
12  libgobject-2.0.0.dylib        	0x000000010f553398 g_type_class_meta_marshal + 120 (gclosure.c:966)
13  libgobject-2.0.0.dylib        	0x000000010f55288a g_closure_invoke + 522 (gclosure.c:773)
14  libgobject-2.0.0.dylib        	0x000000010f57583b signal_emit_unlocked_R + 2155 (gsignal.c:3591)
15  libgobject-2.0.0.dylib        	0x000000010f5774ee g_signal_emit_valist + 6046 (gsignal.c:3309)
16  libgobject-2.0.0.dylib        	0x000000010f577e94 g_signal_emit + 356 (gsignal.c:3365)
17  gimp                          	0x000000010e0b1849 gimp_restore + 1449 (gimp.c:1013)
18  gimp                          	0x000000010dd56704 app_run + 708 (app.c:221)
19  gimp                          	0x000000010dd5a017 main + 1255 (main.c:474)
20  libdyld.dylib                 	0x00007fff85d005fd start + 1

Thread 0 crashed with X86 Thread State (64-bit):
  rax: 0x0000000000000004  rbx: 0x0000000000000028  rcx: 0x00000001118b69b5  rdx: 0x0000000000000000
  rdi: 0x0000000000000103  rsi: 0x0000000000000403  rbp: 0x00007fff51eb5440  rsp: 0x00007fff51eb5418
   r8: 0x0000000000000003   r9: 0x0000000000000000  r10: 0x00000000000002bc  r11: 0x0000000000000246
  r12: 0x000000010e0786b0  r13: 0x0000000000000000  r14: 0x000000010e6d01e0  r15: 0x0000000000000001
  rip: 0x00000001118b69b5  rfl: 0x0000000000010202  cr2: 0x00000001118b69b5
  
Logical CPU:     3
Error Code:      0x00000014
Trap Number:     14

Again spawning the child process has failed (SIGSEGV).

No idea what's causing the problem and why it is specific to GIMP at this point. If any one else has an idea about what's going on here or better how to fix it please let me know. I'm heading out for the weekend but should get back to this on Sunday.

comment:7 Changed 10 years ago by neverpanic (Clemens Lang)

On OS X, you cannot call non-async-safe functions in a child process fork(2)ed from a multithreaded process before calling execve(2). This looks like some post-fork-phtread callback code that calls stuff it shouldn't call. A valid solution for this kind of problem is to replace the fork(2) and execve(2) combination with posix_spawn(2).

Please make sure upstream is aware of this problem.

For further reference, for example: http://lists.gnu.org/archive/html/coreutils/2013-11/msg00083.html. Googling for

*** multi-threaded process forked ***

also gets you lots of similar hits.

Last edited 10 years ago by neverpanic (Clemens Lang) (previous) (diff)

comment:8 Changed 10 years ago by ewen-naos-nz (Ewen McNeill)

I ran into this same problem on Mavericks. After spending a while trying reactivating older ports (eg, older gtk2, older glib2 2.40) -- which didn't seem to solve the problem, and involved lots of rebuilding things depending on them, I started looking at the glib2 source.

In particular the verbose error output says:

Error spawning command line `launchctl getenv DBUS_LAUNCHD_SESSION_BUS_SOCKET': Child process killed by signal 11
Dynamic session lookup supported but failed: launchctl terminated abnormally without any error message
Not enough memory

which implies that maybe it's a memory allocation being failed. So I started following the glib2 call chain for running a process, from the launchctl example above (seemed easier than trying to find where the plugins get loaded, since I imagine they're going through a common glib2 function.

The call path seems to be:

  1. g_spawn_command_line_sync()
  2. g_spawn_sync()
  3. fork_exec_with_pipes()
  4. do_exec()
  5. Optionally: (* child_setup)(user_data)
  6. g_execute()

and AFAICT in g_execute() (in glib/gspawn.c) it looks like there is a memory allocation performed if the thing being run does not have a full path (around line 1721, glib 2.42):

      freeme = name = g_malloc (pathlen + len + 1);

which is consistent with the command line printed out (running bare launchctl).

Memory allocations are one of those things which are typically on the "do not do between fork() and exec()." Although sometimes the rules for single threaded and multi-threaded applications are different about some of these things. (Hard to be sure if that will allocate memory or not, since it goes through all the glib2 memory wrappers too.)

So just as another option to consider for, eg, pthread atfork related things.

Possibly if someone figures out when it last worked (which is why I was trying downgrading glib2/gtk2/etc) then it'll be more obvious why it has stopped working.

Ewen

comment:9 Changed 10 years ago by jon.fairbairn@…

Cc: jon.fairbairn@… added

Cc Me!

comment:10 Changed 10 years ago by ewen-naos-nz (Ewen McNeill)

Cc: macports@… added

Cc Me!

comment:11 Changed 10 years ago by kurtjaeke@…

Cc: kurtjaeke@… added

Cc Me!

comment:12 in reply to:  8 Changed 10 years ago by hohro.st@…

I've also encountered this problem.

Possibly if someone figures out when it last worked (which is why I was trying downgrading glib2/gtk2/etc) then it'll be more obvious why it has stopped working.

There was no such problem at least 13 days ago (on Oct. 5). I think it occurred around the day when this ticket was opened.

comment:13 Changed 10 years ago by mdkcore@…

Cc: mdkcore@… added

Cc Me! (I also have the same problem, OS X 10.9.5)

Last edited 10 years ago by mdkcore@… (previous) (diff)

comment:14 Changed 10 years ago by zhu527812567@…

I encountered this problem when I was using OS X 10.7.5 (~Oct. 13)

I found that something in fork_exec_with_pipes()@glib2 is not allowed. comment:8 indicates the possible offender is g_malloc. I tried to revert these new versions but old versions (1 month ago) did not work. I remembered that I updated my xcode. What about gcc provided by macports? Not work. gcc-apple-4.2 was also used but still not work.

Thus I tried to implement something using posix_spawn but that syscall paniced my kernel. After consulting some Apple SAs, I believed it was CVE-2013-3954 and fixed in 10.9. I decided to update my system entirely. After the upgrade I reinstalled these ports and the problem still exists. On the other hand, posix_spawn does not panic. Thus for 10.9 and above posix_spawn things may be an incomplete alternative.

It seems that we'd better report this issue to upstream glib. This issue seems similar to ours: https://bugzilla.gnome.org/show_bug.cgi?id=738620 .

Thanks.

Last edited 10 years ago by ryandesign (Ryan Carsten Schmidt) (previous) (diff)

comment:15 in reply to:  14 Changed 10 years ago by ewen-naos-nz (Ewen McNeill)

Replying to zhu527812567@…:

It seems that we'd better report this issue to upstream glib. This issue seems similar to ours: https://bugzilla.gnome.org/show_bug.cgi?id=738620 .

That does seem wise. Thanks for pointing out that glib ticket, which does seem similar to me too. Since I've already got an account on the Gnome bug tracker I've added a "BTW" pointer to this MacPorts ticket there with this text:

Potentially related, over at the MacPorts (OS X) site there is ticket #45309 for Gimp failing to launch DBUS or detect any plugins since new versions of glib/gtk/gimp were incorporated into MacPorts about 11 days ago (so around 2014-10-10 ish). On the DBUS launch there's a "Not enough memory" error report. It's worked fine for years prior to that.

As far as I can tell gimp detects plugins by launching something, via glib functions, to auto-discover them (and the DBUS case is definitely supposed to be a program launch). Based on the debugging to date it looks like something in the fork() child is failing before it is able to get as far as calling exec*() -- but we have not yet determined precisely what. Memory allocation is one plausible theory, especially given the "Not enough memory" report.

Ticket URL: https://trac.macports.org/ticket/45309

Seems to affect OS X 10.6.8, 10.7, 10.9 and 10.10 (beta) at least.

It appears that on OS X the relevant glib functions do work in some other gtk applications (eg, eog is mentioned in the ticket), but not in gimp. My hunch is that single threaded applications have different malloc-after-fork rules than multi-threaded applications.

But I suspect if anyone gets closer to narrowing down the trigger event on OS X then it'd also be worth creating our own upstream ticket with more specific detail. And/or if we can narrow it down to a specific upstream version change. Particularly since the tone of that upstream ticket is basically "following POSIX is hard" (which it is, to follow strictly) and suggesting that they "need" to do things that do not follow POSIX.

Ewen

comment:16 Changed 10 years ago by ewen-naos-nz (Ewen McNeill)

Random thought, triggered by an off-hand comment a glib developer made on the glib ticket about it being theoretically possible to write a non-threaded gtk application: maybe what changed is the _order_ in which things are done at startup (ie, maybe in gimp or gtk). Maybe there is now a thread being spawned *before* DBUS launch/plugin discovery, where as previously those were happening before any other threads got started? Certainly to the OS one looks like "single threaded program wants to run something" and the other looks like "multi-threaded program wants to run something", which in several POSIX compliant OS get treated differently. (Not sure what effect this would have later on, eg, using plugins -- but if, eg, the malloc() for the full path name being discovered is the trigger, it might be that gimp is calling it with a full path name -- thus skipping that malloc -- when it uses the plugin.)

Ewen

comment:17 Changed 10 years ago by p-bro

Cc: p.bro@… added

Cc Me!

Changed 10 years ago by zhu527812567@…

Attachment: gimp-lldb-b-workq.txt added

lldb transcript, break at workq_open and workq_return

comment:18 Changed 10 years ago by zhu527812567@…

well ... the multi-thread issue (attachment ticket:45309:gimp-lldb-b-workq.txt ):
frame #20: 0x0000000100e5d03f libpango-1.0.0.dylib`pango_language_get_default + 65
calls CFLocaleCopyPreferredLanguages ---> __workq_open
and then thread 10, 11, 12 ....

The crash in dbus seems to be anothor issue.

Last edited 10 years ago by ryandesign (Ryan Carsten Schmidt) (previous) (diff)

comment:19 Changed 10 years ago by wdc@…

Cc: wdc@… added

Cc Me!

comment:20 Changed 10 years ago by wdc@…

I too am experiencing this problem with meld under MacOS 10.9:

wdc-home-2:gnucash wdc$ !meld
meld JTROS_Cash.csv.save JTROS_Cash.csv
Couldn't bind the translation domain. Some translations won't work.
'module' object has no attribute 'bind_textdomain_codeset'

** (process:65721): WARNING **: Trying to register gtype 'GMountMountFlags' as enum when in fact it is of type 'GFlags'

** (process:65721): WARNING **: Trying to register gtype 'GDriveStartFlags' as enum when in fact it is of type 'GFlags'

** (process:65721): WARNING **: Trying to register gtype 'GSocketMsgFlags' as enum when in fact it is of type 'GFlags'
/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/gobject/__init__.py:115: Warning: attempting to add an interface (GtkEditable) to class (HistoryFileEntry) after class_init
  type_register(cls, namespace.get('__gtype_name__'))
Dynamic session lookup supported but failed: launchd did not provide a socket path, verify that org.freedesktop.dbus-session.plist is loaded!
Traceback (most recent call last):
  File "/opt/local/bin/meld", line 178, in <module>
    main()
  File "/opt/local/bin/meld", line 164, in main
    already_running, dbus_app = meld.dbus_service.setup(app)
  File "/opt/local/lib/meld/meld/dbus_service.py", line 56, in setup
    bus = dbus.SessionBus()
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/dbus/_dbus.py", line 211, in __new__
    mainloop=mainloop)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/dbus/_dbus.py", line 100, in __new__
    bus = BusConnection.__new__(subclass, bus_type, mainloop=mainloop)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/dbus/bus.py", line 122, in __new__
    bus = cls._new_for_bus(address_or_type, mainloop=mainloop)
dbus.exceptions.DBusException: org.freedesktop.DBus.Error.NoMemory: Not enough memory
wdc-home-2:gnucash wdc$ ps axwww | grep dbus
65716   ??  Ss     0:00.01 /opt/local/bin/dbus-daemon --nofork --session
66022 s001  S+     0:00.01 grep dbus
Last edited 10 years ago by ryandesign (Ryan Carsten Schmidt) (previous) (diff)

comment:21 Changed 10 years ago by su-v

Cc: suv-sf@… added

Cc Me!

comment:22 Changed 10 years ago by ewen-naos-nz (Ewen McNeill)

Some thoughts on debugging this, in case someone has more time: immediately after the fork() point there are two processes, the parent and the child. Whatever is going wrong is clearly in the child process, and it seems to fail quickly enough that it's tricky to get a debugger to attach:

  • "LANG=C lldb gimp-2.8" in one window
  • "b fork", then "run", wait for breakpoint to hit
  • "LANG=C lldb" in another window, and "process attach -w -n gimp-2.8" to get it to attach to the next gimp-2.8 process that appears
  • "c" in the first window, so the fork() happens

but AFAICT the child exits (due to the error we're trying to debug) before the lldb has time to do anything meaningful with it.

I did try installing gdb, for "set follow-fork-mode child" but even after doing the code signing dance it appears OS X doesn't have kernel support needed for "follow-fork-mode" to work automatically, so gdb has to be used in basically the same way, which also doesn't have time to work. Plus gdb didn't seem to find any symbols, where as lldb at least had function names, so I think gdb is a deadend.

So... I think the only meaningful way to debug this would be a code change in the child side of the fork() (ie, in glib's fork_exec_with_pipes()), which made the child wait a lot longer before it did something that caused it to exit again (ie, immediately at the top of the child part of the function). Say, sleeping for 5 seconds. Which would allow time to attach the debugger to the child (as above), break out to the debugger, and set some more breakpoints, then continue/step through the child side of the code.

It'll take a bit of time to change the glib code, recompile/install, and step through, but at least we'd know which particular bit of the child code is failing which would, eg, make it feasible to report upstream more specifically. I'm unlikely to have time to do this before late next week, but will have a go late next week if no one else has time to try before then.

Ewen

comment:23 Changed 10 years ago by ewen-naos-nz (Ewen McNeill)

Additional though: it looks like both the DBUS one and the plugins ones are segfaulting (signal 11), so it's possible to get a core dump:

ulimit -c unlimited
gimp-2.8

stop it after the first few have failed. The core files are in /cores/core.PID. Pick one then do:

ewen@ashram:~$ lldb gimp-2.8 -c /cores/core.1118
(lldb) target create "gimp-2.8" --core "/cores/core.1118"
warning: (x86_64) /cores/core.1118 load command 298 LC_SEGMENT_64 has a fileoff + filesize (0x20929000) that
extends beyond the end of the file (0x20928000), the segment will be truncated to match
Core file '/cores/core.1118' (x86_64) was loaded.
Process 0 stopped
* thread #1: tid = 0x0000, 0x000000010f16b9b5, stop reason = signal SIGSTOP
    frame #0: 0x000000010f16b9b5
-> 0x10f16b9b5:  addb   %al, (%rax)
   0x10f16b9b7:  addb   %al, (%rax)
   0x10f16b9b9:  addb   %al, (%rax)
   0x10f16b9bb:  addb   %al, (%rax)
(lldb) print $rax
(unsigned long) $0 = 4
(lldb) 

So it's failing accessing memory at the address in rax, which is 0x4 (confirmed by "register read --all"), which is almost certainly "memory allocation failed, and we didn't catch that, and then we offset by 4". But it's non-trivial without debugging symbols to figure out precisely what this point is (my theory is still on the malloc() I found by source inspection earlier... but no time to dig further now).

Ewen

comment:24 Changed 10 years ago by machsna (J. ‘mach’ Wust)

Cc: j_mach_wust@… added

Cc Me!

comment:25 in reply to:  23 Changed 10 years ago by zhu527812567@…

Replying to macports@…:

For the core files, the backtrace prints something like (frame 0 is 0x00000001069589e0 on my machine)

  * frame #0: 0x00000001069589e0
    frame #1: 0x00007fff98a4b17e libsystem_c.dylib`fork + 29
    frame #2: 0x000000010471bf43 libglib-2.0.0.dylib`fork_exec_with_pipes + 277
    frame #3: 0x000000010471ba5d libglib-2.0.0.dylib`g_spawn_sync + 334

OR lldb reports error: core file does not contain 0x1069589e0

The debugger cannot find 0x00000001069589e0 . One possible cause is that the address does not point to valid executable instructions at all.

-> 0x1069589e0:  addb   %al, (%rax)
   0x1069589e2:  addb   %al, (%rax)
   0x1069589e4:  addb   %al, (%rax)
   0x1069589e6:  addb   %al, (%rax)
   .... (many identical instructions)

The machine code (binary representation) for the instruction is binary zero, so maybe a zero page

(lldb) dis -s 0x0000000106958000
   0x106958000:  addb   %al, (%rax)
   0x106958002:  addb   %al, (%rax)
   0x106958004:  addb   %al, (%rax)
   0x106958006:  addb   %al, (%rax)
   .... 

well, this means that fork() returns to an invalid address page (which is filled with zero). Not only this page, the pages before and after are all zeros. In this case we cannot debug the child :( , since the kernel does not return to the correct position in the child.

comment:26 Changed 10 years ago by ClintO (Clint O)

Cc: clint.olsen@… added

Cc Me!

comment:27 Changed 10 years ago by Crosmatron

Cc: crosma@… added

Cc Me!

comment:28 Changed 10 years ago by dershow

Cc: dersh@… added

Cc Me!

comment:29 Changed 10 years ago by lalakerhater@…

Cc: lalakerhater@… added

Cc Me!

comment:30 Changed 10 years ago by lalakerhater@…

Just wanted to add the phrase "Unknown file type" to this ticket in case others are searching for it. That's the message the UI will return when trying to open files other than .xcf.

comment:31 Changed 10 years ago by skymoo (Adam Mercer)

Cc: ram@… added

Cc Me!

comment:32 Changed 10 years ago by jpenney (Jason Penney)

Cc: jpenney@… added

Cc Me!

comment:33 Changed 10 years ago by wdc@…

I poked at this problem a little more. At the present time I am SUCCESSFULLY running meld and dbus. I'm not seeing the "DBus.Error.NoMemory" error.

To those experiencing this problem I have one question: Have you performed a system restart since installing dbus for the first time? After poking rather a lot at what might be different between my failed and working states, that seems to be the primary difference. I did NOT restart after installing dbus the first time.

I did other stuff, and I'm asking others in another thread if they think the other stuff I did mattered. I think it didn't and that the restart did matter. What do you other folks say?

comment:34 Changed 10 years ago by jeremyhu (Jeremy Huddleston Sequoia)

_pthread_fork_child_postinit basically just loops over a bunch of atfork handlers that are registered with pthreads and executes them and then resets the pthreads library to its initial state.

I suspect that something is calling pthread_atfork() to register an atfork handler and and the child handler is causing the crash. I suggest that you break on pthread_atfork and see what's doing that.

comment:35 Changed 10 years ago by jeremyhu (Jeremy Huddleston Sequoia)

Cc: jeremyhu@… added

Cc Me!

comment:36 Changed 10 years ago by jeremyhu (Jeremy Huddleston Sequoia)

FWIW, gimp seems to run fine with nothing printed to stout on my machine.

comment:37 Changed 10 years ago by dtakahashi@…

I still have the same problem on 10.8.5. I tried system reboot after the port upgrade (which restarts dbus), but it did not aid the problem including dbus error, so causes of those two issues may not be the same.

comment:38 Changed 10 years ago by dtakahashi@…

Cc: dtakahashi@… added

Cc Me!

Changed 10 years ago by dtakahashi@…

lldb transcript breaking at pthread_atfork()

comment:39 Changed 10 years ago by ewen-naos-nz (Ewen McNeill)

FTR, on 10.9 I'm also seeming only the same four pthread_atfork() calls as dtakahashi added at comment:38 (gimp-lldb-transcript-pthread_atfork.txt), although I'm getting somewhat fewer symbols resolved in lldb in the "bt". Anyway the four calls seem to be:

  1. ImageLoaderMachO::doImageInit()
  2. ImageLoaderMachO::doModInitFunctions()
  3. getpwuid()
  4. getpwuid()

The first two seem to be tied up with loading the MachO (executable/libraries) images into memory, and the last two seem to be deep in the callpath of getpwuid(). All of which seem to be inside the OS libraries, rather than inside GIMP/GTK/glib. So I suspect it's unlikely they've changed recently to cause this to suddenly break like this.

There aren't any more pthread_atfork() calls registered before we start getting DBUS dynamic session lookup failures and plugin discovery failures. So I still currently suspect it's something in the actual child path after fork(), rather than an _atfork() handler.

Ewen

Last edited 10 years ago by ryandesign (Ryan Carsten Schmidt) (previous) (diff)

comment:40 Changed 10 years ago by ewen-naos-nz (Ewen McNeill)

After spending a bunch of time rebuilding glib2 with various print debugs (and some sleep(5) calls), I take it back. As best I can tell the child is not making it to the next line of code in gspawn.c:fork_exec_with_pipes() -- a print immediately after the fork() runs in the parent, but not obviously in the child. With the patch I was trying (I'll attach in a minute), output looks like this:

ewen@ashram:~$ gimp-2.8
About to fork in fork_exec_with_pipes()
Immediately after fork in fork_exec_with_pipes(), pid=66894
Error spawning command line `launchctl getenv DBUS_LAUNCHD_SESSION_BUS_SOCKET': Child process killed by signal 11
Dynamic session lookup supported but failed: launchctl terminated abnormally without any error message
Not enough memory
About to fork in fork_exec_with_pipes()
Immediately after fork in fork_exec_with_pipes(), pid=66901

(gimp-2.8:66893): LibGimpBase-WARNING **: gimp-2.8: gimp_wire_read(): error

The pid=NNNNN is the return value from fork(), so the fact we're getting a non-zero value means that print is coming from the parent process (if it was from the child process it'd say "pid=0"). I believe the gimp_wire_read() is just a result of the child suddenly going away, and thus the pipe to it being closed.

In the hope of getting some idea of what was happening in the child, I found some hints on dtrace on OS X and tried using "dtruss" to see what was happening, viz:

lldb /opt/local/bin/gimp-2.8
b main
run

to start a gimp-2.8 process and pause it, then in another window:

sudo dtruss -f -n gimp-2.8 2>&1

to trace (and follow children) of the "gimp-2.8" processes, and then "c" in the "lldb" session.

After letting the gimp-2.8 process run for a little bit, I had a look for fork() calls in the dtruss() output. The most interesting bit is this:

66945/0x224bb8:  fork()          = 0 0
66945/0x224bb8:  thread_selfid(0x7FFF7536B310, 0x7FFF87915E2E, 0x1)              = 2247608 0
66945/0x224bb8:  bsdthread_register(0x7FFF8AB13FBC, 0x7FFF8AB13FAC, 0x2000)              = -1 Err#22

and:

66947/0x224ccd:  fork()          = 0 0
66947/0x224ccd:  thread_selfid(0x7FFF7536B310, 0x7FFF87915E2E, 0x1)              = 2247885 0
66947/0x224ccd:  bsdthread_register(0x7FFF8AB13FBC, 0x7FFF8AB13FAC, 0x2000)              = -1 Err#22

and so on. It appears basically all of the children follow that pattern of three syscalls -- a return from fork(), a thread_selfid(), and then a bsdthread_register() which then appears to fail. I suspect that's not good, but I've been able to find out very little about what bsdthread_register() actually does. (I'll also attach that full dtruss output in a minute.)

The only other time those PIDs are seen is when the parent wait4() collects the return code. So they're definitely vanishing pretty soon after that.

At this point I think we either (a) need a much smaller reproducible test case or (b) someone who knows more about the internals of OS X (or both). It's possible there's some other _atfork() style handler causing havoc, but off hand I don't know other ways those might be registered -- and if it'd cause bsdthread_register() to fail.

Ewen

Changed 10 years ago by ewen-naos-nz (Ewen McNeill)

Attachment: gspawn.c-print-sleep.diff added

glib/gspawn.c(): fork_exec_with_pipes() debug print patch

Changed 10 years ago by ewen-naos-nz (Ewen McNeill)

Attachment: dtruss-gimp-2014-11-02.txt added

Output of "sudo dtruss -f -n gimp-2.8 2>&1"

comment:41 Changed 10 years ago by ewen-naos-nz (Ewen McNeill)

Finally in case someone else wants to try the patch-glib-and-debug thing, the approximate idea is:

sudo port extract glib2
sudo port patch glib2
cd $(port dir glib2)
sudo vi Portfile
# add: "revison 1"
sudo port configure glib2

cd $(port dir glib2)
cd work/glib-*/glib
# patch gspawn.c with gspawn.c-print-sleep.diff or otherwise edit

sudo port build glib2
sudo port destroot glib2
sudo port -f -k install glib2
# Expect to see --->  Installing glib2 @2.42.0_1

beware that MacPorts will do a clean for you if you bump revision and then do "port configure", so keep a copy of your hacky local changes to copy back. Also beware that you need the (local) revision number to be higher than what is installed to get a new version installed; it'll be ignored otherwise. (I just kept reactivating my upstream _0 and uninstalling my local _1, then rebuilding the _1 with local patch, and then reinstalling the local _1; probably the wrong approach, but I didn't want to run through a bunch of revisions and get out of sync with MacPorts, and it seems to just be a integer "local" revision -- ie MacPorts revision, rather than user revision.)

Ewen

comment:42 Changed 10 years ago by ewen-naos-nz (Ewen McNeill)

FWIW, it looks like bsdthread_register() Err#22 (EINVAL) might be a dtruss artefact (which is someone talking about using dtruss to trace go programs, apparently they were looping in the error handler in that case).

But the fact it's not making it to the next line of code still implies it's something in the _atfork() handlers, not in the glib fork_exec_with_pipes() code.

Ewen

comment:43 Changed 10 years ago by EnekoGotzon

Cc: enekogotzon@… added

Cc Me!

comment:44 Changed 10 years ago by dtakahashi@…

I investigated the issue from the gimp. I attach the simplified main.c, which can reproduce "the death of lanchctl" (at least here). As far as I tried, the combination of two functions, g_option_context_add_group (context, gegl_get_option_group ()); and g_option_context_parse_strv (context, &argv, &error1); seems to be important for killing launchctl.

Changed 10 years ago by dtakahashi@…

Attachment: main.c added

modified main.c of the gimp-2.8.14

comment:45 in reply to:  39 Changed 10 years ago by jeremyhu (Jeremy Huddleston Sequoia)

Replying to macports@…:

FTR, on 10.9 I'm also seeming only the same four pthread_atfork() calls as dtakahashi added at comment:38 (gimp-lldb-transcript-pthread_atfork.txt), although I'm getting somewhat fewer symbols resolved in lldb in the "bt". Anyway the four calls seem to be:

  1. ImageLoaderMachO::doImageInit()
  2. ImageLoaderMachO::doModInitFunctions()
  3. getpwuid()
  4. getpwuid()

The first two seem to be tied up with loading the MachO (executable/libraries) images into memory, and the last two seem to be deep in the callpath of getpwuid(). All of which seem to be inside the OS libraries, rather than inside GIMP/GTK/glib. So I suspect it's unlikely they've changed recently to cause this to suddenly break like this.

Your analysis is a tad off here.

The first one is part of the initialization of CoreFoundation. The last two happen *AFTER* the fork and are thus not relevant. The second one is surely the culprit:

(lldb) bt
49	* thread #1: tid = 0x39e8, 0x00007fff92e420ff libsystem_c.dylib`pthread_atfork, queue = 'com.apple.main-thread, stop reason = breakpoint 1.1
50	    frame #0: 0x00007fff92e420ff libsystem_c.dylib`pthread_atfork
51	    frame #1: 0x00007fff92e3bfc0 libsystem_c.dylib`pthread_once + 87
52	    frame #2: 0x00007fff5fc13378 dyld`ImageLoaderMachO::doModInitFunctions(ImageLoader::LinkContext const&) + 236
53	    frame #3: 0x00007fff5fc13762 dyld`ImageLoaderMachO::doInitialization(ImageLoader::LinkContext const&) + 46
54	    frame #4: 0x00007fff5fc1006e dyld`ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int, ImageLoader::InitializerTimingList&) + 380
55	    frame #5: 0x00007fff5fc0ffc4 dyld`ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int, ImageLoader::InitializerTimingList&) + 210
56	    frame #6: 0x00007fff5fc0ffc4 dyld`ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int, ImageLoader::InitializerTimingList&) + 210
57	    frame #7: 0x00007fff5fc0ffc4 dyld`ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int, ImageLoader::InitializerTimingList&) + 210
58	    frame #8: 0x00007fff5fc0ffc4 dyld`ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int, ImageLoader::InitializerTimingList&) + 210
59	    frame #9: 0x00007fff5fc0ffc4 dyld`ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int, ImageLoader::InitializerTimingList&) + 210
60	    frame #10: 0x00007fff5fc0ffc4 dyld`ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int, ImageLoader::InitializerTimingList&) + 210
61	    frame #11: 0x00007fff5fc0feba dyld`ImageLoader::runInitializers(ImageLoader::LinkContext const&, ImageLoader::InitializerTimingList&) + 54
62	    frame #12: 0x00007fff5fc04e38 dyld`dyld::runInitializers(ImageLoader*) + 89
63	    frame #13: 0x00007fff5fc0a87c dyld`dlopen + 581
64	    frame #14: 0x00007fff952cfdd8 libdyld.dylib`dlopen + 57
65	    frame #15: 0x000000010124b319 libgmodule-2.0.0.dylib`g_module_open + 1382
66	    frame #16: 0x00000001011fc6ca libgegl-0.2.0.dylib`gegl_module_open + 20
67	    frame #17: 0x00000001011fc2c6 libgegl-0.2.0.dylib`gegl_module_load + 84
68	    frame #18: 0x00000001011fc231 libgegl-0.2.0.dylib`gegl_module_new + 152
69	    frame #19: 0x00000001011fcf57 libgegl-0.2.0.dylib`gegl_module_db_module_initialize + 157
70	    frame #20: 0x00000001011fbf5f libgegl-0.2.0.dylib`gegl_datafiles_read_directories + 653
71	    frame #21: 0x00000001011df7c3 libgegl-0.2.0.dylib`gegl_post_parse_hook + 587
72	    frame #22: 0x000000010149166e libglib-2.0.0.dylib`g_option_context_parse + 2092
73	    frame #23: 0x0000000101492076 libglib-2.0.0.dylib`g_option_context_parse_strv + 52
74	    frame #24: 0x000000010000fde8 gimp`main + 529
75	    frame #25: 0x00007fff952d07e1 libdyld.dylib`start + 1

It is being called by one of the initializers due to the dlopen().

I suspect that something later dlclose()d that library leaving us with a dead pointer.

Perhaps you can break on dlclose() after that pthread_atfork to see if my hunch is correct.

comment:46 Changed 10 years ago by jeremyhu (Jeremy Huddleston Sequoia)

Also, it looks like you don't have debug symbols for your MacPorts code. For the relevant ports, I suggest that you set the optimize flags to '-O0 -g3' and build from source with -k. Also, ensure that installed libraries aren't stripped. You should then have dSYMs and source available for the MacPorts libraries

comment:47 Changed 10 years ago by ewen-naos-nz (Ewen McNeill)

Prompted by jeremyhu's analysis at comment:45, I noticed that the second pthread_atfork() call is inside libgegl, which is an image handling library -- at that I had an older version of that available. So I tried using the older version:

ewen@ashram:~$ port installed | grep gegl
  gegl @0.2.0_12+python27+x11
  gegl @0.2.0_13+python27+x11 (active)
ewen@ashram:~$ sudo port activate gegl @0.2.0_12+python27+x11
--->  Computing dependencies for gegl
--->  Deactivating gegl @0.2.0_13+python27+x11
--->  Cleaning gegl
--->  Activating gegl @0.2.0_12+python27+x11
--->  Cleaning gegl
ewen@ashram:~$

and after doing that, it's then possible to run GIMP with many fewer errors -- and it can actually load non XCF images (eg, it loaded a JPEG image for me) so it looks like the plugins work).

ewen@ashram:~$ /opt/local/bin/gimp-2.8
GEGL-/opt/local/include/glib-2.0/glib/gmessages.h-Message: Module '/opt/local/lib/gegl-0.2/ff-load.so' load error: dlopen(/opt/local/lib/gegl-0.2/ff-load.so, 10): Library not loaded: /opt/local/lib/libavformat.55.dylib
  Referenced from: /opt/local/lib/gegl-0.2/ff-load.so
  Reason: image not found
While parsing XMP metadata:
Error: No XMP packet found
ewen@ashram:~$ 

So it appears we have at least reached a workaround:

sudo port activate gegl @0.2.0_12+python27+x11

assuming you still have the older version around, and that:

  1. something changed between gegl 0.2.0_12 and 0.2.0_13 which matters
  2. something changed in gegl's build environment recently which matters
  3. not being able to load libavformat matters somehow (I suspect it's not loading because it refers to a different libavformat filename)

A very quick play with lldb breaking on dlclose() suggested that dlclose() was being called in that gegl path, so the trigger in the _atfork() might be something else. But I don't have much time to play with it today, so haven't investigated in any detail.

Thanks jeremyhu for pointing us in the right direction.

Ewen

comment:48 Changed 10 years ago by jeremyhu (Jeremy Huddleston Sequoia)

So the reason this workaround is successful is because you're getting an error even earlier. The dlopen() is failing (due to the upgrade of ffmpeg making a dependency unavailable) and thus pthread_atfork() is never being called to register the atfork handler, and thus that atfork handler is not causing the crash on the child side of the fork because it is never called.

something changed between gegl 0.2.0_12 and 0.2.0_13 which matters

No, it was just a revbump to force a rebuild with newer ffmpeg (r125491)

something changed in gegl's build environment recently which matters

Nope.

not being able to load libavformat matters somehow (I suspect it's not loading because it refers to a different libavformat filename)

Yes. Not being able to load libavformat prevents pthread_atfork from registering the failing atfork handler.

comment:49 Changed 10 years ago by jeremyhu (Jeremy Huddleston Sequoia)

A very quick play with lldb breaking on dlclose() suggested that dlclose() was being called in that gegl path,

Yep, I suspect that is resulting in the atfork handler getting removed from the process, but we still have a dangling pointer to it. Show the backtrace to see who is calling dlclose(). Assuming you have debug symbols, it should point you to the correct line in source as well which will be helpful.

so the trigger in the _atfork() might be something else.

I'm not really sure what you mean by this. Can you please elaborate?

From my analysis, we essentially have:

  1. gegl is dlopen()ed.
  2. Something in gegl or its dependencies registers a child atfork handler.
  3. Something is dlclose()d.
  4. fork() is called.
  5. The child side tries to call the atfork handler and crashes.

This would be expected if the registered atfork handler is inside of the code that was dlclose()d.

Also, if you set your optimization flags to -O0, you will avoid tail call optimizations, so we'll likely see the missing stack frames between:

	    frame #0: 0x00007fff92e420ff libsystem_c.dylib`pthread_atfork
	    frame #1: 0x00007fff92e3bfc0 libsystem_c.dylib`pthread_once + 87
	    frame #2: 0x00007fff5fc13378 dyld`ImageLoaderMachO::doModInitFunctions(ImageLoader::LinkContext const&) + 236

which will tell us exactly what library initializer is doing the pthread_atfork.

comment:50 Changed 10 years ago by ewen-naos-nz (Ewen McNeill)

Thank you jerermyhu for the explanation (comment:49) -- it seems like a plausible theory for what is going wrong (ie, trying to call code that is no longer loaded). We appear to have confirm that it is this second pthread_atfork() that is the cause -- ie in gegl or one of its dependencies -- so that'd be the place to investigate next when someone has time (which is not me today :-) ). Rebuilding relevant libraries (glib, gegl, maybe others) with debug symbols (and no optimisation) would be a place to start.

My hunch is that dtakahashi's minimal example (comment:44) was also linked against the same things as GIMP, so probably has the same trigger.

For completeness:

so the trigger in the _atfork() might be something else.

I'm not really sure what you mean by this. Can you please elaborate?

I was assuming you were curious about the dlclose() for a lack of matching dlopen()/dlclose() pair (in this "discovery" phase), so since I seemed to find a dlclose() was speculating that maybe there was some other trigger for it. But the dangling reference into the now-dlclosed() library, called in the child at fork time, seems much more plausible.

Ewen

comment:51 Changed 10 years ago by jeremyhu (Jeremy Huddleston Sequoia)

Yeah, it looks like this is a bug in gegl.

It iterates over everything in /opt/local/lib/gegl-0.2 doing a dlopen(), some processing, and then a dlclose(). Since the issue seems to go away when you can't load /opt/local/lib/gegl-0.2/ff-load.so, I suspect that the initializer and atfork handler is somewhere in /opt/local/lib/gegl-0.2/ff-load.so or its dependencies.

Try updating back to the current version of gegl and just moving /opt/local/lib/gegl-0.2/ff-load.so to /tmp. I suspect you will be able to launch gimp just fine.

comment:52 Changed 10 years ago by jeremyhu (Jeremy Huddleston Sequoia)

Also, I've got quite a number of outdated ports on my machine as I was on vacation for a month and just got back. That's probably why I'm not seeing the issue. I'll update all my ports (which will probably take a day or so) and try to reproduce, but I'm fairly certain we've discovered the bug. We should just not dlclose() inside of gegl. Note that having dlopen() without a dlclose() is fine. You don't end up with two copies of the library loaded if you have two dlopen()s.

comment:53 Changed 10 years ago by jeremyhu (Jeremy Huddleston Sequoia)

Reproduced after just a few ports were updated:

drwxr-xr-x   4 root admin   136 Nov  2 13:09 gettext
drwxr-xr-x   4 root admin   136 Nov  2 13:10 gdbm
drwxr-xr-x   4 root admin   136 Nov  2 13:11 libxml2
drwxr-xr-x   4 root admin   136 Nov  2 13:19 openssl
drwxr-xr-x   4 root admin   136 Nov  2 13:20 libedit
drwxr-xr-x   4 root admin   136 Nov  2 13:20 sqlite3
drwxr-xr-x   3 root admin   102 Nov  2 13:32 db48
drwxr-xr-x   4 root admin   136 Nov  2 13:32 python_select
drwxr-xr-x   4 root admin   136 Nov  2 13:37 python27
drwxr-xr-x   4 root admin   136 Nov  2 13:37 py27-libxml2
drwxr-xr-x   4 root admin   136 Nov  2 13:37 itstool
drwxr-xr-x   4 root admin   136 Nov  2 13:37 gtk-doc
drwxr-xr-x   4 root admin   136 Nov  2 13:38 libtool
drwxr-xr-x   4 root admin   136 Nov  2 13:38 libpng
drwxr-xr-x   4 root admin   136 Nov  2 13:38 py27-setuptools
drwxr-xr-x   4 root admin   136 Nov  2 13:39 gmake
drwxr-xr-x   4 root admin   136 Nov  2 13:40 p11-kit

Here's the stack with the call to pthread_atfork:

Process 28367 stopped
* thread #1: tid = 0x74a62f, 0x00007fff9223cb0f libsystem_pthread.dylib`pthread_atfork, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
    frame #0: 0x00007fff9223cb0f libsystem_pthread.dylib`pthread_atfork
libsystem_pthread.dylib`pthread_atfork:
-> 0x7fff9223cb0f:  pushq  %rbp
   0x7fff9223cb10:  movq   %rsp, %rbp
   0x7fff9223cb13:  pushq  %r15
   0x7fff9223cb15:  pushq  %r14
(lldb) bt
* thread #1: tid = 0x74a62f, 0x00007fff9223cb0f libsystem_pthread.dylib`pthread_atfork, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x00007fff9223cb0f libsystem_pthread.dylib`pthread_atfork
    frame #1: 0x0000000108c087e3 libp11-kit.0.dylib`p11_library_init_impl + 147 at library.c:124
    frame #2: 0x00007fff9223caf4 libsystem_pthread.dylib`__pthread_once_handler + 65
    frame #3: 0x00007fff88a28f13 libsystem_platform.dylib`_os_once + 73
    frame #4: 0x00007fff9223ca93 libsystem_pthread.dylib`pthread_once + 57
    frame #5: 0x0000000108bbd7bb libp11-kit.0.dylib`_p11_kit_init + 27 at util.c:253
...
    frame #19: 0x00007fff84950857 libdyld.dylib`dlopen(path=0x0000000108bbd7a0, mode=146938488) + 59 at dyldAPIsInLibSystem.cpp:1371
    frame #20: 0x000000010110b48b libgmodule-2.0.0.dylib`g_module_open + 1490
    frame #21: 0x00000001010c516d libgegl-0.2.0.dylib`gegl_module_open + 20
    frame #22: 0x00000001010c4d57 libgegl-0.2.0.dylib`gegl_module_load + 84
    frame #23: 0x00000001010c4cc2 libgegl-0.2.0.dylib`gegl_module_new + 155
    frame #24: 0x00000001010c5a01 libgegl-0.2.0.dylib`gegl_module_db_module_initialize + 156
    frame #25: 0x00000001010c49f8 libgegl-0.2.0.dylib`gegl_datafiles_read_directories + 626
    frame #26: 0x00000001010a79a9 libgegl-0.2.0.dylib`gegl_post_parse_hook + 588
    frame #27: 0x0000000101333fdb libglib-2.0.0.dylib`g_option_context_parse + 2265
    frame #28: 0x0000000101334ac9 libglib-2.0.0.dylib`g_option_context_parse_strv + 52
    frame #29: 0x000000010000f91b gimp-2.8`main + 537
    frame #30: 0x00007fff849515c9 libdyld.dylib`start + 1

which points to p11-kit as the cause

comment:54 Changed 10 years ago by jeremyhu (Jeremy Huddleston Sequoia)

Port: gegl p11-kit added; gimp2 removed

While I think gegl should not dlclose() the modules it opens (because dlclose() always risks dangling pointers like this), I'm curious why p11-kit is doing this. All it's doing is counting the number of times a process has forked.

static void
count_forks (void)
{
        /* Thread safe, executed in child, one thread exists */
        p11_forkid++;
}

void
p11_library_init_impl (void)
{
        p11_debug_init ();
        p11_debug ("initializing library");
        p11_mutex_init (&p11_library_mutex);
        pthread_key_create (&thread_local, free);
        p11_message_storage = thread_local_message;

        pthread_atfork (NULL, NULL, count_forks);
}

comment:55 Changed 10 years ago by jeremyhu (Jeremy Huddleston Sequoia)

Port: gimp2 added
Summary: gimp2 @2.8.14 on 10.9 - doesn't load plug-insp11-kit child atfork handler causes gimp to crash due to gegl dlclose()

comment:56 Changed 10 years ago by jeremyhu (Jeremy Huddleston Sequoia)

Summary: p11-kit child atfork handler causes gimp to crash due to gegl dlclose()p11-kit child atfork handler causes gimp forks to crash due to gegl dlclose(), no gimp plugins available

comment:57 Changed 10 years ago by jeremyhu (Jeremy Huddleston Sequoia)

It looks like pthread_atfork has been in p11-kit since 2011, so I suspect that p11-kit is a new dependency being brought in. In any event, gegl should just be updated to not dlclose()

comment:58 Changed 10 years ago by jeremyhu (Jeremy Huddleston Sequoia)

Port: glib2 added
Summary: p11-kit child atfork handler causes gimp forks to crash due to gegl dlclose(), no gimp plugins availablep11-kit child atfork handler causes gimp forks to crash due to gegl/glib2 dlclose(), no gimp plugins available

Actually glib2 is the one that actually does the dlclose(). gegl does the g_module_close.

Perhaps we should just skip the dlclose() in gmodule (in glib2). The thing is, I don't know how many things rely on g_module_close() actually unmapping the file when the gmodule refcount gets to 0, but then again, things should not be expecting that they are the only reference to a dylib.

comment:59 Changed 10 years ago by jeremyhu (Jeremy Huddleston Sequoia)

Owner: changed from devans@… to ryandesign@…
Status: assignednew

Over to ryan as the maintainer of glib2. Patch proposal will be attached shortly.

Changed 10 years ago by jeremyhu (Jeremy Huddleston Sequoia)

Attachment: glib2-45309.patch added

glib2-45309.patch

comment:60 Changed 10 years ago by ryandesign (Ryan Carsten Schmidt)

Resolution: fixed
Status: newclosed

Thanks! r127768

Note: See TracTickets for help on using tickets.