Opened 5 years ago

Closed 4 years ago

#59497 closed defect (fixed)

openssh @8.1p1: sshd only works in debug mode

Reported by: davidfavor (David Favor) Owned by: Mihai Moldovan <ionic@…>
Priority: Normal Milestone:
Component: ports Version: 2.6.2
Keywords: Cc: Ionic (Mihai Moldovan)
Port: openssh

Description

Recent upgrade of openssh began producing odd sshd behavior...

This works as expected...

/opt/local/sbin/sshd -p 22 -f /opt/local/etc/ssh/sshd_config -d -E /var/log/sshd.log

This fails...

/opt/local/sbin/sshd -p 22 -f /opt/local/etc/ssh/sshd_config -E /var/log/sshd.log

Log files shows only...

reseed_prngs: RAND_bytes failed [preauth]

The sshd process continues to run, just refuses any connections with the reseed_prngs error message.

Be great if someone can mention how to fix this.

Thanks!

Change History (17)

comment:1 Changed 5 years ago by davidfavor (David Favor)

Looks like detaching sshd into backgound is the problem.

Never seen this before.

comment:2 Changed 5 years ago by davidfavor (David Favor)

Maybe this is the problem...

imac> /opt/local/sbin/sshd -v
/opt/local/sbin/sshd: illegal option -- v
OpenSSH_8.1p1, OpenSSL 1.1.1d  10 Sep 2019

imac> /opt/local/sbin/sshd -T -f /opt/local/etc/ssh/sshd_config
sshd: no hostkeys available -- exiting.

comment:3 Changed 5 years ago by jmroot (Joshua Root)

Cc: Ionic added
Port: openssh added
Summary: sshd only works in debug modeopenssh @8.1p1: sshd only works in debug mode

comment:4 Changed 5 years ago by davidfavor (David Favor)

Output using DEBUG3 in config file...

debug1: fd 8 clearing O_NONBLOCK
debug1: Forked child 12919.
debug3: send_rexec_state: entering fd = 11 config len 378
debug3: ssh_msg_send: type 0
debug3: send_rexec_state: done
debug1: rexec start in 8 out 8 newsock 8 pipe 10 sock 11
debug1: inetd sockets after dupping: 5, 5
debug3: BSM audit: connection from 192.168.1.226 port 58625
debug3: BSM audit: iptype 4 machine ID e201a8c0 00000000 00000000 00000000
Connection from 192.168.1.226 port 58625 on 192.168.1.226 port 22
debug1: Local version string SSH-2.0-OpenSSH_8.1
debug1: Remote protocol version 2.0, remote software version OpenSSH_8.1
debug1: match: OpenSSH_8.1 pat OpenSSH* compat 0x04000000
debug2: fd 5 setting O_NONBLOCK
debug3: ssh_sandbox_init: preparing Darwin sandbox
debug2: Network child is on pid 12920
debug3: preauth child monitor started
debug3: ssh_sandbox_child: starting Darwin sandbox [preauth]
reseed_prngs: RAND_bytes failed [preauth]
debug1: do_cleanup [preauth]
debug1: monitor_read_log: child log fd closed
debug3: mm_request_receive entering
debug1: do_cleanup
debug1: Killing privsep child 12920
debug1: audit_event: unhandled event 12

comment:5 Changed 4 years ago by davidfavor (David Favor)

Completely nuked openssh + removed /opt/local/etc/ssh + reinstalled.

Same problem occurs.

comment:6 Changed 4 years ago by Ionic (Mihai Moldovan)

Resolution: invalid
Status: newclosed

This does not seem to be a packaging issue.

The openssh port does not ship default config files - only example files. If needed, you are supposed to copy and edit them.

It also does not generate host keys after installation - which is why your instance of sshd seems to fail.

You will need to generate host keys, e.g., via /opt/local/bin/ssh-keygen -A.

Closing as invalid.

comment:7 Changed 4 years ago by davidfavor (David Favor)

Did a key generation + unload + load.

Same problem exists.

Note: I've been using MacPorts for years. Installed on many machines. I've always installed openssh + did an initial port load openssh, then sshd simply worked. Maybe this has changed.

If there are requires openssh setup steps, someone point me to the related URL, as I can't seem to find any sshd setup conversation anywhere.

Thanks!

comment:8 Changed 4 years ago by Ionic (Mihai Moldovan)

Resolution: invalid
Status: closedreopened

Hm, or maybe something is broken.

I've generated the host keys, synced the sshd_config file with sshd_config.example and loaded the service.

That did start up, and sshd is listening for connections, but I'm likewise seeing errors on reseed_prngs when connecting to the machine on port 2222.

What's your OS version?

comment:9 Changed 4 years ago by Ionic (Mihai Moldovan)

This issue is weird...

OpenSSL seems to report a seeding error when calling RAND_bytes(), but checking RAND_status() right after the RAND_seed() call returns 1, indicating that the (default) DRBG has been seeded with enough data.

Specifically, this:

error:2406E06E:random number generator:RAND_DRBG_reseed:error retrieving entropy

I guess I'll have to dig deeper.

comment:10 Changed 4 years ago by Ionic (Mihai Moldovan)

Yep, weird indeed.

I cleared the (OpenSSL) error stack before calling RAND_seed() and checked it afterwards - the seeding operation seems to really also fail. I don't understand why RAND_status() would return 1 in this case, but aside from that, something seems to be really messed up.

I can only guess at this point, but my best guess would be that this is related to sandboxing or privilege dropping.

It really doesn't happen in debug mode - the OpenSSL error stack remains empty.

I'll test disabling the sandbox and privilege separation tomorrow (if that's even possible).

comment:11 Changed 4 years ago by Ionic (Mihai Moldovan)

Disabling the sandbox or privilege separation is not possible since 7.5 - the option to toggle this was removed in that version.

Regardless, I tested a build without the hpn, gsskex and Apple Keychain integration patches. Same symptom.

I then went ahead and disabled the two sandbox patches, but left the launchd and pam patches applied. Got a password prompt from a detached sshd.

Whatever is breaking OpenSSL, it must be something within these patches.

Last edited 4 years ago by Ionic (Mihai Moldovan) (previous) (diff)

comment:12 Changed 4 years ago by Ionic (Mihai Moldovan)

Okay, here we go:

OpenSSH already has support for the Apple sandbox, although its default setup seems to be too restrictive to Apple itself for some reason. Hence, they patch it to include and read a custom-crafted profile and so do we.

The debug mode disables any child forking, and hence also privilege separation, which explains why that worked.

With privilege separation enabled, the child spawned by sshd chroots into some specific directory.

In vanilla OpenSSH, sshd enables the sandbox in the child process after reseeding the OpenSSL RNG and chrooting to that directory.

However, since Apple (and we) use a special profile file, they (and we) enable the sandbox first, then do all the other things. It's mostly just a code move, but an important one, because a chrooted child couldn't ever be able to read the special profile file residing outside of the chroot.

All of this has been done in exactly the same fashion for years (a decade or longer) and it never failed.

I'm still clueless why it started to fail. There aren't any obvious code changes that would cause this (at least not within OpenSSH) and my experiments also didn't shed light on this.

For instance, I essentially turned the sandbox into "transparency" mode by just blindly allowing everything. Didn't change a thing. I enabled debugging within the sandbox so that each violation would be logged. Not a thing.

It doesn't look like the sandbox is prohibiting anything, but yet OpenSSL gets into some confused state it can't recover from once the sandbox is turned on. And not even that is true.

As previously explained, the vanilla work flow is like this:

spawn child -> do a lot of other work -> reseed -> chroot -> enable sandbox

Contrast this to the (recently breaking) Apple-patched work flow:

spawn child -> do a lot of other work -> enable sandbox -> reseed -> chroot

If I modify this slightly like this:

spawn child -> do a lot of other work -> reseed -> enable sandbox -> reseed -> chroot

everything seems to work just fine.

This doesn't make sense to me. I understand that a reseeding operation before enabling the sandbox works just fine... essentially because it also does so in vanilla OpenSSH. What I cannot wrap my head around is that subsequent reseeding operations also work just fine after enabling the sandbox.

So, I have a workaround, but I don't want to blindly commit this to a security-critical package until I really understand what is going on.

So far, I have only briefly skimmed the OpenSSL (not -SSH) source code and didn't get into the nitty-gritty details of reseeding, including fetching random data from the system, but it looks like I have to in order to understand what it's doing and why it thinks that it can't gather system entropy.

To that end, I wondered whether sandboxing could change access to (already opened) file descriptors or would be ignorant to that, but that (changing access) doesn't seem to be the case. Hence, should OpenSSL already have an open file descriptor to, say, /dev/random, that FD shouldn't be affected by enabling the sandbox retrospectively. This a commonly used technique, c.f., Chromium.

OpenSSH 8.1p1 introduced a set of more complex IPC between master and child processes by means of not only opening up pipes between the processes, but also sending some data over them. This more complex handling is really the only actual change from 7.9p1 to 8.1p1, but at the same time doesn't explain any of the things experienced.

Further down into the rabbit hole of OpenSSL-debugging it is, then, I guess.

Last edited 4 years ago by Ionic (Mihai Moldovan) (previous) (diff)

comment:13 Changed 4 years ago by Ionic (Mihai Moldovan)

I finally understood what is going on, hooray. Leaving this here for future generations.

The sandbox never really had a role to play in this issue. Rather, it was a combination of OpenSSH castrating itself and the OpenSSL crypto core being rewritten in 1.1.1* and functioning completely differently compared to older releases (such as 1.1.0*). The sandbox would have affected it, but it never came to that.

What OpenSSL 1.1.1 uses, compared to older versions, is an "AES-CTR DRBG according to NIST standard SP 800-90Ar1". It also introduced crypto objects chaining, such that each random number generator object can be hooked up to another via parenting. They also introduced two global instances of this DRBG - one used for generating random numbers for use with public keys, the other one for generating random data for use with private keys. This makes the code more complicated, but trust me, that's actually a good thing!

Each DRBG has a specific state it is in (uninitialized, ready, error) and a few pools with random data - for seeding, additional data and getting actual randomness out of it.

When a DRBG is created (internally or externally, though for OpenSSH it's really an internal implementation detail in OpenSSL), the code is creating a seed pool - initially comprised of seeding data the application provides - and then tries to get more entropy from the system to add to this pool. This means that a bad seed does not necessarily compromise the random number generator used by OpenSSL, which sounds good!

When it's reseeded or random data requested by the application, the internal state is checked. If it's not READY but ERROR, the DRBG is restarted (uninitialized and initialized again) in order to clear the error state - including, if applicable, its parent DRBG instances.

So... why does this fail in a forked OpenSSH child?

As already explained, during initialization, system entropy is fetched through different means. These means, on OS X/macOS consist of:

  1. using the getentropy system call to fill a buffer with random bytes (but THAT one is only available on 10.12 and higher!) XOR
  2. reading random data from system devices like /dev/urandom, /dev/random, /dev/hwrng, /dev/srandom and something else I've forgotten IFF they exist and can be opened successfully. Crucially, they are only opened once and the file descriptor left open for additional, later access if reading from the device actually returned useful data. XOR
  3. generating entropy via the RDTSC method that reads a high-resolution timer within the CPU XOR generating entropy via the RDSEED/RDRAND CPU instruction(s).

There is no other entropy source defined in OpenSSL 1.1.1. For OS X/macOS this list is shortened further, because:

  • the RDTSC method is forcefully disabled within OpenSSL (quote: "IMPORTANT NOTE: It is not currently possible to use this code because we are not sure about the amount of randomness it provides. Some SP900 tests have been run, but there is internal skepticism. So for now this code is not used.")
  • the RDSEED/RDRAND functions are implemented, but not enabled by default and we don't enable them. That's probably fine, because using a default-disabled function set in a security-related application feels weird.

Additionally, both these methods would only be usable on x86_64 (or maybe also x86) CPUs, which would leave out ppc ones for good.

To recap, on 10.11 and below, the only entropy source as usable by OpenSSL are the system devices /dev/urandom and /dev/random.

These would work fine, but OpenSSH pulls an additional trigger after enabling the sandbox:

	/*
	 * The kSBXProfilePureComputation still allows sockets, so
	 * we must disable these using rlimit.
	 */
	rl_zero.rlim_cur = rl_zero.rlim_max = 0;
	if (setrlimit(RLIMIT_FSIZE, &rl_zero) == -1)
		fatal("%s: setrlimit(RLIMIT_FSIZE, { 0, 0 }): %s",
			__func__, strerror(errno));
	if (setrlimit(RLIMIT_NOFILE, &rl_zero) == -1)
		fatal("%s: setrlimit(RLIMIT_NOFILE, { 0, 0 }): %s",
			__func__, strerror(errno));
	if (setrlimit(RLIMIT_NPROC, &rl_zero) == -1)
		fatal("%s: setrlimit(RLIMIT_NPROC, { 0, 0 }): %s",
			__func__, strerror(errno));

This code has been in there for longer than a decade as well and what it does is:

  1. disabling creating new files with a file size greater than zero (so essentially writing any data to files... and sockets(?))
  2. disabling OPENING any files or sockets to begin with
  3. disabling spawning additional processes

That's generally fine, because the forked child is only used for authentication and gets all its internal state from the parent instance it was forked from. It doesn't need to create additional files or network sockets and this makes the process more robust to outside tinkering by buffer overflows or the like. The sandbox also plays a big role in that hardening, of course.

However, you might have noticed a conflict here: thusly spawned processes may not open any new files, but OpenSSL 1.1.1* might need to (and, on older systems, must) open system crypto devices to garner entropy. Boom.

This also explains why reseeding the DRBG(s) prior to enabling the sandbox works and continue to work afterwards: the operations succeed, open the crypto devices and leave it open, keeping the file descriptor around. Subsequent reseeding operations can then continue to use it.

But... why did this work for such a long time without generating errors?

Previous OpenSSL versions (1.1.0 and older) are scary. They also initialize a random number generator if it wasn't previously initialized when requesting random data and that operation would generally also pull in system entropy via system crypto devices on OS X/macOS, but... failures to do so are non-fatal. That state is never recorded properly. Additionally, the random seed and random data in general seems to be getting hashed in previous versions in order to fill the pool. Also, failures to fill the pool with system entropy do not necessarily need to lead to failures when fetching random data within the application, since previous OpenSSL versions also mix in some "pseudo-random" data like the PID, user ID and current timestamp to the pool unconditionally. And the random pool data also seems to be getting hashed when requesting it in the application...

Since OpenSSH only ever requests one byte of random data, that might be just enough to satisfy the condition.

As far as I can tell, the error condition was just masked by OpenSSL's previous implementation.


Now that we know what is going wrong, the remaining question is how to fix it.

Calling the reseed function prior to enabling the sandbox is a valid workaround. By doing this, OpenSSL will open a file descriptor to some crypto device (typically /dev/urandom) and cache it. As soon as the device returns data, it shouldn't get closed, so we can continue to use it in the process. The caveat with that approach is, that, should the device block at some point and NOT return more data, the file descriptor will be closed and OpenSSL will not be able to reopen it again. That normally shouldn't be the case for /dev/urandom, so I don't see this as a huge drawback.

Alternatively, I could relax the number of open files limitation (to what level, though?) and add a sandbox exception for the crypto devices. That would probably also work, but relax the security limitations a bit too much - i.e., the process could suddenly open and read other files as well. For this reason, I don't like that solution.

I'll probably commit a fix with the first implementation tomorrow.

comment:14 Changed 4 years ago by mouse07410 (Mouse)

Who disabled RDSEED/RDRAND, and why? I understand that some people don't trust it, and the world stock of lithium is limited, so not everybody who needs it may get it. But still, the strength of an RNG is equal to the strength of it's strongest component. Meaning - if you combine output from several generators, your resulting randomness works be as good as the best of them.

Do yourself a favor and re-enable it.

comment:15 Changed 4 years ago by Ionic (Mihai Moldovan)

As far as I've seen, it's disabled by default in the OpenSSL upstream configuration. I didn't find a configure option to even enable it while quickly grepping the source code, but it looks like passing --with-rand-seed=os,rdcpu (or something similar) would do that. However, the upstream default is just os. Plus, like I said, it would only work on Intel CPUs AFAIK, but we also have to care for the PowerPC faction. It wouldn't even help universally, but admittedly in most cases.

I'm also pretty sure that mixing entropy of different qualities actually degrades the overall quality, but don't quote me on that. :)

And lastly... OpenSSL doesn't really mix them all together. It picks the first method available and working. The other methods are only tried in case of errors or if no entropy is coming out any longer.

I'm not saying that you don't have a point, but you'd have to discuss that with the OpenSSL port maintainers.

comment:16 Changed 4 years ago by mouse07410 (Mouse)

As a cryptographer I assure you that mixing randomness from different sources only improves it.

comment:17 Changed 4 years ago by Mihai Moldovan <ionic@…>

Owner: set to Mihai Moldovan <ionic@…>
Resolution: fixed
Status: reopenedclosed

In 4adcdc8b8e606bb55f83b33a2c60362292a99066/macports-ports (master):

net/openssh: fix sshd failure in non-debug mode. Revbump.

This commit message is essentially just a copy of the source code comment.

ssh_sandbox_child() has the side-effect of disabling opening new files.
This is a security precaution to prevent the child process from leaking
data or opening new sockets, but clashes with newer OpenSSL
implementations.

Generally, OpenSSL wants to read new entropy from the system for each
reseeding operation (and, by extension, through any operation that might
trigger an internal reseeding, like requesting random bytes).

The current OpenSSL port only enables the default set of system entropy

  • which means reading in data from crypto devices like /dev/{,u,s}random

and /dev/hwrng.

To speed things up, OpenSSL tries to open file descriptors to the listed
devices and caches the result, i.e., the open file descriptor. Those are
normally kept open UNLESS a reading error occurred OR no random bytes
were returned.

In a quite scary move, OpenSSL versions prior to 1.1.1 didn't fail when
getting system entropy wasn't successful and also added some
"pseudo-random" data like the PID, user id and current time to the
entropy pool, which was often enough to seed the PRNG.

More recent versions have a rewritten PRNG/DRBG core and, crucially,
stricter rules when it comes to acquiring system entropy - this is now
strictly required and no other data is mixed into the pool.

OpenSSH generally tries (or intends) to leave crypto devices (which
should one of the earliest open devices) alone and not close their FD on
re-exec, but that doesn't seem to work. Although OpenSSL is initialized
very early in the main() call chain, which SHOULD lead to open file
descriptors to crypto devices, on a typical OS X/macOS system,
/dev/urandom is opened as FD 6, which is above any FD that would be
preserved after a re-exec operation.

This leads to the child process having no open file descriptors to
/dev/urandom, activating the sandbox, setting the number of open files
to zero and subsequently effectively breaking OpenSSL 1.1.1+.

We'll work around that by reseeding the PRNGs before enabling the
sandbox, which has the side-effect of opening a file descriptor to
/dev/urandom and keeping it open.

There is a slight catch: errors in reading from the FD or a read count
of zero (i.e., the device not returning any data) will lead to the FD
being closed again without a way to be re-opened.

We can take this risk, as this should realistically not happen. Even if
it does, that only means that the child process will fail to read random
data and hence terminate with an error - showing the same symptoms the
workaround is intended to fix, but nothing worse.

Fixes: #59497

Note: See TracTickets for help on using tickets.