Opened 5 years ago

Closed 11 months ago

Last modified 11 months ago

#58932 closed enhancement (fixed)

py-tensorflow: reduce build time

Reported by: ryandesign (Ryan Carsten Schmidt) Owned by: emcrisostomo (Enrico Maria Crisostomo)
Priority: Normal Milestone:
Component: ports Version:
Keywords: Cc: cjones051073 (Chris Jones), mascguy (Christopher Nielsen)
Port: py-tensorflow

Description

A successful build of a py-tensorflow subport takes over 3 hours on the buildbot. Since the port has 4 subports, that means that a version update of this port makes the buildbot unavailable for any other build for over 12 hours. Is there any way to make the port take a more reasonable time to build?

Attachments (1)

py39-tensorflow1-10.13.png (125.3 KB) - added by ryandesign (Ryan Carsten Schmidt) 3 years ago.

Download all attachments as: .zip

Change History (23)

comment:1 Changed 5 years ago by ryandesign (Ryan Carsten Schmidt)

Type: defectenhancement

comment:2 Changed 5 years ago by cjones051073 (Chris Jones)

Sorry no. Its a big port, and given how the bazel build system works it insists on building a number of its own dependencies. I have looked into this already and what is there is the best I can do.

comment:3 Changed 4 years ago by ryandesign (Ryan Carsten Schmidt)

Individual py-tensorflow and py-tensorflow1 subports are now taking 6, 12, or even 23 hours to build. Some builds are even getting terminated by buildbot because they did not print any output for an hour.

I noticed while logged in to one of the buildbot workers while one of the tensorflow subports was building that one of the many clang processes was taking over 5 GB of memory. I have also seen messages on the buildbot workers' screens saying they have run out of application memory. This suggests that we could improve the build time by giving the buildbot workers more memory.

When I set up the buildbot system in late 2016 the workers were running off SSDs and had 8 GB memory each (except 10.6 i386 which is limited to 4 GB). As workers for new versions of macOS have been added in the years since then, I've had to reduce the memory given to some of them. In addition, we've had SSD failures and some buildbot workers are temporarily running on hard disks.

I am currently researching replacement SSDs and intend to rearrange the worker VMs so that fewer VMs are on each server and so that we can give 9 or 10 GB of memory to the workers for at least the more recent OS versions. If we need even more memory than that, there is room in the servers to add more.

comment:4 Changed 4 years ago by cjones051073 (Chris Jones)

On my 2015 MBPro here it probably takes something in the region of 4 hours to build. So those time over 12 hours definitely point to something else going on, and I would agree it sounds like that is probably memory issues on the buildbots, as not having any feedback during the build for 1 hour is also not normal.

I don’t think there will be much I can do to reduce the memory usage of individual clang processes (and yes for complex build units 5gb is not uncommon sadly) but i could look into if reducing the parallelism in the builds can be done. That won’t help with the overall build time, but would mean the machine would be less stressed at any given point....

comment:5 Changed 4 years ago by cjones051073 (Chris Jones)

Bazel has a number of flags to control the number of parallel jobs and also the memory utilisation

https://github.com/tensorflow/models/issues/195

I’ll look into using these.

What would be a reasonable number of jobs for the build bots, to limit things to ?

Last edited 4 years ago by cjones051073 (Chris Jones) (previous) (diff)

comment:6 Changed 4 years ago by ryandesign (Ryan Carsten Schmidt)

How much memory does your MBP have? I'm guessing more than 8 GB.

Hopefully the tensorflow build process is already respecting the value of build.jobs but I'm not sure what a reasonable number of parallel jobs is for tensorflow. MacPorts sets build.jobs with the assumption that each compiler process might use up to 1 GB memory and that there should be an available core for each compiler process so on a machine with 8 cores and 8 GB RAM it will allow 8 parallel jobs. This seems to work for most ports but there is the odd port that requires much more memory to build, like tensorflow. If bazel can be told to keep its memory usage to fit within the amount of memory that the machine has that could help. Otherwise it might be a start if the portfile could set build.jobs to a fraction of itself, for example:

if {${build.jobs} > 1} {
    build.jobs      [expr {${build.jobs} / 2}]
}

That would reserve 2 GB per job instead of 1. Or a more complicated formula could be used based on the amount of RAM in the machine if you can figure out a simple way to determine what that is and what a good formula would be.

comment:7 Changed 4 years ago by cjones051073 (Chris Jones)

bazel has a quite involve inbuilt resource management system, that monitors the ram and cpu usage. Its just it appears its defaults are a bit too hungry for the resources the buildbots have (most probably the ram/core is a bit lower than anticipated). It also has a bunch of flags to limit things, like

 --local_ram_resources=HOST_RAM*0.75 --local_cpu_resources=HOST_CPUS*.75

which should help.

One other thing though - MacPorts doesn't just set build.jobs to the number of cores a machine has. It actually sets it to the number of hardware threads available *including* hyper threading. Which for many machines means 2*number of cores. e.g. on my MacBook pro here, which has 4 actual CPU cores, its set to 8. MacPorts just uses

 > sysctl hw.activecpu
hw.activecpu: 8

what does this return on the buildbot VMs ? Does it return the actual number of physical cores you have assigned each VM, or something else ?

In a lot of cases setting build.jobs to the number of logical cpus, taking hyper threading into account, is probably a good idea (its what they are there for). But perhaps on the build bots it would be better to use

 > sysctl hw.physicalcpu
hw.physicalcpu: 4

to limit things to the physical cores.

comment:8 Changed 4 years ago by ryandesign (Ryan Carsten Schmidt)

The buildbot workers are virtual machines running in VMware ESXi on 2009 Xserves. The Xserves each have two quad-core processors. I've configured the VMs to have 8 virtual cores each. To the VMs, it appears this way:

$ sysctl hw.activecpu
hw.activecpu: 8
$ sysctl hw.physicalcpu
hw.physicalcpu: 8
%

VMware virtualizes and manages it all somehow so that all the VMs share the available CPU resources.

comment:9 Changed 4 years ago by cjones051073 (Chris Jones)

https://github.com/macports/macports-ports/commit/1c003e9a9f540adc6b16d9e44bfea5a1059150d0

Hopefully one or more of the changes there will be enough to help, but lets see...

Version 0, edited 4 years ago by cjones051073 (Chris Jones) (next)

comment:10 Changed 4 years ago by ryandesign (Ryan Carsten Schmidt)

Regarding this line that you added:

build.jobs [sysctl hw.physicalcpu]

I would suggest putting that inside the build { ... } block. Otherwise sysctl hw.physicalcpu will get run in all sorts of situations where it is not needed, such as when running port info or port livecheck or portindex.

Also note that users are able to modify build.jobs in macports.conf or on the command line. For example a user could run sudo port build py38-tensorflow build.jobs=1. Since the purpose of this line is to reduce the number of jobs, you might want to verify that you will in fact be reducing it before changing it:

build {
    set physicalcpus [sysctl hw.physicalcpu]
    if {${build.jobs} > ${physicalcpus}} {
        build.jobs ${physicalcpus}
    }
    ...
}

comment:11 Changed 4 years ago by cjones051073 (Chris Jones)

Last edited 4 years ago by ryandesign (Ryan Carsten Schmidt) (previous) (diff)

comment:12 Changed 4 years ago by cjones051073 (Chris Jones)

Changes seem to have help. 10.13 has now built the py{37,38} ports in a 'reasonable' time. Ryan, do you agree the resource utilisation on the buildbots was OK. If so please close this.

comment:13 Changed 4 years ago by ryandesign (Ryan Carsten Schmidt)

I don't know if these changes helped or whether it was increasing the RAM on the 10.13 builder to 10 GB or a combination of the two. Let's see what happens with the next version update.

comment:14 Changed 4 years ago by ryandesign (Ryan Carsten Schmidt)

One datapoint: building py27-tensorflow1 @1.15.3 took 8 hours 12 minutes on the newly resurrected Yosemite buildbot worker even though that worker had 9GB RAM and an SSD.

Last edited 4 years ago by ryandesign (Ryan Carsten Schmidt) (previous) (diff)

comment:15 Changed 3 years ago by ryandesign (Ryan Carsten Schmidt)

Taking a look at the Activity Monitor on the 10.13 buildbot worker during this build of py39-tensorflow1 which has taken 6 hours so far, I see that it is swapping because the VM has 8 GB RAM and 8 cores and there are 5 clang processes and one java process started occupying a total of 10.7 GB of virtual memory. The build is starting too many RAM-hungry processes. Since it is known that the build uses more RAM per compiler process than MacPorts estimates, the build should reduce the number of jobs as I suggested in comment:6.

Changed 3 years ago by ryandesign (Ryan Carsten Schmidt)

Attachment: py39-tensorflow1-10.13.png added

comment:16 Changed 3 years ago by ryandesign (Ryan Carsten Schmidt)

Perhaps we can enhance base to help choose a better value for build.jobs.

comment:17 Changed 3 years ago by cjones051073 (Chris Jones)

I've reduced the resource utilisation a bit further in the bazel PG. Hopefully this will help.

comment:18 Changed 3 years ago by ryandesign (Ryan Carsten Schmidt)

We are not the only ones having trouble understanding how to wrangle the tensorflow build process into available resources: https://github.com/tensorflow/tensorflow/issues/42066

It should not be this hard. This software is terrible.

comment:19 Changed 3 years ago by cjones051073 (Chris Jones)

Looks like there are yet more flags we can try in

https://docs.bazel.build/versions/master/memory-saving-mode.html

I've never met any other build system thats as hard as bazel to work out how to configure....

comment:20 Changed 11 months ago by mascguy (Christopher Nielsen)

Cc: mascguy added

comment:21 Changed 11 months ago by mascguy (Christopher Nielsen)

Resolution: fixed
Status: assignedclosed

Now that all of the buildbots for 10.14 and later have 16 vCPUs and 16GB of RAM, we no longer need to limit the Bazel build to 50% of CPU capacity. That change was rolled out with the update two weeks ago, via commit of PR 15397 - py-tensorflow: Update to version 2.12.0, Add Python 310

comment:22 Changed 11 months ago by jmroot (Joshua Root)

I'll take your word for it, but I note that tensorflow builds are still taking 8-10 hours (even without finishing successfully).

Note: See TracTickets for help on using tickets.