Opened 20 months ago

Last modified 6 weeks ago

#58932 assigned enhancement

py-tensorflow: reduce build time

Reported by: ryandesign (Ryan Schmidt) Owned by: emcrisostomo (Enrico Maria Crisostomo)
Priority: Normal Milestone:
Component: ports Version:
Keywords: Cc: cjones051073 (Chris Jones)
Port: py-tensorflow

Description

A successful build of a py-tensorflow subport takes over 3 hours on the buildbot. Since the port has 4 subports, that means that a version update of this port makes the buildbot unavailable for any other build for over 12 hours. Is there any way to make the port take a more reasonable time to build?

Attachments (1)

py39-tensorflow1-10.13.png (125.3 KB) - added by ryandesign (Ryan Schmidt) 6 weeks ago.

Download all attachments as: .zip

Change History (20)

comment:1 Changed 20 months ago by ryandesign (Ryan Schmidt)

Type: defectenhancement

comment:2 Changed 20 months ago by cjones051073 (Chris Jones)

Sorry no. Its a big port, and given how the bazel build system works it insists on building a number of its own dependencies. I have looked into this already and what is there is the best I can do.

comment:3 Changed 12 months ago by ryandesign (Ryan Schmidt)

Individual py-tensorflow and py-tensorflow1 subports are now taking 6, 12, or even 23 hours to build. Some builds are even getting terminated by buildbot because they did not print any output for an hour.

I noticed while logged in to one of the buildbot workers while one of the tensorflow subports was building that one of the many clang processes was taking over 5 GB of memory. I have also seen messages on the buildbot workers' screens saying they have run out of application memory. This suggests that we could improve the build time by giving the buildbot workers more memory.

When I set up the buildbot system in late 2016 the workers were running off SSDs and had 8 GB memory each (except 10.6 i386 which is limited to 4 GB). As workers for new versions of macOS have been added in the years since then, I've had to reduce the memory given to some of them. In addition, we've had SSD failures and some buildbot workers are temporarily running on hard disks.

I am currently researching replacement SSDs and intend to rearrange the worker VMs so that fewer VMs are on each server and so that we can give 9 or 10 GB of memory to the workers for at least the more recent OS versions. If we need even more memory than that, there is room in the servers to add more.

comment:4 Changed 12 months ago by cjones051073 (Chris Jones)

On my 2015 MBPro here it probably takes something in the region of 4 hours to build. So those time over 12 hours definitely point to something else going on, and I would agree it sounds like that is probably memory issues on the buildbots, as not having any feedback during the build for 1 hour is also not normal.

I don’t think there will be much I can do to reduce the memory usage of individual clang processes (and yes for complex build units 5gb is not uncommon sadly) but i could look into if reducing the parallelism in the builds can be done. That won’t help with the overall build time, but would mean the machine would be less stressed at any given point....

comment:5 Changed 12 months ago by cjones051073 (Chris Jones)

Bazel has a number of flags to control the number of parallel jobs and also the memory utilisation

https://github.com/tensorflow/models/issues/195

I’ll look into using these.

What would be a reasonable number of jobs for the build bots, to limit things to ?

Last edited 12 months ago by cjones051073 (Chris Jones) (previous) (diff)

comment:6 Changed 12 months ago by ryandesign (Ryan Schmidt)

How much memory does your MBP have? I'm guessing more than 8 GB.

Hopefully the tensorflow build process is already respecting the value of build.jobs but I'm not sure what a reasonable number of parallel jobs is for tensorflow. MacPorts sets build.jobs with the assumption that each compiler process might use up to 1 GB memory and that there should be an available core for each compiler process so on a machine with 8 cores and 8 GB RAM it will allow 8 parallel jobs. This seems to work for most ports but there is the odd port that requires much more memory to build, like tensorflow. If bazel can be told to keep its memory usage to fit within the amount of memory that the machine has that could help. Otherwise it might be a start if the portfile could set build.jobs to a fraction of itself, for example:

if {${build.jobs} > 1} {
    build.jobs      [expr {${build.jobs} / 2}]
}

That would reserve 2 GB per job instead of 1. Or a more complicated formula could be used based on the amount of RAM in the machine if you can figure out a simple way to determine what that is and what a good formula would be.

comment:7 Changed 12 months ago by cjones051073 (Chris Jones)

bazel has a quite involve inbuilt resource management system, that monitors the ram and cpu usage. Its just it appears its defaults are a bit too hungry for the resources the buildbots have (most probably the ram/core is a bit lower than anticipated). It also has a bunch of flags to limit things, like

 --local_ram_resources=HOST_RAM*0.75 --local_cpu_resources=HOST_CPUS*.75

which should help.

One other thing though - MacPorts doesn't just set build.jobs to the number of cores a machine has. It actually sets it to the number of hardware threads available *including* hyper threading. Which for many machines means 2*number of cores. e.g. on my MacBook pro here, which has 4 actual CPU cores, its set to 8. MacPorts just uses

 > sysctl hw.activecpu
hw.activecpu: 8

what does this return on the buildbot VMs ? Does it return the actual number of physical cores you have assigned each VM, or something else ?

In a lot of cases setting build.jobs to the number of logical cpus, taking hyper threading into account, is probably a good idea (its what they are there for). But perhaps on the build bots it would be better to use

 > sysctl hw.physicalcpu
hw.physicalcpu: 4

to limit things to the physical cores.

comment:8 Changed 12 months ago by ryandesign (Ryan Schmidt)

The buildbot workers are virtual machines running in VMware ESXi on 2009 Xserves. The Xserves each have two quad-core processors. I've configured the VMs to have 8 virtual cores each. To the VMs, it appears this way:

$ sysctl hw.activecpu
hw.activecpu: 8
$ sysctl hw.physicalcpu
hw.physicalcpu: 8
%

VMware virtualizes and manages it all somehow so that all the VMs share the available CPU resources.

comment:9 Changed 12 months ago by cjones051073 (Chris Jones)

[1c003e9a9f540adc6b16d9e44bfea5a1059150d0/macports-ports]

Hopefully one or more of the changes there will be enough to help, but lets see...

Last edited 12 months ago by ryandesign (Ryan Schmidt) (previous) (diff)

comment:10 Changed 12 months ago by ryandesign (Ryan Schmidt)

Regarding this line that you added:

build.jobs [sysctl hw.physicalcpu]

I would suggest putting that inside the build { ... } block. Otherwise sysctl hw.physicalcpu will get run in all sorts of situations where it is not needed, such as when running port info or port livecheck or portindex.

Also note that users are able to modify build.jobs in macports.conf or on the command line. For example a user could run sudo port build py38-tensorflow build.jobs=1. Since the purpose of this line is to reduce the number of jobs, you might want to verify that you will in fact be reducing it before changing it:

build {
    set physicalcpus [sysctl hw.physicalcpu]
    if {${build.jobs} > ${physicalcpus}} {
        build.jobs ${physicalcpus}
    }
    ...
}

comment:11 Changed 12 months ago by cjones051073 (Chris Jones)

Last edited 12 months ago by ryandesign (Ryan Schmidt) (previous) (diff)

comment:12 Changed 12 months ago by cjones051073 (Chris Jones)

Changes seem to have help. 10.13 has now built the py{37,38} ports in a 'reasonable' time. Ryan, do you agree the resource utilisation on the buildbots was OK. If so please close this.

comment:13 Changed 12 months ago by ryandesign (Ryan Schmidt)

I don't know if these changes helped or whether it was increasing the RAM on the 10.13 builder to 10 GB or a combination of the two. Let's see what happens with the next version update.

comment:14 Changed 11 months ago by ryandesign (Ryan Schmidt)

One datapoint: building py27-tensorflow1 @1.15.3 took 8 hours 12 minutes on the newly resurrected Yosemite buildbot worker even though that worker had 9GB RAM and an SSD.

Last edited 11 months ago by ryandesign (Ryan Schmidt) (previous) (diff)

comment:15 Changed 6 weeks ago by ryandesign (Ryan Schmidt)

Taking a look at the Activity Monitor on the 10.13 buildbot worker during this build of py39-tensorflow1 which has taken 6 hours so far, I see that it is swapping because the VM has 8 GB RAM and 8 cores and there are 5 clang processes and one java process started occupying a total of 10.7 GB of virtual memory. The build is starting too many RAM-hungry processes. Since it is known that the build uses more RAM per compiler process than MacPorts estimates, the build should reduce the number of jobs as I suggested in comment:6.

Changed 6 weeks ago by ryandesign (Ryan Schmidt)

Attachment: py39-tensorflow1-10.13.png added

comment:16 Changed 6 weeks ago by ryandesign (Ryan Schmidt)

Perhaps we can enhance base to help choose a better value for build.jobs.

comment:17 Changed 6 weeks ago by cjones051073 (Chris Jones)

I've reduced the resource utilisation a bit further in the bazel PG. Hopefully this will help.

comment:18 Changed 6 weeks ago by ryandesign (Ryan Schmidt)

We are not the only ones having trouble understanding how to wrangle the tensorflow build process into available resources: https://github.com/tensorflow/tensorflow/issues/42066

It should not be this hard. This software is terrible.

comment:19 Changed 6 weeks ago by cjones051073 (Chris Jones)

Looks like there are yet more flags we can try in

https://docs.bazel.build/versions/master/memory-saving-mode.html

I've never met any other build system thats as hard as bazel to work out how to configure....

Note: See TracTickets for help on using tickets.