Break the lock on a job if it's still queued and the runner that it
triggered was assigned to another job. This may cause leftover runners
to be created, but we scale those down in ~3 minutes.
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
* Removes completed jobs from the db
* Skip ensure min idle runners for pools with min idle runners set to 0
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
* removes an extra loop. The fetch tools loop does the same job
* add a lot of log messages
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
Lock operations per instance name. This should avoid go routines trying
to update the same instance when operations may be slow.
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
This commit adds:
* more granular loops for various operations
* update go-github to latest version
* skip trying to fetch runner info for canceled or skipped jobs
* loops use waitgroups to signal exit
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
This commit:
* swaps WaitGroups with errgroups
* wraps errgroup.Wait() in a select to prevent situations in which an
operation takes a long time and prevents garm from being restarted.
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
* When a runner fails to set up the github agent, we reap it after the
pool timeout is reached.
* add a retry in the userdata when configuring the runner agent
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
Providers may return only 3 possible statuses:
* InstanceRunning
* InstanceError
* InstanceStopped
Every other status is reserved for the controller to set. Provider
responses will be split from the instance response in a future commit.
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
Use an errgroup to wait for all instance deletion operations before
returning. Log any failure.
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
* cleanupOrphanedGithubRunners() now uses errgroup to parallelize and
report errors when removing runners from the provider.
* retryFailedInstancesForOnePool() now uses errgroup
* Removed some setPoolRunningState which should be treated in the loop
where those errors eventually bubble up and can be handled.
* Added a number of timeouts in the LXD provider for delete and list
instances. This provider should be converted into an external
provider.
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
The external provider needs a simple way to indicate certain types of
errors. Duplicate error and not found error are such an example.
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
* Add Run() helper for external providers
* Make GARM_CONTROLLER_ID env var common to all commands
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
This interface is similar to the common.Provider interface, but lacks
the AsParams() function. Decoupling the external provider interface from
the internal provider interface allows us to account for any
particularities there may appear between them.
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
The execution package is a common package that can be used by external
providers to load environment variables and stdin, in a coherent struct
that can be consumed by the various commands that need to execute as
part of the provider.
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
The params package should not depend on config. The params packages
should be consumable by external applications that wish to interact with
garm, and it makes no sense to pull in the config package just for some
constants. As such, the following changes have been made:
* Moved some types from config to params
* Moved defaults in a new leaf package called appdefaults
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
This change renames the module from "garm" to "github.com/cloudbase/garm".
This will make it easier to consume public functions defined in garm, by
external applications, without having to resort to replace.
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
Add a grace period for idle runners of 5 minutes. A new idle runner will
not be taken into consideration for scale-down unless it's older than 5
minutes. This should prevent situations where the scaleDown() routine
that runs every minute will evaluate candidates for reaping and
erroneously count the new one as well. The in_progress hooks that
transitiones an idle runner to "active" may arive a long while after the
"queued" hook has spun up a runner.
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>