as some provider binaries probably need additional environment variables
set (e.g kubernetes as client-go depends on KUBERNETES_SERVICE_ vars) it
should be possible to define a list of environment variables which
should get bypassed into the provider binary execution
This branch adds the ability to forcefully remove a runner from GARM.
When the operator wishes to manually remove a runner, the workflow is as
follows:
* Check that the runner exists in GitHub. If it does, attempt to
remove it. An error here indicates that the runner may be processing
a job. In this case, we don't continue and the operator gets immediate
feedback from the API.
* Mark the runner in the database as pending_delete
* Allow the consolidate loop to reap it from the provider and remove it
from the database.
Removing the instance from the provider is async. If the provider errs out,
GARM will keep trying to remove it in perpetuity until the provider succedes.
In situations where the provider is misconfigured, this will never happen, leaving
the instance in a permanent state of pending_delete.
A provider may fail for various reasons. Either credentials have expired, the
API endpoint has changed, the provider is misconfigured or the operator may just
have removed it from the config before cleaning up the runners. While some cases
are recoverable, some are not. We cannot have a situation in which we cannot clean
resources in garm because of a misconfiguration.
This change adds the pending_force_delete instance status. Instances marked with
this status, will be removed from GARM even if the provider reports an error.
The GARM cli has been modified to give new meaning to the --force-remove-runner
option. This option in the CLI is no longer mandatory. Instead, setting it will mark
the runner with the new pending_force_delete status. Omitting it will mark the runner
with the old status of pending_delete.
Fixes: #160
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
This update pulls in the latest version of garm-provider-common which removes
its dependency on go-github, making future updates much less painful.
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
add info metrics about providers, enterprises, organizations,
repositories and pools.
Also expose most of the configurable pool information as metric like
e.g. max Runners as garm_pool_max_runners
Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
We must create the DB entry for a runner with a JIT config included. Adding it later
via an update runs the risk of having the consolidate loop pick up the incomplete instance.
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
When using JIT runners, we register the runner on GitHub before we get
a chance to spin up the instance in the provider. In such cases, we end
up with a runner in "offline" state while we're creating the actual resource
that will embody the runner. This change will give runners a chance to come
online before garm tries to clean them up.
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
We need to abstract away the tools struct and not have garm-provider-common
depend on go-github just for that one struct. It makes it hard to update
go-github without updating garm-provider-common first and then all the rest
of the providers.
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
Thic change adds a metadata endpoint that returns a list of root CA
certificates a runner must install in order to be able to validate all
relevant API endpoints it may require. This includes any GHES API that
runs on a self signed certificate.
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
* Updates the garm-provider-common and go-github packages.
* Update sqlToParamsInstance to return an error when unmarshaling
This change is needed to pull in the new Seal/Unseal functions in common.
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
Ensure that there is a foreign key constraint between runners and jobs.
Once a runner is associated with a job, we want the job to be removed along
with the runner.
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
Instead of asking for all the GARM instances from the API, and then
filtering them by `poolID`, we can ask the API to return only the
instances that are in the `poolID` pool.
Therefore, we only need to count the instances that are running and idle.
Signed-off-by: Ionut Balutoiu <ibalutoiu@cloudbasesolutions.com>
If we fail to get the tools for one pool, garm fails to start due to pool
manager startup timeout. Launch the initial tools update function as a
goroutine and return from Start(). If it fails, it will retry, and we won't
block garm from starting.
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
The variable `runningIdleCount` would get incremented for instances
on every pool, instead of only for the pool we are interested in.
This change fixes this.
Also, adjust the logging message when error occurs in this timeout
exceeded scenario.
Signed-off-by: Ionut Balutoiu <ibalutoiu@cloudbasesolutions.com>
* General cleanup of the integration tests Golang code. Move the
`e2e.go` codebase into its own package and separate files.
* Reduce the overall log spam from the integration tests output.
* Add final GitHub workflow step that stops GARM server, and does the
GitHub cleanup of any orphaned resources.
* Add `TODO` to implement cleanup of the orphaned GitHub webhooks.
This is useful, if the uninstall of the webhooks failed.
* Add `TODO` for extra missing checks on the GitHub webhooks
install / uninstall logic.
Signed-off-by: Ionut Balutoiu <ibalutoiu@cloudbasesolutions.com>