Commit graph

132 commits

Author SHA1 Message Date
Gabriel Adrian Samfira
0dd4f38691 Update go-github
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-12-18 16:20:44 +00:00
Gabriel Adrian Samfira
459906d97e Prevent abusing the GH API
On large deployments with many jobs, we cannot check each job that
we recorded in the DB against the GH API.

Before this change, if a job was updated more than 10 minutes ago,
garm would check against the GH api if that job still existed. While
this approach allowed us to maintain a consistent view over which jobs
still exist and which are stale, it had the potential of spamming the
GH API, leading to rate limiting.

This change uses the scale-down loop as an indicator for job staleness.

If a job remains in queued state in our DB, but has dissapeared from GH
or was serviced by another runner and we never got the hook (garm was down
or GH had an issue - happened in the past), then garm will spin up a new
runner for it. If that runner or any other runner is scaled down, we check
if we have jobs in the queue that should have matched that runner. If we did,
there is a high chance that the job no longer exists in GH and we can remove
the job from the queue.

Of course, there is a chance that GH is having issues and the job is never
pushed to the runner, but we can't really account for everything. In this case
I'd rather avoid rate limiting ourselves.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-12-15 22:41:50 +00:00
Gabriel Adrian Samfira
85968598b0 Add option to disable JIT config
This change adds a flag on providers that allows users to disable JIT
configuration even when it's available. For context, JIT is available
on github.com and any GHES instance >=3.10.

This option is a stopgap measure for providers that have not yet been
updated to use JIT configs instead of runner registration tokens.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-12-11 12:37:33 +00:00
Gabriel Adrian Samfira
d09f12dfd8 Add force delete runner
This branch adds the ability to forcefully remove a runner from GARM.

When the operator wishes to manually remove a runner, the workflow is as
follows:

* Check that the runner exists in GitHub. If it does, attempt to
  remove it. An error here indicates that the runner may be processing
  a job. In this case, we don't continue and the operator gets immediate
  feedback from the API.
* Mark the runner in the database as pending_delete
* Allow the consolidate loop to reap it from the provider and remove it
  from the database.

Removing the instance from the provider is async. If the provider errs out,
GARM will keep trying to remove it in perpetuity until the provider succedes.

In situations where the provider is misconfigured, this will never happen, leaving
the instance in a permanent state of pending_delete.

A provider may fail for various reasons. Either credentials have expired, the
API endpoint has changed, the provider is misconfigured or the operator may just
have removed it from the config before cleaning up the runners. While some cases
are recoverable, some are not. We cannot have a situation in which we cannot clean
resources in garm because of a misconfiguration.

This change adds the pending_force_delete instance status. Instances marked with
this status, will be removed from GARM even if the provider reports an error.

The GARM cli has been modified to give new meaning to the --force-remove-runner
option. This option in the CLI is no longer mandatory. Instead, setting it will mark
the runner with the new pending_force_delete status. Omitting it will mark the runner
with the old status of pending_delete.

Fixes: #160

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-10-12 06:15:36 +00:00
Gabriel Adrian Samfira
26dbc3d8e5 Update garm-provider-common
This update pulls in the latest version of garm-provider-common which removes
its dependency on go-github, making future updates much less painful.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-10-09 10:55:11 +00:00
Gabriel Adrian Samfira
019948acbe Add JIT config as part of instance create
We must create the DB entry for a runner with a JIT config included. Adding it later
via an update runs the risk of having the consolidate loop pick up the incomplete instance.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-09-24 13:51:17 +00:00
Gabriel Adrian Samfira
4bedb1dd63 Fix URLs for enterprises
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-09-24 13:51:17 +00:00
Gabriel Adrian Samfira
5f2cb19503 Use accessors when getting response values
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-09-24 13:51:17 +00:00
Gabriel Adrian Samfira
1268507ce6 Add jit config routes
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-09-24 13:50:20 +00:00
Gabriel Adrian Samfira
5214aca228 Add jit config for new runner
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-09-24 13:49:57 +00:00
Gabriel Adrian Samfira
6dea1c1937 Add temporary replace to fork
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-09-24 13:49:56 +00:00
Gabriel Adrian Samfira
d5f8cf079e Ignore instances that are still being created from reaping
When using JIT runners, we register the runner on GitHub before we get
a chance to spin up the instance in the provider. In such cases, we end
up with a runner in "offline" state while we're creating the actual resource
that will embody the runner. This change will give runners a chance to come
online before garm tries to clean them up.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-09-24 13:49:06 +00:00
Gabriel Adrian Samfira
fc77a4b735 Update go-github and garm-provider-common
We need to abstract away the tools struct and not have garm-provider-common
depend on go-github just for that one struct. It makes it hard to update
go-github without updating garm-provider-common first and then all the rest
of the providers.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-09-24 07:56:56 +00:00
Gabriel Adrian Samfira
a26907fb91 Add root CA bundle metadata URL
Thic change adds a metadata endpoint that returns a list of root CA
certificates a runner must install in order to be able to validate all
relevant API endpoints it may require. This includes any GHES API that
runs on a self signed certificate.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-08-28 09:44:18 +00:00
Gabriel Adrian Samfira
d700b790ac Update garm-provider-common and go-github
* Updates the garm-provider-common and go-github packages.
* Update sqlToParamsInstance to return an error when unmarshaling

This change is needed to pull in the new Seal/Unseal functions in common.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-08-28 08:13:44 +00:00
Gabriel Adrian Samfira
baa7df65a4 Fix garm pool manager startup
If we fail to get the tools for one pool, garm fails to start due to pool
manager startup timeout. Launch the initial tools update function as a
goroutine and return from Start(). If it fails, it will retry, and we won't
block garm from starting.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-08-25 08:57:24 +00:00
Gabriel Adrian Samfira
93bfb6fe07
ping the webhook after creation
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-08-22 09:39:02 +03:00
Gabriel Adrian Samfira
d57e488f12
Return details in case PAT does not have access
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-08-22 09:39:02 +03:00
Gabriel Adrian Samfira
779afe980e
Add webhook show, return info and some fixes
* Added a webhook show command. This gives us info about the webhook and
    if it is installed.
  * Return webhook info when installing the webhook
  * Small typo fixes.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-08-22 09:39:01 +03:00
Gabriel Adrian Samfira
6051fa016c
Return bad request if hook already installed
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-08-22 09:39:01 +03:00
Gabriel Adrian Samfira
dbd41f518d
Add CLI webhook enablement
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-08-22 09:39:01 +03:00
Gabriel Adrian Samfira
7ce3f007b0
Add functions to (un)install webhooks for orgs and repos
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-08-22 09:39:01 +03:00
Gabriel Adrian Samfira
e33b64aacb Providers now return ProviderInstance{}
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-07-23 12:47:56 +00:00
Gabriel Adrian Samfira
e775c9c11d Move most of util package
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-07-22 22:39:17 +00:00
Gabriel Adrian Samfira
ed651bb7d0 Move errors to external package
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-07-22 22:26:47 +00:00
Gabriel Adrian Samfira
da13cec2de Move code to external package
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-07-21 15:34:18 +00:00
Gabriel Adrian Samfira
3d26900d32 Set credentials in pool manager
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-07-04 22:11:45 +00:00
Gabriel Adrian Samfira
a41eeb6f1e Update comment on function
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-07-04 10:48:14 +00:00
Gabriel Adrian Samfira
6c06afb8e8 Don't add aditional labels to GH runner
For now, the aditional labels would only contain the job ID that triggered
the creation of the runner. It does not make sense to add this label to the
actual runner that registeres against github. We can simply use it internally
by fetching it from the DB.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-07-03 07:48:22 +00:00
Gabriel Adrian Samfira
0ab8f73bb4 Use r.log()
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-07-03 07:46:20 +00:00
Gabriel Adrian Samfira
f92ac2a74f Lower backoff timer to 1 minute
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-07-03 07:46:20 +00:00
Gabriel Adrian Samfira
7f510ec40a Check if we have a recorded job
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-07-03 07:46:20 +00:00
Gabriel Adrian Samfira
a526c1024c Various fixes
* enable foreign key constraints on sqlite
  * on delete cascade for addresses and status messages
  * add debug server config option
  * fix rr allocation

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-07-03 07:46:20 +00:00
Gabriel Adrian Samfira
f7cf6bb619 increase backoff to 30 seconds
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-07-03 07:46:20 +00:00
Gabriel Adrian Samfira
3796c25228 Amend some log messages
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-07-03 07:46:20 +00:00
Gabriel Adrian Samfira
bf90eb323a Add back update locks
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-07-03 07:46:20 +00:00
Gabriel Adrian Samfira
b52f107bde Update log messages
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-07-03 07:46:20 +00:00
Gabriel Adrian Samfira
c04a93dde9 Add basic round robin for pools
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-07-03 07:46:20 +00:00
Gabriel Adrian Samfira
4b9c20e1be Reduce timeout to 10 seconds
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-07-03 07:46:20 +00:00
Gabriel Adrian Samfira
28360fd662 Do not record jobs not meant for us
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-07-03 07:46:20 +00:00
Gabriel Adrian Samfira
a15a91b974 Break lock and lower scale down timeout
Break the lock on a job if it's still queued and the runner that it
triggered was assigned to another job. This may cause leftover runners
to be created, but we scale those down in ~3 minutes.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-07-03 07:46:20 +00:00
Gabriel Adrian Samfira
b6a02db446 Remove completed jobs and slight optimization
* Removes completed jobs from the db
  * Skip ensure min idle runners for pools with min idle runners set to 0

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-07-03 07:46:20 +00:00
Gabriel Adrian Samfira
5153738359 Small fixes
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-07-03 07:46:20 +00:00
Gabriel Adrian Samfira
fbffd8157b Add job tracking
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-07-03 07:46:20 +00:00
Gabriel Adrian Samfira
67b871488d Log the actual error
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-06-30 09:14:10 +03:00
Gabriel Adrian Samfira
0a27acd818 Remove extra loop and add logging
* removes an extra loop. The fetch tools loop does the same job
  * add a lot of log messages

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-06-30 08:52:16 +03:00
Gabriel Adrian Samfira
7358beb2b9 Merge Unlock() and UnlockAndDelete()
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-06-30 08:48:29 +03:00
Gabriel Adrian Samfira
1edb9247a8 Add per instance mux
Lock operations per instance name. This should avoid go routines trying
to update the same instance when operations may be slow.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-06-23 15:43:31 +00:00
Gabriel Adrian Samfira
a9cf5127a9 More granular loops, update go-github
This commit adds:

  * more granular loops for various operations
  * update go-github to latest version
  * skip trying to fetch runner info for canceled or skipped jobs
  * loops use waitgroups to signal exit

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-06-23 08:16:41 +00:00
Gabriel Adrian Samfira
4921692ee2 Wrap errgroup in select
This commit:

  * swaps WaitGroups with errgroups
  * wraps errgroup.Wait() in a select to prevent situations in which an
    operation takes a long time and prevents garm from being restarted.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-06-23 01:07:55 +03:00