Commit graph

139 commits

Author SHA1 Message Date
Gabriel
8e13588edd
Merge pull request #314 from mercedes-benz/improve_error_message
Improve error messages in garm log
2024-11-26 10:44:41 +02:00
Michael Kuhnt
d6de59619d commit suggestion 2024-11-22 16:49:56 +01:00
Michael Kuhnt
8a31d81faf ignore workflow_jobs without labels 2024-11-22 11:48:59 +01:00
Fabian Fulga
dcff6f9854 Add getProviderBaseParams function in basePoolManager 2024-09-02 15:25:44 +03:00
Fabian Fulga
03f280da59 Version provider interface 2024-08-21 16:14:38 +03:00
Gabriel Adrian Samfira
cc6e985629 Fix: Scope entities to endpoint
This change scopes all github entities to a github endpoint, allowing
users to have the same repo/org/enterprise created for each endpoint.

This way, if your username is the same on github.com and on your GHES
server, and you have the same repository name or org in both places,
GARM can now handle that situation.

This change also fixes a leaky watcher in the pool manager.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2024-07-29 17:35:57 +00:00
Gabriel Adrian Samfira
2554f70b89 Replace time.After with time.NewTimer
Improper use of time.After can lead to memory leaks if the timer never
gets a chance to fire.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2024-07-05 12:55:35 +00:00
Gabriel Adrian Samfira
892a62bfe4 Allow configuration of job backoff interval
GARM has a backoff interval when consuming queued jobs. This backoff
is intended to allow any potential idle runners to pick up a job before
GARM attempts to spin up a new one. This change allows users to set a
custom backoff interval or disable it altogether by setting it to 0.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2024-07-01 10:27:31 +00:00
Gabriel Adrian Samfira
daaca0bd8f Use watcher and get rid of RefreshState()
This change uses the database watcher to watch for changes to the
github entities, credentials and controller info.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2024-06-21 13:47:48 +00:00
Gabriel Adrian Samfira
1dfa74efd8 Lower the log level of ignored jobs
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2024-06-20 15:30:25 +00:00
Gabriel Adrian Samfira
8d57fc8fa2 Add rudimentary database watcher
Adds a simple database watcher. At this point it's just one process, but
the plan is to allow different implementations that inform the local running
workers of changes that have occured on entities of interest in the database.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2024-06-14 19:47:12 +00:00
Mario Constanti
b4e7dead1c fix: check if runner name is empty and return
Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
2024-06-05 13:48:53 +02:00
Mario Constanti
dc74c45317 fix: remove unnecessary github api call
There are only a few cases, where we get a job information from github
where the runner name is not set.

For all this cases we do not need to check github API at all because
these jobs are never ever get scheduled to a runner:

job.Action is:

* queued:
  a queued job is just queued and not scheduled to a runner so we do
  not get a runner name from the GH API
* completed:
  when conclusion=cancelled|failure github never scheduled the job to a
  runner and with that we do not get a runner name from the GH API

Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
2024-06-05 12:37:20 +02:00
Mario Constanti
7adc48c75f fix: use the american english type of cancelled
github is sending job events where conclusion=cancelled is spelled in american english.

Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
2024-06-05 11:57:33 +02:00
Gabriel Adrian Samfira
cb4d56773f Remove some code, move some code around
Remove code that was just wrapping other functions at this point, and
move some code around. We need to get a better idea what is actually
still needed in the pool manager, to begin to refactor it into something
that can scale out.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2024-04-01 14:52:37 +00:00
Gabriel Adrian Samfira
36288c65e6 Slightly simplify code
Change instance DB functions from querying by ID to querying by name. Names
are unique in GARM, so we might as well use the name instead of the ID and
spare ourselves the extra query to get the ID when a qorkflow comes in.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2024-03-30 18:22:06 +00:00
Gabriel Adrian Samfira
f9f545f060 Remove duplicate code
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2024-03-29 18:50:04 +00:00
Gabriel Adrian Samfira
9384e37bb1 Fix tests
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2024-03-28 18:23:49 +00:00
Gabriel Adrian Samfira
0152b21529 Implement some common logic for pool creation
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2024-03-28 10:09:20 +00:00
Gabriel Adrian Samfira
39f1be5512 Fix JIT config with empty runner group name
When no runner group is set, do not attempt to resolve the runner group.
Looking for an empty runner group will just return a not found error, which
will make GARM fall back to registration token.

This change fixes that.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2024-03-25 18:53:53 +00:00
Gabriel Adrian Samfira
f0080047a3 Remove superfluous function
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2024-03-18 10:56:49 +00:00
Gabriel Adrian Samfira
56da6a4437 Slightly better UX when dealing with webhooks
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2024-03-18 10:19:16 +00:00
Gabriel Adrian Samfira
9259f84e56 Fix getting webhook URL info
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2024-03-18 09:53:34 +00:00
Gabriel Adrian Samfira
cfb68f8928 Check webhook secret for entity
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2024-03-18 09:39:07 +00:00
Gabriel Adrian Samfira
fa75ecfa8e Dedupe more code
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2024-03-17 10:59:09 +00:00
Gabriel Adrian Samfira
b550d0c5b9 remove extra function
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2024-03-17 10:28:35 +00:00
Gabriel Adrian Samfira
1734e6f87c Deduplicate code
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2024-03-17 10:21:41 +00:00
Gabriel Adrian Samfira
234f71d9d1 Rename PoolType to GithubEntityType
We'll use GithubEntityType throughout the codebase to determine the
type of operation that is about to take place, so this won't belimited
to determining only pool type. We'll also use this to dedupe the label
scope as well.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2024-03-17 06:58:03 +00:00
Gabriel Adrian Samfira
206fe42c73 Remove unused code, update test
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2024-03-15 15:48:53 +00:00
Gabriel Adrian Samfira
d7ea80a657 Remove log message
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2024-03-15 08:12:16 +00:00
Gabriel Adrian Samfira
ce3c917ae5 Add pool balancing strategy
This change adds the ability to specify the pool balancing strategy to
use when processing queued jobs. Before this change, GARM would round-robin
through all pools that matched the set of tags requested by queued jobs.

When round-robin (default) is used for an entity (repo, org or enterprise)
and you have 2 pools defined for that entity with a common set of tags that
match 10 jobs (for example), then those jobs would trigger the creation of
a new runner in each of the two pools in turn. Job 1 would go to pool 1,
job 2 would go to pool 2, job 3 to pool 1, job 4 to pool 2 and so on.

When "stack" is used, those same 10 jobs would trigger the creation of a
new runner in the pool with the highest priority, every time.

In both cases, if a pool is full, the next one would be tried automatically.

For the stack case, this would mean that if pool 2 had a priority of 10 and
pool 1 would have a priority of 5, pool 2 would be saturated first, then
pool 1.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2024-03-14 20:04:34 +00:00
Gabriel Adrian Samfira
7d33e0f0cf Add job info in runner list
This change adds information about the job a runner is currently handling.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2024-03-11 15:46:18 +00:00
Gabriel Adrian Samfira
9a6770c3a3 Allow bypassing Unauthorized error when deleting runner
This change allows users to bypass GitHub Unauthorized errors when removing
github runners. This means that removing runners will now be possible even
if the pool manager is stopped.

There is a new flag added to the runner rm command and to the API that
tells GARM to bypass pool being stopped and any 401 error returned by
GitHub.

This means you will be able to remove the runners from garm and your
provider, but will mean that the runner will still exist in github as
"offline" if the credentials are not updated or the runner manually removed.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2024-03-10 15:21:39 +00:00
Gabriel Adrian Samfira
cbb2134f0e Add GitHub App support
This change adds the ability to use GitHub Apps to authenticate against the
GitHub API. This gives us a larger quota for API requests (15k vs 5k for PATs).

Also, each GitHub App has its own quota, whereas PATs share the same user quota.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2024-03-01 19:47:50 +00:00
Mario Constanti
3fd09f6dcd fix: assignOp linter finding
Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
2024-02-22 15:06:53 +01:00
Mario Constanti
b0e3f78fbb fix: godoc linter warnings (TODOs)
Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
2024-02-22 15:06:53 +01:00
Mario Constanti
f6404456b9 fix: indent-error-flow linter findings
Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
2024-02-22 15:06:53 +01:00
Mario Constanti
e5ed45c258 fix: unnecessary conversion linter findings
Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
2024-02-22 15:06:53 +01:00
Mario Constanti
3b9f8b555b fix: var-naming linter findings
Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
2024-02-22 15:06:53 +01:00
Mario Constanti
bd0b27ab10 fix: gci section warnings
Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
2024-02-22 15:06:53 +01:00
Mario Constanti
8fc001f5f6 fix: misspell linter warnings
Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
2024-02-22 15:06:53 +01:00
Mario Constanti
b36b5137b6 feat: count github api calls
introduce metrics counter for github api calls

Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
2024-02-21 14:22:45 +01:00
Gabriel Adrian Samfira
43b96c543d Add some logging
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2024-01-30 09:37:26 +00:00
Gabriel Adrian Samfira
e441b6ce89 Switch to log/slog
This change switches GARM to the new structured logging standard
library. This will allow us to set log levels and reduce some of
the log spam.

Given that we introduced new knobs to tweak logging, the number of
config options for logging now warrants it's own section.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2024-01-05 23:46:40 +00:00
Gabriel Adrian Samfira
0dd4f38691 Update go-github
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-12-18 16:20:44 +00:00
Gabriel Adrian Samfira
459906d97e Prevent abusing the GH API
On large deployments with many jobs, we cannot check each job that
we recorded in the DB against the GH API.

Before this change, if a job was updated more than 10 minutes ago,
garm would check against the GH api if that job still existed. While
this approach allowed us to maintain a consistent view over which jobs
still exist and which are stale, it had the potential of spamming the
GH API, leading to rate limiting.

This change uses the scale-down loop as an indicator for job staleness.

If a job remains in queued state in our DB, but has dissapeared from GH
or was serviced by another runner and we never got the hook (garm was down
or GH had an issue - happened in the past), then garm will spin up a new
runner for it. If that runner or any other runner is scaled down, we check
if we have jobs in the queue that should have matched that runner. If we did,
there is a high chance that the job no longer exists in GH and we can remove
the job from the queue.

Of course, there is a chance that GH is having issues and the job is never
pushed to the runner, but we can't really account for everything. In this case
I'd rather avoid rate limiting ourselves.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-12-15 22:41:50 +00:00
Gabriel Adrian Samfira
85968598b0 Add option to disable JIT config
This change adds a flag on providers that allows users to disable JIT
configuration even when it's available. For context, JIT is available
on github.com and any GHES instance >=3.10.

This option is a stopgap measure for providers that have not yet been
updated to use JIT configs instead of runner registration tokens.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-12-11 12:37:33 +00:00
Gabriel Adrian Samfira
d09f12dfd8 Add force delete runner
This branch adds the ability to forcefully remove a runner from GARM.

When the operator wishes to manually remove a runner, the workflow is as
follows:

* Check that the runner exists in GitHub. If it does, attempt to
  remove it. An error here indicates that the runner may be processing
  a job. In this case, we don't continue and the operator gets immediate
  feedback from the API.
* Mark the runner in the database as pending_delete
* Allow the consolidate loop to reap it from the provider and remove it
  from the database.

Removing the instance from the provider is async. If the provider errs out,
GARM will keep trying to remove it in perpetuity until the provider succedes.

In situations where the provider is misconfigured, this will never happen, leaving
the instance in a permanent state of pending_delete.

A provider may fail for various reasons. Either credentials have expired, the
API endpoint has changed, the provider is misconfigured or the operator may just
have removed it from the config before cleaning up the runners. While some cases
are recoverable, some are not. We cannot have a situation in which we cannot clean
resources in garm because of a misconfiguration.

This change adds the pending_force_delete instance status. Instances marked with
this status, will be removed from GARM even if the provider reports an error.

The GARM cli has been modified to give new meaning to the --force-remove-runner
option. This option in the CLI is no longer mandatory. Instead, setting it will mark
the runner with the new pending_force_delete status. Omitting it will mark the runner
with the old status of pending_delete.

Fixes: #160

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-10-12 06:15:36 +00:00
Gabriel Adrian Samfira
26dbc3d8e5 Update garm-provider-common
This update pulls in the latest version of garm-provider-common which removes
its dependency on go-github, making future updates much less painful.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-10-09 10:55:11 +00:00
Gabriel Adrian Samfira
019948acbe Add JIT config as part of instance create
We must create the DB entry for a runner with a JIT config included. Adding it later
via an update runs the risk of having the consolidate loop pick up the incomplete instance.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-09-24 13:51:17 +00:00