Commit graph

258 commits

Author SHA1 Message Date
Gabriel Adrian Samfira
1734e6f87c Deduplicate code
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2024-03-17 10:21:41 +00:00
Gabriel Adrian Samfira
234f71d9d1 Rename PoolType to GithubEntityType
We'll use GithubEntityType throughout the codebase to determine the
type of operation that is about to take place, so this won't belimited
to determining only pool type. We'll also use this to dedupe the label
scope as well.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2024-03-17 06:58:03 +00:00
Gabriel Adrian Samfira
206fe42c73 Remove unused code, update test
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2024-03-15 15:48:53 +00:00
Gabriel Adrian Samfira
ac29af6eff Add some unit tests
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2024-03-15 14:35:23 +00:00
Gabriel Adrian Samfira
d7ea80a657 Remove log message
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2024-03-15 08:12:16 +00:00
Gabriel Adrian Samfira
cdfda0321a Fix balancer type validation
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2024-03-15 07:26:04 +00:00
Gabriel Adrian Samfira
ce3c917ae5 Add pool balancing strategy
This change adds the ability to specify the pool balancing strategy to
use when processing queued jobs. Before this change, GARM would round-robin
through all pools that matched the set of tags requested by queued jobs.

When round-robin (default) is used for an entity (repo, org or enterprise)
and you have 2 pools defined for that entity with a common set of tags that
match 10 jobs (for example), then those jobs would trigger the creation of
a new runner in each of the two pools in turn. Job 1 would go to pool 1,
job 2 would go to pool 2, job 3 to pool 1, job 4 to pool 2 and so on.

When "stack" is used, those same 10 jobs would trigger the creation of a
new runner in the pool with the highest priority, every time.

In both cases, if a pool is full, the next one would be tried automatically.

For the stack case, this would mean that if pool 2 had a priority of 10 and
pool 1 would have a priority of 5, pool 2 would be saturated first, then
pool 1.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2024-03-14 20:04:34 +00:00
Gabriel Adrian Samfira
7d33e0f0cf Add job info in runner list
This change adds information about the job a runner is currently handling.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2024-03-11 15:46:18 +00:00
Gabriel Adrian Samfira
9a6770c3a3 Allow bypassing Unauthorized error when deleting runner
This change allows users to bypass GitHub Unauthorized errors when removing
github runners. This means that removing runners will now be possible even
if the pool manager is stopped.

There is a new flag added to the runner rm command and to the API that
tells GARM to bypass pool being stopped and any 401 error returned by
GitHub.

This means you will be able to remove the runners from garm and your
provider, but will mean that the runner will still exist in github as
"offline" if the credentials are not updated or the runner manually removed.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2024-03-10 15:21:39 +00:00
Gabriel Adrian Samfira
4668461603 Expose the credential type through the API
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2024-03-02 17:04:27 +00:00
Gabriel Adrian Samfira
cbb2134f0e Add GitHub App support
This change adds the ability to use GitHub Apps to authenticate against the
GitHub API. This gives us a larger quota for API requests (15k vs 5k for PATs).

Also, each GitHub App has its own quota, whereas PATs share the same user quota.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2024-03-01 19:47:50 +00:00
Mario Constanti
d332e7a8ca fix: gocritic linter finding
Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
2024-02-22 17:40:15 +01:00
Mario Constanti
4c81c16505 fix: goconst linter findings
Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
2024-02-22 17:39:02 +01:00
Mario Constanti
9e7ac60c09 fix: gosec linter finding
we have to keep crypto/sha1 as github is still using it

Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
2024-02-22 17:34:12 +01:00
Mario Constanti
fd0550eb7f fix: godoc linter findings (TODO comments)
Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
2024-02-22 17:33:19 +01:00
Mario Constanti
ee3a670456 fix: var-naming linter findings
Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
2024-02-22 17:28:39 +01:00
Mario Constanti
7221812dfa fix: remove unused cobra args
Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
2024-02-22 17:20:05 +01:00
Mario Constanti
9f5c38ef2d fix: unused-parameter linter findings
Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
2024-02-22 16:54:38 +01:00
Mario Constanti
b8a9b6c89b fix: ignore testing package typechecks
Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
2024-02-22 15:06:53 +01:00
Mario Constanti
3fd09f6dcd fix: assignOp linter finding
Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
2024-02-22 15:06:53 +01:00
Mario Constanti
9f405e0e8f fix: ifElseChain linter findings
Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
2024-02-22 15:06:53 +01:00
Mario Constanti
b0e3f78fbb fix: godoc linter warnings (TODOs)
Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
2024-02-22 15:06:53 +01:00
Mario Constanti
acc17eafcd fix: receiver-naming linter findings
Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
2024-02-22 15:06:53 +01:00
Mario Constanti
f6404456b9 fix: indent-error-flow linter findings
Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
2024-02-22 15:06:53 +01:00
Mario Constanti
e5ed45c258 fix: unnecessary conversion linter findings
Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
2024-02-22 15:06:53 +01:00
Mario Constanti
0ab86a7e51 fix: unused-parameter linter findings
Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
2024-02-22 15:06:53 +01:00
Mario Constanti
3b9f8b555b fix: var-naming linter findings
Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
2024-02-22 15:06:53 +01:00
Mario Constanti
bd0b27ab10 fix: gci section warnings
Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
2024-02-22 15:06:53 +01:00
Mario Constanti
8fc001f5f6 fix: misspell linter warnings
Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
2024-02-22 15:06:53 +01:00
Mario Constanti
b3854eaf18 fix: whitespace linter warnings
Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
2024-02-22 15:06:53 +01:00
Mario Constanti
d68cc3bf05 fix: add missing metrics for few gh api callS
Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
2024-02-22 05:57:42 +01:00
Mario Constanti
b36b5137b6 feat: count github api calls
introduce metrics counter for github api calls

Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
2024-02-21 14:22:45 +01:00
Mario Constanti
0faf0927bc feat: count external provider operations
Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
2024-02-21 14:21:41 +01:00
Mario Constanti
b1cbfac08a fix: switch to context.Background() for adminctx
Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
2024-02-20 16:42:10 +01:00
Mario Constanti
2a3e4d6563 fix: pass context.TODO by getting admin context
fix linter warnings

Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
2024-02-20 16:39:52 +01:00
Mario Constanti
0a53b8f6d8 fix: stop metrics collector ticker on ctx.Done
Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
2024-02-20 14:57:32 +01:00
Mario Constanti
17d74dfbf0 chore: rework prometheus metrics registration
fail if metric registration panics

Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
2024-02-20 14:27:27 +01:00
Mario Constanti
97f172eb51 fix: improve metrics collection loop
by adding the context from main and make auth.GetAdminContext accepting a
context we are now able to stop the metrics collection loop once the
context is canceled

Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
2024-02-20 06:33:21 +01:00
Mario Constanti
1d8d9459eb chore: refactor metrics endpoint
refactoring is needed to make the metrics package usable from within the
runner package for further metrics.

This change also makes the metric-collector independent from requests to
the /metrics endpoint

Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
2024-02-19 16:22:32 +01:00
Gabriel Adrian Samfira
43b96c543d Add some logging
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2024-01-30 09:37:26 +00:00
Gabriel Adrian Samfira
5b735eaaf4 Small adjustments
This change increases the tools refresh interval to 5 minutes, cleans
up the websocket code a bit, augments the error message that may be returned
when trying to delete a runner in an invalid state and removes a log message
that does not add much value.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2024-01-12 19:53:27 +00:00
Gabriel Adrian Samfira
61e97f0896 Append pool_type and pool_mgr info to logs
Pool managers will have 2 fields identifying which manager generated
the log line.

In the future, we will add tracking ids in various cases, allowing
us to track down issues faster.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2024-01-06 00:21:50 +00:00
Gabriel Adrian Samfira
e441b6ce89 Switch to log/slog
This change switches GARM to the new structured logging standard
library. This will allow us to set log levels and reduce some of
the log spam.

Given that we introduced new knobs to tweak logging, the number of
config options for logging now warrants it's own section.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2024-01-05 23:46:40 +00:00
Gabriel Adrian Samfira
2a5e2409b2 Add system-info instance callback
Allow runners to update their own system information. Runners can now send
back os_name, os_version and agent_id back as part of a POST to
CALLBACK_URL/system-info/.

The goal is to get better info in regard to the actual OS that's running
and to move the agent_id from the status updates to the system-info callback.

The status updates should be used only to send back info about the status of
the installation process.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2024-01-04 15:23:43 +00:00
Gabriel Adrian Samfira
0dd4f38691 Update go-github
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-12-18 16:20:44 +00:00
Gabriel Adrian Samfira
affb56f9a0 Remove the LXD internal provider
Canonical have relicensed the LXD project to AGPLv3. This means that we can
no longer update the go LXD client without re-licensing GARM as AGPLv3. This
is not desirable or possible.

The existing code seems to be Apache 2.0 and all code that has already been
contributed seems to stay as Apache 2.0, but new contributions from Canonical
employees will be AGPLv3.

We cannot risc including AGPLv3 code now or in the future, so we will separate
the LXD provider into its own project which can be AGPLv3. GARM will simply
execute the external provider.

If the client code of LXD will ever be split from the main project and re-licensed
as Apache 2.0 or a compatible license, we will reconsider adding it back as a
native provider. Although in the long run, I believe external providers will
be the only option as they are easier to write, easier to maintain and safer to
ship (a bug in the provider does not crash GARM itself).

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-12-18 12:16:48 +00:00
Gabriel Adrian Samfira
459906d97e Prevent abusing the GH API
On large deployments with many jobs, we cannot check each job that
we recorded in the DB against the GH API.

Before this change, if a job was updated more than 10 minutes ago,
garm would check against the GH api if that job still existed. While
this approach allowed us to maintain a consistent view over which jobs
still exist and which are stale, it had the potential of spamming the
GH API, leading to rate limiting.

This change uses the scale-down loop as an indicator for job staleness.

If a job remains in queued state in our DB, but has dissapeared from GH
or was serviced by another runner and we never got the hook (garm was down
or GH had an issue - happened in the past), then garm will spin up a new
runner for it. If that runner or any other runner is scaled down, we check
if we have jobs in the queue that should have matched that runner. If we did,
there is a high chance that the job no longer exists in GH and we can remove
the job from the queue.

Of course, there is a chance that GH is having issues and the job is never
pushed to the runner, but we can't really account for everything. In this case
I'd rather avoid rate limiting ourselves.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-12-15 22:41:50 +00:00
Gabriel Adrian Samfira
85968598b0 Add option to disable JIT config
This change adds a flag on providers that allows users to disable JIT
configuration even when it's available. For context, JIT is available
on github.com and any GHES instance >=3.10.

This option is a stopgap measure for providers that have not yet been
updated to use JIT configs instead of runner registration tokens.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-12-11 12:37:33 +00:00
Mario Constanti
215bd71855 feat: passthrough additional env vars to provider
as some provider binaries probably need additional environment variables
set (e.g kubernetes as client-go depends on KUBERNETES_SERVICE_ vars) it
should be possible to define a list of environment variables which
should get bypassed into the provider binary execution
2023-12-01 11:54:34 +01:00
Gabriel Adrian Samfira
d09f12dfd8 Add force delete runner
This branch adds the ability to forcefully remove a runner from GARM.

When the operator wishes to manually remove a runner, the workflow is as
follows:

* Check that the runner exists in GitHub. If it does, attempt to
  remove it. An error here indicates that the runner may be processing
  a job. In this case, we don't continue and the operator gets immediate
  feedback from the API.
* Mark the runner in the database as pending_delete
* Allow the consolidate loop to reap it from the provider and remove it
  from the database.

Removing the instance from the provider is async. If the provider errs out,
GARM will keep trying to remove it in perpetuity until the provider succedes.

In situations where the provider is misconfigured, this will never happen, leaving
the instance in a permanent state of pending_delete.

A provider may fail for various reasons. Either credentials have expired, the
API endpoint has changed, the provider is misconfigured or the operator may just
have removed it from the config before cleaning up the runners. While some cases
are recoverable, some are not. We cannot have a situation in which we cannot clean
resources in garm because of a misconfiguration.

This change adds the pending_force_delete instance status. Instances marked with
this status, will be removed from GARM even if the provider reports an error.

The GARM cli has been modified to give new meaning to the --force-remove-runner
option. This option in the CLI is no longer mandatory. Instead, setting it will mark
the runner with the new pending_force_delete status. Omitting it will mark the runner
with the old status of pending_delete.

Fixes: #160

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2023-10-12 06:15:36 +00:00