DevFW-CICD/garm - EDP: Build your thing in minutes

Author	SHA1	Message	Date
Gabriel Adrian Samfira	1734e6f87c	Deduplicate code Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>	2024-03-17 10:21:41 +00:00
Gabriel Adrian Samfira	234f71d9d1	Rename PoolType to GithubEntityType We'll use GithubEntityType throughout the codebase to determine the type of operation that is about to take place, so this won't belimited to determining only pool type. We'll also use this to dedupe the label scope as well. Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>	2024-03-17 06:58:03 +00:00
Gabriel Adrian Samfira	206fe42c73	Remove unused code, update test Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>	2024-03-15 15:48:53 +00:00
Gabriel Adrian Samfira	ac29af6eff	Add some unit tests Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>	2024-03-15 14:35:23 +00:00
Gabriel Adrian Samfira	d7ea80a657	Remove log message Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>	2024-03-15 08:12:16 +00:00
Gabriel Adrian Samfira	cdfda0321a	Fix balancer type validation Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>	2024-03-15 07:26:04 +00:00
Gabriel Adrian Samfira	ce3c917ae5	Add pool balancing strategy This change adds the ability to specify the pool balancing strategy to use when processing queued jobs. Before this change, GARM would round-robin through all pools that matched the set of tags requested by queued jobs. When round-robin (default) is used for an entity (repo, org or enterprise) and you have 2 pools defined for that entity with a common set of tags that match 10 jobs (for example), then those jobs would trigger the creation of a new runner in each of the two pools in turn. Job 1 would go to pool 1, job 2 would go to pool 2, job 3 to pool 1, job 4 to pool 2 and so on. When "stack" is used, those same 10 jobs would trigger the creation of a new runner in the pool with the highest priority, every time. In both cases, if a pool is full, the next one would be tried automatically. For the stack case, this would mean that if pool 2 had a priority of 10 and pool 1 would have a priority of 5, pool 2 would be saturated first, then pool 1. Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>	2024-03-14 20:04:34 +00:00
Gabriel Adrian Samfira	7d33e0f0cf	Add job info in runner list This change adds information about the job a runner is currently handling. Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>	2024-03-11 15:46:18 +00:00
Gabriel Adrian Samfira	9a6770c3a3	Allow bypassing Unauthorized error when deleting runner This change allows users to bypass GitHub Unauthorized errors when removing github runners. This means that removing runners will now be possible even if the pool manager is stopped. There is a new flag added to the runner rm command and to the API that tells GARM to bypass pool being stopped and any 401 error returned by GitHub. This means you will be able to remove the runners from garm and your provider, but will mean that the runner will still exist in github as "offline" if the credentials are not updated or the runner manually removed. Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>	2024-03-10 15:21:39 +00:00
Gabriel Adrian Samfira	4668461603	Expose the credential type through the API Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>	2024-03-02 17:04:27 +00:00
Gabriel Adrian Samfira	cbb2134f0e	Add GitHub App support This change adds the ability to use GitHub Apps to authenticate against the GitHub API. This gives us a larger quota for API requests (15k vs 5k for PATs). Also, each GitHub App has its own quota, whereas PATs share the same user quota. Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>	2024-03-01 19:47:50 +00:00
Mario Constanti	d332e7a8ca	fix: gocritic linter finding Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>	2024-02-22 17:40:15 +01:00
Mario Constanti	4c81c16505	fix: goconst linter findings Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>	2024-02-22 17:39:02 +01:00
Mario Constanti	9e7ac60c09	fix: gosec linter finding we have to keep crypto/sha1 as github is still using it Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>	2024-02-22 17:34:12 +01:00
Mario Constanti	fd0550eb7f	fix: godoc linter findings (TODO comments) Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>	2024-02-22 17:33:19 +01:00
Mario Constanti	ee3a670456	fix: var-naming linter findings Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>	2024-02-22 17:28:39 +01:00
Mario Constanti	7221812dfa	fix: remove unused cobra args Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>	2024-02-22 17:20:05 +01:00
Mario Constanti	9f5c38ef2d	fix: unused-parameter linter findings Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>	2024-02-22 16:54:38 +01:00
Mario Constanti	b8a9b6c89b	fix: ignore testing package typechecks Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>	2024-02-22 15:06:53 +01:00
Mario Constanti	3fd09f6dcd	fix: assignOp linter finding Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>	2024-02-22 15:06:53 +01:00
Mario Constanti	9f405e0e8f	fix: ifElseChain linter findings Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>	2024-02-22 15:06:53 +01:00
Mario Constanti	b0e3f78fbb	fix: godoc linter warnings (TODOs) Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>	2024-02-22 15:06:53 +01:00
Mario Constanti	acc17eafcd	fix: receiver-naming linter findings Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>	2024-02-22 15:06:53 +01:00
Mario Constanti	f6404456b9	fix: indent-error-flow linter findings Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>	2024-02-22 15:06:53 +01:00
Mario Constanti	e5ed45c258	fix: unnecessary conversion linter findings Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>	2024-02-22 15:06:53 +01:00
Mario Constanti	0ab86a7e51	fix: unused-parameter linter findings Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>	2024-02-22 15:06:53 +01:00
Mario Constanti	3b9f8b555b	fix: var-naming linter findings Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>	2024-02-22 15:06:53 +01:00
Mario Constanti	bd0b27ab10	fix: gci section warnings Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>	2024-02-22 15:06:53 +01:00
Mario Constanti	8fc001f5f6	fix: misspell linter warnings Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>	2024-02-22 15:06:53 +01:00
Mario Constanti	b3854eaf18	fix: whitespace linter warnings Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>	2024-02-22 15:06:53 +01:00
Mario Constanti	d68cc3bf05	fix: add missing metrics for few gh api callS Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>	2024-02-22 05:57:42 +01:00
Mario Constanti	b36b5137b6	feat: count github api calls introduce metrics counter for github api calls Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>	2024-02-21 14:22:45 +01:00
Mario Constanti	0faf0927bc	feat: count external provider operations Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>	2024-02-21 14:21:41 +01:00
Mario Constanti	b1cbfac08a	fix: switch to context.Background() for adminctx Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>	2024-02-20 16:42:10 +01:00
Mario Constanti	2a3e4d6563	fix: pass context.TODO by getting admin context fix linter warnings Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>	2024-02-20 16:39:52 +01:00
Mario Constanti	0a53b8f6d8	fix: stop metrics collector ticker on ctx.Done Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>	2024-02-20 14:57:32 +01:00
Mario Constanti	17d74dfbf0	chore: rework prometheus metrics registration fail if metric registration panics Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>	2024-02-20 14:27:27 +01:00
Mario Constanti	97f172eb51	fix: improve metrics collection loop by adding the context from main and make auth.GetAdminContext accepting a context we are now able to stop the metrics collection loop once the context is canceled Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>	2024-02-20 06:33:21 +01:00
Mario Constanti	1d8d9459eb	chore: refactor metrics endpoint refactoring is needed to make the metrics package usable from within the runner package for further metrics. This change also makes the metric-collector independent from requests to the /metrics endpoint Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>	2024-02-19 16:22:32 +01:00
Gabriel Adrian Samfira	43b96c543d	Add some logging Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>	2024-01-30 09:37:26 +00:00
Gabriel Adrian Samfira	5b735eaaf4	Small adjustments This change increases the tools refresh interval to 5 minutes, cleans up the websocket code a bit, augments the error message that may be returned when trying to delete a runner in an invalid state and removes a log message that does not add much value. Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>	2024-01-12 19:53:27 +00:00
Gabriel Adrian Samfira	61e97f0896	Append pool_type and pool_mgr info to logs Pool managers will have 2 fields identifying which manager generated the log line. In the future, we will add tracking ids in various cases, allowing us to track down issues faster. Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>	2024-01-06 00:21:50 +00:00
Gabriel Adrian Samfira	e441b6ce89	Switch to log/slog This change switches GARM to the new structured logging standard library. This will allow us to set log levels and reduce some of the log spam. Given that we introduced new knobs to tweak logging, the number of config options for logging now warrants it's own section. Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>	2024-01-05 23:46:40 +00:00
Gabriel Adrian Samfira	2a5e2409b2	Add system-info instance callback Allow runners to update their own system information. Runners can now send back os_name, os_version and agent_id back as part of a POST to CALLBACK_URL/system-info/. The goal is to get better info in regard to the actual OS that's running and to move the agent_id from the status updates to the system-info callback. The status updates should be used only to send back info about the status of the installation process. Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>	2024-01-04 15:23:43 +00:00
Gabriel Adrian Samfira	0dd4f38691	Update go-github Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>	2023-12-18 16:20:44 +00:00
Gabriel Adrian Samfira	affb56f9a0	Remove the LXD internal provider Canonical have relicensed the LXD project to AGPLv3. This means that we can no longer update the go LXD client without re-licensing GARM as AGPLv3. This is not desirable or possible. The existing code seems to be Apache 2.0 and all code that has already been contributed seems to stay as Apache 2.0, but new contributions from Canonical employees will be AGPLv3. We cannot risc including AGPLv3 code now or in the future, so we will separate the LXD provider into its own project which can be AGPLv3. GARM will simply execute the external provider. If the client code of LXD will ever be split from the main project and re-licensed as Apache 2.0 or a compatible license, we will reconsider adding it back as a native provider. Although in the long run, I believe external providers will be the only option as they are easier to write, easier to maintain and safer to ship (a bug in the provider does not crash GARM itself). Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>	2023-12-18 12:16:48 +00:00
Gabriel Adrian Samfira	459906d97e	Prevent abusing the GH API On large deployments with many jobs, we cannot check each job that we recorded in the DB against the GH API. Before this change, if a job was updated more than 10 minutes ago, garm would check against the GH api if that job still existed. While this approach allowed us to maintain a consistent view over which jobs still exist and which are stale, it had the potential of spamming the GH API, leading to rate limiting. This change uses the scale-down loop as an indicator for job staleness. If a job remains in queued state in our DB, but has dissapeared from GH or was serviced by another runner and we never got the hook (garm was down or GH had an issue - happened in the past), then garm will spin up a new runner for it. If that runner or any other runner is scaled down, we check if we have jobs in the queue that should have matched that runner. If we did, there is a high chance that the job no longer exists in GH and we can remove the job from the queue. Of course, there is a chance that GH is having issues and the job is never pushed to the runner, but we can't really account for everything. In this case I'd rather avoid rate limiting ourselves. Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>	2023-12-15 22:41:50 +00:00
Gabriel Adrian Samfira	85968598b0	Add option to disable JIT config This change adds a flag on providers that allows users to disable JIT configuration even when it's available. For context, JIT is available on github.com and any GHES instance >=3.10. This option is a stopgap measure for providers that have not yet been updated to use JIT configs instead of runner registration tokens. Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>	2023-12-11 12:37:33 +00:00
Mario Constanti	215bd71855	feat: passthrough additional env vars to provider as some provider binaries probably need additional environment variables set (e.g kubernetes as client-go depends on KUBERNETES_SERVICE_ vars) it should be possible to define a list of environment variables which should get bypassed into the provider binary execution	2023-12-01 11:54:34 +01:00
Gabriel Adrian Samfira	d09f12dfd8	Add force delete runner This branch adds the ability to forcefully remove a runner from GARM. When the operator wishes to manually remove a runner, the workflow is as follows: * Check that the runner exists in GitHub. If it does, attempt to remove it. An error here indicates that the runner may be processing a job. In this case, we don't continue and the operator gets immediate feedback from the API. * Mark the runner in the database as pending_delete * Allow the consolidate loop to reap it from the provider and remove it from the database. Removing the instance from the provider is async. If the provider errs out, GARM will keep trying to remove it in perpetuity until the provider succedes. In situations where the provider is misconfigured, this will never happen, leaving the instance in a permanent state of pending_delete. A provider may fail for various reasons. Either credentials have expired, the API endpoint has changed, the provider is misconfigured or the operator may just have removed it from the config before cleaning up the runners. While some cases are recoverable, some are not. We cannot have a situation in which we cannot clean resources in garm because of a misconfiguration. This change adds the pending_force_delete instance status. Instances marked with this status, will be removed from GARM even if the provider reports an error. The GARM cli has been modified to give new meaning to the --force-remove-runner option. This option in the CLI is no longer mandatory. Instead, setting it will mark the runner with the new pending_force_delete status. Omitting it will mark the runner with the old status of pending_delete. Fixes: #160 Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>	2023-10-12 06:15:36 +00:00

1 2 3 4 5 ...

258 commits