Skip to content

Conversation

@timgrein
Copy link
Contributor

@timgrein timgrein commented Jan 17, 2025

This PR combines the approaches described in (I've described each idea in isolation in each PR):

Some important notes:

  • The functionality is guarded via a separate feature flag inference_cluster_aware_rate_limiting
  • This functionality is only enabled for the elastic inference provider in combination with the sparse_embedding task type

The combined high-level overview looks like the following:

image

@timgrein timgrein added >non-issue auto-backport Automatically create backport pull requests when merged v9.0.0 v8.18.0 labels Jan 17, 2025
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Jan 17, 2025
@timgrein timgrein changed the title [DRAFT][Inference API] Add node-local rate limiting for the inference API [Inference API] Add node-local rate limiting for the inference API Jan 17, 2025
@timgrein timgrein marked this pull request as draft January 17, 2025 16:18

List<DiscoveryNode> assignedNodes = new ArrayList<>();

// TODO: here we can probably be smarter: if |num nodes in cluster| > |num nodes per task types|
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something I kept out of this PR scope for now as we only need it as soon as we support multiple services and/or task types

@timgrein timgrein added :ml Machine learning Team:ML Meta label for the ML team and removed needs:triage Requires assignment of a team area label labels Jan 23, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @timgrein, I've created a changelog YAML for you.

}

private NodeRoutingDecision determineRouting(String serviceName, Request request, UnparsedModel unparsedModel) {
if (INFERENCE_API_CLUSTER_AWARE_RATE_LIMITING_FEATURE_FLAG.isEnabled() == false) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not strictly necessary, but we can keep it for now and remove it after FF

@demjened demjened merged commit a40370a into elastic:main Jan 29, 2025
16 checks passed
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.x

elasticsearchmachine pushed a commit that referenced this pull request Jan 30, 2025
…API (#120400) (#121251)

* [Inference API] Add node-local rate limiting for the inference API (#120400)

* Add node-local rate limiting for the inference API

* Fix integration tests by using new LocalStateInferencePlugin instead of InferencePlugin and adjust formatting.

* Correct feature flag name

* Add more docs, reorganize methods and make some methods package private

* Clarify comment in BaseInferenceActionRequest

* Fix wrong merge

* Fix checkstyle

* Fix checkstyle in tests

* Check that the service we want to the read the rate limit config for actually exists

* [CI] Auto commit changes from spotless

* checkStyle apply

* Update docs/changelog/120400.yaml

* Move rate limit division logic to RequestExecutorService

* Spotless apply

* Remove debug sout

* Adding a few suggestions

* Adam feedback

* Fix compilation error

* [CI] Auto commit changes from spotless

* Add BWC test case to InferenceActionRequestTests

* Add BWC test case to UnifiedCompletionActionRequestTests

* Update x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/common/InferenceServiceNodeLocalRateLimitCalculator.java

Co-authored-by: Adam Demjen <demjened@gmail.com>

* Update x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/common/InferenceServiceNodeLocalRateLimitCalculator.java

Co-authored-by: Adam Demjen <demjened@gmail.com>

* Remove addressed TODO

* Spotless apply

* Only use new rate limit specific feature flag

* Use ThreadLocalRandom

* [CI] Auto commit changes from spotless

* Use Randomness.get()

* [CI] Auto commit changes from spotless

* Fix import

* Use ConcurrentHashMap in InferenceServiceNodeLocalRateLimitCalculator

* Check for null value in getRateLimitAssignment and remove AtomicReference

* Remove newAssignments

* Up the default rate limit for completions

* Put deprecated feature flag back in

* Check feature flag in BaseTransportInferenceAction

* spotlessApply

* Export inference.common

* Do not export inference.common

* Provide noop rate limit calculator, if feature flag is disabled

* Add proper dependency injection

---------

Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
Co-authored-by: Jonathan Buttner <jonathan.buttner@elastic.co>
Co-authored-by: Adam Demjen <demjened@gmail.com>

* Use .get(0) as getFirst() doesn't exist in 8.18 (probably JDK difference?)

---------

Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
Co-authored-by: Jonathan Buttner <jonathan.buttner@elastic.co>
Co-authored-by: Adam Demjen <demjened@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged >feature :ml Machine learning Team:ML Meta label for the ML team v8.18.0 v9.0.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants