-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[Inference API] Add node-local rate limiting for the inference API #120400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Inference API] Add node-local rate limiting for the inference API #120400
Conversation
…of InferencePlugin and adjust formatting.
|
|
||
| List<DiscoveryNode> assignedNodes = new ArrayList<>(); | ||
|
|
||
| // TODO: here we can probably be smarter: if |num nodes in cluster| > |num nodes per task types| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something I kept out of this PR scope for now as we only need it as soon as we support multiple services and/or task types
|
Hi @timgrein, I've created a changelog YAML for you. |
…ting' into inference-api-adaptive-rate-limiting
…ting' into inference-api-adaptive-rate-limiting
| } | ||
|
|
||
| private NodeRoutingDecision determineRouting(String serviceName, Request request, UnparsedModel unparsedModel) { | ||
| if (INFERENCE_API_CLUSTER_AWARE_RATE_LIMITING_FEATURE_FLAG.isEnabled() == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not strictly necessary, but we can keep it for now and remove it after FF
💚 Backport successful
|
…API (#120400) (#121251) * [Inference API] Add node-local rate limiting for the inference API (#120400) * Add node-local rate limiting for the inference API * Fix integration tests by using new LocalStateInferencePlugin instead of InferencePlugin and adjust formatting. * Correct feature flag name * Add more docs, reorganize methods and make some methods package private * Clarify comment in BaseInferenceActionRequest * Fix wrong merge * Fix checkstyle * Fix checkstyle in tests * Check that the service we want to the read the rate limit config for actually exists * [CI] Auto commit changes from spotless * checkStyle apply * Update docs/changelog/120400.yaml * Move rate limit division logic to RequestExecutorService * Spotless apply * Remove debug sout * Adding a few suggestions * Adam feedback * Fix compilation error * [CI] Auto commit changes from spotless * Add BWC test case to InferenceActionRequestTests * Add BWC test case to UnifiedCompletionActionRequestTests * Update x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/common/InferenceServiceNodeLocalRateLimitCalculator.java Co-authored-by: Adam Demjen <demjened@gmail.com> * Update x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/common/InferenceServiceNodeLocalRateLimitCalculator.java Co-authored-by: Adam Demjen <demjened@gmail.com> * Remove addressed TODO * Spotless apply * Only use new rate limit specific feature flag * Use ThreadLocalRandom * [CI] Auto commit changes from spotless * Use Randomness.get() * [CI] Auto commit changes from spotless * Fix import * Use ConcurrentHashMap in InferenceServiceNodeLocalRateLimitCalculator * Check for null value in getRateLimitAssignment and remove AtomicReference * Remove newAssignments * Up the default rate limit for completions * Put deprecated feature flag back in * Check feature flag in BaseTransportInferenceAction * spotlessApply * Export inference.common * Do not export inference.common * Provide noop rate limit calculator, if feature flag is disabled * Add proper dependency injection --------- Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co> Co-authored-by: Jonathan Buttner <jonathan.buttner@elastic.co> Co-authored-by: Adam Demjen <demjened@gmail.com> * Use .get(0) as getFirst() doesn't exist in 8.18 (probably JDK difference?) --------- Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co> Co-authored-by: Jonathan Buttner <jonathan.buttner@elastic.co> Co-authored-by: Adam Demjen <demjened@gmail.com>
This PR combines the approaches described in (I've described each idea in isolation in each PR):
Some important notes:
inference_cluster_aware_rate_limitingelasticinference provider in combination with thesparse_embeddingtask typeThe combined high-level overview looks like the following: