KAFKA-17431: Support invalid static configs for KRaft so long as dynamic configs are valid #18949

kevin-wu24 · 2025-02-18T15:22:13Z

What:

During broker startup, attempt to read dynamic configurations from latest local snapshot on disk.

Testing:

Added integration test which applies a set of valid dynamic configs, shuts down a broker, invalidates those configs in the static configuration, and finally restarts the broker. Without this functionality, the broker crashes during startup.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

…amic configs are valid

github-actions · 2025-02-26T03:12:27Z

A label of 'needs-attention' was automatically added to this PR in order to raise the
attention of the committers. Once this issue has been triaged, the triage label
should be removed to prevent this automation from happening again.

cmccabe · 2025-02-26T19:08:54Z

core/src/main/scala/kafka/raft/RaftManager.scala


  def voterNode(id: Int, listener: ListenerName): Option[Node]
+
+  def getRecordSerde: RecordSerde[T]


In Kafka we generally don't put 'get' in front of getters. So this method should just be recordSerde

cmccabe · 2025-03-04T20:13:17Z

core/src/main/scala/kafka/server/BrokerServer.scala


      val clientMetricsReceiverPlugin = new ClientMetricsReceiverPlugin()
      config.dynamicConfig.initialize(Some(clientMetricsReceiverPlugin))
+      DynamicBrokerConfig.readDynamicBrokerConfigsFromSnapshot(raftManager, config, quotaManagers)


This will be too late to address some of the use-cases we're concerned about, right?

It would be better to load all the config key/value pairs from the snapshot and overwrite the static configs with them, before the first time we create a KafkaConfig object.

This will be too late to address some of the use-cases we're concerned about, right?

Could you illuminate some of these cases for me? It seems like we're doing this early enough in brokerServer's startup to me. Are we instead concerned that bad static configs will prevent even SharedServer#startup, which sets up KRaft and metadata publishers, from completing successfully? I guess I'm just confused as to how node can even get into a state where the static configs would crash SharedServer#startup, but somehow also have valid dynamic configs?

I'm a bit confused as to what you're proposing. It sounds like this loading config key/value pairs from the snapshot should occur before we construct the KafkaConfig in SharedServer's initialization (i.e. before we call SharedServer#start which initializes raftManager)? If so, that means we can't use raftManager to actually perform the snapshot read, right?

Could you illuminate some of these cases for me?

One use case is if we have changed the network configuration in such a way that the old static configuration isn't valid. The SSL keystore file being moved is a good example. In that case we would not want to start up with the old static configuration

If so, that means we can't use raftManager to actually perform the snapshot read, right?

It shouldn't be necessary to use raftManager to read the snapshot, since the snapshot is just a file. Maybe we could have some static utility method which finds the last snapshot file (it's just a matter of finding the one that sorts last in the folder...)

The test I wrote does "move" the SSL keystore file. By invalidating the static ssl configs on one of the nodes after shutting it down, we can successfully complete brokerServer#startup() with my changes. Without them, we fail during startup.

If we want full parity with how ZK worked, we need to load the dynamic configurations prior to sending out the initial dynamic configurations. In ZK mode we actually fetched the broker configuration from ZK prior to doing this. It's possible we could do this in a different way but I'm not confident that it will solve all the possible cases.

we need to load the dynamic configurations prior to sending out the initial dynamic configurations

I'm a bit confused about what this means. Where in sharedServer do we "send out initial dynamic configurations"?

From a high level, I think the current implementation is doing the equivalent in KRaft, since the KRaft layer, which reads the snapshot, on the broker is acting like ZK (although possibly stale) to store broker configs (correct me if I'm wrong, I also will look more into this tmrw).

The only config I see that is used in sharedServer/raftManager that is also part of DynamicBrokerConfig#AllDynamicConfigs is the LISTENER_SECURITY_PROTOCOL_MAP_CONFIG. This config is read when building the KRaft network client for the broker during startup, but only the entry for the controller listener is used. From my understanding, the only way to invalidate this static config if it was previously valid would be to mess with server.properties directly and then restart the broker. Is that also a case we also want to cover with this change?

I'm a bit confused about what this means. Where in sharedServer do we "send out initial dynamic configurations"?

SharedServer doesn't. This happens from BrokerServer or ControllerServer.

From a high level, I think the current implementation is doing the equivalent in KRaft, since the KRaft layer, which reads the snapshot, on the broker is acting like ZK (although possibly stale) to store broker configs (correct me if I'm wrong, I also will look more into this tmrw).

If you are confident that this will handle the case that originally motivated this JIRA then I'm OK with doing this for now. As you recall, the case was the broker SSL keystore file being statically set to a path that no longer existed.

The only config I see that is used in sharedServer/raftManager that is also part of DynamicBrokerConfig#AllDynamicConfigs is the LISTENER_SECURITY_PROTOCOL_MAP_CONFIG. This config is read when building the KRaft network client for the broker during startup, but only the entry for the controller listener is used. From my understanding, the only way to invalidate this static config if it was previously valid would be to mess with server.properties directly and then restart the broker. Is that also a case we also want to cover with this change?

No. That configuration is static and cannot be dynamically changed.

If you are confident that this will handle the case that originally motivated this JIRA then I'm OK with doing this for now. As you recall, the case was the broker SSL keystore file being statically set to a path that no longer existed.

The test I wrote shows that the implementation changes address this case, since it does the following:

updating the dynamic configs with the current valid static configs, which includes the keystore file

making sure the broker's most recent snapshot has that config update

shutdown the broker

setting the static keystore location config to an invalid file path

verify we can start the broker again (with my change this test passes, and without it the test throws a NoSuchFileException for the invalid file path)

cmccabe · 2025-03-04T20:16:05Z

core/src/main/scala/kafka/server/DynamicBrokerConfig.scala

+          batch.forEach(record => {
+            if (record.message().apiKey() == MetadataRecordType.CONFIG_RECORD.id) {
+              val configRecord = record.message().asInstanceOf[ConfigRecord]
+              if (DynamicBrokerConfig.AllDynamicConfigs.contains(configRecord.name())) {


This is not quite right, because you aren't distinguishing between broker configurations and other kinds of configuration. You're also assuming that if you find a broker configuration, it applies to this node, which may not be the case.

This is not quite right, because you aren't distinguishing between broker configurations and other kinds of configuration.

Yeah I looked at DynamicConfigPublisher. I think when I get the ConfigRecord, I need to check that its resourceType is for BROKER and only put those configs in dynamicBrokerConfigs. Then whether or not we update cluster defaults/per-broker is based on if resourceName is empty or contains the broker's id.

That is correct. For pedantic correctness, you should also handle the case where we're setting it to null (by deleting it) although I don't expect that to occur in a snapshot.

The approach discussed in my previous comment covers this case, since we're not checking the value() field of the ConfigRecord to determine whether or not we put the key value pair into the props passed into processConfigChanges. I assume processConfigChanges handles this null value case already.

I assume processConfigChanges handles this null value case already.

Sorry, but this would be an incorrect assumption! We don't seem to have special handling for null values in the code that translates the dynamicBrokerConfigs map into a Properties object.

Let's just handle this properly by removing the key/value pair from that map when a null value shows up.

cmccabe

LGTM

…, since the latter depends on the former when a snapshot exists

cmccabe

LGTM

chia7712 · 2025-08-24T21:17:12Z

core/src/main/scala/kafka/server/DynamicBrokerConfig.scala

+          })
+        }
+        val configHandler = new BrokerConfigHandler(config, quotaManagers)
+        configHandler.processConfigChanges("", dynamicPerBrokerConfigs)


dynamicPerBrokerConfigs should be replaced by dynamicDefaultConfigs. I have opened https://issues.apache.org/jira/browse/KAFKA-19642 to fix it

support invalid static configs on brokerServer if last snapshots' dyn…

5e47e11

…amic configs are valid

github-actions bot added triage PRs from the community core Kafka Broker labels Feb 18, 2025

github-actions bot added the needs-attention label Feb 26, 2025

cmccabe reviewed Feb 26, 2025

View reviewed changes

kevin-wu24 added 3 commits February 26, 2025 13:29

code review

caee8a5

fixing build

00db22f

removing magic number

590f50e

github-actions bot removed needs-attention triage PRs from the community labels Feb 27, 2025

cmccabe reviewed Mar 4, 2025

View reviewed changes

kevin-wu24 added 3 commits March 4, 2025 17:13

code review + cleanup

645bcb1

merging trunk

4b81603

handling null config record values in snapshot dynamic config load

30e777c

cmccabe approved these changes Mar 17, 2025

View reviewed changes

instantiate quotaManagers before reading dynamic config from snapshot…

6cd3e56

…, since the latter depends on the former when a snapshot exists

cmccabe approved these changes Mar 18, 2025

View reviewed changes

cmccabe merged commit a5325e0 into apache:trunk Mar 18, 2025
18 of 20 checks passed

chia7712 reviewed Aug 24, 2025

View reviewed changes


		def voterNode(id: Int, listener: ListenerName): Option[Node]

		def getRecordSerde: RecordSerde[T]

KAFKA-17431: Support invalid static configs for KRaft so long as dynamic configs are valid #18949

KAFKA-17431: Support invalid static configs for KRaft so long as dynamic configs are valid #18949

Uh oh!

Conversation

kevin-wu24 commented Feb 18, 2025

What:

Testing:

Committer Checklist (excluded from commit message)

Uh oh!

github-actions bot commented Feb 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevin-wu24 Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevin-wu24 Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cmccabe Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevin-wu24 Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cmccabe left a comment

Choose a reason for hiding this comment

Uh oh!

cmccabe left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kevin-wu24 Mar 4, 2025 •

edited

Loading

kevin-wu24 Mar 7, 2025 •

edited

Loading

cmccabe Mar 17, 2025 •

edited

Loading

kevin-wu24 Mar 4, 2025 •

edited

Loading