Question
Ernst & Young
ZA
Last activity: 21 May 2020 18:50 EDT
Stream node remain in "JOINING_FAILED" status
Hi All
We have a cluster in Production with one node on the Stream tab of the "Decisioning: Services" landing page remaining in JOINING_FAILED status (see attached screen shot). We traced it to this snippet in the Kafka server.log file:
...snip...
Hi All
We have a cluster in Production with one node on the Stream tab of the "Decisioning: Services" landing page remaining in JOINING_FAILED status (see attached screen shot). We traced it to this snippet in the Kafka server.log file:
...snip...
..snip...
Does anyone have an idea of what the problem is here? We know next to nothing about Kafka.
Regards,
Johan
-
Like (0)
-
Share this page Facebook Twitter LinkedIn Email Copying... Copied!
Accepted Solution
Updated: 6 May 2020 7:55 EDT
Ernst & Young
ZA
I was instructed by Pega support in response to SR-D67690 to place the following in prconfig of the problem node:
<env name="dsm/services/stream/server_properties/broker.id" value="8">
As far as I could gather this is supposed to force the broker id. It wasn't sufficient, though. The node kept on starting up with the wrong id. I eventually traced it to some db table that still had an entry for the node with the wrong broker id. After I deleted that the Kafka node started functioning correctly again. It resolved a LOT of stability issues. Unfortunately I do not recall the name of the db table and I no longer have contact with the client. I don't think it was pr_sys_statusnodes, though.
Pegasystems Inc.
FR
Hello,
Is it a new environment? First time you have seen this issue, how long have you been running with this Production system? any changes?
TechMahindra
IN
Hi Johan,
Try to stop that JVM and check is any active process running on the app server after JVM stop.
If any kill those process and start again and also if upgraded system make sure you clean pr_sys_statusnode table.
Hope this helps, let me know how its goes.
Regards,
Anandh
Ernst & Young
ZA
Thank you for your response, Anandh. We've been having this problem since past Friday. We've been in production since June with twice-monthly deployments. There was no deployment last week. I didn't clean the pr_sys_statusnode table since we have not upgraded. Should I delete the row related to the machine with the problem? I tried your suggestion of stopping the JVM and killing any other Java processes after JVM stopped. There was one. It didn't help to kill it and restart. On the other machines there are two other Java processes, which makes sense since Kafka and Cassandra are both running fine over there.
Pegasystems Inc.
IN
Hi Johan,
Hope all the nodes are able to communicate with each other.Please try to ping from one node to other & check the response.If it can talk to each other then stop all the nodes,Clear pr_sys_statusnodes table & restart server node by node.First start util nodes followed by stream node & then web nodes.
Please let me know If issue still persists after that.
Thanks,
Abhinav
Ernst & Young
ZA
Hi Abhinav
Sorry it took so long to get back to this. Hectic weekend with Black Friday and all. We have permission to restart all the nodes tonight. I will let you know if it helped. All the nodes can ping each other.
Regards,
Johan
Pegasystems Inc.
IN
Hi Johan,
Can you please provide an update.Did it work?
Thanks,
Abhinav
Ernst & Young
ZA
Hi Abhinav
The restart was postponed to last night to coincide with other downtime. We took the nodes down and when we looked at the table it was empty. We brought the nodes back up and the Kafka server is still down. The four records in the table are back. So no, it didn't work.
Regards,
Johan
Pegasystems Inc.
IN
Hi Johan,
Okay, It means something wrong with kafka itself.Can you please share error logs which got generated during kafka startup.
Thanks,
Abhinav
Ernst & Young
ZA
Hi Abhinav
In my very first post I included an extract from that log file. I'm attaching a copy of the latest log file.
Regards,
Johan
Ernst & Young
ZA
SR logged: SR-D67690
Infosys
NL
how did you fixed the issue
Accepted Solution
Updated: 6 May 2020 7:55 EDT
Ernst & Young
ZA
I was instructed by Pega support in response to SR-D67690 to place the following in prconfig of the problem node:
<env name="dsm/services/stream/server_properties/broker.id" value="8">
As far as I could gather this is supposed to force the broker id. It wasn't sufficient, though. The node kept on starting up with the wrong id. I eventually traced it to some db table that still had an entry for the node with the wrong broker id. After I deleted that the Kafka node started functioning correctly again. It resolved a LOT of stability issues. Unfortunately I do not recall the name of the db table and I no longer have contact with the client. I don't think it was pr_sys_statusnodes, though.
Pegasystems Inc.
US
Hi @JohanH55,
I've updated your thread to include your SR details. I also updated your Products and Versions based on that SR.
If the Versions are not correct, please click the Edit icon and update them to the proper versions.
Thank you for using the Pega Collaboration Center!
Marissa | Senior Moderator | Pega Collaboration Center
Standard and Poors
US
We have received below exceptions while restarting one of the node and restart is failing continuously due to kafka exceptions .
ERROR Error while creating ephemeral at /brokers/ids/1005, node already exists and owner '-23442432' does not match current session '1235353453576456' (kafka.zk.KafkaZkClient$CheckedEphemeral).
ERROR [KafkaServer id=1004] Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists at org.apache.zookeeper.KeeperException.create(KeeperException.java:119) kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:38) [kafka_2.11- Proprietary information hidden.jar
Solution :
Deleted the node related information from below tables then restarted the node & its started fine
DELETE FROM pr_sys_statusnodes where PYNODENAME='node ID'
DELETE FROM pr_data_stream_nodes where PYNODEID like '/brokers/ids/1005%