How the nodetool cleanup command work in DDS node

Question

MesdiSilitonga

Member since 2017

6 posts

ASM

Posted: Jan 15, 2024

Last activity: May 27, 2024

Posted: 15 Jan 2024 22:26 EST
Last activity: 27 May 2024 10:34 EDT

Closed

Solved

How the nodetool cleanup command work in DDS node

Report

We facing in the Prod env that our DDS node disk space is full.

We received the below Warning in the log as well and in the UI the free disk space not showing. "At least one of the following disks must be healthy: /dev/mapper/ Available space on disk decreased to 603 GB when it must be at least 633 GB."

Based on this article we can clean up the data in the DDS node, but not mention what kind of data will be removed from the DDS node.

https://docs-previous.pega.com/decision-management/87/verifying-available-disk-space.

So we have below query regarding the nodetool cleanup command.

What kind of data will be remove from DDS node?
Will remove data impact decisioning in pega?
Is there any other way to mantain the dds node disk space?
How Pega decide which data to remove from DDS node?
Is there any retention on the data in DDS node?

Thank you

***Edited by Moderator Rupashree to add Capability tags***
***Edited by Moderator Marije to add Case tags INC-B16862 ***

To see attachments, please log in.

Pega Customer Decision Hub 8.6

Data Management

Installation and Deployment

Support Case Exists

Like (0)
Share this page Facebook Twitter LinkedIn Email Copying... Copied!

Accepted Solution

Posted: 1 year ago

Updated: 1 year ago

Posted: 23 May 2024 7:35 EDT
Updated: 23 May 2024 7:36 EDT

MarijeSchillern

MOD

replied to MesdiSilitonga

Report

@MesdiS16679174 I can see that INC-B16862 has been resolved with the following conclusion:

Repair will cause GC pause. This is DataStax Cassandra behavior. Please refer work with DataStax administrator for further assistance.

Recommendation is GC and Repair not run simultaneously. There is no need to explicitly run the GC. The JVM will take care of that internally.

Issue primary reason description:

Too frequent GC and Repair causing instability

https://docs-previous.pega.com/decision-management/86/repairing-and-cleaning-cassandra-nodes-data-consistency

Explanation description:

The frequency of running Repair operation has to be evaluated judicially. If it is run too frequently, it can impact the regular operations.

In the article it is only a recommendation to run Incremental repairs i.e. 'nodetool repair -inc - par'

We do not suggest running nodetool garbagecollect every week.

There is no need to for explicitly running GC on the cluster.

View reply inline

To see attachments, please log in.

Posted: 1 year ago

Posted: 18 Jan 2024 15:49 EST

baylp

PEGA

replied to MesdiSilitonga

Report

@MesdiS16679174

Running nodetool cleanup results in the removal of data that the node no longer owns after new nodes have been added to a cluster. The nodetool cleanup process involves compaction of existing data on the disk which does use some IO and CPU. In most cases this IO and CPU is not high and does not impact the decisioning applications, however it's always a good idea to run the command on a non-production environment to gauge the impact that you may see in production.

@MesdiS16679174

You may not see a significant decrease of used disk space if new nodes have not been introduced to your cluster recently and you may need to check to see if the data you have is valid using other methods. It should be noted that Pega does not directly decide what data should removed from a DDS node. The determination of what data to remove is performed by the Apache Cassandra NoSQL Datastore . Under normal operation data is retained within Cassandra on the DDS node for some time after its deleted or its time to live (TTL) expires. All Cassandra data is written to immutable files called SSTables. As part of ongoing maintenance called compaction these files are combined together and a new SSTable written. Deleted and expired records are discarded during compaction if they are a certain age (gc_grace_seconds). The normal criteria for a compaction to be started is when 4 SSTables are of a similar size. With larger TTL values the compaction of older, larger SSTables can become less frequent, hence taking up more disk space.

The nodetool cfstats command (tablestats in Cassandra versions 4 and greater) Average tombstones per slice metric can be used to identify if you have a buildup of tombstone records (Deleted or expired records). If this metric has a high value for a particular table (Let's say greater than 10) you should work with GCS to understand if it is appropriate to reduce the table's gc_grace_seconds value or perform a full manual compaction.

It is also possible that the data stored is valid and/or the overhead of data retained as part of the compaction process is normal. in that case it may be that additional disk space needs to be added to the nodes.

Hopefully this addresses your questions.

Show Less

To see attachments, please log in.

Like (0)

Posted: 1 year ago

Updated: 1 year ago

Posted: 24 Jan 2024 5:09 EST
Updated: 24 Jan 2024 5:10 EST

MesdiSilitonga

ASM

replied to baylp

Report

@baylp, Thank you for the answer.

I have checked the sstables in our env. There are large maximum tombstones per slice like 924 and these keyspaces are default tables from Pega.

For a few of the tables I can see that TTL is 0, what will happen to this data if we perform nodetool Cleanup? Is there any order if we want to run this 4 nodetool command? nodetool compact, nodetool cleanup, nodetool garbagecollect, nodetool repair. Thank you

To see attachments, please log in.

Like (0)

Posted: 1 year ago

Posted: 24 Jan 2024 22:44 EST

SUMAN_GUMUDAVELLY

Ford Motor Company

replied to MesdiSilitonga

Report

@MesdiS16679174

Please run below commands and share the output

- ./nodetool info

- ./nodetool status

- ./nodetool describecluster

To see attachments, please log in.

Like (0)

Posted: 1 year ago

Updated: 1 year ago

Posted: 5 Feb 2024 21:05 EST
Updated: 5 Feb 2024 21:07 EST

MesdiSilitonga

ASM

replied to SUMAN_GUMUDAVELLY

Report

Dear @SUMAN_GUMUDAVELLY I have attached the details below, in this detail i can see the huge data in folder data and when i break down is in table pxdecisionresults which is table from pega as well.

To see attachments, please log in.

Like (0)

Posted: 1 year ago

Posted: 23 Feb 2024 14:51 EST

baylp

PEGA

replied to MesdiSilitonga

Report

@MesdiS16679174

A default TTL value of 0 means that if an explicit TTL is not supplied in a record write the record will not never expire. It looks like your largest table is pxdecisionresults. You should consider running nodetool garbagecollect and nodetool cleanup for that table to completely remove any expired records and tombstones.

To see attachments, please log in.

Like (0)

Posted: 1 year ago

Posted: 25 Feb 2024 21:27 EST

MesdiSilitonga

ASM

replied to baylp

Report

@baylp\We have run nodetool cleanup and the disk space does not increase, we also ran nodetool repair and garbagecollect but it fails because there is not enough free disk space. Any suggestion on how we can delete this data without affecting the decisioning in pega?.

To see attachments, please log in.

Like (0)

Posted: 1 year ago

Updated: 1 year ago

Posted: 4 Mar 2024 18:03 EST
Updated: 2 May 2024 1:56 EDT

baylp

PEGA

replied to MesdiSilitonga

Report

@MesdiS16679174

It looks like you may need to increase disk space at this point as the pxdecision results table is taking the majority of the space. Do you set a Time To Live on the records in your pxdecision results table? If not, going forward it may make sense to have the records expire after a prescribed time. If you do not set a TTL for the records in this table you may wish to terminate nodes, manually delete SSTables that are older than a particular date and restart the node.

Please try any recommendations on a lower environment before applying them to production.

To see attachments, please log in.

Like (0)

Posted: 1 year ago

Updated: 1 year ago

Posted: 20 Mar 2024 22:29 EDT
Updated: 20 Mar 2024 22:30 EDT

MesdiSilitonga

ASM

replied to baylp

Report

@baylp We have increased the disk space and performed nodetool repair and GC and the disk usage reduced. But for TTL in the px-decision results table we have checked is set to 0, what is the recommended TTL for this table? And is there any guide to change the TTL table in Cassandra?

To see attachments, please log in.

Like (0)

Posted: 1 year ago

Posted: 9 Apr 2024 13:26 EDT

baylp

PEGA

replied to MesdiSilitonga

Report

@MesdiS16679174

With respect to setting the TTL for pxDecisionResults and the recommended value can you please raise an incident with GCS. We should not need to change the setting directly inside Cassandra, and GCS should be able to advise on how it can be achieved at the product level.

Many thanks,

Paul

To see attachments, please log in.

Like (0)

Posted: 1 year ago

Posted: 23 Apr 2024 7:03 EDT

MarijeSchillern

MOD

replied to baylp

Report

@MesdiS16679174 I am unable to find any support ticket logged for this.

Has your original question been answered in this forum post?

If so, can you please hit 'Accept Solution' against the reply which answered your main concerns?

To see attachments, please log in.

Like (0)

Posted: 1 year ago

Posted: 2 May 2024 2:04 EDT

MesdiSilitonga

ASM

replied to MarijeSchillern

Report

@MarijeSchillern We are still trying to solve the issue for our side also we already open the SR on this waiting on GCS response

To see attachments, please log in.

Like (0)

Posted: 1 year ago

Posted: 2 May 2024 2:03 EDT

MesdiSilitonga

ASM

replied to baylp

Report

@baylp We tried to run nodetool repair and Garbage Collect but it's always failed, also we have asked the GCS regarding this issue why the nodes are down when the nodetool garbage collect is running but they suggested we consult with our DataStax teams, which didn't solve our problem. Since on our side, there is no DataStax expert we need to check it manually.

To see attachments, please log in.

Like (0)

Posted: 1 year ago

Posted: 2 May 2024 4:58 EDT

MarijeSchillern

MOD

replied to MesdiSilitonga

Report

@MesdiS16679174 please provide the INC support ticket ID.

To see attachments, please log in.

Like (0)

Posted: 1 year ago

Posted: 2 May 2024 6:14 EDT

MesdiSilitonga

ASM

replied to MarijeSchillern

Report

@MarijeSchillern

This is the SR id #INC-B16862

To see attachments, please log in.

Like (0)

Accepted Solution

Posted: 1 year ago

Updated: 1 year ago

Posted: 23 May 2024 7:35 EDT
Updated: 23 May 2024 7:36 EDT

MarijeSchillern

MOD

replied to MesdiSilitonga

Report

@MesdiS16679174 I can see that INC-B16862 has been resolved with the following conclusion:

Repair will cause GC pause. This is DataStax Cassandra behavior. Please refer work with DataStax administrator for further assistance.

Recommendation is GC and Repair not run simultaneously. There is no need to explicitly run the GC. The JVM will take care of that internally.

Issue primary reason description:

Too frequent GC and Repair causing instability

https://docs-previous.pega.com/decision-management/86/repairing-and-cleaning-cassandra-nodes-data-consistency

Explanation description:

The frequency of running Repair operation has to be evaluated judicially. If it is run too frequently, it can impact the regular operations.

In the article it is only a recommendation to run Incremental repairs i.e. 'nodetool repair -inc - par'

We do not suggest running nodetool garbagecollect every week.

There is no need to for explicitly running GC on the cluster.

To see attachments, please log in.

Likes (1)

Suman Gumudavelly

Posted: 1 year ago

Posted: 27 May 2024 10:34 EDT

MesdiSilitonga

ASM

replied to MarijeSchillern

Report

@MarijeSchillern Previously we didn't run any repairs that's why the data got bigger. That's why we added more disk space to run garbage collect and repair to reduce the size. But after we added the disk space and tried to run garbage repair we always got an error in the system (failed) and the disk usage was not reduced. Currently, I am handing over this issue to my other colleagues due to another issue, seems it's not explained properly to GCS.

To see attachments, please log in.

Like (0)

Question

How the nodetool cleanup command work in DDS node

Need help or want to help others?

Experience the benefits of Support Center when you log in.

Question

How the nodetool cleanup command work in DDS node

Related content:

Need help or want to help others?

Experience the benefits of Support Center when you log in.

We'd prefer it if you saw us at our best.