Skip to content

Conversation

@lukasgraef
Copy link
Contributor

@lukasgraef lukasgraef commented Oct 19, 2025

Describe the PR

When analyzing the heap dump created by running CPD against https://github.com/alsa-project/alsa-firmware/blob/master/emu/emu0404_netlist.h (mentioned in #6145) with Eclipse's Memory Analzyer, there's a clear culprit for the huge memory consumption in CPD: net.sourceforge.pmd.cpd.MatchCollector#tokenMatchSets

image

tokenMatchSets is a Map that stores lots of Set<Integer> as its values, which can take up most of the available heap space. There's a drop-in replacement for using Set<Integer> here, that takes up significantly less memory: java.util.Bitset.

When using PMD 7.17, I wasn't able to run CPD on emu0404_netlist.h with --minimum-tokens 100, even with a max heap size of 8GB.

Using my locally built SNAPSHOT with BitSet, I was able to analyze emu0404_netlist.h, while only consuming about ~750MB of memory.

Which means in this scenario, this simple replacement reduces the memory footprint of CPD to less than a tenth compared to 7.17.

Related issues

Ready?

  • Added unit tests for fixed bug/feature
  • Passing all unit tests
  • Complete build ./mvnw clean verify passes (checked automatically by github actions)
  • Added (in-code) documentation (if needed)

@pmd-actions-helper
Copy link
Contributor

pmd-actions-helper bot commented Oct 19, 2025

Documentation Preview

No regression tested rules have been changed.

(comment created at 2025-10-29 12:01:44+00:00 for 4346bca)

Copy link
Member

@oowekyala oowekyala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could reproduce your results on my machine. I was skeptical at first because I thought BitSet would grow too large if the integers that need to be stored are very large, but that doesn't look to be the case.

Thanks for suggesting this change! This is a great improvement

@oowekyala oowekyala added this to the 7.18.0 milestone Oct 21, 2025
Copy link
Member

@adangel adangel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

I now finished my own testing (#6145 (comment)).

BitSet seems to use slightly more memory when using a "normal" project with not so much overlapping duplicates (case 1). For the single file emu0404_netlist.h (case 2) on the other hand, it uses less, so that it passes (I can confirm your ~750MB, you probably used java 25?).
It's still not enough to be able to analyze the whole project (case 3), so we'll need more changes in this area. The dedup solution looks promising combined with BitSet.

So, overall, it's an improvement and I'm going to merge it for 7.18.0.

@adangel adangel merged commit a5a9c59 into pmd:main Oct 29, 2025
1 check passed
magwas pushed a commit to magwas/pmd that referenced this pull request Nov 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants