Skip to content

Conversation

@Ayush2k02
Copy link

Second approach for fixing this bug . Will be tested and benchmarked against #3539 to see if this is better

Closes #3533

Comment on lines 62 to 65
pub image_data: Vec<(u64, Image<Color>)>,
/// HashMap for O(1) lookup: content_hash -> (id, full_image_for_collision_check)
/// Used in deduplicating images efficiently
image_lookup: HashMap<u64, Vec<(u64, Image<Color>)>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ends up storing the image data twice? I don't think that this is desirable since images can have large memory requirements.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmnmn, think I'll have the map reference the vector

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't be allowed due to rusts ownership rules, but you can store an Arc<Vec<…>> in both. But I'm a bit confused why you have to do this at all, I would expect that just storing (u64, Image<Color>) in a HashSet should be all that we need, right? The hash set / hash map already implement the eq check on hash collisions

Copy link
Author

@Ayush2k02 Ayush2k02 Jan 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes total sense - was redundantly reimplementing what HashMap does automatically. One follow-up question though , went with the manual approach because I was trying to preserve source_node_id when available (to avoid regenerating IDs for existing images). Your solution generates new UUIDs on every HashMap miss , is the source node Id not relevant here ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The image source is just cheaper to compute if it exists, but if it does not exist the code should behave the same way. If the source ids are the same, we know that the images will also be the same without having to check them. We could also extend this with a second hashmap from source_id to image id and always first try to find the source id in the source_id hashmap (if we have a source id) and if not look up the full image. That should give us optimal perf characteristics, but I think I still prefer this simpler solution if the perf is good enough

@TrueDoctor
Copy link
Member

Alright, I've now pushed another suggestion. If the perf of this is acceptable, I think this is going to be the simplest solution which is hopefully also relatively easy to maintain. @0HyperCube thoughts?

@Ayush2k02
Copy link
Author

This is so much more cleaner , wow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rasterize node: the rasterized image is leaking from one document to another

3 participants