Skip to content

Conversation

@d-w-moore
Copy link
Collaborator

@d-w-moore d-w-moore commented May 19, 2025

Wherein we close down threads in an orderly way, so that things don't leave things to be disposed in the wrong order for the ever persnickety SSL shutdown logic.

Experiments show that SIGTERM actually does induce the Python interpreter to shut down non-daemonic threads, so installing a signal handler for that may not be necessary in the end.

@d-w-moore d-w-moore changed the title [_722] fix segfault and hung threads on SIGINT during parallel get [#722] fix segfault and hung threads on KeyboardIinterrupt during parallel get May 19, 2025
@d-w-moore d-w-moore self-assigned this May 19, 2025
@d-w-moore d-w-moore marked this pull request as draft May 19, 2025 17:09
@d-w-moore
Copy link
Collaborator Author

After a bit of manual testing, will attempt to make a proper test for SIGINT and SIGTERM to ensure things are left in an ok state.

@d-w-moore
Copy link
Collaborator Author

d-w-moore commented Jun 5, 2025

A GUI for example that maintains background asynch parallel transfers using PRC could trap and guard against Ctrl-C thusly:

from irods.parallel import abort_asynchronous_transfers
signal(SIGINT, lambda *_:exit(0 if abort_asynchronous_transfers() else 0))

Update: abort_asynchronous_transfers has transformed. It is now abort_parallel_transfers and may now be used to abort the current (just interrupted) synchronous transfer as well as all pending background ones. See the README updates in this pull request.

@d-w-moore d-w-moore force-pushed the segfault_parallel_io_722.m branch from abafff5 to fb36836 Compare June 6, 2025 13:48
Copy link
Contributor

@alanking alanking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable. Just a couple of things in the test

@korydraughn
Copy link
Contributor

Looks like we have a conflict.

Seems this PR is close to completion?

@alanking
Copy link
Contributor

alanking commented Dec 5, 2025

Just checking to see if this PR is still being considered for 3.3.0.

@d-w-moore
Copy link
Collaborator Author

Just checking to see if this PR is still being considered for 3.3.0.

Will check its currency to see whether the segfault is still a concern. If so, then I think we can consider it for this release.

@d-w-moore d-w-moore force-pushed the segfault_parallel_io_722.m branch from fb36836 to 481952c Compare December 12, 2025 15:25
@korydraughn
Copy link
Contributor

What's the status of this PR?

@d-w-moore
Copy link
Collaborator Author

What's the status of this PR?

I believe it's almost ready. I want to look over it once more.

Copy link
Contributor

@alanking alanking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awaiting signal that this is ready

@d-w-moore
Copy link
Collaborator Author

Awaiting signal that this is ready

I've added a test (first draft, will run soon) that interrupts a put . We weren't testing that previously.

finish put test

debug(parallel)

debug(put-test)

behaves better if we add mgr to list sooner?

experimental changes ACTIVE_PATH

paths active

make return values consisten from io_multipart_*()

print debug on abort

almost there?

move statement where transfer_managers is updated

rework abort_transfer fn slightly

handle logic for prematurely shutdown executor
@d-w-moore
Copy link
Collaborator Author

Now open to comment ... once more. These changes are final in my mind, as far as the significant changes to library functionality.

@d-w-moore
Copy link
Collaborator Author

I can squash the last 6 commits or so, if that helps reviewers.

@korydraughn
Copy link
Contributor

Yes please.

transfer_managers: weakref.WeakKeyDictionary["_Multipart_close_manager", Any] = weakref.WeakKeyDictionary()

def abort_parallel_transfers(dry_run=False, filter_function=None):
"""'cls' should be tuple to extract the current synchronous transfer."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is/was 'cls'?

if File is None:
File = gen_file_handle()
try:
f = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is f? It appears to be unused.

Comment on lines +532 to +533
# TODO - examine this experimentally restored code, as
# library should react to these two exception types(and perhaps others) by quitting all transfer threads
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an issue we can link to with this TODO? To which two exception types is it referring?

Comment on lines +654 to +659
# if queueLength > 0:
(futures, mgr, chunk_notify_queue) = retval
# else:
# futures = retval
# TODO: investigate: Huh? Why were we zeroing out total_bytes when there is no progress queue?
#chunk_notify_queue = total_bytes = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to keep these commented-out bits? And if so, can we get an issue number for the TODO?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this TODO can go.

Comment on lines +23 to +46
def wait_till_true(function, timeout=LARGE_TEST_TIMEOUT, msg = ''):
"""Wait for test purposes until a condition becomes true , as determined by the
return value of the provided test function.
By default, we wait at most LARGE_TEST_TIMEOUT seconds for the function to return true, and then
quit or time out. Alternatively, a timeout of None translates as a request never to time out.
If the msg value passed in is a nonzero-length string, it can be used to raise a timeout exception;
otherwise timing out causes a normal exit, relaying as the return value the last value returned
from the test function.
"""
start_time = time.clock_gettime_ns(time.CLOCK_BOOTTIME)
while not (truth_value := function()):
if (
timeout is not None
and (time.clock_gettime_ns(time.CLOCK_BOOTTIME) - start_time) * 1e-9
> timeout
):
if msg:
raise TimeoutError(msg)
else:
break
time.sleep(_clock_polling_interval)
return truth_value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a common place where we can put this? It's duplicated here and in test_signal_handling_in_multithread_get.py.

Comment on lines +129 to +131
# Tell the parent process the name of the local file being "get"ted (got) from iRODS
print(local_path)
sys.stdout.flush()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be used as signal in the test on line 70 that the transfer threads have spawned. Is that understanding correct? If so, let's explicitly mention that in the comment as well because that seems really important.

abort_parallel_transfers()
exit(128+sig)

signal.signal(signal.SIGTERM, handler)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is the KeyboardInterrupt handling below the "signal handler" for SIGINT?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@alanking
Copy link
Contributor

Also, take a peek at the 2 failing checks to see if there's anything actionable.

@d-w-moore d-w-moore force-pushed the segfault_parallel_io_722.m branch from 4b0458e to 14037f9 Compare January 15, 2026 21:23
@d-w-moore
Copy link
Collaborator Author

d-w-moore commented Jan 15, 2026

Sorry about the delay. (I see some reviews are already made.) I have just submitted a squash - but no actual changes of any kind since the last note I posted above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants