Skip to content

feat: Allow study progress to be saved & resumed#106

Merged
p-e-w merged 5 commits intop-e-w:masterfrom
anrp:anrp/save-and-resume-v2
Jan 23, 2026
Merged

feat: Allow study progress to be saved & resumed#106
p-e-w merged 5 commits intop-e-w:masterfrom
anrp:anrp/save-and-resume-v2

Conversation

@anrp
Copy link
Contributor

@anrp anrp commented Jan 15, 2026

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @anrp, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness of optimization studies by implementing a comprehensive checkpointing and resumption system. It leverages Optuna's JournalStorage to persist study progress, allowing users to seamlessly continue interrupted or long-running experiments. The changes include new configuration options for managing checkpoint files, intelligent handling of study identification through parameter hashing, and dynamic adjustment of Optuna's sampler behavior to correctly account for startup trials upon resumption.

Highlights

  • Study Checkpointing and Resumption: Introduced the ability to save and resume the progress of an optimization study, preventing loss of work due to interruptions or crashes.
  • Configurable Checkpoint Files: Added new settings (study_checkpoint_file and study_autoresume) to control where study progress is saved and whether to automatically resume.
  • Robust Study Identification: Implemented a mechanism to generate a unique study name by hashing key configuration parameters, ensuring that resumed studies are consistent with their original setup.
  • Dynamic Startup Trial Management: Adjusted Optuna's optimization process to correctly handle the n_startup_trials (random trials) when resuming a study, ensuring proper exploration even after an interruption.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable feature to save and resume study progress, which is crucial for long-running optimization tasks. The implementation correctly uses Optuna's JournalStorage for persistence, and the logic for handling resumed studies, including the continuation of startup trials, is well-implemented. I've identified a critical typo that would prevent the code from running, along with a medium-severity style guide violation. After addressing these points, this will be a solid contribution.

@anrp anrp force-pushed the anrp/save-and-resume-v2 branch from af67313 to b49e0ce Compare January 15, 2026 14:03
@anrp anrp changed the title Allow study progress to be saved & resumed feat: Allow study progress to be saved & resumed Jan 15, 2026
@anrp anrp mentioned this pull request Jan 15, 2026
@anrp anrp force-pushed the anrp/save-and-resume-v2 branch from b49e0ce to 3c37150 Compare January 15, 2026 14:12
remaining_trials = settings.n_trials - start_index
if remaining_trials > 0 and random_trials_to_run > 0:
if start_index > 0:
print(f"Running additional {random_trials_to_run} random trials")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be displayed in the frontend. Sampling is an implementation detail that the user can do nothing with.

@anrp
Copy link
Contributor Author

anrp commented Jan 16, 2026

Ultimately from all of this discussion - would you like "configuration option for directory", "all settings saved in study name so that it can't restart from a different place" and "no logic for more random tries"? That converts this to a resume-only capability (and you can extend the number of tries interactively, that's it)? (Maybe just don't include n_tries in the study name, and read from study file (resuming) or command line (initial)?

@p-e-w
Copy link
Owner

p-e-w commented Jan 17, 2026

We need to decide what we actually want here. My idea was to have a system that can recover from a crash by resuming where it left off (e.g. after another program used up VRAM and led to an OOM), or allow the user to stop the run and continue the next day. In that case, it's obvious what should happen: All settings are restored, and the trials continue from the cutoff point.

But it seems that you and @spikymoth have a more ambitious vision: That the user should be able to stop the run and resume it with different settings. I agree that this ability would be nice to have in principle, but it seems very difficult to implement in a way that's correct, non-confusing, and maintainable.

@anrp
Copy link
Contributor Author

anrp commented Jan 17, 2026

The only setting that really makes sense to resume with differently is the n_trials, and that can effectively already be overriden with a resume-only capability by the interactive menu option. I would be fine with that, because it does simplify every other part of this.

@p-e-w
Copy link
Owner

p-e-w commented Jan 17, 2026

Here's a sketch of how I think a reasonable implementation might look:

  1. The study progress is journaled to a file with the same name as the model. The complete settings object is stored in there.
  2. When Heretic is started and a journal file matching the model name already exists, the user is offered the choice between resuming the study or starting a new study.
  3. If the user chooses to resume the study, the entire settings object is replaced with the one from the journal file.
  4. If the user chooses to start a new study, the existing journal file is deleted and replaced with a journal file corresponding to the current run.
  5. When the run is complete, the journal file is deleted.

@anrp
Copy link
Contributor Author

anrp commented Jan 17, 2026

1-4 SGTM but I'd like to push back on 5 being automatic, since it's sometimes useful to just be able to drop back in to the chat interface to test things i.e. I'd like to keep the few-KB result file which represents the $hours of computation. Thoughts?

@spikymoth
Copy link
Contributor

I think a "Delete study log? (y/n)" question at the end if you select "None (exit program)" would make sense, personally. And I agree that n_trials is the main setting that makes sense to allow changing between sessions (which should only require setting exclude=True on that Field() to avoid serialization). There's a case to be made for some other settings, but I think n_trials is the main one for just picking up a promising study again later.

@p-e-w
Copy link
Owner

p-e-w commented Jan 18, 2026

1-4 SGTM but I'd like to push back on 5 being automatic, since it's sometimes useful to just be able to drop back in to the chat interface to test things i.e. I'd like to keep the few-KB result file which represents the $hours of computation.

Okay, but how exactly will this work? Let's say the study is complete and we keep the journal. Now Heretic is run again with the same model, and the user chooses to "resume" the study. But the study is already complete, and the number of completed trials is equal to the number of trials to complete. Now what? We just tell them that the run they just started is already complete, and show the trial selection menu with the Pareto front?

And I agree that n_trials is the main setting that makes sense to allow changing between sessions (which should only require setting exclude=True on that Field() to avoid serialization).

But we already have that functionality. When the Pareto front is displayed, one of the options is "Continue optimization (run more trials)". This was implemented in #76.

@anrp
Copy link
Contributor Author

anrp commented Jan 18, 2026

[...] We just tell them that the run they just started is already complete, and show the trial selection menu with the Pareto front?

is necessary if you want to

[...] "Continue optimization (run more trials)". This was implemented in #76.

after deciding to exit.
Maybe print a distinct message about no work actually happening, but basically, yes.

@p-e-w
Copy link
Owner

p-e-w commented Jan 18, 2026

Ok, how about this:

Instead of asking the user whether to delete the journal file when they exit (which interrupts the exit process and might be the wrong time to make that decision), we prompt them when the program starts?

If the previous run is complete:

You have already processed this model. How would you like to proceed?
[1] Show the results from the previous run, allowing you to export models, or to run additional trials.
[2] Ignore the previous run and start from scratch. This will delete the checkpoint file and all results from the previous run.

If the previous run is incomplete:

You have already processed this model, but the run was interrupted. How would you like to proceed?
[1] Continue the previous run from where it stopped.
[2] Ignore the previous run and start from scratch. This will delete the checkpoint file and all results from the previous run.

@anrp anrp force-pushed the anrp/save-and-resume-v2 branch from 3c37150 to ba1f085 Compare January 18, 2026 13:05
@anrp
Copy link
Contributor Author

anrp commented Jan 18, 2026

I like that, basically moves the delete question to startup time. Implemented, PTAL.

@anrp anrp force-pushed the anrp/save-and-resume-v2 branch from ba1f085 to 2d1fdbb Compare January 18, 2026 15:11
@anrp anrp force-pushed the anrp/save-and-resume-v2 branch from 2d1fdbb to 19d9f52 Compare January 18, 2026 15:24
@p-e-w
Copy link
Owner

p-e-w commented Jan 18, 2026

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable feature to save and resume study progress, which is a great improvement for long-running optimization tasks. The implementation is mostly solid, leveraging Optuna's journaling storage. I've found one area in src/heretic/main.py that could be improved for robustness and efficiency by handling potential missing keys in study attributes and avoiding redundant API calls. My suggestion addresses a potential crash and makes the code cleaner. The rest of the changes are well-implemented and consistent with the new feature.

@anrp anrp force-pushed the anrp/save-and-resume-v2 branch from 19d9f52 to 73d7dba Compare January 18, 2026 15:50
@anrp anrp force-pushed the anrp/save-and-resume-v2 branch 2 times, most recently from f95b0ce to a68fc36 Compare January 18, 2026 16:10
@anrp anrp force-pushed the anrp/save-and-resume-v2 branch 5 times, most recently from ca72b30 to 083a6b3 Compare January 19, 2026 13:34
@anrp anrp force-pushed the anrp/save-and-resume-v2 branch from 083a6b3 to d1d0fb6 Compare January 20, 2026 15:12
@anrp anrp force-pushed the anrp/save-and-resume-v2 branch from d1d0fb6 to e38921c Compare January 20, 2026 17:58
@anrp anrp force-pushed the anrp/save-and-resume-v2 branch from e38921c to 6483012 Compare January 20, 2026 18:08
@anrp anrp force-pushed the anrp/save-and-resume-v2 branch from d83a056 to 3b63699 Compare January 20, 2026 19:10
Copy link
Owner

@p-e-w p-e-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the logic is sound now.

@p-e-w
Copy link
Owner

p-e-w commented Jan 22, 2026

/gemini review

@gemini-code-assist
Copy link
Contributor

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

@p-e-w
Copy link
Owner

p-e-w commented Jan 22, 2026

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable feature allowing study progress to be saved and resumed, which significantly improves the user experience by preventing loss of work. The changes are well-structured, touching .gitignore for checkpoint exclusion, config.py for new settings, and main.py and utils.py for the core logic of saving, loading, and resuming studies. The refactoring of settings source customization in config.py is also a good improvement. Overall, the implementation seems robust and directly addresses the stated goal of the pull request.

@p-e-w p-e-w merged commit ebc22c2 into p-e-w:master Jan 23, 2026
4 checks passed
@p-e-w
Copy link
Owner

p-e-w commented Jan 23, 2026

Yup, that looks good now. Merged!

@p-e-w p-e-w mentioned this pull request Jan 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

State save possible?

3 participants