I have a failed experiment (consists of running many trials) in which some of them were successful and some of them failed. I’m using
get_experiments() function to retrieve the failed exp, reopen it, and continue logging to it. However, the experiment status is still always failed, and we could no longer track the experiment progress with stdout and stdin.
Anyone has an idea on how we can change the experiment status and also be able to stop the experiment again with neptune ?
Thanks for this. What you can do right now is to log new metrics/text/images. However, status will stay the same. Reopened experiment will not change stdout/strerr or hardware monitoring metrics.
We are still investigating how to improve this experience. We have two ideas in mind:
- introduce fully-fledged reopening of the experiment.
- introduce linear linking between experiments - so that each experiment can be linked with the preceding one.
Just wanted to ask what sort of need to want to address? It will greatly help us shaping this feature in the correct direction.
Thank you for your quick answer.
The first approach is more convenient for me. However, I’m trying now to link my two experiments as you mentioned in the second approach.
I’m still looking for alternatives to be able to add the logs of the failed experiment (charts and monitoring) to the new created one. At least, I guarantee, for now, that I have the status "running’’ and can stop the experiment when needed.
For now I can confirm that we are considering how to re-open experiments properly. We are in the research/ideation phase.
We share the view that it would be better UX overall.
There is one trick that can help you copy logs from failed experiment to hte new one:
import neptune project = neptune.Session().get_project('neptune-ml/credit-default-prediction') old_exp = project.get_experiments('CRED-184') new_exp = project.create_experiment() for name, channel in old_exp.get_channels().items(): if channel.channelType == 'numeric': metric = old_exp.get_numeric_channels_values(name) for _, row in metric.iterrows(): new_exp.log_metric(name, row['x'], row[name]) new_exp.stop()
Hope this helps
Thank you for the information. I have already implemented that, had also to fetch the artefacts and hardware utilisation and it is fortunately working for now.
Unfortunately, couldn’t find an alternative to fetch the whole stderr, stdout. The logs there are important to fetch. Even when using
project.get_leaderboard(), It only fetch the last value from the logs.
Thank you for your efforts