Neptune and hyperparameter search with Tune

Hi,

I am using Neptune with neptune-tensorboard for a machine learning project.
I would like to use tune for hyperparameter search for my model. For a grid search, tune launches runs (sequentially or in parallel) with different hyperparameters and write the checkpoints & results in separate directories so that Tensorboard recognizes the distinct runs.

My problem is, as Neptune isn’t based on directories, the logs of the different runs overwrite each other and we end-up with a single curve per logged metric as if only one run happened.

Do you know how to avoid this issue?
The only way I see would be to stop using neptune-tensorboard and write a customer logger from scratch that logs the metrics from the different runs with unique names. This way, each metric would be displayed a distinct plot.

Many thanks,

Hi @thomas,

Thanks for reaching out!

I think that simple fix would be to first run your experiments with tune and log them using TensorBoard logging (no Neptune code at this point).
Then, you use our integration with TensorBoard to sync your runs to Neptune (for analysis, comparison, sharing, etc). So instead of integration in the code, you sync using command line:
neptune tensorboard /path/to/logdir --project USERNAME/PROJECT_NAME
This command search this directory and log all experiments.

What do you think?

Best,
Kamil

Hi @thomas

I checked some examples of Tune loggers and it is possible that you can write a simple NeptuneLogger that gets the job done.

I wrote something quickly.
Here is a full example:

import time
import random

from ray import tune
from ray.tune.logger import Logger, DEFAULT_LOGGERS
import neptune

from ray.tune.result import (NODE_IP, TRAINING_ITERATION, TIME_TOTAL_S,
                             TIMESTEPS_TOTAL, EXPR_PARAM_FILE,
                             EXPR_PARAM_PICKLE_FILE, EXPR_PROGRESS_FILE,
                             EXPR_RESULT_FILE)

class NeptuneLogger(Logger):
    """Neptune logger.
    Requires the experiment configuration to have a MLFlow Experiment ID
    or manually set the proper environment variables.
    """
    def _init(self):
        from neptune.sessions import Session
        
        project = Session().get_project(self.config.get("neptune_project_name"))
        exp = project.create_experiment(name=self.config.get("neptune_experiment_name"))
        
        self.exp = exp

    def on_result(self, result):
        for name, value in result.items():
            if isinstance(value, float):
                self.exp.log_metric(name, x=result.get(TRAINING_ITERATION), y=value)
            elif isinstance(value, int):
                self.exp.log_metric(name, x=result.get(TRAINING_ITERATION), y=value)
            elif isinstance(value, str):
                self.exp.log_text(name, x=result.get(TRAINING_ITERATION), y=value)
            else:
                continue
                
    def close(self):
        self.exp.stop()

def easy_objective(config):
    for i in range(20):
        result = dict(
            timesteps_total=i,
            mean_loss=(config["height"] - 14)**2 - abs(config["width"] - 3))
        tune.track.log(**result)
        time.sleep(0.02)
    tune.track.log(done=True)


if __name__ == "__main__":
    
    PROJECT_NAME = 'jakub-czakon/examples'
    
    trials = tune.run(
        easy_objective,
        name="neptune",
        num_samples=5,
        loggers=DEFAULT_LOGGERS + (NeptuneLogger, ),
        config={
            "neptune_project_name": PROJECT_NAME, 
            "neptune_experiment_name": "tune test runs",
            "width": tune.sample_from(
                lambda spec: 10 + int(90 * random.random())),
            "height": tune.sample_from(lambda spec: int(100 * random.random()))
        })

I am not quite sure if I understand what TRAINING_ITERATION is here just copied a similar framework and adjusted.

I hope this helps!

PS
You can find us/me more often on our community spectrum chat.

Hi @kamil.kaczmarek and @jakub_czakon, thanks for your help and for the effort.

Jakub’s solution works well indeed. The only drawback I see comes from the weird state that aborting one of the experiments from Neptune creates. Since one tune.run command creates multiple experiment, aborting one of the experiments from Neptune leaves the other experiments in a Running but not Responding state as aborting one experiment kills the whole tune process.

An alternative would be to log all the trials into one Neptune experiment and prevent metrics from overwriting each other by prepending the id (or name) of the trial to the log_name, but this method has many drawbacks as well…

Thanks again,
Thomas

Aborting an experiment from neptune aborts the script that started it.
However, that script also spun off other experiments and hence they get terminated.
But since self.exp.stop() is not invoked they are left there, not responding.

Now, there are 3 questions:

  • Would you like to bort all experiments runs from an experiment group that tune started?
    If so, then you can very easily in the UI filter all experiments with the experiment_name/experiment_group_id of your tune group and abort them.
  • Would you like to abort just this particular experiment and keep others going?
  • (Separate topic but still relevant) Would you like to monitor this meta-experiment as it was done here for another framework?

Hi Jakub,

Would you like to bort all experiments runs from an experiment group that tune started? If so, then you can very easily in the UI filter all experiments with the experiment_name / experiment_group_id of your tune group and abort them

Yes, aborting the other experiments of the same tune run from the UI is fairly easy (even though in reality they’ve been aborted already, but clicking the “Abort” icon fixes the label from “Not Responding” to “Aborted”). What do you refer to when you say “experiment_group_id”?

Would you like to abort just this particular experiment and keep others going?

I am fine with aborting all the experiments at once. I don’t think that tune allows users to abort a single trial anyway.

(Separate topic but still relevant) Would you like to monitor this meta-experiment as it was done here for another framework?

Yes, I am intending to build a summarizer to compute and monitor some aggregated results from all the experiments of the group. I guess this would be done after tune.run finishes by creating a separate neptune experiment, computing the intended values and logging them to neptune?

Thanks,
T

Sorry for the late response @thomas I missed the notification.

@kamil.kaczmarek
Do you think we can produce a fix helping with going from Not Responding to Aborted?

Hmm, you can definitely do it after tune.run finishes by fetching all experiments with a given name:

from neptune.sessions import Session
project = Session().get_project('PROJECT/NAME')

experiments = project.get_experiments(name=['experiment_group_name'])

results = []
for exp in experiments:
    result = exp.get_numeric_channels_values('score_you_care_about')
    results.append()

...
#create some plots etc
...

summary_exp = neptune.create_experiment(name='summary_exp')
summary_exp.log_image('summary_charts', 'summary_img.png')
for metric in results_metrics:
    summary_exp.log_metric('summary_metrics', metric)

You could also try and pass a neptune callback to monitor some things as those are running but I would have to go deeper into ray.tune to understand if it allows you to do that.

Hi @thomas and @jakub_czakon,

Thanks for the info. I’ll check with engineers the behavior of changing experiment status.

Best,
Kamil

Hi @jakub_czakon, I hope you are well!

I am following up on this thread and the discussion we had during you visit in Berlin early 2020.
We are finally refactoring our Neptune logging following your recommendation to log tune trials in separate Neptune experiments.

FYI, the default trial executor of tune prevents from setting the correct status for neptune experiments. I opened a ticket for them here. It is easily fixed by inheriting from their base class and redefining the _stop_trial method as a workaround.

Almost everything is working properly. My last fight is that tune uses print statements to stream info to stdout (bad practice, but we can’t change it…). Every time a trial starts, the associated neptune exp redirects stdout by changing the value of sys.stdout directly. Tune logs are common to all trials, so I would like them to be uploaded to all neptune experiments.

The only idea I found so far is to write a custom StdStreamWithMultipleUploads that would collect the channels writers of the neptune exps associated with all the trials and would write in each one of them in its self.write method. It’s a bit hacky as I need to fetch protected attributes of Neptune experiment objets to do this but I haven’t found better yet.

Any idea on how to upload stdout to several experiments properly?

Many thanks!