Neptune and hyperparameter search with Tune

Hi,

I am using Neptune with neptune-tensorboard for a machine learning project.
I would like to use tune for hyperparameter search for my model. For a grid search, tune launches runs (sequentially or in parallel) with different hyperparameters and write the checkpoints & results in separate directories so that Tensorboard recognizes the distinct runs.

My problem is, as Neptune isn’t based on directories, the logs of the different runs overwrite each other and we end-up with a single curve per logged metric as if only one run happened.

Do you know how to avoid this issue?
The only way I see would be to stop using neptune-tensorboard and write a customer logger from scratch that logs the metrics from the different runs with unique names. This way, each metric would be displayed a distinct plot.

Many thanks,

Hi @thomas,

Thanks for reaching out!

I think that simple fix would be to first run your experiments with tune and log them using TensorBoard logging (no Neptune code at this point).
Then, you use our integration with TensorBoard to sync your runs to Neptune (for analysis, comparison, sharing, etc). So instead of integration in the code, you sync using command line:
neptune tensorboard /path/to/logdir --project USERNAME/PROJECT_NAME
This command search this directory and log all experiments.

What do you think?

Best,
Kamil

Hi @thomas

I checked some examples of Tune loggers and it is possible that you can write a simple NeptuneLogger that gets the job done.

I wrote something quickly.
Here is a full example:

import time
import random

from ray import tune
from ray.tune.logger import Logger, DEFAULT_LOGGERS
import neptune

from ray.tune.result import (NODE_IP, TRAINING_ITERATION, TIME_TOTAL_S,
                             TIMESTEPS_TOTAL, EXPR_PARAM_FILE,
                             EXPR_PARAM_PICKLE_FILE, EXPR_PROGRESS_FILE,
                             EXPR_RESULT_FILE)

class NeptuneLogger(Logger):
    """Neptune logger.
    Requires the experiment configuration to have a MLFlow Experiment ID
    or manually set the proper environment variables.
    """
    def _init(self):
        from neptune.sessions import Session
        
        project = Session().get_project(self.config.get("neptune_project_name"))
        exp = project.create_experiment(name=self.config.get("neptune_experiment_name"))
        
        self.exp = exp

    def on_result(self, result):
        for name, value in result.items():
            if isinstance(value, float):
                self.exp.log_metric(name, x=result.get(TRAINING_ITERATION), y=value)
            elif isinstance(value, int):
                self.exp.log_metric(name, x=result.get(TRAINING_ITERATION), y=value)
            elif isinstance(value, str):
                self.exp.log_text(name, x=result.get(TRAINING_ITERATION), y=value)
            else:
                continue
                
    def close(self):
        self.exp.stop()

def easy_objective(config):
    for i in range(20):
        result = dict(
            timesteps_total=i,
            mean_loss=(config["height"] - 14)**2 - abs(config["width"] - 3))
        tune.track.log(**result)
        time.sleep(0.02)
    tune.track.log(done=True)


if __name__ == "__main__":
    
    PROJECT_NAME = 'jakub-czakon/examples'
    
    trials = tune.run(
        easy_objective,
        name="neptune",
        num_samples=5,
        loggers=DEFAULT_LOGGERS + (NeptuneLogger, ),
        config={
            "neptune_project_name": PROJECT_NAME, 
            "neptune_experiment_name": "tune test runs",
            "width": tune.sample_from(
                lambda spec: 10 + int(90 * random.random())),
            "height": tune.sample_from(lambda spec: int(100 * random.random()))
        })

I am not quite sure if I understand what TRAINING_ITERATION is here just copied a similar framework and adjusted.

I hope this helps!

PS
You can find us/me more often on our community spectrum chat.

Hi @kamil.kaczmarek and @jakub_czakon, thanks for your help and for the effort.

Jakub’s solution works well indeed. The only drawback I see comes from the weird state that aborting one of the experiments from Neptune creates. Since one tune.run command creates multiple experiment, aborting one of the experiments from Neptune leaves the other experiments in a Running but not Responding state as aborting one experiment kills the whole tune process.

An alternative would be to log all the trials into one Neptune experiment and prevent metrics from overwriting each other by prepending the id (or name) of the trial to the log_name, but this method has many drawbacks as well…

Thanks again,
Thomas

Aborting an experiment from neptune aborts the script that started it.
However, that script also spun off other experiments and hence they get terminated.
But since self.exp.stop() is not invoked they are left there, not responding.

Now, there are 3 questions:

  • Would you like to bort all experiments runs from an experiment group that tune started?
    If so, then you can very easily in the UI filter all experiments with the experiment_name/experiment_group_id of your tune group and abort them.
  • Would you like to abort just this particular experiment and keep others going?
  • (Separate topic but still relevant) Would you like to monitor this meta-experiment as it was done here for another framework?

Hi Jakub,

Would you like to bort all experiments runs from an experiment group that tune started? If so, then you can very easily in the UI filter all experiments with the experiment_name / experiment_group_id of your tune group and abort them

Yes, aborting the other experiments of the same tune run from the UI is fairly easy (even though in reality they’ve been aborted already, but clicking the “Abort” icon fixes the label from “Not Responding” to “Aborted”). What do you refer to when you say “experiment_group_id”?

Would you like to abort just this particular experiment and keep others going?

I am fine with aborting all the experiments at once. I don’t think that tune allows users to abort a single trial anyway.

(Separate topic but still relevant) Would you like to monitor this meta-experiment as it was done here for another framework?

Yes, I am intending to build a summarizer to compute and monitor some aggregated results from all the experiments of the group. I guess this would be done after tune.run finishes by creating a separate neptune experiment, computing the intended values and logging them to neptune?

Thanks,
T

Sorry for the late response @thomas I missed the notification.

@kamil.kaczmarek
Do you think we can produce a fix helping with going from Not Responding to Aborted?

Hmm, you can definitely do it after tune.run finishes by fetching all experiments with a given name:

from neptune.sessions import Session
project = Session().get_project('PROJECT/NAME')

experiments = project.get_experiments(name=['experiment_group_name'])

results = []
for exp in experiments:
    result = exp.get_numeric_channels_values('score_you_care_about')
    results.append()

...
#create some plots etc
...

summary_exp = neptune.create_experiment(name='summary_exp')
summary_exp.log_image('summary_charts', 'summary_img.png')
for metric in results_metrics:
    summary_exp.log_metric('summary_metrics', metric)

You could also try and pass a neptune callback to monitor some things as those are running but I would have to go deeper into ray.tune to understand if it allows you to do that.

Hi @thomas and @jakub_czakon,

Thanks for the info. I’ll check with engineers the behavior of changing experiment status.

Best,
Kamil