Distributed Hyperparamter Optimization

Do you like creating neural networks but hate the time it takes to optimize them - in particular, the process of optimizing hyperparameters? Don't you hate burning out your RAM while training the 90th variation of your model in your quest for better accuracy? Does code like this make your eyes bleed? real code in one of my friends' projects

for nn in test_num_nodes_per_hidden_layer:
    for a in test_activation:
        for ki in test_kernel_initializer:
            for of in test_optimizer_function:
                for lf in test_loss_function:
                    for df in test_discount_factor:
                        for r in test_reward:
                            for p in test_punishment:
                                for dr in test_draw_reward:
                                    for tfq in test_transfer_frequency:
                                        for ed in test_epsilon_decay:
                                            for edt in test_epsilon_decay_type:
                                                for me in test_min_epsilon:
                                                    for i in test_include_whos_turn:
                                                        for ms in test_memory_size:
                                                            for (
                                                                rbs
                                                            ) in test_replay_batch_size:
                                                                n = "{}".format(
                                                                    datetime.datetime.now().timestamp()
                                                                )
                                                                entry = pandas.DataFrame.from_dict(
                                                                    {
                                                                        "name": [n],
                                                                        "test_include_whos_turn": [
                                                                            i
                                                                        ],
                                                                        "test_num_nodes_per_hidden_layer": [
                                                                            nn
                                                                        ],
                                                                        "test_activation": [
                                                                            a
                                                                        ],
                                                                        "test_kernel_initializer": [
                                                                            ki
                                                                        ],
                                                                        "test_learning_rate": [
                                                                            of.learning_rate
                                                                        ],
                                                                        "test_optimizer_function": [
                                                                            of
                                                                        ],
                                                                        "test_loss_function": [
                                                                            lf
                                                                        ],
                                                                        "test_discount_factor": [
                                                                            df
                                                                        ],
                                                                        "test_reward": [
                                                                            r
                                                                        ],
                                                                        "test_punishment": [
                                                                            p
                                                                        ],
                                                                        "test_draw_reward": [
                                                                            dr
                                                                        ],
                                                                        "test_transfer_frequency": [
                                                                            tfq
                                                                        ],
                                                                        "test_epsilon_decay": [
                                                                            ed
                                                                        ],
                                                                        "test_epsilon_decay_type": [
                                                                            edt
                                                                        ],
                                                                        "test_memory_size": [
                                                                            ms
                                                                        ],
                                                                        "test_replay_batch_size": [
                                                                            rbs
                                                                        ],
                                                                    }
                                                                )
                                                                results = pandas.concat(
                                                                    [results, entry],
                                                                    ignore_index=True,
                                                                )
                                                                entry.to_csv(
                                                                    "results.csv",
                                                                    mode="a",
                                                                    header=False,
                                                                )
                                                                m = model.Model(
                                                                    model_name=n,
                                                                    model=model.create_model(
                                                                        include_whos_turn=i,
                                                                        num_nodes_per_hidden_layer=nn,
                                                                        activation=a,
                                                                        kernel_initializer=ki,
                                                                        optimizer_function=of,
                                                                        loss_function=lf,
                                                                    ),
                                                                    discount_factor=df,
                                                                    num_episodes=test_num_episodes,
                                                                    include_whos_turn=i,
                                                                    reward=r,
                                                                    punishment=p,
                                                                    draw_reward=dr,
                                                                    transfer_frequency=tfq,
                                                                    epsilon_decay=ed,
                                                                    epsilon_decay_type=edt,
                                                                    min_epsilon=me,
                                                                )
                                                                m.train(
                                                                    with_output=False,
                                                                    save_plots=True,
                                                                )

Us too! That's why we created this tool that allows (broke students just like you) to distribute your AI training workloads over all of your friends' computers without selling your organs to AWS to afford a few hours of SageMaker!.

Each client will need to have python installed and run the bash install script which will install the necessary ML dependencies. Then, all they need to run is the worker.py file and they will automatically start receiving hyperparameters to optimize! Once they're connected, a Websocket server written in Go puts all of your little minions to work! Hard work! As soon as they finish their job, they are immediately sent with the next set of hyperparameters to train. Finally, a convenient frontend written in React allows you to configure what combinations of hyperparameters you want to optimize and trigger the start of training. It also provides real-time feedback with information delivered directly from the Go Websocket Server.

We tried to implement a genetic algorithm in python instead of going for a brute force "grid search" approach for finding hyperparameters hoping that it would converge on the best hyperparameters; however, in practice we found that the simple grid search was more effective as the genetic algorithm was too unstable to converge on within a reasonable amount of time. Shout-out to Adrian for spending a painful 12 hours debugging it and trying to make it work!.

Briefly, some challenges we ran into were:

  1. Learning concurrency patterns in Go while attempting to write efficient code in Go
  2. Designing our own protocol for websocket messages
  3. Transfer of data across 3 different programming languages through websockets
  4. Hosting the final project
Share this project:

Updates