Hire a web Developer and Designer to upgrade and boost your online presence with cutting edge Technologies

Thursday 17 October 2024

How to Plan and Run Machine Learning Experiments Systematically

 Machine learning experiments can take a long time. Hours, days, and even weeks in some cases.

This gives you a lot of time to think and plan for additional experiments to perform.

In addition, the average applied machine learning project may require tens to hundreds of discrete experiments in order to find a data preparation model and model configuration that gives good or great performance.

The drawn-out nature of the experiments means that you need to carefully plan and manage the order and type of experiments that you run.

You need to be systematic.

In this post, you will discover a simple approach to plan and manage your machine learning experiments.

With this approach, you will be able to:

  • Stay on top of the most important questions and findings in your project.
  • Keep track of what experiments you have completed and would like to run.
  • Zoom in on the data preparations, models, and model configurations that give the best performance.

Let’s dive in.

How to Plan and Run Machine Learning Experiments Systematically

How to Plan and Run Machine Learning Experiments Systematically
Photo by Qfamily, some rights reserved.

Confusion of Hundreds of Experiments

I like to run experiments overnight. Lots of experiments.

This is so that when I wake up, I can check results, update my ideas of what is working (and what is not), and kick off the next round of experiments, then spend some time analyzing the findings.

I hate wasting time.

And I hate running experiments that do not get me closer to the goal of finding the most skillful model, given the time and resources I have available.

It is easy to lose track of where you’re up to. Especially after you have results, analysis, and findings from hundreds of experiments.

Poor management of your experiments can lead to bad situations where:

  • You’re watching experiments run.
  • You’re trying to come up with good ideas of experiments to run right after a current batch has finished.
  • You run an experiment that you had already run before.

You never want to be in any of these situations!

If you are on top of your game, then:

  • You know exactly what experiments you have run at a glance and what the findings were.
  • You have a long list of experiments to run, ordered by their expected payoff.
  • You have the time to dive into the analysis of results and think up new and wild ideas to try.

But how can we stay on top of hundreds of experiments?

Design and Run Experiments Systematically

One way that I have found to help me be systematic with experiments on a project is to use a spreadsheet.

Manage the experiments you have done, that are running, and that you want to run in a spreadsheet.

It is simple and effective.

Simple

It is simple in that I or anyone can access it from anywhere and see where we’re at.

I use Google Docs to host the spreadsheet.

There’s no code. No notebook. No fancy web app.

Just a spreadsheet.

Effective

It’s effective because it only contains the information needed with one line per experiment and one column for each piece of information to track on the experiment.

Experiments that are done can be separated from those that are planned.

Only experiments that are planned are set-up and run and their order ensures that the most important experiments are run first.

You will be surprised at how much such a simple approach can free up your time and get you thinking deeply about your project.

Example Spreadsheet

Let’s look at an example.

We can imagine a spreadsheet with the columns below.

These are just an example from the last project I worked on. I recommend adapting these to your own needs.

  • Sub-Project: A subproject may be a group of ideas you are exploring, a technique, a data preparation, and so on.
  • Context: The context may be the specific objective such as beating a baseline, tuning, a diagnostic, and so on.
  • Setup: The setup is the fixed configuration of the experiment.
  • Name: The name is the unique identifier, perhaps the filename of the script.
  • Parameter: The parameter is the thing being varied or looked at in the experiment.
  • Values: The value is the value or values of the parameter that are being explored in the experiment.
  • Status: The status is the status of the experiment, such as planned, running, or done.
  • Skill: The skill is the North Star metric that really matters on the project, like accuracy or error.
  • Question: The question is the motivating question the experiment seeks to address.
  • Finding: The finding is the one line summary of the outcome of the experiment, the answer to the question.

To make this concrete, below is a screenshot of a Google Doc spreadsheet with these column headings and a contrived example.

Systematic Experimental Record

Systematic Experimental Record

I cannot say how much time this approach has saved me. And the number of assumptions that it proved wrong in the pursuit of getting top results.

In fact, I’ve discovered that deep learning methods are often quite hostile to assumptions and defaults. Keep this in mind when designing experiments!

Get The Most Out of Your Experiments

Below are some tips that will help you get the most out of this simple approach on your project.

  • Brainstorm: Make the time to frequently review findings and list new questions and experiments to answer them.
  • Challenge: Challenge assumptions and challenge previous findings. Play the scientist and design experiments that would falsify your findings or expectations.
  • Sub-Projects: Consider the use of sub-projects to structure your investigation where you follow leads or investigate specific methods.
  • Experimental Order: Use the row order as a priority to ensure that the most important experiments are run first.
  • Deeper Analysis: Save deeper analysis of results and aggregated findings to another document; the spreadsheet is not the place.
  • Experiment Types: Don’t be afraid to mix-in different experiment types such as grid searching, spot checks, and model diagnostics.

You will know that this approach is working well when:

  • You are scouring API documentation and papers for more ideas of things to try.
  • You have far more experiments queued up than resources to run them.
  • You are thinking seriously about hiring a ton more EC2 instances.

Summary

In this post, you discovered how you can effectively manage hundreds of experiments that have run, are running, and that you want to run in a spreadsheet.

You discovered that a simple spreadsheet can help you:

  • Keep track of what experiments you have run and what you discovered.
  • Keep track of what experiments you want to run and what questions they will answer.
  • Zoom in on the most effective data preparation, model, and model configuration for your predictive modeling problem.

Do you have any questions about this approach? Have you done something similar yourself?
Let me know in the comments below.

Wednesday 16 October 2024

10 Command Line Recipes for Deep Learning on Amazon Web Services

 Running large deep learning processes on Amazon Web Services EC2 is a cheap and effective way to learn and develop models.

For just a few dollars you can get access to tens of gigabytes of RAM, tens of CPU cores, and multiple GPUs. I highly recommend it.

If you are new to EC2 or the Linux command line, there are a suite of commands that you will find invaluable when running your deep learning scripts in the cloud.

In this tutorial, you will discover my private list of the 10 commands I use every time I use EC2 to fit large deep learning models.

After reading this post, you will know:

  • How to copy your data to and from your EC2 instances.
  • How to set up your scripts to run for days, weeks, or months safely.
  • How to monitor processes, the system, and GPU performance.

    Overview

    The commands presented in this post assume that your AWS EC2 instance is already running.

    For consistency, a few other assumptions are made:

    • Your server IP address is 54.218.86.47; change this to the IP address of your server instance.
    • Your username is ec2-user; change this to your user name on your instance.
    • Your SSH key is located in ~/.ssh/ and has the filename aws-keypair.pem; change this to your SSH key location and filename.
    • You are working with Python scripts.

      1. Log in from Your Workstation to the Server

      You must log into the server before you can do anything useful.

      You can log in easily using the SSH secure shell.

      I recommend storing your SSH key in your ~/.ssh/ directory with a useful name. I use the name aws-keypair.pem. Remember: the file must have the permissions 600.

      The following command will log you into your server instance. Remember to change the username and IP address to your relevant username and server instance IP address.

      2. Copy Files from Your Workstation to the Server

      You copy files from your workstation to your server instance using secure copy (scp).

      The example below, run on your workstation, will copy the script.py Python script in the local directory on your workstation to your server instance.

      3. Run Script as Background Process on the Server

      You can run your Python script as a background process.

      Further, you can run it in such a way that it will ignore signals from other processes, ignore any standard input (stdin), and forward all output and errors to a log file.

      In my experience, all of this is required for long-running scripts for fitting large deep learning models.

      This assumes you are running the script.py Python script located in the /home/ec2-user/ directory and that you want the output of this script forwarded to the file script.py.log located in the same directory.

      Tune for your needs.

      If this is your first experience with nohup, you can learn more here:

      If this is your first experience with redirecting standard input (stdin), standard output (stout), and standard error (sterr), you can learn more here:

      4. Run Script on a Specific GPU on the Server

      I recommend running multiple scripts at the same time, if your AWS EC2 instance can handle it for your problem.

      For example, your chosen EC2 instance may have 4 GPUs, and you could choose to run one script on each.

      With CUDA, you can specify which GPU device to use with the environment variable CUDA_VISIBLE_DEVICES.

      We can use the same command above to run the script and specify the specific GPU device to use as follows:

      If you have 4 GPU devices on your instance, you can specify CUDA_VISIBLE_DEVICES=0 to CUDA_VISIBLE_DEVICES=3.

      I expect this would work for the Theano backend, but I have only tested it with the TensorFlow backend for Keras.

      You can learn more about CUDA_VISIBLE_DEVICES in the post:

      5. Monitor Script Output on the Server

      You can monitor the output of your script while it is running.

      This may be useful if you output a score each epoch or after each algorithm run.

      This example will list the last few lines of your script log file and update the output as new lines are added to the script.

      Amazon may aggressively close your terminal if the screen does not get new output in a while.

      An alternative is to use the watch command. I have found Amazon will keep this terminal open:

      I have found that standard out (stout) from python scripts does not appear to be updated frequently.

      I don’t know if this is an EC2 thing or a Python thing. This means you may not see the output in the log updated often. It seems to be buffered and output when the buffer hits fixed sizes or at the end of a run.

      Do you know more about this?
      Let me know in the comments below.

      6. Monitor System and Process Performance on the Server

      It is a good idea to monitor the EC2 system performance. Especially the amount of RAM you are using and have left.

      You can do this using the top command that will update every few seconds.

      You can also monitor the system and just your process, if you know its process identifier (PID).

      7. Monitor GPU Performance on the Server

      It is a good idea to keep an eye on your GPU performance.

      Again, keep an eye on GPU utilization, on which GPUs are running, if you plan on running multiple scripts in parallel and in GPU RAM usage.

      You can use the nvidia-smi command to keep an eye on GPU usage. I like to use the watch command that keeps the terminal open and clears the screen for each new result.

      8. Check What Scripts Are Still Running on the Server

      It is also important to keep an eye on which scripts are still running.

      You can do this with the ps command.

      Again, I like to use the watch command to keep the terminal open.

      9. Edit a File on Server

      I recommend not editing files on the server unless you really have to.

      Nevertheless, you can edit a file in place using the vi editor.

      The example below will open your script in vi.

      Of course, you can use your favorite command line editor, like emacs; this note is really for you if you are new to the Unix command line.

      If this is your first exposure to vi, you can learn more here:

      10. From Your Workstation Download Files from the Server

      I recommend saving your model and any results and graphs explicitly to new and separate files as part of your script.

      You can download these files from your server instance to your workstation using secure copy (scp).

      The example below is run from your workstation and will copy all PNG files from your home directory to your workstation.

      Additional Tips and Tricks

      This section lists some additional tips when working heavily on AWS EC2.

      • Run multiple scripts at a time. I recommend selecting hardware that has multiple GPUs and running multiple scripts at a time to make full use of the platform.
      • Write and edit scripts on your workstation only. Treat EC2 as a pseudo-production environment and only ever copy scripts and data there to run. Do all development on your workstation and write small tests of your code to ensure it will work as expected.
      • Save script outputs explicitly to a file. Save results, graphs, and models to files that can be downloaded later to your workstation for analysis and application.
      • Use the watch command. Amazon aggressively kills terminal sessions that have no activity. You can keep an eye on things using the watch command that send data frequently enough to keep the terminal open.
      • Run commands from your workstation. Any of the commands listed above intended to be run on the server can also be run from your workstation by prefixing the command with “ssh –i ~/.ssh/aws-keypair.pem ec2-user@54.218.86.47” and quoting the command you want to run. This can be useful to check in on processes throughout the day.

      Summary

      In this tutorial, you discovered the 10 commands that I use every time I am training large deep learning models on AWS EC2 instances with GPUs.

      Specifically, you learned:

      • How to copy your data to and from your EC2 instances.
      • How to set up your scripts to run for days, weeks, or months safely.
      • How to monitor processes, the system, and GPU performance.

      Do you have any questions?
      Ask your questions in the comments below and I will do my best to answer.

Connect broadband

How to Plan and Run Machine Learning Experiments Systematically

  Machine learning experiments can take a long time. Hours, days, and even weeks in some cases. This gives you a lot of time to think and pl...