CMPE591: Deep Learning in Robotics

Preparing the Environment

It is suggested that you install a virtual environment for this course. You can use Anaconda or Miniconda (smaller size) for this purpose. You can download Anaconda from here. Alternatively, you can use Mamba (a faster version of conda) for this purpose. You can download Mamba from here. Install the downloaded script by running the following command in your terminal:
$ bash <downloaded_script>.sh
After the installation, you can create a virtual environment by running the following command:
# For Anaconda
$ conda create -n <virtual_environment_name> python=3.9
$ conda activate <virtual_environment_name>

# For Mamba
$ mamba create -n <virtual_environment_name> python=3.9
$ mamba activate <virtual_environment_name>
You will need to run mamba activate <virtual_environment_name> (or conda) every time you open a new terminal to activate the virtual environment. You can deactivate the virtual environment by running mamba deactivate.

We will use MuJoCo and dm_control for our simulation environment. You can install them by running:
$ pip install dm_control  # dm_control automatically installs mujoco
also install some other dependencies:
$ pip install git+
$ pip install pyyaml
$ mamba install numpy  # or conda install numpy
Additionally, we will use PyTorch for training our models, check out the installation instructions. After installing PyTorch, clone the homework repository by running:
$ git clone
Homeworks will be released in the src folder of the repository. You can run the demo code by running:
$ cd
$ python
You should see the following output: homework1 and mujoco_menagerie will be common throughout homeworks and homework<x>.py will be added each week. It is suggested that you use Visual Studio Code with GitHub Copilot for easier development (though double-check everything that copilot suggests). GitHub Copilot is free for students.

Homework 1 (Training a DNN using PyTorch)

In this homework, you will train a deep neural network that estimates the object's position given the executed action and the state (a top-down view of the environment). Below are some example states.
There are two object types (cube and sphere) with random sizes between 2cm to 3cm, and the robot randomly pushes the object to four main directions. Based on the object's type and size, the resulting object position changes. Assuming that you have already cloned the repository, you can run the following code for sampling the data:
import numpy as np
from homework1 import Hw1Env

env = Hw1Env(render_mode="gui")
for _ in range(100):
    action_id = np.random.randint(4)
    _, img_before = env.state()
    pos_after, img_after = env.state()
You might also want to check the main part of the to see how to collect data with multiple processes. Sample 1000 data point and train
  1. A plain multi-layer perceptron (MLP)
  2. Convolutional neural network
using PyTorch.


Instead of predicting the object's position, predict the raw image of the environment after the action.

Homework 2 (Deep Q Network)

In this homework, you will train a deep Q network (DQN) that learns to push the object to the desired position. There has been a couple of updates in the environment file, so make sure to pull the latest version of the repository by running git pull. You can run the following code to interact with the environment (also see after pulling the latest version):
import numpy as np

from homework2 import Hw2Env

env = Hw2Env(n_actions=N_ACTIONS, render_mode="gui")
for episode in range(10):
    done = False
    cum_reward = 0.0
    while not done:
        action = np.random.randint(N_ACTIONS)
        state, reward, is_terminal, is_truncated = env.step(action)
        done = is_terminal or is_truncated
        cum_reward += reward
    print(f"Episode={episode}, reward={cum_reward}")
If you want to work on a remote machine with no screen, make sure you set the following environment variables:
export MUJOCO_GL=egl
The reward is set to the following 1/distance(ee, obj)+1/distance(obj, goal) where ee is the end-effector position, obj is the object position, and goal is the goal position. Tuning with hyperparameters can be tricky, so you can use the following hyperparameters:
    Conv2d(3, 32, 4, 2, 1), ReLU(),  # (-1, 3, 128, 128) -> (-1, 32, 64, 64)
    Conv2d(32, 64, 4, 2, 1), ReLU(),  # (-1, 32, 64, 64) -> (-1, 64, 32, 32)
    Conv2d(64, 128, 4, 2, 1), ReLU(),  # (-1, 64, 32, 32) -> (-1, 128, 16, 16)
    Conv2d(128, 256, 4, 2, 1), ReLU(),  # (-1, 128, 16, 16) -> (-1, 256, 8, 8)
    Conv2d(256, 512, 4, 2, 1), ReLU(),  # (-1, 256, 8, 8) -> (-1, 512, 4, 4)
    Avg([2, 3])  # average pooling over the spatial dimensions  (-1, 512, 4, 4) -> (-1, 512),
    Linear(512, N_ACTIONS)
GAMMA = 0.99
EPSILON_DECAY = 0.999 # decay epsilon by 0.999 every EPSILON_DECAY_ITER
EPSILON_DECAY_ITER = 10 # decay epsilon every 100 updates
MIN_EPSILON = 0.1 # minimum epsilon
UPDATE_FREQ = 4 # update the network every 4 steps
TARGET_NETWORK_UPDATE_FREQ = 100 # update the target network every 1000 steps
This set is not definitive, but it seems to converge to a good policy. Feel free to share your good set of hyperparameters with the class. Plot the reward over episodes and add it to your submission. You can use high_level_state to get a higher-level state, instead of raw pixels. This might speed up your experimentation as you would not need to train a convolutional network.

Homework 3 (Policy Gradient Methods)

In this homework, you will train either a vanilla policy gradient or proximal policy optimization (PPO) model to learn to push the object to the desired position. The environment is the same as the previous homework, except the actions are now continuous. Some boilerplate code is provided in to collect data in multiple processes. Note that there might be more efficient implementations, and you are absolutely free not to use the provided code. In fact, you are free to use high-level implementations such as [Stable Baselines]( However, you will need to define the environment yourself (which is not very hard). Make sure to pull the latest version of the repository by running git pull. Stable hyperparameters will be shared soon. Feel free to share your good set of hyperparameters with the class. Plot the reward over episodes and add it to your submission. As in HW2, you can use high_level_state to get a higher-level state instead of raw pixels.

Homework 4 (Offline RL)

In one short paragraph, please explain the advantages of offline RL in [1] over online RL in this particular work where data was not obtained from external reasources but was collected by the researchers themselves.

[1] Kalashnikov D, Irpan A, Pastor P, Ibarz J, Herzog A, Jang E, Quillen D, Holly E, Kalakrishnan M, Vanhoucke V, Levine S. Scalable deep reinforcement learning for vision-based robotic manipulation. InConference on Robot Learning 2018 Oct 23 (pp. 651-673). PMLR.

Homework 5 (Learning from Demonstration with CNMPs)

In this homework, you will collect demonstrations that consist of (t, ey, ez, oy, oz) where e and o are the end-effector and the object cartesian coordinates with subscripts denoting the relevant axis. The code for collecting demonstrations is provided in The robot randomly moves its end-effector in the y-z plane, sometimes hitting the object and sometimes not. The height of the object is random and provided from the environment as well. You will train a CNMP with the following dataset: {(t, ey, ez, oy, oz)i, hi}Ni=0 where h is the height of the object. Here, t will be the query dimension, h will be the condition to be given to the decoder, and other dimensions will be target dimensions while training the CNMP. In other words, given several context points (with all dimensions provided), the model will be asked to predict the end-effector and the object positions given the time and the height of the object.