Run Tensorflow scripts from Node.js server deployed on AWS as Docker container

December 20, 2018

Machine and deep learning tooling is excitingly accessible and fun for a developer to work with. There are couple of ways to develop/play with machine learning code:

Jupyter notebook running locally or on your server
Jupyter environment/scripting in the cloud:
Scripting/developing locally

There might be more options I am unaware of or more likely I just cannot find them using simple search engine queries.

I tried couple of those above - Jupyter notebook, scripting locally and Kaggle. The approach you take usually depends on what you want to achieve. If you want to learn then starting with the cloud solution is the shortest way which gives the most pleasure, picking Kaggle could probably be one of the best choices, you can see what others do and can fork their kernels and learn by example.

Whilst playing/learning with machine learning is undeniably a great way to spend your time, it is a bit more exciting to use the skills and build a production ready model for others to consume. For this choice now I pick scripting - writing Python scripts which are later deployed to production and exposed via browser interface.

Note: if you prefer looking into code then check out an implementation supporting this article in Github repo ivarprudnikov/char-rnn-tensorflow

Objectives

Prepare Python scripts for training & generating
Store training data & training parameters
Train models in browser interface
Generate text based on previously trained model
Wrap implementation in Docker
Use CI service to build Docker image for the registry
Run Docker image on AWS and expose it to the WWW

Text generator

As it happens there are examples of simple text generators in the web. One of them leads me to a character generator detailed in Andrei Karpathy’s blog post “The Unreasonable Effectiveness of Recurrent Neural Networks (RNN)” published in mid 2015. This RNN example relieves me from any explanations how/why it works and is generally concise in its implementation.

Part of the problem is solved but current article mentions Tensorflow and Karpathy’s implementation is written in Torch. For this reason there was a reimplementation of char-rnn I found in another Github repository hzy46/Char-RNN-TensorFlow which I forked and tweaked a bit.

Training

Happily char-rnn-tensorflow is capable of training the model by passing it a text file as an argument amongst other possible options.

python train.py \
  --input_file data/shakespeare.txt  \
  --name shakespeare \
  --num_steps 50 \
  --num_seqs 32 \
  --learning_rate 0.01 \
  --max_steps 20000

Above will produce and save couple of Tensoflow checkpoints but will take some time. Training time varies depending on hardware you have at disposal. Lengthy training process forces us to use asynchronous API when exposing this feature via HTTP so that user gets notified after s/he submits request to train on given data and process finishes in the background.

Generating text

After training completes and checkpoints were saved we can call another script to generate text for us:

python sample.py \
  --converter_path model/shakespeare/converter.pkl \
  --checkpoint_path model/shakespeare/ \
  --max_length 1000

Above script needs a path to checkpoint data to pull variable values from it and use in the rebuilt model, it also uses the vocabulary file which was dumped in the process of training, that includes characters used for text generation.

Web application

Given we have python scripts for training and character generation it is possible to wrap those in a simple app exposed to the public internet. User will be able to upload and train her own sample then generate some text after training finishes.

To build an app I’ll use Node.js with Express framework. There will be couple of publicly accessible paths to deal with uploading of training data, training of that data and generating text out of it. User will also be able to see other submissions as we do not care about security and accounts to make this exercise simpler.

Uploading data

Simplest approach to solving the upload problem is a basic html form. User should have UTF-8 encoded plain text file available on her system. And app should be able to render the form and process form POST with multipart/form-data. Ideally uploaded plain text files could be stored somewhere on AWS S3 or similar services to leverage almost infinite scalability but for this exercise I’ll keep those on the same server running application and executing Python scripts.

Render upload form

Basic knowledge of Express framework here is assumed. Below example expects view engine along with default views directory set to be able to render html. If you follow code in git repository it will be a bit different and will have more features used.

router.get('/upload', (req, res) => {
  // render upload.html/upload.ejs file
  res.render('upload')
})

<!-- excerpt from upload html/ejs file -->
<form action="/upload" method="post" enctype="multipart/form-data">
  <fieldset>
    <legend>Training data</legend>
      <label for="customFile">Choose training data file</label>
      <input name="file" type="file" id="customFile" required>
      <small class="form-text text-muted">
        File should be UTF-8 plain/text containing data you want to use for training.
      </small>
  </fieldset>
  <button type="submit">Upload</button>
</form>

Process form POST

In order to process multipart form request I chose to use Busboy dependency.

$ npm i -S busboy

Original filename is not being used and instead is replaced with train.txt which is going to be the same for every upload. In order to distinguish those files they will be living in separate directories named after generated id which is a timestamp here in the example. After successful upload user will be redirected.

router.post('/upload', (req, res) => {

  // generate id
  const id = Date.now()
  
  // prepare dependency used to process request
  const busboy = new Busboy({
    headers: req.headers,
    limits: {
      fileSize: 1024 * 50, // bytes
      files: 1 // only one file per request
    }
  })
  let fileStream = null
  let filePath = null
  let folderPath = path.join('uploads', id)

  // handle multipart file
  busboy.on('file', (fieldName, file, fileName) => {
    if (fileName) {
      fs.mkdirSync(folderPath)
      filePath = path.join(folderPath, 'train.txt')
      fileStream = file.pipe(fs.createWriteStream(filePath))
    }
  });
  
  // redirect on success, otherwise render same page with error message
  busboy.on('finish', () => {
    res.set({Connection: 'close'});

    if (!fileStream) {
      res.render('upload', Object.assign(res.locals, {
        error: "Cannot save given training data"
      }));
    } else {
      fileStream.on('finish', async () => {
        res.redirect('/')
      })
      fileStream.on('error', () => {
        res.render('upload', Object.assign(res.locals, {
          error: "Error occurred while saving file"
        }));
      })
    }
  })
  
  // pipe request stream to our dependency
  req.pipe(busboy)
})

One problem is almost solved, training file is ready to be used.

Storage

We could store everything in the filesystem but eventually it gets quite complicated. Initially I thought example without database will be a bit more readable but the effect was opposite as soon as I wanted to render more details in html. It is useful to track when user uploads data, when training starts and ends, even giving names to those training jobs is useful, but storing them in filesystem seemed a bit dull and verbose to implement.

Apart from those mentioned useful parts it will be necessary to store logs as well. Logs are going to be produced when training on uploaded data. I chose to store those in database after reading quite old Travis blog post Solving the Puzzle of Scalable Log Processing.

Model

There are 2 things I want to store:

model represents training job and contains details such as id, time it was created, training parameters, if user uploaded text data, is training complete, etc.
log is part of model but instead of having one big field containing all the text it will be split into lines which will make it easier to insert new data.

Database

Now that there is a relationship between models and structure is known in advance - will not likely to change, it is sensible to choose relational database. I believe that lowest common denominator will be MySQL. Document store such as Mongo does not really make much sense here and not only because of relationship but more due to the nature of log data which will drip line by line.

create table model (
  id             varchar(255) not null,
  created_at     timestamp             DEFAULT CURRENT_TIMESTAMP,
  updated_at     timestamp             DEFAULT CURRENT_TIMESTAMP
  ON UPDATE CURRENT_TIMESTAMP,
  name           varchar(255) not null,
  train_params   json         not null,
  has_data       tinyint      not null default false,
  is_in_progress tinyint      not null default false,
  is_complete    tinyint      not null default false,
  training_pid   varchar(255),
  primary key (id)
)
  ENGINE = InnoDB
  DEFAULT CHARACTER SET utf8
  COLLATE utf8_general_ci;

alter table model
  add constraint unique_id unique (id);
alter table model
  add constraint unique_pid unique (training_pid);

create table model_log (
  model_id varchar(255) not null,
  position int          not null,
  chunk    text         not null
)
  ENGINE = InnoDB
  DEFAULT CHARACTER SET utf8
  COLLATE utf8_general_ci;
alter table model_log
  add index FK_MODEL_LOG_MODEL (model_id),
  add constraint FK_MODEL_LOG_MODEL foreign key (model_id) references model (id);

A quite simple schema encapsulates what I mentioned before, table model will contain metadata about training and model_log will contain chunks of log text. Position in model_log ins necessary in theory to guarantee order of log entries.

In order for Node.js to connect to database we’ll need to get relevant dependency:

$ npm i -S mysql

To verify that connection works we could try listing models:

const mysql = require('mysql')
const pool = mysql.createPool({
  connectionLimit: 10,
  connectTimeout: 20 * 1000,
  acquireTimeout: 20 * 1000,
  timeout: 10 * 1000,
  host: process.env.MYSQL_HOST || 'localhost',
  user: process.env.MYSQL_USER || 'root',
  password: process.env.MYSQL_PASSWORD || '',
  database: process.env.MYSQL_DATABASE || 'rnn_generator',
  port: process.env.MYSQL_PORT || 3306,
  // ssl: "Amazon RDS", will be necessary later
})

pool.query("select * from model order by updated_at desc limit ? offset ?", [10, 0], (error, rows) => {
  console.log(rows)
})

Training parameters

Just before being able to run those scripts we need to pass some arguments to them as well. Before that user ought to be able to tweak them in UI. Arguments are going to be exposed as a web form to the user and upon POST those values will be stored in the database. You can see all of training options used in a web form in the git repository. As an example a form looks similar to the following:

<form action="/<%= locals.model.id %>/options" method="post" enctype="multipart/form-data">

  <h1 class="h3 text-center mb-3">Update training options</h1>

  <% if(locals.errors){ %>
    <p class="alert alert-warning">
      Form contains errors
    </p>
  <% } %>

  <fieldset>
    <legend>Training options</legend>

    <div class="form-group row">
      <label for="num_seqs" class="col-sm-6 col-form-label">Number of seqs in one batch</label>
      <div class="col-sm-6">
        <input name="num_seqs" placeholder="Default: 32" type="number" id="num_seqs"
               class="form-control <% if(fieldErr("num_seqs")){ %>is-invalid<% } %>"
               min="1"
               value="<%=fieldData("num_seqs")%>">
        <% if(fieldErr("num_seqs")){ %>
        <div class="invalid-feedback"><%= fieldErr("num_seqs") %></div>
        <% } %>
      </div>
    </div>
    
    <!-- More fields in Github repository -->
    
    <hr class="mt-5">
    
    <div class="text-right">
      <a href="/model/<%= locals.model.id %>" class="btn btn-outline-secondary">Cancel</a>
      <button type="submit" class="btn btn-primary">Update</button>
    </div>
  </fieldset>
</form>

Above you can see I am using variable interpolation <%= variable %> which is part of ejs templating engine I have configured Express app with. This form expects model to be passed to renderer, the one returned from MySQL and data which holds form field values, it also expects helper functions fieldData() which is a shorcut to access values in data object and fieldErr() which checks if there is an error for a given form field. Rendering of the above looks like:

router.get('/:id/options', checkPathParamSet("id"), loadInstanceById(), (req, res) => {
  res.render('training_options', Object.assign(res.locals, {
    data: JSON.parse(req.instance.train_params),
    model: req.instance
  }))
})

Here checkPathParamSet and loadInstanceById are helper middleware functions that are reused amongst other methods as well:

function checkPathParamSet(paramName) {
  return (req, res, next) => {
    if (!req.params[paramName]) {
      res.render('404')
      return
    }
    next()
  }
}

function loadInstanceById() {
  return async (req, res, next) => {
    let instance = await db.findModel(req.params.id)
    if (!instance) {
      res.render('404')
    } else {
      req.instance = instance;
      next()
    }
  }
}

Schema validation

Storage of training parameters is pretty much straightforward except their validation. To make sure we comply with Python script api those requirements need to be formalized somewhere else but script itself and then validated against before storage. This could also be an ad hoc implementation but validation ought to be reused before calling script as well, also reading code is harder that looking into schema. I am using JSON schema for parameter definition and validation, below example shows only couple of fields:

{
  "$id": "generator/schema/training/options.json",
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "num_seqs": {
      "type": "integer",
      "minimum": 1,
      "description": "number of seqs in one batch"
    },
    "num_steps": {
      "type": "integer",
      "minimum": 1,
      "description": "length of one seq"
    }
  },
  "additionalProperties": false
}

There are couple of JSON schema validators in the wild but I chose ajv as it supports latest schema drafts. Lets see how validation logic is implemented:

const Ajv = require('ajv')
// above schema sample
const trainOptionsSchema = require("train_arguments_schema.json") 
const ajv = new Ajv({allErrors: true, coerceTypes: true, removeAdditional: true})
const validator = ajv.compile(trainOptionsSchema)

function chackTrainParams(params) {
  if (validator(params)) {
    return null
  }
  let errors = {}
  // validator shows path of failing leaf starting with a dot
  // changing it to simplify rendering of error messages
  validator.errors.forEach((err) => {
    let keyWithoutTrailingDot = err.dataPath.replace(/^\./, "");
    errors[keyWithoutTrailingDot] = err.message
  })
  return errors
}

With above implementation it is easy enough to check if there are any errors in current object:

chackTrainParams({num_seqs: 0})

// returns
// { num_seqs: 'should be >= 1' }

Storage

Previously, for document upload, I used Busboy but when using it with form fields it is a bit verbose, gladly there is a wrapper around that dependency called multer which puts form fields into req.body. For just the form field submissions use multerUpload.none() middleware.

router.post('/:id/options', checkPathParamSet("id"), loadInstanceById(), multerUpload.none(), asyncErrHandler.bind(null, async (req, res) => {

  let model = req.instance

  // filter out empty values
  let params = Object.keys(req.body).reduce((memo, val) => {
    if (req.body[val] != null && req.body[val] !== "") {
      memo[val] = req.body[val]
    }
    return memo
  }, {})

  let errors = chackTrainParams(params)
  if (errors) {
    return res.render('training_options', Object.assign(res.locals, {
      error: "Found " + Object.keys(errors).length + " errors",
      errors: errors,
      data: params,
      model: req.instance
    }))
  }

  await db.updateModel(model.id, {
    train_params: JSON.stringify(params)
  })

  res.redirect(`${req.baseUrl}/${model.id}`)
}))

Another thing above is async error handler which ought to be explicitly written as until now Express does not allow handling them:

const asyncErrHandler = (asyncFn, req, res) => asyncFn(req, res)
  .catch(err => {
    console.log((new Date()).toISOString(), "[ERROR]", util.inspect(err))
    res.status(500).render('500')
  });

Running python scripts

All ingredients are ready to be used, training data will be uploaded to a location on disk, training parameters will be stored in a database. It is now necessary to use those details and run Python scripts which were forked already. I’ve written about how to run Python scripts from within Node.js application already so will not delve into much details, make sure to skim it through though “Using Python scripts in Node.js server”.

Training

First I need action buttons in UI which will start/stop training process, upon click those should hit appropriate path handler in application which will do the rest. Those buttons will be exposed only in certain conditions - start when training data is uploaded and stop only when training is in progress:

<% if(model.is_in_progress){ %>
  <form class="d-inline-block" action="/model/<%= model.id %>/stop" method="post">
    <button type="submit" class="btn btn-sm btn-danger">Stop training</button>
  </form>
<% } else if (model.has_data) { %>
  <form class="d-inline-block" action="/model/<%= model.id %>/start" method="post">
    <button type="submit" class="btn btn-sm btn-outline-primary">Start training</button>
  </form>
<% } %>

Before executing training script we need to be sure it is not running currently, this is achieved by checking if training_pid is present on model instance. It is also necessary to clean up any existing log entries if those exist because relationship prohibits having more than one log representation for training, in other words - one model one log. Then script will be started in a separate process and events coming from it will be both stored in the database and sent to websocket connection to be rendered in real time. Websocket will not be covered here as it is part of other article I mentioned above “Using Python scripts in Node.js server”

router.post('/:id/start', checkPathParamSet("id"), loadInstanceById(), asyncErrHandler.bind(null, async (req, res) => {

  let model = req.instance
  
  // if in progress then cancel
  if (model.training_pid) {
    return res.redirect(`${req.baseUrl}/${model.id}`)
  }

  const params = JSON.parse(model.train_params || "{}")
  
  // clear any existing logging
  await db.deleteLogEntries(model.id)

  // launch script process
  let subprocess
  try {
    subprocess = await trainModel(model.id, params)
  } catch (err) {
    // cleanup if error
    await db.setModelTrainingStopped(model.id)
    return res.render('show', Object.assign(res.locals, {
      model: req.instance,
      error: util.inspect(err)
    }))
  }

  // obtain a running websocket instance
  const wss = req.app.get(WEBSOCKET)

  // store logs and also spit it out to websocket ro render them in real time
  let chunkPosition = 1
  subprocess.stdout.on('data', async (data) => {
    let logEntry = {
      model_id: model.id,
      chunk: data + "",
      position: chunkPosition
    }
    await db.insertLogEntry(logEntry)
    wss.broadcast(JSON.stringify(logEntry))
    chunkPosition++
  });
  subprocess.stderr.on('data', async (data) => {
    let logEntry = {
      model_id: model.id,
      chunk: `Error: ${data}`,
      position: chunkPosition
    }
    await db.insertLogEntry(logEntry)
    wss.broadcast(JSON.stringify(logEntry))
    chunkPosition++
  });

  // mark model as in progress
  await db.setModelTrainingStarted(model.id, subprocess.pid);

  // get back to page it was initiated on
  res.redirect(`${req.baseUrl}/${model.id}`)
}))

Script is launched in trainModel() function which returns Promise. In addition to checks above within router handler trainModel() will double check stored training parameters and will then merge them with defaults. Training data and pid file existence will also be checked before spawning a new Python process.

function trainModel(submissionId, params) {

  // defaults
  let args = {
    num_seqs: 32,
    num_steps: 50,
    lstm_size: 128,
    num_layers: 2,
    use_embedding: false,
    embedding_size: 128,
    learning_rate: 0.001,
    train_keep_prob: 0.5,
    max_steps: 1000,
    save_every_n: 1000,
    log_every_n: 100,
    max_vocab: 3500
  }

  return new Promise(function (resolve, reject) {
    if (!submissionId) reject("submissionId required");

    if (typeof params === "object") {
      let errors = chackTrainParams(params)
      if (errors) {
        return reject(errors)
      } else {
        Object.assign(args, params)
      }
    }

    const folderPath = path.join(UPLOADS_PATH, submissionId)
    const trainFilePath = path.join(folderPath, TRAIN_FILENAME)
    if (!fs.existsSync(trainFilePath))
      return reject("missing training data file")

    const trainPidPath = path.join(folderPath, TRAIN_PID_FILENAME)
    const modelDir = path.join(GENERATOR_PATH, MODEL_DIR, submissionId)
    // remove any existing checkpoints
    rimraf.sync(modelDir)
    mkdirp.sync(modelDir)

    let spawnArgs = [
      "-u",
      path.join(GENERATOR_PATH, 'train.py'),
      "--input_file", trainFilePath, // utf8 encoded text file
      "--name", submissionId // name of the model
    ]
    Object.keys(args).forEach((k) => {
      if (k != null && args[k] != null) {
        spawnArgs.push(`--${k}`)
        spawnArgs.push(args[k])
      }
    })
    // run script
    const subprocess = spawn('python', spawnArgs, {
      stdio: ['ignore', "pipe", "pipe"]
    });
    // store pid file
    fs.writeFileSync(trainPidPath, subprocess.pid)
    
    // cleanup in case of error
    subprocess.on("error", () => {
      rimraf.sync(trainPidPath)
      setModelTrainingStopped(submissionId)
    })
    subprocess.on("exit", () => {
      rimraf.sync(trainPidPath)
      setModelTrainingStopped(submissionId)
    })

    resolve(subprocess);
  })
}

Above will spawn Python process which will execute training script I mentioned in the beginning. Some variables are not clear as they are not defined in this excerpt but you could always look at the source code to see what values they hold

Generating sample

This will be similar as in training, first I need a “button” in UI:

<% if(model.is_complete){ %>
  <button id="sampleBtn" type="button" class="btn btn-sm btn-primary" 
      onclick="generateSample()">Generate sample
  </button>
<% } %>

<div id="sample" class="my-3" style="display: none">
  <p class="font-weight-bold">Generated sample</p>
  <div id="sampleOutput" class="bg-dark text-light p-3 small"></div>
</div>

<script>
  function generateSample() {
    $("#sampleOutput").empty()
    $("#sample").show()
    $("#sampleBtn").prop("disabled", true)

    $.get("/model/<%= model.id %>/sample")
      .done(function (data) {
        $("#sampleOutput").text(data)
      })
      .fail(function (data) {
        $("#sampleOutput").text(JSON.stringify(data) || "Error occurred")
      })
      .always(function () {
        $("#sampleBtn").prop("disabled", false)
      })
  }
</script>

Instead of making a form post I’m using some ajax here with a help from jQuery. For anyone wondering why I did not use front end framework for this it is to keep things simpler.

Router will handle /sample endpoint which in turn checks if model finished training via its is_complete flag. It is also necessary to use same parameters as were used for training to make sure correct model representation is created before extracting a text sample.

router.get('/:id/sample', checkPathParamSet("id"), loadInstanceById(), asyncErrHandler.bind(null, async (req, res) => {
  let model = req.instance
  if (!model.is_complete) {
    res.status(400).send({error: "Not ready yet"})
    return
  }

  // use same params from training
  const trainingParams = JSON.parse(model.train_params)
  let args = [
    'lstm_size',
    'num_layers',
    'use_embedding',
    'embedding_size'
  ].reduce((memo, key) => {
    if (trainingParams[key] != null)
      memo[key] = trainingParams[key]
    return memo;
  }, {})

  // TODO somebody add start_string and max_length from query params
  let subprocess
  try {
    subprocess = await sampleModel(model.id, args)
  } catch (err) {
    return res.status(400).send({error: util.inspect(err)})
  }

  subprocess.stderr.on('data', (data) => {
    console.log(`Error: ${data}`)
  });
  res.set('Content-Type', 'text/plain');
  subprocess.stdout.pipe(res)
}))

As you see it is missing start_string and max_length arguments, this will be left for someone else to complete and add some input fields to UI, for now I’ve set some defaults for it to work.

sampleModel() function is similar to trainModel() so it is not necessary for me to repeat it over here you can check it out in source code.

Docker

Implementation here uses a mix of languages which is a bit of a challenge to make sure it runs across different machines. There are couple of issues I’d like to solve with Docker. Firstly correct dependencies have to be used for both Python scripts and Node.js server, then I’d like to run it all with one command, it all needs to be wrapped to be run in production environment. To solve dependency issue one could use virtualenv and npm, to run it all in one command developer could write a shell script but to run it all on production environment you need some orchestration. All mentioned issues can easily be resolved with Docker.

Choosing a base Docker image here is a challenge as I could not find the one with both Node.js and Python support, choose one or the other and install missing components. I chose Python image as a base one as it is a bit more important to make certain those scripts run in consistent environment and I could not guarantee it would always be the same if it was installed from scratch at every deployment to production. Installing Node.js on the other hand is not hard at all when using nvm besides developer usually locks its dependencies with package-lock.json. Furthermore if anything goes wrong with server it will be easier to spot rather than checking in logs if scripts are failing.

Directory structure is split into server and generator. Former contains all Node.js server and latter the scripts.

repo
  +- generator/
  +- server/
  \ Dockerfile

Dockerfile is simple enough without the part that installs Node.js

FROM python:3.6

# replace shell with bash so we can source files
RUN rm /bin/sh && ln -s /bin/bash /bin/sh

# Install node.js
ENV NODE_VERSION 10.12.0
ENV NVM_DIR /usr/local/nvm
RUN mkdir -p $NVM_DIR
# install nvm
# https://github.com/creationix/nvm#install-script
RUN curl -o- https://raw.githubusercontent.com/creationix/nvm/v0.33.11/install.sh | bash
# install node and npm
RUN echo "source $NVM_DIR/nvm.sh && \
    nvm install $NODE_VERSION && \
    nvm alias default $NODE_VERSION && \
    nvm use default" | bash
# add node and npm to path so the commands are available
ENV NODE_PATH $NVM_DIR/versions/node/v$NODE_VERSION/lib/node_modules
ENV PATH $NVM_DIR/versions/node/v$NODE_VERSION/bin:$PATH
# confirm installation
RUN node -v
RUN npm -v

# prepare workdir
ENV APP_PATH /app
ENV SERVER_PATH /app/server
ENV GENERATOR_PATH /app/generator
RUN mkdir -p $SERVER_PATH $GENERATOR_PATH

# separate installation of python deps and copying python assets
# to make sure docker caches installation step
COPY generator/requirements.txt $GENERATOR_PATH/requirements.txt
RUN pip install -r $GENERATOR_PATH/requirements.txt

# separate installation of npm modules
# to make sure docker caches installation step
COPY server/package.json $SERVER_PATH/package.json
COPY server/package-lock.json $SERVER_PATH/package-lock.json
RUN cd $SERVER_PATH && npm install

COPY . $APP_PATH

WORKDIR $SERVER_PATH

EXPOSE 8080

CMD ["npm", "start"]

If you are not very familiar with Docker then all it does is installs Node.js along with nvm, then copies over the files and installs Python dependencies for scripts:

COPY generator/requirements.txt $GENERATOR_PATH/requirements.txt
RUN pip install -r $GENERATOR_PATH/requirements.txt

and then Node.js server dependencies:

COPY server/package.json $SERVER_PATH/package.json
COPY server/package-lock.json $SERVER_PATH/package-lock.json
RUN cd $SERVER_PATH && npm install

To run the above you have to install Docker on your machine and have MySQL running with rnn_generator database created and schema applied to it:

build container image: docker build -t foobar .

run built image:

docker run --rm -ti \
  -e "MYSQL_HOST=docker.for.mac.localhost" \
  -e "MYSQL_USER=root" \
  -e "MYSQL_PASSWORD=" \
  -e "MYSQL_DATABASE=rnn_generator" \
  -p 8080:8080 foobar

docker.for.mac.localhost is meant for OSX users, if you use something else try finding out the value in Docker documentation.

Deployment to AWS

All this exercise was not only about being able to run those scripts locally but also to try and deploy them to environment close to production. For this reason I went with what I usually work with which is AWS. I say close to production because I do not intend to spend money on this example implementation I just made, it will run given tiny resources.

I need couple of parts of infrastructure to make it all work:

database - RDS
Docker container registry - Elastic Container Registry (ECR)
Docker runner - Elastic Beanstalk (EB)

I could have chosen Elastic Container Service (ECS) to run the Docker image on but I got scared trying to use it as there were some many options compared to EB launch configuration.

Docker files can be run on EB but then it builds them at the time of deployment which might take really long time before instance becomes ready to respond to requests. It is much better to build the image on CI server and push it to registry before using it.

CI server

This whole example is hosted on Github and is quite easy to integrate with Travis CI which is also free when used with public repositories. I used it for building the Docker image and pushing it to ECR in AWS. CI configuration is relatively simple not taking into account the shell script I had to assemble for it to build and push image to AWS. It will run Docker build every time there is a new git commit pushed to Github and then will deploy but only on master branch.

sudo: required
language: python
services:
- docker
env:
  global:
  - DOCKER_REPO=ivarprudnikov/rnn-generator
  - AWS_ACCOUNT_ID=<redacted>
  - EB_REGION="eu-west-1"
  - EB_APP="<elastic beanstalk app name>"
  - EB_ENV="<elastic beanstalk environment name>";
  - S3_BUCKET="<s3 bucket name the zipped app will be uploaded to>"
  - secure: <encrypted secret>
  - secure: <encrypted access key>
before_install:
- pip install awscli
- export PATH=$PATH:$HOME/.local/bin
script:
- docker build -t $DOCKER_REPO .
deploy:
  provider: script
  script: bash docker_push.sh
  on:
    branch: master

Deployment script is an assembled version from those I found in the wild internets, it relies on installed awscli sdk to tag the build that was just made and then to push it to registry.

#!/bin/bash -e

TIMESTAMP=$(date '+%Y%m%d%H%M%S')
VERSION="${TIMESTAMP}-${TRAVIS_COMMIT}"
REGISTRY_URL=${AWS_ACCOUNT_ID}.dkr.ecr.${EB_REGION}.amazonaws.com
SOURCE_IMAGE="${DOCKER_REPO}"
TARGET_IMAGE="${REGISTRY_URL}/${DOCKER_REPO}"
TARGET_IMAGE_LATEST="${TARGET_IMAGE}:latest"
TARGET_IMAGE_VERSIONED="${TARGET_IMAGE}:${VERSION}"

aws configure set default.region ${EB_REGION}

# Push image to ECR
###################

$(aws ecr get-login --no-include-email)

# update latest version
docker tag ${SOURCE_IMAGE} ${TARGET_IMAGE_LATEST}
docker push ${TARGET_IMAGE_LATEST}

# push new version
docker tag ${SOURCE_IMAGE} ${TARGET_IMAGE_VERSIONED}
docker push ${TARGET_IMAGE_VERSIONED}

# ...

I do push 2 tags here, the latest one and the versioned one which will allow me to specify it when deploying to Elastic Beanstalk. Keep in mind that ECR has some limits on maximum amount of tags and images. Deployment to EB requires me to push zip field with just the Dockerrun.aws.json in it which in tun tells what docker image to use:

{
  "AWSEBDockerrunVersion": "2",
  "volumes": [],
  "containerDefinitions": [
    {
      "name": "generator",
      "image": "<TARGET_IMAGE>",
      "essential": true,
      "memoryReservation": 96,
      "portMappings": [
        {
          "hostPort": 80,
          "containerPort": 8080
        }
      ],
      "mountPoints": []
    }
  ]
}

json file contains <TARGET_IMAGE> which is replaced in deployment script with a versioned image name:

# ...

ZIP="${VERSION}.zip"

# Deploy new version to Elasticbeanstalk
########################################

# Interpolate Dockerrun.aws.json and also create backup .bak file
sed -i.bak "s#<TARGET_IMAGE>#$TARGET_IMAGE_VERSIONED#" Dockerrun.aws.json

# Zip application
zip -r ${ZIP} Dockerrun.aws.json

# Copy application version over to S3
aws s3 cp ${ZIP} s3://${S3_BUCKET}/${ZIP}

# Create a new application version with the zipped up Dockerrun file
aws elasticbeanstalk create-application-version --application-name ${EB_APP} \
    --version-label ${VERSION} --source-bundle S3Bucket=${S3_BUCKET},S3Key=${ZIP}

# Update the environment to use the new application version
aws elasticbeanstalk update-environment --environment-name ${EB_ENV} \
      --version-label ${VERSION}

Elastic beanstalk

There is part of configuration on CI server which pushes 2 things to AWS: freshly built Docker image and new EB application version. This is almost everything we need except setting up EB environment itself which proved to be a bit catchy and time consuming so I’ll outline some important bits I got caught on.

To start I had to set up AWS RDS MySQL database which in turn resides in a VPC. Then I had to enable Networking under EB application which requires enabling LoadBalancer, the former allows setting up VPC details which are necessary to allow your EC2 instance to be created in that same VPC to then access the database. Apart from VPC which is out of scope here I needed to make sure my application is using correct image and then is correctly mapped in LoadBalancer.

app file structure — Load balancer port mapping

When creating an app make sure to select Multi-container Docker running on 64bit Amazon Linux as previously mentioned Dockerrun.aws.json is using "AWSEBDockerrunVersion": "2".

Summary

Right this is a long one and took me some time to build. Eventually I’ve made it available to public under rnn-generator.dasmicrobot.com but I have little confidence in it working for more than one user as resources allocated to that instance are miserable.

This exercise explored a naive way of running Tensorflow powered scripts via webapp. Even this basic functionality which was achieved required to use wide array of development techniques to make it successiful:

Node.js powered web server
Tensorflow powered recurrent neural net implementation
Websockets to see live progress in UI
Docker to make sure the application is running in predictable environment
CI server to automate deployments
Docker container registry to store images
Elastic Beanstalk to run Docker container images
RDS to store sequential data
Bootstrap framework to assemble UI in html

If you asked me why is there no React in this list I might do something stupid.

Improvements

Current implementation is clunky and hardly scalable as both training script and webserver lives on the same box. Making sure that scripts run in something with GPU would be a great start, I would probably try AWS Lambda or one of their Machine learning products.

After splitting out Python scripts it would be much easier to concentrate on implementing sort of REST API so that website could use more ajax instead of old school forms.

UI could also have more pictures of unicorns to make sure MVP attracts first million in days and not years.

Thank you for reading.

Source code

Source code is available in Github repo ivarprudnikov/char-rnn-tensorflow