Tensorflow Programs and Tutorials

This repository did some toy experiments based on Tensorflow in order to introduce some deep learning concepts which are used for image recognition and language modeling.

This repository did some toy experiments based on Tensorflow (a popular machine learning framework), in order to introduce some deep learning concepts which are used for image recognition and language modeling. I summarized them into three parts, they are:

  • Convolutional Neural Network for handwritten digits recognition
  • LSTM-based character level sequence to sequence generation
  • Question pair classification with RNN


Convolutional Neural Networks

In this tutorial, the author intended to create a convolutional neural network through Tensorflow and to train MNIST digits. I won’t go through all the code details, and will only introduce some most important procedures.

Data Extraction

from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

The above code is used for MNIST data extraction, they have already been saved as a built-in function in Tensorflow.

Network Construnction

def conv2d(x, W):
  return tf.nn.conv2d(input=x, filter=W, strides=[1, 1, 1, 1], padding='SAME')

def max_pool_2x2(x):
  return tf.nn.max_pool(x, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')

The 2D convolutional layer and max pooling layer are the two most common layers in terms of CNN for 2D image recognition. The above code defines a 2D convolutional layer (see conv2d() function) and a max pooling layer (see max_pool_2x2() function).

For tf.nn.conv2d() function, there are some parameters to specify:

  1. the first one is the input, which must be a tensor of shape [batch, in_height, in_width, in_channels].
  2. The second one is the filter, which must have the same type as input, with shape [filter_height, filter_width, in_channels, out_channels].
  3. The third is stride, a 1D tensor of length 4, which describes the sliding window for each dimension of input.
  4. The fourth is padding, which has two mode to specify, namely “SAME” and “VALID”.
    1. For “SAME”, the output size will be computed as: out_height = ceil(float(in_height) / float(strides[1])), out_width = ceil(float(in_width) / float(strides[2])).
    2. For “VALID”, the output size will be computed as: out_height = ceil(float(in_height – filter_height + 1) / float(strides[1])), out_width = ceil(float(in_width – filter_width + 1) / float(strides[2])).
    3. Here the author uses “SAME” for conv2d function in order to keep the output size same with input size, and just reduce the spatial size during max_pool_2x2 operation.

For tf.nn.max_pool() function, the parameters setting and their usage are same with tf.nn.conv2d().

Here is the code for a simple convolutional network construction:

sess = tf.InteractiveSession()
x = tf.placeholder("float", shape = [None, 28,28,1]) #shape in CNNs is always None x height x width x color channels
y_ = tf.placeholder("float", shape = [None, 10]) #shape is always None x number of classes
#First Conv and Pool Layers
W_conv1 = tf.Variable(tf.truncated_normal([5, 5, 1, 32], stddev=0.1))
b_conv1 = tf.Variable(tf.constant(.1, shape = [32])) #shape of the bias just has to match output channels of the filter
h_conv1 = tf.nn.conv2d(input=x, filter=W_conv1, strides=[1, 1, 1, 1], padding='SAME') + b_conv1
h_conv1 = tf.nn.relu(h_conv1)
h_pool1 = tf.nn.max_pool(h_conv1, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')
#Second Conv and Pool Layers
W_conv2 = tf.Variable(tf.truncated_normal([5, 5, 32, 64], stddev=0.1))
b_conv2 = tf.Variable(tf.constant(.1, shape = [64]))
h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
h_pool2 = max_pool_2x2(h_conv2)
#First Fully Connected Layer
W_fc1 = tf.Variable(tf.truncated_normal([7 * 7 * 64, 1024], stddev=0.1))
b_fc1 = tf.Variable(tf.constant(.1, shape = [1024]))
h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)
#Dropout Layer
keep_prob = tf.placeholder("float")
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)
#Second Fully Connected Layer
W_fc2 = tf.Variable(tf.truncated_normal([1024, 10], stddev=0.1))
b_fc2 = tf.Variable(tf.constant(.1, shape = [10]))
#Final Layer
y = tf.matmul(h_fc1_drop, W_fc2) + b_fc2

There are two things to note: The first one is the tf.placeholder() function, it is simply a variable that we will assign data to at a later date (during training). It allows us to create our operations and build our computation graph, without needing the real valued data, and the graph will be filled with real number during optimization. In this way, we can define a series of computations without knowing the actual data. The second one is tf.Variable(), it’s basically just tensor, which is a variable, so it’s also trainable during the optimization, and the values in these matrices have the ability to change (thinking about SGD updates).

For tf.placeholder(), there are some parameters we need to specify: the first one is dtype, which describes the type of elements in the tensor; the second one is shape, which describes the shape of tensor; the third one the name, and is used for naming the variable (an optional parameter) which help programmers during debugging and recording when building the complex model graph, moreover, with the name scope mechanism, it also allows you to reuse the variable, which makes it extremely useful for handling some specified tasks (e.g. RNN based machine translation, generative adversarial models etc.).

For tf.Variable(), the first parameter should be the initial value, which can be a tensor, and that’s the initial value for the variable. Note that the initial value must have a shape specified. The second one could be the customized variable name (if you want to share the variable later), but it’s optional.


Here is the code for training:

crossEntropyLoss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels = y_, logits = y))
trainStep = tf.train.AdamOptimizer().minimize(crossEntropyLoss)
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
batchSize = 50
for i in range(1000):
    batch = mnist.train.next_batch(batchSize)
    trainingInputs = batch[0].reshape([batchSize,28,28,1])
    trainingLabels = batch[1]
    if i%10 == 0:
        summary =, {x: trainingInputs, y_: trainingLabels, keep_prob: 1.0})
        writer.add_summary(summary, i)
    if i%100 == 0:
        trainAccuracy = accuracy.eval(session=sess, feed_dict={x:trainingInputs, y_: trainingLabels, keep_prob: 1.0})
        print "step %d, training accuracy %g"%(i, trainAccuracy), feed_dict={x: trainingInputs, y_: trainingLabels, keep_prob: 0.5})

There are also two points worth noting: at first, the graph should be launched per function, which would convert the symbolic level construction into real value computation; the second one is tf.global_variables_initializer(), this must be run before the training, in order to initialize all variables declared before.


In this tutorial, the author first performed some text pre-processing in order to clean the text data into character level (removes punctuation, parentheses, question marks, etc. and leave only the alphanumeric characters), before proceeding to build the network and start training. Here I will just introduce how to use Tensorflow to construct a Char-RNN model, but won’t go through the details about data pre-processing, see the following code:

#prepare the dataset of input to output pairs encoded as integers
dataX = []
dataY = []
for i in range(0, nChars - seqLength, 1):
    seq_in = allText[i:i + seqLength]
    seq_out = allText[i + seqLength]
    dataX.append([charToInt[char] for char in seq_in])
# reshape X to be [samples, time steps, features]
X = np.reshape(dataX, (nExamples, seqLength, 1))
# normalize
X = X / float(nVocab)
# one hot encode the output variable
y = np.zeros([nExamples, nVocab])
for i, example in enumerate(dataY):
    lis = np.zeros(nVocab)
    lis[example] = 1
    y[i] = lis

Up to here, we see that X can be regarded as the system input data of size [nExamples, SeqLength, DimFeatures], and y can be regarded as the system input target of size [nExamples, VocabSize] (one-hot coding), they will be fed into the placeholder during training.

Define Char-RNN Model:

batchSize = 24
lstmUnits = 48
iterations = 100000
numDimensions = 1
numClasses = nVocab

labels = tf.placeholder(tf.float32, [None, numClasses])
input_data = tf.placeholder(tf.float32, [None, seqLength, numDimensions])

lstmCell = tf.contrib.rnn.BasicLSTMCell(lstmUnits)
lstmCell = tf.contrib.rnn.DropoutWrapper(cell=lstmCell, output_keep_prob=0.85)
value, _ = tf.nn.dynamic_rnn(lstmCell, input_data, dtype=tf.float32)

weight = tf.Variable(tf.truncated_normal([lstmUnits, numClasses]))
bias = tf.Variable(tf.constant(0.1, shape=[numClasses]))
value = tf.transpose(value, [1, 0, 2])
last = tf.gather(value, int(value.get_shape()[0]) - 1)
prediction = (tf.matmul(last, weight) + bias)

Here, the author uses tf.contrib.rnn.BasicLSTMCell() to create a basic LSTM cell (without bi-direction), and then uses the tf.nn.dynamic_rnn() function to output the LSTM outputs for each time step and last hidden state.

For tf.contrib.rnn.BasicLSTMCell(), there are some parameters we need to specify, such as: num_units – the number of units in the LSTM cell. reuse – whether to reuse variables in an existing scope. The default is “none”.

For tf.nn.dynamic_rnn(), there are also some parameters we need to set:

  1. the first one is an instance of RNN Cell, here it has been defined as lstmCell (see above code)
  2. the second one is the RNN input, which should have the shape [batch_size, time_step, numDims].
  3. The return value of tf.nn.dynamic_rnn() gives a tuple pair (outputs, state) where output is a tensor of shape [batch_size, time_step, lstm_output_dim], and state is the last hidden state of shape [batch_size, lstm_state_dim].

As we know, the outputs have shape [batch_size, time_step, lstm_output_dim], so if you want to get the last time_step output, it should be transposed first, then take the last time_step as following:

value = tf.transpose(value, [1, 0, 2])
last = tf.gather(value, int(value.get_shape()[0]) - 1)

And then it is going to be a multi-class classification problem, same with the first section optimization part.

Question Classification with Bi-LSTM

In this part, the author aimed to do a question classification task based on new released Quora dataset, the data structure looks like this:


image (1).png

is_duplicate=0 means they are different question, is_duplicate=1 means they share same meaning.

image (2).png

The author gives the above data structure to merge a question pair into a single tensor. In this case, if we look at the first sentence “How can I be a good geologist?”, it has 8 characters. Each can be represented as a 50-D feature vector (the author used GLOVE for word2vec representation with 50-D feature vector for each word), so it’s a 8×50 tensor. For the second question “What should I do to be a great geologist”, it can be represented as a 10×50 tensor, so after the merge it’s a 18×50 tensor now. Then some text data get cleaned again, e.g.:

def cleanSentences(string):
    if (isinstance(string, basestring) == False):
        return " " 
    string = string.lower()
    string = re.sub('([.,!?()])', r' \1 ', string) # Separates punctuation from the word
    return string

The above function separates punctuation from the word.

Data Construction

In order to save the memory, the input and target should be constructed as following:

numClasses = 2
X = np.zeros((numTrainExamples + numTestExamples, maxSeqLength), dtype='int64')
Y = np.zeros((numTrainExamples + numTestExamples, numClasses), dtype='int32')

In order to save memory, X should have the shape (nExamples, maxSeqLength) rather than (nExamples, maxSeqlength, word2vecDim). Y should be (nExamples, 2), for example, 10 means is_duplicate=1, 01 means is_duplicate=0 (binary logistic classification problem). The following code is used for the construction of X and Y:

exampleCounter = 0
for index, row in df.iterrows():
    firstQuestion = cleanSentences(row['question1'])
    secondQuestion = cleanSentences(row['question2'])
    firstQuestionSplit = firstQuestion.split()
    secondQuestionSplit = secondQuestion.split()
    indexCounter = 0
    for word in firstQuestionSplit:
            X[exampleCounter][indexCounter] = wordsList.index(word)
        except ValueError:
            X[exampleCounter][indexCounter] = 399999 #Vector for unkown words
        indexCounter = indexCounter + 1
    for word in secondQuestionSplit:
            X[exampleCounter][indexCounter] = wordsList.index(word)
        except ValueError:
            X[exampleCounter][indexCounter] = 399999 #Vector for unkown words
        indexCounter = indexCounter + 1
    if (row['is_duplicate'] == 1):
        Y[exampleCounter] = [0,1]
        Y[exampleCounter] = [1,0]
    exampleCounter = exampleCounter + 1'Data/xMatrix.npy', X)'Data/yMatrix.npy', Y)

From the code above, we can see that each row of X gives the word index of the question pairs, we can then use a look up table for indexing real word vector as the input to the Bi-LSTM system. See the following Bi-LSTM model construction:


labels = tf.placeholder(tf.float32, [batchSize, numClasses])
input_data = tf.placeholder(tf.int32, [batchSize, maxSeqLength])
keep_prob = tf.placeholder(tf.float32)
seq_len = tf.placeholder(tf.int32, [None])

data = tf.Variable(tf.zeros([batchSize, maxSeqLength, numDimensions]),dtype=tf.float32)
data = tf.nn.embedding_lookup(wordVectors,input_data)

lstm_fw_cell = tf.contrib.rnn.BasicLSTMCell(lstmUnits, forget_bias=1.0, state_is_tuple=True)
lstm_bw_cell = tf.contrib.rnn.BasicLSTMCell(lstmUnits, forget_bias=1.0, state_is_tuple=True)
lstm_fw_cell = tf.contrib.rnn.DropoutWrapper(lstm_fw_cell, keep_prob)
lstm_bw_cell = tf.contrib.rnn.DropoutWrapper(lstm_bw_cell, keep_prob)
value, states = tf.nn.bidirectional_dynamic_rnn(cell_fw=lstm_fw_cell, cell_bw=lstm_bw_cell, inputs=data,sequence_length=seq_len, dtype=tf.float32)

value = tf.concat(value, 2)
hiddenUnits = 32

weight = tf.Variable(tf.truncated_normal([2*lstmUnits, hiddenUnits]))
bias = tf.Variable(tf.constant(0.1, shape=[hiddenUnits]))
value = tf.transpose(value, [1, 0, 2])
last = tf.gather(value, int(value.get_shape()[0]) - 1)
fc1 = (tf.matmul(last, weight) + bias)

weight2 = tf.Variable(tf.truncated_normal([hiddenUnits, numClasses]))
bias2 = tf.Variable(tf.constant(0.1, shape=[numClasses]))
prediction = (tf.matmul(fc1, weight2) + bias2)

correctPred = tf.equal(tf.argmax(prediction,1), tf.argmax(labels,1))
accuracy = tf.reduce_mean(tf.cast(correctPred, tf.float32))

loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=labels))
optimizer = tf.train.AdamOptimizer().minimize(loss)

sess = tf.InteractiveSession()
saver = tf.train.Saver()

for i in range(iterations):
    #Next Batch of reviews
    nextBatch, nextBatchLabels = getBatch();
    train_seq_len = np.ones(batchSize) * maxSeqLength, {input_data: nextBatch, labels: nextBatchLabels, keep_prob: 0.7, seq_len: train_seq_len})
    if (i % 100 == 0):
        trainingAccuracy =, {input_data: nextBatch, labels: nextBatchLabels, keep_prob: 1.0, seq_len: train_seq_len})
        print 'The training loss at iteration', i, 'is', trainingAccuracy

After the tf.nn.embedding_lookup(wordVectors, input) layer, the inputs of shape [batch_size, maxSeqlength] will become a tensor of shape [batch_size, maxSeqlength, wordDim], now it can be regraded as the input data into the system.

The most interesting part here is the tf.nn.bidirectional_dynamic_rnn() function, where you can put forward_lstm_cell and backward_lstm_cell in.

  1. The first parameter is an instance of RNN cell, to be used for forward direction.
  2. The second parameter is also an instance of RNN cell, which can be used for backward direction.
  3. The third parameter is the input with shape [batch_size, time_step, wordDim].
  4. The fourth parameter is the seq_length, which is a vector with shape [batch_size], it contains the actual length for each of the sequence within a batch, generally we fill this vector with max_seq_length.
  5. The last parameter is dtype, which is an optional choice.

The tf.nn.bidirectional_dynamic_rnn() will return a tuple (outputs, output_state), where outputs itself is a tuple (output_fw, output_bw), output_fw is a tensor of shape [batch_size, max_seq_len, cell_fw.output_size] and output_bw is a tensor of shape [batch_size, max_seq_len, cell_bw.output_size]. output_state is also a tuple (output_state_fw, ouput_state_bw) which gives the forward and backward last hidden state of Bi-LSTM.

Because the returned output is a tuple, it still needs to concatenate them with tf.concat(output, 2) (concatenate them along the third dim) function for further computation. The optimization procedure is the same as before (multi-class classification problem).


This repository gives some interesting applied machine learning toy examples, which aims to learn the most popular deep learning concepts based on Tensorflow framework. Here, I make effort to explain how to build the model and some important built-in functions. Hopefully it will make sense, and provide you with a better understanding of Tensorflow.

Author: Shawn Yan | Editor: Qintong Wu

0 comments on “Tensorflow Programs and Tutorials

Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: