Deprecated: Function create_function() is deprecated in /customers/0/f/6/priorlabs.com/httpd.www/wp-content/plugins/revslider/includes/framework/functions-wordpress.class.php on line 257 Deprecated: Function create_function() is deprecated in /customers/0/f/6/priorlabs.com/httpd.www/wp-includes/pomo/translations.php on line 208 Deprecated: Function create_function() is deprecated in /customers/0/f/6/priorlabs.com/httpd.www/wp-includes/pomo/translations.php on line 208 Deprecated: Function create_function() is deprecated in /customers/0/f/6/priorlabs.com/httpd.www/wp-includes/pomo/translations.php on line 208 Deprecated: Function create_function() is deprecated in /customers/0/f/6/priorlabs.com/httpd.www/wp-includes/pomo/translations.php on line 208 Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /customers/0/f/6/priorlabs.com/httpd.www/wp-content/plugins/js_composer/include/classes/core/class-vc-mapper.php on line 111 Warning: Cannot modify header information - headers already sent by (output started at /customers/0/f/6/priorlabs.com/httpd.www/wp-content/plugins/revslider/includes/framework/functions-wordpress.class.php:257) in /customers/0/f/6/priorlabs.com/httpd.www/wp-includes/feed-rss2.php on line 8 Priorlabs http://www.priorlabs.com A Data science and machine learning lab Tue, 17 Oct 2017 21:26:46 +0000 en-GB hourly 1 https://wordpress.org/?v=4.7.15 126518376 Character-level text classification: CNN http://www.priorlabs.com/2017/10/17/character-level-text-classification-cnn/ http://www.priorlabs.com/2017/10/17/character-level-text-classification-cnn/#respond Tue, 17 Oct 2017 20:48:12 +0000 http://www.priorlabs.com/?p=4319

The code associated with this post can be found in text-classification.


Introduction

We will be implementing a convolutional neural network in Keras for character-level text classification. The implementation will be done in two flavors, one with an embedding layer transforming the input from 70-dimensional one-hot encoded input arrays down to 32-dimensional real-valued arrays, and one without the embedding. The setup with follow the steps and architecture outlined in Character-level Convolutional Networks for Text Classification, but as a shallower implementation, consisting of three convolutional layers. The code in the associated repository can easily be transformed for deeper implementations, although parameter tuning will pose a challenge beyond the scope of this post.

Notable libraries

  1. Keras (TensorFlow backend)
  2. Pickle
  3. Pandas
  4. Numpy

Data

The code is general enough to be used with a wide range datasets, but the current post, we will be using the Rotten Tomatoes movie review (included in the repo). The dataset contains 5000 movie reviews, half of which are of positive sentiment and half of which are of negative sentiment (two-class problem). The task at hand is to classify the input strings into one of the two classes, feeding the network arrays of arrays each representing a character from the review.

Preprocessing

All preprocessing steps are summarized in one function ‘char_preproc’, which in turn calls a set of other functions (check the code repo for details).

  1. Clean text strings (remove all characters except letters, numbers and basic punctuation, and make into lower-case).
  2. Transform each text string to an array of characters.
  3. Tokenize (transform each character to an integer reference to a vocabulary of 69 unique character entries, represented by integers 1 to 69).
  4. Pad all reviews into the same length (add integer 0’s to the end each array, resulting in each array being exactly 250 elements long. In the rare case of a review containing more than 250 letters, truncate to 250).

For the case where we use an embedding layer, the preprocessing ends here. It should be emphasized though, that the addition of an embedding layer was just experimental and I never managed to make it add to the performance or to reduce training times (slightly better convergence), but with right parameter tuning and/or with pre-trained embeddings, one might be able to at least reduce training times by feeding a smaller representation to the convolutional layers.

For the default case of running the setup by feeding the input directly into the first convolutional layer, we one-hot encode the data.

Loading data and running preprocessing steps

In order to not having to run the preprocessing every time, we use the function ‘load_preprocessed_data’ as our point of entry and conveniently control the saving and loading of preprocessed data.

def load_processed_data(load=True, binarize=False):
    table = None

    if os.path.isfile('data/processed/data-ready.pkl') and load:
        print("data exists - loading")

        with open('data/processed/data-ready.pkl', 'rb') as file:
            data = pickle.load(file)
    else:
        print("reading raw data and preprocessing..")
        table = pd.read_csv('data/rt-polarity.csv')
        data = char_preproc(table.text, table.label, 70, binarize)

        with open('data/processed/data-ready.pkl', 'wb') as file:
            pickle.dump(data, file)

    return (data, table)

def char_preproc(X, Y, vocab_len, binarize=False):
    # -----------------------------
    # preproc X's------------------

    # cleanup
    X = cleanup_col(X, numbers=True)
    # split in arrays of characters
    char_arrs = [[x for x in y] for y in X]

    # tokenize
    tokenizer = Tokenizer(char_level=True)
    tokenizer.fit_on_texts(char_arrs)

    # token sequences
    seq = tokenizer.texts_to_sequences(X)

    # pad to same length
    seq = pad_sequences(seq, maxlen=250, padding='post', truncating='post', value=0)

    # make to on-hot
    if binarize:
        X = binarize_tokenized(seq, vocab_len)
    else:
        X = seq

    # ----------------------------
    # preproce Y's and return data

    # one-hot encode Y's
    Y = np.array([[1, 0] if x == 1 else [0, 1] for x in Y])

    # generate and return final dataset
    data = Dataset(X, Y, shuffle=True, testsize=0.1)

    return data

Training

The training is done on a setup with three convolutional layers (1D) with 8 character wide kernels. The two first layers are followed by pooling layers with 2 characters wide filters. One fully connected layer containing 1024 neurons with relu activation function feeds the output layer. Parameters and architecture are controlled in the settings section in the beginning of the file.

# settings ---------------------
# ------------------------------

EMBEDDING = True
TYPE = 'embedding' if EMBEDDING else 'standard'
MODELPATH ='models/char-conv-' + TYPE + '-{epoch:02d}-{val_acc:.3f}-{val_loss:.3f}.hdf5'
FILTERS = 500
LR = 0.0001 if EMBEDDING else 0.00001

CONV = [
    {'filters':500, 'kernel':8, 'strides':1, 'padding':'same', 'reg': 0, 'pool':2},
    {'filters':500, 'kernel':8, 'strides':1, 'padding':'same', 'reg': 0, 'pool':2},
    {'filters':500, 'kernel':8, 'strides':1, 'padding':'same', 'reg': 0, 'pool':''}
]



# generate dataset -------------
# ------------------------------

data, table = load_processed_data(False, not EMBEDDING)
print("input shape: ", np.shape(data.x_train))

The convolutional layers are added through a loop that allows for quick changes in the architecture by editing the settings.

# model architecture ------------------------------------------
# -------------------------------------------------------------


# input and embedding ----------
# ------------------------------

if EMBEDDING:

    inputlayer = Input(shape=(250,))
    network = Embedding(70, 16, input_length=250)(inputlayer)

else:
    inputlayer = Input(shape=(250 ,70))
    network = inputlayer

# convolutional layers ---------
# ------------------------------

for C in CONV:

    # conv layer
    network = Conv1D(filters=C['filters'], kernel_size=C['kernel'], \
                     strides=C['strides'], padding=C['padding'], activation='relu', \
                     kernel_regularizer=regularizers.l2(C['reg']))(network)

    if type(C['pool']) != int:
        continue

    # pooling layer
    network = MaxPooling1D(C['pool'])(network)

# fully connected --------------
# ------------------------------
network = Flatten()(network)
network = Dense(1024, activation='relu')(network)
network = Dropout(0)(network)

# output
ypred = Dense(2, activation='softmax')(network)

A cross entropy loss function and RMSprop optimizer are used for the optimization and the training is done in batches of 50.
Callbacks such as early stopping (patience of 10, stops training if no improvement observed in the validation accuracy after 10 epochs), model checkpointing and logging (Tensoroard) are added for training control and saving of results.

The training (current results) was done on a NVIDIA GTX 1080 Ti GPU and training time was seldom above 15s per epoch.

# training ----------------------------------------------------
# -------------------------------------------------------------


# callbacks --------------------
# ------------------------------

# tensorboard

TB_DIR = 'logs/' + datetime.now().strftime("%Y-%m-%d %H:%M:%S") + '_' + TYPE

os.makedirs(TB_DIR)
tensorboard = TensorBoard(log_dir=TB_DIR)

# early stopping and checkpoint
estopping = EarlyStopping(monitor='val_acc', patience=10)
checkpoint = ModelCheckpoint(filepath=MODELPATH, save_best_only=True)

# model-------------------------
# ------------------------------

optimizer = RMSprop(lr=LR)


model = Model(inputs=inputlayer, outputs=ypred)
model.compile(loss='categorical_crossentropy',
              optimizer=optimizer,
              metrics=['acc'])

print(TB_DIR)
print(model.summary())

# fit and run ------------------
# ------------------------------
try:
    hist = model.fit(data.x_train,
                     data.y_train,
                     validation_data=(data.x_test, data.y_test),
                     epochs=500,
                     batch_size=50,
                     shuffle=False,
                     verbose=2,
                     callbacks=[checkpoint, estopping, tensorboard])

except KeyboardInterrupt:    
    print("training terminated by user")

 

Results

 

Validation accuracy for setup with (blue) and without (purple) embdedding layer.

With minimal parameter tuning the validation set accuracy reaches 72% for the version without embedding layer and 70% for the version with an embedding layer, after about 20 epochs (The graph above is smoothed – leading to a shift forward in time). The addition of an embedding layer, in this case, leads to a slightly faster convergence, although the best accuracy is obtained by the version without. Comparing these results with what can be achieved using say, a word-level convolutional network, they are still a bit behind (~80%), but one must keep in mind that we have been feeding the network character by character and with only three convolutional layers are still able to reach quite interesting results!

]]>
http://www.priorlabs.com/2017/10/17/character-level-text-classification-cnn/feed/ 0 4319
Two stream convolutional network for predicting social matches in linkedin-data http://www.priorlabs.com/2017/04/06/two-stream-convolutional-network-for-predicting-social-matches-in-linkedin-data/ http://www.priorlabs.com/2017/04/06/two-stream-convolutional-network-for-predicting-social-matches-in-linkedin-data/#comments Thu, 06 Apr 2017 00:10:59 +0000 http://www.priorlabs.com/?p=4298

Code associated with this post can be found in two-stream-cnn.


Introduction

In this post we will explore the feasibility of using a two-stream convolutional network to to predict user-to-user interest in a small social network (Lunchback), using only text (Linkedin-description and tags) as input.
The objective is to create a recommendation engine, that for each user h in the network can, recommend a set of users g_{i} that h is likely to want to interact with.

In more general terms, we’ll be exploring some simple examples for using a two-stream convolutional network for learning a similarity (matching)-function f(X_1,X2) that describes some (potentially complex) relation between the sets of data X_1 and X_2. The input data could be everything from image data to natural text input. The relations in turn could be anything ranging from a fuzzy identity function (are these the same object?) to something complex like [does these two commenters agree or not].


We will touch upon a couple of concepts used in more modern approaches to natural language processing and try to piece it all together towards a ranking engine. The approach will be vanilla flavored, and rather than pushing it to the maximum by investigating multiple network architectures (and optimizing model parameters), the focus will be on identifying pitfalls, reasoning around these and get an overall understanding of the approach.

All code can be found in this Github-repo. Although the provided dataset will be the one for the MNIST-example. The main dataset belongs to Lunchback and may be added at a later stage in a hashed form.

Inspiration

Libraries

  • Tensorflow (framework of choice)
  • Gensim (nlp library for the text-preprocessing, word2vec)
  • Pandas (data frames)
  • Numpy (numerical calculations and math operations)

Lunchback and the dataset

The social network

Lunchback is a Swedish app-startup founded in 2015, launched in 2016. It still in its early stages of development but with a proven small-scale track-record of its concept: Connecting professionals over business lunches.

This is how its works:

 

  • Members sign up with their Linkedin profile and inputs additional information about their skill-set, interests, and the types of profiles they wish to come in contact with. You might for example be a tech entrepreneur looking for investors and developers or a marketing professional looking for legal advise.
  • Members open up lunch spots in their calendar. Basically telling the network “Hey, I’m open for lunches this and that day. Send me requests!”
  • Members can search through the profiles and request to invite someone to lunch. The user that makes the request pays for the lunch.

 


Building a dataset

We have some choices in what we wish to predict. For instance, we could go for predicting matches of the form: User1 requests User2, and User2 accepts, i.e basically user combinations that has led to lunches. We could also go for predicting User1 requests User2 (regardless of the response). The the second alternative trains our model to find relevant alternatives that could catch the users interest, while the first alternative also optimizes towards alternatives that are likely to end up in a final match.

I chose to go for the less limiting alternative of predicting requests at this stage, mainly in order to not limiting the size of the dataset further (although a qualified majority of all requests leads to full matches). This makes us end up with a directed graph where each edge represents ‘User1 requests User2’:


The dataset is rather small (~1200 requests – only positive class) compared to the tens of thousands up to millions of training examples one usually encounters in similar tasks. This will of course bring its own challenges, which will be shown. You have reason to be skeptical.

Another challenge is that we only have the positive class. i.e cases where a lunch request was sent (indicating user interest). We don’t have any indication of disinterest. With n users in the network, we have n(n-1) potential edges in our graph. Assuming that all requests that were not sent are indicators of irrelevance is just naive. We will inevitably include some noise when building the negative class.

With that said, we continue by building the negative class as a first approach by sampling randomly from the set of user-pairs that does not have the above property of ‘User1 requested User2’. We end up with a balanced dataset of ~ 2250 examples.

 

  • positive class: ~1125 user-pairs (request-sent)
  • negative class ~1125 user-pairs (random-pairs)

The data

For each user, there is a text string which is a concatenation of their Linkedin headline information (usually their current role description, department and company) and a series of ‘skill’-tags labeling their interests and experience. We concatenate the different strings for each user with the tags first and then the role information. In the table below one such user-pair data point is shown (after a preprocessing step where we have removed special characters etc).

 

user1_stringuser2_string
design management marketing resourcing teambuilding it business investment leadership entrepreneurship sw development product owner project manager ericsson ab and cs project manager lunchback ab product developmentbusiness marketing management teambuilding finance law it inception marketing how to bootstrap a startup lean startup methodology idea generation lean startup approach public speaking marketing strategy growth growth hacking startup life startup marketing big ideas ceo at lunchback

 

Two stream network architecture

I will not go deeply into the details of TensorFlow, convolutional neural networks (CNN’s) or the theory of word embeddings in this post. Readers that want some more background information could start here and here. For a great post on using CNN’s for text-classification, check implementing-a-cnn-for-text-classification-in-tensorflow. For a good intro to the general aspects of convolutional networks, read http://cs231n.github.io/convolutional-networks/.

We’ll save the parts on word-embeddings and text preprocessing for later, since the first example implementation will not require any text-realted stuff, and will revolve around training a similarity function on the MNIST dataset. The overall architecture will be the same and both in the case of the text-input and MNIST-example, in both cases the data will be be arranged as 2-d arrays, which in turn will be arranged in arrays containing all training examples. How to turn natural text input into 2d-arrays will be shown further down.


Layers

There are a number of ways to set-up the architecture for extracting features from the both streams and to combine them into a common decision stream. One way is to feed the input X_1, X_2 as two channels into the same convolutional layer. This limits us in a major sense though; the output from the layer is processed as a linear combination of the two channels (no guarantee that this is a useful relation in the general case). On the positive side, it reduces the parameters trained and thus the risk of over-fitting. We can also reduce the parameter set by employing a parameter/feature-sharing scheme between the streams. For clarity and simplicity I choose to go with two parallel streams with separate set of weights, and rely on regularization for controlling over-fitting.

 

 


 

  1. Two parallel convolutional layers to extract features from each input-stream. Two sets of parameters to train.
  2. Two parallel pooling layers (max-pooling) to down-sample the feature set.
  3. A shared dropout layer for the concatenated pooling layer output.
  4. A fully connected layer that takes flattened 1d-tensor with concatenated outputs from the previous layers.
  5. Another dropout layer as a part of the regularization strategy.
  6. An output layer with shape (num-fully-connected-neurons, num-classes)

 

We build three  python classes, one for creating the convolutional layers, one for the fully connected ones and one for the dropout layer.

Convolutional layers

The ‘convLayer’-class is initialized with a few parameters defining its properties. The most important here is its shape, where the two first elements holds information about the shape of the convolutional filters (the window that scans over our two-dimensional data input). The third element holds the number of channels we input (had it been color images, then each image would be a collection of 3 color channels). In our case, there will be one channel only. The fourth and last element holds the number of of filters we are applying, i.e the number of feature-extractors we are using to scan over our 2d-data volume. Further, strides determines how we increment the filters in our scan over the 2d-data, and padding, how we handle the edges. See the links above to clear the details out if you need.

The class handles everything-layer related such as initiating the weights and biases and applying the nonlinear activation. We will go with the relu-activation-function, which takes the linear output z_i = W_{ij}x_j + b_i from neuron i and returns a_i = max(0, z_i).

class convLayer(object):
    '''
    A convolutional layer with a 4d input and 4d output volume
    shape = [nrow, ncol, chanels, nfilters]
    strides = [1, nrow, ncol, 1]
    '''

    def __init__(self, shape, strides, name, padding="SAME"):
        self.shape = shape
        self.strides = strides
        self.name = name
        self.padding = padding

        # set weight var
        winit = tf.truncated_normal(shape, stddev=0.1)
        self.weight = tf.Variable(winit, name="w_{}".format(name))
        # set bias var
        binit = tf.constant(0.1, shape=[shape[-1]])
        self.bias = tf.Variable(binit, name="b_{}".format(name))

    def layer(self, x_in):
        weighted = tf.nn.conv2d(x_in, self.weight, strides=self.strides,
                                padding=self.padding)
        weighted_bias = tf.add(weighted, self.bias)

        return weighted_bias

    def layer_relu(self, x_in):
        # layer activations with relu and dropout
        activation = tf.nn.relu(self.layer(x_in))
        return activation

Max-pooling layers

The max-pooling layers down-sample the output from the convolutional layers by scanning output-volume with a window of shape ksize and returning the maximum value for each such window. This reduces the number of inputs to the proceeding layer and has a regularizing-effect and helps reducing training time by reducing the number of parameters to be trained (by reducing complexity in the following layers). For quick summary check this Quora question.

class maxPool(object):
    def __init__(self, ksize, strides, padding = 'SAME'):
        self.ksize = ksize
        self.strides = strides
        self.padding = padding

    def pool(self, x, keep_prob):
        pooled = tf.nn.dropout(tf.nn.max_pool(x, ksize=self.ksize, strides=self.strides,
                                padding = self.padding), keep_prob)
        self.outdim = pooled.get_shape().as_list()
        self.numout = np.prod(self.outdim[-3:])
        return pooled

Fully connected layers (with dropout)

The fully connected layer is initialized by its shape = [dimensionality of its input, number of hidden units]. The dimensionality of the input is input is determined by the preceding layer, and the number of hidden units is a parameter to be determined when deciding on the network architecture. In our experiments, the number of hidden units will be set to between 128 and 512, for the layer following the convolutional layers, and 2 for the output layer (corresponding to the two output classes).

class fullLayer(object):
    '''
    A traditional fully connected layer with a 2d input and 2d output volume
    '''

    def __init__(self, shape, name):
        self.shape = shape
        self.name = name
        # set weight var
        winit = tf.truncated_normal(shape, stddev=0.1)
        self.weight = tf.Variable(winit, name="w_{}".format(name))
        # set bias var
        binit = tf.constant(0.1, shape=[shape[-1]])
        self.bias = tf.Variable(binit, name="b_{}".format(name))

    def layer(self, x_in, flat=True):
        if flat:
            x_in = tf.reshape(x_in, [-1, self.shape[0]])

        weighted = tf.matmul(x_in, self.weight)
        weighted_bias = tf.add(weighted, self.bias)

        return weighted_bias

    def layer_relu(self, x_in, keep_prob=1):
        # layer activations with relu and dropout
        activation = tf.nn.dropout(tf.sigmoid(self.layer(x_in)), keep_prob)
        return activation

    def layer_sigmoid(self, x_in, keep_prob=1):
        # layer activations with sigmoid and dropout
        activation = tf.nn.dropout(tf.sigmoid(self.layer(x_in)), keep_prob)
        return activation

The network

We put our layers together, instantiate the layer classes and build our network in a function.  The output layer output and sum of regularization terms are returned. The network function takes data in the form of tensors (multidimensional arrays) as input, together with a dictionary of parameters controlling some aspects of the architecture (filter shapes, strides, number of neurons in the fully connected layers etc). The keep_prob – argument is used to control the dropout layers.

def network(x1, x2, keep_prob, params):
    
    x1 = tf.expand_dims(x1, 3)
    x2 = tf.expand_dims(x2, 3)
    
    # convolutional ---------------------------------             
    # -----------------------------------------------
    
    with tf.name_scope("convlayer_1"):
        # convolutional layer ----------------------
        # initialize convolutional layer
        convlayer1 = convLayer(shape=params['conv']['shape'],
                               strides=params['conv']['strides'], name="conv", padding='SAME')
        h_conv1 = convlayer1.layer_relu(x1)
    
        
    with tf.name_scope("poollayer_1"):
        # pooling layer ----------------------------
        poollayer1 = maxPool(ksize=params['pool']['ksize'], strides=params['pool']['strides'])
        h_pool1 = poollayer1.pool(h_conv1, keep_prob)
    
        print("Pool layer 1 output shape: {}".format(h_pool1.get_shape()))
    # -----------------------------------------------------------------------------------
    
    with tf.name_scope("convlayer_2"):
        # convolutional layer ----------------------
        # initialize convolutional layer
        convlayer2 = convLayer(shape=params['conv']['shape'],
                              strides=params['conv']['strides'], name="conv", padding='SAME')
        h_conv2 = convlayer2.layer_relu(x2)
        
    
    with tf.name_scope("poollayer_2"):
        # pooling layer ----------------------------
        poollayer2 = maxPool(ksize=params['pool']['ksize'], strides=params['pool']['strides'])
        h_pool2 = poollayer2.pool(h_conv2, keep_prob)
        
        print("Pool layer 2 output shape: {}".format(h_pool2.get_shape()))
    # fully connected -------------------------------             
    # -----------------------------------------------
    
    # concatenate output of the two parallel conv layers      
    full_in = tf.concat(3, [h_pool1, h_pool2])
    num_full_in = poollayer1.numout + poollayer2.numout
    
    
    with tf.name_scope("fullayer_combined"):
        # fully connected layer---------------------
        # initialize hidden fully connected layer
        fullayer = fullLayer(shape=[num_full_in, params['full']], name="full")
        h_full = fullayer.layer_relu(full_in, keep_prob)
        
    
    with tf.name_scope("outlayer"):
        # output layer -----------------------------
        # initialize output layer
        outlayer = fullLayer(shape=params["vout"], name="out")
        v_out = outlayer.layer(h_full)
        # ------------------------------------------

        l2_loss = tf.nn.l2_loss(convlayer1.weight) + tf.nn.l2_loss(convlayer2.weight) + 
        tf.nn.l2_loss(fullayer.weight) + tf.nn.l2_loss(outlayer.weight)

    return v_out, l2_loss

 

Optimization

In order to train our network, we need a couple of more things, like a cost function to minimize. There is some freedom in choosing which cost- (loss) function to use, but nowadays most standard implementations tends to use the cross-entropy loss function. For a deeper dive into a motivation, check out this section in Michel Nielsen’s web-book on deep learning.

We need further to compute gradients (in a separate step for visualization) and an optimizer to perform the stochastic gradient decent step. We collect all such computations inside the optimization_ops-function. It’s arguments includes the network output ‘y’ and the true class labels ‘Y’ for computing the cost-function and returns the train-step to be executed on the TensorFlow graph.

def optimization_ops(y, Y, l2_loss, reg_param = 0.0, learningrate = 0.001):
    '''Generates optimization related operations'''
    with tf.name_scope('cost'):
        cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y, Y)) + reg_param * l2_loss

    with tf.name_scope('optimization'):
        optimize = tf.train.AdamOptimizer(learningrate)
        grads_vars = list(zip(tf.gradients(cost, tf.trainable_variables()), tf.trainable_variables()))
        train_step = optimize.apply_gradients(grads_and_vars=grads_vars)
 
    return(cost, grads_vars, train_step)

 

Training

Next, we put all steps together. We set our parameters controlling various aspects of the architecture, training and regularization. We initialize placeholders for the input data, call the network and optimization_ops functions and initiate some performance metrics.

NETWORK_PARAMS = {
    "conv" : {"shape" : [FILTER_HEIGHT, FILTER_WIDTH, 1, NUM_FILTERS], "strides" : [1,1,1,1]},
    "pool" : {"ksize" : [1,POOL_HEIGHT,POOL_WIDTH,1], "strides" : [1,POOL_HEIGHT,POOL_WIDTH,1]}, 
    "full" : NUM_HIDDEN_UNITS,
    "vout" : [NUM_HIDDEN_UNITS, 2]
}

# ---------------------------------------------------
# build network -------------------------------------


# input train vars
with tf.name_scope('input'):
    x1 = tf.placeholder(tf.float32, [None, 28, 28], name="X1") # add dim for multiple channels
    x2 = tf.placeholder(tf.float32, [None, 28, 28], name="X2")
    y  = tf.placeholder(tf.float32, [None, 2], name="Y")

    keep_prob = tf.placeholder(tf.float32)

with tf.name_scope('network'):
    yh, l2_loss = network(x1, x2, keep_prob, NETWORK_PARAMS)

with tf.name_scope('opt-ops'):
    cost, grads_vars, train_step = optimization_ops(yh, y, l2_loss, REG_PARAM, LEARNING_RATE)

with tf.name_scope('metrics'):        
    # accuracy        
    correct_prediction = tf.equal(tf.argmax(yh, 1), tf.argmax(y, 1))
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
    
    # auc
    pr = tf.cast(tf.argmax(tf.nn.softmax(yh), 1),tf.float32)
    yl = tf.cast(tf.argmax(y, 1),tf.float32)
    auc, update_op = tf.contrib.metrics.streaming_auc(pr,yl)

In order to be able to visualize various training related quantities such as loss, accuracy and the weights associated with the different network layers in Tensorboard (TensorFlow’s visualization tool), we define a number of summaries for the information to be exported. Run Tensorboard from your command-line interface:

tensorboard --logdir /summaries/

# ---------------------------------------------------
# set tensorboard ops  ------------------------------

init_all = tf.initialize_all_variables()
init_loc = tf.initialize_local_variables()

# cost summary
tf.scalar_summary('cost', cost)

# acc summary
tf.scalar_summary('accuracy', accuracy)

# weights and biases summary
for var in tf.trainable_variables():
    tf.histogram_summary(var.name, var)

# gradient summary
for grad, var in grads_vars:
    tf.histogram_summary(var.name + '/gradient', grad)

merged_summaries = tf.merge_all_summaries()

Final training loop

The data input is contained in the data-object and is feeded batch-wise to the TensorFlow-session. The training batches are constructed by defining a generator that takes in training data as well as arguments determining the batch-sizes and the number of epochs to run the training process.

batches = two_stream_batches(data.x1_train, data.x2_train, data.y_train, 100, 10)

The batch generator.

def two_stream_batches(X1, X2, Y, batch_size, num_epochs, shuffle=True):
    
data_size = len(Y)
    num_batches_per_epoch = int(data_size / batch_size) + 1
    for epoch in range(num_epochs):
    # Shuffle the data at each epoch
        if shuffle:
            shuffle_indices = np.random.permutation(np.arange(data_size))
            X1 = X1[shuffle_indices]
            X2 = X2[shuffle_indices]
            Y = Y[shuffle_indices]

        for batch_num in range(num_batches_per_epoch):
            start_index = batch_num * batch_size
            end_index = min((batch_num + 1) * batch_size, data_size)

            yield X1[start_index:end_index], X2[start_index:end_index], Y[start_index:end_index]

  1. For each batch (a training step) data is sent into the session (the TF-graph), an optimization step is performed (train_step from optimization_ops), summaries are calculated and exported.
  2. For each test_interval, train and test cost and accuracy is printed on screen (and saved in arrays potential future – not implemented).
  3. For each save_interval, the current state of the model is saved (to be used in early stopping, future predictions and to resume training, in the case when training has to be stopped temporarily).

The modelSaver – class sets up a directory for keeping saved models and saves the current  state.

class modelSaver(object):
    def __init__(self, mdir, max_save):
        self.mdir = mdir
        if not os.path.exists(mdir):
            os.makedirs(mdir)
        self.saver = tf.train.Saver(max_to_keep=max_save)
        
    def save(self, sess, name, current_step, print_info):
        savestring = self.mdir + name 
        self.saver.save(sess, savestring, global_step=current_step)
        if print_info:
            print("Checkpoint saved at: {}".format(self.mdir))

Training loop

Note that at this point we’re just collecting summaries for the training data. It’s a simple step to add a summary export step for the test runs too.

with tf.Session() as sess:
    # initialize variables, sumaries and checkpoint saver
    sess.run([init_all, init_loc])
    summary_writer = tf.train.SummaryWriter("./summaries", graph=tf.get_default_graph())
    modelsaver = modelSaver('./models/', 100)

    
    print("----------------------- Training starts -------------------------")

    batches = two_stream_batches(data.x1_train, data.x2_train, data.y_train, 100, 10)
    
    for batch in batches:
        x1_batch, x2_batch, y_batch = batch
        
        # perform training step
        _, summary = sess.run([train_step, merged_summaries], 
                              feed_dict={x1: x1_batch, x2: x2_batch, y: y_batch, keep_prob: KEEP_PROB})
        # write summaries
        summary_writer.add_summary(summary, step)

        # calculate validation metrics
        if step % test_interval == 0 or step == 1:
            
            te_acc, te_cost = sess.run([accuracy, cost], 
                                        feed_dict={x1: data.x1_test, x2: data.x2_test, y: data.y_test, keep_prob: 1})
            tr_acc, tr_cost = sess.run([accuracy, cost], 
                                        feed_dict={x1: x1_batch, x2: x2_batch, y: y_batch, keep_prob: 1})
            
            acc_test.append(te_acc)
            acc_train.append(tr_acc)
            
            time = datetime.now().replace(microsecond=0)
            
            print("step: {}, {}, train_acc: {}, test_acc: {}, train_loss: {}, test_loss: {}".format(
                step, time,  round(tr_acc, 3), round(te_acc, 3), round(tr_cost, 3), round(te_cost, 3)))
        
        # save checkpoints
        if step % save_interval == 0:
            modelsaver.save(sess, 'latest-run', current_step = step, print_info = False)   
            
        step += 1
        
    # final predictions run on separate set    
    predictions = sess.run(tf.nn.softmax(yh), 
                           feed_dict={x1: data.x1_test, x2: data.x2_test, y: data.y_test, keep_prob: 1})

Test of concept: Matching MNIST-digits

We  started out with a concept, an architecture and a goal of creating a recommendation-engine based on text input. The last couple of parts of this post have been about to put together the code for running the training algorithm and executing the neural network. We have so far stopped short of the natural language processing that was promised in the beginning. The reason for that is that the architecture under investigation is a general one and it is instructive to test it out on a different dataset (a completely different type of data altogether), just to have a controlled environment with well-known attributes to gather some knowledge of the general problem: to train a similarity function over two sets of data.

We’ll run trough this part quickly so we can move on. Up until now, all code is completely general and can with some adjustments the parameters be used in either project. Bare with me and we’ll do some NLP very soon.

Goal

The MNIST dataset is a preprocessed dataset containing images of handwritten digits and a common starting point for anyone wishing explore the field of image recognition. Vast amounts example code, blog-posts, scientific papers and chapters in computer vision / deep learning material has been devoted to the dataset. The basic concept is the same, given the dataset of the handwritten digits (images), train a model that learns to predict the true label (digits they represent).

We will do things differently, and build a setup where we take as input pairs of MINST-images and feed the to our two-stream network and try predicting if the two images are of the same class (if the true labels match). This will be done without the algorithm ever knowing what the true labels are, just that they match or not.

Note that this is for demonstration purposes only. In reality, faced with such a similarity function, the straight forward way would be to just run the two images through a single-stream prediction setup and then match the predicted labels instead. The value of this approach becomes apparent when the similarity function is complex and not separable into two single-stream evaluations.

The MNIST-dataset and the complementary array

We start by importing the data, using TensorFlow’s input_data for the MNIST tutorials. We reshape the images into 28×28 pixel images and feed them together with the label into generate_complement in order to produce our final data input (explained below). Finally the data- object containing the train and validation datasets are created.

from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/", one_hot=False)

X1 = mnist.train.images
X1 = np.reshape(X1, [-1, 28, 28])
Ytemp = mnist.train.labels

# Generate array X2 where each element is either a randomly
# selected non-matching digit or randomly selected matching digit
X2, Y = generate_complement(X1, Ytemp, 0.5)    

data = Dataset(X1, X2, Y, testsize = 0.2,  shuffle = True)

Starting with X_1 as the original images, we create a balanced dataset (class balance is controlled by the class_prob-argument) with one-hot-endoded labels and a second image array X_2X_2 has the property of either containing (at the same index as X_1) an image of the same digit (not the same per se, but a different image of the same digit) or the image of a different digit. We’ll run the experiment with class_prob = 0.5.

def generate_complement(X, Y, class_prob):

    # select index set that matches label in Y with p = class_prob
    # and a different label with p = 1-p
    new_indicies = [np.random.choice(np.where(Y == x)[0]) 
                 if np.random.uniform() > class_prob 
               else np.random.choice(np.where(Y != x)[0]) for x in Y]

    # generate dataset accordingly
    Ys = np.array([Y[x] for x in new_indicies])
    # one-hot encode classes ([1,0]: match, [0,1]: non-match)
    Ys = np.array([[1,0] if Y[x] == Ys[x] else [0,1] for x in range(len(Y))])
    Xs = np.array([X[x] for x in new_indicies])


    return (Xs, Ys)

Finally storing everything in an object for easy retrieval in training and evaluation.

class Dataset():
    
    def __init__(self, x1, x2, y, testsize = 0.2,  shuffle = True):
        
        leny = len(y)
        
        if shuffle == True:
            si = np.random.permutation(np.arange(leny))
            x1 = x1[si]
            x2 = x2[si]
            y  = y [si]
        
        if type(testsize) == int:
            testindex = testsize
        else: 
            testindex = int(testsize*leny)     
                
            self.x1_train = x1[testindex:]; self.x1_test = x1[:testindex] 
            self.x2_train = x2[testindex:]; self.x2_test = x2[:testindex]
            self.y_train  = y [testindex:]; self.y_test  = y [:testindex]

Example data with on-hot-encoded labels.

 

Results

We run the training with parameters similar to those of the the first layer in the standard example on the TensorFlow-site.  But of course with the additions relevant to our setup.

  • Learning rate 0.001
  • Batch size: 100
  • Filter size: 5×5
  • Max-pool ksize: 2×2
  • Number of hidden neurons: 512
  • Number of classes: 2
  • Number of epochs: 20
  • Drop-off probability: 0.5
  • L2-regularization: None

After 20 epochs of training our validation accuracy has reached 97.5%, giving us an error-rate of 2.5%. Thinking of this accuracy as the square of the accuracy of running a single stream classifier (and perform matching on the predictions), the result is quite good (in the competing method, we’d have to multiply the accuracies of the two independednt runs). With a_s^2 = a_t, where a_s is the accuracy  normal single stream classifier and a_t that of our two-stream version, this would be equivalent to an a_s = \sqrt{a_t} = 0.987. With an additional set of convolutional and max-pool layers, this could possibly be bumped up to close to the state of the art (for CNN’s).

 

Note that these results are calculated on the validation set, and not on the provided standard test set, but with the validation set completely held out (no hyperparameter-tuning), the results should not be prone to changes.

Predicting the social matches

Lets come back to our main project. Now, instead of sending in images of digits, we want to send in text strings containing the LinkedIn header information (position, team, profession company) and the tags (skill, interests, experience) of the users. This data is in text-form and naturally, we’ll need to process and transform it in a way, making it suitable for a convolutional network. The approach will be to transform the text into a vector representation using word embeddings (word2vec).

Word embeddings

I will not go through the technical details of word embeddings and the different word2vec algorithms, but the interested reader might find useful material here, here, here and here. But to get a taste of it, from a algorithm’s perspective each word in a text can be seen as a reference to a vocabulary. The word ‘engine’ might map to slot 679 in a vocabulary containing say 30 000 words. Turning this into a simple representation that distinguishes ‘engine’ from all other words, could be to use an array-representation  where all elements of the array are set to ‘0’ except element 679, which is ‘1’. Each word would correspond to a different state in this 30000-dimensional space, all mutually exclusive (in fact orthogonal). A text containing 50 words could thus be represented by a very sparse matrix with 50 rows and 30000 columns. This is inefficient, and it fails to give any hints of structure in the language. It is just a way of distinguishing different word and put them in a format that can be processed standard libraries.

The first goal of  a word embedding is to significantly compactify this space, from 30000 down to say 300 dimensions or even 100 or 30 (the embedding dimension). This is done by allowing the elements of the new vectors to take any real value (which can be scaled down to stay within [0, 1]), and thus letting the words represent a points in this lower dimensional real-valued space (rather than the basis vectors of the discrete vocabulary space). The second goal is to preserve some of the semantic structure of the language by arranging the words in a way, so words that occur in similar context end up close to each other. Thus words like ‘great’ and ‘good’ should tend to correspond to vectors that are close (euclidean distance) but far away from the word ‘bad’. There are in fact a bunch of geometrical effects occurring in this space that can be used for analysis of semantics in NLP various tasks.

Preprocessing

We read the datasets (insert your own here if you want to try something else). Remember that one set includes pairs of strings where User1 has made an active request to User2 and the other is random pairs of users-strings. The data is read into Pandas data frames, merged, and cleaned.

import pandas as pd
import numpy as np
import re
import tensorflow as tf
from tensorflow.contrib import learn
import matplotlib.pyplot as plt
import gensim
from datetime import datetime

# read data
lunchdat_pos = pd.read_csv('/data/matched_strings_requests.csv')
lunchdat_rnd = pd.read_csv('/data/random_strings_requests.csv')

# set target
lunchdat_pos['match'] = 1
lunchdat_rnd['match'] = 0

# create dataset ------------------------------------
# ---------------------------------------------------


# reference copy of dataset as a whole
lunchdat = pd.concat([lunchdat_pos[['user1_string', 'user2_string', 'match']]
                    , lunchdat_neg[['user1_string', 'user2_string', 'match']]]
                    ).sample(frac=1).reset_index(drop=True)


# clean out special characters
X1 = np.array(cleanup_col(lunchdat.user1_string))
X2 = np.array(cleanup_col(lunchdat.user2_string))

# one-hot encode target array
Y  = np.array(pd.get_dummies(list(lunchdat.match)).as_matrix())

The functions for performing the cleaning. In this experiment we simply remove all special characters and numbers and make all letters to lower case.

def cleanup_str(st, numbers = False):
    
    st = str(st)

    if numbers == True:
        keep = set(string.letters + string.digits + ' ')    
    else: 
        keep = set(string.letters + ' ')
    
    # clean string
    st = ''.join(x if x in keep else ' ' for x in st)
    # rem multiple spaces
    st = re.sub(' +',' ', st)

    return st.strip().lower()   

# mapper: cleanup a pd column or list of strings
def cleanup_col(col, numbers = False):
    
    col = map(lambda x: cleanup_str(x, numbers = numbers), col)
    return col

 

We need to set a constant pre-defined string length (number of words) for the data, since the convolutional network will expect constant shape as input. The distribution is quite fat tailed with most strings containing between 10 and 40 words. A few has over 100 words in them. More specifically, setting a ‘MAXLENGTH’ of 50 words will only truncate about 10%. The reduce_strings-function below will also split each string on the space character into numpy-arrays.

# reduce stringlength -------------------------------
# ---------------------------------------------------

MAXLENGTH = 50

# X1 data
X1 = reduce_strings(X1, MAXLENGTH)
# X2 data
X2 = reduce_strings(X2, MAXLENGTH)

def reduce_strings(stringlist, maxlength, return_arrays = True):
    
    # if type(stringlist) != list:
    #    stringlist = list(stringlist)
    
    splitsreduce = [x[0:maxlength] for x in [x.split(' ') for x in stringlist]]
    
    if return_arrays:
        return splitsreduce
    
    shortstrings = [' '.join(x) for x in splitsreduce]
    return shortstrings

Next we generate the word2vec models and perform the transformation. The models are created using the word2vec procedure in the Gensim library (a large Python NLP library) and is called via the genwordvecs-function.
The only arguments we will use will be the embedding-dimension and the minimum number of occurrences of a word we allow in order to include it (it is otherwise removed – this clears out some typos and specific names).

def genwordvecs(docs, emb_size, try_load=False, minc=1):

    if try_load == True:
        try:
            vmodel_name = "embedding_dim_{}_c_{}".format(emb_size, minc)
            vmodel = gensim.models.Word2Vec.load('vmodels/'+vmodel_name)
            print('model loaded from disk')
            
            return vmodel
        except IOError:
            print('error loading model..')
            print('training word embeddings')

    vmodel = gensim.models.Word2Vec(docs, min_count=minc, size = emb_size, workers = 4)
    vmodel.save('vmodels/'+vmodel_name)
    
    return vmodel

vmodel = genwordvecs(np.concatenate([X1,X2], axis = 0), 
                     emb_size=EMBEDDING_SIZE, 
                     try_load=True, 
                     minc=5)

With the word2vec model in place, next step is to transform our arrays of words into arrays of word vectors and to pad all of them into the same length. Remember that we decided for a maximum number of words for the input earlier. For all cases with fewer words, we’ll simply fill up with vectors of zeros.

def vec2pad(doc, max_length):
    
    doclength, embdim = np.shape(doc)
    # add zeros up the decided sequence length 
    if doclength < max_length:
        s = np.zeros([max_length - doclength, embdim]) 
        doc = np.concatenate((doc,s), axis = 0)
        
        return doc
    elif doclength == max_length:
        
        return doc
    else: 
        print("document is longer that the set max_length")
        
        return doc

def w2v_transform(string_arrays, model, max_length = None):
    
    # removes words that not in vocabulary and then transforms to vector form
    v2w_arrays = map(lambda x: model[[y for y in x if y in model]], string_arrays)
    # sets length limit and zero-vectors as padding
    if max_length != None:
        v2w_arrays = map(lambda x: vec2pad(x, max_length), v2w_arrays)
        
    return np.array(v2w_arrays)

We transform the data and produce the final dataset.

# X1 data
X1t = w2v_transform(X1, vmodel, MAXLENGTH)  
# X2 data
X2t = w2v_transform(X2, vmodel, MAXLENGTH)  

# final dataset object
data = Dataset(X1t, X2t, Y, testsize = 0.2,  shuffle = False)

Training and Results

The dataset now holds arrays of shape [number of examples, number of words, embedding dimension]. One often encounters embedding dimensions up to 300. The results here are based on trying out embedding in a 32-dimensional space (which is comparatively low). There’s is yet to be published some definitive investigations of the sensitivity among different tasks and datasets. There are may other freedoms to consider, for example the use of multiple different filter sizes (number of consecutive words to consider) like in this post. For some guidance on how to decide on an architecture check this sensitivity analysis.

We will go for a single filter size. The padding strategy will one to ensure that the output from the convolutional layer keeps its original shape (shape of the 2d data input), denoted as ‘SAME’ in TensorFlow. The convolutional filters will span the whole of the word vectors and run over 4 words at a time (the filter size). Much can be done in terms of optimizing hyper-parameters and architecture, but as mentioned in the beginning, the goal of this post is rather to take a simple approach without too much focus on optimizing the setup to the fullest. A future post might handle that part.

 

  • Learning rate 0.0001
  • Batch size: 100
  • Filter size: 4×32
  • Max-pool ksize: 2×32
  • Number of hidden neurons: 256
  • Number of classes: 2
  • Number of epochs: 40
  • Drop-off probability: 0.5
  • L2-regularization: 0.5

With this setup we reach an accuracy of 76% after 40 epochs, which is encouraging. But there are things to consider!

Important note 1

Remember how we constructed the dataset; the negative class was not ‘negative’ per se! We just sampled from a uniform distribution over the users twice, one time for User1 and one time for User2, with the only limitation that the user-pair-combination is not in the positive set and that we do not have duplicates. This is not data constructed by a natural process and far from being created in a way similar to the positive class. Specifically, we fail to take into account the distributions of users over being requesters and being r e q u e s t e d. In the positive set, we’ll have a skewed distribution where some user are the network centers (our network is not homogeneous) and very active (many requests sent and/or may requests received). But this is not true for the negative class and gives the network freedom to learn specific user attributes and thus: The network will over-fit to some degree to the validation set even though it has never seen the it! Just because we have introduced a distinction between the classes that we then transferred to the validation and test sets.

 

  1. A user that has been very active either by sending or requesting will end up (as ‘positive/match/[1,0]’) multiple times in both test and validation split.
  2. The network can take a shortcut and to some degree learning the attributes and features of the users in the positive class and give the impression of learning generalizable features.

With this said, the size of this effect needs further investigation and can come to show to be somewhere between marginal and important, and with such a small dataset, extracting any kind of signal must be seen as a success.

Important note 2

I did some experiments to confine this effect on the other extreme by limiting the users in the negative class to the set of users in the positive class, forced the distributions over requests and requested to be the same over the two classes. I did this by re-sampling a social network by moving around the edges (brute force by randomly testing thousands of times) in the graph in a way that the new set held completely new user-pairs.

 

  1. The distribution over the the new requesters in the new set was the same as in the positive set.
  2. The distribution over the the new requested in the new set was the same as in the positive set.
  3. Exactly the same users in the both sets.
  4. None o of the new negative pair combinations had occurred in the positive set.

 


Now, this ensures the basic properties to be transferred from the positive class to the negative class, but at the cost of loosing all information on what attributes, skill-sets, job-positions etc that are in general interesting and popular, and forces the convolutional network to only learn combinations of features from the two users that has led to a request to be made. With such a small dataset, this setup should not be expected to generate anything useful, but with some heavy regularization, I was able to at least get a small signal even in this case. The accuracy on the validation set peaked slightly above 55%.
I emphasize that this is an extreme approach that only makes sense in a case where the dataset is large and the graph is dense.

 

Important note 3

Lets say that we by some tweaking and re-sampling negative class (one good approach would be to keep the requesters the same across the two sets, but with true vs randomly sampled requested users), achieve an accuracy of somewhere between the two extremes. Is that a good result? Most likely yes! Accuracy is not the best measure here, remember, our task at hand is not to determine the matching probability between two randomly selected user-pairs, but to create a ranking engine that among the remaining (n-1) (or a subset thereof) can recommend a small set of good matches. In this case, accuracies even in the low 50’s could give a small set of recommendations with high precision (ex. the 10 highest scoring ones). Better metrics for us would be maybe AUC (ranking capability) during validation, and precision over top 10 highest ranked matches during final testing.


Acknowledgements

Thanks to Jimmy Zhao and the Lunchback-team for letting me share this approach and the results. This was based on an early round of experiments but a version of the approach may be included in the final deployment.


Distribution

Feel free to use any images or code-syntax from this post in any way you like, but please link back to this post if you use it in a publication (profit or non profit / blog / comment / etc).


]]>
http://www.priorlabs.com/2017/04/06/two-stream-convolutional-network-for-predicting-social-matches-in-linkedin-data/feed/ 2 4298