Two stream convolutional network for predicting social matches in linkedin-data

Code associated with this post can be found in two-stream-cnn.


In this post we will explore the feasibility of using a two-stream convolutional network to to predict user-to-user interest in a small social network (Lunchback), using only text (Linkedin-description and tags) as input.
The objective is to create a recommendation engine, that for each user h in the network can, recommend a set of users g_{i} that h is likely to want to interact with.

In more general terms, we’ll be exploring some simple examples for using a two-stream convolutional network for learning a similarity (matching)-function f(X_1,X2) that describes some (potentially complex) relation between the sets of data X_1 and X_2. The input data could be everything from image data to natural text input. The relations in turn could be anything ranging from a fuzzy identity function (are these the same object?) to something complex like [does these two commenters agree or not].

We will touch upon a couple of concepts used in more modern approaches to natural language processing and try to piece it all together towards a ranking engine. The approach will be vanilla flavored, and rather than pushing it to the maximum by investigating multiple network architectures (and optimizing model parameters), the focus will be on identifying pitfalls, reasoning around these and get an overall understanding of the approach.

All code can be found in this Github-repo. Although the provided dataset will be the one for the MNIST-example. The main dataset belongs to Lunchback and may be added at a later stage in a hashed form.



  • Tensorflow (framework of choice)
  • Gensim (nlp library for the text-preprocessing, word2vec)
  • Pandas (data frames)
  • Numpy (numerical calculations and math operations)

Lunchback and the dataset

The social network

Lunchback is a Swedish app-startup founded in 2015, launched in 2016. It still in its early stages of development but with a proven small-scale track-record of its concept: Connecting professionals over business lunches.

This is how its works:


  • Members sign up with their Linkedin profile and inputs additional information about their skill-set, interests, and the types of profiles they wish to come in contact with. You might for example be a tech entrepreneur looking for investors and developers or a marketing professional looking for legal advise.
  • Members open up lunch spots in their calendar. Basically telling the network “Hey, I’m open for lunches this and that day. Send me requests!”
  • Members can search through the profiles and request to invite someone to lunch. The user that makes the request pays for the lunch.


Building a dataset

We have some choices in what we wish to predict. For instance, we could go for predicting matches of the form: User1 requests User2, and User2 accepts, i.e basically user combinations that has led to lunches. We could also go for predicting User1 requests User2 (regardless of the response). The the second alternative trains our model to find relevant alternatives that could catch the users interest, while the first alternative also optimizes towards alternatives that are likely to end up in a final match.

I chose to go for the less limiting alternative of predicting requests at this stage, mainly in order to not limiting the size of the dataset further (although a qualified majority of all requests leads to full matches). This makes us end up with a directed graph where each edge represents ‘User1 requests User2’:

The dataset is rather small (~1200 requests – only positive class) compared to the tens of thousands up to millions of training examples one usually encounters in similar tasks. This will of course bring its own challenges, which will be shown. You have reason to be skeptical.

Another challenge is that we only have the positive class. i.e cases where a lunch request was sent (indicating user interest). We don’t have any indication of disinterest. With n users in the network, we have n(n-1) potential edges in our graph. Assuming that all requests that were not sent are indicators of irrelevance is just naive. We will inevitably include some noise when building the negative class.

With that said, we continue by building the negative class as a first approach by sampling randomly from the set of user-pairs that does not have the above property of ‘User1 requested User2’. We end up with a balanced dataset of ~ 2250 examples.


  • positive class: ~1125 user-pairs (request-sent)
  • negative class ~1125 user-pairs (random-pairs)

The data

For each user, there is a text string which is a concatenation of their Linkedin headline information (usually their current role description, department and company) and a series of ‘skill’-tags labeling their interests and experience. We concatenate the different strings for each user with the tags first and then the role information. In the table below one such user-pair data point is shown (after a preprocessing step where we have removed special characters etc).


design management marketing resourcing teambuilding it business investment leadership entrepreneurship sw development product owner project manager ericsson ab and cs project manager lunchback ab product developmentbusiness marketing management teambuilding finance law it inception marketing how to bootstrap a startup lean startup methodology idea generation lean startup approach public speaking marketing strategy growth growth hacking startup life startup marketing big ideas ceo at lunchback


Two stream network architecture

I will not go deeply into the details of TensorFlow, convolutional neural networks (CNN’s) or the theory of word embeddings in this post. Readers that want some more background information could start here and here. For a great post on using CNN’s for text-classification, check implementing-a-cnn-for-text-classification-in-tensorflow. For a good intro to the general aspects of convolutional networks, read

We’ll save the parts on word-embeddings and text preprocessing for later, since the first example implementation will not require any text-realted stuff, and will revolve around training a similarity function on the MNIST dataset. The overall architecture will be the same and both in the case of the text-input and MNIST-example, in both cases the data will be be arranged as 2-d arrays, which in turn will be arranged in arrays containing all training examples. How to turn natural text input into 2d-arrays will be shown further down.


There are a number of ways to set-up the architecture for extracting features from the both streams and to combine them into a common decision stream. One way is to feed the input X_1, X_2 as two channels into the same convolutional layer. This limits us in a major sense though; the output from the layer is processed as a linear combination of the two channels (no guarantee that this is a useful relation in the general case). On the positive side, it reduces the parameters trained and thus the risk of over-fitting. We can also reduce the parameter set by employing a parameter/feature-sharing scheme between the streams. For clarity and simplicity I choose to go with two parallel streams with separate set of weights, and rely on regularization for controlling over-fitting.




  1. Two parallel convolutional layers to extract features from each input-stream. Two sets of parameters to train.
  2. Two parallel pooling layers (max-pooling) to down-sample the feature set.
  3. A shared dropout layer for the concatenated pooling layer output.
  4. A fully connected layer that takes flattened 1d-tensor with concatenated outputs from the previous layers.
  5. Another dropout layer as a part of the regularization strategy.
  6. An output layer with shape (num-fully-connected-neurons, num-classes)


We build three  python classes, one for creating the convolutional layers, one for the fully connected ones and one for the dropout layer.

Convolutional layers

The ‘convLayer’-class is initialized with a few parameters defining its properties. The most important here is its shape, where the two first elements holds information about the shape of the convolutional filters (the window that scans over our two-dimensional data input). The third element holds the number of channels we input (had it been color images, then each image would be a collection of 3 color channels). In our case, there will be one channel only. The fourth and last element holds the number of of filters we are applying, i.e the number of feature-extractors we are using to scan over our 2d-data volume. Further, strides determines how we increment the filters in our scan over the 2d-data, and padding, how we handle the edges. See the links above to clear the details out if you need.

The class handles everything-layer related such as initiating the weights and biases and applying the nonlinear activation. We will go with the relu-activation-function, which takes the linear output z_i = W_{ij}x_j + b_i from neuron i and returns a_i = max(0, z_i).

Max-pooling layers

The max-pooling layers down-sample the output from the convolutional layers by scanning output-volume with a window of shape ksize and returning the maximum value for each such window. This reduces the number of inputs to the proceeding layer and has a regularizing-effect and helps reducing training time by reducing the number of parameters to be trained (by reducing complexity in the following layers). For quick summary check this Quora question.

Fully connected layers (with dropout)

The fully connected layer is initialized by its shape = [dimensionality of its input, number of hidden units]. The dimensionality of the input is input is determined by the preceding layer, and the number of hidden units is a parameter to be determined when deciding on the network architecture. In our experiments, the number of hidden units will be set to between 128 and 512, for the layer following the convolutional layers, and 2 for the output layer (corresponding to the two output classes).

The network

We put our layers together, instantiate the layer classes and build our network in a function.  The output layer output and sum of regularization terms are returned. The network function takes data in the form of tensors (multidimensional arrays) as input, together with a dictionary of parameters controlling some aspects of the architecture (filter shapes, strides, number of neurons in the fully connected layers etc). The keep_prob – argument is used to control the dropout layers.



In order to train our network, we need a couple of more things, like a cost function to minimize. There is some freedom in choosing which cost- (loss) function to use, but nowadays most standard implementations tends to use the cross-entropy loss function. For a deeper dive into a motivation, check out this section in Michel Nielsen’s web-book on deep learning.

We need further to compute gradients (in a separate step for visualization) and an optimizer to perform the stochastic gradient decent step. We collect all such computations inside the optimization_ops-function. It’s arguments includes the network output ‘y’ and the true class labels ‘Y’ for computing the cost-function and returns the train-step to be executed on the TensorFlow graph.



Next, we put all steps together. We set our parameters controlling various aspects of the architecture, training and regularization. We initialize placeholders for the input data, call the network and optimization_ops functions and initiate some performance metrics.

In order to be able to visualize various training related quantities such as loss, accuracy and the weights associated with the different network layers in Tensorboard (TensorFlow’s visualization tool), we define a number of summaries for the information to be exported. Run Tensorboard from your command-line interface:

Final training loop

The data input is contained in the data-object and is feeded batch-wise to the TensorFlow-session. The training batches are constructed by defining a generator that takes in training data as well as arguments determining the batch-sizes and the number of epochs to run the training process.

The batch generator.

  1. For each batch (a training step) data is sent into the session (the TF-graph), an optimization step is performed (train_step from optimization_ops), summaries are calculated and exported.
  2. For each test_interval, train and test cost and accuracy is printed on screen (and saved in arrays potential future – not implemented).
  3. For each save_interval, the current state of the model is saved (to be used in early stopping, future predictions and to resume training, in the case when training has to be stopped temporarily).

The modelSaver – class sets up a directory for keeping saved models and saves the current  state.

Training loop

Note that at this point we’re just collecting summaries for the training data. It’s a simple step to add a summary export step for the test runs too.

Test of concept: Matching MNIST-digits

We  started out with a concept, an architecture and a goal of creating a recommendation-engine based on text input. The last couple of parts of this post have been about to put together the code for running the training algorithm and executing the neural network. We have so far stopped short of the natural language processing that was promised in the beginning. The reason for that is that the architecture under investigation is a general one and it is instructive to test it out on a different dataset (a completely different type of data altogether), just to have a controlled environment with well-known attributes to gather some knowledge of the general problem: to train a similarity function over two sets of data.

We’ll run trough this part quickly so we can move on. Up until now, all code is completely general and can with some adjustments the parameters be used in either project. Bare with me and we’ll do some NLP very soon.


The MNIST dataset is a preprocessed dataset containing images of handwritten digits and a common starting point for anyone wishing explore the field of image recognition. Vast amounts example code, blog-posts, scientific papers and chapters in computer vision / deep learning material has been devoted to the dataset. The basic concept is the same, given the dataset of the handwritten digits (images), train a model that learns to predict the true label (digits they represent).

We will do things differently, and build a setup where we take as input pairs of MINST-images and feed the to our two-stream network and try predicting if the two images are of the same class (if the true labels match). This will be done without the algorithm ever knowing what the true labels are, just that they match or not.

Note that this is for demonstration purposes only. In reality, faced with such a similarity function, the straight forward way would be to just run the two images through a single-stream prediction setup and then match the predicted labels instead. The value of this approach becomes apparent when the similarity function is complex and not separable into two single-stream evaluations.

The MNIST-dataset and the complementary array

We start by importing the data, using TensorFlow’s input_data for the MNIST tutorials. We reshape the images into 28×28 pixel images and feed them together with the label into generate_complement in order to produce our final data input (explained below). Finally the data- object containing the train and validation datasets are created.

Starting with X_1 as the original images, we create a balanced dataset (class balance is controlled by the class_prob-argument) with one-hot-endoded labels and a second image array X_2X_2 has the property of either containing (at the same index as X_1) an image of the same digit (not the same per se, but a different image of the same digit) or the image of a different digit. We’ll run the experiment with class_prob = 0.5.

Finally storing everything in an object for easy retrieval in training and evaluation.

Example data with on-hot-encoded labels.



We run the training with parameters similar to those of the the first layer in the standard example on the TensorFlow-site.  But of course with the additions relevant to our setup.

  • Learning rate 0.001
  • Batch size: 100
  • Filter size: 5×5
  • Max-pool ksize: 2×2
  • Number of hidden neurons: 512
  • Number of classes: 2
  • Number of epochs: 20
  • Drop-off probability: 0.5
  • L2-regularization: None

After 20 epochs of training our validation accuracy has reached 97.5%, giving us an error-rate of 2.5%. Thinking of this accuracy as the square of the accuracy of running a single stream classifier (and perform matching on the predictions), the result is quite good (in the competing method, we’d have to multiply the accuracies of the two independednt runs). With a_s^2 = a_t, where a_s is the accuracy  normal single stream classifier and a_t that of our two-stream version, this would be equivalent to an a_s = \sqrt{a_t} = 0.987. With an additional set of convolutional and max-pool layers, this could possibly be bumped up to close to the state of the art (for CNN’s).


Note that these results are calculated on the validation set, and not on the provided standard test set, but with the validation set completely held out (no hyperparameter-tuning), the results should not be prone to changes.

Predicting the social matches

Lets come back to our main project. Now, instead of sending in images of digits, we want to send in text strings containing the LinkedIn header information (position, team, profession company) and the tags (skill, interests, experience) of the users. This data is in text-form and naturally, we’ll need to process and transform it in a way, making it suitable for a convolutional network. The approach will be to transform the text into a vector representation using word embeddings (word2vec).

Word embeddings

I will not go through the technical details of word embeddings and the different word2vec algorithms, but the interested reader might find useful material here, here, here and here. But to get a taste of it, from a algorithm’s perspective each word in a text can be seen as a reference to a vocabulary. The word ‘engine’ might map to slot 679 in a vocabulary containing say 30 000 words. Turning this into a simple representation that distinguishes ‘engine’ from all other words, could be to use an array-representation  where all elements of the array are set to ‘0’ except element 679, which is ‘1’. Each word would correspond to a different state in this 30000-dimensional space, all mutually exclusive (in fact orthogonal). A text containing 50 words could thus be represented by a very sparse matrix with 50 rows and 30000 columns. This is inefficient, and it fails to give any hints of structure in the language. It is just a way of distinguishing different word and put them in a format that can be processed standard libraries.

The first goal of  a word embedding is to significantly compactify this space, from 30000 down to say 300 dimensions or even 100 or 30 (the embedding dimension). This is done by allowing the elements of the new vectors to take any real value (which can be scaled down to stay within [0, 1]), and thus letting the words represent a points in this lower dimensional real-valued space (rather than the basis vectors of the discrete vocabulary space). The second goal is to preserve some of the semantic structure of the language by arranging the words in a way, so words that occur in similar context end up close to each other. Thus words like ‘great’ and ‘good’ should tend to correspond to vectors that are close (euclidean distance) but far away from the word ‘bad’. There are in fact a bunch of geometrical effects occurring in this space that can be used for analysis of semantics in NLP various tasks.


We read the datasets (insert your own here if you want to try something else). Remember that one set includes pairs of strings where User1 has made an active request to User2 and the other is random pairs of users-strings. The data is read into Pandas data frames, merged, and cleaned.

The functions for performing the cleaning. In this experiment we simply remove all special characters and numbers and make all letters to lower case.


We need to set a constant pre-defined string length (number of words) for the data, since the convolutional network will expect constant shape as input. The distribution is quite fat tailed with most strings containing between 10 and 40 words. A few has over 100 words in them. More specifically, setting a ‘MAXLENGTH’ of 50 words will only truncate about 10%. The reduce_strings-function below will also split each string on the space character into numpy-arrays.

Next we generate the word2vec models and perform the transformation. The models are created using the word2vec procedure in the Gensim library (a large Python NLP library) and is called via the genwordvecs-function.
The only arguments we will use will be the embedding-dimension and the minimum number of occurrences of a word we allow in order to include it (it is otherwise removed – this clears out some typos and specific names).

With the word2vec model in place, next step is to transform our arrays of words into arrays of word vectors and to pad all of them into the same length. Remember that we decided for a maximum number of words for the input earlier. For all cases with fewer words, we’ll simply fill up with vectors of zeros.

We transform the data and produce the final dataset.

Training and Results

The dataset now holds arrays of shape [number of examples, number of words, embedding dimension]. One often encounters embedding dimensions up to 300. The results here are based on trying out embedding in a 32-dimensional space (which is comparatively low). There’s is yet to be published some definitive investigations of the sensitivity among different tasks and datasets. There are may other freedoms to consider, for example the use of multiple different filter sizes (number of consecutive words to consider) like in this post. For some guidance on how to decide on an architecture check this sensitivity analysis.

We will go for a single filter size. The padding strategy will one to ensure that the output from the convolutional layer keeps its original shape (shape of the 2d data input), denoted as ‘SAME’ in TensorFlow. The convolutional filters will span the whole of the word vectors and run over 4 words at a time (the filter size). Much can be done in terms of optimizing hyper-parameters and architecture, but as mentioned in the beginning, the goal of this post is rather to take a simple approach without too much focus on optimizing the setup to the fullest. A future post might handle that part.


  • Learning rate 0.0001
  • Batch size: 100
  • Filter size: 4×32
  • Max-pool ksize: 2×32
  • Number of hidden neurons: 256
  • Number of classes: 2
  • Number of epochs: 40
  • Drop-off probability: 0.5
  • L2-regularization: 0.5

With this setup we reach an accuracy of 76% after 40 epochs, which is encouraging. But there are things to consider!

Important note 1

Remember how we constructed the dataset; the negative class was not ‘negative’ per se! We just sampled from a uniform distribution over the users twice, one time for User1 and one time for User2, with the only limitation that the user-pair-combination is not in the positive set and that we do not have duplicates. This is not data constructed by a natural process and far from being created in a way similar to the positive class. Specifically, we fail to take into account the distributions of users over being requesters and being r e q u e s t e d. In the positive set, we’ll have a skewed distribution where some user are the network centers (our network is not homogeneous) and very active (many requests sent and/or may requests received). But this is not true for the negative class and gives the network freedom to learn specific user attributes and thus: The network will over-fit to some degree to the validation set even though it has never seen the it! Just because we have introduced a distinction between the classes that we then transferred to the validation and test sets.


  1. A user that has been very active either by sending or requesting will end up (as ‘positive/match/[1,0]’) multiple times in both test and validation split.
  2. The network can take a shortcut and to some degree learning the attributes and features of the users in the positive class and give the impression of learning generalizable features.

With this said, the size of this effect needs further investigation and can come to show to be somewhere between marginal and important, and with such a small dataset, extracting any kind of signal must be seen as a success.

Important note 2

I did some experiments to confine this effect on the other extreme by limiting the users in the negative class to the set of users in the positive class, forced the distributions over requests and requested to be the same over the two classes. I did this by re-sampling a social network by moving around the edges (brute force by randomly testing thousands of times) in the graph in a way that the new set held completely new user-pairs.


  1. The distribution over the the new requesters in the new set was the same as in the positive set.
  2. The distribution over the the new requested in the new set was the same as in the positive set.
  3. Exactly the same users in the both sets.
  4. None o of the new negative pair combinations had occurred in the positive set.


Now, this ensures the basic properties to be transferred from the positive class to the negative class, but at the cost of loosing all information on what attributes, skill-sets, job-positions etc that are in general interesting and popular, and forces the convolutional network to only learn combinations of features from the two users that has led to a request to be made. With such a small dataset, this setup should not be expected to generate anything useful, but with some heavy regularization, I was able to at least get a small signal even in this case. The accuracy on the validation set peaked slightly above 55%.
I emphasize that this is an extreme approach that only makes sense in a case where the dataset is large and the graph is dense.


Important note 3

Lets say that we by some tweaking and re-sampling negative class (one good approach would be to keep the requesters the same across the two sets, but with true vs randomly sampled requested users), achieve an accuracy of somewhere between the two extremes. Is that a good result? Most likely yes! Accuracy is not the best measure here, remember, our task at hand is not to determine the matching probability between two randomly selected user-pairs, but to create a ranking engine that among the remaining (n-1) (or a subset thereof) can recommend a small set of good matches. In this case, accuracies even in the low 50’s could give a small set of recommendations with high precision (ex. the 10 highest scoring ones). Better metrics for us would be maybe AUC (ranking capability) during validation, and precision over top 10 highest ranked matches during final testing.


Thanks to Jimmy Zhao and the Lunchback-team for letting me share this approach and the results. This was based on an early round of experiments but a version of the approach may be included in the final deployment.


Feel free to use any images or code-syntax from this post in any way you like, but please link back to this post if you use it in a publication (profit or non profit / blog / comment / etc).


This entry has 2 replies

Leave a Reply

2 Comments on "Two stream convolutional network for predicting social matches in linkedin-data"

Notify of
Sort by:   newest | oldest | most voted

Hello, I am reading your tutorial which I found very useful. For fully connected layers, when relu activation is preferred shouldn’t the code be:

activation = tf.nn.dropout(tf.nn.relu(self.layer(x_in)), keep_prob)
instead of
activation = tf.nn.dropout(tf.sigmoid(self.layer(x_in)), keep_prob) ?

Please excuse my ignorance if I am mistaken, I am new in Deep learning.
Thank you!


Good catch! Thank you.

Notice: Trying to get property of non-object in /customers/0/f/6/ on line 238