Character-level text classification: CNN

The code associated with this post can be found in text-classification.


We will be implementing a convolutional neural network in Keras for character-level text classification. The implementation will be done in two flavors, one with an embedding layer transforming the input from 70-dimensional one-hot encoded input arrays down to 32-dimensional real-valued arrays, and one without the embedding. The setup with follow the steps and architecture outlined in Character-level Convolutional Networks for Text Classification, but as a shallower implementation, consisting of three convolutional layers. The code in the associated repository can easily be transformed for deeper implementations, although parameter tuning will pose a challenge beyond the scope of this post.

Notable libraries

  1. Keras (TensorFlow backend)
  2. Pickle
  3. Pandas
  4. Numpy


The code is general enough to be used with a wide range datasets, but the current post, we will be using the Rotten Tomatoes movie review (included in the repo). The dataset contains 5000 movie reviews, half of which are of positive sentiment and half of which are of negative sentiment (two-class problem). The task at hand is to classify the input strings into one of the two classes, feeding the network arrays of arrays each representing a character from the review.


All preprocessing steps are summarized in one function ‘char_preproc’, which in turn calls a set of other functions (check the code repo for details).

  1. Clean text strings (remove all characters except letters, numbers and basic punctuation, and make into lower-case).
  2. Transform each text string to an array of characters.
  3. Tokenize (transform each character to an integer reference to a vocabulary of 69 unique character entries, represented by integers 1 to 69).
  4. Pad all reviews into the same length (add integer 0’s to the end each array, resulting in each array being exactly 250 elements long. In the rare case of a review containing more than 250 letters, truncate to 250).

For the case where we use an embedding layer, the preprocessing ends here. It should be emphasized though, that the addition of an embedding layer was just experimental and I never managed to make it add to the performance or to reduce training times (slightly better convergence), but with right parameter tuning and/or with pre-trained embeddings, one might be able to at least reduce training times by feeding a smaller representation to the convolutional layers.

For the default case of running the setup by feeding the input directly into the first convolutional layer, we one-hot encode the data.

Loading data and running preprocessing steps

In order to not having to run the preprocessing every time, we use the function ‘load_preprocessed_data’ as our point of entry and conveniently control the saving and loading of preprocessed data.


The training is done on a setup with three convolutional layers (1D) with 8 character wide kernels. The two first layers are followed by pooling layers with 2 characters wide filters. One fully connected layer containing 1024 neurons with relu activation function feeds the output layer. Parameters and architecture are controlled in the settings section in the beginning of the file.

The convolutional layers are added through a loop that allows for quick changes in the architecture by editing the settings.

A cross entropy loss function and RMSprop optimizer are used for the optimization and the training is done in batches of 50.
Callbacks such as early stopping (patience of 10, stops training if no improvement observed in the validation accuracy after 10 epochs), model checkpointing and logging (Tensoroard) are added for training control and saving of results.

The training (current results) was done on a NVIDIA GTX 1080 Ti GPU and training time was seldom above 15s per epoch.




Validation accuracy for setup with (blue) and without (purple) embdedding layer.

With minimal parameter tuning the validation set accuracy reaches 72% for the version without embedding layer and 70% for the version with an embedding layer, after about 20 epochs (The graph above is smoothed – leading to a shift forward in time). The addition of an embedding layer, in this case, leads to a slightly faster convergence, although the best accuracy is obtained by the version without. Comparing these results with what can be achieved using say, a word-level convolutional network, they are still a bit behind (~80%), but one must keep in mind that we have been feeding the network character by character and with only three convolutional layers are still able to reach quite interesting results!


This entry has 0 replies

Leave a Reply

Be the First to Comment!

Notify of
Notice: Trying to get property of non-object in /customers/0/f/6/ on line 238