The use of Neural Networks in music is not a new topic. In the field of audio retrieval, people were trying to use Neural Networks to model a variety of music elements  like chords, base frequencies, etc. As the author mentioned, in 1943, people started to solve speech recognition problems using Neural Networks. But at that time, computational capabilities were not enough to yield good results, so it never gained popularity. However, due to GPU computing and the availability of large amounts of data today, we are starting to see good results. Thus, the author wanted to run a musical experiment with Neural Networks like Fig.1 to achieve the goal of making the neural translation of musical style. In this article, the author provides us with a very detailed analysis on why and how he selected his approach and presents us with a good result by using the proposed approach.
Firstly, I would like to let you know the background knowledge about both the music itself and the technology that will be used for implementation.
A. Music and Neural Networks
One of Google’s more famous projects, Google Magenta , commands AI composers that use neural networks to generate melodies and produced some groundbreaking results. This demonstrated a successful musical application of neural networks. Google Magenta’s amazing performance made the author believe that neural networks can also do some interesting predictions about music.
Then the author analyzed two important elements of music: the composition and the performance. The composition focuses on the music itself, which means the musical notes that define a song. As author mentions, you can think of this as sheet music in Fig.2.
But this is just the first thing the musician needs to do. The performance, which means how these notes are played by the performer, can be the soul of the song. Because different performers can perform the exact same song differently, the musical style is used to describe the individualistic way of playing music.
Musical style is hard to define, because we cannot simply parameterize the style like we can with pitch and notes. If you’ve listened to a lot of classical piano music, it’s very clear that a novice pianist and an experienced pianist produce different range of dynamics, which means a variation on the “loudness” of music. This “loudness” of a note can be performed by hitting the key with a hard stroke for piano. In music notation, these dynamics are indicated using Italian letters. These letters are called emoticons. Different people have different feeling, so they have their own emotional performance, which means the unique set of dynamics here. Theses dynamics can be a very important feature of style. We can see the notations in music in Fig.3.
Also, people can label a song with a genre such as Classical, Jazz etc. From the genre, we find that there are regulations to specific musical styles, such that people can identify the style by some dynamics. That means people can use genres to categorize the musical styles.
In reference [3, 4, 5], they are trying to generate the composition parts, but not the performance parts, So in this article, the author uses the MIDI file format to make the machine perform like a human by adding in dynamics.
The author designed and trained a model to synthesize the dynamics for sheet music, and showed two performances of the same sheet music: one performed by a human and the other generated by the machine. This is a blind test to see whether you can tell which one is performed by human. The author mentioned less than half of the respondents gave the correct answer. I’ve also personally listened to the two performances. Unfortunately, I can easily tell which one is performed by human, because the bot’s performance still contained some weird dynamics, and I am sure that a human won’t perform like that. But it is still impressive.
b. Feedforward Neural Networks
The Feedforward Neural Network (FNN) is the most commonly used architecture. Neurons are connected in layers. The first layer is the input layer and the last layer is the output layer. The layers between these twos are called hidden layers. Fig. 6 shows the architecture with only one hidden layer. There is an important assumption in FNN: every input is independent of the rest.
c. Recurrent Neural Networks
The main limitation of a simple FNN is the lack of memory. This is due to the assumption of FNN that inputs are independent of each other. But in the case of music, it is almost always written with global structure, because musicians write the sequence of music based on the feeling they want to express, and cannot be considered independently. Recurrent Neural Networks (RNNs) can solve this kind of problems. RNN has states and a feed back loop called a recurrent weight. This structure takes previous state into the calculation for the immediate output. This means RNNs can have a short-term memory mechanism to remember what was computed in the past, and use it to compute the current results. This mechanism is shown in Fig. 7 below.
d. Bi-Directional Recurrent Neural Network
Musicians have the ability to look ahead when performing, which can help them prepare for the upcoming emoticons. But for a simple RNN, it c
an only read inputs in order. So we should introduce a structure that can access the upcoming time-steps. This is called a Bi-Directional RNN . This architecture combines two RNN layers as shown in Fig.8.
The first layer is called the forward layer, which processes the input sequence in the original order. The second layer is called the backward layer, which can process the input sequence in the reverse order. This seems to be a good choice for the author’s purpose.
The basic idea of Bi-directional Recurrent Neural Network is that for each training sequence, there is a RNN no matter if you’re going forward or backward, and these two RNNs are connected to the output layer. This structure provides complete information either in the past or in the future. The pseudocode of forward pass and backward pass are showed below:
for t = 1 to T do Forward pass for the forward hidden layer, storing activations at each timestep for t = T to 1 do Forward pass for the backward hidden layer, storing activations at each timestep for all t, in any order do Forward pass for the output layer, using the stored activations from both hidden layers
for all t, in any order do Backward pass for the output layer, storing terms at each timestep for t = T to 1 do BPTT backward pass for the forward hidden layer, using the stored terms from the output layer for t = 1 to T do BPTT backward pass for the backward hidden layer, using the stored terms from the output layer
e. Long Short-term Memory Network
For RNNs, one major problem is that they cannot remember long-term dependencies [7, 8]. To solve this, Long Short-Term Memory Network (LSTM) was introduced. As shown in Fig. 9, it gives a gating mechanism to control over the issues like how much needs to be remembered, and how much needs to be forgotten.
Finally, we are ready to design the architecture. The author uses Bi-directional LSTMs in this article. There are two separate networks, one to realize the Genre, called GenreNet, and one to realize the style, called StyleNet.
Genres have definitive musical styles, so the author uses this characteristic to design a basic model to learn the dynamics of a song. There are two main layers in this model, as shown in Fig. 10: the Bi-Directional LSTM Layer and the Linear Layer.
The Bi-Directional LSTM layers combine the advantages from LSTMs, which will provide the memory for learning dependencies, and the Bi-Directional architecture, which will allow the model to take the “future” into consideration. This makes this architecture’s output feed into another layer as input. The linear layer is used to transfer the output, which ranges from [-1, 1], to a larger range.
This network is used to learn the more complex genres that GenreNet cannot be trained on. There are subnetworks of GenreNet contained in this StyleNet to learn genre-specific style. There is a layer called an interpretation layer that’s shared by the GenreNet subnetworks. It reduces the number of parameters the network need to learn, and StyleNet is just similar to a translation tool to translate the music input into different styles. It can be seen in Fig. 11 that it’s a multitask learning model.
In this article, the authors use music files in MIDI format, because this format preserves musical properties. There is a parameter called velocity that’s analogous to volume, but with a range of 0 – 127. Here, the author use velocity to detect the dynamics.
The author created a piano dataset of 349 tracks, with the genres limited to Classical and Jazz. The MIDI files the author used is from Yamaha Piano e-Competition. Human performances usually have at least 45 different velocities as shown in Fig.12. The author set 20 as the minimum threshold of different numbers of velocities.
Finally, for the time signature, 4/4 is chosen in this article due to it being the most common. This provides the author with more sample than other time signatures.
MIDI Encoding Scheme
After getting the piano dataset, the next step is to design the input and output matrix.
First, the dataset need to be quantized, which will allow the author to use matrices to represent the notes. But we will lose the exact timing of the notes. Here, the author approximate the time-stamps of all the notes to the nearest 1/16th note, allowing the capture of the notes. Fig. 13 shows the difference between the unquantized and the quantized representation of the notes.
B. Input Matrix Representation
The input will carry the information on note pitches, start time and end time. There are three states for each note, so the author uses a binary vector to represent these three states: “note on” is encoded as [1,1], “note sustained” as [0,1] and “note off” as [0,0].
The note pitch needs to be encoded as well. A matrix is created. with the first dimension about the MIDI pitch number, and the second dimension about the 1/16 note (quantized time-step). This can be seen in Fig. 14.
Fig 14. Input and Output Representation.
C. Output Matrix Representation
The output matrix carries the velocities of the input. The columns also represent pitch and the rows represent time-step as shown in Fig. 14. The pitch dimension is only 88 notes, because this time we only need to represent velocity. The data is then divided by the max velocity, 127, and finally the output velocity is decoded back to a MIDI file.
The author provides a lot of detail on the training process, but I’ll just highlight the key points. TensorFlow was used to build the model with two GenreNet units (Classical and Jazz), with GenreNet having 3 layers. The input should be 176 nodes wide and one layer deep. Mini-batches of size 4 were chosen, and the learning rate was set as 0.001. Adam optimizer was used to perform stochastic optimization. Here, the author used 95% Classical Songs and Jazz songs each for training, and 5% for validation. Also, the dropout rate was set as 0.8 due to the experimentation shown in Fig. 15. This dropout rate will make the model learn the underlying patterns. The model was trained for 160 epochs, and the final and validation loss were 0.0007 and 0.0011 respectively.
Fig.16 the author shows the training error and validation error.
The author also presents a lot of epoch snapshots from the training session, so we can see the difference between Classical and Jazz outputs. Fig. 17 is one of the snapshots.
The key to testing the results is to see whether the StyleNet can generate human-like performances, and the author use a Turing test  to test these results. The author created a survey called “Identify The Human”, with 9 questions in two parts, and the participants will listen to the music in 10 second clips as shown in Fig.18.
The participants need to identify which performance is human. Fig.19 shows that on average, 53% of participants could highlight the human performance.
To make this survey more complete, the author added a new option called “Cannot Determine” to make sure the participants’ decisions were not from guessing. As shown in Fig.20, this time, the author found 46% of participants could identify the human. This means that the StyleNet model can pass the Turing Test, and it can generate performances indistinguishable from human’s.
I’ve learned to play the violin for 19 years, and I strongly agree with the author’s point that there are two important elements to music: the composition and the performance. And sometimes, the performance is the more difficult element. In my opinion, after hearing the music it generated, StyleNet’s performance is very impressive. I think if the note quantiazation can be made smaller, like in 1/32 note or even 1/64 note, then it can achieve a better result. Also, I think a challeng lies in that you cannot make sure StyleNet learns the correct style from the human’s performance. In the test the author provided, I can select the human’s performance correctly because there are still some tiny details in the note that cannot be learnt by the network, while these details are the ones that I use to judge whether it is from a human or not. One suggestion is to let musicians to listen to the music generated by StyleNet, I think they can provide helpful some professional advice on possible improvements.
I also want to make a comparison between neural style transformation in computer vision [12,13] and the proposed work here. Their basic approach is similar: both use two networks corresponding to style and content. But for neural style transfer in images, the difference between these two networks (one for content reconstruction and the other for style reconstruction) is larger. For the style reconstruction, it will be generated by calculating different subsets of the CNN like conv1_1,[conv1_1, conv2_1],[conv1_1, conv2_1, conv3_1],[conv1_1, conv2_1, conv3_1,conv4_1],[conv1_1, conv2_1, conv3_1, conv4_1, conv5_1] to capture the spatial features with different sizes of sub-sets. In the case of music, I think the main idea of using LSTM in generating music is to capture the features in styles and performances. It can be concluded as follows:
To summarize, CNN is a typical spatial depth of the neural network, RNN is the depth of time in the neural network. When using CNN, we should focus on spatial mapping, image data particularly fit this scene. But for music, we need the time sequence, and thus we should use RNN.
 Colombo, Florian, Alexander Seeholzer, and Wulfram Gerstner. “Deep Artificial Composer: A Creative Neural Network Model for Automated Melody Generation.” International Conference on Evolutionary and Biologically Inspired Music and Art. Springer, Cham, 2017.
 Warren S. McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity.
 D. Eck and J. Schmidhuber. A First Look at Music Composition using LSTM Recurrent Neural Networks
 Bob L. Sturm, Joao Felipe Santos, Oded Ben-Tal, and Iryna Korshunova. Music transcription modelling and composition using deep learning.
 Composing Music With Recurrent Neural Networks.
 M. Schuster and K. K Paliwal. Bidirectional recurrent neural networks.
 Ilya Sutskever. Training Recurrent Neural Networks.
 Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. Understanding the exploding gradient problem.
 Sepp Hochreiter and J Urgen Schmidhuber. LONG SHORT-TERM MEMORY.
 M Alan. Turing. Computing machinery and intelligence.
 Jing, Yongcheng, et al. “Neural Style Transfer: A Review.” arXiv preprint arXiv:1705.04058 (2017).
 Gatys, Leon A., Alexander S. Ecker, and Matthias Bethge. “A neural algorithm of artistic style.” arXiv preprint arXiv:1508.06576 (2015).
Technical Analyst: Shixin Gu | Reviewer: Joni Chung