I would like to share this with you although I guess it as usual are very few that would have any interest in it. The reason I want to tell you about it is firstly, because I think my solution to this sound balance problem is very unorthodox and fun. Secondly because I think the result was surprisingly good. It still freaks me out a how much I was able to improve the quality or maybe more correct the "perceivability" of the interview.
I have to say that I am a happy Linux user and that this little project is made on a Linux platforms with basic open source software for audio and image processing.
Let's look at the problem.
It is an interview conducted over Skype. And the problem is very simple. One voice is the voice of the woman who ask the questions in the interview. Her voice comes through loud and clear. The other voice is the voice of the woman being interviewed. Her voice comes through so quiet that you can hardly hear it.
Here is a screen shot of the audio file opened with Audacity. It is obvious how one voice gives large fluctuations in the graph while the other ones gives tiny fluctuations in the graph.
I would like to try if I could balance the sound level in the two voices in proportion to each other. The first thing I found out was that I could increase the volume of the quiet voice if I in Audacity (a sound editing program) selected the piece that has low volume and only that piece. That is the piece which I in the illustration here above has identified as the "Voice B". Than after selecting it I ran a so called normalize effect on it. After doing only that the sound of voice B, the one that was so quiet would suddenly come though laud and clear. I was so surprised and happy about this discovery that I made a short video about it.
So therefore, if I could do this all over the interview which lasted about an hour I would probably be getting a fine balanced interview where you could hear both voices. So I started to do that manually and it was a little laborious process and at the same time it was also difficult to make it uniform so I started thinking about whether or not this process could be automated. I looked around for an already existing programs that did this same thing and I found one that was called The Levelator® and I tried it, but for some reason it gave absolutely nothing to my audio file. As I thought more about it, I got a picture in my head that it could be cool if I could make two complimentary soundtrack of the interview, one where only the strong voice is heard and where there is silence on all the parts where the quiet voice is speaking and one where only the quiet voice is heard and where there is silence in all the places where the strong voice is speaking. Then I would be able to open the two soundtracks in Audacety together and be able to treat them individually in relation to each other.
I thought it would be an easy task for a computer sound engineer to make a program that could do that. A program that could register when it was a period of tiny fluctuations in the graph and when it was a period of big fluctuations in the graph. But I am no computer sound engineer. I am not so strong in audio on the computer, but I have worked a lot with images and video on computers and as I thought about it the idea of turning the sound stream in to a video stream started to take form in my mind. The idea that I might be able to make sound into video and the video in to a series of images and then make an algorithm based on measurements of the images.
There are for Linux / Unix a small command line based program for playing audio files. This program is called "play" and it is part of a program suite called SoX which is for command line based manipulation of audio files.
When "play" plays a sound file it also display some meta information while running. It shows a time indicator but it also shows a small sound level indicator.
Now, if I could shoot this little indicator while it was playing the interview with 24 frames per second I would be able to print this video into a series of images which could show exactly how loud the sound is at any given time.
So I hope you are with so far. The idea is that I make a little video recording (screen recording) of the small volume indicator's movement while the interview is playing. This video I print out to a huge amounts images one for each frame in the video recording. Each of these images contains accurate information about how loud the sound is at a given time in the recording. This information can be read from every single image in the series by a human being, but now I want to find a way that the computer also can read or calculate this information on its own so that the process can be automated. There are two parameters that I would like to have the computer figure out. One is a precise time. The second is the state of the sound level indicator at that particular time.
OK first the time parameter because that is the easyest. If I create a script that count its way through the series of images then the number it has reached in the counting can very easely be calculatet to a time messure. There are 24 images per second so you can just devide the number by 24 then you have the time in seconds.
But how do we get the computer to read the sound level indicator?
Netpbm is the name of a package of command-based programs for image manipulation. It is a software I use again and again and therefore I also know its possibilities really well.
I knew that in the netpbm program suite there is a program called "pnmcrop" which can remove all same colored edges of an image. So, if you have a black image with a white star in the middle then "pnmcrop" remove all the black from top, bottom, right, left side of the image until it meets some of the white pixels from the star in the middle. Like this:
Before pnmcrop
After pnmcrop
I also knew that the netpbm program suite had a program that could be used to measure how large an image was in pixels (pamfile).
The neat thing about the netpbm program suite is that all the programs in it are command line programs which means that they can be put into a script and used to automate workflows.
So I record a little video of "play" playing the interview sound file. This can be done by making a screen recording or a screen cast and there is again an excellent command-line program that can be used for this purpose. This program is called ffmpeg.
In the picture below I define the area I want to record of the screen. As you can see, it is a very small area.
When this is written out to images instead of video it produces a huge amount of images (24 per second). The images looks like this.
Now I use pnmcrop on the five images here above to cut all black away from the right and only from the right side the images:
You can see the image are different in width in accordance to the scale of the little volume indicator.
As I said, there is also a program in the netpbm program suite that can measure a picture width in pixels. So now I have all of a sudden a setup that can give me an exact figure for the sound level measured in pixels at an exact point in time. So I would be able to create a simple conditional statement:
If the image is greater than 314 pixels wide than it is voice A speaking. If it is less than 314 pixels wide, it is voice B speaking.
Now 24 time measurements of a second is too high a level of detail for what I want to do so I made a small script that collected the images twelve and twelve. Here I also use a program from the the netpbm program suite in the script to put the images together twelve and twelve (pnmcat).
Now each image represents half a second. It's a little easier to work with and I can still do the same algorithm to crop from the right and measure precisely how the sound level is for just that given half a second.
So I made a few shell script to automate this.
The first script produces a list with exact time measurement for every time the interview switch from voice A to voice B and vice versa.
5
8
9
14
19
23
24.5
26.5
29.5
31.5
32.5
36
36.5
43
43.5
54
55.5
58
77.5
78
192.5
etc...
The script is here if anyone want to see how I make that in practice. (Sorry the comments are in danish. If you don't understand danish you will have to use google translate or analyze the meaning out of the code it self)
The next script takes this list and producer from that the two previously mentioned complementary soundtrack. For this I use SoX and ffmpeg.
Finally I open the two audio tracks in Audacity and adjust them relative to each other and then I mix them down to on track again.
The result has been really good and the process have been really inspiring.
/Mikkel
Note to those who wants to experiment with this in practice.
Although I have split the voices into two tracks there are still some transitions where the strong voice patches the weak voice tracks. In particular, I found out that the shift from the strong voice to the weak almost always put a little less than a second of the strong voice on the weak voice track. To counteract this, I tried to move the timeing of all the shifts one second forward in relation to the soundtrack. I did this by removing 24 photos from the start of the process where the many small images is collected twelve and twelve. It was a very easy place to do it because I just need to change a number. The starting point in the script. I just counted that 24 counts up. So if 145 is the image that comes closest to the audio file starting point I will instead start 24 images ahead in the series so that image 169 will be the starting point.
When moveing the timeing of the shifts a second forward in relation to the audio file there ware much less patches from the powerful voice on the weak voice soundtrack, but there was still some. I've found that the best results was achieved by leveling these "spots" into alligenment with the fluctuations of the rest of the track on the quiet voice track. I have done this manually in Audacity. It takes me about 20 minutes to do this manually in an interview lasting about one hour. I zooms into the graph so much that 25 seconds fills my whole Audacity window. Then I use the arrow at the end of scroolbaren in the bottom of the Audacity window to move forward in the soundtrack. As soon as I see a fluctuation which extends above the average very low level on the soundtrack I mark it and use
Effect -> Amplify
To turn down the volume on this place and reach the level of the rest of the track. When I'm done with this process, I have a track where all fluctuations have the same low rate and the board looks quite uniform without anything sticking up in any places. Then I run
Effect -> normalize
on the track. It makes a really strong impact. The entire track is enhanced significantly and so I do not need to do anything alse. The second track, with the powerful voice I don't change at all. When I manually have eliminated all "ridges" in the quiet tracks and run
Effect -> Normalize
on the entire track I can just export the the two tracks together and then I have the finished audio file where the two voices can both be heard loud and clear.
/Mikkel
Note to those who wants to experiment with this in practice.
Although I have split the voices into two tracks there are still some transitions where the strong voice patches the weak voice tracks. In particular, I found out that the shift from the strong voice to the weak almost always put a little less than a second of the strong voice on the weak voice track. To counteract this, I tried to move the timeing of all the shifts one second forward in relation to the soundtrack. I did this by removing 24 photos from the start of the process where the many small images is collected twelve and twelve. It was a very easy place to do it because I just need to change a number. The starting point in the script. I just counted that 24 counts up. So if 145 is the image that comes closest to the audio file starting point I will instead start 24 images ahead in the series so that image 169 will be the starting point.
When moveing the timeing of the shifts a second forward in relation to the audio file there ware much less patches from the powerful voice on the weak voice soundtrack, but there was still some. I've found that the best results was achieved by leveling these "spots" into alligenment with the fluctuations of the rest of the track on the quiet voice track. I have done this manually in Audacity. It takes me about 20 minutes to do this manually in an interview lasting about one hour. I zooms into the graph so much that 25 seconds fills my whole Audacity window. Then I use the arrow at the end of scroolbaren in the bottom of the Audacity window to move forward in the soundtrack. As soon as I see a fluctuation which extends above the average very low level on the soundtrack I mark it and use
Effect -> Amplify
To turn down the volume on this place and reach the level of the rest of the track. When I'm done with this process, I have a track where all fluctuations have the same low rate and the board looks quite uniform without anything sticking up in any places. Then I run
Effect -> normalize
on the track. It makes a really strong impact. The entire track is enhanced significantly and so I do not need to do anything alse. The second track, with the powerful voice I don't change at all. When I manually have eliminated all "ridges" in the quiet tracks and run
Effect -> Normalize
on the entire track I can just export the the two tracks together and then I have the finished audio file where the two voices can both be heard loud and clear.