You chose a row of pixels. The greyscale-values of this row correlates with the audio-wave-form.
When you choose a frame which is thicker than 1 pixel, the pixel-data of pixels one below the other are added and the average is taken for further use.
Only by moving the frame, you find numberless different sound very intuitive and sensual.
When you chosse full-screen, you can work properly with silhouettes such as sinus-, square- or realtime-silhouettes, taken from dancers e.g., with a connected camera.

Loaded picture (a silhouette)...

...is transformed into greyscales (here: full-screen)...

...is transformed into an audio-wave.