## Implement a toy video encoder in browser

EpiphanyShowcase

Video encoding is an undocumented minefield. The only way to know where the mines are is stepping on them yourself. When I just started, there were no good documents that could give me the big picture. The best I could find was the Daala blog series. But its main goal is introducing some more advanced algorithms, rather than teaching the basics. As someone who has navigated the minefield, I think I should write a beginner friendly introduction. In this article, we will implement a toy video encoder, which does not strictly follow any industrial standard, but will illustrate how video compression works under the hood.

Video compression is true magic. If you calculate the raw size of a one hour and a half long 1080p 24fps movie (750gb), you will see that your hard drive would explode storing that movie. But the size of an MP4 file of the same movie is totally fine. The compression rate of H264, which is already the previous generation video codec, is about 3000 to 0.4. To put that into perspective, suppose you have three bytes to represent one pixel R, G and B. After compression, you will use the same space to save 7500 pixels. And yet, you probably won’t notice any artifacts or quality issues on the encoded video.

The previous generation video codecs are represented by H264 and VP8, while the current generation are HEVC or H265, and VP9. Today, there is a special hardware on the GPU that handles video encoding and decoding, which is why, despite being computation heavy, encoding and decoding have become so efficient. The future generation codec will be called AV1, which is still at its early stage without any hardware support yet.

A video codec works like a pipeline. It takes a few steps to finally compress a video. The first step is color conversion. We are all familiar with the RGB color for displaying. However, when it comes to compression, RGB is not the ideal color space for two reasons. First, R, G and B channels contain highly correlated information. For example, the Green image looks like the Greenish version of the Red image. The Blue image looks like the Bluish version of the Red image too. This means all three color channels store some redundant and correlated information, the intensity of light. Second, it turns out that human eyes are more sensitive to intensity than color. If we want to sacrifice some information for space without changing the image too much, color information is the one we want to trim down.

let canvas = getCanvas();

canvas.height = 160 let ctx = canvas.getContext(‘2d’)

let newImg = new Image; newImg.crossOrigin = “Anonymous”;

let isRGB = true; ctx.font = “30px Arial”; let drawImages = function () { if (loaded) { ctx.clearRect(0,0,canvas.width, canvas.height); ctx.drawImage(newImg, 0, 0, 128, 128);

    if (isRGB)
{
imageData = ctx.getImageData(0, 0, 128, 128);

for (let y = 0; y < 128; ++y)
{
for (let x = 0; x < 128; ++x)
{
imageData.data[(y * 128 + x) * 4 + 1] = 0
imageData.data[(y * 128 + x) * 4 + 2] = 0
}
}
ctx.putImageData(imageData, 138, 0);
ctx.fillText("R", 138 + 128 / 2 - 12, 158);

imageData = ctx.getImageData(0, 0, 128, 128);

for (let y = 0; y < 128; ++y)
{
for (let x = 0; x < 128; ++x)
{
imageData.data[(y * 128 + x) * 4 + 0] = 0
imageData.data[(y * 128 + x) * 4 + 2] = 0
}
}
ctx.putImageData(imageData, 138 * 2, 0);
ctx.fillText("G", 138 * 2 + 128 / 2 - 12, 158);

imageData = ctx.getImageData(0, 0, 128, 128);

for (let y = 0; y < 128; ++y)
{
for (let x = 0; x < 128; ++x)
{
imageData.data[(y * 128 + x) * 4 + 0] = 0
imageData.data[(y * 128 + x) * 4 + 1] = 0
}
}
ctx.putImageData(imageData, 138 * 3, 0);
ctx.fillText("B", 138 * 3 + 128 / 2 - 12, 158);
}
else
{
imageData = ctx.getImageData(0, 0, 128, 128);
for (let y = 0; y < 128; ++y)
{
for (let x = 0; x < 128; ++x)
{
let e = (y * 128 + x)
let intensity = 0.2126 * imageData.data[e * 4] + 0.7152 * imageData.data[e * 4 + 1] + 0.0722 * imageData.data[e * 4 + 2]

imageData.data[e * 4] = intensity
imageData.data[e * 4 + 1] = intensity
imageData.data[e * 4 + 2] = intensity
}
}
ctx.putImageData(imageData, 138, 0);
ctx.fillText("Y", 138 + 128 / 2 - 12, 158);

imageData = ctx.getImageData(0, 0, 128, 128);

for (let y = 0; y < 128; ++y)
{
for (let x = 0; x < 128; ++x)
{
let e = (y * 128 + x)
let u =  -0.09991 * imageData.data[e * 4] - 0.33609 * imageData.data[e * 4 + 1] + 0.436 * imageData.data[e * 4 + 2]
let v =  0.615 * imageData.data[e * 4] - 0.55861 * imageData.data[e * 4 + 1] - 0.05639 * imageData.data[e * 4 + 2]

imageData.data[e * 4] = 1.28033 * v + 128
imageData.data[e * 4 + 1] = -0.21482*u - 0.38059*v + 128
imageData.data[e * 4 + 2] = 2.12798 * u + 128
}
}
ctx.putImageData(imageData, 138 * 2, 0);
ctx.fillText("UV", 138 * 2 + 128 / 2 - 24, 158);
}
}


}

canvas.onresize = function (width, height) { drawImages() }

let UIComponents = function(){ this.image = “Lenna” this.mode = “RGB” } let uimodel = new UIComponents(); let ui = createUi(); ui.add(uimodel, ‘image’, [ ‘Lenna’, ‘Baboon’, ‘Fruits’] ).onChange( ()=>{ if (uimodel.image === “Lenna”) { console.log(“change to lenna ==========================”) newImg.src = ‘https://epiphany.pub/image/faa2ff7436b15877418252219f4f1042029a75e42a4851e9b8d6af949d1a6756/4c411b8a0b5207739f97e787d2af77ec9e1a1a47f117eb946ba1fcf51865d5f6/bfcf3757089ac1f583ad617b4d92e4a9.png’ } else if (uimodel.image === ‘Baboon’) { console.log(“change to baboon ==========================”) newImg.src = ‘https://epiphany.pub/image/faa2ff7436b15877418252219f4f1042029a75e42a4851e9b8d6af949d1a6756/4c411b8a0b5207739f97e787d2af77ec9e1a1a47f117eb946ba1fcf51865d5f6/674fd8e5a649dd8eb182d868b4c6d1b8.png’ } else if (uimodel.image === ‘Fruits’) { newImg.src = ‘https://epiphany.pub/image/faa2ff7436b15877418252219f4f1042029a75e42a4851e9b8d6af949d1a6756/4c411b8a0b5207739f97e787d2af77ec9e1a1a47f117eb946ba1fcf51865d5f6/115bc491e101002d9a0ff48536e0662d.png’ } } );

ui.add(uimodel, ‘mode’, [ ‘RGB’, ‘YUV’] ).onChange(()=>{ console.log(uimodel) if (uimodel.mode === ‘RGB’) { isRGB = true; } else { isRGB = false; } drawImages();

});

The industrial standard color format for image compression is YUV. In the YUV format, light intensity (channel Y) is separated from chrominance (channel U and V). The formula to convert RGB to YUV is given by the following:

$\begin{pmatrix} 0.2126 & 0.7152 & 0.0722 \\ -0.09991 & -0.33609 & 0.436 \\ 0.615 & -0.55861 & -0.05639 \end{pmatrix} *\begin{pmatrix} R\\G\\B \end{pmatrix}=\begin{pmatrix} Y\\U\\V \end{pmatrix}$

Once an image is converted into YUV, its color channels can be down sampled to reduce size. There are many ways of doing it, each is identified by a name, such as YUV444, YUV422, YUV420p, NV12 etc. If you need to use any encoding library, these names are very important, because a format conversion is often inevitable in order to connect an encoding pipeline. For example, a camera might only output images in YUV422, but the video encoder you want to use might only accept YUV420p.

So what are those formats exactly? Well, YUV444 means there is no down sampling on the color channels at all, it is simply the original YUV image. YUV422 means we shrink the size of U and V by half along their x axis. Each U or V sample on a YUV422 image represents two horizontally adjacent samples on the original U or V channel. YUV420p, which is the most common format, shrinks the original UV images along both x and y axises. Hence, the resulting UV images are one quarter of their original. A format name not only states how down sampling is performed, it also informs us how pixels are saved in memory. There are two general ways of saving pixels. One way is called packed, such as RGB, where the color values of all channels belong to a same pixel stay together. Another way is called planar, where we save an entire channel first, followed by another channel and so on. The YUV420p format, unlike RGB, is a planar format.

The next step down the pipeline is image partitioning. We want to break the image into small patches called transform blocks. Transform block is the processing unit of the next stage when we will turn each block into math equations. It’s hard to find a set of good equations to represent the entire image due to complexity. However, it should be simpler if we only work on one small area each time. In a real encoder, block based motion information is also analyzed and harvested to further improve compression rate. But it is a more advanced topic that we will omit for now.

Recent video encoders, such as H265, use adaptive block sizes. For flat areas with less details, block size is usually larger. For areas with small details, partitioning will continue to generate finer blocks, akin to a Quadtree. For simplicity, however, we only use a uniform size of 8x8 in our toy implementation.

The next step is representing each block with frequencies. The motivation is removing high frequencies (the details), but still keeping low frequencies to preserve most of the original image. This is similar to Fourier Transform, but instead of using sine functions, we use cosines, hence the name Discrete Cosine Transform. DCT is favored over Fourier Transform because it has higher energy compaction.

A visual explanation of energy compaction is the above image. If we plot the histogram of a DCT result, the value distribution has a focus on a few low frequencies. This is a very good property for compression, because most of the values (coefficients) are close to zero and therefore disposable.

You may want to ask if the energy compaction property of DCT is always true for any input image. The answer is no (thinking white noise). But it’s generally true for natural images, because natural images are very smooth locally. However if we apply DCT to an image over and over again, we will not be able to compress the image infinitely, because after the first DCT, the resulting image is not locally smooth anymore.

Like Fourier Transform, the goal of DCT is representing the pixels using a linear combination of functions (cosines). DCT has a few variants. The one typically used for image compression is DCT-II and its inversion (for decoding) DCT-III. For DCT-II, the cosine functions are defined as the following:

$cos(\frac{\pi}{N}(x+\frac{1}{2})k) x,k=0...N-1$

where N=8 is the size of the transform block and k defines the frequency of a cosine function.

let hsv2rgb = function (H, S, V) { var R, G, B, C, Hi, X;

C = V * S;
Hi = Math.floor(H / 60);
X = C * (1 - Math.abs(((H / 60) % 2) - 1));

switch (Hi) {
case 0: R = C; G = X; B = 0; break;
case 1: R = X; G = C; B = 0; break;
case 2: R = 0; G = C; B = X; break;
case 3: R = 0; G = X; B = C; break;
case 4: R = X; G = 0; B = C; break;
case 5: R = C; G = 0; B = X; break;

default: R = 0; G = 0; B = 0; break;
}

return new THREE.Vector3(R, G, B);


};

let updateGeometry = function(geometry, geometry2, plotoffsetx, plotoffsety, isUpdate) { var indices = [] var vertices2 = [] var vertices3 = [] var normalCal = [] var minZ = 10 var maxZ = -10

if (isUpdate)
{
vertices2 = geometry.attributes.position.array
vertices3 = geometry2.attributes.position.array
}
let h = 0
for (var y = 0; y < 65; y++)
for (var x = 0; x < 65; x++) {
let locY = y / 8 -4
let locX = x / 8 -4
let locZ = Math.cos((Math.PI/8)*(locY+4)*((locX+4)+0.5))

if (minZ > locZ)
minZ = locZ;

if (maxZ < locZ)
maxZ = locZ

if (isUpdate)
{
vertices2[h++] = locX
vertices2[h++] = locY
vertices2[h++] = locZ
}
else
{
vertices2.push(locX, locY, locZ)
}
normalCal.push(new THREE.Vector3(0, 0, 0))
}

h = 0

for (var y = 0; y < 64; y++) {
for (var x = 0; x < 64; ++x) {
let baseIdx = y * 65 + x
let ind1 = (y + 1) * 65 + x + 1
let ind2 = y * 65 + x + 1
let ind3 = (y + 1) * 65 + x

if (!isUpdate){
indices.push(baseIdx, ind2, ind1)
indices.push(baseIdx, ind1, ind3)
}
let basev = new THREE.Vector3(vertices2[baseIdx * 3], vertices2[baseIdx * 3 + 1], vertices2[baseIdx * 3 + 2])
let ind1v = new THREE.Vector3(vertices2[ind1 * 3], vertices2[ind1 * 3 + 1], vertices2[ind1 * 3 + 2])
let ind2v = new THREE.Vector3(vertices2[ind2 * 3], vertices2[ind2 * 3 + 1], vertices2[ind2 * 3 + 2])
let ind3v = new THREE.Vector3(vertices2[ind3 * 3], vertices2[ind3 * 3 + 1], vertices2[ind3 * 3 + 2])

let v2 = new THREE.Vector3();
v2.subVectors(ind2v, basev)
let v1 = new THREE.Vector3();
v1.subVectors(ind1v, basev)

let n1 = new THREE.Vector3;
n1.crossVectors(v2, v1)

if (!isUpdate)
{
vertices3.push(vertices2[baseIdx * 3], vertices2[baseIdx * 3 + 1], vertices2[baseIdx * 3 + 2])
vertices3.push(vertices2[ind3 * 3], vertices2[ind3 * 3 + 1], vertices2[ind3 * 3 + 2])
vertices3.push(vertices2[baseIdx * 3], vertices2[baseIdx * 3 + 1], vertices2[baseIdx * 3 + 2])
vertices3.push(vertices2[ind2 * 3], vertices2[ind2 * 3 + 1], vertices2[ind2 * 3 + 2])
}
else
{
vertices3[h++] = vertices2[baseIdx * 3]
vertices3[h++] = vertices2[baseIdx * 3 + 1]
vertices3[h++] = vertices2[baseIdx * 3 + 2]
vertices3[h++] = vertices2[ind3 * 3]
vertices3[h++] = vertices2[ind3 * 3 + 1]
vertices3[h++] = vertices2[ind3 * 3 + 2]
vertices3[h++] = vertices2[baseIdx * 3]
vertices3[h++] = vertices2[baseIdx * 3 + 1]
vertices3[h++] = vertices2[baseIdx * 3 + 2]
vertices3[h++] = vertices2[ind2 * 3]
vertices3[h++] = vertices2[ind2 * 3 + 1]
vertices3[h++] = vertices2[ind2 * 3 + 2]
}
//  break;
}
//  break;
}

let normals = []
let colors = []

if (isUpdate)
{
normals = geometry.attributes.normal.array
colors = geometry.attributes.color.array
}

let e = 0;
h = 0;

for (var i = 0; i < normalCal.length; ++i) {
normalCal[i].normalize();

if (!isUpdate)
{
normals.push(normalCal[i].x, normalCal[i].y, normalCal[i].z)
}
else
{
normals[e++] = normalCal[i].x
normals[e++] = normalCal[i].y
normals[e++] = normalCal[i].z
}
/*let cv = normalCal[i]
cv.normalize()
colors.push(cv.x, cv.y, cv.z)*/
let hue = (1 - (vertices2[i * 3 + 2] - minZ) / (maxZ - minZ)) * 240
//console.log(hue)
let color = hsv2rgb(hue, 1, 1);
if (!isUpdate)
{
colors.push(color.x, color.y, color.z)
}
else
{
colors[h++] = color.x
colors[h++] = color.y
colors[h++] = color.z
}
}

if (!isUpdate)
{
// itemSize = 3 because there are 3 values (components) per vertex
geometry.setIndex(indices)
}
else
{
geometry.attributes.position.needsUpdate = true;
geometry.attributes.normal.needsUpdate = true;
geometry.attributes.color.needsUpdate = true;
geometry.computeBoundingSphere();
geometry2.attributes.position.needsUpdate = true;
geometry2.computeBoundingSphere();
}


}

let vshader = ‘ void main() {\n\ gl_Position = projectionMatrix * modelViewMatrix * vec4(position, 1.0);\n\ gl_Position.z -= 0.01;\n\ }’

let fshader = ‘void main() {\n\ gl_FragColor = vec4(0.3,0.3,0.3,1.0);\n\ }’

let canvas = getCanvas() canvas.width = 640 canvas.height = 480

const renderer = new THREE.WebGLRenderer({ canvas: canvas, antialias: true, alpha: true });

renderer.autoClear = false;

var scene = new THREE.Scene(); var camera = new THREE.PerspectiveCamera(60, 640 / 480, 0.1, 1000); camera.position.z = 4; //var camera = new THREE.OrthographicCamera( 6.4 / 2, 6.4 / -2, 4.8 / 2, 4.8 / - 2, 1, 1000 );

renderer.setClearColor(“#FFFFFF”); renderer.setSize(640, 480);

let cameraDis = 9.5 let cameraFrom = new THREE.Vector3( 7.988318693325925, 1.8385050955502407, 5.727710141714169) cameraFrom.normalize() cameraFrom.multiplyScalar(cameraDis) let cameraTo = new THREE.Vector3(0, 0, 0) let cameraUp = new THREE.Vector3(0, 0, 1)

camera.position.set(cameraFrom.x, cameraFrom.y, cameraFrom.z) camera.up = cameraUp camera.up.normalize() camera.lookAt(cameraTo);

canvas.onresize = function (width, height) { camera = new THREE.PerspectiveCamera(45, width / height, 0.1, 1000); camera.position.z = 4; //camera = new THREE.OrthographicCamera( 6.4 / 2, 6.4 / -2, 4.8 / 2, 4.8 / - 2, 1, 1000 ); renderer.setSize(width, height);

camera.position.set(cameraFrom.x, cameraFrom.y, cameraFrom.z)
camera.up = cameraUp
camera.up.normalize()
camera.lookAt(cameraTo);


}

var vertices4 = []

var geometry = new THREE.BufferGeometry(); var geometry2 = new THREE.BufferGeometry();

updateGeometry(geometry, geometry2, 0, 0, false)

var material = new THREE.MeshBasicMaterial({ vertexColors: THREE.VertexColors }); material.side = THREE.DoubleSide;

var mesh = new THREE.Mesh(geometry, material); // Add cube to Scene scene.add(mesh);

var frame = new THREE.LineSegments(geometry2, material3) scene.add(frame)

vertices4.push(-4.0, -4.0, -1.0) vertices4.push(4.0, -4.0, -1.0) vertices4.push(4.0, -4.0, -1.0) vertices4.push(4.0, 4.0, -1.0) vertices4.push(4.0, 4.0, -1.0) vertices4.push(-4.0, 4.0, -1.0) vertices4.push(-4.0, 4.0, -1.0) vertices4.push(-4.0, -4.0, -1.0)

vertices4.push(-4.0, -4.0, 1.0) vertices4.push(4.0, -4.0, 1.0) vertices4.push(4.0, -4.0, 1.0) vertices4.push(4.0, 4.0, 1.0) vertices4.push(4.0, 4.0, 1.0) vertices4.push(-4.0, 4.0, 1.0) vertices4.push(-4.0, 4.0, 1.0) vertices4.push(-4.0, -4.0, 1.0) vertices4.push(-4.0, -4.0, 1.0) vertices4.push(-4.0, -4.0, -1.0) vertices4.push(4.0, 4.0, 1.0) vertices4.push(4.0, 4.0, -1.0) vertices4.push(-4.0, 4.0, 1.0) vertices4.push(-4.0, 4.0, -1.0) vertices4.push(4.0, -4.0, 1.0) vertices4.push(4.0, -4.0, -1.0)

var geometry3 = new THREE.BufferGeometry(); geometry3.addAttribute(‘position’, new THREE.Float32BufferAttribute(vertices4, 3)); var coord = new THREE.LineSegments(geometry3, material3) scene.add(coord)

// Render Loop var render = function () { requestAnimationFrame(render); renderer.clear(); renderer.render(scene, camera); };

let onMouseDownPosition = { x: 0, y: 0 } let isMouseDown = false;

function onMouseDown(event) { onMouseDownPosition = { x: event.clientX, y: event.clientY } isMouseDown = true; }

function onTouchStart(event) { onMouseDownPosition = { x: event.touches[0].clientX, y: event.touches[0].clientY } isMouseDown = true; }

function onMouseUp(event) { isMouseDown = false; }

onMouseDownTheta = 0; onMouseDownPhi = 0; let radious = 5;

function resetCamera(offsetX, offsetY) { let vx = new THREE.Vector3 vx.crossVectors(cameraFrom, cameraUp).normalize()

let vy = new THREE.Vector3
vy.crossVectors(vx, cameraFrom).normalize()
vy.multiplyScalar(offsetY)
vx.multiplyScalar(offsetX)

cameraUp.crossVectors(new THREE.Vector3(0, 0, 1), cameraFrom)
cameraUp.crossVectors(cameraFrom, cameraUp)

cameraUp.normalize()

cameraFrom.normalize()
camera.position.set(cameraFrom.x, cameraFrom.y, cameraFrom.z)
//console.log("cam:" , camera.position);
camera.up = cameraUp
camera.lookAt(cameraTo);


}

function onMouseMove(event) {

event.preventDefault();

if (isMouseDown) {

let offsetX = (event.clientX - onMouseDownPosition.x) / 10.0;
let offsetY = (event.clientY - onMouseDownPosition.y) / 10.0;

resetCamera(offsetX, offsetY)

onMouseDownPosition.x = event.clientX
onMouseDownPosition.y = event.clientY
}


}

function onTouchMove(event) { event.preventDefault();

if (isMouseDown) {

let offsetX = (event.touches[0].clientX - onMouseDownPosition.x) / 10.0;
let offsetY = (event.touches[0].clientY - onMouseDownPosition.y) / 10.0;

resetCamera(offsetX, offsetY)

onMouseDownPosition.x = event.touches[0].clientX
onMouseDownPosition.y = event.touches[0].clientY
}


}

canvas.onmousemove = onMouseMove; canvas.onmouseup = onMouseUp; canvas.onmousedown = onMouseDown; canvas.ontouchstart = onTouchStart; canvas.ontouchmove = onTouchMove; canvas.ontouchend = onMouseUp;

render()

We can plot the cosines as a function of x and k in 3D to get a feel for it. As you can see, when k increases, the frequency of the cosine function increases too. To perform the transform, we sample the values of the cosine functions and save the samples into a matrix, as shown in the below image. Then we multiply each row (a 8x1 vector) of the image with this matrix to get a new vector of the same size (8x1). Each entry of this new vector represents the energy or coefficient of the corresponding cosine function. The original row can now be represented as a linear combination of the cosines with those coefficients. After we have processed all the rows and have an intermediate transform result, we need to repeat the same process again on the intermediate result for each column, because pixels along the y direction is also correlated.

Why does a simple matrix/vector multiplication allow us to transform pixels into a linear combination of cosines? The reason is not unlike the resolution of forces. In physics, a single force can be broken down into two or more which have different directions, and, taken together, are an equivalent for the single one. To break down a force, we perform dot products between the single force vector and other force directions. The results are the projections of the force onto other direction vectors. Similarly, if we view each row of the matrix as a vector, performing dot product between matrix rows and the rows of the image yields the projections of the image onto the cosine functions. And the projections, taken together, are an equivalent for the original image.

let canvas = getCanvas(); canvas.height = 512 let ctx = canvas.getContext(‘2d’)

let newImg = new Image; newImg.crossOrigin = “Anonymous”;

let isRGB = true; ctx.font = “30px Arial”; let rawRGB = null;

let cos = [] for (let y = 0; y < 8; ++y) { cos.push([]) for (let x = 0; x < 8; ++x) { let c = Math.cos(Math.PI / 8 * y * (x + 0.5)) cos[y].push© } }

let dctblock = function (rgb) { let data = [] for (let e = 0; e < 64; ++e) { let intensity = 0.2126 * rgb[e * 4] + 0.7152 * rgb[e * 4 + 1] + 0.0722 * rgb[e * 4 + 2] data.push(intensity) }

let rdata = []
clr()
log("DCT applied on each row:")
for (let h = 0; h < 8; ++h)
{
let pr = "| "
for (let r = 0; r < 8; ++r)
{
let cc = cos[r]
let rr = 0

for (let e = 0; e < 8; ++e)
{
rr += cc[e] * data[e + h * 8]
}
rdata.push(rr)
pr += rr.toFixed(2) + ' '
}
pr += "|"
log(pr)
}

log("\nDCT applied again on each column:")

let rdata2 = []
for (let h = 0; h < 8; ++h)
{
for (let r = 0; r < 8; ++r)
{
let cc = cos[r]
let rr = 0
for (let e = 0; e < 8; ++e)
{
rr += cc[e] * rdata[e * 8 + h]
}
rdata2.push(rr)
}
}

for (let h = 0; h < 8; ++h)
{
for (let r = 0; r < 8; ++r)
{
if (r > h)
{
let tmp = rdata2[h * 8 + r]
rdata2[h * 8 + r] = rdata2[h + r * 8]
rdata2[h + r * 8] = tmp
}
}
}

for (let h = 0; h < 8; ++h)
{
let pr = "| "
for (let r = 0; r < 8; ++r)
{
pr += rdata2[h*8+r].toFixed(2) + ' '
}
pr += "|"
log(pr)
}

log("\nAfter quantization:")

let quantization = [16, 11, 10, 16, 24, 40, 51, 61,
12, 12, 14, 19, 26, 58, 60, 55,
14, 13, 16, 24, 40, 57, 69, 56,
14, 17, 22, 29, 51, 87, 80, 62,
18, 22, 37, 56, 68, 109, 103, 77,
24, 35, 55, 64, 81, 104, 113, 92,
49, 64, 78, 87, 103, 121, 120, 101,
72, 92, 95, 98, 112, 100, 103, 99]

for (let h = 0; h < 8; ++h)
{
let pr = "| "
for (let r = 0; r < 8; ++r)
{
pr += Math.round(rdata2[h*8+r]/quantization[h*8+r]) + ' '
}
pr += "|"
log(pr)
}


}

let UIComponents = function(){ this.image = “Lenna” this.offsetX = 0 this.offsetY = 0 this.DCT = function(){ dctblock(rawRGB) } } let uimodel = new UIComponents();

const pixelW = 12

let drawImages = function () { if (loaded) { ctx.clearRect(0,0,canvas.width, canvas.height); ctx.drawImage(newImg, 0, 0, 512, 512); let data = ctx.getImageData(uimodel.offsetX8, uimodel.offsetY8, 8,8) rawRGB = data.data

    let ox = 522
let oy = Math.min(canvas.height-pixelW*8, uimodel.offsetY*8 - pixelW*4)
oy = Math.max(0, oy)

for(let y = 0;y<8;++y){
for (let x =0;x<8;++x)
{
ctx.fillStyle = 'rgba('+data.data[(y*8+x)*4]+', '+data.data[(y*8+x)*4+1]+', '+data.data[(y*8+x)*4+2]+', 255)';
ctx.fillRect(ox+x*pixelW,oy+y*pixelW,pixelW,pixelW)
}
}

ctx.beginPath();
ctx.rect(uimodel.offsetX*8, uimodel.offsetY*8, 8, 8);
ctx.moveTo(uimodel.offsetX*8+8, uimodel.offsetY*8);
ctx.lineTo(ox, oy);
ctx.moveTo(uimodel.offsetX*8+8, uimodel.offsetY*8+8);
ctx.lineTo(ox, oy+pixelW*8);

for(let x = 0;x<9;++x)
{
ctx.moveTo(ox+x*pixelW,oy)
ctx.lineTo(ox+x*pixelW,oy+pixelW*8)
}

for(let y = 0;y<9;++y)
{
ctx.moveTo(ox,oy+y*pixelW)
ctx.lineTo(ox+8*pixelW,oy+pixelW*y)
}

ctx.stroke();
}


}

canvas.onresize = function (width, height) { drawImages(); }

let ui = createUi(); ui.add(uimodel, ‘image’, [ ‘Lenna’, ‘Baboon’, ‘Fruits’] ).onChange( ()=>{ if (uimodel.image === “Lenna”) { newImg.src = ‘https://epiphany.pub/image/faa2ff7436b15877418252219f4f1042029a75e42a4851e9b8d6af949d1a6756/4c411b8a0b5207739f97e787d2af77ec9e1a1a47f117eb946ba1fcf51865d5f6/bfcf3757089ac1f583ad617b4d92e4a9.png’ } else if (uimodel.image === ‘Baboon’) { newImg.src = ‘https://epiphany.pub/image/faa2ff7436b15877418252219f4f1042029a75e42a4851e9b8d6af949d1a6756/4c411b8a0b5207739f97e787d2af77ec9e1a1a47f117eb946ba1fcf51865d5f6/674fd8e5a649dd8eb182d868b4c6d1b8.png’ } else if (uimodel.image === ‘Fruits’) { newImg.src = ‘https://epiphany.pub/image/faa2ff7436b15877418252219f4f1042029a75e42a4851e9b8d6af949d1a6756/4c411b8a0b5207739f97e787d2af77ec9e1a1a47f117eb946ba1fcf51865d5f6/115bc491e101002d9a0ff48536e0662d.png’ } } ); ui.add(uimodel,“offsetX”).min(0).max(63).step(1).listen(); ui.add(uimodel,“offsetY”).min(0).max(63).step(1).listen(); ui.add(uimodel, ‘DCT’)

let onMouseDown = function(e){ const rect = canvas.getBoundingClientRect() const x = event.clientX - rect.left const y = event.clientY - rect.top

if (x >= 0 && x < 512)
{
if (y >= 0 && y < 512)
{
uimodel.offsetX = Math.floor(x / 8)
uimodel.offsetY = Math.floor(y / 8)
drawImages()
}
}


}

canvas.onmousedown = onMouseDown;

What’s special about a DCT result (of a single row) is that, it is always the first few coefficients (the lowest frequencies) that has the largest value. The higher a frequency is, the closer its coefficient is to zero. This reflects the locally smooth feature of natural images. The first coefficient is the most important one, because it encodes the average value of the pixels.

The above is an experiment of DCT. You can select a 8x8 transform block by mouse. And the DCT button on the right side will perform the transform and output the results. As you can see, after DCT and Quantization, most of the values are zero.

Despite being magical, the effect of DCT can be illustrated by a simplified example. Let’s assume the size of DCT matrix and the transform block is 2x2. Then the DCT Matrix is the following (scaled by $$\sqrt{2}$$):$$\begin{pmatrix} 1 & 1 \\ 1 & -1 \\ \end{pmatrix}$$. Given a row of pixels $$\begin{pmatrix} a \\ b \end{pmatrix}$$, we have:

$\begin{pmatrix} 1 & 1 \\ 1 & -1 \\ \end{pmatrix} *\begin{pmatrix} a\\ b \end{pmatrix}=\begin{pmatrix} a + b\\ a - b \end{pmatrix}$

After DCT, the new vector becomes $$\begin{pmatrix} a+b \\ a-b \end{pmatrix}$$, where $$a+b$$ can be viewed as the average (scaled by 2) of the two pixels and $$a-b$$ is the delta of them. Because of local smoothness, the value of $$a-b$$ should be close to zero. This is essentially how DCT “compacts energy”. It encodes highly correlated values with an average and a few deltas. Since the deltas are close to zero, they can be thrown away if needed. As an analogy, to measure the heights of identical twins, you don’t need to measure each of them. You can just measure one and consider the hight of the other person is the same.

But without throwing away zeros, DCT doesn’t really give us any compression. Therefore, as the next step, which is called Quantization, we will divide each entry of the DCT result by the corresponding value from a Quantization table and round the values to their nearest integer, and get rid of zeros. Here we will use the Quantization table found in the JPEG algorithm. Exactly how these values were obtained is beyond the scope of this post.

$\begin{pmatrix} 16 & 11 & 10 & 16 & 24 & 40 & 51 & 61 \\ 12 & 12 & 14 & 19 & 26 & 58 & 60 & 55 \\ 14 & 13 & 16 & 24 & 40 & 57 & 69 & 56 \\ 14 & 17 & 22 & 29 & 51 & 87 & 80 & 62 \\ 18 & 22 & 37 & 56 & 68 & 109 & 103 & 77 \\ 24 & 35 & 55 & 64 & 81 & 104 & 113 & 92 \\ 49 & 64 & 78 & 87 & 103 & 121 & 120 & 101 \\ 72 & 92 & 95 & 98 & 112 & 100 & 103 & 99 \end{pmatrix}$

After Quantization, most coefficients become zeros, with most of the non-zeros clustered in the top-left corner of the 8x8 matrix. We want to reorganize the numbers into a list such that most zeros can stay together in the tailing part of the list, because then we can remove the tailing zeros all together, rather than keeping track of the locations of each individual zeros. We do this by flatting the matrix in a zigzag manner.

Once tailing zeros are removed, the data can be further compressed using entropy encoding, such as Huffman coding. The core idea is that for values that appear more frequently in the data, we use less bits to encode them.

let canvas = getCanvas(); canvas.height = 512 let ctx = canvas.getContext(‘2d’)

let newImg = new Image; newImg.crossOrigin = “Anonymous”;

let isRGB = true; ctx.font = “30px Arial”; let rawRGB = null;

let cos = [] for (var y = 0; y < 8; ++y) { cos.push([]) for (var x = 0; x < 8; ++x) { let c = Math.cos(Math.PI / 8 * y * (x + 0.5)) cos[y].push© } }

let inversecos = [] for (var y = 0; y < 8; ++y) { inversecos.push([]) for (var x = 0; x < 8; ++x) { let c = Math.cos(Math.PI / 8 * x * (y + 0.5)) inversecos[y].push© } }

let quantization = [16, 11, 10, 16, 24, 40, 51, 61, 12, 12, 14, 19, 26, 58, 60, 55, 14, 13, 16, 24, 40, 57, 69, 56, 14, 17, 22, 29, 51, 87, 80, 62, 18, 22, 37, 56, 68, 109, 103, 77, 24, 35, 55, 64, 81, 104, 113, 92, 49, 64, 78, 87, 103, 121, 120, 101, 72, 92, 95, 98, 112, 100, 103, 99 ]

let drawBlock = function(){

    let ox = 522
let oy = 0

for (let y = 0; y < 8; ++y) {
for (let x = 0; x < 8; ++x) {
ctx.fillStyle = 'rgba(' + rawRGB.data[((y+uimodel.offsetY*8) * 512 + x+uimodel.offsetX) * 4] + ', ' + rawRGB.data[((y+uimodel.offsetY*8) * 512 + x+uimodel.offsetX) * 4 + 1] + ', ' + rawRGB.data[((y+uimodel.offsetY*8) * 512 + x+uimodel.offsetX) * 4 + 2] + ', 255)';
ctx.fillRect(ox + x * pixelW, oy + y * pixelW, pixelW, pixelW)
}
}

ctx.beginPath();

for (let x = 0; x < 9; ++x) {
ctx.moveTo(ox + x * pixelW, oy)
ctx.lineTo(ox + x * pixelW, oy + pixelW * 8)
}

for (let y = 0; y < 9; ++y) {
ctx.moveTo(ox, oy + y * pixelW)
ctx.lineTo(ox + 8 * pixelW, oy + pixelW * y)
}

ctx.stroke();


}

var dctblock = function(rawRGB, offsetX, offsetY) { let data = [] for (let y = 0; y < 8; ++y) { for(let x = 0;x<8;++x){ let intensity = 0.2126 * rawRGB.data[((offsetY+y)512+x+offsetX) * 4] + 0.7152 * rawRGB.data[((offsetY+y)512+x+offsetX) * 4 + 1] + 0.0722 * rawRGB.data[((offsetY+y)*512+offsetX) * 4 + 2] data.push(intensity) } }

let rdata = []
for (let h = 0; h < 8; ++h) {
for (let r = 0; r < 8; ++r) {
let cc = cos[r]
let rr = 0
for (let e = 0; e < 8; ++e) {
rr += cc[e] * data[e + h * 8]
}
rdata.push(rr)
}
}
let rdata2 = []
for (let h = 0; h < 8; ++h) {
for (let r = 0; r < 8; ++r) {
let cc = cos[r]
let rr = 0
for (let e = 0; e < 8; ++e) {
rr += cc[e] * rdata[e * 8 + h]
}
let ff = Math.round(rr / quantization[r * 8 + h]);
rdata2.push(ff)
}
}

for (let h = 0; h < 8; ++h) {
for (let r = 0; r < 8; ++r) {
let ff = Math.round(rdata2[h * 8 + r] * quantization[r * 8 + h]);
rdata2[h * 8 + r] = ff;
}
}

let rdata3 = []
for (let h = 0; h < 8; ++h) {
for (let r = 0; r < 8; ++r) {
let cc = inversecos[r]
let rr = 0
for (let e = 0; e < 8; ++e) {
if (e == 0) {
rr += rdata2[e + h * 8] / 2
} else {
rr += cc[e] * rdata2[e + h * 8]
}
}
rdata3.push(rr / 4)
}
}

for (let h = 0; h < 8; ++h) {
for (let r = 0; r < 8; ++r) {
if (r > h) {
let tmp = rdata3[h * 8 + r]
rdata3[h * 8 + r] = rdata3[h + r * 8]
rdata3[h + r * 8] = tmp
}
}
}

let rdata4 = []

for (let h = 0; h < 8; ++h) {
for (let r = 0; r < 8; ++r) {
let cc = inversecos[r]
let rr = 0
for (let e = 0; e < 8; ++e) {
if (e == 0) {
rr += rdata3[e + h * 8] / 2
} else {
rr += cc[e] * rdata3[e + h * 8]
}
}
rdata4.push(rr / 4)
}
}

let patch = ctx.getImageData(offsetX, offsetY, 8, 8)

for (let i = 0; i < rdata4.length; ++i) {
patch.data[i * 4] = rdata4[i];
patch.data[i * 4 + 1] = rdata4[i];
patch.data[i * 4 + 2] = rdata4[i];
}

ctx.putImageData(patch, offsetX, offsetY);
drawBlock()


}

let progress = 0 let run = false

let task = function() { if (loaded && rawRGB && run) { if (progress < 64 * 64) { uimodel.offsetY = Math.floor(progress / 64) uimodel.offsetX = Math.floor(progress % 64) dctblock(rawRGB, uimodel.offsetX * 8, uimodel.offsetY *8) progress ++; window.requestAnimationFrame(task);

    }
}


}

let UIComponents = function() { this.image = “Lenna” this.offsetX = 0 this.offsetY = 0 this.DCT = function() { if (run) { run = false }else{ if (progress >= 64*64) { progress = 0 } run = true task() } } }

let uimodel = new UIComponents();

const pixelW = 12

let drawImages = function() { if (loaded) { ctx.clearRect(0, 0, canvas.width, canvas.height); ctx.drawImage(newImg, 0, 0, 512, 512); rawRGB = ctx.getImageData(0,0,512,512) } }

canvas.onresize = function (width, height) { drawImages() }

let ui = createUi(); ui.add(uimodel, ‘image’, [‘Lenna’, ‘Baboon’,‘Fruits’]).onChange(() => { loaded = false run = false progress = 0 if (uimodel.image === “Lenna”) { newImg.src = ‘https://epiphany.pub/image/faa2ff7436b15877418252219f4f1042029a75e42a4851e9b8d6af949d1a6756/4c411b8a0b5207739f97e787d2af77ec9e1a1a47f117eb946ba1fcf51865d5f6/bfcf3757089ac1f583ad617b4d92e4a9.png’ } else if (uimodel.image === ‘Baboon’) { newImg.src = ‘https://epiphany.pub/image/faa2ff7436b15877418252219f4f1042029a75e42a4851e9b8d6af949d1a6756/4c411b8a0b5207739f97e787d2af77ec9e1a1a47f117eb946ba1fcf51865d5f6/674fd8e5a649dd8eb182d868b4c6d1b8.png’ } else if (uimodel.image === ‘Fruits’) { newImg.src = ‘https://epiphany.pub/image/faa2ff7436b15877418252219f4f1042029a75e42a4851e9b8d6af949d1a6756/4c411b8a0b5207739f97e787d2af77ec9e1a1a47f117eb946ba1fcf51865d5f6/115bc491e101002d9a0ff48536e0662d.png’ } }); ui.add(uimodel, “offsetX”).min(0).max(63).step(1).listen(); ui.add(uimodel, “offsetY”).min(0).max(63).step(1).listen(); ui.add(uimodel, ‘DCT’)

After entropy encoding, our toy encoder is done. Now we need to look at how to decode the result. Roughly speaking, decoding is reversed encoding. We need to recover the DCT transform blocks from the Huffman encoded data. Then we need to apply inverse DCT to recover the YUV pixels. Finally, we need to up sample the UV channels and then recover the RGB colors.

The first and the last steps are trivial. So let’s talk about how to do inverse DCT. As mentioned above, the inverse of DCT-II is DCT-III, defined by the following cosines:

$cos(\frac{\pi}{N}x(k+\frac{1}{2})) k = 0 ... N$

Notice that k and x are exchanged in DCT-III compared to DCT-II. Therefore if we sample the values of the cosines into a matrix, it will be a 90-degree counter-clockwise rotated version of the DCT-II matrix. But when carrying out the inverse DCT, we don’t just do matrix/vector multiplication. Instead, we treat the first term of the linear combination differently:

$X_k=\frac{1}{2}x_0+\sum_{n=1}^{N-1}x_ncos(\frac{\pi}{N}*n*(k+\frac{1}{2})) k=0...N$

where $$x_n$$is an entry of a vector. Notice that it is not a matrix/vector multiplication, as $$x_0$$is handled differently.

The experiment above executes the entire encoding and decoding pipelines block by block. After each block is compressed and decompressed, we put the block back on top of the original image for comparison. For simplicity, we only compress the Y channel. It can be generalized to color channels easily. As you can see, the recovered (gray scale) image doesn’t have noticeable artifacts.

So far, we have talked about how to encode a single image. But what about videos? Well, a video is just a set of images displayed rapidly. Between two adjacent images, pixels don’t change much. This is a temporal redundancy we can take advantage of. To encode a video sequence,  the first frame is encoded as a base. Then, for the consecutive frames, we calculate the deltas against their previous frame. We only encode the deltas for those frames, because the deltas are expected to be small thanks to the temporal smoothness. But after a while, numerical errors during encoding will add up to a point that artifacts become visible. At this point, we need to re-encode an entire frame as the new base. That’s why we have three frame types in a video: I-frame, P-frame and B-frame. An I-frame is also called a key frame, it is a frame that can be decoded by itself. A P-frame, on the other hand, only encodes delta, hence can’t be decoded by itself without some previous frames. B-frame is even more complex. It needs not only a few previous frames, but also a few frames after it.

In addition to frame types, we also have the concept of Presentation Timestamp  (PTS) and Decoding Timestamp (DTS). Decoding Timestamp defines the decoding order, which is aligned with the order of frames as they are stored in a video file. But this order is not necessarily the right order for displaying, because a B-frame, which should be played early, might be saved after some later frames in file, because the B-frame needs those later frames in order to be decoded.

If you save compressed video data into an H264 file, it is usually playable. But we almost never use a raw H264 file directly. Instead, we use file formats like MP4, MOV, MKV, WEBM, etc. They are the container formats, within which there are a few media streams, and video is just one type of stream. There are also audio, subtitle, even GPS coordinates, depth and stereo images in some special use cases. Therefore as the last step of a real encoding process, we need to perform muxing to package the encoded video into a container format.

People often get confused by these container formats, as some MP4 files that are playable on one platform (for example, Linux) can’t be played on another (say, IOS). Because, internally, container files can use different compression standards (H264 vs. H265) and parameters (resolution, framerate, bitrate etc.) that may not be supported by a picky platform.

Finally, my rant about video being an undocumented minefield at the beginning is mostly about video encoding libraries, but that’s another story. In real-life, you will certainly never encode a video from scratch. But I hope this little hands-on tutorial can give you a good picture on how video encoding works and help you understand those encoding libraries. While writing this, I found a great introduction article on the same topic, which I wish I had when I first started.