Affine Transformation — why 3D matrix for a 2D transformation

Assumption: you know the basics of Linear Transformation by matrix multiplication. If not, this 3Blue1Brown’s video is a great intro.

Have you wondered why Affine Transformation typically uses a 3×3 matrix to transform a 2D image? It looks confusing at first, but it is actually a brilliant idea!

If you search around for articles on the topic, you can see that 3×3 matrices are used to perform transformations (scaling, translation, rotating, shearing) on images, which are 2D!

Credit: Cmglee at https://en.wikipedia.org/wiki/Affine_transformation

We know that the locations of each pixel in a 2D image can be represented with a 2D vector [x,y], and the image can be linearly transformed with a 2×2 matrix. So questions naturally come. Where do the 3×3 matrices come from? 3×3 matrices are not even compatible with 2D vectors! Why are we using a 3×3 matrix while it seems that a 2×2 can do the same?

In this article, I will answer the questions below:

  • Why Affine Transformation uses a 3×3 matrix to transform a 2D image? and
  • more confusing yet, why in OpenCV’s Affine Transformation function cv2.warpAffine(), the shape of the input transformation matrix is 2 x 3?

In linear transformation, a 2×2 matrix is used to do scaling, shearing, and rotating on a 2D vector [x,y], which is exactly what Affine Transformation does also. You see what’s missing? translation! Multiplying a 2D vector with a 2×2 matrix cannot achieve translation.

In linear transformation: scalingshearing and rotating, the basis vectors all stay on the same origin (0,0) before and after the transformation. That means the point (0,0) never changes location. To translate the image to a different location, you need to add a vector after the matrix transformation. Therefore, the general expression for Affine Transformation is q= Ap + b, which is

[p₁, p₂] can be understood as the original location of one pixel of an image. [q₁, q₂] is the new location after the transformation.

When vector b [b₁, b₂] is [0,0], there is no translation. In this case, a 2 x 2 matrix A is indeed sufficient. When b is not [0,0], the image moves to a different location (aka translation). b₁ and b₂ here determines the new location. Specifically, b₁ determines how much the location moves along the x axis and b₂ determines how much it moves along the y axis. This looks a bit cumbersome, doesn’t it? It is a 2-step calculation: matrix multiplication + vector addition.

KEY QUESTION: what if we can perform the entire transformation with just one matrix multiplication?

That’s exactly what a 3×3 matrix can do, combining the multiplication with a 2×2 matrix and adding of a 2D vector into one multiplication with a 3×3 matrix!

Here is how it works. The original points on the 2D plane are padded with 1 in the third axis, becoming (p₁, p₂, 1). This makes them points on a 3D plane with value of p₃ always 1. So the 2D image is still a 2D image, it’s just that now it is augmented into a 3D space. To visualize it:

When we shear the cube along the z axis, the image does not change shape or size but it moves to a different location from the perspectives of x and y axes! That’s exactly how the 3×3 matrix transformation helps!

What happens to the transformation matrix then? It is expanded into this

The key points here:

  • The last row of the transformation matrix A [0,0,1] makes sure that the dots after transformation are still on the same z=1 plane.
  • a₁₃ and a₂₃ determine how much the image is shifted along the x and y axes accordingly. They are actually the same as b₁ and b₂ in the 2D vector above.

It is worth noting the two 0s in position a₃₁ and a₃₂. They stay as 0 in Affine transformation. If either of them is none zero, the image will go out of the z=1 plane and when projected back to to the z=0 plane, it is no longer in the same shape. In that case, it is not an Affine Transformation anymore. That is why in OpenCV, the input transformation matrix for the Affine Transformation function cv2.warpAffine() is a 2 x 3 matrix, which only has 2 rows, as the third row [0, 0, 1] never changes!

OK, a recap:

  1. Why Affine Transformation typically use a 3×3 matrix to transform a 2D image?
    For saving computation steps and elegance (in my opinion), it combines a two-step calculation into one matrix multiplication.
  2. Why in OpenCV’s Affine Transformation function cv2.warpAffine(), the input transformation matrix for is 2 x 3 (hopefully not confusing anymore)?
    The third row of the transformation matrix always stays as [0,0,1] so no need to specify it in the function input.

Let me know if the explanation makes sense.

Bonus: ChatGPT’s answer:

Advertisement