I'm a gameplay programmer
focused on developing
engaging and unique experiences!

My Resume Learn more

Optimizing Vertex Shader Matrices

One of the most common tasks for a vertex shader is to convert a vertex position from local space to projected space. A simple approach may be to cache the matrices LocalToWorld, WorldToCamera, and CameraToProjected in constant buffers for the vertex shader to use. Then, the shader could calculate the LocalToProjected matrix and output the correct position for the fragment shader to use. However, by doing this calculation in the vertex shader, LocalToProjected will be computed for every vertex of a mesh! This can be optimized by calculating LocalToProjected on the CPU and storing that in the constant buffer for the vertex shader to use at will.

On that note, we store LocalToProjected per draw call because LocalToWorld will be unique for every object that needs to be drawn.

Matrix Multiplication Order

A further optimization on generating LocalToProjected is the order we multiply the matrices LocalToWorld, WorldToCamera, and CameraToProjected. An initial approach may be to simply multiply these matrices in order resulting in this code:

Matrix LocalToCamera = WorldToCamera * LocalToWorld;
Matrix LocalToProjected = CameraToProjected * LocalToCamera;

A subtle problem arises in that this order of operations requires every object to use two Matrix multiplications for computing LocalToProjected. Let's see what happens when multiplying in the opposite order

Matrix WorldToProjected  = CameraToProjected * WorldToCamera;
Matrix LocalToProjected = WorldToProjected * LocalToWorld;

Hmm, this still requires two matrix multiplications. However, now the matrix WorldToProjected can be reused for every other object that needs to be drawn! Thus, we cut down the number of matrix multiplications from 2N to N + 1. The WorldToProjected matrix could also be stored in a constant buffer for shaders to use because it it will be the same for any vertex.

Instruction Count

One way to see the results of this optimization is to look at the disassembly for a shader that calculates LocalToProjected rather than pulling it from a constant buffer. Below, we see that kind of shader takes 24 instructions.

mul r1.xyzw, r0.xxxx, cb2[0].xyzw
mul r2.xyzw, r0.yyyy, cb2[1].xyzw
add r1.xyzw, r1.xyzw, r2.xyzw
mul r2.xyzw, r0.zzzz, cb2[2].xyzw
add r1.xyzw, r1.xyzw, r2.xyzw
mul r0.xyzw, r0.wwww, cb2[3].xyzw
add r0.xyzw, r0.xyzw, r1.xyzw  // r0.x <- vertexPosition_world.x; r0.y <- vertexPosition_world.y; r0.z <- vertexPosition_world.z; r0.w <- vertexPosition_world.w

mul r1.xyzw, r0.xxxx, cb0[0].xyzw
mul r2.xyzw, r0.yyyy, cb0[1].xyzw
add r1.xyzw, r1.xyzw, r2.xyzw
mul r2.xyzw, r0.zzzz, cb0[2].xyzw
add r1.xyzw, r1.xyzw, r2.xyzw
mul r0.xyzw, r0.wwww, cb0[3].xyzw
add r0.xyzw, r0.xyzw, r1.xyzw  // r0.x <- vertexPosition_camera.x; r0.y <- vertexPosition_camera.y; r0.z <- vertexPosition_camera.z; r0.w <- vertexPosition_camera.w

mul r1.xyzw, r0.xxxx, cb0[4].xyzw
mul r2.xyzw, r0.yyyy, cb0[5].xyzw
add r1.xyzw, r1.xyzw, r2.xyzw
mul r2.xyzw, r0.zzzz, cb0[6].xyzw
add r1.xyzw, r1.xyzw, r2.xyzw
mul r0.xyzw, r0.wwww, cb0[7].xyzw
add o0.xyzw, r0.xyzw, r1.xyzw

ret
// Approximately 24 instruction slots used

Now let's look at the same shader that has been optimized to use a cached LocalToProjected matrix.

mov r0.xyz, v0.xyzx  // r0.x <- vertexPosition_local.x; r0.y <- vertexPosition_local.y; r0.z <- vertexPosition_local.z
mov r0.w, l(1.000000)  // r0.w <- vertexPosition_local.w

mul r1.xyzw, r0.xxxx, cb2[4].xyzw
mul r2.xyzw, r0.yyyy, cb2[5].xyzw
add r1.xyzw, r1.xyzw, r2.xyzw
mul r2.xyzw, r0.zzzz, cb2[6].xyzw
add r1.xyzw, r1.xyzw, r2.xyzw
mul r0.xyzw, r0.wwww, cb2[7].xyzw
add o0.xyzw, r0.xyzw, r1.xyzw

ret 
// Approximately 10 instruction slots used

We reduced our instruction count by 14!

No Comments Yet.

Leave a comment