Something about the decal implementation I wrote for Gunbritt.
As I began I was quite befuddled. I didn’t have a good idea of how to project an image on to (possibly) several other models. I looked at a few articles around the internet which described how to do it using a camera. I wanted something simpler though! After doing some contemplating I began breaking it down in to smaller pieces. To begin with, I decided to limit my self to using the deferred rendered models, which meant I didn’t have to draw on to multiple models. I also realized, that by using a simple bounding box as the in-world representation of the decal I would, rendering it, always generate the correct pixels (and a few unneccesary ones).
This subsequently meant I could sample the position texture and check if the position of that pixel was inside or outside the boundingbox using distance. I could also cull pixels using the normal to projector forward dot to remove back surface drawing or drawn-out decals.
This is basically just if (pixel inside bounds) multi = 1 else multi = 0. But without branching.
After that i simply normalized the x/y distance (the bounding box is facing towards the projection) of the pixel from the bounding box center, and used that to sample the decal texture, which then is output to the already existing deferred diffuse texture.
And now with some randomized rotation..
A few notes:
This implementations is obviously only for deferred rendering. Given how simple it is, I’m sure it can be made to work with forward rendering though.
During the rendering of the bounding boxes (‘projectors’), depth writing must be disabled. Else things will get wacky : ).
During our first project we wanted to use animations. This wasn’t really the norm, as we were going to be issued a animationhandler, but not until the next project. So the task of implementing our own system fell to me…. So….. Anyways here’s a bunch of pictures & text concerning its implementation : ) .
In our engine, we have a bunch of support classes for the animation system.
The Animator is the main calculator for bone transforms & such, and is responsible for importing the bone data from fbx files, and constructing internal representations of them.
The way I implemented additive blend was to calculate the relevant keyframe in a regular fashion, and then substract each bone’s corresponding bind pose transform from it. What was left is then added on top of whatever is in the current node’s local transform. — Of course we had to have our animator André Rondahl make animations specifically intended for that purpose.
With that we could do cool things like this:
Animation is a container class used by Animator to store animations. It contains all the keyframe data. They also contain the functionality of calculating keyframe transforms at a point in time & interpolating between them.
Finally, there’s the AnimationController. This is owned by every model instance and is responsible for storing instance related data such as active animations & current bone transforms aswell as bone attachés.
The animation controller has some interesting functionality. One would be that it allows dynamic attach & detach of other model instances to one of it’s (Animators) bones.
Another one would be automatic merge blending. This is done using a stored history of active animations, that is then interpolated from.
Sometime during my work on the queue I realized that garbage collection would be a nightmare, as the ownership of the buffers is essentially shared between one producer and all consumers. I felt that figuring out when it was ok to destroy a buffer would require syncronization mechanisms that interfered with the general pop/push performance.
I had some hopes though, that there would exist a concurrent shared pointer in the std library that could solve my problem without being a performance burden. After some searching I found lots of references and half-implementations of it, but when it came down to it, however, it was just experimental bits & pieces that weren’t avaliable for use (As I understand, in C++20, there will be a thread safe shared pointer in std, and there were, at some point, a working experimental implementation in the code).
So. I figured, why not build my own. I tried once and gave up because I couldn’t figure out how to do the increments & decrements safely & in a lock free manner. However when time came for me to do my Specialization project at The Game Assembly, I wanted to give it another shot.
As I began I had two criteria for the finished product: It would be lock free, and access to the shared object would have near-same performance as a simple raw pointer.
As I began thinking about the lock free control mechanism for copying a pointer object I realized something I had half-learned in my previous attempt: I could not allow a thread access to the shared block haphazardly, as there would be no way to know when another thread were already in there decrementing it (potentially to zero). That meant that whatever control mechanism I invented, I had to put in the pointer objects themselves, locally.
My first breakthrough in the control mechanism was the realization that so long as the source of a copy operation hadn’t decremented it’s corresponding shared block, that shared block (and shared object) would be alive. This meant that if I could somehow make the source object be alive until the increments had happened, I would be in the clear. This spawned a lot of my first Ideas/drafts:
Store a sort of client list in the upper word of a pointer
Which would then be incremented on copy to be a sort of promise from the from object to increment the shared block ref counter on the client’s behalf.
This failed of course as there was no telling when the actual incrementation would happen, and a client might choose to decrement the shared block before its corresponding increment (potentially killing the shared block ahead of time).
The second breakthrough I had came when I was reading a paper on an implementation of a lock free arbitrary length word. The idea was that lock free-ness can be guaranteed if all threads involved in an operation would have to help out / try to succeed in whatever was going on. This made me realize that I was sort of on the right track, but I would have to include not only the from-object’s user threads, but the to-object’s users as well. The solution: Make the to-object’s assigner help out with increments
As well as the from object..
That’s basically the main idea of the mechanism. In short: Increment the COPY_REQUEST iterator which forces whichever thread wants to store a new value in the pointer object to help increment the shared block ref counter, as well as help out with the incrementation from the Copy-requester.
Fun bonus. To make this mechanism work I needed to have more storage space than the just 16 bits extra on top of a 64 bit pointer block. This led me to investigate Microsoft’s Interlocked operations, which in fact does support one atomic operation on 128 bits: _InterlockedCompareExchange128. This uses the underlying ‘lock cmpxchg16b’ instruction, which is the only widely avaliable 128bit atomic instruction. This meant I had to wrap a whole bunch of operations around it. It resulted in the AtomicOWord class. Quite the utility.
During my studies at The Game Assembly I’ve taken on a few roles. Most recently & during most of the second year I’ve been the Systems, Animations & Engine guy, responsible for expiditing and maintaining various functionality that the other programmers needs. Also taking on various gameplay related tasks when needed.
Here’s a list of guys I’ve been in the past (And in some cases still am, to some degree):
Collision / World Interaction Guy
Mouse Guy (Yep, point & click needed a mouse guy)
In backwards chronological order here are the games I’ve worked on.
Some things I did:
Decals & Brains Implementation = )
Base Networking system, along with teammate
Base physX integration & system, along with teammates
Various: Drag effect for bombs, particle system upgrades(more movement patterns, distortion to angles over time, mesh particles)
It began as a hobby project at the beginning of the 2018 summer when I felt frustrated I (for several reasons) didn’t get to learn anything about threading. This was just after my first year at The Game Assembly. It was designed as an excuse to learn more about threading.
In the beginning it was a simple bounded multi-consumer-multi-producer structure using a circular buffer and a whole bunch of atomics to control the flow. After using it for a while I realized I could make it a lot faster by not having it be sequencially consistent, splitting production over a series of buffers. Both these versions can be viewed in the ‘Discarded’ folder at my github as StaticMultiThreadQueue.h and DataShuttle.h, along with a whole bunch of failed ideas.
As I experimented with different ideas I realized I could combine the best of the previous versions (Sequencial consistency but with separate producers) and get dynamic memory allocation to boot! This resulted in the final version which was also later expanded upon with exception safety and much optimization.
To explain the inner workings a little bit:
Internally all the producers keep a slot in an array, with pointers referring to one(usually the front-most) of the buffers in their buffer list. A new producer begins by allocating it’s initial buffer, and placing it in an appropriate slot.
Once inside the active buffer, it will iterate a slot iterator and, in the event that slot is marked as empty, insert it’s data there.
In the event a slot is marked as non-empty, the push will fail and the producer will allocate a new buffer and push it to the front of it’s list(Which can then be guaranteed to succeed, barring an excepting). No two producers will ever share a buffer.
Consumers circle around the producer arrays searching for one that’s interesting. (The search begins from a private offset to keep things a bit spread out).
When a consumer find one that has contents, it will store a ‘shortcut’ pointer directly to that buffer, and continue dequeueing stratight from the stored buffer until it fails.
Inside the buffer, when a consumer needs to dequeue an element, it first iterates an atomic that which indicates whether an entry is actually avaliable. Afterwards it claims an actual read slot using a second iterator variable.
As for the size method: It simply circles through all the buffer lists, collecting local sizes. The reason I am not using an atomic counter in the main structure is optimization. It becomes a highly contended atomic between all consumers & producers, and affects performance noticeably as a result.
To list some of its positives
It’s really fast. I compared it against a couple of similar queues (Boost, moodycamel, Microsoft) and it beats them all in both single & multithreaded performance. (Not counting moodycamel’s bulk operations).
Basic exception safety.
Wait-free pushing & Lock-free popping
Things to improve upon in the future:
Currently there is no garbage collection — old used-up buffers just sit around collecting virtual dust, until they are destroyed with the rest of the structure.
Recycling of producer slots. In the event a thread is killed, the arrayslot it occupies goes unused. This could be improved upon by having a mechanism for packing together the remaining existing producers & decrementing the counter.
Recycling of object id’s. As both producers and consumers rely on the local object Id to store local pointers, the arrays used must grow when encountering an Id higher than current capacity. This could be made more efficient by recycling the Id’s of destroyed queues.