I help companies make great products.

Blog

Porting Existing Systems to Unity's DOTS: JTween

With game engines like Unity and Unreal coming out with new exciting features all the time, it can be tough to keep up with all of the latest technological advancements. One of the best ways of staying engaged and up-to-date in these circumstances is to have a small use-case or reason to learn the latest and greatest and immediately apply it.

Goals

I’d been interested in becoming more familiar with Unity’s Data-Oriented Technology Stack (DOTS) since it was first announced, but didn’t have any immediate need for it at the time. This changed in the last Luddum Dare game jam when I was fixing a bug with a custom tween library that I’d written. I’d written it at a time when I was getting annoyed with a third-party tween engine we were using at work and wanted a more reliable solution that I could tweak to fit my needs. Tweening is relatively straightforward with much of the various algorithms mapping over chunks of data and executing small bits of work. I realized that porting this tween engine to use the new DOTS stack should present a significant number of potential performance benefits without much in the way of drawbacks. My main goal in porting this to leverage DOTS is to reduce the CPU time spent on the main thread doing the work up updating tween progress and applying it to the target of the tween and off.

Thought Process

My first move was to assess which components of DOTS should be leveraged; the Entity-Component System (ECS) seemed like it might be a good fit, allowing for a a component per unique tween target type and offering a hybrid approach for tweening native Unity components like Transforms. However after digging into it for a day or two, looking at the example projects and posts on the Unity forums, it dawned on me that ECS has not yet stabilized in terms of its feature set and API and finding basic, consistent implementation information for updating Transforms seemed insurmountable. The Job system and Burst compiler are better in this regard though and can accomplish the same goal of porting the tween system even without using ECS.

Points of Interest

Using some of the DOTS samples here, I was able to pick up the basics and started working on an MVP tween implementation. A couple of things became quickly apparent.

NativeCollections

Working with NativeCollections, the Unity C# interface for native memory collections that allow for passing data back and forth between the main thread and Job worker threads, requires delicate care. There are many situations that result in a memory leak or fatal error, which in the editor causes endless warnings until restarted or worse a fatal exception which causes an crash. This became doubly important when user-facing events and callbacks come into play and due to order of operations and errors in user code this could lead to internal state not being cleaned up properly. This includes:

  • Not disposing of native collections. This needs to be taken into account based on the Allocator type (which determines how long the NativeCollection persists for) as well as in OnDestroy for those occasions where the Unity editor or app might stop playing before the collection would have been cleaned up at runtime.

  • Accessing an index out of range. Nothing too ground breaking here, but an out of index error with a NativeCollection will likely be fatal and could potentially crash your app/editor so treat these with special care.

  • TransformAccessArray API is limited: For NativeArray, we have a variety of options for getting and setting content in performant ways, including ways to slice portions of existing arrays over their contents. For TransformAccessArray, we are limited to passing an entire array of Transforms or adding/removing a single Transform at a time. As there is more happening internal to this collection when elements are added or removed, this can end up causing performance spikes if large numbers of elements are add/removed at once. For removal particularly this popped up on my radar as a problem. To mitigate this during the cleanup of tweens, I amortize (spread out) the number of Transforms removed per frame so this does not result in performance spikes.

Profile Early and Often

Profile early and often as a DOTS implementation develops. I created a very basic implementation of a tween library at first running on the Job system and using the Burst compiler alongside a tester class that attempted to break the implementation in a variety of ways including its ability to scale, cleanup, and handle user input via the public APIs. As development continued, I was able to leverage this tester class by profiling it at various levels of scale and it helped give immediate insight into areas that were under-performing and steered the underlying implementation. This also led to some great insights/patterns in creating job-based systems as well as some general C# performance benefits.

Even if not immediately acting on Profiler captures, it will likely lead to creating tasks for future work. There were many times as I was profiling that I saw issues with my implementation that were not the immediate focus of what I was working on, but formed the basis for future refactors that would cover them. Its often better to start with a simpler implementation that might be deficient in terms of performance, but accomplishes the functional goal and then iterate to make it perform to expectation than try to do both at the same time.

At even small numbers of calls, List<T>.Remove(T value) was far slower than List<T>.Remove(int index) since it needs to get the index and then internally just calls RemoveAt.

An example of starting simple and refactoring for performance later, my first array shuffle implementation whose goal was to keep one linear block of valid elements iterated through the array from a removed element and moved all elements back one, which was horrid for scale in the number of iterations that would occur at larger scales. I noted this though and a week or so later as part of a larger performance pass with a functional implementation I optimized this by simply copying all of the elements back as a single block of the array back one element.

An example of starting simple and refactoring for performance later, my first array shuffle implementation whose goal was to keep one linear block of valid elements iterated through the array from a removed element and moved all elements back one, which was horrid for scale in the number of iterations that would occur at larger scales. I noted this though and a week or so later as part of a larger performance pass with a functional implementation I optimized this by simply copying all of the elements back as a single block of the array back one element.

Jobs with Shifting Sizes of Work can be Challenging

Creating a job-based system whose immediate working contents can expand and contract frequently is challenging, but is required for one supporting a tween library whose usage can occur at any time and where tweens can be very short or long. Many of the Unity NativeCollections like NativeArray and the Job interfaces accept either a single array as a parameter or a numeric capacity that requires the user to manually copy the contents in by the index operator or Add member. This conflicts with our desire to be able to grow or shrink the size of the Job’s contents at will, but we can overcome this by using the fixed operator and the Unity static UnsafeUtility class to copy only the slice of our managed array’s contents that we want to have worked on over the NativeArray’s contents. We can also employ the same technique in reverse to copy a NativeArray’s contents back over the managed array’s contents. Here is an example adapted from a Gist here by LotteMakesStuff which was where I first saw this being employed:

public static unsafe void CopyTweenLifetimeDirectlyToNativeArray(
TweenLifetime[] sourceArray,
NativeArray<TweenLifetime> destinationArray,
int length)
{
    fixed (void* arrayPointer = sourceArray)
    {
        UnsafeUtility.MemCpy(
        NativeArrayUnsafeUtility.GetUnsafeBufferPointerWithoutChecks(destinationArray),
            arrayPointer,
            length * UnsafeUtility.SizeOf<TweenLifetime>());
    }
}

public static unsafe void CopyNativeArrayDirectlyToTweenLifetime(
    NativeArray<TweenLifetime> sourceArray,
    TweenLifetime[] destinationArray)
{
    fixed (void* arrayPointer = destinationArray)
    {
        UnsafeUtility.MemCpy(
            arrayPointer,
            NativeArrayUnsafeUtility.GetUnsafeBufferPointerWithoutChecks(sourceArray),
            sourceArray.Length * UnsafeUtility.SizeOf<TweenLifetime>());
    }
}

Copying in only slices of the data we want is predicated on our array having all of the data elements we want worked on being in a single linear block. A tweens duration is assigned by its user and could potentially be removed at any point if stopped; our strategy for keeping a linear block of data elements must be able to handle this. To enable linear blocks of valid data elements and support removal at any time, I shuffle the array’s contents in a naive way by copying the entire block of data elements starting from just past the removed index back over the removed index. With batch tweens that have a shared duration, this is made even easier since we can shuffle over entire blocks of tween data elements as they’re being recycled.

The advantage of being able to copy over a slice of an array versus a fixed size array made up of only the contents that need to be worked on is that we can set a much greater capacity for these managed arrays up to the maximum amount of tweens we want to use and avoid the GC cost of resizing these arrays. Additional tween data elements can be added to the end of the current “length” of the arrays contents at any time, even when Jobs are being run.

The only NativeCollection type JTween uses that we cannot employ this technique with is TransformAccessArray; with this we cannot easily add contents to en-masse and are limited to adding and removing individual elements to/from (a constructor exists which allows for passing an array of Transforms, but without being able to specify a length or copying over a slice it does not help us). This seems mostly due to TransformAccessArray internally sorting transforms and interacting more with the Unity engine directly than a NativeArray of arbitrary structs. It is the only NativeCollection type that we persist across multiple frames and add/remove individual elements from. This is because its expensive to reallocate every frame and re-populate and cheaper over the long run to set an initial, large capacity.

Order of operation matters

Creating a Job-based system with a simple, easy user-facing API and callbacks requires careful orchestration. There is a period of time during a frame(s) when data has been copied to the worker threads and is being worked on by the Job where changes made to the same data on the main thread can have any number of undesired effects, i.e we really only want to modify Job data when it is not being worked on by Jobs. In order to achieve this, all user actions affecting data that might be used by a job is encapsulated into an action queue and resolved before/after the Jobs work. The same type of approach is used for tween callbacks; these are queued in such a way that they can be safely invoked after all sensitive NativeCollections or Job code has been run. This prevents downstream errors in user code from causing cascading errors internally for JTween that might result in fatal errors for the Unity editor and/or application.

Since a user callback could result in anything including error-prone code outside of our control, we queue these up as we’re processing tween data for jobs and invoke these in a safe context after we’ve already handled the more delicate NativeCollections and Job work.

Since a user callback could result in anything including error-prone code outside of our control, we queue these up as we’re processing tween data for jobs and invoke these in a safe context after we’ve already handled the more delicate NativeCollections and Job work.

Outcome

If the goal was to reduce the CPU time on the mean thread and shift it to the worker threads via the DOTS stack, how does the end result of porting my old tween library to the new DOTS-based JTween compare? I decided to profile them both to see how they compare. For the test, I decided to start a large number of tweens on one frame (10,000 Transforms moving at once) and with the same duration so that they end on the same frame.

Old Tween Library

As we can see my old tween library was not built well enough to handle this capacity of tweens occurring at the same time. The editor/windows standalone build hovers around 10-11 ms of total CPU time while the Android build is significantly worse off around from 16-23 ms.

There are a few things I could do without DOTS to improve the performance of this on the main thread (using structs and arrays over classes and linked lists for tween data would allow for much faster looping), but there is a likely threshold for gains using C# improvements alone and it doesn’t change the fact that this work is still taking up time on the main thread along with all of our other gameplay/engine logic.

JTween

The performance difference for CPU time taken on the main thread between my old tween library and JTween is several magnitudes better; as most of the work on the main thread revolves around updating data to and from Jobs and the actual processing of tween data occurs on worker threads, porting to DOTS has drastically reduced the amount of CPU time taken on the main thread to 0.75-1.5 ms for the editor/Windows standalone builds and 2.7-3 ms on Android. For the Jobs processing tween data, they can take advantage of the Burst compiler optimizations to run even faster than they could have on the main thread alone. And these are likely only to get better over time as DOTS improves.

Conclusion

All in all, my experience in writing JTween has shown me that writing dynamic, performant systems utilizing DOTS is possible with a little extra work and well-thought out design and porting existing system to leverage DOTS does not necessarily require significant API changes for users, but does require internal architectural changes. JTween can be used for free in both personal and commercial projects and can be found here.

If you find that JTween is a tool you’d like to see future development on and want to support future development (as well as any of the other free tools I have released on Github), your bug reports and feature/pull requests are more than welcome and if you’d like to donate to support my work you can do so at the link below.

ko-fi