At first glance, JavaScript may not be the ideal language to write scripts or applications that manipulate large amounts of data. It's (mostly) single-threaded, limited in memory allocation, and powered by an improving but not quite industry-leading garbage collector. Of course, there are reasons to use JavaScript, such as the ease of running in web browsers, the ability to utilise an existing codebase or the diverse community of packages available. This post goes over a few key problem areas and how you can avoid them.
Data IO
While this isn't a rarely-used optimisation, it's a critical first step for anyone new to using JavaScript for data-heavy tasks. For those who don't know, JavaScript runs in a single thread. Running in a single thread means that only one task can be happening at once in your application. When working with data stored in files or on the web, reading or writing can be a massive bottleneck. Luckily, JavaScript is a concurrent language.
Using Promise.all()
alongside multiple fetch
or fs.readFile
calls, you can tell JavaScript to queue the file or network calls and wait for them to complete before continuing. For more information about this, check out Mozilla's guide on using promises. If you don't use a call to something like Promise.all
here, the application will read each file or network request one after the other, which could be significantly slower depending on the circumstances.
There can be some downsides to keep in mind regarding this, however. Most runtimes and operating systems will enforce limits on the number of open files or the number of network requests that can happen at once. While Node will handle this for files, network requests aren't as easy. In terms of web requests and Node or the browser, it'll often cause instability rather than throwing errors. Thus, it's essential to not put hundreds of network requests into a Promise.all
call at the same time. Using a helper function such as the following can limit the number of network requests that can run at any given time.
async function runWithLimit(workers, count = 10) {
if (workers.length === 0) {
return [];
}
const methods = workers.slice();
const results = [];
async function task() {
while (methods.length > 0) {
const a = methods.pop()!;
const r = await a();
results.push(r);
}
}
await Promise.all(new Array(count).fill(undefined).map(() => task()));
return results.reverse();
}
Chunk data
When possible, operating on as minimal a set of data as possible is ideal. NodeJS, by default, has a low memory limit, and most browsers heavily limit the amount of memory a page can use. As such, performing the operations on chunks of data at a time can make data-heavy processing possible. Doing this does depend on what the operations are, however. In cases where you can't do any processing before loading all of the data but only need a small subset of it, one common technique is to read the file twice. On the first read, gather the required data, and then on the second read, do the actual processing. Doing this is slower due to the extra file reads, but it prevents running out of memory.
Workers
While promises are good at reducing IO bottlenecks, they don't help at all with CPU-heavy tasks. They may slow down the program slightly due to the overhead of the promises. One way around this is to use a feature of JavaScript known as Web Workers. These allow you to run code in other threads, getting around the limitation of only doing one thing at a time. One major downside of employing workers in JavaScript is that they can't share data. You must send data to the worker via message channels and then send data from the worker back to the main application via the same message channel system. For more complex setups, this can become messy.
A library I've used in the past to make these simple is microjob on npm. It allows you to send data as a parameter to a function and get the result by treating the call to the other thread as a promise. The library is for NodeJS only and does not support browsers. When writing code in the browser, you must either use another library or directly use web workers.
Hidden Classes
While this is considered a micro-optimisation in most cases, for large-scale data manipulation in JavaScript, this internal v8 (the JavaScript engine NodeJS and Chrome use) optimisation can be crucial. The v8 engine created hidden classes to speed up data access. When the structure of objects changes, such as adding or removing a property, it must re-create these hidden classes. Doing this lowers the overall performance of object accesses by a small factor, which can add up when doing large amounts of data processing. To solve this, try to keep the actual structure of objects as rigid as possible. If you know you'll be adding another field later in the program, try setting it to a dummy value at the start.
Memory Allocations
One of the most significant issues with processing large amounts of data in NodeJS is hitting the garbage collector. Most garbage collectors optimise for gradual memory allocations over time rather than massive quantities in short durations. All of the JavaScript array helper methods, such as map
, filter
, flatMap
, and others, all return a copy of the original array. Thus allocating more memory that the garbage collector must free. Once memory usage gets over a certain amount, the garbage collector will pause the application while it frees memory. The time it takes to free memory is often non-trivial and can significantly slow down the application.
If an application has three chained map calls and a filter call on a 10000 element array, three new copies of this array are created and immediately discarded. Cases such as this are rather typical in data-heavy applications and very quickly add up. When possible, try to avoid throw-away memory allocations.
Array mutation
In slow areas of the code where it is safe to modify the original array, writing a custom map function that doesn't create a copy can be beneficial.
function inlineMap(arr, f) {
for (let i = 0; i < arr.length; i++) {
arr[i] = f(arr[i]);
}
return arr;
}
This function is simple and doesn't support map features such as getting the index; however, it will massively reduce memory usage. You can add additional functionality if the application requires it. While this allocates less memory, it does modify the array object passed to it. Due to this, you must take care to ensure that nothing requires the original data afterwards.
Another boost to apply here is modifying the objects inside the array as well. When running map on an array, the function generally takes the current value and returns another. In cases where you're just modifying the original object, you can instead perform that modification and return it. Consider the following two examples. The first is a typical map function, while the second mutates the original object.
inlineMap(data, (i) => ({
id: i.id,
data: i.data * 2,
}));
inlineMap(data, (i) => {
i.data *= 2;
return i;
});
One major caveat with this method is that this will modify the object within every array. If you copy an array and modify an object inside the copy, the modification will also apply to the original. You should avoid using these methods in most general code. This method dramatically sacrifices the readability and maintainability of the codebase, so you should not use it unless necessary.
For reference, here is a jsperf entry on the above inlineMap
function. While it only shows a small gain over the native function on most modern browsers, the benefits come from the minimised garbage collection.
Will v8 fix this in the future?
This problem with chained array helper calls is unlikely to be fixed by optimisations to v8 in the future due to the difficulty in determining when optimisations are possible. In languages such as Java or C# with proper stream systems, a terminating operation allows better optimisations at a runtime level. The requirement to "collect" the results in Java, while more lengthy to type, allows for optimisations that are not possible with APIs such as those in JavaScript.
Use a Profiler
Despite being the last point in this post, using a profiler is always the first thing you should do when optimising code. Optimisation is a data-driven process. If you don't know if something is slow, you don't know how to fix it. To quote Donald Knuth from The Art of Computer Programming,
The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimisation is the root of all evil (or at least most of it) in programming.
Once you have verified something is slow and modified the code, you must then also use a profiler to confirm that you have fixed the issue. If you haven't established a performance improvement with metrics, you have no way of knowing that you improved anything. An excellent tool for testing out differences in small chunks of code is jsperf, as linked earlier.
Conclusion
Optimising software can be complex. Understanding a few of the common pitfalls, such as extreme memory allocation or loading files synchronously and validating performance issues before making changes, is essential to keep in mind. Optimising code without validating that it's slow and that you've improved it is not only a waste of time but also potentially a detriment to the overall quality of the codebase.
About the Author
Hi, I'm Maddy Miller, a Senior Software Engineer at Clipchamp at Microsoft. In my spare time I love writing articles, and I also develop the Minecraft mods WorldEdit, WorldGuard, and CraftBook. My opinions are my own and do not represent those of my employer in any capacity. Find out more.