Rendered at 19:08:57 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
moring 1 days ago [-]
The article shows nicely how "every byte matters" is false. First, it starts off by talking about the cost of a new field, when the actual topic is array-of-structs vs. struct-of-arrays. Then, this:
> How much of an impact can this have?
> Reading is:alive (1 byte) Across 1M Monsters
You aren't reading one byte here, you are reading 1M bytes! Of course, optimizing the access to 1M bytes is something to consider. Optimizing the access to one byte isn't.
The article is definitely worth reading IMHO, but it really needs a better headline!
jayd16 1 days ago [-]
Even more so, it shows that SoA data structure means you can add fields to your 1M monsters with little impact.
gmueckl 1 days ago [-]
This is valid for sequential scanning of the data. The CPU will fill whole cache lines at once with the arrays that do get used and the algorithm touches all the field instances in the array.
Now think about random access to single struct instances instead: the CPU loads a cache line worth of data for each field and uses only one element out of the whole cache line. This is much worse than a compact structure representation of the same data.
SoA is not universally better.
Rendello 3 hours ago [-]
> SoA is not universally better.
This is an important part of Data-oriented Design: the representation of the data should be pragmatically tied to its access patterns, not dogma.
Richard Fabian's DoD book gives the example that (x,y,z) is almost always better as a classic array-of-structs rather than a struct-of-arrays, because if you're accessing one dimension, you probably are want to process the other two dimensions at the same time:
This sounds similar to relational databases vs document oriented databases, at least when I briefly looked into database like MongoDB when such things were all the rage 15-20 years ago.
For the internal web site that customer support people used a document oriented database would be great because that wants to load everything about one customer and pretty much doesn't need anything else until the user is done supporting that customer.
For the dozens or periodic reports that needed to be generated relational was way better. A given report generally only wanted a small amount of per customer data but wanted that for all customers.
A little bit of searching and LLM querying suggests that nowadays there are databases that are good at both kind of tasks, in particular Postgress with JSONB, at least at the scale we were looking at (maybe 30k or so customers), but maybe really big operations would need more specialized software.
tremon 1 days ago [-]
The Array-of-Struct vs Struct-vs-Array organization is actually more similar to row-major ordering vs column-major ordering, i.e. the data structure that analysis databases use to optimize for aggregate calculations. Document databases are not really comparable because they don't impose structure on the data; with document databases you just have a tree of JSON elements, which is neither AoS nor SoA.
ncruces 10 hours ago [-]
Or, another name for the same thing columnar: storage vs. row storage.
jayd16 1 days ago [-]
No it's not always better and I didn't mean to imply it was. I was simply saying that the article argues against its title.
In both cases you want to think about locality of the next read and structure the data accordingly.
notatyrannosaur 1 days ago [-]
> you can add fields to your 1M monsters with little impact.
Great for this access pattern, but I wouldn't make a general statement like that. This is the same thing as row-oriented vs column-oriented databases, OLTP vs OLAP.
SoA is weak if you are adding/removing monsters more often than accessing a single "hot" field.
Altern4tiveAcc 1 days ago [-]
> SoA is weak if you are adding/removing monsters more often than accessing a single "hot" field.
Why is that? Genuinely curious. Does "weak" mean that it performs worse than AoS, or that the gains aren't as significant versus AoS?
tsimionescu 1 days ago [-]
It's because removing a monster with 20 fields from an SoA structure means resizing 20 arrays. Removing the same monster from an AoS array involves resizing a single array, which you're going to process in a very cache friendly way.
vouwfietsman 1 days ago [-]
I'm not sure why anybody would at the same time be implementing SoA AND resizing 20 arrays for a single delete, those things seem to be on either ends of the "I care about performance" spectrum.
tsimionescu 22 hours ago [-]
The point is that a simple SoA implementation requires this - each field in the monster struct is an item in 20 different arrays. So, removing one monster means removing that item from those 20 arrays.
Now, as others have suggested, you can have a more complex implementation, where instead of removing the monster's fields from those arrays, you just mark them as "dead" or whatever and then skip them when consuming the relevant arrays, with some relatively small extra bookkeeping overhead. Of course, this comes with its own drawbacks, especially if the number of monsters is very dynamic and you are memory constrained.
The point is not to say that SoA is never good for performance, it obviously and certainly is, probably even in most cases. It's just not always best for performance, this was all.
Altern4tiveAcc 1 days ago [-]
Assuming ordering isn't a concern, can't you just have a field called "removed" and skip those when iterating?
Or swap it with the last monster, and keeping an index for the last monster alive.
tsimionescu 22 hours ago [-]
Sure, but these schemes might have their own drawbacks depending on the exact use case - especially if you have a very dynamic number of monsters and constantly add and remove them (say, some kind of bullet hell style game).
marcosdumay 1 days ago [-]
Then you have to read the "removed" field on every field read on every operation.
SoA is only useful when you don't read multiple fields for most operations.
ablob 1 days ago [-]
Two fields should be fine, actually.
The way caches are organized you are very unlikely to thrash with the lookups (due to n-way associativity) while only keeping relevant data in the cache at the same time.
You still have roughly the following layout (in the cache), where A is the field and V is valid:
The former access pattern still yields a clean cache layout where no unnecessary data is loaded (which is the most costly operation here by far) as opposed to
| A1 V1 B1 C1 | ... | A2 V2 B2 C2 | ...
In the general case there will exist a number of fields for which SOA layout will be worse if all are accessed close to each other, but for just a validity indicator this should not be the case. I think your statement is not wrong, but also not 100% correct.
This is on par to linear search being faster than binary search for small n. As soon as caches and branch prediction chime in many rules of thumb just change. Most importantly, however, is that a distinction between small and large n basically _needs_ to happen at that point.
jayd16 1 days ago [-]
Presumably they're referring to resizing the arrays.
gmueckl 1 days ago [-]
Array resizing is avoidable with an embedded free list if ordering is of no concern.
setr 11 hours ago [-]
If you take out ordering, then lookups on your SoA are now a search, and n-field lookup on an entity is now a JOIN operation.
The smarter you get about it, the closer you get to an OLAP db
Which leads to my theory… I feel like Bevy could be implemented on top of an in-memory DuckDB and get away with it
Altern4tiveAcc 7 hours ago [-]
Depending on your access patterns, maybe you could have a hash table mapping entities ids to indexes in your SoA. Perhaps that's viable if looking up a single entity is not typical to your use case?
> Which leads to my theory… I feel like Bevy could be implemented on top of an in-memory DuckDB and get away with it
Haha, it certainly does sound viable.
keynha 1 days ago [-]
[dead]
celrod 1 days ago [-]
Yes. I think one of the big advantages of SoA is that you only pay for the fields you're currently using.
If you need a field somewhere, you can add it and only pay the cost of iterating it where you need it.
bronlund 1 days ago [-]
Every Struct Matters
noelwelsh 1 days ago [-]
The JVM is currently pretty bad for memory allocation. Every object (i.e. not a primitive) has a header that IIRC is 12 bytes. But there is good news in JVM land: this will be reduced to 8 bytes in the next JVM release, and Project Valhalla will give the tools to do away with headers entirely in some cases. Project Valhalla also has tools to manage off-heap memory, which is important in many cases.
The JVM is an odd place where it requires too much heap to compete with the AOT compiled languages, but its startup time is too slow compared to interpreted languages. I think these enhancements are essential to keep the platform relevant.
pron 1 days ago [-]
> Every object (i.e. not a primitive) has a header that IIRC is 12 bytes. But there is good news in JVM land: this will be reduced to 8 bytes in the next JVM release
Since JDK 25 it's already 64 bits with the `-XX:+UseCompactObjectHeaders` flag [1], but in JDK 27 it will be the default [2].
> where it requires too much heap to compete with the AOT compiled languages
Not to compete but to beat, and not too much, but the right amount. Low level languages are optimised for control, not performance (that control translates to better performance in smaller programs, and to worse performance in larger programs), and their particular constraints prevent them from enjoying certain important optimisations, especially those offered by JIT compilation and moving collectors, which remove some overheads that AOT compilers and free-list allocators incur. Their memory management is forced (by their constraints) to optimise for footprint rather than speed.
There are common misunderstandings about memory management and why moving collectors were created to reduce the CPU overheads of malloc/free, especially in large programs, in exchange for what is effectively free RAM. This is why moving collectors are chosen by the languages that are unconstrained enough to use them and have the resources to implement them (Java, .NET, V8). With the exception of Zig (and even there it requires some effort), it's hard for low level languages to use the basic optimisation that's behind moving collectors. I gave a talk about how moving collectors optimise memory management at the last Java One, and it should be available on YouTube soonish [3].
> but its startup time is too slow compared to interpreted languages
That hasn't been the case for some time. You are right, though, that startup/warmup time is worse than in AOT compiled languages, and that is the tradeoff of optimising JITs: reduce the overheads associated with AOT compilation in large program in exchange for warmup.
Both startup and warmup have already been improved thanks to Project Leyden's "AOT cache" [4], but it will never be as low as C.
In general, the tradeoff is between optimisations that help large programs vs optimisations that help small programs.
[3]: I can't reproduce the full talk (which goes into the maths of memory management) here but what happened with moving collectors was that until very recently (open source low-latency moving collectors are newer than ChatGPT), they required pauses and so weren't suitable for programs requiring low latencies. As a result, many developers either forgot or never learnt just how incredibly efficient moving collectors are. But the key is that because accessing RAM by necessity requires CPU, using CPU effectively captures RAM even it's not used by the program. Bringing the CPU and RAM usage into a good balance is more efficient than trying to minimise one or the other. This is also the reason why hardware (physical or virtual) is packaged within a very narrow band of RAM/core ratio.
In general, the tradeoff is between optimisations that help large programs vs optimisations that help small programs.
Do you have concrete examples of large scale Java programs that are significantly more performant than comparable programs in native languages like C++? My understanding was that this dynamic hadn't fundamentally changed much since the 2010s, when Java was able to occasionally edge out a win in 1-2 benchmarks and would lose handily in others. My experience is that large scale Java programs remain a bit of a bear even after significant optimization effort (e.g. Bazel).
There are of course plenty of optimizations the JVM does that aren't possible AOT, but that that doesn't imply an automatic win at large scales, as Rust demonstrates.
pron 1 days ago [-]
> Do you have concrete examples of large scale Java programs that are significantly more performant than comparable programs in native languages like C++?
Yes. I was working in a place that made large sensor-fusion applications, air-traffic control applications, and logistical planning, each in the 2-8MLOC range. Over time, we ported all of them from C++ to Java because C++'s performance overheads were too annoying to work around.
Of course, in principle it's always possible to match and perhaps even exceed Java's performance in a low-level language, but in practice it becomes ever more difficult as the program grows (and the cost remains with maintenance forever). The reason is that as programs grow, patterns become less regular (e.g. the variance in object lifetimes grows), the need for concurrency grows (and so the need for sharing objects among threads and for lock free data structures), and more general constructs are used (e.g. more dynamic dispatch). Improvements in modern allocators, as well as LTO and PGO have helped, but not enough to match the extent of optimisations you can do once you're free of the design constraints of low-level control and the focus on the worst case.
Java's thesis (not initially, but from very early on) was to rely on optimisations that can't be effectively employed by low-level languages because of their constraints, such as efficient memory management that benefits from being able to move most pointers in a program, and highly aggressive speculative optimisations (that are nondeterministic and can fail, resulting in deoptimisation). These optimisations tend to be global, and so they don't restrict program structure much, keeping maintenance costs lower, but they do help the average case at the cost of harming the worst case, which is a tradeoff that programs written in low-level languages don't want, and of course, it doesn't give the low-level control that's the entire point of low-level languages. Proving that thesis took a while, and longer in some aspects than others (moving collectors that don't pause were first released to a wide audience three years ago).
Of course, the differences aren't huge because the hot paths are typically small enough that they can be improved without adding too much cost (and hot paths require some manual optimisation in all languages), but gaining some performance as a side effect of significantly lowering costs is nice.
> There are of course plenty of optimizations the JVM does that aren't possible AOT, but that that doesn't imply an automatic win at large scales, as Rust demonstrates.
I don't know what it is that Rust demonstrates given how few large scale projects have chosen it, but I've seen nothing to indicate that it doesn't suffer from the same performance issues as C++ compared to Java. In fact, someone I know who works at one of the world's largest tech companies told me that his team lead really wanted to do something in Rust, so they ported a small-to-medium service from Java to Rust. The result was such a huge performance drop that it wouldn't meet their minimum requirements. They were then forced to spend an additional 6 to 12 months carefully hand-optimising their Rust code until it matches Java's performance, but the result is such that all future maintenance will be more expensive. This is the exact same pattern I've seen with C++.
It's interesting that 20 years ago the people who said Java can't beat C++ on performance were experienced low-level programmers who had little or no experience with Java (and they were also right on several axes at the time). Today the people who say that are those with little experience with low-level languages (and are under the impression that low level languages are universally fast), but they will eventually learn about their fundamental performance issues just as we did decades ago.
I think that Rust in particular has made people without much experience in low-level programming (among which Rust has made much more inroads than among those with a lot of experience in low-level programming) believe a certain story, namely that the problem with low level languages was memory safety and that that was the reason so many large programs switched to Java despite the performance sacrifices they had to make. Now that Rust fixes that problem, they can have their cake and eat it too! In reality, memory safety was indeed one of the several significant problems with low level languages that Java sought to fix, but another was the performance issues low level languages suffer from as they get large (making good performance ever more costly). The tradeoff isn't performance (in large programs there might even be a performance gain) but low-level control, as that is what low-level languages are about. That was what they offered back then, and it's still what they offer now. Rust was first designed twenty years ago, back when things still looked a certain way (which is why, IMO, it repeated most of C++'s design mistakes), but these days I think that a better, more modern design of low-level languages is more focused on control, leaving large programs to high-level languages. Lack of memory safety has, without a doubt, been one of the things that made low-level languages less palatable to "ordinary" applications, but it was far from the only one.
Anyway, I'm sure the debate of which is faster, C++ (/Rust/Zig) or Java, will continue, and frankly, due to the nature of modern hardware, compiler, and runtime optimisations these days (when the question of the cost of some individual operation is all but meaningless and out ability to extrapolate from the performance of one program to another is close to nil), it largely comes down to empirical questions such as which program patterns are more or less common in the field and in which domains, as there are code and workload patterns that could give an advantage to either one.
WhitneyLand 1 days ago [-]
”they ported a small-to-medium service from Java to Rust. The result was such a huge performance drop that it wouldn't meet their minimum requirements”
That result would say less about performance of languages than it would about competency of developers with a language.
I just don’t buy that a task could be assigned to two teams with comparable expertise and domain knowledge in Rust and Java, and have the Rust result be at a “huge” performance deficit.
No, don’t believe that was an apples to apples comparison.
pron 1 days ago [-]
It may well be the case that it's not an apples-to-apples comparison, but as someone with over two decades of experience in both Java and C++, I find it not only unsurprising, but as a case of both Java and Rust doing exactly what they're designed to do.
Rust is designed to be a low-level language, i.e. a language with maximal control with all of its pros and cons (albeit with memory safety, which C++ doesn't have), while Java is designed to address the performance issues low level languages have, particularly as they get larger, due to their control constraints. Without such constraints, it is easier to offer better performance for less effort especially as programs grow.
In that particular program I was told that the differences were due to needing more locks in the Rust version. As has always been the case, they managed to achieve parity with much more effort (that is expected to continue over the lifetime of the software), but again, this is the explicit tradeoff of the approaches.
Thirty years ago, and even twenty years ago (when Rust was first being designed) many still believed that more control is the only path to good performance, even if it comes with a lot of effort. Today it's clear that it's not the only path, and the debate is mostly around which program and workload patterns that happen to work better with one approach or the other are more common.
wiseowise 11 hours ago [-]
> That result would say less about performance of languages than it would about competency of developers with a language.
> B-b-but skill issue!
That's one of the dimensions of the language too. Not only raw performance matters.
AlotOfReading 1 days ago [-]
I don't know what it is that Rust demonstrates given that few large scale projects have chosen it, but I've seen nothing to indicate that it doesn't suffer from the same performance issues as C++ compared to Java.
The point of bringing up Rust is that it also gives the compiler much more information to optimize on than C++, but actual performance is comparable or slightly worse in most benchmarks because the quality of C++ codegen is so high. Some of those Rust advantages are exactly the same things that have been touted as major advantages for Java over C++, like escape analysis and lifetimes.
Of course, in principle it's always possible to match and perhaps even exceed Java's performance in a low-level language, but in practice it becomes ever more difficult as the program grows (and the cost remains with maintenance forever).
Sure, which is why I asked for real examples of whatever you consider a "large scale" program. I wasn't able to find anything via search before I replied, and the wiki page on Java performance [0] is repeating what I understood.
> Some of those Rust advantages are exactly the same things that have been touted as major advantages for Java over C++, like escape analysis and lifetimes.
These aren't the biggest advantages. I would say that the biggest ones are aggressive speculative optimisations that allow inlining of virtual calls (by default, up to a depth of 15 calls) and the ability to freely move pointers, which allows alternatives to free-list-based memory management. Low-level languages can't afford pervasive speculative optimisation (as they're focused on the worst case) and can't allow most of their pointers to be moved (because they often share them directly with the hardware and/or device drivers).
> and the wiki page on Java performance [0] is repeating what I understood.
That may be because the information on that page seems to be up to date to 2011-2. Java is now on version 26, BTW.
AlotOfReading 1 days ago [-]
LLVM does speculative devirtualization as well these days, though it's not as aggressive as Hotspot. High-performance native code tries to avoid deep dynamic hierarchies anyway, so it's mitigated by cultural practices.
GCs are definitely a strong point for Java, but most high-performance code can be rewritten to avoid pummeling memory management. This used to be common for Java in financial applications, not sure if it still is.
C++ has evolved its own compacting GCs like oilpan [0] for applications where high performance is inherently tied to allocation. Oilpan runs into pointer issues and isn't remotely comparable to G1GC or ZGC, but I think the speed of V8 speaks for itself. Rust allows you to drop in non free-list based allocators and GCs (e.g. Bumpalo), but they're relatively immature.
That may be because the information on that page seems to be up to date to 2011-2. Java is now on version 26, BTW.
The last time I dove into JVM internals was around the same time. I figured that someone who's worked with it more recently might have better examples than what's easily searchable.
> LLVM does speculative devirtualization as well these days, though it's not as aggressive as Hotspot. High-performance native code tries to avoid deep dynamic hierarchies anyway, so it's mitigated by cultural practices.
Sure, AOT compilation also didn't stand still, and overall I'd say that Java and low level languages are closer today than they were 20 or even 10 years ago on all fronts: both have improved in areas where they were behind.
> This used to be common for Java in financial applications, not sure if it still is.
Given that low-latency collectors are only 3 years old, I'm sure some existing Java applications still do it, but new ones no longer need to (and it may turn out to be counterproductive with the new collectors)
> Rust allows you to drop in non free-list based allocators and GCs (e.g. Bumpalo), but they're relatively immature.
The problem isn't the immaturity but the integration with the standard library that requires significant code changes (e.g. you need to use different string and collection implementations). However, even where there is good integration - as in the case of Zig - arenas impose limitations (due to the care that needs to be given to lifetime) that make the program less flexible. But yes, when all the stars are aligned, arenas can beat moving collectors (that's about the only thing that can), but moving collectors aren't standing still and resting on their laurels, either.
> I figured that someone who's worked with it more recently might have better examples than what's easily searchable.
I don't know about a single unified resource, but you can find everything here: https://openjdk.org/jeps/0
JIT improvements are usually too low-level to merit a JEP, but all the major GC changes are there. For a taste of what's going on in the JIT these days, see this recent talk: https://youtu.be/J4O5h3xpIY8
gf000 1 days ago [-]
Slightly off topic -- java-related wiki pages are notoriously bad and possibly biased for some reason. They are laughably outdated and have a bunch of non-objective sentences that paint a much worse picture of the language than deserved.
I have even tried removing/rewriting some of the questionable sentences but my edits weren't accepted.
jandrewrogers 1 days ago [-]
I’ve done performance-engineering for decades in Java, C++, and C for both data analytics and supercomputing/HPC. Java performs significantly worse than C++ in all cases without exception. This is the result you should expect from first principles; something has gone horribly wrong with your software optimization if Java is faster than C++ or even Rust.
There are good reasons to use Java in environments that care about performance. Absolute performance can be traded for other concerns while still being good. It is why I did so much performance-engineering work in the language.
Most performance is architectural in nature. Extremely granular control of scheduling is a prerequisite. System languages provide that control if you want it, Java does not.
When you design software in Java, you accept that some software architectures are not available to you. If you care about performance, you would not port a software architecture optimized around the limitations of Java to a systems language.
pron 1 days ago [-]
> I’ve done performance-engineering for decades in Java, C++, and C for both data analytics and supercomputing/HPC. Java performs significantly worse than C++ in all cases without exception.
I've done similar work (not supercomputing/HPC, but yes for soft and hard realtime software, including safety-critical software) and I couldn't disagree more. Of course, we didn't get to write every program in both Java and C++, but the main question was how much effort it took to achieve the required performance. Over multiple projects it was clear that hitting the performance targets was, on the whole, significantly easier in Java.
> This is the result you should expect from first principles; something has gone horribly wrong with your software optimization if Java is faster than C++ or even Rust.
Strong disagreement here, but we need to be specific about what we mean when we say performance.
It is undoubtedly true that for every Java program there exists a C++ program with the same performance, and the proof is simple: every Java program is a C++ program with the classes being input. But that C++ program is close to 2MLOC long. The same could also be said about a C++ program vs. an Assembly program, as every C++ program could be written as an Assembly program.
But when I talk about performance, I refer to what I think most programmers care about when it comes to performance. Not how fast can a program hypothetically be given enough effort and expertise, but how fast can my program be in my budget.
Both speculative compiler optimisations and memory management optimisations are simply not an option for low level languages due to their constraints, and they are very powerful global optimisations. Given a lot of expertise and effort (that must continue throughout the software's lifetime, and often increases as it evolves) you can work around these limitations, but Java was designed so that you can benefit from them, which means more performance per unit of effort.
In large programs more general constructs (e.g. dynamic dispatch) and patterns (concurrency, great variance in object lifetime) grow in prevalence, and low level languages require more effort and discipline to work around their shortcomings in these areas. Optimising JITs that allow aggressive speculative optimisations and moving collectors were invented and adopted to address these shortcomings. You could claim that the advanced mechanisms that were developed to address C++'s performance issues have failed to achieve their goal, although it won't be easy and much of it comes down to empirical questions of which patterns arise more or less frequently in software, but given that this is what these mechanisms were at least intended to achieve, you certainly can't claim that they fail to do so "from first principles". Some compilation optimisations need speculation; some memory management optimisations need moving pointers. Not having these optimisations available in a program you can write without a lot of special effort cannot make it faster "from first principles".
So no, I don't believe at all that something has to go wrong for a Java program to be faster than a C++ program given a certain budget for the program. Indeed, in larger, more complex programs, I believe the very opposite is true. In most situations, if you get the same performance in C++ as you do in Java, then something has gone terribly wrong with your Java program.
As someone who's worked on a pretty famous JVM feature (virtual threads), I can tell you that we and the designers of low-level languages consciously make different performance tradeoffs because we optimise for different programs and people, and have different preferences when it comes to average case vs. worst case, but there is no universal dominance in performance to either one of these approaches over the other.
One obvious example was our decision to remove Unsafe from Java. Some Java developers voiced opposition, citing a program speed competition (the "one-billion-row challenge" [1]) where Unsafe improved the performance of an entry (which was later cloned and tweaked by others) by 25%. But we saw it as further motivation for the decision. Among over a dozen performance experts who submitted entries, only one was able to write a program efficient enough for Unsafe to make a big difference, and the variance in the results even among the top 20 or so entries was larger than Unsafe's improvement. By removing Unsafe, we would harm that one expert's program, but it would allow us to perform more aggressive constant-folding optimisations that would result in much greater performance improvements over the entire ecosystem. Even from a design philosophy perspective alone, this removal of control to the detriment of some programs "for the greater good" of performance over the entire ecosystem is almost unthinkable in low level languages, because control is what they're for. Did that decision make Java a faster or a slower language? That depends on how you look at performance.
If what you are saying is correct, the performance of Java has to be the best-kept secret in the industry. Because you are the only person I've ever heard making such claims seriously.
But this looks more like an apples-to-oranges comparison. You might be talking more about performance in complex business logic, while others are talking about performance in computation.
I can imagine that Java could be faster than C++ or Rust (for the same effort) when the number distinct active tasks is large. But in more traditional performance-critical work, such as HPC or video game engines, there are usually only a limited number of distinct combinations of performance-critical tasks that can be active at the same time. Even if the codebase itself is huge, the performance-critical subset is simple, and the performance advantages from increased control over the execution are cheap.
pron 17 hours ago [-]
> the performance of Java has to be the best-kept secret in the industry
Is it, though? It's the first language of choice for a large number, if not most performance-critical applications.
> Because you are the only person I've ever heard making such claims seriously.
Your sources must be very limited, then, because in serious compiler and runtime design and memory management circles this is quite common. There is a debate, but it is an empirical one over whether the circumstances that favour Java over C++ are more or less common in practice or vice-versa. And again, given that it's the first language of choice in most performance-critical applications (and even if you don't believe it's number one, surely you agree it's in the top two or three) one or two more people probably think its performance is at least competitive with C++.
> But in more traditional performance-critical work, such as HPC or video game engines, there are usually only a limited number of distinct combinations of performance-critical tasks that can be active at the same time
I wouldn't say HPC and video game engines are "traditional performance critical work". Not because they're not performance critical, but because the range of performance critical programs is far larger - think bank card transaction processing; think mobile phone routing, and there are many more examples (also, AAA video game engines are indeed very traditional in their design and tech choices, but their performance-sensitivity these days is not so much around CPU-related optimisations but about scheduling the GPU, and their tech choices are much more constrained by the consoles they need to support than by performance).
jltsiren 16 hours ago [-]
It sounds like we are not even talking about the same thing when we talk about performance.
HPC and video game engines are examples of traditional performance-critical work. Performance-critical, because they typically run in a resource-constrained environment. (If they don't, the user is likely to request the system to do more work.) And traditional, because it's more about algorithmic performance than system performance. The kind of performance people cared about long before computers became capable enough to run complex software systems.
I would not consider card transaction processing performance-critical. The total number of transactions is very low relative to the amount of resources available to process them.
As for Java, it stopped being a general-purpose language a long time ago. Most people who care about the performance of the software they write don't consider it, because almost nobody in their field uses it or talks about it. If it's actually a good choice for performance-sensitive applications in those fields, the people who are using it have done a good job keeping it secret.
pron 8 hours ago [-]
You're right, because I certainly don't consider resource-constrained programs to be the only performance-sensitive applications. I consider an application performance-sensitive when it has severe performance requirements (either on throughput or latency or both) that aren't easily or sufficiently met with horizontal scaling. This typically involves situations where high volumes of data must flow and be processed on the same machine (my own journey with Java began when we ported a large C++ application that did distributed, soft-realtime sensor fusion, synchronised with atomic clocks, to Java, and it was very much performance-sensitive).
If you are running in a resource-constrained environment, you might have no choice but to have complete control over hardware resources, in which case you may need to use a low-level language, but your optimisation budget is very high. A different and more common case is where the hardware isn't too resource-constrained, but the performance requirements aren't easily met, either. In these situations, the performance challenge isn't necessarily to optimise at all costs, but to find a way to meet the performance requirement while staying within budget. In these areas, Java has already displaced C++, and continues to be the first language of choice.
Of course, the people who write such applications (in any language) don't often talk about their architecture, but here's one example when they do: https://www.infoq.com/presentations/java-robot-swarms/ In this case, as in many others, the performance requirements are strict (and aren't easily met with horizontal scaling), but the constraint under which they must be met isn't the hardware but the budget and speed of development/evolution.
More often, the performance challenge is how to get the best performance per unit of effort (while meeting the performance requirements, of course) rather than how to get the last 1-5% of performance at any cost. Or sometimes I put this question as not "how fast can a program be?" but "how fast can I practically make my program?"
The optimisations Java offers are precisely intended to maximise the latter, because that's exactly where low-level languages suffer performance shortcomings. They could get that performance or perhaps better with a lot more effort (that needs to be continuously spent throughout the software's lifetime), but many performance-sensitive applications don't have or would rather not spend the time, money, or expertise to do that, and are looking for the best performance per unit of effort.
imtringued 10 hours ago [-]
>I wouldn't say HPC and video game engines are "traditional performance critical work". Not because they're not performance critical, but because the range of performance critical programs is far larger - think bank card transaction processing; think mobile phone routing, and there are many more examples (also, AAA video game engines are indeed very traditional in their design and tech choices, but their performance-sensitivity these days is not so much around CPU-related optimisations but about scheduling the GPU, and their tech choices are much more constrained by the consoles they need to support than by performance).
In "business oriented" contexts, the usual culprits are database access and serialization/communication overheads. If you use Rust with serdes, you get access to one of the fastest ways to turn JSON documents into struct accessible data on the entire planet. The same implementation effort could be spent on any industry specific data formats.
I am struggling to think of any scenarios where Rust is supposed to be uniquely unsuited and Java would have an obvious win to make the broad and sweeping statements you've made.
If everything you said is true, people would be building JVM backends for C++/Rust the same way LLVM has been used as a backend and there would be constant discussions about JVM vs clang vs gcc. It just doesn't add up.
pron 9 hours ago [-]
> If you use Rust with serdes, you get access to one of the fastest ways to turn JSON documents into struct accessible data on the entire planet.
Yeah, because most people who choose Rust are those coming from JS, Python, or Ruby, and almost no one has written large systems in Rust yet, I see why you'd think that, because that's indeed the main challenge in the kind of programs normally written in JS, Python, or Ruby. In automation control, the bottleneck isn't the DB; in distributed sensor fusion the bottleneck isn't the DB; in telecom routing the bottleneck isn't the DB (I actually don't know what the bottleneck is in transaction processing, but I'm pretty sure it's not just the DB). These are just some areas where Java is the top choice.
> I am struggling to think of any scenarios where Rust is supposed to be uniquely unsuited and Java would have an obvious win to make the broad and sweeping statements you've made.
In all the same places where Java displaced C++ and continues to do so: large systems. I think few even consider Rust, TBH.
> If everything you said is true, people would be building JVM backends for C++/Rust the same way LLVM has been used as a backend and there would be constant discussions about JVM vs clang vs gcc. It just doesn't add up.
First, Java is far more popular than C++ (let alone Rust), so there would be little point (although there is an LLVM backend for the JVM, though I doubt many people use it). The people who want Java's benefits over C++'s benefits have been using Java for a long time now.
Second, you can't have a JVM backend for C++ and Rust and fully enjoy the performance benefits of Java, because the JVM's optimisations are enabled by the language not having the constraints that low-level languages have. The people who just need the performance choose Java anyway, and the people who choose low-level language choose them because they need the control the JVM doesn't offer.
imtringued 10 hours ago [-]
I'm not sure I understand what exactly you're talking about. I personally moved away from Java to Rust, because of the obvious and immediate performance benefits and this is possible because Rust manages to stay safe despite the lack of a garbage collector.
SleepyMyroslav 6 hours ago [-]
I am not GP poster. I find pron points interesting even if I work in the gamedev on game engines. If you don't mind I will try to explain how I see them interesting. Since I have not worked on Rust systems I will stick to C++.
Note his example elsewhere in this discussion of 2 projects done at same time in Java and Rust and the complaint that Rust system used too many locks. This can happen in C++ too. But why it does not happen in (my) practice? Because C++ evolved to not use locks in large scale parallel systems. This was said from mainstage conferences keynotes at least since 2013 [1]. So there is "normal C++" and "C++ that works at large scale" and they are not the same C++ languages. The performance scales between them are many orders of magnitude. Imho it does not mean that Java anywhere near the best of what C++ can do. So here we are talking past each other. pron is correct that Java is not bad and you are correct that you have no reasons to leave Rust.
> The performance scales between them are many orders of magnitude. Imho it does not mean that Java anywhere near the best of what C++ can do.
I don't think you're aware of where Java is today. Here's a recent talk about some of the issues we're working on now: https://youtu.be/J4O5h3xpIY8
I said that in the past the people who believed Java can't match or exceed C++'s performance were typically those with a lot of low-level programming experience and little or no experience with Java, while today it's mostly people with little experience with low-level programming, but I think you may be in the first group. To people in that group, the question I pose is: what is exactly that you'd think makes Java harder to compile in an optimised way than C++? That's not hard to answer for JS or Python, but you'll find that it is hard to answer for Java. (I don't have a question to ask the people in the second group because they are typically people who don't know much about software performance to begin with, don't have any informed intuition about it, and just say nonsensical things like "runtime overhead").
On the whole, the range of optimisations available to our compiler is larger than to a C++ compiler, and we have a wider selection of memory management optimisations, too (this matters mostly in large programs with a wide variety of object lifetimes).
So if you were to ask me why I would speculate that C++ can't be as well-optimised as Java, I could tell you that it's because it can't inline as aggressively and it can't move pointers (due to its constraints and intended domains).
I think an answer for why Java wouldn't be as optimised at C++ could refer to things like "Java has an interpreter" (true, but that design was chosen to support more aggressive speculative optimisations in the compiler), or "Java has moving-tracing GCs" (true, and that was chosen because they offer an optimisation of memory management in a wide variety of situations). The JVM was designed to address specific performance shortcoming of low-level languages; true, they don't result in a win in all situations, and in some they even lose, but these mechanisms were chosen because they do win in many situations.
In general, when we (the JVM's developers) see something that C++ can do faster, we treat it as a performance bug and solve it. What John (the chief JVM architect) is talking about is related to the last area where Java suffers (arrays-of-structs) to which we'll start delivering the solution very soon.
There are some intentional performance-related tradeoffs that both our team and the C++/gcc/LLVM teams make, but they are about offering better or worse performance under different circumstances, and definitely not universally.
As an example I was personally involved with, the C++ team and us intentionally chose differenet approaches to coroutines that give better performance in some situations and worse in others, and we both opted to prioritise different situations (i.e. situations where cache misses are more or less likely).
In general, C++ offers better performance than Java in some programs, and the opposite is true in other programs. On average, their performance has come closer over the years, each improving the areas where they were weaker.
As to "the best of what C++ can do", it's hard to define, because, as I said, every Java program can be seen as a C++ program, so technically C++ can always match the performance of a Java program given enough effort and expertise. But when talking about performance, what's practically possible matters much more than what's hypothetically possible, and in those programs where Java wins, achieving the same performance in C++ is just far more costly.
But also, given that both languages can and do come close to the maximal hypothetical hardware performance, they're rarely too far apart (unless we're considering warmup time), and they're both very much "anywhere near" each other almost all the time.
SleepyMyroslav 4 hours ago [-]
as for my experience, yep I do not have Java experience and a long list of C++ projects.
> what is exactly that you'd think makes Java harder to compile in an optimised way than C++?
In games C++ is doing some simulations and data delivery for GPU. Code that does work on GPU is not mixed with rest of C++ code. So invoking Cuda (or the likes) in the middle of computation is a cheat code that Java does not have. Simulations on the CPU need to be efficiently parallel ( think 12 hardware threads for last gen or 4-6 threads for smaller platforms) and most likely specialized for hardware SIMD ( think AVX2 for last gen or SSE2 like for smaller platforms). To wrangle multi GB data efficiently a lot of compression/decompression and data structures are needed. Does Java still has overhead per class instance? It might force designs with arrays of primitive data types that are more verbose.
Add there per platform I/O and everything. It means that games force people to unlearn everything that language ever thought about standard I/O. Even more about being cross platform. In C++ it means something completely different. In C++ you can't trust language implementation vendor with anything. From your comment I assume that Java teams rely on language implementation in lots of ways. In C++ being efficient means do it yourself. How efficient our memory allocation is? Answer can only be per engine/project. There is no 'average' because 'vendor provided' is the bottom of the barrel quality. No one is improving vendor provided exactly because no one is expected to use it.
In short there are hard to compare many different C++. I can't see them compare to each other much less to other programming languages like Java. This might be not the answer you wanted but that's all I have.
pron 3 hours ago [-]
> So invoking Cuda (or the likes) in the middle of computation is a cheat code that Java does not have.
> Does Java still has overhead per class instance? It might force designs with arrays of primitive data types that are more verbose.
That is the last area where Java is still behind but the work on arrays-of-structs (with no headers) is nearly complete. A first release of that is imminent.
> In C++ being efficient means do it yourself
Right, and that's precisely what I meant about low-level languages being optimised for control and not performance. You could do things at such a low level in Java, but the main problem is not the performance but that it's just less convenient than in C++.
Anyway, aside from some outdated (or soon-to-be-outdated) things, what you pointed out is mostly about lack of convenient direct low-level control rather than general performance, and that is exactly when low-level languages can be a better fit.
tealpod 1 days ago [-]
We compiled one of our Java app to native binary using GraalVM (for encyption and secret managment needs). Side effect is the Java native binary performance is excellent, app startup time also significantly less compared to JVM version.
I am not sure how it compares with C++, Rust and Zig, but we made a benchmark with a similar Go binary, Java native version performance (load tests) is similar to Go binary. Only RAM usage of Java native binary is 3 times to Go binary (and JVM app took almost 10 times more RAM than Go version).
pron 1 days ago [-]
The RAM difference is primarily because both Native Image (what you call Graal VM) and Go use much simpler and less efficient memory management techniques. HotSpot uses much more RAM by design as there are inefficiencies caused by using too little of it. Memory management - and especially very sophisticated approaches that are only used by the best resourced teams - is an especially misunderstood aspect.
I gave a talk on the subject that I hope will be published soon, and while I can't reproduce it here, let me give an example that offers some basic intuition. Imagine needing to do some computation in two ways on a machine with 1GB of free RAM. You could run for 10s, taking up 100% CPU and consuming 80MB of RAM, or for 9s, taking up 100% CPU and consuming 800MB of RAM. The second is more efficient, despite taking up 10x more RAM and saving "only" 10% of CPU, regardless of the relative cost of RAM and CPU. This is because taking up 100% of the CPU effectively captures 100% of RAM (as no other program can use it), so both programs capture the entire 1GB only the second one captures it for a second less. This scales to non extreme situations because accessing RAM requires CPU, so using CPU means capturing RAM whether you use it or not. So HotSpot uses it if it can use it to balance the CPU utilisation.
In some situations it may not matter, and I assume that if Native Image and Go work just as well for you, then the workload isn't very high, but under high workloads, this can matter a lot.
setr 11 hours ago [-]
> This is because taking up 100% of the CPU effectively captures 100% of RAM
Isn’t that only true though specifically at 100% CPU utilization?
If it were at 90% CPU, then you have no RAM capture, and then you can’t say anything about whether 80 or 800MB should be taken; it’s only a freebie if and only if literally no other program can do work on the machine.
I don’t see how you can map X% CPU utilization to Y% RAM capture.
Like a program could be network heavy, CPU light and mmaps a large file? Or streaming a file from disk with a constant memory allocation, but doing heavy nonstop CPU work.
The CPU / RAM capture ratio would be wildly different; the ideal for your program, while other competing programs of unknown behaviors exist, I don’t see any way for hotspot to approximate
pron 8 hours ago [-]
> Isn’t that only true though specifically at 100% CPU utilization?
No. Because any RAM access requires CPU, using up any CPU effectively captures some ability to use RAM.
> I don’t see how you can map X% CPU utilization to Y% RAM capture.
You're right that there isn't a fixed formula, but the most efficient balance can have a narrow range, because CPU and RAM are typically sold as a package with a rather narrow RAM/core ratio (usually between 0.5 and 4GB, where the lower end is usually when you have slow cores). This is also because of the intrinsic relationship of RAM and CPU.
> Like a program could be network heavy, CPU light and mmaps a large file? Or streaming a file from disk with a constant memory allocation, but doing heavy nonstop CPU work.
A program that is very CPU light can't make use of a lot of physical RAM at any one time (again, because using RAM requires CPU). Once exception is caching, but memory access patterns for caching are easily detectable, and you can (and Java does) offer a different balance for them. I covered that in my talk, which will be eventually published on YouTube.
setr 5 hours ago [-]
> I covered that in my talk, which will be eventually published on YouTube.
Any idea how I get myself notified once it’s up? Or a YT account to poll
imtringued 9 hours ago [-]
>HotSpot uses much more RAM by design as there are inefficiencies caused by using too little of it.
Ah yes, the swapping induced by IntelliJ overflowing my system RAM is supposed to reduce the inefficiencies of using too little memory. Great...
Thanks pron, you've fully bought into all the JVM kool-aid talking points without ever trying to question them. One of the reasons I upgraded to 32 GB RAM in 2019 was to run a Minecraft modpack. Minecraft is one of the most memory intensive games I've ever played.
When you consider that the smallest cloud instances that cost $4 per month only give you like 512 MB of RAM and have refused to upgrade for at least a decade, the idea of using more than 512 MB to be "more efficient" is ridiculous. It raises your minimum costs to $10 per month.
>I gave a talk on the subject that I hope will be published soon, and while I can't reproduce it here, let me give an example that offers some basic intuition.
>Imagine needing to do some computation in two ways on a machine with 1GB of free RAM. You could run for 10s, taking up 100% CPU and consuming 80MB of RAM, or for 9s, taking up 100% CPU and consuming 800MB of RAM.
This is the "wasted RAM is unused RAM" mentality and it doesn't work, because you usually have multiple competing programs and when you run out of RAM, your system will start swapping. This will then require you to buy more RAM, leading to more leftover RAM, which is then wasted and gets consumed by the applications again. It's nonsense.
Then there is the fact that the vast majority, basically 99.9% of algorithms are not scalable in the naive way presented. Nobody will waste resources on writing the same algorithm twice for these two cases. Databases are usually designed to either be primarily file system backed or in-memory backed. They will use the extra memory to hold indices and let the OS do the caching or they will reserve all the memory up front, intentionally leaving nothing for other applications.
>The second is more efficient, despite taking up 10x more RAM and saving "only" 10% of CPU, regardless of the relative cost of RAM and CPU. This is because taking up 100% of the CPU effectively captures 100% of RAM (as no other program can use it), so both programs capture the entire 1GB only the second one captures it for a second less.
Ok, now you're just writing nonsense. Nowadays people have CPUs with multiple cores and use an OS with a scheduler. If you have two programs taking up 100% of the CPU, the OS will give each process some of the hardware resources. You can't just assume some 100% CPU blockage here just because it is convenient for your argument. It's especially dishonest since even a 99% CPU blockage basically makes your argument fall apart completely.
If you have two programs decide to 10x the memory consumption to save one second, you'll most likely run into swapping issues, which will actually lock up your system for several seconds at a time and if you're unlucky, the OOM killer strikes or the compositor freezes up and you have to reboot. You're saying that a 1 second savings is worth an endless amount of inconveniences.
>This scales to non extreme situations because accessing RAM requires CPU, so using CPU means capturing RAM whether you use it or not. So HotSpot uses it if it can use it to balance the CPU utilisation.
Again, this is completely incorrect in so many ways that you're bragging you know nothing about how modern computers work.
CPU cores have their own local memory resources called caches. Depending on how your code is written, you may tile your data so it fits entirely in cache and operate within the local memory.
When performing inter thread communication, there are often situations where the data often doesn't even get written and then loaded to main memory, since atomic operations can make use of the MESI cache coherency protocol to pull the data directly from another cores' cache.
Nowadays DMA is the standard way to perform large data transfers to hardware peripherals. If you load a file from an HDD, the SATA peripheral will communicate via DMA to copy whole sectors or file system blocks. The same applies to sending data to an SSD, network interface, GPU or basically anything else that performs bulk transfers (1 KiB+). The DMA engine is a separate component independent of the CPU and it may write data directly into cache as well.
Then there is the fact that RAM is a form of storage and storage is usually characterized by the fact that it takes up an area and said areas can be subdivided. When RAM is used, the portion of used RAM is considered blocked for the duration of how long it is stored, independently of whether it is accessed or not. This means that the most important objective is having sufficient amounts of RAM to store all data, not to occupy all of it preemptively even when it is not really needed.
The same can't be said of CPUs. Occupying the CPU usually means actively using the CPU. The only exception to this is things like spinlocks which should be avoided like the plague. By what the CPU is occupied is determined by the OS, therefore your logic is backwards. It's not the program blocking the CPU and therefore blocking the memory. The OS decided to stop running your process to run another process. Progress is slowed down, but it is not blocked.
Actual blockage only occurs when two processes compete for a fixed resource so that it is not possible to run both processes simultaneously, so that one process has to be closed to run another process.
pron 7 hours ago [-]
> Ah yes, the swapping induced by IntelliJ overflowing my system RAM is supposed to reduce the inefficiencies of using too little memory. Great...
That's like me saying, oh great, so the swapping introduced by MS Word or Outlook shows just how efficient C++ is...
> Thanks pron, you've fully bought into all the JVM kool-aid talking points without ever trying to question them.
Oh I didn't just "buy" them. As a low-level programmer who's suffered for a long time from intrinsic inefficiencies and C++, I became a compiler and runtime engineer working on the JVM to solve the problems I had in C++.
> This is the "wasted RAM is unused RAM" mentality and it doesn't work, because you usually have multiple competing programs and when you run out of RAM, your system will start swapping
No, it's actually more involved and interesting than that, but you'll have to wait for my talk.
> Ok, now you're just writing nonsense. Nowadays people have CPUs with multiple cores and use an OS with a scheduler. If you have two programs taking up 100% of the CPU, the OS will give each process some of the hardware resources. You can't just assume some 100% CPU blockage here just because it is convenient for your argument
I didn't. I specifically said it was just an example to demonstrate the inter-relatedness of RAM and CPU since accessing RAM requires CPU. To understand why every single language that can isn't limited by other constraints and has the engineering resources to do so uses the same basic memory management algorithm as Java I guess you'll have to watch my talk when it's published.
> Again, this is completely incorrect in so many ways that you're bragging you know nothing about how modern computers work.
Wow. I guess it doesn't take much to be an engineer working on safety critical realtime applications and then on one of the worlds most advanced optimising compilers and you can get pretty far without knowing how computers work.
> CPU cores have their own local memory resources called caches. Depending on how your code is written, you may tile your data so it fits entirely in cache and operate within the local memory.
The data you need to access at any one time and the overall memory consumption of your program are two very different things. Maybe you don't know this, but CPU caches don't work by caching a large contiguous portion of the address space.
> When performing inter thread communication, there are often situations where the data often doesn't even get written and then loaded to main memory, since atomic operations can make use of the MESI cache coherency protocol to pull the data directly from another cores' cache.
I find it hilarious that you're trying to teach me about MESI, given that designing algorithms and data structures that are efficient on top of MESI was one of my jobs [1], and I advised Intel on architecture, but okay, maybe I know nothing about computers, as you concluded from a paragraph where I tried to give people who may not be compiler or memory management experts some intution about modern memory management design.
FYI, modern malloc/free allocators are also intentionally less footprint-optimised than older ones to get better performance (although they can't offer all the optimisations of moving collectors because they're not allowed to move pointers), but maybe none of the people writing the compilers or memory management mechanisms you use know computers as much as you do, and you know all there is to know.
This looks to be the end of the conversation now. Just wanted to drop in and thank you for your time commenting, pron.
The common discourse is that "XYZ language is close to the metal and therefore Blazing Fast (tm)" people become tribalistic and forgot that this there are engineering considerations and trade-offs all the way down. I appreciate you making the argument for the JVM delivering performant code when a budget matters.
Oops, thank you, but I actually meant to link to this one about how Netflix uses it: https://youtu.be/4kEh8hxAP4U. But your link is good, too.
layer8 1 days ago [-]
What do you mean by “control”?
kakacik 1 days ago [-]
Most of real world use of Java platform has next to 0 concerns like those. Some more niche use case may benefit, good, but overall success map isn't changing anytime soon. Reasons for its long term success lie elsewhere.
nitwit005 38 minutes ago [-]
While this is true, it is true because the applications where it might be a concern avoid using Java.
FartyMcFarter 1 days ago [-]
Android Java apps' memory consumption is definitely a relevant concern.
gf000 1 days ago [-]
It doesn't even run "JavaTM", but some bastard child that is in like ~5 years delay compared to OpenJDK.
wiseowise 10 hours ago [-]
It uses ART, which is not a Java platform.
re-thc 1 days ago [-]
Not true. Lots of large Java deployments with millions to billions in cloud spend. The Java part of it isn’t 0.
Memory isn’t free. CPU isn’t free.
gf000 1 days ago [-]
And java uses very little CPU compared to most other languages. It's right after manual memory managed languages like C/C++, and is the first managed language according to a paper about how "green" each language is.
But there is a semi-fundamental tradeoff here, you either use more CPU to use less memory or the reverse. Java can be dynamically configured for either end (though defaults to less CPU by not running the GC unnecessarily).
pron 1 days ago [-]
> The cost of each new field is rarely considered
Most developers, in Java and in most other languages, do not consider the cost of every field, but I can tell you that people who need micro-optimisations certainly do care, and in Java's standard library, a layout is very much a concern (except, as always, you want to optimise what really matters; there's no point in optimising something that is unlikely to be a hot spot in a real program). Sometimes, though, you want to intentionally spread out the layout to avoid cache line sharing when concurrency is involved. You will find such examples in the standard library, too.
re-thc 1 days ago [-]
> Most developers, in Java and in most other languages, do not consider the cost of every field
Are you saying most developers are bad? It’s the equivalent of most employees don’t consider the cost of every action to the employer and is how company spend blows up.
pron 1 days ago [-]
I'm saying that most developers aren't writing code where layout is a primary contributor to the program's performance. Even in performance-sensitive applications, only a minority of the team are working on the hot spots.
And speaking about costs, knowing what to optimise is the key to software performance. Improving the performance of an operation by 10000x will improve the performance of your program by less than 1% if the operation is only 1% of the profile to begin with. So I'm only saying that most developers don't work on code where the layout is very significant, but some certainly do.
re-thc 1 days ago [-]
> I'm saying that most developers aren't writing code where layout is a primary contributor to the program's performance.
I've heard this theory before. This isn't just about performance and I don't buy it.
I've seen too many examples of this is just a temporary solution so it doesn't matter. >3 years later that "temporary solution" was still there and at the heart of many operations yet it's now to hard and too costly to fix.
I've also seen the this is a quick hack. No 1 uses it. It doesn't go through any hot paths. All good. You know what happens? Years later, every service literally goes through it. Again, it's too hard to fix.
In the real world these "theories" are really loose. The only fix is every should be aware of what they are doing and do it properly. The it might not happen, etc mindset is dangerous.
pron 1 days ago [-]
This has absolutely nothing to do with what I said. I wasn't referring to people who think that program performance doesn't matter (although I'm sure there are many of those) but to people working on code that either doesn't impact the overall program's performance much or it does but not due to layout. The number of developers working on code where layout is a major contributor to performance is relatively low, and this includes people working on programs where layout does impact performance significantly (because even in such a program, that particular hot path is not touched by every developer).
re-thc 1 days ago [-]
> but to people working on code that either doesn't impact the overall program's performance much or it does but not due to layout
And that's the problem. Who decides that? How do you know and that's my problem with it. Things always change. It's always temporary, not in the hot path, doesn't matter etc until it does.
So what is considered "doesn't impact" often comes back to bite.
pron 1 days ago [-]
That is why profiling is the only way to good performance. It's what lets you know what matters, and it's the only thing that does or can. I've been doing low-level (as well as high level) programming for more than 25 years, and I don't know in advance what is more efficient than what. An operation that was inefficient in the program I wrote yesterday under high contention or bad branch prediciton could be efficient in the program I'll write tomorrow. I can only know that if I profile my specific program (and when writing code for different architectures, I need to profile my program on all of them, because what's efficient on x86-64 may be inefficient on Aarch64 or vice-versa). The days we could tell that something is efficient or not, except for the obvious cases, are gone. Computers, at both the hardware and software infrastructure layers, don't work like that anymore.
If your profile shows you a hot path that's responsible for 90% of the time your program spends, any second optimising anything outside of it harms your performance, as it's a second spent on low ROI instead of high ROI.
NuclearPM 20 hours ago [-]
You do realize that code can be updated right?
gf000 1 days ago [-]
Then what is it that you are saying? That I should use JMH to determine the best layout for my helper class that will be initialized 3 times? Like most of the software (by line of code) is boring plumbing from one service to another with some dumb business logic sprinkled in. Something like a single config option for your database driver matters orderS of magnitude more in many types of applications.
It's much more niche to work on stuff where such changes actually matter, like much much more people write boring CRUD backends than those who write physics simulators and audio processing pipelines combined.
re-thc 1 days ago [-]
Consider the cost of every field, of every action.
Understand the language, the memory model, etc. Don't do "it works on my machine". Understand the architecture, layout, implications etc.
E.g. if you need an int and not a long you should clearly use an int. Wait until you do this every time and things blow up and it's too "hard" to change.
It's called be aware of your actions. Take responsibility of what you do.
> It's much more niche to work on stuff where such changes actually matter,
Not true and that's why there's so much wastage.
A lot of things matter. I've seen more times than the other way that simple awareness and changes can pay for my salary, e.g. not updating to newer EC2 instances when they get released in AWS. Even in a mid size company that was hundreds to thousands in savings.
I've seen CI/CD pipelines where the developers never considered caching and it takes hours to run. It's not free. When every PR and update (hundreds a day) triggers a run it's a cost and a cost not just on machines but developer time waiting.
I can list a lot more examples and everyone in the chain can contribute.
pron 1 days ago [-]
> Consider the cost of every field, of every action.
This runs counter to most modern software performance principles. Thanks to modern hardware optimisations (cache hierarchy, ILP, branch prediction), modern compiler optimisations (aggressive inlining that leads to a much wider view), and increased concurrency, the notion of some action having a cost lost most meaning about 20 years ago, and increasingly since. Because how fast some action is now depends on a much broader context of what else is going on in the program (and the machine), action X can be faster than Y in one program and the same or slower than Y in another.
Because it's nearly impossible to generalise (and so what was true in your previous program may not be true in your current one unless they're nearly identical), the advice is to first profile your program so that you know how fast or slow different parts are in the context of your particular program and then to focus the optimisation efforts on the hot paths in your program. Otherwise, you may end up spending effort where it makes no difference, and this comes at the cost of optimising what matters, overall harming performance.
Taking responsibility means being smart about directing your resources to where they can have the most impact.
Retr0id 1 days ago [-]
Most likely they just have other priorities. A lot of code is not at all performance-sensitive, or is bottlenecked by some other factor.
perching_aix 1 days ago [-]
No, it means the opposite.
nathanielks 1 days ago [-]
If the previous commenter won't say that, I will
LoganDark 1 days ago [-]
It doesn't take a "bad developer" to not consider the cost of every field...
petra 1 days ago [-]
And probably, those optimization could be automated by LLM's.
ChrisMarshallNY 1 days ago [-]
I started off with Machine Code, on a device with 256 bytes (not KB) of RAM. That was 256 bytes, to install the executable, reserve the stack, and set up the heap.
We often used bit (not byte) fields, to convey information.
Made life challenging.
However, being able to be sloppy has its definite advantages. It takes a long time to design highly-optimized stuff. If just declaring a couple of new properties takes thirty seconds, and designing a bitfield takes an hour, then we have some real cost-savings, there.
That said, it's easy to get crazy, these days. I just spent a couple of days, chasing down greedy memory hogs. These were operations that ate gigabytes of memory. I determined that the real culprit was actually Apple MapKit, and figured out a simple workaround, but it took a long time to get there. If I suspect the OS, then it's usually my fault, and trying everything before going back to the OS takes time.
Obscurity4340 1 days ago [-]
How do you deal with all the daemons and automatic crap that does this on Mac? Isnt it all reinforced by SIP?
ChrisMarshallNY 1 days ago [-]
I think all operating systems have these.
In this one case, allocating a MapView via storyboard, caused some kind of cascading strong reference stuff.
Simply allocating it programmatically, fixed it.
Took awhile to get there, though.
Obscurity4340 17 hours ago [-]
Can you elaborate? Did you turn SIP off or what did you do?
ChrisMarshallNY 16 hours ago [-]
No. No need. Just spent a lot of time, logging my code, and eliminating every possible leak.
Tedious, but there really wasn’t anything else I could do. Finding out about the programmatic solution was really just a wild guess.
forinti 1 days ago [-]
So if you need speed, you just have to swallow your OO programmer's pride and put your data in arrays.
jayd16 1 days ago [-]
If you have hot loops with millions of iterations at a time, structure your code accordingly. Its not anti-OO to choose the right data structure for the job.
bob1029 1 days ago [-]
And avoid moving said data between physical threads as much as possible.
Most of the bottlenecks I see are not due to the organization of data. Unnecessary communication of data is the #1 offender.
burnt-resistor 1 days ago [-]
Working set and algorithm diagonalization (work independence) FTW. Immutable data structures and copying often helps to avoid cache invalidation penalties.
chadgpt3 21 hours ago [-]
If you had the right language you could use AoS syntax with SoA implementation. I heard Jai was going to have this feature?
kerblang 23 hours ago [-]
... IF that's your main performance problem.
I already know I'm dealing with huge perf issues caused by ORM & lazy-load semantics. I/O abuse is usually going to be so, so much worse than memory/cache issues. Java is mainly used for business information systems, where I/O is king. Plain vanilla memory abuse is also a big one.
But my main problem is a mgmt convinced the magic wand of AI will make all sorts of problems dissapear, and it's going to take 5 years for them to realize nope.
It's still fun to learn about cache optimization though, esp. when someone makes it reasonably digestible like this. And maybe it also helps people to recognize that OOP is not some great over-arching zen truth of truths.
theandrewbailey 1 days ago [-]
Maybe someone can write an OO language where arrays of structs are automatically stored as structs of arrays.
Odin is heavily inspired by the lang he or she is referring to!
fp64 1 days ago [-]
A sibling comment also mentioned Jai. Not sure what I am missing that the original post was explicitly referring to Jai, some inside joke maybe?
I am sorry, I only know Odin. Jai is this cult on reddit/discord, right? You get access if you socialize enough or something? Not my thing. Not for a language.
theandrewbailey 1 days ago [-]
(original poster here)
I was just throwing out an idea. I had no idea there were already implementations! Because, to my knowledge, conventional popular languages like C/C++/C#/Java/JS/Python don't do that, and automatically doing that (under certain conditions) feels like an easy performance win.
jevndev 1 days ago [-]
For what it’s worth, a common example of the capabilities of c++26 reflection is exactly this use case. I can’t remember where I first saw it, but this article [0] showcases the technique pretty well. It’s opt-in so not the compiler optimization that you’re imagining but still neat that it’s possible
Ah. So, the context (Which I read too far into evidently): 1: One of Jai's initial primary marketing points was to address exactly this: SoA performance with AoS ergonomics. 2: Odin is (or was initially) inspired by Jai.
Yes we should end the hateful rhetoric of most and least significant bytes. Every Byte Matters.
diabllicseagull 1 days ago [-]
We'll get there, bit by bit.
1 days ago [-]
zabzonk 1 days ago [-]
We need an ending to byte-sizeism as well.
moi2388 1 days ago [-]
In combination with “What colour are your bits” I do not see this ending well..
agalunar 1 days ago [-]
Perhaps worth noting that the number of lines in a cache is often different than the number of rows, which can be relevant for some workloads.
The size of an ordinary cache is rows × ways × size(line), where rows = 2 ↑ num-idx-bits. For example, most Intel 64 and AMD 64 processors use log₂(size(page)) − log₂(size(line)) = 12 − 6 = 6 index bits for the L1 cache*, so an L1 cache with 8-way associativity is 64 sets × 8 lines/set × 64 bytes/line = 32 KB large, and an L1 cache with 12-way associativity is 64 × 12 × 64 = 48 KB large. I remember being surprised to learn that most processors have only 64 rows in the L1 cache!
*So that virtual indexes and physical indexes are identical (so that retrieval of the row can happen in parallel with TLB lookup).
ssiddharth 1 days ago [-]
Slight tangent, but every ms, μs, and ns counts too. We've gotten awfully carefree with response times and wasted compute cycles.
nasretdinov 1 days ago [-]
Ideally you'd want to go further and actually store the is_alive as a bit mask and use SIMD instructions to filter out zeroes for example.
chadgpt3 21 hours ago [-]
Even without SIMD (not counting SWIR as SIMD) if you knew most entities were alive, you could zoom through the array checking 64 at a time until you found one that wasn't all 1s and then inspect more closely. Though maybe an index list is better if they're almost all alive.
recursivedoubts 1 days ago [-]
When you are developing games, sometimes.
When you are developing most other applications every byte does not matter. What matters much more is overall system architecture, collapsing unnecessary abstraction layers that some developers (especially java developers) seem to love and optimizing your datastore access.
As always, profile profile profile.
A company I worked for spent a violent couple of man-decades flipping our proprietary scripting language from interpeted to bytecode generation, obviously with tons of bugs and subtle semantic changes, and it ended up boosting overall system performance by about 30%. We could have done nothing over that period of time and hardware advances would have made a bigger impact.
manoDev 1 days ago [-]
Tip: to get LN cache sizes on mac, the commmand is
Oh, I was just watching this yesterday and got a little re-energised about getting back to more active development of my DoD JS engine! Thanks!
setheron 1 days ago [-]
Add it to my watch list!
compiler-guy 1 days ago [-]
SoA can be a big win. But so can plain AoS, just depends on the access pattern.
Profiling important workloads matters. Without that everything else is guesswork.
coldcity_again 1 days ago [-]
I love to see stuff like this. And an active Vectrex gamedev and PC/Amiga sizecoder I strongly agree with the sentiment!
jadbox 23 hours ago [-]
Zig's MultiArrayList is a cool language feature to support objects of collections, and I wish more languages had first class support for it (without overhead of copy's).
PrathikArun 6 hours ago [-]
Wow looks great!!!
1 days ago [-]
rao-v 1 days ago [-]
Anyways find it odd that major languages don’t have a built in way of asking for an array of objects to be optimized as SoA or AoS
jayd16 1 days ago [-]
It doesn't quite make sense to keep object identity at the language level. Inherently the data in the arrays cannot be the same memory of the data in the objects fields.
To get the speed up, you can't just abstract it as an access pattern because it's tied to the specific way the memory is laid out.
If you were trying to make some kind of collection type that could be queried by both row and column, you would need to store it both ways at all times and also keep both representations in sync, which also defeats the purpose, somewhat.
I feel like if you're trying to do this pattern then it doesn't make sense to also keep the objects.
chadgpt3 21 hours ago [-]
Database structure in a programming language! I think I've heard this idea before but never seen it implemented. Define a table with rows and columns, give it some implementation hints and query it like "SELECT id FROM monsters WHERE alive=false" and the compiler could translate it several completely different ways depending on your layout.
rao-v 16 hours ago [-]
Yes! Exactly
rao-v 16 hours ago [-]
I was thinking about just asking the compiler to do the expensive reshuffle when I need it to, but you could go further and expect the compiler to figure out the likely access pattern and spend the transformation budget.
Heck memory is cheap (fine was cheap) give me a data structure that amortizes writes cleverly by maintaining both SoA and AoS at the same time
jayd16 4 hours ago [-]
Well the whole point is that memory (cache in this case) is not cheap at all. We have very little and the point is to do as little loading of what you don't care about as possible to keep the cache full of only exactly what is needed for the task.
How do you imagine it's possible to write to every SoA and every AoS and have that as cheap as only the first step?
yas_hmaheshwari 1 days ago [-]
Out of course: I had thought about reading an article about Iran war or some geo political news when I read fzakaria :-)
AxelWickman 1 days ago [-]
Cool read. The AoS vs SoA speaks for itself.
readthenotes1 1 days ago [-]
"In that time, you get used to huge classes. New functionality? Just add a new method and field to the class"
I guess this is one reason why object-orientation has such a bad reputation.
I once worked at a bank where the OO mentor had taught people that the only object they needed was "Tape" and have them replicate the structure of data on the old spooled tape reels.
The struct of arrays reminds me of this optimization.
burnt-resistor 1 days ago [-]
I'm curious if anyone has had to write a JNI extension for a hot (CPU, GPU, RAM) section the JVM was unable to effectively JIT and/or optimize enough.
chadgpt3 21 hours ago [-]
I think that's generally a pessimization because JNI has fairly high overhead.
burnt-resistor 20 hours ago [-]
So then, preferably the set of problems that take a long time to compute and does so independently.
RickJWagner 1 days ago [-]
That’s a great read. I wish more people wrote like that.
fdegmecic 1 days ago [-]
CppCon 2014: Mike Acton "Data-Oriented Design and C++"
Andrew Kelley: A Practical Guide to Applying Data Oriented Design (DoD)
you should check these two talks out then.
lionkor 1 days ago [-]
The first is quite famous in data oriented design/programming circles, the second one is up there, too. Both very much worth watching.
setheron 17 hours ago [-]
(author) thank you for the kind words.
coolThingsFirst 1 days ago [-]
Why doesn’t the machine fill up the other cache lines as well why is 64 bytes only and then a miss?
masklinn 1 days ago [-]
They will absolutely do that (prefetching, they can even eagerly load what’s on the other side of a pointer).
However it requires additional hardware to recognize patterns which benefit from prefetching, and every time the CPU prefetches data which ends up not being used it has both burned energy and memory bandwidth, and evicted data from the cache which might be needed (cache pollution).
spiffyk 1 days ago [-]
A cache line is simply the unit of data a CPU cache works with (generally 64 bytes, because someone somewhere has probably determined that that is the best line size for general use), much like there are units of data like bytes (8 bits nowadays, but there have been weird ones historically), pages (varies between hardware as well, and may be OS-configurable), etc.
As TFA mentions, a CPU does some predictions about what cache lines to prefetch, e.g. when you do sequential reads. Moreover, the x86_64 instruction set provides a prefetch instruction through which you are able to give the CPU a hint "hey, I'm gonna be using this soon, prepare accordingly, pretty please".
Still, the utility of prefetching is diminished if you only use a single byte from each cache line, because the mechanism generally depends on you doing other work while the next cache line is being fetched. So really the best case scenario is to take as much time as possible to work with what is already fetched, so that there is time for the next unit of data to be fetched in the meantime.
Liquid_Fire 1 days ago [-]
It might sometimes prefetch the surrounding lines as well, but ultimately cache space is limited, so there is a trade-off. Every time you fill a line, you are throwing away something else that was cached there previously, which you may need again in the near future.
> How much of an impact can this have? > Reading is:alive (1 byte) Across 1M Monsters
You aren't reading one byte here, you are reading 1M bytes! Of course, optimizing the access to 1M bytes is something to consider. Optimizing the access to one byte isn't.
The article is definitely worth reading IMHO, but it really needs a better headline!
Now think about random access to single struct instances instead: the CPU loads a cache line worth of data for each field and uses only one element out of the whole cache line. This is much worse than a compact structure representation of the same data.
SoA is not universally better.
This is an important part of Data-oriented Design: the representation of the data should be pragmatically tied to its access patterns, not dogma.
Richard Fabian's DoD book gives the example that (x,y,z) is almost always better as a classic array-of-structs rather than a struct-of-arrays, because if you're accessing one dimension, you probably are want to process the other two dimensions at the same time:
https://www.dataorienteddesign.com/dodbook/node9.html#SECTIO...
For the internal web site that customer support people used a document oriented database would be great because that wants to load everything about one customer and pretty much doesn't need anything else until the user is done supporting that customer.
For the dozens or periodic reports that needed to be generated relational was way better. A given report generally only wanted a small amount of per customer data but wanted that for all customers.
A little bit of searching and LLM querying suggests that nowadays there are databases that are good at both kind of tasks, in particular Postgress with JSONB, at least at the scale we were looking at (maybe 30k or so customers), but maybe really big operations would need more specialized software.
In both cases you want to think about locality of the next read and structure the data accordingly.
Great for this access pattern, but I wouldn't make a general statement like that. This is the same thing as row-oriented vs column-oriented databases, OLTP vs OLAP. SoA is weak if you are adding/removing monsters more often than accessing a single "hot" field.
Why is that? Genuinely curious. Does "weak" mean that it performs worse than AoS, or that the gains aren't as significant versus AoS?
Now, as others have suggested, you can have a more complex implementation, where instead of removing the monster's fields from those arrays, you just mark them as "dead" or whatever and then skip them when consuming the relevant arrays, with some relatively small extra bookkeeping overhead. Of course, this comes with its own drawbacks, especially if the number of monsters is very dynamic and you are memory constrained.
The point is not to say that SoA is never good for performance, it obviously and certainly is, probably even in most cases. It's just not always best for performance, this was all.
Or swap it with the last monster, and keeping an index for the last monster alive.
SoA is only useful when you don't read multiple fields for most operations.
This is on par to linear search being faster than binary search for small n. As soon as caches and branch prediction chime in many rules of thumb just change. Most importantly, however, is that a distinction between small and large n basically _needs_ to happen at that point.
The smarter you get about it, the closer you get to an OLAP db
Which leads to my theory… I feel like Bevy could be implemented on top of an in-memory DuckDB and get away with it
> Which leads to my theory… I feel like Bevy could be implemented on top of an in-memory DuckDB and get away with it
Haha, it certainly does sound viable.
The JVM is an odd place where it requires too much heap to compete with the AOT compiled languages, but its startup time is too slow compared to interpreted languages. I think these enhancements are essential to keep the platform relevant.
Since JDK 25 it's already 64 bits with the `-XX:+UseCompactObjectHeaders` flag [1], but in JDK 27 it will be the default [2].
> where it requires too much heap to compete with the AOT compiled languages
Not to compete but to beat, and not too much, but the right amount. Low level languages are optimised for control, not performance (that control translates to better performance in smaller programs, and to worse performance in larger programs), and their particular constraints prevent them from enjoying certain important optimisations, especially those offered by JIT compilation and moving collectors, which remove some overheads that AOT compilers and free-list allocators incur. Their memory management is forced (by their constraints) to optimise for footprint rather than speed.
There are common misunderstandings about memory management and why moving collectors were created to reduce the CPU overheads of malloc/free, especially in large programs, in exchange for what is effectively free RAM. This is why moving collectors are chosen by the languages that are unconstrained enough to use them and have the resources to implement them (Java, .NET, V8). With the exception of Zig (and even there it requires some effort), it's hard for low level languages to use the basic optimisation that's behind moving collectors. I gave a talk about how moving collectors optimise memory management at the last Java One, and it should be available on YouTube soonish [3].
> but its startup time is too slow compared to interpreted languages
That hasn't been the case for some time. You are right, though, that startup/warmup time is worse than in AOT compiled languages, and that is the tradeoff of optimising JITs: reduce the overheads associated with AOT compilation in large program in exchange for warmup.
Both startup and warmup have already been improved thanks to Project Leyden's "AOT cache" [4], but it will never be as low as C.
In general, the tradeoff is between optimisations that help large programs vs optimisations that help small programs.
[1]: https://openjdk.org/jeps/519
[2]: https://openjdk.org/jeps/534
[3]: I can't reproduce the full talk (which goes into the maths of memory management) here but what happened with moving collectors was that until very recently (open source low-latency moving collectors are newer than ChatGPT), they required pauses and so weren't suitable for programs requiring low latencies. As a result, many developers either forgot or never learnt just how incredibly efficient moving collectors are. But the key is that because accessing RAM by necessity requires CPU, using CPU effectively captures RAM even it's not used by the program. Bringing the CPU and RAM usage into a good balance is more efficient than trying to minimise one or the other. This is also the reason why hardware (physical or virtual) is packaged within a very narrow band of RAM/core ratio.
[4]: https://www.youtube.com/watch
There are of course plenty of optimizations the JVM does that aren't possible AOT, but that that doesn't imply an automatic win at large scales, as Rust demonstrates.
Yes. I was working in a place that made large sensor-fusion applications, air-traffic control applications, and logistical planning, each in the 2-8MLOC range. Over time, we ported all of them from C++ to Java because C++'s performance overheads were too annoying to work around.
Of course, in principle it's always possible to match and perhaps even exceed Java's performance in a low-level language, but in practice it becomes ever more difficult as the program grows (and the cost remains with maintenance forever). The reason is that as programs grow, patterns become less regular (e.g. the variance in object lifetimes grows), the need for concurrency grows (and so the need for sharing objects among threads and for lock free data structures), and more general constructs are used (e.g. more dynamic dispatch). Improvements in modern allocators, as well as LTO and PGO have helped, but not enough to match the extent of optimisations you can do once you're free of the design constraints of low-level control and the focus on the worst case.
Java's thesis (not initially, but from very early on) was to rely on optimisations that can't be effectively employed by low-level languages because of their constraints, such as efficient memory management that benefits from being able to move most pointers in a program, and highly aggressive speculative optimisations (that are nondeterministic and can fail, resulting in deoptimisation). These optimisations tend to be global, and so they don't restrict program structure much, keeping maintenance costs lower, but they do help the average case at the cost of harming the worst case, which is a tradeoff that programs written in low-level languages don't want, and of course, it doesn't give the low-level control that's the entire point of low-level languages. Proving that thesis took a while, and longer in some aspects than others (moving collectors that don't pause were first released to a wide audience three years ago).
Of course, the differences aren't huge because the hot paths are typically small enough that they can be improved without adding too much cost (and hot paths require some manual optimisation in all languages), but gaining some performance as a side effect of significantly lowering costs is nice.
> There are of course plenty of optimizations the JVM does that aren't possible AOT, but that that doesn't imply an automatic win at large scales, as Rust demonstrates.
I don't know what it is that Rust demonstrates given how few large scale projects have chosen it, but I've seen nothing to indicate that it doesn't suffer from the same performance issues as C++ compared to Java. In fact, someone I know who works at one of the world's largest tech companies told me that his team lead really wanted to do something in Rust, so they ported a small-to-medium service from Java to Rust. The result was such a huge performance drop that it wouldn't meet their minimum requirements. They were then forced to spend an additional 6 to 12 months carefully hand-optimising their Rust code until it matches Java's performance, but the result is such that all future maintenance will be more expensive. This is the exact same pattern I've seen with C++.
It's interesting that 20 years ago the people who said Java can't beat C++ on performance were experienced low-level programmers who had little or no experience with Java (and they were also right on several axes at the time). Today the people who say that are those with little experience with low-level languages (and are under the impression that low level languages are universally fast), but they will eventually learn about their fundamental performance issues just as we did decades ago.
I think that Rust in particular has made people without much experience in low-level programming (among which Rust has made much more inroads than among those with a lot of experience in low-level programming) believe a certain story, namely that the problem with low level languages was memory safety and that that was the reason so many large programs switched to Java despite the performance sacrifices they had to make. Now that Rust fixes that problem, they can have their cake and eat it too! In reality, memory safety was indeed one of the several significant problems with low level languages that Java sought to fix, but another was the performance issues low level languages suffer from as they get large (making good performance ever more costly). The tradeoff isn't performance (in large programs there might even be a performance gain) but low-level control, as that is what low-level languages are about. That was what they offered back then, and it's still what they offer now. Rust was first designed twenty years ago, back when things still looked a certain way (which is why, IMO, it repeated most of C++'s design mistakes), but these days I think that a better, more modern design of low-level languages is more focused on control, leaving large programs to high-level languages. Lack of memory safety has, without a doubt, been one of the things that made low-level languages less palatable to "ordinary" applications, but it was far from the only one.
Anyway, I'm sure the debate of which is faster, C++ (/Rust/Zig) or Java, will continue, and frankly, due to the nature of modern hardware, compiler, and runtime optimisations these days (when the question of the cost of some individual operation is all but meaningless and out ability to extrapolate from the performance of one program to another is close to nil), it largely comes down to empirical questions such as which program patterns are more or less common in the field and in which domains, as there are code and workload patterns that could give an advantage to either one.
That result would say less about performance of languages than it would about competency of developers with a language.
I just don’t buy that a task could be assigned to two teams with comparable expertise and domain knowledge in Rust and Java, and have the Rust result be at a “huge” performance deficit.
No, don’t believe that was an apples to apples comparison.
Rust is designed to be a low-level language, i.e. a language with maximal control with all of its pros and cons (albeit with memory safety, which C++ doesn't have), while Java is designed to address the performance issues low level languages have, particularly as they get larger, due to their control constraints. Without such constraints, it is easier to offer better performance for less effort especially as programs grow.
In that particular program I was told that the differences were due to needing more locks in the Rust version. As has always been the case, they managed to achieve parity with much more effort (that is expected to continue over the lifetime of the software), but again, this is the explicit tradeoff of the approaches.
Thirty years ago, and even twenty years ago (when Rust was first being designed) many still believed that more control is the only path to good performance, even if it comes with a lot of effort. Today it's clear that it's not the only path, and the debate is mostly around which program and workload patterns that happen to work better with one approach or the other are more common.
> B-b-but skill issue!
That's one of the dimensions of the language too. Not only raw performance matters.
[0] https://en.wikipedia.org/wiki/Java_performance
These aren't the biggest advantages. I would say that the biggest ones are aggressive speculative optimisations that allow inlining of virtual calls (by default, up to a depth of 15 calls) and the ability to freely move pointers, which allows alternatives to free-list-based memory management. Low-level languages can't afford pervasive speculative optimisation (as they're focused on the worst case) and can't allow most of their pointers to be moved (because they often share them directly with the hardware and/or device drivers).
> and the wiki page on Java performance [0] is repeating what I understood.
That may be because the information on that page seems to be up to date to 2011-2. Java is now on version 26, BTW.
GCs are definitely a strong point for Java, but most high-performance code can be rewritten to avoid pummeling memory management. This used to be common for Java in financial applications, not sure if it still is.
C++ has evolved its own compacting GCs like oilpan [0] for applications where high performance is inherently tied to allocation. Oilpan runs into pointer issues and isn't remotely comparable to G1GC or ZGC, but I think the speed of V8 speaks for itself. Rust allows you to drop in non free-list based allocators and GCs (e.g. Bumpalo), but they're relatively immature.
The last time I dove into JVM internals was around the same time. I figured that someone who's worked with it more recently might have better examples than what's easily searchable.[0] https://chromium.googlesource.com/v8/v8/+/main/include/cppgc...
Sure, AOT compilation also didn't stand still, and overall I'd say that Java and low level languages are closer today than they were 20 or even 10 years ago on all fronts: both have improved in areas where they were behind.
> This used to be common for Java in financial applications, not sure if it still is.
Given that low-latency collectors are only 3 years old, I'm sure some existing Java applications still do it, but new ones no longer need to (and it may turn out to be counterproductive with the new collectors)
> Rust allows you to drop in non free-list based allocators and GCs (e.g. Bumpalo), but they're relatively immature.
The problem isn't the immaturity but the integration with the standard library that requires significant code changes (e.g. you need to use different string and collection implementations). However, even where there is good integration - as in the case of Zig - arenas impose limitations (due to the care that needs to be given to lifetime) that make the program less flexible. But yes, when all the stars are aligned, arenas can beat moving collectors (that's about the only thing that can), but moving collectors aren't standing still and resting on their laurels, either.
> I figured that someone who's worked with it more recently might have better examples than what's easily searchable.
I don't know about a single unified resource, but you can find everything here: https://openjdk.org/jeps/0
JIT improvements are usually too low-level to merit a JEP, but all the major GC changes are there. For a taste of what's going on in the JIT these days, see this recent talk: https://youtu.be/J4O5h3xpIY8
I have even tried removing/rewriting some of the questionable sentences but my edits weren't accepted.
There are good reasons to use Java in environments that care about performance. Absolute performance can be traded for other concerns while still being good. It is why I did so much performance-engineering work in the language.
Most performance is architectural in nature. Extremely granular control of scheduling is a prerequisite. System languages provide that control if you want it, Java does not.
When you design software in Java, you accept that some software architectures are not available to you. If you care about performance, you would not port a software architecture optimized around the limitations of Java to a systems language.
I've done similar work (not supercomputing/HPC, but yes for soft and hard realtime software, including safety-critical software) and I couldn't disagree more. Of course, we didn't get to write every program in both Java and C++, but the main question was how much effort it took to achieve the required performance. Over multiple projects it was clear that hitting the performance targets was, on the whole, significantly easier in Java.
> This is the result you should expect from first principles; something has gone horribly wrong with your software optimization if Java is faster than C++ or even Rust.
Strong disagreement here, but we need to be specific about what we mean when we say performance.
It is undoubtedly true that for every Java program there exists a C++ program with the same performance, and the proof is simple: every Java program is a C++ program with the classes being input. But that C++ program is close to 2MLOC long. The same could also be said about a C++ program vs. an Assembly program, as every C++ program could be written as an Assembly program.
But when I talk about performance, I refer to what I think most programmers care about when it comes to performance. Not how fast can a program hypothetically be given enough effort and expertise, but how fast can my program be in my budget.
Both speculative compiler optimisations and memory management optimisations are simply not an option for low level languages due to their constraints, and they are very powerful global optimisations. Given a lot of expertise and effort (that must continue throughout the software's lifetime, and often increases as it evolves) you can work around these limitations, but Java was designed so that you can benefit from them, which means more performance per unit of effort.
In large programs more general constructs (e.g. dynamic dispatch) and patterns (concurrency, great variance in object lifetime) grow in prevalence, and low level languages require more effort and discipline to work around their shortcomings in these areas. Optimising JITs that allow aggressive speculative optimisations and moving collectors were invented and adopted to address these shortcomings. You could claim that the advanced mechanisms that were developed to address C++'s performance issues have failed to achieve their goal, although it won't be easy and much of it comes down to empirical questions of which patterns arise more or less frequently in software, but given that this is what these mechanisms were at least intended to achieve, you certainly can't claim that they fail to do so "from first principles". Some compilation optimisations need speculation; some memory management optimisations need moving pointers. Not having these optimisations available in a program you can write without a lot of special effort cannot make it faster "from first principles".
So no, I don't believe at all that something has to go wrong for a Java program to be faster than a C++ program given a certain budget for the program. Indeed, in larger, more complex programs, I believe the very opposite is true. In most situations, if you get the same performance in C++ as you do in Java, then something has gone terribly wrong with your Java program.
As someone who's worked on a pretty famous JVM feature (virtual threads), I can tell you that we and the designers of low-level languages consciously make different performance tradeoffs because we optimise for different programs and people, and have different preferences when it comes to average case vs. worst case, but there is no universal dominance in performance to either one of these approaches over the other.
One obvious example was our decision to remove Unsafe from Java. Some Java developers voiced opposition, citing a program speed competition (the "one-billion-row challenge" [1]) where Unsafe improved the performance of an entry (which was later cloned and tweaked by others) by 25%. But we saw it as further motivation for the decision. Among over a dozen performance experts who submitted entries, only one was able to write a program efficient enough for Unsafe to make a big difference, and the variance in the results even among the top 20 or so entries was larger than Unsafe's improvement. By removing Unsafe, we would harm that one expert's program, but it would allow us to perform more aggressive constant-folding optimisations that would result in much greater performance improvements over the entire ecosystem. Even from a design philosophy perspective alone, this removal of control to the detriment of some programs "for the greater good" of performance over the entire ecosystem is almost unthinkable in low level languages, because control is what they're for. Did that decision make Java a faster or a slower language? That depends on how you look at performance.
[1]: https://github.com/gunnarmorling/1brc
But this looks more like an apples-to-oranges comparison. You might be talking more about performance in complex business logic, while others are talking about performance in computation.
I can imagine that Java could be faster than C++ or Rust (for the same effort) when the number distinct active tasks is large. But in more traditional performance-critical work, such as HPC or video game engines, there are usually only a limited number of distinct combinations of performance-critical tasks that can be active at the same time. Even if the codebase itself is huge, the performance-critical subset is simple, and the performance advantages from increased control over the execution are cheap.
Is it, though? It's the first language of choice for a large number, if not most performance-critical applications.
> Because you are the only person I've ever heard making such claims seriously.
Your sources must be very limited, then, because in serious compiler and runtime design and memory management circles this is quite common. There is a debate, but it is an empirical one over whether the circumstances that favour Java over C++ are more or less common in practice or vice-versa. And again, given that it's the first language of choice in most performance-critical applications (and even if you don't believe it's number one, surely you agree it's in the top two or three) one or two more people probably think its performance is at least competitive with C++.
> But in more traditional performance-critical work, such as HPC or video game engines, there are usually only a limited number of distinct combinations of performance-critical tasks that can be active at the same time
I wouldn't say HPC and video game engines are "traditional performance critical work". Not because they're not performance critical, but because the range of performance critical programs is far larger - think bank card transaction processing; think mobile phone routing, and there are many more examples (also, AAA video game engines are indeed very traditional in their design and tech choices, but their performance-sensitivity these days is not so much around CPU-related optimisations but about scheduling the GPU, and their tech choices are much more constrained by the consoles they need to support than by performance).
HPC and video game engines are examples of traditional performance-critical work. Performance-critical, because they typically run in a resource-constrained environment. (If they don't, the user is likely to request the system to do more work.) And traditional, because it's more about algorithmic performance than system performance. The kind of performance people cared about long before computers became capable enough to run complex software systems.
I would not consider card transaction processing performance-critical. The total number of transactions is very low relative to the amount of resources available to process them.
As for Java, it stopped being a general-purpose language a long time ago. Most people who care about the performance of the software they write don't consider it, because almost nobody in their field uses it or talks about it. If it's actually a good choice for performance-sensitive applications in those fields, the people who are using it have done a good job keeping it secret.
If you are running in a resource-constrained environment, you might have no choice but to have complete control over hardware resources, in which case you may need to use a low-level language, but your optimisation budget is very high. A different and more common case is where the hardware isn't too resource-constrained, but the performance requirements aren't easily met, either. In these situations, the performance challenge isn't necessarily to optimise at all costs, but to find a way to meet the performance requirement while staying within budget. In these areas, Java has already displaced C++, and continues to be the first language of choice.
Of course, the people who write such applications (in any language) don't often talk about their architecture, but here's one example when they do: https://www.infoq.com/presentations/java-robot-swarms/ In this case, as in many others, the performance requirements are strict (and aren't easily met with horizontal scaling), but the constraint under which they must be met isn't the hardware but the budget and speed of development/evolution.
More often, the performance challenge is how to get the best performance per unit of effort (while meeting the performance requirements, of course) rather than how to get the last 1-5% of performance at any cost. Or sometimes I put this question as not "how fast can a program be?" but "how fast can I practically make my program?"
The optimisations Java offers are precisely intended to maximise the latter, because that's exactly where low-level languages suffer performance shortcomings. They could get that performance or perhaps better with a lot more effort (that needs to be continuously spent throughout the software's lifetime), but many performance-sensitive applications don't have or would rather not spend the time, money, or expertise to do that, and are looking for the best performance per unit of effort.
In "business oriented" contexts, the usual culprits are database access and serialization/communication overheads. If you use Rust with serdes, you get access to one of the fastest ways to turn JSON documents into struct accessible data on the entire planet. The same implementation effort could be spent on any industry specific data formats.
I am struggling to think of any scenarios where Rust is supposed to be uniquely unsuited and Java would have an obvious win to make the broad and sweeping statements you've made.
If everything you said is true, people would be building JVM backends for C++/Rust the same way LLVM has been used as a backend and there would be constant discussions about JVM vs clang vs gcc. It just doesn't add up.
Yeah, because most people who choose Rust are those coming from JS, Python, or Ruby, and almost no one has written large systems in Rust yet, I see why you'd think that, because that's indeed the main challenge in the kind of programs normally written in JS, Python, or Ruby. In automation control, the bottleneck isn't the DB; in distributed sensor fusion the bottleneck isn't the DB; in telecom routing the bottleneck isn't the DB (I actually don't know what the bottleneck is in transaction processing, but I'm pretty sure it's not just the DB). These are just some areas where Java is the top choice.
> I am struggling to think of any scenarios where Rust is supposed to be uniquely unsuited and Java would have an obvious win to make the broad and sweeping statements you've made.
In all the same places where Java displaced C++ and continues to do so: large systems. I think few even consider Rust, TBH.
> If everything you said is true, people would be building JVM backends for C++/Rust the same way LLVM has been used as a backend and there would be constant discussions about JVM vs clang vs gcc. It just doesn't add up.
First, Java is far more popular than C++ (let alone Rust), so there would be little point (although there is an LLVM backend for the JVM, though I doubt many people use it). The people who want Java's benefits over C++'s benefits have been using Java for a long time now.
Second, you can't have a JVM backend for C++ and Rust and fully enjoy the performance benefits of Java, because the JVM's optimisations are enabled by the language not having the constraints that low-level languages have. The people who just need the performance choose Java anyway, and the people who choose low-level language choose them because they need the control the JVM doesn't offer.
Note his example elsewhere in this discussion of 2 projects done at same time in Java and Rust and the complaint that Rust system used too many locks. This can happen in C++ too. But why it does not happen in (my) practice? Because C++ evolved to not use locks in large scale parallel systems. This was said from mainstage conferences keynotes at least since 2013 [1]. So there is "normal C++" and "C++ that works at large scale" and they are not the same C++ languages. The performance scales between them are many orders of magnitude. Imho it does not mean that Java anywhere near the best of what C++ can do. So here we are talking past each other. pron is correct that Java is not bad and you are correct that you have no reasons to leave Rust.
1. https://sean-parent.stlab.cc/presentations/2013-09-11-cpp-se...
I don't think you're aware of where Java is today. Here's a recent talk about some of the issues we're working on now: https://youtu.be/J4O5h3xpIY8
I said that in the past the people who believed Java can't match or exceed C++'s performance were typically those with a lot of low-level programming experience and little or no experience with Java, while today it's mostly people with little experience with low-level programming, but I think you may be in the first group. To people in that group, the question I pose is: what is exactly that you'd think makes Java harder to compile in an optimised way than C++? That's not hard to answer for JS or Python, but you'll find that it is hard to answer for Java. (I don't have a question to ask the people in the second group because they are typically people who don't know much about software performance to begin with, don't have any informed intuition about it, and just say nonsensical things like "runtime overhead").
On the whole, the range of optimisations available to our compiler is larger than to a C++ compiler, and we have a wider selection of memory management optimisations, too (this matters mostly in large programs with a wide variety of object lifetimes).
So if you were to ask me why I would speculate that C++ can't be as well-optimised as Java, I could tell you that it's because it can't inline as aggressively and it can't move pointers (due to its constraints and intended domains).
I think an answer for why Java wouldn't be as optimised at C++ could refer to things like "Java has an interpreter" (true, but that design was chosen to support more aggressive speculative optimisations in the compiler), or "Java has moving-tracing GCs" (true, and that was chosen because they offer an optimisation of memory management in a wide variety of situations). The JVM was designed to address specific performance shortcoming of low-level languages; true, they don't result in a win in all situations, and in some they even lose, but these mechanisms were chosen because they do win in many situations.
In general, when we (the JVM's developers) see something that C++ can do faster, we treat it as a performance bug and solve it. What John (the chief JVM architect) is talking about is related to the last area where Java suffers (arrays-of-structs) to which we'll start delivering the solution very soon.
There are some intentional performance-related tradeoffs that both our team and the C++/gcc/LLVM teams make, but they are about offering better or worse performance under different circumstances, and definitely not universally.
As an example I was personally involved with, the C++ team and us intentionally chose differenet approaches to coroutines that give better performance in some situations and worse in others, and we both opted to prioritise different situations (i.e. situations where cache misses are more or less likely).
In general, C++ offers better performance than Java in some programs, and the opposite is true in other programs. On average, their performance has come closer over the years, each improving the areas where they were weaker.
As to "the best of what C++ can do", it's hard to define, because, as I said, every Java program can be seen as a C++ program, so technically C++ can always match the performance of a Java program given enough effort and expertise. But when talking about performance, what's practically possible matters much more than what's hypothetically possible, and in those programs where Java wins, achieving the same performance in C++ is just far more costly.
But also, given that both languages can and do come close to the maximal hypothetical hardware performance, they're rarely too far apart (unless we're considering warmup time), and they're both very much "anywhere near" each other almost all the time.
> what is exactly that you'd think makes Java harder to compile in an optimised way than C++?
In games C++ is doing some simulations and data delivery for GPU. Code that does work on GPU is not mixed with rest of C++ code. So invoking Cuda (or the likes) in the middle of computation is a cheat code that Java does not have. Simulations on the CPU need to be efficiently parallel ( think 12 hardware threads for last gen or 4-6 threads for smaller platforms) and most likely specialized for hardware SIMD ( think AVX2 for last gen or SSE2 like for smaller platforms). To wrangle multi GB data efficiently a lot of compression/decompression and data structures are needed. Does Java still has overhead per class instance? It might force designs with arrays of primitive data types that are more verbose.
Add there per platform I/O and everything. It means that games force people to unlearn everything that language ever thought about standard I/O. Even more about being cross platform. In C++ it means something completely different. In C++ you can't trust language implementation vendor with anything. From your comment I assume that Java teams rely on language implementation in lots of ways. In C++ being efficient means do it yourself. How efficient our memory allocation is? Answer can only be per engine/project. There is no 'average' because 'vendor provided' is the bottom of the barrel quality. No one is improving vendor provided exactly because no one is expected to use it.
In short there are hard to compare many different C++. I can't see them compare to each other much less to other programming languages like Java. This might be not the answer you wanted but that's all I have.
It does (and has since JDK 22). But what we're working on now is JIT-compiling Java code to CUDA (not arbitrary code, but certainly code that's suitable for a kernel): https://openjdk.org/projects/babylon/articles/hat-matmul/hat...
> and most likely specialized for hardware SIMD ( think AVX2 for last gen or SSE2 like for smaller platforms)
Yep, we've had good SIMD support for a few years now. (https://javapro.io/2026/04/09/java-vector-api-faster-vector-...)
> Does Java still has overhead per class instance? It might force designs with arrays of primitive data types that are more verbose.
That is the last area where Java is still behind but the work on arrays-of-structs (with no headers) is nearly complete. A first release of that is imminent.
> In C++ being efficient means do it yourself
Right, and that's precisely what I meant about low-level languages being optimised for control and not performance. You could do things at such a low level in Java, but the main problem is not the performance but that it's just less convenient than in C++.
Anyway, aside from some outdated (or soon-to-be-outdated) things, what you pointed out is mostly about lack of convenient direct low-level control rather than general performance, and that is exactly when low-level languages can be a better fit.
I am not sure how it compares with C++, Rust and Zig, but we made a benchmark with a similar Go binary, Java native version performance (load tests) is similar to Go binary. Only RAM usage of Java native binary is 3 times to Go binary (and JVM app took almost 10 times more RAM than Go version).
I gave a talk on the subject that I hope will be published soon, and while I can't reproduce it here, let me give an example that offers some basic intuition. Imagine needing to do some computation in two ways on a machine with 1GB of free RAM. You could run for 10s, taking up 100% CPU and consuming 80MB of RAM, or for 9s, taking up 100% CPU and consuming 800MB of RAM. The second is more efficient, despite taking up 10x more RAM and saving "only" 10% of CPU, regardless of the relative cost of RAM and CPU. This is because taking up 100% of the CPU effectively captures 100% of RAM (as no other program can use it), so both programs capture the entire 1GB only the second one captures it for a second less. This scales to non extreme situations because accessing RAM requires CPU, so using CPU means capturing RAM whether you use it or not. So HotSpot uses it if it can use it to balance the CPU utilisation.
In some situations it may not matter, and I assume that if Native Image and Go work just as well for you, then the workload isn't very high, but under high workloads, this can matter a lot.
Isn’t that only true though specifically at 100% CPU utilization?
If it were at 90% CPU, then you have no RAM capture, and then you can’t say anything about whether 80 or 800MB should be taken; it’s only a freebie if and only if literally no other program can do work on the machine.
I don’t see how you can map X% CPU utilization to Y% RAM capture.
Like a program could be network heavy, CPU light and mmaps a large file? Or streaming a file from disk with a constant memory allocation, but doing heavy nonstop CPU work.
The CPU / RAM capture ratio would be wildly different; the ideal for your program, while other competing programs of unknown behaviors exist, I don’t see any way for hotspot to approximate
No. Because any RAM access requires CPU, using up any CPU effectively captures some ability to use RAM.
> I don’t see how you can map X% CPU utilization to Y% RAM capture.
You're right that there isn't a fixed formula, but the most efficient balance can have a narrow range, because CPU and RAM are typically sold as a package with a rather narrow RAM/core ratio (usually between 0.5 and 4GB, where the lower end is usually when you have slow cores). This is also because of the intrinsic relationship of RAM and CPU.
> Like a program could be network heavy, CPU light and mmaps a large file? Or streaming a file from disk with a constant memory allocation, but doing heavy nonstop CPU work.
A program that is very CPU light can't make use of a lot of physical RAM at any one time (again, because using RAM requires CPU). Once exception is caching, but memory access patterns for caching are easily detectable, and you can (and Java does) offer a different balance for them. I covered that in my talk, which will be eventually published on YouTube.
Any idea how I get myself notified once it’s up? Or a YT account to poll
Ah yes, the swapping induced by IntelliJ overflowing my system RAM is supposed to reduce the inefficiencies of using too little memory. Great...
Thanks pron, you've fully bought into all the JVM kool-aid talking points without ever trying to question them. One of the reasons I upgraded to 32 GB RAM in 2019 was to run a Minecraft modpack. Minecraft is one of the most memory intensive games I've ever played.
When you consider that the smallest cloud instances that cost $4 per month only give you like 512 MB of RAM and have refused to upgrade for at least a decade, the idea of using more than 512 MB to be "more efficient" is ridiculous. It raises your minimum costs to $10 per month.
>I gave a talk on the subject that I hope will be published soon, and while I can't reproduce it here, let me give an example that offers some basic intuition.
>Imagine needing to do some computation in two ways on a machine with 1GB of free RAM. You could run for 10s, taking up 100% CPU and consuming 80MB of RAM, or for 9s, taking up 100% CPU and consuming 800MB of RAM.
This is the "wasted RAM is unused RAM" mentality and it doesn't work, because you usually have multiple competing programs and when you run out of RAM, your system will start swapping. This will then require you to buy more RAM, leading to more leftover RAM, which is then wasted and gets consumed by the applications again. It's nonsense.
Then there is the fact that the vast majority, basically 99.9% of algorithms are not scalable in the naive way presented. Nobody will waste resources on writing the same algorithm twice for these two cases. Databases are usually designed to either be primarily file system backed or in-memory backed. They will use the extra memory to hold indices and let the OS do the caching or they will reserve all the memory up front, intentionally leaving nothing for other applications.
>The second is more efficient, despite taking up 10x more RAM and saving "only" 10% of CPU, regardless of the relative cost of RAM and CPU. This is because taking up 100% of the CPU effectively captures 100% of RAM (as no other program can use it), so both programs capture the entire 1GB only the second one captures it for a second less.
Ok, now you're just writing nonsense. Nowadays people have CPUs with multiple cores and use an OS with a scheduler. If you have two programs taking up 100% of the CPU, the OS will give each process some of the hardware resources. You can't just assume some 100% CPU blockage here just because it is convenient for your argument. It's especially dishonest since even a 99% CPU blockage basically makes your argument fall apart completely.
If you have two programs decide to 10x the memory consumption to save one second, you'll most likely run into swapping issues, which will actually lock up your system for several seconds at a time and if you're unlucky, the OOM killer strikes or the compositor freezes up and you have to reboot. You're saying that a 1 second savings is worth an endless amount of inconveniences.
>This scales to non extreme situations because accessing RAM requires CPU, so using CPU means capturing RAM whether you use it or not. So HotSpot uses it if it can use it to balance the CPU utilisation.
Again, this is completely incorrect in so many ways that you're bragging you know nothing about how modern computers work.
CPU cores have their own local memory resources called caches. Depending on how your code is written, you may tile your data so it fits entirely in cache and operate within the local memory.
When performing inter thread communication, there are often situations where the data often doesn't even get written and then loaded to main memory, since atomic operations can make use of the MESI cache coherency protocol to pull the data directly from another cores' cache.
Nowadays DMA is the standard way to perform large data transfers to hardware peripherals. If you load a file from an HDD, the SATA peripheral will communicate via DMA to copy whole sectors or file system blocks. The same applies to sending data to an SSD, network interface, GPU or basically anything else that performs bulk transfers (1 KiB+). The DMA engine is a separate component independent of the CPU and it may write data directly into cache as well.
Then there is the fact that RAM is a form of storage and storage is usually characterized by the fact that it takes up an area and said areas can be subdivided. When RAM is used, the portion of used RAM is considered blocked for the duration of how long it is stored, independently of whether it is accessed or not. This means that the most important objective is having sufficient amounts of RAM to store all data, not to occupy all of it preemptively even when it is not really needed.
The same can't be said of CPUs. Occupying the CPU usually means actively using the CPU. The only exception to this is things like spinlocks which should be avoided like the plague. By what the CPU is occupied is determined by the OS, therefore your logic is backwards. It's not the program blocking the CPU and therefore blocking the memory. The OS decided to stop running your process to run another process. Progress is slowed down, but it is not blocked.
Actual blockage only occurs when two processes compete for a fixed resource so that it is not possible to run both processes simultaneously, so that one process has to be closed to run another process.
That's like me saying, oh great, so the swapping introduced by MS Word or Outlook shows just how efficient C++ is...
> Thanks pron, you've fully bought into all the JVM kool-aid talking points without ever trying to question them.
Oh I didn't just "buy" them. As a low-level programmer who's suffered for a long time from intrinsic inefficiencies and C++, I became a compiler and runtime engineer working on the JVM to solve the problems I had in C++.
> This is the "wasted RAM is unused RAM" mentality and it doesn't work, because you usually have multiple competing programs and when you run out of RAM, your system will start swapping
No, it's actually more involved and interesting than that, but you'll have to wait for my talk.
> Ok, now you're just writing nonsense. Nowadays people have CPUs with multiple cores and use an OS with a scheduler. If you have two programs taking up 100% of the CPU, the OS will give each process some of the hardware resources. You can't just assume some 100% CPU blockage here just because it is convenient for your argument
I didn't. I specifically said it was just an example to demonstrate the inter-relatedness of RAM and CPU since accessing RAM requires CPU. To understand why every single language that can isn't limited by other constraints and has the engineering resources to do so uses the same basic memory management algorithm as Java I guess you'll have to watch my talk when it's published.
> Again, this is completely incorrect in so many ways that you're bragging you know nothing about how modern computers work.
Wow. I guess it doesn't take much to be an engineer working on safety critical realtime applications and then on one of the worlds most advanced optimising compilers and you can get pretty far without knowing how computers work.
> CPU cores have their own local memory resources called caches. Depending on how your code is written, you may tile your data so it fits entirely in cache and operate within the local memory.
The data you need to access at any one time and the overall memory consumption of your program are two very different things. Maybe you don't know this, but CPU caches don't work by caching a large contiguous portion of the address space.
> When performing inter thread communication, there are often situations where the data often doesn't even get written and then loaded to main memory, since atomic operations can make use of the MESI cache coherency protocol to pull the data directly from another cores' cache.
I find it hilarious that you're trying to teach me about MESI, given that designing algorithms and data structures that are efficient on top of MESI was one of my jobs [1], and I advised Intel on architecture, but okay, maybe I know nothing about computers, as you concluded from a paragraph where I tried to give people who may not be compiler or memory management experts some intution about modern memory management design.
FYI, modern malloc/free allocators are also intentionally less footprint-optimised than older ones to get better performance (although they can't offer all the optimisations of moving collectors because they're not allowed to move pointers), but maybe none of the people writing the compilers or memory management mechanisms you use know computers as much as you do, and you know all there is to know.
[1]: I later even wrote, for a general audience, about data structures over distributed MESI (well, MOESI to be precise) protocols: https://highscalability.com/the-performance-of-distributed-d...
The common discourse is that "XYZ language is close to the metal and therefore Blazing Fast (tm)" people become tribalistic and forgot that this there are engineering considerations and trade-offs all the way down. I appreciate you making the argument for the JVM delivering performant code when a budget matters.
Memory isn’t free. CPU isn’t free.
But there is a semi-fundamental tradeoff here, you either use more CPU to use less memory or the reverse. Java can be dynamically configured for either end (though defaults to less CPU by not running the GC unnecessarily).
Most developers, in Java and in most other languages, do not consider the cost of every field, but I can tell you that people who need micro-optimisations certainly do care, and in Java's standard library, a layout is very much a concern (except, as always, you want to optimise what really matters; there's no point in optimising something that is unlikely to be a hot spot in a real program). Sometimes, though, you want to intentionally spread out the layout to avoid cache line sharing when concurrency is involved. You will find such examples in the standard library, too.
Are you saying most developers are bad? It’s the equivalent of most employees don’t consider the cost of every action to the employer and is how company spend blows up.
And speaking about costs, knowing what to optimise is the key to software performance. Improving the performance of an operation by 10000x will improve the performance of your program by less than 1% if the operation is only 1% of the profile to begin with. So I'm only saying that most developers don't work on code where the layout is very significant, but some certainly do.
I've heard this theory before. This isn't just about performance and I don't buy it.
I've seen too many examples of this is just a temporary solution so it doesn't matter. >3 years later that "temporary solution" was still there and at the heart of many operations yet it's now to hard and too costly to fix.
I've also seen the this is a quick hack. No 1 uses it. It doesn't go through any hot paths. All good. You know what happens? Years later, every service literally goes through it. Again, it's too hard to fix.
In the real world these "theories" are really loose. The only fix is every should be aware of what they are doing and do it properly. The it might not happen, etc mindset is dangerous.
And that's the problem. Who decides that? How do you know and that's my problem with it. Things always change. It's always temporary, not in the hot path, doesn't matter etc until it does.
So what is considered "doesn't impact" often comes back to bite.
If your profile shows you a hot path that's responsible for 90% of the time your program spends, any second optimising anything outside of it harms your performance, as it's a second spent on low ROI instead of high ROI.
It's much more niche to work on stuff where such changes actually matter, like much much more people write boring CRUD backends than those who write physics simulators and audio processing pipelines combined.
Understand the language, the memory model, etc. Don't do "it works on my machine". Understand the architecture, layout, implications etc.
E.g. if you need an int and not a long you should clearly use an int. Wait until you do this every time and things blow up and it's too "hard" to change.
It's called be aware of your actions. Take responsibility of what you do.
> It's much more niche to work on stuff where such changes actually matter,
Not true and that's why there's so much wastage.
A lot of things matter. I've seen more times than the other way that simple awareness and changes can pay for my salary, e.g. not updating to newer EC2 instances when they get released in AWS. Even in a mid size company that was hundreds to thousands in savings.
I've seen CI/CD pipelines where the developers never considered caching and it takes hours to run. It's not free. When every PR and update (hundreds a day) triggers a run it's a cost and a cost not just on machines but developer time waiting.
I can list a lot more examples and everyone in the chain can contribute.
This runs counter to most modern software performance principles. Thanks to modern hardware optimisations (cache hierarchy, ILP, branch prediction), modern compiler optimisations (aggressive inlining that leads to a much wider view), and increased concurrency, the notion of some action having a cost lost most meaning about 20 years ago, and increasingly since. Because how fast some action is now depends on a much broader context of what else is going on in the program (and the machine), action X can be faster than Y in one program and the same or slower than Y in another.
Because it's nearly impossible to generalise (and so what was true in your previous program may not be true in your current one unless they're nearly identical), the advice is to first profile your program so that you know how fast or slow different parts are in the context of your particular program and then to focus the optimisation efforts on the hot paths in your program. Otherwise, you may end up spending effort where it makes no difference, and this comes at the cost of optimising what matters, overall harming performance.
Taking responsibility means being smart about directing your resources to where they can have the most impact.
We often used bit (not byte) fields, to convey information.
Made life challenging.
However, being able to be sloppy has its definite advantages. It takes a long time to design highly-optimized stuff. If just declaring a couple of new properties takes thirty seconds, and designing a bitfield takes an hour, then we have some real cost-savings, there.
That said, it's easy to get crazy, these days. I just spent a couple of days, chasing down greedy memory hogs. These were operations that ate gigabytes of memory. I determined that the real culprit was actually Apple MapKit, and figured out a simple workaround, but it took a long time to get there. If I suspect the OS, then it's usually my fault, and trying everything before going back to the OS takes time.
In this one case, allocating a MapView via storyboard, caused some kind of cascading strong reference stuff.
Simply allocating it programmatically, fixed it.
Took awhile to get there, though.
Tedious, but there really wasn’t anything else I could do. Finding out about the programmatic solution was really just a wild guess.
Most of the bottlenecks I see are not due to the organization of data. Unnecessary communication of data is the #1 offender.
I already know I'm dealing with huge perf issues caused by ORM & lazy-load semantics. I/O abuse is usually going to be so, so much worse than memory/cache issues. Java is mainly used for business information systems, where I/O is king. Plain vanilla memory abuse is also a big one.
But my main problem is a mgmt convinced the magic wand of AI will make all sorts of problems dissapear, and it's going to take 5 years for them to realize nope.
It's still fun to learn about cache optimization though, esp. when someone makes it reasonably digestible like this. And maybe it also helps people to recognize that OOP is not some great over-arching zen truth of truths.
mild /s
I am sorry, I only know Odin. Jai is this cult on reddit/discord, right? You get access if you socialize enough or something? Not my thing. Not for a language.
I was just throwing out an idea. I had no idea there were already implementations! Because, to my knowledge, conventional popular languages like C/C++/C#/Java/JS/Python don't do that, and automatically doing that (under certain conditions) feels like an easy performance win.
[0] https://brevzin.github.io/c++/2025/05/02/soa/
The size of an ordinary cache is rows × ways × size(line), where rows = 2 ↑ num-idx-bits. For example, most Intel 64 and AMD 64 processors use log₂(size(page)) − log₂(size(line)) = 12 − 6 = 6 index bits for the L1 cache*, so an L1 cache with 8-way associativity is 64 sets × 8 lines/set × 64 bytes/line = 32 KB large, and an L1 cache with 12-way associativity is 64 × 12 × 64 = 48 KB large. I remember being surprised to learn that most processors have only 64 rows in the L1 cache!
*So that virtual indexes and physical indexes are identical (so that retrieval of the row can happen in parallel with TLB lookup).
When you are developing most other applications every byte does not matter. What matters much more is overall system architecture, collapsing unnecessary abstraction layers that some developers (especially java developers) seem to love and optimizing your datastore access.
As always, profile profile profile.
A company I worked for spent a violent couple of man-decades flipping our proprietary scripting language from interpeted to bytecode generation, obviously with tons of bugs and subtle semantic changes, and it ended up boosting overall system performance by about 30%. We could have done nothing over that period of time and hardware advances would have made a bigger impact.
Profiling important workloads matters. Without that everything else is guesswork.
To get the speed up, you can't just abstract it as an access pattern because it's tied to the specific way the memory is laid out.
If you were trying to make some kind of collection type that could be queried by both row and column, you would need to store it both ways at all times and also keep both representations in sync, which also defeats the purpose, somewhat.
I feel like if you're trying to do this pattern then it doesn't make sense to also keep the objects.
Heck memory is cheap (fine was cheap) give me a data structure that amortizes writes cleverly by maintaining both SoA and AoS at the same time
How do you imagine it's possible to write to every SoA and every AoS and have that as cheap as only the first step?
I guess this is one reason why object-orientation has such a bad reputation.
I once worked at a bank where the OO mentor had taught people that the only object they needed was "Tape" and have them replicate the structure of data on the old spooled tape reels.
The struct of arrays reminds me of this optimization.
Andrew Kelley: A Practical Guide to Applying Data Oriented Design (DoD)
you should check these two talks out then.
However it requires additional hardware to recognize patterns which benefit from prefetching, and every time the CPU prefetches data which ends up not being used it has both burned energy and memory bandwidth, and evicted data from the cache which might be needed (cache pollution).
As TFA mentions, a CPU does some predictions about what cache lines to prefetch, e.g. when you do sequential reads. Moreover, the x86_64 instruction set provides a prefetch instruction through which you are able to give the CPU a hint "hey, I'm gonna be using this soon, prepare accordingly, pretty please".
Still, the utility of prefetching is diminished if you only use a single byte from each cache line, because the mechanism generally depends on you doing other work while the next cache line is being fetched. So really the best case scenario is to take as much time as possible to work with what is already fetched, so that there is time for the next unit of data to be fetched in the meantime.