I worked on port of a Konami game from PSX to PC, that was 1999-2000 - the code (C) had lots of #ifdefs like where inline assembly was placed, and the original "C" code was kept. It seemed all done by one specific programmer, and what really saved us (in porting) the game was the originallly kept "C" code. My mips knowledge was never as good as x86.
So yes, it was the norm back then. My second job (1998), was working on a team that was going to do some software for Intel for the upcoming back then Katmai processor (Pentium I was it). It had all the new fancy SIMD instructions. The software was supposed to be something like media composer - you slap images, rotate them, etc all in realtime using software rendering (GPUs were still relatively expensive).
I wrote a bilinear and bicubic texture mapper with marching squares for transparent areas. It was all in assembly, and I spent lots of time optimizing it. Back then we used Intel's VTune, and it was super-precise (for the processors back then) - how they are going to pipeline, how much (supposedly) instructions cycles would take, waits, etc. That helped a lot!
But the real lesson was, that me and my manager - both claiming to be super good at assembly (after our recent achievements), rewrote the sphere mapping code for a game another team was writting in assembly, but alas our assembly (but no katmai instructions) code was slower than what the compiler did ;) - TBH, if we did proper mipmapping and texture swizzling we would've fared both ways, but hey, demo coders were not always to be found so they had to rely on regular programmers like us!
flipcode keeps a lot of good articles, with lots of good assembly for that - https://www.flipcode.com/archives/articles.shtml - there were even better materials from earlier years, but can't find them.
Turbo/Borland Pascal were so awesome, because they allowed for very easy of inline assembly use (somehow) than C/C++ - though you had to know which registers you can touch or not.
It was always so disappointing, spending hours coding up a tightly wound assembly language version of some inner loop that uses half the instructions generated by the C++ compiler, only to find that your slaved-over version is actually 5% slower. But OTOH... the thrill when it actually was faster!
This was back in the Pentium 4 era, where there were deep pipelines and oddities like some simple ALU instructions (ADD, SUB, etc.) taking 0.5 cycles(!), while others (ADC, SHR) took 4 cycles IIRC.
> uses half the instructions generated by the C++ compiler
is there a tool that could profile/predict ahead of time, so that one does not attempt to hand write assembly before knowing for sure it will beat the compiled version?
There was Intel VTune, which I heard was good, though I haven't used it myself. One difficulty is that there are many non-obvious and hard-to-predict factors that interact to produce pipeline stalls. Instructions had specified throughputs and latencies (throughout being the number of cycles before another independent instruction of that type could be initiated; latency being the number of cycles before its output could be used by another instruction), but that was only part of the story. Was that memory read from L1 cache? L2? Main memory? Is this conditional branch predictable? Which of the several applicable execution units will this micro-op get sent to? There were also occasional performance cliffs (alternating memory reads that were exactly some particular power of 2 apart would alias in the cache, leading to worst-case cache behaviour; tight loops that did not begin on a 16-byte boundary would confuse the instruction prefetcher on some CPUs...)
I may be getting x86 CPU generations mixed up. But having wrestled with all this, I can certainly see the appeal of hand-optimising for older, simpler CPUs like the 6510 used in the C64, where things were a lot more deterministic.
VTune still exists and is free since a few years. Neat thing with VTune is that it has support for a few runtimes, so it understands for example CPython internals to the point that stack-traces can be a mixture of languages. That's something only becoming available just now outside of VTune, like Python 3.12 has some hooks for Linux perf.
Note that there's also the open-source uiCA [0], which similarly predicts scheduling and overall throughput for a basic block. Their benchmarks claim it to be more accurate than IACA and other tools for newer Intel CPUs, but I wouldn't be qualified to judge those claims.
Yup. Back in the day VTune was useful and good, then I haven't used it for more than 20 years. It might be still good, but knowing how much more complicated the current CPU architecture is, and how much I've lost touch with low-level assembly I don't know if it's going to be useful to me. I'm relying now on profiling, and other programmers that have become way better in this than me to hear their opinion, and use that as basis (or others on the web).
Most of the time, some good optimized library would do pretty good.
This varies from trivial to very hard to mostly data dependent with different architectures. llvm-mca might be of interest.
One should be able to do a best-case calculation, mostly assuming caches hit and branch prediction gets the answer right. Register renaming manages to stay out of the way.
Getting more dubious, there is a statistical representation of program performance on unknown (or partially known) data. One might be able to estimate that usefully, though I haven't seen it done.
This! I would spend hours unwrapping loops etc, optimizing register use and then profile my amazing x86 version only to find the compiler had found a better route. But you're right -- on the times when you code came out 2X, 5X, 10X quicker than the compiler.. that was what you lived for :)
Thanks!!! Yes it was MGS - the deal was with our studio to do the port from PSX -> PC and since Microsoft was the publisher, part of the deal was for Konami to port Age Of Empires (1? 2?) to PSX.
You’re also my hero. I remember being like… 13 and seeing a copy of MG:S at the store… for PC! I will freely admit it, I hid the copy behind the most boring games on the shelf, and came back and bought it a week later, it was still safely hidden. I distinctly remember the clerk that I paid was quite irritated, almost as if they couldn’t find where I hid it so they could buy it themselves.
In any event, that game, and that port, were a part of my childhood that I will always remember.
It was Age Of Empires 2, I had a EU copy. I bet there are some interesting stories regarding that port, too.
AFAIK the user interface for the win9x version was drawn using native GDI windows library, and yet the PS2 version, using a completely different rendering framework and architecture, sported the very same appearance, font glitches and all. I wonder if they actually wrote an emulation layer around that.
The AoE2 PS2 version actually had half-decent USB keyboard and mouse support. Back then USB keyboard/mice were very uncommon (it was all PS/2) but when we tried it, it actually worked.
I missed that other game forums of similar vintage are now gone.
The way to being able to do Assembly in Borland products, and to lesser extent on VC++, was great.
I always feel like vomiting when looking at the mess UNIX compilers have come up for it, instead of inferring how registers are being used, and the whole string concatenation stuff.
well I've now realized I've used the terms I now know, and looking at what I've read back then - it was roto zoomers (lol) and it looks like it was the "block" texture way instead of what's considered now the term - swizzling - https://www.khronos.org/opengl/wiki/Texture#Swizzle_mask
Isn't that a pretty common tactic? Write everything in C and then at some point in the dev cycle start identifying functions/features that can be asm'd for performance reasons?
I was actually trying to find it - there were lots of .txt files published back then - and there was one about texture mapping from 1992... 1994? - and it explained swizzling and why it was efficient with caches.
So it's not as neat as swizzling (quickly looking at it) - but essentially same goal - keep pixels that have to be drawn together at close (e.g. blocks). Mipmapping helps too, as then you don't "skip" too many pixels - and you both gain better quality and less cache misses.
You might be thinking of the article written by Pascal of Cubic Team. I think the archive is called pasroto.zip and it explained why rotozoomer performance would tank when the texture was rotated by 90 degrees (since you're stepping in the Y direction and blowing out your cache for every pixel). Really interesting at the time.
For those interested, the competitive AoE2 scene is alive and well with one of its biggest tournaments, NAC5, going on now, an in-person LAN in Berlin.
> A key speedup technique AoE used was realized during discussions I had with iD software programmer and optimization guru Michael Abrash over lunch at Tia's Mexican Restaurant in Mesquite, TX.
How many freeform interactions like this did we lose because of the Internet's illusion of being connected?
Well, yes, but I also see their point (even if I'm not sure I agree with it): by being forced to meet IRL, you're also forced to make real contact and seem more likely to form strong relations, and it is off-the-record by default so you can share more things
Of course, the (massive) counterpoint is that you get to talk to way fewer people, particularly if they're more than half an hour traveling away. Quantity versus quality, but by having a lot of quantity through more diverse online interactions, you can find the conversations that have a lot of quality for you (overlapping fields of work, hobbies, or just a personality match).
Which is better? Probably something in the middle, where you hang out in chat rooms but are also conscious of the advantages of arranging to meet up. I do find it inspirational (too strong a word, but you get the idea) to hear of other times or cultures where things are done differently
I don't think so. I know everybody loves to say that IRC is better than [insert commercial chat application] but in this particular use case discord is superior to irc imo because of the voice/video chat features and greater convenience.
Discord has a million deserted "servers" with redundant general channels. Peak IRC had a dozen or so large networks that often bundled a lot of e.g. adjacent FOSS projects. The chances of running into someone interesting a la the article anecdote was just higher.
Discord is a massive net negative for chatting on the internet because of this flaw imho.
This wasn't my experience with IRC. Our channel only ever had thirty people tops before being swallowed by discord. It's got the same discovery/accessibility issues that the Windows vs Linux issue has. Want Windows/Discord? Google Windows/Discord, first result, done. Want Linux/IRC? Weeeell, first you're going to have to find a distro/client. There's no singular trusted expert and everyone disagrees on what's best, so you better pick one and pray it was the right choice. Oh, you'll also have to configure it a bit, or a lot if you chose the wrong one.
The thing is, my experience is different, but of course similarly anecdotal. I essentially got my entire professional network and career on IRC, and that includes contacts all over the modern tech stack and adjacent interests, being on just two IRC "servers". And I could connect many other people in the same way.
On Discord, there seem to be just more barriers against this. Getting someone into a new place doesn't just require hopping into a new channel with a single /join command, but an entire new "server" with a new crowd. There's more inhibition against that.
People may say that Discord and similar will compensate, greatly, as the number of interactions can grow a lot. On the other hand, I don't think the experience is comparable to fully focusing on the person you're eating in person with.
Discord is a poor replacement. I think it allows a lot of broad connections without depth which sounds good on the surface but unfortunately real insights require a bit of digging but the conversation has already moved ahead in the chatroom. That's why old school forums were actually a better kind of a discussion board but discord being free and not requiring technical setup won.
> People may say that Discord and similar will compensate, greatly, as the number of interactions can grow a lot.
Are Discord discussions indexed by any public search engines? What about communities that are invite-only (without much actual reason to be so)? What about community admins who decide to take their whole thing down, communities that break site-wide rules and get removed by site admins, etc? Does Discord Inc. make any commitments towards publishing discussions that have archival/historical value?
How much knowledge is already irrecoverably stuck in Slack's bit bucket, as people flocked away to the next walled-garden chat app?
NEVER FORGET WHAT THEY HAVE TAKEN FROM US. WE WERE ONCE A CULTURE. NOW WE ARE LOST, FOREVER. WHITHER OUR SENSE OF BEAUTY. (etc., etc., insert architecture pics to taste)
C'mon dude. The opportunity for people to talk to one another about this stuff is unimaginably better than it was back then. Like, here we are, right now, me telling you you're full of shit. What are the chances of us being able to do that in 1999?
(I'm sorry to be mean, but I remember 1999 very well and it was much harder to get good information about things, and discuss things with others interested in the same topics, than it is today. And it was already markedly better then than it was even 5 years prior to that!)
The fact that it was hard to get information meant the ones which did break out were infinitely better. The ones you could discuss things were very into the things you were discussing.
A fair comment, unreasonably deaded. I was mean, and I did apologise for it, and I meant it - but I still did it, and that reflects poorly on me.
But: if you're going to make the case for the internet making us worse off, "people used to communicate with one another... LOOK AT WHAT WE HAVE LOST" is the worst angle possible. Good lord. I just can't even comprehend it.
I had plenty of online conversations on Usenet, IRC and email with guys like Abrash and Jez San etc. It was a much smaller community back then and there was no gatekeeping -- everyone's contact info was out there and inbox zero was still achievable.
I get what you're saying. There are a lot of tinier discords and chat rooms where people post technical points and other people chime in. There are even websites with these small places.
The challenge of course is finding your way there. They're not exactly discoverable, and unlike with a job, it's usually through some pretty odd connections that you end up there.
Actually no, it's not "as simple as that" when everyone except you doesn't take lunch hour and schedules meetings at noon. We have a right to the lunch hour.
Then decline those meetings. Doesn't mean that other people should have to waste an hour of their day that they could better spend at home in the evening.
Well, in certain online games the balancing can't just stay static.
But update sizes really are quite insane. One Baldur's Gate 3 update had 100GB and of course that required 100GB additional free space. At that point it was easier to just reinstall the whole game.
The difference in quality of the game designers who worked on the original AoE2 versus AoE2:DE is pretty apparent. I think what is most annoying to me is how much harder it is to parse the visual information of "what kind of unit am I fighting" now that the graphics are "better".
As a game dev in the mid-90s patches had just started to appear; you usually got them from a cover disk as most still did not have Internet access. The thought that a title I published might need to be patched after release was something that horrified me.
This was in 1999. C++ compilers have come a long ways since then. While there are still opportunities for hand-written asm to go and order of magnitude faster than C++, they're mostly around manual vectorization where the auto-vectorizer fails.
Even intrinsics didn't even necessarily work well. MSVC, in particular, was really, really bad back then with SIMD intrinsics -- any use of MMX or SSE intrinsics in VC6 would result in more than two-thirds of the generated code being move instructions, with a single value sometimes moved two or three times between ALU instructions. It was trivial to beat the compiler with hand-written assembly. MMX intrinsics were never fixed and SSE intrinsics weren't fixed until VS2010.
For scalar code, it was more that the CPUs got better, as out-of-order execution starting with the Pentium Pro made instruction scheduling less important. The original Pentium CPU was an in-order design with two pipes where the second V pipe had significant restrictions, which was harder for compilers to deal with than the PPro/PII and its decoding pattern.
Yes, younger devs grown up on the myth of C and C++ being always fast, have missed the days when inline Assembly was a higher count than pure C and C++ code.
I have seen applications for MS-DOS, effectively using C as a Macro Assembler, only the data structures and high level logic was C as if Macro Assembler macros.
> Yes, younger devs grown up on the myth of C and C++ being always fast, have missed the days when inline Assembly was a higher count than pure C and C++ code.
And still is in VLC. (Okay, maybe not higher, but they do use a crapton of assembly in their decoders, and it does speed them up by a factor of 10 or so today.)
Video decoding has always been a prime example for SIMD stuff, however I wonder how much of that code VLC devs could wipeout, assuming hardware vídeo decoding being available everywhere.
Compilers beat hand written Assembly for the general use cases.
Now beating special use cases, like using vector instructions to parallel process video decoding streams is another matter.
It is no accident that after all the efforts improving Java and .NET JITs for auto-vectorization across various vendors, both platforms now expose SIMD intrinsics as well.
The choices and resulting codegen are fairly different. Only one of them works "properly" as of today. Though I'm open to be proven wrong once Panama vectors get stable in the Java land.
They will only be out of preview when Valhala ships, as per roadmap.
Then there is the whole issue when will they reach other implementations beyond OpenJDK, specially a very important alternative implementation running on many phones across the globe.
Nevertheless the need to explicitly being allowed to write vector code is there.
Thanks for the explanation. Aside from vectorization is there anything else that handwritten assembly could be better? Assuming on modern CPUs and modern compilers.
Sure, people who are good at assembly can often do register allocation and instruction selection better for small snippets of code. Or optimize based on guarantees the compiler can’t see or know about.
I think AOE was using DirectX, so I assume that was one level up the stack, ie figuring out which sprites are visible to which extent by walking different data structures and then just throwing stuff at directdraw.
DirectDraw wasn't really meant as a drawing toolkit; you _did_ have blits, but they were not hardware accelerated AFAIK and not nearly flexible enough for what the article suggests (mirror, stiple, probably other stuff). In general, what DirectDraw gave you was a rectangle you could draw pixels into, and a way to get those pixels efficiently to the screen. In other words, more like a clean abstraction over the display driver.
DirectDraw did accelerate blits, but only simple cases, and even then only if the hardware and drivers supported it. Hazy memory is that accelerated blit support was sketchy for pre-3D GPUs, especially stretchblts. It also had significant overhead for drawing small sprites. DirectDraw was critical for video players at the time due to being much faster than GDI and even sometimes Direct3D.
Yeah, sounds about right. And yes, obviously it was much faster than GDI for games, since you had more or less direct access to a framebuffer. Not building up some device-independent bitmap and going through some slow path to convert it into the right format for the GPU (which wasn't called a GPU back then, of course).
Even the oldest S3 VLB cards performed 8 bit copies 3-4x faster than CPU. Pretty much every VGA chipset from around 1993-94 onward had 2D accelerator, most popular were IBM 8514 derived like from S3 or ATI.
AoE2DE still ships with hand-written assembly, though not part of the game code itself. The executable is initially encrypted by an executable packer. It unpacks the game code at runtime - some functions even on-demand.
Not sure why they do this, but this even leaves all code as RWX (readable, writable and executable) which is highly insecure.
I'm reading a history of Borland and the author claims that the Turbo Pascal compiler was mostly written in assembly and was also used in Delphi 1.0. No one in Borland could make significant changes in the code so eventually they rewrote it for Delphi 2.0.
Not sure if all was true but very fascinating. I think there is a certain character in programmers who went through the fire and storm by writing softwares in assembly language for non trivial CPUs (like Pentium and up) that is unique.
We were just coming out of the age of assembler. The background blitting is all assembler, but the characters are Direct3D. When I put the 3D engine in there it was a 3D software renderer I had written when I was 13 in C++ and x86 and I just pasted it into the game engine, but I think we took the software renderer out and just used Direct3D's renderer. GPUs were just coming onto the market, and I think the hope was that it would get a boost from a GPU, although I don't remember ever testing it on one. (I do remember going to gaming expos and the card manufacturers would just hand out GPUs to every dev that went to their booths...)
A few years ago I wrote a naive SLP drawing routine and it was very slow, and left me wondering what exactly they did to make it usable in AOE2. So this makes a lot of sense to me.
Fun little format to mess with if you find the docs for it - surprisingly not that difficult to implement (there used to be a wiki with the spec on it, might be gone now?)
So yes, it was the norm back then. My second job (1998), was working on a team that was going to do some software for Intel for the upcoming back then Katmai processor (Pentium I was it). It had all the new fancy SIMD instructions. The software was supposed to be something like media composer - you slap images, rotate them, etc all in realtime using software rendering (GPUs were still relatively expensive).
I wrote a bilinear and bicubic texture mapper with marching squares for transparent areas. It was all in assembly, and I spent lots of time optimizing it. Back then we used Intel's VTune, and it was super-precise (for the processors back then) - how they are going to pipeline, how much (supposedly) instructions cycles would take, waits, etc. That helped a lot!
But the real lesson was, that me and my manager - both claiming to be super good at assembly (after our recent achievements), rewrote the sphere mapping code for a game another team was writting in assembly, but alas our assembly (but no katmai instructions) code was slower than what the compiler did ;) - TBH, if we did proper mipmapping and texture swizzling we would've fared both ways, but hey, demo coders were not always to be found so they had to rely on regular programmers like us!
flipcode keeps a lot of good articles, with lots of good assembly for that - https://www.flipcode.com/archives/articles.shtml - there were even better materials from earlier years, but can't find them.
Turbo/Borland Pascal were so awesome, because they allowed for very easy of inline assembly use (somehow) than C/C++ - though you had to know which registers you can touch or not.