Last piece of unintentional I have. Not much else looks like art.
I think broken 10bit packed SIMD caused this as well.

Some more unintentional .
The diagonal patterns were caused by Daala's DC prediction trying to cope after a desync.

While working on video and audio compression I sometimes make unintentional .
Here's one, generated by broken packed 10bit UYVY SIMD.

The only people of I've seen writing aarch64 SIMD in the multimedia community do it to get paid and are somewhat reluctant to help you. I can fully understand why. In contrast, you'll get swarmed with suggestions if you need help with x86.

On top of the time you've spend writing the code, you now have to spend as many hours doing trial and error tweaking you give you at least 3x speedup. That did it for me, as all the fun I've had writing SIMD is gone.

The last complaints may seem trivial, but nothing is worse than this: everything is so much slower on aarch64. ld1, st1, fadd, fmul, fmla, etc. First attempt at a trivial SIMD that gives you 4x, 8x speedup on x86 ends up giving you barely 2x speedup.

Assemblers apparently have some inconsistencies. With GAS, on non-android systems, v0.4s[0] is acceptable syntax, but android requires v0.s[0].

Although unzip, transpose and revN instructions do somewhat compensate, they're still not as flexible as shufps and the numerous pack/unpack instructions x86 has.

Next, there's no shufps equivalent in aarch64. You can't shuffle 32 bit values with an immediate. You need a spare vector reg, in which to load values, one byte at a time. You waste registers, and more importantly, cycles. Makes writing FFTs beyond tedious.

Whilst on the topic of IO, only load/store instructions can do IO. fmla, fmul, fadd, etc. do not take a memory location, only vectors. You need to waste a register and an instruction to load.

Whilst x86 gives you [base_reg + offset_reg*[1,2,4,8] + const_offset] for free, with both offsets being signed. If you're iterating backwards you can even save a comparison and a register for the final count! x86 is so much more usable just because of this.

Even assuming you live in a world where all your data is contiguous and stored incrementally, there's still the issue where loading and incrementing more than 4 vector regs affects pipelining, since each load has to wait for the previous one to finish.

You can't offset the pointer when addressing either, so you have to waste a register to load the pointer in, and increment it. A counterargument is that aarch64 has 32 whole registers. No, it has 23 registers. 1 is zero, 8 must be (partially) preserved.

An argument I've often heard is "you can load to a vector register and increment a pointer at the same time!" is somehow better than x86 addressing. It isn't. You're only allowed to increment by either 0 or exactly the amount of bytes you loaded.

I forced myself to do aarch64 SIMD a few months ago. Some people have been claiming its nice, clean ISA will win over the dirty x86 ISA. I'm sure they've never handwritten SIMD before. I'll elaborate.

To their credit, they do provide a semaphore so you can properly synchronize the inevitable copies.
DRM/VAAPI doesn't give you anything so you have no idea whether the image is half-done or complete. You get either tearing or occasionally slow explicit sync for VAAPI.

To shame on NVIDIA, their Vulkan->CUDA interop is one way, one copy. You must copy the imported Vulkan image/buffer to do anything useful with, like encoding.

The driver recovers flawlessly after a 10 second hang. Why fix bugs when you can just crash the program with minimal interference.

Want to crash an Intel GPU on Linux? You can, with this small shader that does nothing.

ichii.moe

Open-minded, international instance that is very geeky and enjoys pop culture.

The main topics of this instance include (but are not limited to): esports, gaming, hardware, coding, fantasy, anime, manga, books, movies, TV series, memes, ...

Character limit: 7007
Bigger emojis, and other enhancements!