The best bet would be to attempt to replicate the slowdowns in an emulator to figure out what's going on perhaps ? For example by artificially slowing down memory transfer speeds perhaps ?
sure thing, but sadly, neither of emulators is accurate performance-wise.
and Dreamcast/NAOMI DMA part is still black box: HOLLY chipset have number of DMAs - G1/ATA(cart/GD-drive), G2 (sound RAM), video RAM access, etc. most of these uses SH-4 CPU's DMA channel 0 in DDT mode (on-demand DMA mode, channel controlled by external device, HOLLY).
but, games might start several different type DMA requests at once, and available docs have no restrictions of this. so, how do this works in the same time but using single DMA channel 0 ? - we don't know, HOLLY chipset probably somehow schedule all the requests, divide in smaller transfers and push to SH4's DDT in mixed order.
but that's just my guesswork...
there is also 200MHz SH-4 CPU itself, which effective speed may vary from like "50MHz" equivalent (in case of many cache miss / slow MMIO or memory access, etc), up to "400MHz" equivalent if everything runs from cache and code is optimized for superscalar execution (2 instructions in parallel).
so, effective CPU speed may vary like x8 times, and neither of emulators even trying to emulate this, but just running CPU core at some specific clock...