Blog Personal.

Uncategorized

Reply to Duoae

Comentario#1:

Hi Urian, I read your post after seeing the traffic to mine (through Google translate so i may misunderstand something) – very interesting and informative. I’d love to know why you think my understanding is incorrect.

Well… I don’t want to transform this in a stupid war between you and me.

From my own perspective, i think you missed out the concept of interleaving memory which is intrinsic to the operation of GDDR. In DDR (as per your first generic example) the interleaving occurs on the controller on the DRAM chip and is transparent to the UMC. Interleaved memory *must* be symmetrical in capacity for it to work (if it did not we would have different capacity DDR chips on those sticks and more graphics cards would sport odd RAM capacity configurations – which they don’t).

For GDDR (as per the JEDEC standard) interleaving occurs across the controllers (usually 64 bit wide covering 2 chips @16bitx4 or 32bitx2). In this scenario (and in my example) your «A» is the transparent memory controller linking to each individual 64bit wide chip controller. There is no UMC because «A» becomes the UMC/DMA controller on the APU. The Northridge does not necessarily direct traffic like it does on a desktop system.

In your example system, each chip is operating separately (which would cause an access penalty) but even discarding that, you would not have «A» able to grant both access to 280 GB/s x2 to the GPU whenever anything else in the system is accessing it – same as in my examples. The effective bandwidth will be reduced and if you time average the access width you will never achieve 560 GB/s or 336 GB/s because there will always be traffic through DMA, GPU, & CPU.

My post was about how the databus to the GDDR6 is used and it is a very simple post to make it understandable to all the people. In other words, I wanted to answer why 560GB/s and 336 GB/s of bandwidth in the more clear form posible without ignoring how the architecture really works.

In ALL and EVERYONE of Zen Based SoCs if you want to access to the same memory space of the CPU and the peripherals you need to access throught the Data Fabric, and the Data Fabric uses the UMC that runs at memclk*32 bytes per cycle. In other words even if you have a better bandwidth from your memory you can go beyond that bandwidth if you access from the CPU memory space.

The problem that I see it the term «Channel» that in DDRn and GDDR6 are two different things under the same name and of course people is now very lost and confused.

In DDRn every channel is asigned to a different chip, in GDDR6 you can connect both channels at the same chip, meaning that in reality GDDR6 is a dual ported RAM and this is change compared to the GDDR5 that as single channels and single ported.

Since the GDDR6 is dual ported you can have two different devices or group of devices connected to the memory every group with its own datapath to the memory.

The most logical configuration is to put the UMC to one of two channels but we have the another channel free. I believe that Arden and Sparkman are names for different versions of the Scarlett Engine, I remember how Komachi leaked something that seemed interesting to me about the memory configuration of Sparkman.

Well, it seems that we have a non-coherent access to memory in Arden/Sparkman, this is the same situation that we have in current gen consoles based in GDDR5 memory where we have two memory spaces and two different datapaths to memory, the first one is across the UNB/Northbridge/Data Fabric, the second one is directly from the GPU.

PS4 and PS4 Pro Diagram
Xbox One X

This nothing new, GPUs for PC has two different datapaths, the traditional one to their local memory (non-coherent) and the one to the memory system that in PC is through the PCI Express bus, in the case of SoC is throught the Fusion Compute Link that connects the GPU with the CPU Northbridge.

Of course for rendering using the FCL to access to memory is a bad idea since it doesn’t give enough bandwidth and it could become a bottleneck. The problem is that the UMC is a bottleneck since it can’t use all the bandwidth of the GDDR6 Channel.

¿The first part of the solution? Give the second channel exclusively for the GPU but 280GB/s is a very low bandwith… How we solve it. Well, since we hace two memory spaces let give to the GPU an access to 10GB of the total memory for itself in the non-coherent memory space.

The CPU and other clients of the Northbridge/Data Fabric can only access to the 6GB assigned to the coherent space directly, they always access to that Space at 56GB/s using one ot the two GDDR6 Channels. The GPU can access too to that space at the same time using the second channel at full rate (280GB/s) this is an upgrade from the current generation where the GPU for accessing to the GDDR5 that is single channel and single ported has to use the same datapath of the CPU.

When the GPU needs to read the Command List from the CPU it reads it from the Coherent space.

The 10GB part is for the GPU exclusively and a few acceleratos like the Display Engines, Video Codec… The CPU can’t have access to that directly and you need a DMA Unit making copies from one space to the other. Since the SSD is in the coherent Space the system needs to copy from one space to the other to get the info when the GPU needs it.

When the Data Fabric/Northbridge wants to access to the memory and it has translated the virtual direction to a physical direction the component A reads the physical direction to know where it wants to read or write the data. In other words for the first channel we have a bridge that gives access to the UMC or the GPU based in the physical memory that we want to access.

I gave to that bridge the name of «A»…

This is all, I know that is very simplistic but is a limited explanation of both bandwidths in Xbox Series X.

This is all.

0 0 vote
Article Rating
6 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Dani

Ahora tengo otra duda.

Tengo claro que cuando la GPU accede al espacio de la memoria óptima tiene un bus de 320 bits y un ancho de banda de 560 GB/s. Pero cuando accede a los otros 3,5 GB, ya no tiene un bus de 320 bits, sino de 192 bits (6 módulos GDDR6), entonces…¿el ancho de banda máximo para la GPU sería de 336 GB/s si accede sola y de 168 GB/s si sólo puede utilizar un canal?

Dani

Claro, si me encantan. Lo que no alcanzo es el 100% de entendimiento. A juzgar por tu respuesta parece que me he quedado muy lejos de ese 100%.

Siendo sincero, no entiendo la burrada que he dicho.

Dani

Vale, ya lo veo. En mi esquema mental ponía un acceso diferente a los módulos de 2 GB. Ni caso.

nolgan

grande urian como siempre, este ea el articulo uno de los 2 que yo queria leer, aunque lo lea con traductor, aunque no aclara algun punto de mis dudas

pd: RECOMENDACION, te permite wordpress hacer un blog, doble… en ingles y español?, porque si te lo permite.. pudiendo poner tus articulos en ingles tb, te interesaria.. en al comunidad inglesa es mas grande y tb demanda estos articulos puede que hasta mas que la comunidad española.. y asi esta comunidad no estaria traduciendo.. y lo entenderian mejor

gracias

Duoae

Thanks for the very detailed and well explained explanation. I didn’t think we were getting into a fight? I just wanted to understand more your points. It doesn’t help me when people just say «you’re wrong» (as many people have done! 🙂 ). I see your diagrams and I can see nothing wrong with them – they are correct! The issue I had with your previous explanation was that it appeared that the bandwidth across a chip pin (16-bit) was counted twice. When the the GPU has to access the 280 GB/s bandwidth through the UNB it is using the… Read more »