Idea for Converter functions #1777

jdmsolarius · 2024-12-24T19:42:59Z

Is your feature request related to a problem? Please describe

The .ToBitmap Code has a few issues and I offer a potential solution. The code change I propose runs on 2-2.5 year old 12 Core AMD CPU this goes from 18 milliseconds to ~1.9 milliseconds.

The two problems with the existing code are

This is a highly Parallel Problem

Creating several thousand buffers is time consuming (assuming a 16 bit image)
Solution:

We already have a function that allows us to get a pointer to an area, we should use a pointer instead of the buffers. In the case of RGB we can also go two at a time packing six bytes into a ULong and assigning two pixels at once.

    using (var pixels = image.GetPixelsUnsafe());
    var mapping = GetMapping(format);

    var bitmap = new Bitmap(image.Width, image.Height, format);
    //First Pain Point, No Parallelization for a very Parallel problem.
    for (var y = 0; y < image.Height; y++)
    {
        var row = new Rectangle(0, y, image.Width, 1);
        var data = bitmap.LockBits(row, ImageLockMode.WriteOnly, format);
        var destination = data.Scan0;

        var bytes = pixels.ToByteArray(0, y, image.Width, 1, mapping);
        //This is the second Pain Point. Instead of getting a Pointer to an Area, we are materializing a buffer
        if (bytes is not null)
            Marshal.Copy(bytes, 0, destination, bytes.Length);
        
        bitmap.UnlockBits(data);
    }

    SetBitmapDensity(self, bitmap, useDensity);
    return bitmap;

Describe the solution you'd like
This is the code I use for Images at the lab. Usually we don't have an Alpha channel but I wrote code that accounts for both 32bppRgb and 24bppRgb. _so I go through and Assign two ULongs at once. The biggest gains though are from Parallel Proccessing and not materializing buffers (we never call .ToByteArray) conceptually it's very simple we just get a pointer to the unsafepixel collection and start iterating.

Describe alternatives you've considered

Describe the solution you'd like

This code works on a variety of images I acknowledge it doesn't have all the error checking and SetBitmapDensity isn't called but the General idea is sound and could easily be applied to several other slower functions.

  public static unsafe Bitmap ToBitmapFast(this IMagickImage self)
   where TQuantumType : struct, IConvertible{
   IMagickImage image = self;
   
   PixelFormat format = self.HasAlpha ? PixelFormat.Format32bppArgb : PixelFormat.Format24bppRgb;
   string mapping = format == PixelFormat.Format24bppRgb ? "BGR" : "BGRA";
   
   int height = image.Height;
   int width = image.Width;
   using IUnsafePixelCollection<TQuantumType> pixels = image.GetPixelsUnsafe();
   Bitmap bitmap = new Bitmap(width, height, format);
   
   // Lock the entire bitmap
   BitmapData data = bitmap.LockBits(
       new Rectangle(0, 0, width, height),
       ImageLockMode.WriteOnly,
       format
   );

 try
 {
     // Get the source pointer for the entire image (ushort image data)
     nint pointer = pixels.GetAreaPointer(0, 0, width, height);
     if (pointer == 0)
     {
         throw new InvalidOperationException("Invalid source pointer for the entire image.");
     }
     //if this is an even 
     int strideDecision = (width % 2 == 0) ? 1 : 0;
      
   ushort* srcPtr = (ushort*)pointer; // Source is ushort
   byte* destPtr = (byte*)data.Scan0; // Destination is byte

   int destinationStride = data.Stride;
   int channels = mapping.Length;

   Parallel.For(0, height, row =>
   {
       ushort* srcRowPtr = srcPtr + row * width; // Source data for the row
       byte* destRowPtr = destPtr + row * destinationStride; // Destination data for the row



       if (channels == 3) // BGR
       {
           for (int col = 0; col < width - 1 - strideDecision; col += 2)
           {
               // Load two ushort values
               ushort pixelValue1 = srcRowPtr[col];
               ushort pixelValue2 = srcRowPtr[col + 1];

               // Normalize to byte
               byte normalizedValue1 = (byte)(pixelValue1 >> 8); // Scale down by 256
               byte normalizedValue2 = (byte)(pixelValue2 >> 8);

               // Pack RGB values for two pixels into a ulong
               ulong packedPixels = ((ulong)normalizedValue2 << 40) | // Pixel 2 - Blue
                                    ((ulong)normalizedValue2 << 32) | // Pixel 2 - Green
                                    ((ulong)normalizedValue2 << 24) | // Pixel 2 - Red
                                    ((ulong)normalizedValue1 << 16) | // Pixel 1 - Blue
                                    ((ulong)normalizedValue1 << 8)  | // Pixel 1 - Green
                                    normalizedValue1;                // Pixel 1 - Red

               // Write the packed pixels directly
               *(ulong*)(destRowPtr + col * 3) = packedPixels;

           }
           if (strideDecision == 1)
           {
               int col = width - 1;
               ushort pixelValue = srcRowPtr[col];
               byte normalizedValue = (byte)(pixelValue >> 8);

               destRowPtr[col * channels + 0] = normalizedValue; // Blue
               destRowPtr[col * channels + 1] = normalizedValue; // Green
               destRowPtr[col * channels + 2] = normalizedValue; // Red
               if (channels == 4)
                   destRowPtr[col * channels + 3] = normalizedValue; // Red
           }
       }
       else if (channels == 4) // BGRA
       {
           
           for (int col = 0; col < width; col++)
           {
               // Load four ushort values for one pixel
               ushort pixelValueB = srcRowPtr[col * 4 + 0]; // Pixel Blue
               ushort pixelValueG = srcRowPtr[col * 4 + 1]; // Pixel Green
               ushort pixelValueR = srcRowPtr[col * 4 + 2]; // Pixel Red
               ushort pixelValueA = srcRowPtr[col * 4 + 3]; // Pixel Alpha

               // Normalize each channel to byte
               uint packedPixel =
                   ((uint)pixelValueA << 24) | // Alpha
                   ((uint)pixelValueR << 16) | // Red
                   ((uint)pixelValueG << 8)  | // Green
                   (uint)pixelValueB;          // Blue

               // Write the packed pixel as a single 32-bit value
               *(uint*)(destRowPtr + col * 4) = packedPixel;
           }
       }
       // Process any remaining pixel if the width is odd
 
   });
 }
 finally
 {
     // Unlock the Bitmap
     bitmap.UnlockBits(data);
 }
 return bitmap;
 }

Describe alternatives you've considered

I tried to avoid using Pointers and just run the code in Parallel with a single buffer. The increase in speed is only marginal the problem is that we still need to call ToByteArray() and materialize an enormous Byte Array with Millions of Bytes when we should really just be pointing to the Bytes.

Additional context

This is something I actually use in production code and I would love to see instantaneous conversion as part of the library. Happy to contribute more if necessary.

The text was updated successfully, but these errors were encountered:

dlemstra · 2024-12-24T21:21:00Z

Do you get the same performance boost without the parallel loop? I don't think I want to add that in this spot. But I could also make this an optional argument. I wonder what would the performance boost would be without the parallel for loop.

jdmsolarius · 2024-12-24T23:25:27Z

I do get a large performance boost without the Parallel loop it runs in 5 milliseconds without Parallel and 1-2 Milliseconds with Parallel so roughly half of the benefit seems to come from Parallelization and half of the benefit seems to come from getting rid of the buffers. Perhaps 60-40% Given that the old version is ~12 Milliseconds on my computer. To give another example of what I am talking about we have leveraged nint pointer = pixels.GetAreaPointer(0, 0, width, height); to enormous benefit when Leveling Images. These numbers are for 16 bit Tiff's.

I suspect that there are a large number of cases _where the overhead of calling the C Library (Switching Contexts) is more expensive than just manipulating the memory in C# These could be sped up even further by Vectorizing the mathematics and leveraging System.Numerics.Vector.

the function below runs at Quadruple the Speed of the current Leveling function without Parallel. In Parallel it takes less than a second on an 8 core machine to Level an Image.

Example of our Leveling Extension

public static unsafe void LevelingFast<TQuantumType>(
         this IMagickImage<TQuantumType> self

ushort blackpoint, ushort whitepoint)
where TQuantumType : struct, IConvertible
 { 
   // Possibly add a guard:
   if (whitepoint <= blackpoint)
       throw new ArgumentException("whitepoint must be greater than blackpoint.");

   IMagickImage<TQuantumType> image = self;

   int height = image.Height;
   int width = image.Width;

   using IUnsafePixelCollection<TQuantumType> pixels = image.GetPixelsUnsafe();
   nint pointer = pixels.GetAreaPointer(0, 0, width, height);

   if (pointer == 0)
       throw new InvalidOperationException("Invalid pointer for the entire image.");

   ushort* srcPtr = (ushort*)pointer;

   ushort maxVal =ushort.MaxValue; // 16-bit max
   ushort minVal = ushort.MinValue;
   double reciprocal = 1.0 / (whitepoint - blackpoint);
   Parallel.For(0, height, y =>
   {
       ushort* srcRowPtr = srcPtr + (y * width);

       for (int x = 0; x < width; x++)
       {
           ushort pixelValue = srcRowPtr[x];

           if (pixelValue >= whitepoint)
           {
               pixelValue = maxVal;
           }
           else if (pixelValue <= blackpoint)
           {
               pixelValue = 0; 
           }
           else
           {
               double normalized =(double)(pixelValue - blackpoint) *reciprocal;


               double scaled = normalized * maxVal + 0.5;
               if (scaled > maxVal) 
                   scaled = maxVal;
               
               pixelValue = (ushort)(int)scaled;
           }

           srcRowPtr[x] = pixelValue;
       }
   });

}``

jdmsolarius · 2024-12-25T21:33:00Z

Here is what I think is Happening for simple functions (like Leveling) that explains why the C# code is so much faster even though it's doing more or less the same thing.

Call Setup (Managed → Unmanaged)
Parameter Marshaling
The .NET runtime needs to ensure that function arguments are in a format/layout that the native function expects. For simple types (like int, float, or pointers), this may be minimal. For arrays, structures, or strings, the runtime might need to copy or pin those objects in memory so the garbage collector won’t move them during the call.
Calling Convention
The runtime must set up the stack frame according to the C function’s calling convention (e.g., __cdecl, __stdcall). This includes placing parameters on the stack or in CPU registers as required by the target platform.
Managed State Preservation The runtime keeps track of which objects are pinned and performs any required housekeeping to pause or adjust garbage collection so that it doesn’t interfere with the native call.
Native Execution and lack of a JIT
Once the call has been set up, control passes to the compiled C function in unmanaged memory. This part (running the C code itself) is usually very fast, provided the library function is optimized and compiled for performance.
Just in Time compilers can make optimizations that a C Compiler cannot. Calling a Compiled C function a million times in a loop will not improve the performance of that function. The same is not true for a JIT. For example if A JIT see's only 16 Bit Tiff's in a loop it can optimize against this particular case.
Return Transition (Unmanaged → Managed)
Result Handling
When the C function returns, the result (whether it is an integer, pointer, or complex structure) may need to be converted back into a managed form or pinned memory may need to be unpinned.
Garbage Collector Awareness The runtime re-establishes the managed environment fully (unpins objects, potentially resumes GC operations) before continuing with the next managed instruction in C#.
Why This Overhead Matters in Tight Loops
If you’re calling a native function once or twice, the P/Invoke overhead is typically negligible.
However, if you’re in an inner loop iterating over millions of pixels, repeatedly calling a C function thousands (or millions) of times, that overhead adds up fast.
Each call has a “fixed cost” that does not shrink with function complexity. Even if your C function only does a small amount of work, the overhead of getting to that function and returning might be more expensive than the operation itself.

While Calling C Code is not as expensive as a full Context switch, in many cases the Overhead of doing so may be more expensive than the operation itself. I have no doubt that if I ran the C code directly against my image it would be just as fast (or faster) than my C# code but in a tight loop the JIT is able to Optimize against specific cases.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idea for Converter functions #1777

Idea for Converter functions #1777

jdmsolarius commented Dec 24, 2024 •

edited

Loading

dlemstra commented Dec 24, 2024 •

edited

Loading

jdmsolarius commented Dec 24, 2024 •

edited

Loading

jdmsolarius commented Dec 25, 2024

Idea for Converter functions #1777

Idea for Converter functions #1777

Comments

jdmsolarius commented Dec 24, 2024 • edited Loading

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Describe alternatives you've considered

Additional context

dlemstra commented Dec 24, 2024 • edited Loading

jdmsolarius commented Dec 24, 2024 • edited Loading

jdmsolarius commented Dec 25, 2024

jdmsolarius commented Dec 24, 2024 •

edited

Loading

dlemstra commented Dec 24, 2024 •

edited

Loading

jdmsolarius commented Dec 24, 2024 •

edited

Loading