Analysing large images with Dask library

Mary Adewunmi
3 min readMay 31, 2021
Photo by National Cancer Institute on Unsplash

This is a short tutorial on how to load images with the dask machine learning library.

I decided to write a tutorial on this when I saw, there was no straightforward tutorial on how I can load a large image dataset without my kernel crashing or an error of low memory to accommodate the proposed dataset.

📌 Why did I choose to use Dask?

Dask enables the natural scaling of Pandas, Scikit-Learn, and Numpy workflows with little rewriting when dealing with large datasets with a single computer. It works effectively with these tools, copying the majority of its API and internal data structures. Furthermore, Dask collaborates with these libraries to guarantee that they progress in a consistent manner, reducing friction when moving from a local laptop to a multi-core workstation, and finally to a distributed cluster.

The full code can be found here.

📌Setting up the environment

  1. Installing the necessary Libraries

2. Create a temporary directory for temporary files in other to prevent large images from occupying space on the system memory.

3. Importing our datasets

About the datasets, it can be downloaded from https://www.kaggle.com/c/siim-isic-melanoma-classification/dataThe datasets was publicly made available by Kaggle but provided by the International Skin Imaging Collaboration (ISIC), funded by the International Society for Digital Imaging of the Skin, is an international initiative to improve melanoma diagnosis (ISDIS). The ISIC Archive houses the world's largest collection of high-resolution dermoscopic photographs of skin lesions. Contributors to the images include:​1. Dermatology Service, Melanoma Unit, Hospital Clínic de Barcelona, IDIBAPS, Universitat de Barcelona, Barcelona, Spain2. Memorial Sloan Kettering Cancer Center New York, NY3. Department of Dermatology, Medical University of Vienna. Vienna, Austria4. Melanoma Institute Australia. Sydney, Australia5. The University of Queensland, Brisbane, Australia6. Department of Dermatology, University of Athens Medical SchoolIt has 9 classes of Skin diseases which are pigmented benign keratosis, melanoma,vascular lesion,actinic keratosis,squamous cell carcinoma,basal cell carcinoma,seborrheic keratosis,dermatofibroma and nevus.

This codes actually divides the datasets into chunk and save it in a temporary directory that we created earlier

One can also preprocess images uploaded with dask, by changing it to gray as illustrated below, for distinct pixel information, augment images for more dataset, segment images and filter the images using dask temporary storage.

Another beautiful thing about dask is you can decide to visualize computation graph of images, how it is being preprocessed, the array, chunk size and data type of your image with few lines of code and it will be very useful for debugging in case there is an error on the graph.

📌Conclusion

Dask is a powerful tool in the hand of Machine learning analyst especially when large image datasets are involved. It is cheap to use compare to Cloud notebooks and one can also use them with Cloud servers notebooks like Jupyter notebook, Saturn notebook and the likes.

📌 References

👉Photo by National Cancer Institute on Unsplash

👉 Datasets can be found here

👉Dask documentation

Any feedback or constructive criticism is welcome.

You can find me here.

Happy coding😉!

--

--

Mary Adewunmi

I am a Data scientist/Deep learning Researcher with focus on using Deep learning/Computer vision for medical image diagnosis.