Fine-Tuning LLMs: When Prompting Is Not Enough

Graph Neural Networks for Fraud Detection in Crypto Transactions


Most websites use JavaScript (JS) to make dynamic content, which makes it a valuable attack vector against browsers, browser plug-ins, email clients, and other JS applications. Among common JS-based attacks are drive-by-download, cross-site scripting (XSS), cross-site request forgery (XSRF), and malvertising.

Most malicious JS code is **obfuscated** - a sequence of confusing code transformations that preserve functionality while hiding intent. Conventional antivirus signature-based methods struggle against zero-day obfuscated scripts.

## Introduction

Most websites use JavaScript (JS) code to make dynamic content; thus, JS code becomes a valuable attack vector against browsers, browser plug-ins, email clients, and other JS applications. Among common JS-based attacks are drive-by-download, cross-site scripting (XSS), cross-site request forgery (XSRF), malvertising/malicious advertising, and others. Most of the malicious JS codes are obfuscated in order to hide what they are doing and to avoid being detected by signature-based security systems. In other words, the obfuscation technique is a sequence of confusing code transformations to compromise its understandability, but at the same time to save its functionality.

![](/images/2022-06-28-detect-malicious-javascript/1*YkmXBgkfe2cM9B3bUfEd0w.png)

Example of randomization obfuscationConventional antiviruses and Intrusion Detection Systems (IDS) employ heuristic-based and signature-based methods to detect malicious JS code. But this analysis can be inefficient in case of zero-day attacks. Machine learning (ML) applications, which are currently being actively developed in various industries, have also found their place in cybersecurity. ML has shown its effectiveness against zero-based attacks. When it comes to detecting malicious JS code, there are different approaches from the field of Natural Language Processing (NLP), standard ML that uses tabular data, and deep learning models.

The input data for ML models will vary due to the fact that there are two methods to analyze the behavior of the program: static and dynamic code analysis. The static method analyzes the data without running the source code and is based on source code only. For instance, this can be archived by traversing the code Abstract Syntax Tree. In opposite, dynamic code analysis requires source code to be executed. In this post, we will consider only cases of static analysis.

In this article, we will look at some related work to get an idea of ​​what researchers offer for obfuscated JS code detection. And also will consider the task of classifying benign /malicious JS code snippets using a combination of NLP features and the standard ML approaches.

## Common Approaches to Feature JavaScript Code

In [Detecting Obfuscated JavaScripts using Machine Learning](https://www.researchgate.net/publication/321805699_Detecting_Obfuscated_JavaScripts_using_Machine_Learning) the authors used a dataset of regular, miniﬁed, and obfuscated samples from a content delivery network jsDelivr, the Alexa top 500 websites, a set of malicious JavaScript samples from the Swiss Reporting and Analysis Centre for Information Assurance [MELANI](https://www.ncsc.admin.ch/ncsc/en/home.html). Authors showed that it is possible to distinguish between obfuscated and non-obfuscated scripts with precision and recall around 99%. The following set of features has been used:

<BlogImage
  src="/images/2022-06-28-detect-malicious-javascript/1*GVILYOzLt7CU0KftJYZzvw.png"
  caption="Static features for JavaScript snippets, source: https://www.researchgate.net/publication/321805699_Detecting_Obfuscated_JavaScripts_using_Machine_Learning"
  alt="Static features for JavaScript snippets, source: [Detecting Obfuscated JavaScripts using Machine Learning](https://www.researchgate.net/publication/321805699_Detecting_Obfuscated_JavaScripts_using_Machine_Learning)"
  width={400}
  height={300}
  compact
  maxWidth="max-w-md"
/>

The extracted set of feature vectors was utilized to train and evaluate three different classifiers: Linear Discriminant Analysis (LDA), Random Forest (RF), and Support Vector Machine (SVM).

In “[A machine learning approach to detection of JavaScript-based attacks using AST features and paragraph vectors](https://www.sciencedirect.com/science/article/pii/S1568494619305022)” the authors used another approach to extract features from JS codes. They employed [Abstract Syntax Tree (AST)](https://en.wikipedia.org/wiki/Abstract_syntax_tree) for code structure representation and used it as input to the Doc2Vec method. Drive-by-download data by Marionette for malicious JS codes and the JSUNPACK plus Alexa top 100 websites datasets for benign JS codes were used as datasets for training. For the purpose of constructing AST authors used [Esprima](https://esprima.org/demo/parse.html#), a syntactical and lexical analyzing tool.

While the previous approaches rely on lexical and syntactic features, the approach considered in “[Malicious JavaScript Detection Based on Bidirectional LSTM Model](https://www.mdpi.com/2076-3417/10/10/3440)” leverages semantic information. Along with AST features, the authors constructed the Program Dependency Graph (PDG) and generated JS code semantic slices which were transformed into numerical vectors. Then these vectors were fed into Bidirectional Long Short-Term Memory (BLSTM) neural network. BLSTM model showed performance with 97.71% accuracy and 98.29% F1-score.

| Approach | Description |
|:---|:---|
| Natural language | Treat JS as text: character statistics, entropies, special function counts |
| Lexical features | Regex + NLP methods (BoW, TF-IDF, Doc2Vec) |
| Syntactic features | Abstract Syntax Tree (AST) + NLP |
| Semantic features | AST → CFG → PDG → semantic slices → vectors |

## Coding section: Classification of benign /malicious JS code

We use **Approach 1** (natural language features) for simplicity. The dataset comes from [Machine Learning for the Cybersecurity Cookbook](https://github.com/PacktPublishing/Machine-Learning-for-Cybersecurity-Cookbook/tree/master/Chapter03/Detecting%20Obfuscated%20Javascript).

```python
import os
import re
import math
import warnings
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from tqdm import tqdm
from collections import Counter
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score

warnings.filterwarnings('ignore')
sns.set_theme(font_scale = 2)

SEED = 0
JS_DATA_DIR = "./JavascriptSamples"
OBFUSCATED_JS_DATA_DIR = "./JavascriptSamplesObfuscated"
```

### Data downloading

```python
filenames, scripts, labels = [], [], []
file_types_and_labels = [(JS_DATA_DIR, 0), (OBFUSCATED_JS_DATA_DIR, 1)]

for files_path, label in file_types_and_labels:
    files = os.listdir(files_path)
    for file in tqdm(files):
        file_path = files_path + "/" + file
        try:
            with open(file_path, "r", encoding="utf8") as myfile:
                df = myfile.read().replace("\n", "")
                df = str(df)
                filenames.append(file)
                scripts.append(df)
                labels.append(label)
        except Exception as e:
            print(e)
```

```python
df = pd.DataFrame(data=filenames, columns=['js_filename'])
df['js'] = scripts
df['label'] = labels

df.head()
```

![](/images/2022-06-28-detect-malicious-javascript/1*MtAxrM_3vuELvOJB3Wxu9Q.png)


### Data cleansing

```python
# removing empty scripts
df = df[df['js'] != '']

# removing duplicates
df = df[~df["js"].isin(df["js"][df["js"].duplicated()])]

# Some obfuscated scripts I found in the legitimate JS samples folder, so let's change it label to 1
df["label"][df["js_filename"].apply(lambda x: True if 'obfuscated' in x else False)] = 1

df.label.value_counts()
```

0-label — normal code, 1-label — obfuscated code

![](/images/2022-06-28-detect-malicious-javascript/1*kJy9U8Llz17wgnaKrsZcVg.png)

![](/images/2022-06-28-detect-malicious-javascript/1*23VuEJhK4L1p4QYCBHcdZQ.png)


### Feature engineering

```python
df['js_length'] = df.js.apply(lambda x: len(x))
df['num_spaces'] = df.js.apply(lambda x: x.count(' '))

df['num_parenthesis'] = df.js.apply(lambda x: (x.count('(') + x.count(')')))
df['num_slash'] = df.js.apply(lambda x: x.count('/'))
df['num_plus'] = df.js.apply(lambda x: x.count('+'))
df['num_point'] = df.js.apply(lambda x: x.count('.'))
df['num_comma'] = df.js.apply(lambda x: x.count(','))
df['num_semicolon'] = df.js.apply(lambda x: x.count(';'))
df['num_alpha'] = df.js.apply(lambda x: len(re.findall(re.compile(r"\w"),x)))
df['num_numeric'] = df.js.apply(lambda x: len(re.findall(re.compile(r"[0-9]"),x)))

df['ratio_spaces'] = df['num_spaces'] / df['js_length']
df['ratio_alpha'] = df['num_alpha'] / df['js_length']
df['ratio_numeric'] = df['num_numeric'] / df['js_length']
df['ratio_parenthesis'] = df['num_parenthesis'] / df['js_length']
df['ratio_slash'] = df['num_slash'] / df['js_length']
df['ratio_plus'] = df['num_plus'] / df['js_length']
df['ratio_point'] = df['num_point'] / df['js_length']
df['ratio_comma'] = df['num_comma'] / df['js_length']
df['ratio_semicolon'] = df['num_semicolon'] / df['js_length']
```

![](/images/2022-06-28-detect-malicious-javascript/1*EG6A9k7S_1D-5jokHZYz9Q.png)

```python
def entropy(s):
    p, lns = Counter(s), float(len(s))
    return -sum( count/lns * math.log(count/lns, 2) for count in p.values())

df['entropy'] = df.js.apply(lambda x: entropy(x))

print("Mean entropy for obfuscated js:", df['entropy'][df["label"] == 1].mean())
print("Mean entropy for non-obfuscated js:", df['entropy'][df["label"] == 0].mean())
```

![](/images/2022-06-28-detect-malicious-javascript/1*jRq4TglluZWlo0BFtC2ZUQ.png)

For other features ideas I used the following list of JS functions that are frequently used in malicious JS codes:

![](/images/2022-06-28-detect-malicious-javascript/1*PKYtUItFOFUS_ILMeKWB3A.png)

*Functions widely used in malicious JavaScript, source: [Malicious JavaScript Detection Based on Bidirectional LSTM Model](https://www.mdpi.com/2076-3417/10/10/3440)*

```python
# String Operation: substring(), charAt(), split(), concat(), slice(), substr()

df['num_string_oper'] = df.js.apply(lambda x: x.count('substring') +
                                            x.count('charAt') +
                                            x.count('split') +
                                            x.count('concat') +
                                            x.count('slice') +
                                            x.count('substr'))

df['ratio_num_string_oper'] = df['num_string_oper'] / df['js_length']

print("Mean string operations for obfuscated js:", df['num_string_oper'][df["label"] == 1].mean())
print("Mean string operations for non-obfuscated js:", df['num_string_oper'][df["label"] == 0].mean())

```

![](/images/2022-06-28-detect-malicious-javascript/1*V-LGQ6CQiXKXv8xAPcJNdA.png)

```python
# Encoding Operation: escape(), unescape(), string(), fromCharCode()

df['num_encoding_oper'] = df.js.apply(lambda x: x.count('escape') +
                                        x.count('unescape') +
                                        x.count('string') +
                                        x.count('fromCharCode'))

df['ratio_num_encoding_oper'] = df['num_encoding_oper'] / df['js_length']

print("Mean encoding operations for obfuscated js:", df['num_encoding_oper'][df["label"] == 1].mean())
print("Mean encoding operations for non-obfuscated js:", df['num_encoding_oper'][df["label"] == 0].mean())
```

![](/images/2022-06-28-detect-malicious-javascript/1*AvbFHBR5txULLOorJxiljw.png)

```python
# URL Redirection: setTimeout(), location.reload(), location.replace(), document.URL(), document.location(), document.referrer()

df['num_url_redirection'] = df.js.apply(lambda x: x.count('setTimeout') +
                                          x.count('location.reload') +
                                          x.count('location.replace') +
                                          x.count('document.URL') +
                                          x.count('document.location') +
                                          x.count('document.referrer'))

df['ratio_num_url_redirection'] = df['num_url_redirection'] / df['js_length']

print("Mean URL redirections for obfuscated js:", df['num_url_redirection'][df["label"] == 1].mean())
print("Mean URL redirections for non-obfuscated js:", df['num_url_redirection'][df["label"] == 0].mean())
```

![](/images/2022-06-28-detect-malicious-javascript/1*fqIBIQXV3hvoTtcZhR9FLQ.png)

```python
# Specific Behaviors: eval(), setTime(), setInterval(), ActiveXObject(), createElement(), document.write(), document.writeln(), document.replaceChildren()

df['num_specific_func'] = df.js.apply(lambda x: x.count('eval') +
                                       x.count('setTime') +
                                       x.count('setInterval') +
                                       x.count('ActiveXObject') +
                                       x.count('createElement') +
                                       x.count('document.write') +
                                       x.count('document.writeln') +
                                       x.count('document.replaceChildren'))

df['ratio_num_specific_func'] = df['num_specific_func'] / df['js_length']

print("Mean specific functions for obfuscated js:", df['num_specific_func'][df["label"] == 1].mean())
print("Mean specific functions for non-obfuscated js:", df['num_specific_func'][df["label"] == 0].mean())
```

![](/images/2022-06-28-detect-malicious-javascript/1*UYeDSWlrFBLpABDXokSDxQ.png)


### Training a Random Forest Classifier

**Train/test data split**

```python
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, 3:], df['label'],
                                                    stratify=df['label'],
                                                    test_size=0.2,
                                                    random_state=SEED)
```

**Random Forest Model**

```python
clf=RandomForestClassifier(n_estimators=100, random_state=SEED)
clf.fit(X_train, y_train)
y_pred=clf.predict(X_test)

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
```

**Metrics results**

```
conf_mat = metrics.confusion_matrix(y_test, y_pred)

plt.subplots(figsize=(6,6))
sns.set(font_scale=1.4) # for label size
sns.heatmap(conf_mat, annot=True, fmt=".0f", annot_kws={"size": 16}, cbar=False) # font size
plt.xlabel('Target (true) Class'); plt.ylabel('Output (predicted) class'); plt.title('Confusion Matrix')
plt.show();

print(metrics.classification_report(y_test,
                                    y_pred,
                                    target_names=['non-obfuscted', 'obfuscated']))
```

![](/images/2022-06-28-detect-malicious-javascript/1*XwhAZC3XTJVxajOCrrGY8Q.png)

Full code: [GitHub](https://github.com/MariaZork/my-machine-learning-tutorials/blob/master/js-obfuscation-detection)
s
## Further reading

[1] S. Aebersold et al., [Detecting Obfuscated JavaScripts using Machine Learning](https://www.researchgate.net/publication/321805699_Detecting_Obfuscated_JavaScripts_using_Machine_Learning) (2016), ICIMP 2016: The Eleventh International Conference on Internet Monitoring and Protection

[2] S. Ndichu et al., [A machine learning approach to detection of JavaScript-based attacks using AST features and paragraph vectors](https://www.sciencedirect.com/science/article/pii/S1568494619305022) (2019), Applied Soft Computing

[3] A. Fass et al., [JAST: Fully Syntactic Detection of Malicious (Obfuscated) JavaScript](https://link.springer.com/chapter/10.1007/978-3-319-93411-2_14) (2018), DIMVA

[4] X. Song et al., [Malicious JavaScript Detection Based on Bidirectional LSTM Model](https://www.mdpi.com/2076-3417/10/10/3440) (2020), Applied Sciences

Detect Malicious JavaScript Code using Machine Learning

Top Machine Learning in Cybersecurity Trends to Watch in 2022


Phishing is a fraudulent activity that aims to steal user credentials, credit card and bank account information or to deploy malicious software on the victim's infrastructure. For instance, fraudsters can massively send letters containing malicious links, by clicking on which the user allows the attacker to get sensitive data.

There are various phishing URL detection techniques: white/black lists usage, heuristics-oriented approaches such as usage of content, visual and URL features. Here we will discuss URL-based method using standard machine learning approaches (Logistic regression, Random forest, SGD classifier and etc.) in combination with NLP-driven features which can be extracted from the URL.

## Introduction

Machine learning is now widely used in different cybersecurity areas. Among them are the following:

- Malware detection and classification
- Domain generation algorithms and botnet detection
- Network intrusion detection
- URL detection
- Spam filtering
- Malicious insiders threats
- Cyber-physical systems (CPS) and industrial control systems (ICS)
- Biometric systems (face recognition, speaker verification/recognition, fingerprint systems)
- Anomalous user behaviour analysis

In this tutorial we will consider only one of these numerous cases and will give some insights into how, by combining NLP features and standard (non-deep learning) machine learning algorithms, we can perform the detection of phishing URLs.

**There are different types of URL obfuscation:**

| Type | Sample |
|:---|:---|
| Obfuscation with other domains | `http://school497.ru/222/www.paypal.com/...` |
| Typo-squatting domains | `http://cgi-3.paypal-secure.de/...` |
| Obfuscation with IP address | `http://69.72.130.98/javaseva/https://paypal.com/...` |
| Obfuscation with URL shorteners | `http://goo.gl/HQx5g` |

## Dataset

In this tutorial I used the **Kaggle** dataset ["Phishing Site URLs"](https://www.kaggle.com/taruntiwarihp/phishing-site-urls) which contains 549,346 entries of URLs labeled with 2 categories ("Good", "Bad").

## Measuring ML Model Performance

Important metrics for measuring the quality of cybersecurity ML models are **precision** and **recall**:

$$Precision = \frac{TP}{TP + FP}$$

$$Recall = \frac{TP}{TP + FN}$$

In cyber security, **recall** is critical because it determines the rate of malicious samples we pass as legitimate, so FN should be as low as possible. **Precision** acts like the friction rate for legitimate users. There should always be an optimal trade-off between recall and precision.

## Feature Engineering

Here is the general structure of any URL — protocol, username, hostname, port, path and query:

```
https://user:password@hostname.com:443/path?query
```

The following helper functions extract NLP features from the URL:

```python
import re
import string
import math
from collections import Counter

VOWELS = set("aeiou")
CONSONANTS = set(string.ascii_lowercase) - VOWELS

def url_length(s: str):
    return len(s)

def vowels_pct(s):
    count = sum(1 for ch in s.lower() if ch in VOWELS)
    return count / len(s)

def count_dots(s):
    return s.count('.')

def count_slash(s):
    return s.count('/')

def count_digits(s):
    return len(re.sub(r"\D", "", s))

def extract_doc(s):
    """Split URL by punctuation and join with spaces for TF-IDF"""
    return " ".join(re.split("[" + string.punctuation + "]+", s))
```

After feature extraction, we apply TF-IDF on the `doc` field and combine with numeric features:

```python
from scipy.sparse import coo_matrix, hstack
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import RobustScaler

tf_idf_vec = TfidfVectorizer(
    encoding='utf-8',
    stop_words='english',
    ngram_range=(1, 3),
    max_df=0.8,
    min_df=1000
)
sc = RobustScaler()
```

## Results

| Algorithm | Mean Precision (5-fold) | Mean Recall (5-fold) |
|:---:|:---:|:---:|
| Logistic Regression | 0.94 | 0.91 |
| SGD Classifier | 0.94 | 0.93 |
| **Random Forest** | **0.96** | **0.98** |
| Linear SVM | 0.95 | 0.92 |

Random Forest showed the best performance with mean recall of **0.982** over 5 folds and no misclassifications on the manually selected test samples.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MariaZork/my-machine-learning-tutorials/blob/master/phishing-url-detection/phishing_url_detection.ipynb)

## Conclusion

URL-based phishing detection using NLP features combined with standard ML classifiers achieves strong performance. The Random Forest classifier is the clear winner here — its ensemble nature handles the high-dimensional TF-IDF feature space well and its recall is the most critical metric for security applications.

Key takeaways:
- TF-IDF on tokenized URL text is a powerful and simple feature
- Structural URL features (dots, slashes, IP presence, port) add important signal
- Random Forest dominates; Logistic Regression is competitive but lower recall
- Class imbalance should be handled explicitly via `class_weight='balanced'`


Phishing URL Detection using Standard Machine Learning Methods


In this tutorial we will consider colorectal histology tissues classification using ResNet architecture and PyTorch framework.

## Introduction

Recently machine learning (ML) applications became widespread in the healthcare industry: omics field (genomics, transcriptomics, proteomics), drug investigation, radiology and digital histology. Deep learning based image analysis studies in histopathology include different tasks (e.g., classification, semantic segmentation, detection, and instance segmentation). The main goal of ML in this field is automatic detection, grading and prognosis of cancer.

However, there are several challenges in digital pathology. Usually histology slides are large sized hematoxylin and eosin (H&E) stained images with color variations and artifacts; different levels of magnification result in different levels of information extraction. One Whole Slide Image (WSI) is a multi-gigabyte image with typical resolution **100,000 × 100,000** pixels.

In a supervised classification scenario, WSIs are divided into patches with some stride, then a CNN architecture extracts feature vectors from patches which can be passed into traditional ML algorithms (SVM, gradient boosting) for further operations.

![Typical steps for ML in digital pathological image analysis.](/images/2020-10-26-colorectal-tissue-classification/introduction-pic.jpg)

In this article we apply CNN **ResNet** architecture to classify tissue types of colon. We won't use transfer learning — weights from ImageNet are not related to histology and won't help convergence.

---

## Dataset

The collection of textures in colorectal cancer histology — a "MNIST for biologists". Available at:
- [Zenodo](https://zenodo.org/record/53169#.X5XO59AzbIV)
- [Kaggle](https://www.kaggle.com/kmader/colorectal-histology-mnist)

Two folders:
- **5000 image tiles**: 150 × 150 px each (74 × 74 µm). Eight tissue categories.
- **10 larger images**: 5000 × 5000 px each. Multiple tissue types per image.

All images are RGB, 0.495 µm/pixel, digitized with Aperio ScanScope, magnification 20×. Histological samples are fully anonymized images of formalin-fixed paraffin-embedded human colorectal adenocarcinomas from the University Medical Center Mannheim, Germany.

---

## Colorectal MNIST Classification with ResNet

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MariaZork/my-machine-learning-tutorials/blob/master/colorectal-cancer-classification/colorectal-cancer-classification.ipynb)

### Setup

<CodeCell language="python">
{`import os
import random
import itertools

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from PIL import Image
from sklearn.metrics import confusion_matrix, classification_report
import torch
import torch.nn as nn
import torch.utils.data as D
import torch.nn.functional as F
from torchvision import transforms, models
from tqdm import tqdm

import warnings
warnings.filterwarnings('ignore')

torch.cuda.empty_cache()`}
</CodeCell>

<CodeCell language="python">
{`DATA_DIR = '/kaggle/input/colorectal-histology-mnist/'
SMALL_IMG_DATA_DIR = os.path.join(DATA_DIR,
    'kather_texture_2016_image_tiles_5000/Kather_texture_2016_image_tiles_5000')

IMAGE_SIZE = 224
SEED = 2000
BATCH_SIZE = 64
NUM_EPOCHS = 15

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")`}
</CodeCell>

---

### Data Exploration

<CodeCell language="python" output={`['03_COMPLEX', '08_EMPTY', '04_LYMPHO', '01_TUMOR',
 '02_STROMA', '06_MUCOSA', '05_DEBRIS', '07_ADIPOSE']`}>
{`classes = os.listdir(SMALL_IMG_DATA_DIR)
classes`}
</CodeCell>

<CodeCell language="python" output={`03_COMPLEX  625
08_EMPTY    625
04_LYMPHO   625
01_TUMOR    625
02_STROMA   625
06_MUCOSA   625
05_DEBRIS   625
07_ADIPOSE  625`}>
{`for label in classes:
    num_samples = len(os.listdir(os.path.join(SMALL_IMG_DATA_DIR, label)))
    print(label + '\t' + str(num_samples))`}
</CodeCell>

Sample tiles from each class:

![Sample tiles from each tissue class](/images/nb_images/colorectal-tissue-classification_files/colorectal-tissue-classification_9_0.png)

---

### PyTorch Dataset and DataLoaders

<CodeCell language="python">
{`class HistologyMnistDS(D.Dataset):
    def __init__(self, df, transforms, mode='train'):
        self.records = df.to_records(index=False)
        self.transforms = transforms
        self.mode = mode
        self.len = df.shape[0]

    @staticmethod
    def _load_image_pil(path):
        return Image.open(path)

    def __getitem__(self, index):
        path = self.records[index].img_path
        img = self._load_image_pil(path)
        if self.transforms:
            img = self.transforms(img)
        if self.mode in ['train', 'val', 'test']:
            return img, torch.from_numpy(np.array(self.records[index].label_num))
        return img

    def __len__(self):
        return self.len`}
</CodeCell>

<CodeCell language="python">
{`train_transforms = transforms.Compose([
    transforms.Resize((IMAGE_SIZE, IMAGE_SIZE)),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

val_transforms = transforms.Compose([
    transforms.Resize((IMAGE_SIZE, IMAGE_SIZE)),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])`}
</CodeCell>

<CodeCell language="python" output={`Train DF shape: (4000, 3)
Valid DF shape: (200, 3)
Test DF shape:  (800, 3)`}>
{`train_df, tmp_df = train_test_split(df, test_size=0.2,
                                    random_state=SEED, stratify=df['label'])
valid_df, test_df = train_test_split(tmp_df, test_size=0.8,
                                     random_state=SEED, stratify=tmp_df['label'])

print("Train DF shape:", train_df.shape)
print("Valid DF shape:", valid_df.shape)
print("Test DF shape:", test_df.shape)`}
</CodeCell>

<CodeCell language="python">
{`ds_train = HistologyMnistDS(train_df, train_transforms)
ds_val   = HistologyMnistDS(valid_df, val_transforms, mode='val')
ds_test  = HistologyMnistDS(test_df,  val_transforms, mode='test')

train_loader = D.DataLoader(ds_train, batch_size=BATCH_SIZE, shuffle=True,  num_workers=4)
val_loader   = D.DataLoader(ds_val,   batch_size=BATCH_SIZE, shuffle=False, num_workers=4)
test_loader  = D.DataLoader(ds_test,  batch_size=BATCH_SIZE, shuffle=False, num_workers=4)`}
</CodeCell>

Example batch image (denormalised):

![Example batch image](/images/nb_images/colorectal-tissue-classification_files/colorectal-tissue-classification_20_0.png)

---

### Train Loop

<CodeCell language="python">
{`import copy

checkpoints_dir = '/kaggle/working/'
history_train_loss, history_val_loss = [], []

def train_model(model, loss, optimizer, scheduler, num_epochs):
    best_model_wts = copy.deepcopy(model.state_dict())
    best_loss = 10e10
    best_acc_score = 0.0

    for epoch in range(num_epochs):
        print('Epoch {}/{}:'.format(epoch, num_epochs - 1), flush=True)

        for phase in ['train', 'val']:
            dataloader = train_loader if phase == 'train' else val_loader
            if phase == 'train':
                scheduler.step()
                model.train()
            else:
                model.eval()

            running_loss = running_acc = 0.

            for inputs, labels in tqdm(dataloader):
                inputs = inputs.to(device)
                labels = labels.to(device)
                optimizer.zero_grad()

                with torch.set_grad_enabled(phase == 'train'):
                    preds = model(inputs)
                    loss_value = loss(preds, labels)
                    preds_class = preds.argmax(dim=1)
                    if phase == 'train':
                        loss_value.backward()
                        optimizer.step()

                running_loss += loss_value.item()
                running_acc  += (preds_class == labels.data).float().mean()

            epoch_loss = running_loss / len(dataloader)
            epoch_acc  = running_acc  / len(dataloader)
            print(f'{phase} Loss: {epoch_loss:.4f}  Acc: {epoch_acc:.4f}', flush=True)

            if phase == 'train':
                history_train_loss.append(epoch_loss)
            else:
                history_val_loss.append(epoch_loss)
                if epoch_loss < best_loss:
                    best_loss = epoch_loss
                    best_model_wts = copy.deepcopy(model.state_dict())
                    print("Saving model for best loss")
                    os.makedirs(checkpoints_dir, exist_ok=True)
                    torch.save({'state_dict': best_model_wts},
                               checkpoints_dir + 'best_model.pth.tar')
                if epoch_acc > best_acc_score:
                    best_acc_score = epoch_acc
                print(f'Best loss: {best_loss:.4f}  Best acc: {best_acc_score:.4f}')

    return model`}
</CodeCell>

---

### Model Setup and Training

ResNet-50 with the final linear layer replaced to output 8 classes. StepLR reduces the Adam learning rate by 10× every 7 epochs.

<CodeCell language="python">
{`model = models.resnet50(pretrained=False)
model.fc = torch.nn.Linear(model.fc.in_features, len(classes))
model = model.to(device)

loss      = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=7, gamma=0.1)`}
</CodeCell>

<CodeCell language="python" output={`Epoch 0/14:  val Loss: 0.7102  Acc: 0.7578  → Saving
Epoch 2/14:  val Loss: 0.4103  Acc: 0.8477  → Saving
Epoch 6/14:  val Loss: 0.1979  Acc: 0.9414  → Saving
Epoch 9/14:  val Loss: 0.1765  Acc: 0.9414  → Saving`}>
{`train_model(model, loss, optimizer, scheduler, num_epochs=NUM_EPOCHS);`}
</CodeCell>

---

### Results

Train/validation loss curves:

![Train and validation loss curves](/images/nb_images/colorectal-tissue-classification_files/colorectal-tissue-classification_24_1.png)

<CodeCell language="python">
{`model.load_state_dict(
    torch.load(os.path.join(checkpoints_dir, 'best_model.pth.tar'))['state_dict']
)
model.eval()

y_preds = []
for inputs, labels in tqdm(test_loader):
    inputs = inputs.to(device)
    with torch.set_grad_enabled(False):
        preds = model(inputs)
    y_preds.append(preds.argmax(dim=1).data.cpu().numpy())

y_preds = np.concatenate(y_preds)`}
</CodeCell>

<CodeCell language="python" output={`Confusion matrix, without normalization:
[[98  0  1  0  0  1  0  0]
 [ 0 88  5  0  7  0  0  0]
 [ 2 11 83  3  0  1  0  0]
 [ 0  0  5 95  0  0  0  0]
 [ 0  4  2  0 89  1  4  0]
 [ 1  0  2  4  2 91  0  0]
 [ 0  0  0  0  1  0 96  3]
 [ 0  0  0  0  0  0  1 99]]`}>
{`cm = confusion_matrix(test_df.label_num.values, y_preds)
plot_confusion_matrix(cm, label_num)`}
</CodeCell>

![Confusion matrix](/images/nb_images/colorectal-tissue-classification_files/colorectal-tissue-classification_28_1.png)

<CodeCell language="python" output={`              precision  recall  f1-score  support
01_TUMOR           0.97    0.98      0.98      100
02_STROMA          0.85    0.88      0.87      100
03_COMPLEX         0.85    0.83      0.84      100
04_LYMPHO          0.93    0.95      0.94      100
05_DEBRIS          0.90    0.89      0.89      100
06_MUCOSA          0.97    0.91      0.94      100
07_ADIPOSE         0.95    0.96      0.96      100
08_EMPTY           0.97    0.99      0.98      100

accuracy                             0.92      800`}>
{`print(classification_report(
    test_df.label_num.values,
    y_preds,
    target_names=list(label_num.keys())
))`}
</CodeCell>

---

## Conclusion

We trained ResNet-50 for 15 epochs achieving **92% accuracy** on the test set. **Tumor** and **Empty** classes are the most recognisable (F1 = 0.98). The most confusable label is **Complex**, which likely represents combinations of other tissue types.

Maria
Zorkaltseva

Latest Articles

Fine-Tuning LLMs: When Prompting Is Not Enough

Graph Neural Networks for Fraud Detection in Crypto Transactions

Detect Malicious JavaScript Code using Machine Learning

Top Machine Learning in Cybersecurity Trends to Watch in 2022

Phishing URL Detection using Standard Machine Learning Methods

Colorectal Histology MNIST: Images Classification using ResNet Architecture (PyTorch)

MariaZorkaltseva

Latest Articles

Fine-Tuning LLMs: When Prompting Is Not Enough

Graph Neural Networks for Fraud Detection in Crypto Transactions

Detect Malicious JavaScript Code using Machine Learning

Top Machine Learning in Cybersecurity Trends to Watch in 2022

Phishing URL Detection using Standard Machine Learning Methods

Colorectal Histology MNIST: Images Classification using ResNet Architecture (PyTorch)

Maria
Zorkaltseva