Machine Learning Testing

A46R...b95i
19 Aug 2022
74

This is a part of the series of blog posts related to the automated creation of Machine Learning Models, and Datasets used for training Neural Networks, and Model Agnostic Meta-Learning. If you are interested in the background of the story, you may scroll to the bottom of the post to get the links to previous blog posts. You may also head to Use SERP Data to Build Machine Learning Models page to get a clear idea of what kind of Automated ML Models you can create, or how to utilize them for Machine Learning systems.

In previous weeks, we have created an object in the database, and stored machine learning training, testing, and dataset information within it. This week we will do the final step necessary to release the source code to the public, unit testing machine learning tool. The metrics in these tests will give an idea about which ml models can be created, and which ones cannot be. These unit tests should not be confused with testing machine learning models themselves. These tests are checking the scalability aspect of machine learning and not the individual ML Models. I will talk about the source code release next week in the conclusion part. In the mean time, you may Register to Claim Free Credits to get to know SerpApi ecosystem.

We will be using the PyTest of Python, as our testing library. The unit tests will focus on individual endpoints and other aspects of the machine learning process. I will share some common naming for machine learning testing to clear out what I am not talking about and share the examples of the subset of ml testing I am referring to.

What is machine learning testing?

Machine learning testing may refer to the process of software testing through machine learning models. Testing traditional software with a trained model may improve the test performance, or may reduce the need for maintenance.

I have shared this information to specify which term I am not referring to. I have tried a different approach for Rails for such a purpose and failed. You may access the details of the five series blog posts through my profile. Of course, that is not to say it isn’t impossible. In the scope of this writing, we will only be using image classification datasets.

How are machine learning models tested?

Machine learning models are tested through iterations of test data, or real-world data, and scored based on validation of their task. These tests could indicate if ml models are sufficiently trained or if further optimization is needed.

This is actually within the scope of this series. I have touched upon the automatic machine learning model testing capability of the tool by fetching random images from the local storage server, testing machine learning models against it, and creating a validation score. But again, this is not the scope of this blog post.

What are the benefits of machine learning testing?

We are testing the machine learning tool’s scalability in the context of this writing. To write tests for this, we have to take the same approach with software development to a major extent. The aim is to test the robustness of the tool and then methodically expose it to test cases. Model predictions at the end of a training process are outside the scope of this writing. However, methods employed here could be used in this tutorial could be repurposed for minimum functionality tests that are responsible for testing model behavior. Moreover, since we have a tool that could do workflows at scale and asynchronously, automated testing with minimum functionality tests could reduce the time to pick the right model, thus reducing the time for overall model testing. This repurposing could also be useful for testing machine learning systems consisting of models with different capabilities. Much like software systems, ml systems could be broken down into pieces and tested for model evaluation. Individual data points connecting the systems could be brought together in a test suite to create coherency. It also allows for more precise debugging of individual ml models. Although, the tests we write will focus on debugging the scalability. The aim is to write tests like a software engineer writing unit tests for software systems and then serve the robust tool to a software engineer who is an ML enthusiast, data scientist doing Data Science, or a machine learning engineer in MLOps. It could then be used to create a dataset via SerpApi’s Google Images Scraper API, get the validation metrics for different algorithms, do optimization with ease, and use it for creating machine learning systems.

Unit Testing Automatic Dataset Creation Endpoints

Here is the unit test for automatically adding a labeled image as training data to the storage server by using a query. This is made possible by SerpApi’s Google Images Scraper API.

def test_add_to_db_behaviour():
	# Get The first link
	query = Query(q="Coffee")
	db = ImagesDataBase()
	serpapi = Download(query, db)
	serpapi.serpapi_search()
	link = serpapi.results[0]

	# Call the endpoint to check if an object with a specific link has been added to database and delete it
	body = json.dumps({"q":"Coffee", "limit": 1})
	response = client.post("/add_to_db", headers = {"Content-Type": "application/json"}, data=body, allow_redirects=True)
	assert response.status_code == 200
	assert db.check_if_it_exists(link, "Coffee") != None
	db.delete_by_link(link)

Here is another test for calling multiple queries, and if required, with a specification(e.g, only apple the fruit, and not the phone). It is made possible by utilizing the chips parameter in SerpApi's Google Images Scraper API, which is excellent for preprocessing training data.

def test_multiple_query_behaviour():
	# Get The first link
	chips = "q:coffee,g_1:espresso:AKvUqs_4Mro%3D"
	query = Query(q="Coffee imagesize:500x500", chips=chips)
	db = ImagesDataBase()
	serpapi = Download(query, db)
	serpapi.serpapi_search()
	link = serpapi.results[0]

	# Call the endpoint to check if an object with a specific link has been added to database and delete it
	multiple_queries = MultipleQueries(queries=["Coffee"], desired_chips_name="espresso", limit=1)
	body = json.dumps(multiple_queries.dict())
	response = client.post("/multiple_query", headers = {"Content-Type": "application/json"}, data=body, allow_redirects=True)
	assert response.status_code == 200
	assert db.check_if_it_exists(link, "Coffee imagesize:500x500") != None
	db.delete_by_link(link)

Of course, there are other ways to improve the quality assurance of the preprocessing of your training data such as defining input data size within queries (e.g. imagesize:500x500).

These tests are creating training data objects in the storage server, and then checking if such a labeled image object exists. We delete the input data afterward in order to prevent fake positive results.

Unit Testing Optimizers

Here is the unit test for optimizers used in machine learning training process. These tests will indicate which optimizers are supported by the tool, and which ones aren’t.

def test_optimizers():
	optimizers = [
		TrainCommands(
			n_epoch = 1,
			batch_size = 2,
			model_name = "unit_test_model",
			optimizer = {
				"name": "Adadelta",
				"lr": 1.0,
				"rho": 0.9,
				"eps": 1e-06,
				"weight_decay": 0.1
			}
		),
		TrainCommands(
			n_epoch = 1,
			batch_size = 2,
			model_name = "unit_test_model",
			optimizer = {
				"name": "Adagrad",
				"lr": 1.0,
				"lr_decay": 0.1,
				"weight_decay": 0.1,
				"initial_accumulator_value": 0.1,
				"eps": 1e-10
			}
		),
		TrainCommands(
			n_epoch = 1,
			batch_size = 2,
			model_name = "unit_test_model",
			optimizer = {
				"name": "Adam",
				"lr": 1.0,
				"betas": (0.9, 0.999),
				"eps": 1e-10,
				"weight_decay": 0.1,
				"amsgrad": True
			}
		),
		...
	]

	for optimizer in optimizers:
		# Train a Model for 1 Epoch
		body = json.dumps(optimizer.dict())
		response = client.post("/train", headers = {"Content-Type": "application/json"}, data=body, allow_redirects = True)
		assert response.status_code == 200, "{}".format(optimizer.optimizer['name'])

		# Find The Training Attempt in the Database
		find_response = client.post("/find_attempt?name=unit_test_model", allow_redirects = True)
		assert find_response.status_code == 200
		assert type(find_response) != None

		# Delete The Training Attempt in the Database to make the Succeeding Unit Test Valid
		delete_response = client.post("/delete_attempt?name=unit_test_model", allow_redirects = True)
		assert delete_response.status_code == 200

By default, a TrainingCommands object has its own valid inputs, by changing the optimizer, we may check its validation for the tool. At each iteration, we are training a small model with a minimum batch size of 2, and a minimum epoch of 1. This way the unit tests are not taking much time. We also add different parameters in the optimizer to get the information on which parameters for the specific optimizer are supported or not. Model evaluation is irrelevant at this point. After the training is over, we fetch the Attempt object which is storing the information for the ml model, and delete it to prevent false positives in the next iterations.

These kinds of tests are also acting kind of like integration tests since we are doing a full scope of the tool except the testing endpoints.

Unit Testing Loss Functions

We follow the same schedule for the loss functions. We’ll keep everything the same in the TrainingCommands, and change only the loss function(criterion). Like optimizers, model evaluation is irrelevant at this point. I have actually caught some undocumented deprecated parameters while testing loss functions. I will be sharing them in the next week’s blog post while showcasing how to use them.

	criterions = [
		TrainCommands(
			n_epoch = 1,
			batch_size = 2,
			model_name = "unit_test_model",
			criterion = {
				"name": "L1Loss",
				#"size_average": True, ## Will be deprecated
				#"reduce": True, ## Will be deprecated
				"reduction": "sum"
			}
		),
		TrainCommands(
			n_epoch = 1,
			batch_size = 2,
			model_name = "unit_test_model",
			criterion = {
				"name": "MSELoss",
				#"size_average": True, ## Will be deprecated
				#"reduce": True, ## Will be deprecated
				"reduction": "sum"
			}
		),
		TrainCommands(
			n_epoch = 1,
			batch_size = 2,
			model_name = "unit_test_model",
			criterion = {
				"name": "CrossEntropyLoss",
				#"weight": Torch.tensor(3), ## Not supported yet
				#"size_average": True, ## Will be deprecated
				#"reduce": True, ## Will be deprecated
				"reduction": "sum",
				#"ignore_index": 1, ## Not supported yet
				"label_smoothing": 0.1
			}
		),
		...
	]
  
	for criterion in criterions:
		# Train a Model for 1 Epoch
		body = json.dumps(criterion.dict())
		response = client.post("/train", headers = {"Content-Type": "application/json"}, data=body, allow_redirects = True)
		assert response.status_code == 200, "{}".format(criterion.criterion['name'])

		# Find The Training Attempt in the Database
		find_response = client.post("/find_attempt?name=unit_test_model", allow_redirects = True)
		assert find_response.status_code == 200
		assert type(find_response) != None

		# Delete The Training Attempt in the Database to make the Succeeding Unit Test Valid
		delete_response = client.post("/delete_attempt?name=unit_test_model", allow_redirects = True)
		assert delete_response.status_code == 200

Unit Testing Convolutional Layer Algorithms

Testing sequential layers is much easier than testing functions. Previously, we have created an object called CustomModel, which takes in sequential layers as inputs from a customizable dictionary, and creates a custom machine learning model. If any process goes wrong in initializing custom ml models, it will raise an exception. So, we just need to test if the type of the created object is CustomModel.

def test_convolutional_layers():
	layers = [
		TrainCommands(model={"layers":[
			{
				"name":"Conv1d",
				"in_channels":16,
				"out_channels":33,
				"kernel_size":3,
				"stride":2,
				"padding":1,
				"dilation":1,
				"groups": 1,
				"bias": True,
				"padding_mode": "'reflect'",
				"device": None,
				"dtype": torch.float
			}
		]}),
		TrainCommands(model={"layers":[
			{
				"name":"Conv2d",
				"in_channels":16,
				"out_channels":33,
				"kernel_size":3,
				"stride":2,
				"padding":1,
				"dilation":1,
				"groups": 1,
				"bias": True,
				"padding_mode": "'reflect'",
				"device": None,
				"dtype": torch.float
			}
		]}),
		TrainCommands(model={"layers":[
			{
				"name":"Conv3d",
				"in_channels":16,
				"out_channels":33,
				"kernel_size":3,
				"stride":2,
				"padding":1,
				"dilation":1,
				"groups": 1,
				"bias": True,
				"padding_mode": "'reflect'",
				"device": None,
				"dtype": torch.float
			}
		]}),
		...
	]
  
	for layer in layers:
		assert type(CustomModel(tc=layer)) == CustomModel

Unit Testing Pooling Layer Algorithms

Testing sequential layers for pooling algorithms is the same as convolutional layers. Running tests against different pooling layers will be sufficient.

def test_pooling_layers():
	layers = [
		TrainCommands(model={"layers":[
			{
				"name": "MaxPool1d",
				"kernel_size": 3,
				"stride": 2,
				"padding": 1,
				"dilation": 1,
				"return_indices": True,
				"ceil_mode": True
			}
		]}),
		TrainCommands(model={"layers":[
			{
				"name": "MaxPool2d",
				"kernel_size": 3,
				"stride": 2,
				"padding": 1,
				"dilation": 1,
				"return_indices": True,
				"ceil_mode": True
			}
		]}),
		TrainCommands(model={"layers":[
			{
				"name": "MaxPool3d",
				"kernel_size": 3,
				"stride": 2,
				"padding": 1,
				"dilation": 1,
				"return_indices": True,
				"ceil_mode": True
			}
		]}),
 		...
	]
  
	for layer in layers:
		assert type(CustomModel(tc=layer)) == CustomModel

Unit Testing Linear Layer Algorithms

These types of tests for commonly used layers in deep learning models are good for the quality assurance of the tool. For example, the tool could easily be tweaked to make linear regression in the future.

def test_linear():
	layers = [
		TrainCommands(model={"layers":[
			{
				"name": "Linear",
				"in_features": 5,
				"out_features": 6,
				"bias": True
				#"device" Not Supported Yet
				#"dtype": Not Supported Yet
			}
		]}),
		TrainCommands(model={"layers":[
			{
				"name": "Bilinear",
				"in1_features": 5,
				"in2_features": 6,
				"out_features": 7,
				"bias": True
				#"device" Not Supported Yet
				#"dtype": Not Supported Yet
			}
		]}),
		TrainCommands(model={"layers":[
			{
				"name": "LazyLinear",
				"out_features": 6,
				"bias": True
				#"device" Not Supported Yet
				#"dtype": Not Supported Yet
			}
		]}),
 		...
	]
  
	for layer in layers:
		assert type(CustomModel(tc=layer)) == CustomModel

Unit Testing Utilities from Other Functions

The following lines of code in python are for the utilities from other functions. As you can see, although Flatten is working, UnFlatten is not suitable for the tool yet. This is because we cannot pass a variable in the training yet.

def test_utilities():
	layers = [
		TrainCommands(model={"layers":[
			{
				"name": "Flatten",
				"start_dim": 1,
				"end_dim": -1,
			}
		]}),
		#TrainCommands(model={"layers":[
		#	{
		#		"name": "UnFlatten",
		#		"dim": 'features',
		#		"unflattened_size": (('C',2)),
		#	}
		#]}),
	]

	for layer in layers:
		assert type(CustomModel(tc=layer)) == CustomModel

Unit Testing Non-Linear Activation Layer Algorithms

In deep learning models, Non-Linear Activation layers could improve model performance, and help with the vanishing gradient problem. The results could be observed in the evaluation metrics. Here are the unit tests for them:

def test_non_linear_activations():
	layers = [
		TrainCommands(model={"layers":[
			{
				"name": "ELU",
				"alpha": 1.1,
				"inplace": True,
			}
		]}),
		TrainCommands(model={"layers":[
			{
				"name": "Hardshrink",
				#"lambda": 0.6
			}
		]}),
		TrainCommands(model={"layers":[
			{
				"name": "Hardsigmoid",
				"inplace": True,
			}
		]}),
 		...
	]
  
	for layer in layers:
		assert type(CustomModel(tc=layer)) == CustomModel

Conclusion

I am grateful to the reader for their attention, and Brilliant People of SerpApi for all their support. I think, all the perturbations I have gone through in creating this tool was worth it. Next week, I will share a link on the SerpApi Github Page, and open source the code. There are some points I would like to mention some of the points. I have retracted some unessential parts of the code because I thought they weren’t good enough. The first one is the manual adding of training data to the images database. At the time, I was using system storage for such a task, and I would like to update it to a storage server solution in the future. The tool will require a SerpApi Account. You may claim free credits just by registering. But this doesn’t mean you cannot update the dataset without it. You may simply head to your server dashboard and create a new object. So I don’t think it violates the open source criteria of the code. After all, the source code behind scalability will be on public display. The second part I would like to discuss is that the code automatically updates the dataset only for images. Various tasks for different data distributions such as NLP, or most other artificial intelligence tasks require other file formats to be accepted as input data. For now, only images will be supported. Also, some tests like regression testing, or invariance tests may not be possible with the tool. I haven’t tried them out. Another point I would like to touch on is that metrics of the tests showed that some of the optimizers, loss functions, and algorithms used in the machine learning training process are not supported, while some of them are partially supported with a limited number of parameters. The ones that I haven’t tested don’t necessarily mean that it is not possible to use them in machine learning systems. I have done software testing just like what I do on traditional software. The ones that are not supported for sure will be documented in the next week’s blog post in detail. You may try to find out for the others, and maybe write your own tests to contribute to the repository. Final issue I would like to mention is that I have retracted the parts concerning front-end development. This doesn’t mean that you will not have access to a front-end part where you can easily modify your customizable dictionaries. /docs endpoint will list a number of endpoints for you to play around and use the tool for real world purposes. Again, I am grateful for your attention, and excited to open source the tool.

Previous Blog Posts

16 Comments

Johnson Chau
now
(edited)
Thank you for sharing this blog on testing for machine learning models. It was quite interesting to see that you can apply software testing principles on an area with dynamic and sometimes non-deterministic outputs. I wonder how integration testing can link with deployment of machine learning models - would love to hear your thoughts on it!