Magento 2 is an enterprise e-commerce platform that handles hundreds of thousands of products. However, its import tools are not up to the task. To deal with this situation I created an import tool for the community.
Nederlandse vertaling van dit artikel.
The code and documentation can be found here
- a product import library for custom imports
- product importers using XML (REST and file based)
- a tool to repair the url_rewrite table
In October 2017 we were struggling with some product imports. The first was a REST-based sync that required several requests per product. It took several days for 40.000 products and 600 categories. The second was a sync from local XML files. It was faster, but we needed to take some shortcuts to get it working at an acceptable speed. A complete update of all products again was almost unworkable.
I talked to Marco, my employer and colleague, about it and we agreed that it shouldn’t have to take this much time to import products. In the end importing products is just a matter of moving data into the database, right? The low level processes of MySQL are fast enough. What if we could cut out the middle man, Magento, and write to MySQL directly? This should result in a great performance gain. A few tests proved this right.
I wanted to do this project for the Magento community because importing products is such a basic task it should be freely available to all. Further more, I didn’t want to be hurried and have the time to do it right. It took me longer than the three months I estimated it would take, but then again it has become much more extended than I anticipated.
Optimization is fun. It gives me a kick to see how fast things can go. Here are some of the techniques I came up with:
Instead of handling a single product at a time, the new library imports products in batches of 1000. This way the overhead time for processing a product is much smaller, and in some cases many inserts can be combined in a single statement.
Writing multiple records to a table in a single statement is 2 to 10 times faster than executing single statements, so I wanted to use this wherever I could. However, MySQL has a maximum query size that must be obeyed. That’s why I first query the server variable “max_allowed_packet” to figure out how large the query can be. Next, I cram as many records into the INSERT as I can. Since I did not want the library to spend time calculating the query size again and again, I pass a constant with the query that gives the maximum size of a single record in sizes of 1 KB, 2 KB, 16 KB or 128 KB. For example, when a record takes up 2 KB and the max query size is 1 MB, 500 inserts can be stuffed into a single query.
I tested “with a clock in my hand”. Every time I made a change to the code, I ran an automated benchmark with 2500 products. I could notice immediately when a change I made impacted the speed of the code. This way I stumbled upon things I hadn’t even considered a lot of times.
I also measured memory usage to make sure that it was kept at a low level and did not accumulate in larger imports.
After I found out that importing images could become a bottleneck I created a file cache and an HTTP cache for images. This way they won’t have to be downloaded over and over again.
How fast is the library? I cannot tell you how fast the library is in your situation. It depends on many factors. Your best bet is to create a small test and find out!
Implementing all features that may be expected of a product import library, was a looooot of work. But next to that I came up with these new ones I think are also needed.
Placeholders. Products are linked to other products in several ways. Configurable, grouped and bundled products are linked to other products. Each product may have related, up-sell, and cross-sell products. Links between products are stored by id. Of course a product can only be stored when the products it links to already exist. Products may even link back to eachother. This problem is often overlooked.
I handled this situation by creating placeholder products for products that are linked to, but don’t exist yet. A dummy placeholder product is created for it, mainly for its id. When the actual product is imported, the placeholder is replaced by the real thing.
Url key generation. Url keys must be unique per store view. They are usually generated based on the name of the product, and often different products have the same name. Since this proves to be a problem I allow you to generate the url key based on the name or the sku. When this url key conflicts with another, you can choose what to do about it: add the sku, or a generated follow-up number.
REST. Magento’s REST implementation is based on single entities. It will not allow you to import multiple products, because its framework cannot handle it. I wanted to use Magento’s webapi authentication system and import multiple products via XML. With a little hack I succeeded in doing so and this allows you to use the library via a REST call (XML only).
You can use this library to create your custom product import. It aims to be the fastest Magento 2 product import library around, it validates input and gives good feedback on errors, it has an extensive API and I documented every part of it.
Note: the library is quite new and used only in a few projects. Many features have not been used in production environments yet. Start using it in safe environment with a good backup of the database. You will inevitably come across some bugs, since there’s a lot of new code. Report them in the issues section of the GitHub repository and I will look into them.
I hope you enjoy this library. I welcome all constructive feedback. Let me know what parts you have used and what can be improved.