Remember a Redis overflow troubleshooting

Remember a Redis overflow troubleshooting

Remember a Redis overflow troubleshooting

Cause of the problem

One morning I was still swiping my phone on the way to work, and suddenly there was a commotion in the department group, "The site can't be opened?!" "Probabilistic page reporting" "What's going on"... I was so sleepy that I immediately frightened. My phone almost fell to the ground, so anxious, I immediately wanted to take out my computer on the bus to check it out. After ten minutes of panic, I quickly got off the bus and rushed back to the company’s work station to start the investigation.

Positioning problem

I opened the console of Aliyun and watched the error report for nearly half an hour. They were basically the same:

2020-12-10 08:50:19,868 ERROR 168244 [-/ POST /user/site/list] 
nodejs.ReplyError: OOM command not allowed when used memory > 'maxmemory'.

What do you mean? It's redis overflow! The current redis memory exceeds the maximum memory capacity set by it.

Hurry up on the server and start the job

Check the redis memory information first:

redis-cli info memory

Be good, the maximum capacity is only 4.66G, and 4.65G has been used, which are the two positions pointed by the arrow


At that time, the situation was anxious and there was no screenshot. What I put out here is just to check the chestnuts occupied by the redis memory. The memory situation is not at the time.

Now the real hammer redis has overflowed. Let’s see what is taking up so much. Check the key that takes up a lot of memory.

redis-cli --bigkeys

The output is as follows (the data is not at the time)


The output of redis-cli --bigkeys will show the string key, set key, and hash key that occupy the largest memory, how many bytes they occupy, and other information. However, the largest key is only tens of M, which cannot be caused by a certain key. It may be that a certain situation has caused the creation of a million-level key to fill the redis memory.

The output summary information confirms my conjecture:

-------- summary -------
Sampled 11998929 keys in the keyspace! 
Total key length in bytes is 430089742 (avg len 35.84) 
Biggest string found 'xxxx' has 789970 bytes
Biggest set found 'xxxx' has 20841 members 
Biggest hash found 'xxxx' has 3393675 fields

A 1000多万record coexisted in redis ! I took a breath, can I still do it like this, more than ten million. . . .

Investigation process

In order to determine the form of the key of the tens of millions of records, I saved all the keys to the text. On the one hand, I think that the data will be frequently used in the analysis process. Frequent large-scale data retrieval will affect Reading to normal redis, on the other hand, it is also to save the "evidence" to facilitate the analysis of the cause of the failure

Save the record in the text:

redis-cli keys "*" > /data/redis-key.log

Note that in general, it keys *is not recommended to use in a production environment , because this is really very performance-consuming, especially in the redis with a relatively large magnitude, and sometimes it will be stuck for a while, causing other redis readings to be affected. . However, during this extraordinary period, failures have appeared, and it is understandable to use them.

Okay, now all the keys are available. I need to see which type of key is taking up the memory. First of all, according to all the keys written in redis in my program, I checked that they appear in the key text. The number of rows, if I check sitemap_how many rows the key form is beginning with

cat /data//redis-key.log | grep 'sitemap_' | wc -l

The results show that the largest type of key I query is only tens of thousands.

What else?

Then I randomly find 100,000 key data to see what it looks like

shuf -n 100000 /data/redis-key.log > /data/redis-ramdom-key-100000.log

I picked up 100,000 records at random to see what it looks like

less /data/redis-ramdom-key-100000.log

(Here, by the way, in the linux used to view the log has several commands or text image cat, less, tailand the like, catwill the entire text to be displayed, for small text output; lessis to show a small part of the rolling load, suitable text viewer ; tailOften used for real-time output log, or output log or the last few lines of the text; which occasion should be used which command to view the text should be clear in mind)

Look at the output content, all are in this format:


I suddenly realized that this is egg-sessionthe session id created. It turned out that there was a problem with the session created by the middle layer;

I remember that the session expiration time was set for two days. Could it be that the session expiration time setting did not take effect, causing the session records to accumulate in redis and overflow? In order to verify my idea, I have to see if the survival time of those existing sessions exceeds two days

View the key survival time is redis-cli ttlachieved by using commands , you can refer to the document by yourself, and choose one from the list of session IDs I found to see if it is more than two days (the ttlquery result unit is in milliseconds, which is 2 days 172800000毫秒)

redis-cli ttl 'f30a0485-b59f-4939-a41d-3955786b37e0'

The results showed that it did not exceed two days,

Then I took the 100,000 randomly picked out to check:

cat /data/redis-ramdom-key-100000.log | xargs -I key redis-cli ttl key > /data/key_ttl.log

The main meaning of the above command is: output the 100,000 pieces of data (not displayed), pass them through the pipeline xrags, xragstake each input received as a parameter, record it as a key, and then pass the key to redis-cli ttlexecution, which is Check the survival time of this key; save the time in the text key_ttl.log. Here is one more thing. xragsIt is really powerful and easy to use. If you want your shell statements to be elegant, you deserve xrags.

Next, let's see which ones are longer than two days, just output them directly:

cat /data/key_ttl.log | awk '{if($0 > 172800) print $0}'

What this means is to filter the survival time of the 100,000 records you just got, and print out the ones that are more than two days old. Note: It awkis a text processing tool, which is also very powerful. It is recommended to check it out.

But what puzzles me is that there is no output result, which proves that the survival time is less than two days! oh no

Since the expiration time setting is okay, it depends on what is wrong with the creation. I just took the key of a session id and printed out its value.

redis-cli get 'f30a0485-b59f-4939-a41d-3955786b37e0'

The specific structure is not shown, but I got an important message. There is no session information and business information in this session, only a few dispensable values ​​are stored; I am puzzled, login information should not be stored in the session Is there any user information? Why is the user information empty. Can this also create a session? Prove that you can create a session without logging in? In order to verify my idea, I went to experiment:

I opened my local project and cleared the redis related records in advance. I opened the page without logging in. I immediately went back to see the redis information, oh no it really created a session for me. . . . . I think I know what's going on. Someone must visit our website, causing conversations to continue to be created.

In hindsight, I opened the access log of nginx,

tail -f xxxxx.log

Sure enough, all the requests were endless. I thought, shouldn't. Isn't the program restricting the current of the ip? How can I brush it over? I took the similar ones from the nginx log for a few days and came out immediately, and used its ip to check the current limit record of the middle layer. Each ip only recorded 3 words, which means that the person who brushed it has a very large ip pool. , To avoid shielding us, each Ip only requested 3 times and changed immediately. I thought to myself, this person is amazing, such a big ip pool, really enough

Now there is still a doubt, that is, why not log in will create a session for me, so that those who use the amount of money have a chance, I have to change it! Otherwise it would be too dangerous. !

Explore the solution

After checking, it was found that the session was egg-seesion-redisautomatically created by a plug-in of egg . As long as the request does not bring the session key we specified, and there is data to be written (no matter what the content is written, even if it is empty), it will be given immediately You create a session, but this is not the result I want. What I want is to create a session after logging in. After Effects egg-seesion, egg-session-redis, koa-sessionsource study, I found a place to operate:


This is egg-session-redisthe code site where the plug-in creates the session for me, then I only need to make a judgment before redis.set and then decide whether to write it or not.

Another is that this is the code in the plug-in. There are two ways to achieve what I want: 1. Rewrite this plug-in and publish it to npm, and use it by yourself. 2. You can create a new app in the root directory of egg.js .js, there are several life cycle functions, there is a didLoadfunction in it, click here for details; it will be executed when all configuration files have been loaded. At this time, I can overwrite app.seeionStore.

Considering the time and complexity of operation, I undoubtedly chose the second one

Desired purpose: 1. Rewrite this plug-in, then publish it to npm, and use it by yourself 2. You can create a new app.js in the root directory of egg.js, there are several life cycle functions in it, and there is a didLoadfunction in it. Click for details. This; is executed when all configuration files have been loaded. At this time, I am overwriting app.seeionStore.

Considering the time and complexity of operation, I undoubtedly chose the second one