失眠网 > 对警报线程池的警报线程_如何建立更好的警报

对警报线程池的警报线程_如何建立更好的警报

时间：2023-01-25 17:57:33

对警报线程池的警报线程

背景 (Background)

One of the most popular complaints from developers to DBAs involves alerting, whether from third party tools or alerting built by other developers or DBAs in the environment. Building or using alerts for important applications, data layers, or processes within a SQL Server environment offer everyone benefits, but can become noisy if they’re architected poorly, or the purpose isn’t considered. In this article, we look at considerations for building effective alerts that tell us when something is wrong without creating situations where we learn to disregard them. We want to make sure that we respond when we need to, and not always be on high alert when there is no issue.

开发人员对DBA的最普遍抱怨之一涉及警报，无论是来自第三方工具的警报还是环境中其他开发人员或DBA构建的警报。为SQL Server环境中的重要应用程序，数据层或流程构建或使用警报可为每个人带来好处，但如果它们的架构设计不佳或未考虑目的，则可能会变得嘈杂。在本文中，我们着眼于构建有效警报的注意事项，这些警报可在出现问题时告诉我们，而不会造成我们学会忽视它们的情况。我们希望确保在需要时做出回应，并且在没有问题时不会总是保持高度警惕。

讨论区 (Discussion)

Imagine SQL Server writing to the error log every five seconds, “The Database,DatabaseName, is online and good to go!” In one minute, we would have twelve messages telling us that our database was fine and the very nature of such an alert would add overhead, such as space to store it, along with questions as to why we’re seeing something written to anerror log, when everything is running smoothly.

想象一下，SQL Server每五秒钟写入一次错误日志，“数据库DatabaseName联机并且可以使用！” 在一分钟内，我们将收到十二条消息，告诉我们我们的数据库很好，而这种警报的本质将增加开销，例如存储它的空间，以及有关为什么我们看到写入错误的内容的问题日志，一切运行顺利时。

In the popular children’s fable,The Boy Who Cried Wolf, we see a story that cautions us of a problem with too many alerts, especially when some generate false alerts – they may slowly train us to ignore them and when something problematic occurs, we don’t respond. For an example, in a OLTP environment, we may run a test transaction that processes identically to our application, but choose to use a very timeout to alert us if it doesn’t finish in the timeframe. If our timeout is much shorter than our application, we may receive too many alerts indicating that the transaction is taking too long, possibly training us to ignore this alert.

在流行的儿童寓言《哭狼的男孩》中，我们看到一个故事来提醒我们注意警报太多的问题，尤其是当某些警报生成错误警报时-他们可能会慢慢训练我们忽略它们，而当出现问题时，我们不会没有回应。例如，在OLTP环境中，我们可能会运行与我们的应用程序相同的测试事务，但是选择使用非常大的超时时间来提醒我们是否未在时间表内完成。如果我们的超时比我们的应用程序短得多，我们可能会收到太多警报，表明事务处理时间太长，可能会训练我们忽略此警报。

想法： (Ideas:)

In the case of alerts, fewer is better, and especially in situations where we have limited time or resources to respond.

就警报而言，越少越好，尤其是在我们响应时间或资源有限的情况下。

One-next step alerts should be completely eliminated with the next step automated. An example of this is a SQL Server agent job that fails with the only solution to this failure being “Restart the job” (if the job is built exclusively for this solution). There is absolutely no reason to send an alert that this job failed, if the job must be restarted if it failed; simply, automate the next step, which is restarting the job. If a DBA or developer wants to keep a record of failures, that’s one thing, but alerting when there is only one next step wastes time and resources (noise).

下一步警报应完全消除，下一步自动化。一个示例就是一个SQL Server代理作业失败，该失败的唯一解决方案是“重新启动该作业”（如果该作业是专门为此解决方案构建的）。如果作业失败必须重新启动，则绝对没有理由发送该作业失败的警报。只需简单地自动化下一步，即重新启动作业。如果DBA或开发人员想要保留故障记录，那是一回事，但是仅在下一步时发出警报会浪费时间和资源（噪音）。

There may be times where we want to log failures or errors, even if there is a next step to create a baseline. However, tracking baselines differs from alerting because we’re statistically collecting data before we assert that an error exists, or that an error is abnormal. Also, logging an error or issue doesn’t have to be constructed in a way that generates noise until we diagnose the problem exists, or that a problem is about to begin. To put this in a question form, suppose that I asked DBAs who used replication what’s the average amount of replication errors they get each day and they replied, “I don’t know.” If they don’t know the answer to that question, then it’s possible that two errors generated on one day is little to no issue, or that it’s a major issue; for instance, aquery timeout errormay be of little consequence, whereas therow cannot be found at the subscriberindicates a problem. Likewise, if DBCC CHECKDB fails, I want to know immediately.

有时候，即使下一步需要创建基准，我们也希望记录故障或错误。但是，跟踪基准与警报有所不同，因为我们在断言错误存在或错误异常之前正在统计收集数据。同样，在我们诊断问题存在或问题将要开始之前，不必以产生噪音的方式构造记录错误或问题。要将其放在问题表中，假设我问使用复制的DBA，他们每天收到的平均复制错误数量是多少，他们回答“我不知道”。如果他们不知道该问题的答案，则有可能一天中产生的两个错误几乎没有问题，或者这是一个主要问题；例如，查询超时错误可能影响不大，而在订户上找不到该行则表明存在问题。同样，如果DBCC CHECKDB失败，我想立即知道。

This leads to the point that some alerts may be better suited around a threshold being crossed, such as a consecutive number of heartbeats failing as opposed to one heartbeat that fails (if the baseline shows that this happens). We can also expand this to cover a period of time – such as too many failed logins within a five period instead of over a day, relative to what our baseline shows.

这导致一些警报可能更适合越过阈值，例如连续几次心跳失败，而不是一个心跳失败（如果基线表明发生了这种情况）。我们还可以将其扩展到一段时间，例如相对于基准线显示的情况，在五个时段内（而不是一天内）有太多的失败登录。

一些有用的提示和问题 (Some Useful Tips and Questions)

Consider alerts with steps to solve them initially, and automated solutions later. Often with software, we release in versions, such as version 1.0, 1.1, 1.2, etc and the later versions of our solution can automate our solutions in the email. Putting a summarized solution in the email also allows us to pass off the task to junior or mid-level DBAs\developers since it will have steps on how to solve the problem.

考虑警报，该警报应首先解决这些问题，然后再考虑自动化解决方案。我们通常使用软件发布版本，例如1.0、1.1、1.2等版本，而我们解决方案的更高版本可以通过电子邮件自动实现我们的解决方案。在电子邮件中放入汇总的解决方案还使我们可以将任务传递给初级或中级DBA \开发人员，因为它将就如何解决问题采取步骤。

Sometimes use noisy alerting to get a feel for an environment. In the beginning of starting with a client, I will often use noisy alerting the first month so that I know the subtleties of the environment. This is my own preference and it allows me to learn faster – after a month, however, I will often build the tools based on what I’ve learned. The exception is here is the client; I don’t want to make too much noise for them, if they also want to be included on the alert. An example of this is a replication row count report where I see the count difference, regardless of whether it’s 0 or not, and later, only report when it catches a value greater than 0 and shouldn’t.

有时使用嘈杂的警报来感受环境。从开始与客户接触开始，我通常会在第一个月使用嘈杂的警报来了解环境的微妙之处。这是我自己的偏好，它使我能够更快地学习-但是一个月后，我通常会根据所学知识来构建工具。例外是客户端。如果他们也希望被包括在警报中，我不想为他们制造太多的声音。这样的一个示例是复制行计数报告，在该报告中，无论计数值是否为0，我都可以看到计数差，之后，仅在捕获到大于0且不应该的值时才报告。

Consider using behavioral reporting, if familiar with behavioral statistics. Environments, servers, and SQL Server instances all have behavioral attributes that can be learned and reported on when falling outside of normal behavior for each. By far, this is one of the most accurate and less noisy ways of reporting, though it does require understanding what threshold falls outside of normal. If unfamiliar with this approach, use some of the built-in SQL Server tools, like restart a job after it fails a number of times and then alert and build your tools in a manner that makes retries possible – if applicable – and alerting easy. For an example, one of the easiest ways to report failure is to use a logging table where all failures go to it – in this manner, you can use the table to report on backup failures, job failures, checkpoint failures, DBCC failures, etc and create one solution to report on the log, while using a similar approach for inserting the different failures into the table.

如果熟悉行为统计信息，请考虑使用行为报告。环境，服务器和SQL Server实例均具有行为属性，当每个行为都超出正常行为时，可以学习并报告这些行为属性。到目前为止，这是最准确，噪音最小的报告方式之一，尽管它确实需要了解超出正常范围的阈值。如果不熟悉这种方法，请使用一些内置SQL Server工具，例如在多次失败后重新启动作业，然后以使重试（如果适用）和警报变得容易的方式发出警报并构建您的工具。例如，报告故障的最简单方法之一是使用所有故障都记录在其中的日志表–通过这种方式，您可以使用该表报告备份故障，作业故障，检查点故障，DBCC故障等。并创建一种解决方案来报告日志，同时使用类似的方法将不同的故障插入表中。

Consider a logging table with useful columns that allow for filtering, such as a piece of the application for alerting the right team and a date and time for receiving the latest alerts.考虑一个具有有用列的日志表，这些列允许进行过滤，例如用于向正确的团队发出警报的应用程序部分以及用于接收最新警报的日期和时间。

If there’s only one solution to a problem – like a log file grows too much and must be shrunk because the environment has no alternatives to the problem, then either build the solution immediately, or create a pro-active approach to avoid the problem, such as requiring that developers batch transactions in the example. It makes absolutely no sense to report a failure on something when there’s only one solution to the problem, or when the problem could be prevented easily.

如果只有一个解决方案，例如日志文件增长太多，并且由于环境无法替代该问题而必须将其压缩，则可以立即构建解决方案，或者创建一种主动方法来避免该问题，例如因为要求开发人员在示例中批量处理事务。当问题只有一种解决方案时，或者当问题可以轻松预防时，报告某件事情的失败绝对没有任何意义。

For summaries or assessments, like an ETL audit or a health check, consider a time-based email or summary that you won’t ignore and at a time that you will stop and review. While I find summaries helpful – and they can sometimes be a warning – it’s easy to turn these off if they are received randomly, or at a bad time. If built right, these provide a good overview of what’s happening, but are very important to keep an eye on as they are often useful in detecting problems.

对于摘要或评估（例如ETL审核或运行状况检查），请考虑您不会忽略的基于时间的电子邮件或摘要，并在您停止并审阅它们的时间。虽然我发现摘要很有帮助-有时可能是一个警告-但如果随机或在不好的时候收到这些摘要，则很容易将其关闭。如果构建得当，它们可以很好地概述正在发生的事情，但是要密切注意非常重要，因为它们通常对于检测问题很有用。