百万司机车货匹配项目从 300qps 至 3000qps 的优化之路

赞赏 2017-08-23

原文网址：https://www.youyong.top/article/11599d0bcd56a

背景：一个车货匹配项目，初步估计有100万司机。支持每秒500次查询最新货源列表，业务高峰时每分钟大约10万司机（包括大量无注册司机）在线查询最新的货源订单。即1500qps，当然还有很多低频率的业务，如每秒几笔货源订单。
环境准备: 测试资源是线下开发机，非生产环境

原文网址：https://www.youyong.top/article/11599d0bcd56a

数据库服务器

CPU：至强 4核单线程
RAM：32G
硬盘：7200转SAS 1T 2个做了raid1
OS：Centos 6.4
网卡：1000M
IP：192.168.1.6

应用服务器

CPU：至强 4核单线程
RAM：32G
硬盘：7200转SAS 1T 2个做了raid1
OS：Centos 7.0
网卡：1000M
IP：192.168.1.10

交换机

H3C 1000M

平台初步技术方案

基于公司目前技术储备，我们可以选择使用的组件有 nginx/apache/php/go/pgbouncer/pg，这些组件可以组合的架构还是很丰富的，原则上能满足业务需求的情况下，架构是越简单越好。复杂的架构出问题的机率越高

但很明显，整个系统的瓶颈肯定是在各种Api与数据库通信这一环节上面，这是优化的重中之重

一、测试环境服务组件版本号

PostgreSQL 9.4.1、nginx-1.6.2、apache-2.2.31、php-5.5.38、pgbouncer-1.5.4、golang-1.7.3

1、PostgreSQL 9.4.1

编译安装

wget https://ftp.postgresql.org/pub/source/v9.4.1/postgresql-9.4.1.tar.gz
./configure --prefix=/home/ad/pgsql
gmake -j 4
gmake install

初始化

su postgres
/home/ad/pgsql/bin/initdb -D /home/ad/data -U postgres -E utf8 --locale=c -W

参数配置

listen_addresses = '*'
port = 9410
max_connections = 2000 
superuser_reserved_connections = 3

shared_buffers = 6048MB
work_mem = 2MB
maintenance_work_mem = 512MB
autovacuum_work_mem = 512MB

synchronous_commit = on
checkpoint_segments = 32
checkpoint_timeout = 5min
checkpoint_completion_target = 0.5
checkpoint_warning = 30s
wal_level = minimal #单机模式
effective_cache_size = 32GB

log_destination = 'csvlog'
logging_collector = on
log_min_duration_statement = 1000
log_line_prefix = '%a %r %u %d %p'
log_statement = 'ddl'
track_activity_query_size = 4096

autovacuum = on
log_autovacuum_min_duration = 0
autovacuum_max_workers = 3

其它参数全部先默认

2、apache-2.2.31

编译安装

./configure --prefix=/home/ad/apache-2.2.31 --enable-rewrite --enable-deflate 
--enable-expires --enable-headers --enable-modules=most --with-mpm=worker --enable-so

选择worker-mpm模式

参数配置

Listen 8812
LoadModule php5_module        modules/libphp5.so #这个是编译好php后自动加上去的
User apache
Group apache
ServerName 127.0.0.1:8812

< Directory />
    Options FollowSymLinks
    AllowOverride None
    Order deny,allow
    allow from all
</Directory>

<IfModule mime_module>
AddType application/x-httpd-php .php
</IfModule>

Include conf/extra/httpd-mpm.conf
Include conf/extra/httpd-default.conf

其它参数全部先默认

3、php-5.5.38

编译安装

wget http://cn2.php.net/get/php-5.5.38.tar.gz/from/this/mirror
mv mirror php-5.5.38.tar.gz
tar zxf php-5.5.38.tar.gz 
cd php-5.5.38
./configure --prefix=/home/ad/php --with-apxs2=/home/ad/apache/bin/apxs 
--with-pgsql=/home/ad/pgsql --enable-fpm
gmake -j 4
gmake install

参数配置

error_reporting = E_ALL & ~E_NOTICE
track_errors = Off
max_execution_time = 30
memory_limit = 128M
display_errors = On #生产环境不建议打开
log_errors = On

4、nginx-1.6.2

编译安装

wget http://nginx.org/download/nginx-1.6.2.tar.gz
tar zxf nginx-1.6.2.tar.gz
cd nginx-1.6.2
./configure --prefix=/home/ad/nginx --with-http_ssl_module --with-http_spdy_module 
--with-http_stub_status_module --with-pcre

参数配置

user nginx nginx;
worker_processes  4;
worker_cpu_affinity 0001 0010 0100 1000;                      
worker_rlimit_nofile 65535;
error_log  logs/error.log;
pid        logs/nginx.pid;

events {
    use epoll;
    worker_connections  65535;
}

http {
    include       mime.types;
    default_type  application/octet-stream;

    upstream openapi{    
    server 192.168.1.10:8812    
    }
    server {
        listen       6001;
        server_name  192.168.1.10;
        location / {
            proxy_pass http://openapi;           
        }
        error_page   500 502 503 504  /50x.html;
        location = /50x.html {
            root   html;
        }
    }
}

5、pgbouncer-1.5.4

pgbouncer连接池也装在应用服务器上面

编译安装

安装pgbouncer需要libevent包

wget https://github.com/downloads/libevent/libevent/libevent-2.0.19-stable.tar.gz
tar zxf libevent-2.0.19-stable.tar.gz
cd libevent-2.0.19-stable
./configure --prefix=/home/ad/libevent
gmake -j 4
gmake installl

--加载libevent动态库
echo '/home/ad/libevent/lib' > /etc/ld.so.conf.d/libevent.conf
ldconfig

wget http://pgfoundry.org/frs/download.php/3393/pgbouncer-1.5.4.tar.gz
tar zxf pgbouncer-1.5.4
cd pgbouncer-1.5.4
./configure --prefix=/home/ad/pgbouncer/ --with-libevent=/home/ad/libevent/
gmake -j 4
gmake install

参数配置

运行参数文件/home/ad/pgbouncer/pgbouncer.ini

[databases]
car_goods_matching = host=192.168.1.6 port=9410 dbname=car_goods_matching 
user=postgres password=pgsql
[pgbouncer]
listen_port = 9410
listen_addr = *
auth_type = md5
auth_file = /home/ad/pgbouncer/user.txt
logfile = /home/ad/pgbouncer/pgbouncer.log
pidfile = /home/ad/pgbouncer/pgbouncer.pid
admin_users = pgb_admin
pool_mode = session
max_client_conn = 65535
default_pool_size = 1024
client_idle_timeout = 60
idle_transaction_timeout = 30

用户及口令文件/home/ad/pgbouncer/user.txt

"pgb_admin" "pgsql"

6、golang-1.7.3

直接安装

yum install golang -y

二、测试数据及脚本

1、数据库环境

--建立测试库
create database car_goods_matching;             
--建立用户表并生成100万用户数据,用户注册时间随机分布

CREATE TABLE public.users
(
    id serial not null unique,
    nickname text not null,
    mobileno text not null unique,
    createtime timestamp not null default now()
);

COMMENT ON TABLE public.users IS '用户表';
COMMENT ON COLUMN public.users.id IS 'id号';
COMMENT ON COLUMN public.users.nickname IS '呢称';
COMMENT ON COLUMN public.users.mobileno IS '手机号';
COMMENT ON COLUMN public.users.createtime IS '注册时间';

--生成用户数据

INSERT INTO public.users(nickname,mobileno,createtime) select
 t::text,t::text,'2015-12-01'::timestamp + (random()*31449600)::integer * 
interval '1 seconds'   FROM  generate_series(13800000001,13801000000) as t;

--建立索引

CREATE INDEX users_createtime_idx ON users USING BTREE(createtime);

--建立订单表并生成1000万订单数据,用户下单时间随机分布

CREATE TABLE orders
(
    id serial not null unique,
    createtime timestamp not null default now(),
    users_id integer not null,
    goods_name text not null
);

COMMENT ON TABLE public.orders IS '运力需求订单表';
COMMENT ON COLUMN public.orders.id IS 'id号';
COMMENT ON COLUMN public.orders.createtime IS '下单时间';
COMMENT ON COLUMN public.orders.users_id IS '用户id号';
COMMENT ON COLUMN public.orders.goods_name IS '货源名称';

--这是生成订单的函数，每天30000订单

CREATE OR REPLACE FUNCTION orders_make(a_date date) RETURNS TEXT AS
$$
BEGIN
    INSERT INTO orders(users_id,goods_name,createtime)
    SELECT
        users.id,
        md5(random()::Text),
        a_date::timestamp + (random()*86399)::integer * interval '1 seconds'
    FROM 
        users
    WHERE
        users.createtime < a_date
    ORDER BY 
        random()
    LIMIT 30000
    ;
    RETURN 'OK';
END;
$$
LANGUAGE PLPGSQL;

COMMENT ON FUNCTION orders_make(a_date date) IS '生成订单数据';

--调用函数生成数据，时间比较长，耐心等

SELECT orders_make('2015-12-01'::date + t) FROM generate_series(0,365) as t;

--把生成的数据序号重新生成

create sequence order_id_seq_tmp;
copy (select nextval('order_id_seq_tmp'),createtime,users_id,goods_name from orders 
order by createtime) to '/home/pg/orders.txt';
truncate table orders;
copy orders from '/home/pg/orders.txt';
SELECT SETVAL('orders_id_seq',(select id FROM orders ORDER BY id DESC LIMIT 1));
DROP sequence order_id_seq_tmp;

--建立索引

CREATE INDEX orders_createtime_idx ON orders USING BTREE(createtime);
CREATE INDEX orders_users_id_idx ON orders USING BTREE(users_id);
--数据生成后记得做一下统计

vacuum analyze

2、测试程序

--调试性能使用hello.php

<?php
echo  "hello world!!";
?>

--获取订单列表api接口goodslist.php

<?php 
FUNCTION isint($val,$empty_is_true=FALSE)
{
     $val=strval($val); 
     IF ($val=="" && !$empty_is_true)
         {
              RETURN FALSE;
         }
     ELSEIF($val=="" && $empty_is_true)
         {
              RETURN TRUE;    
         } 
     IF (STRVAL(INTVAL($val))!=STRVAL($val))
         {
              RETURN FALSE;
         }
     ELSE
         {
              RETURN TRUE;
         }
}   
//生成json字符串
FUNCTION openapi_json_encode($data)
{
     if (version_compare(PHP_VERSION, '5.4.0', '>=')) 
         {
              RETURN json_encode($data,JSON_UNESCAPED_UNICODE); 
         }
     else
         {
              RETURN json_encode($data);
         }
}           
IF (!isint($_GET['offset']))#偏移量
{
     $_GET['offset']="0";
}
IF ($_GET['pool']=="1") #连接池
{
    $conn=pg_connect("host=192.168.1.10 port=9410 dbname=car_goods_matching 
user=pgb_admin password=pgsql");   
}
ELSE #直连
{
    $conn=pg_connect("host=192.168.1.6 port=9410 dbname=car_goods_matching 
user=postgres password=pgsql"); 
}
$sql="SELECT 
          orders.id,orders.createtime,
          orders.goods_name,
          users.nickname,users.mobileno 
      FROM
          orders 
          INNER JOIN users ON orders.users_id=users.id    
      ORDER BY 
          orders.id DESC
      OFFSET
          ".$_GET['offset']."
      LIMIT 10
      " ;   
$sql_result=pg_query($conn,$sql) ;   
$data['return_code']="FAIL";
$data['data']= pg_fetch_all($sql_result);  
IF ($data['data'])
{
     $data['return_code']="SUCCESS";  
}
ECHO openapi_json_encode($data);  
pg_close($conn);  
?>

三、性能校准

1、apache+php

短连接测试

配置apache,关闭长连接配置

vim /home/ad/apache/conf/extra/httpd-default.conf
KeepAlive Off
MaxKeepAliveRequests 100
KeepAliveTimeout 5

重启服务

/home/ad/apache/bin/apachectl restart

在另外一台电脑上使用wrk工具进行测试

wrk -t4 -c8 -d60s http://192.168.1.10:8812/hello.php
Running 1m test @ http://192.168.1.10:8812/hello.php
  4 threads and 8 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   432.30us  768.90us  36.84ms   99.35%
    Req/Sec     3.51k   186.91     4.02k    85.21%
  837934 requests in 1.00m, 135.05MB read
Requests/sec:  13942.51
Transfer/sec:      2.25MB

使用atop查看服务器的资源使用情况

http连接断开等待情况

netstat -n  | grep -i TIME_WAIT | wc -l
6000

长连接测试

非全程长连接

vim /home/ad/apache/conf/extra/httpd-default.conf
KeepAlive On
MaxKeepAliveRequests 100 #连接超过100次请求就结束
KeepAliveTimeout 5 #连接超过5秒就结束

跑wrk测试

wrk -t4 -c8 -d60s http://192.168.1.10:8812/hello.php  
Running 1m test @ http://192.168.1.10:8812/hello.php
  4 threads and 8 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.96ms   21.25ms 444.92ms   99.33%
    Req/Sec     5.91k   748.39     7.90k    80.88%
  1405959 requests in 1.00m, 201.38MB read
Requests/sec:  23393.64
Transfer/sec:      3.35MB

使用atop查看服务器的资源使用情况

全程长连接

vim /home/ad/apache/conf/extra/httpd-default.conf
KeepAlive On
MaxKeepAliveRequests 1000000
KeepAliveTimeout 60

跑wrk测试

wrk -t4 -c8 -d60s http://192.168.1.10:8812/hello.php
Running 1m test @ http://192.168.1.10:8812/hello.php
  4 threads and 8 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   356.51us  722.91us  37.48ms   99.30%
    Req/Sec     6.34k   315.97     6.91k    93.21%
  1513403 requests in 1.00m, 216.49MB read
Requests/sec:  25213.26
Transfer/sec:      3.61MB

使用atop查看服务器的资源使用情况

从上面测试的结果来看，在高并发，高压力的情况下，进程复用可以提高一倍的系统性能。短连接时产生了大量的TIME_WAIT进程，产生的原因主要是客户端不断的连接和断开，在高并发的情况下，服务器迟迟没收到客户端断开连接的第四次握手信号造成的，这些进程会一直占用locale port, 导致本地端口号给消耗光了。可以设置下面的内核参数来提高locale port的使用效率

net.ipv4.tcp_syncookies = 1 #表示开启SYN Cookies,可防范少量SYN攻击，默认为0

net.ipv4.tcp_tw_reuse = 1 #允许将TIME-WAIT sockets重新用于新的TCP连接，默认是0

net.ipv4.tcp_tw_recycle = 1 #表示开启TCP连接中TIME-WAIT sockets的快速回收，默认是0

net.ipv4.ip_local_port_range = 1024 65535 #增大locale port数量，默认是从32768开始

不过这几个参数是治标不治本，最终还是要回到提升api本身的处理能力.另外对于动态的api接口业务，如果你的客户端直接与apache进行通信，现实中基本都是短连接业务，所以用apache+php跑个13000已经是极限了。

2、nginx+apache+php

nginx与apache短连接通信

下面我们来看看nginx+apache+php在短连接的情况下，locale port是怎样给消耗光的，nginx+apache+php见上面的配置

使用wrk工具对nginx进行压力测试

wrk -t4 -c8 -d60s http://192.168.1.10:6001/hello.php
Running 1m test @ http://192.168.1.10:6001/hello.php
  4 threads and 8 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    13.02ms   89.80ms   1.00s    97.80%
    Req/Sec     2.24k     1.00k    3.20k    82.52%
  112423 requests in 1.00m, 17.06MB read
  Socket errors: connect 0, read 0, write 76, timeout 41
  Non-2xx or 3xx responses: 41
Requests/sec:   1871.76
Transfer/sec:    290.91KB

使用atop查看服务器的资源使用情况

在测试到一半的时候再打开一个终端，测试一下服务器是否还有接受请求的能力，发现连接超时，使用top查看系统负载情况，竟然没负载

curl http://192.168.1.10:6001/
curl: (7) Failed connect to 192.168.1.10:6001; 连接超时

1871次/秒的处理能力，还有timeout情况，是不是怀疑nginx有问题，其实这不是nginx的问题，主要是nginx处理能力太强了，而apache跑php弱了，导致进程一直占用socket port,我们只需要做一下这样的调整，服务器的处理能力就上来了

nginx与apache长连接通信

upstream openapi{    
server 192.168.1.10:8812    
keepalive 1000000;
}
server {
    listen       6001;
    server_name  192.168.1.10;
    location / {
        proxy_pass http://openapi;           
        proxy_http_version 1.1;    #nginx默认是http1.0协议，http1.0不支持长连接
        proxy_set_header Connection "";
    }
    error_page   500 502 503 504  /50x.html;
    location = /50x.html {
        root   html;
    }
}

wrk -t4 -c8 -d60s http://192.168.1.10:6001/hello.php
Running 1m test @ http://192.168.1.10:6001/hello.php
  4 threads and 8 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.07ms    9.61ms 258.20ms   99.47%
    Req/Sec     4.07k   267.78     5.77k    88.93%
  969902 requests in 1.00m, 147.02MB read
Requests/sec:  16138.11
Transfer/sec:      2.45MB

使用atop查看服务器的资源使用情况

客户端与nginx长连接通信，nginx与apache长连接通信

wrk -t4 -c8 -d60s http://192.168.1.10:6001/hello.php
Running 1m test @ http://192.168.1.10:6001/hello.php
  4 threads and 8 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   627.20us    2.94ms 115.11ms   99.32%
    Req/Sec     4.18k   352.11     6.90k    81.97%
  998513 requests in 1.00m, 151.41MB read
Requests/sec:  16614.20
Transfer/sec:      2.52MB

看上去数据变化不大，但是TIME_WAIT量明显减少了

netstat -n | grep -i TIME_WAIT|wc -l
97

由于nginx和apache是跑在同一个节点上，cpu互相争用，所以并发数下来，如果跑在不同的机器上，nginx代理的损耗其实是非常小的，看看下面的测试就会知道

nginx与apache长连接通信（不同节点）

nginx位于应用服务器

apache位于数据库服务器

wrk -t4 -c8 -d60s http://192.168.1.10:6001/hello.php
Running 1m test @ http://192.168.1.10:6001/hello.php
  4 threads and 8 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   433.26us    1.32ms  54.39ms   99.11%
    Req/Sec     5.92k   561.30     6.53k    90.67%
  1414121 requests in 1.00m, 213.08MB read
Requests/sec:  23567.56
Transfer/sec:      3.55MB

# 架构	连接数	系统cpu使用量  系统cpu占比 用户cpu使用量	用户cpu占比  time_wait占用数
1 apache短连接	12392	99	24.75%	149	0.3725	6000
2 apache长连接	24186	110	27.50%	245	0.6125	2500
3 apache全程长连接	25213	110	27.50%	227	0.5675	8
4 nginx+apache短连接	1871	2	0.50%	1	0.0025	6000
5 nginx+apache长连接	16138	115	28.75%	183	0.4575	4800
6 nginx长连接+apache长连接	16614	123	30.75%	196	0.49	97

从上面的测试数据来看，建议高并发简单业务的接口采用第5种方案，如果是web网页与接口混合使用的，使用第6种模式

四、业务测试

为了测试更接近于实际业务场景，这次使用http_load工具，http_load使用单进程并发多个线程进行并发测试，访问nginx时采用的是非长连接访问，并且可以设置每秒访问接口次数，http_load测试特别说明

单进程限制每秒1000次访问

参数为-r 334（超过1000的三分之一）时，实际上能跑到500次/秒的访问量

参数为-r 501（超过1000的二分之一）时，测试端cpu会跑满

所以跑每秒500次访问时参数是-r 334, 跑每秒1000次访问时启用两个-r 334 http_load进程来测试，依此类推

模拟每秒100次访问，持续访问300秒

http_load -r 100 -s 300 /home/ad/urls.txt

模拟每秒500次访问，持续访问300秒

http_load -r 334 -s 300 /home/ad/urls.txt

因为测试工具占用cpu资源还是比较大，需要另外一台主机上运行，这样测试更准确

1、nginx+apache+php+pgsql

[root@cqs ~]# echo 'http://192.168.1.10:6001/goodslist.php?pool=0&offset=0' 
 > /home/ad/urls.txt

测试结果

每秒100次访问

[root@cqs ~]#  http_load -r 100 -s 300 /home/ad/urls.txt  
29999 fetches, 96 max parallel, 4.42485e+07 bytes, in 300 seconds
1475 mean bytes/connection
99.9967 fetches/sec, 147495 bytes/sec
msecs/connect: 0.471398 mean, 2.057 max, 0.23 min
msecs/first-response: 12.5897 mean, 992.857 max, 7.751 min
HTTP response codes:
  code 200 -- 29999

其它略，下面是测试结果对比表

数据库服务器压力对比

次数/秒	系统cpu使用量 系统cpu占比	用户cpu使用量 用户cpu占比	总使用量	总占比
100	37	9.25%	35 	8.75%	72 	18.00%
200	73	18.25%	70 	17.50%	143 	35.75%
300	115	28.75%	104 	26.00%	219 	54.75%
400	148	37.00%	140 	35.00%	288 	72.00%

应用服务器压力对比

次数/秒	time_wait占用数	load average	502次数	占比	locale port
100	6000	0.2	0		
200	6000	0.2	0		
300	6000	0.3	0		
400	6000	0.5	20430	17.03%	消耗完

从上面的数据来看，qps为400时数据库服务器就出现了不稳定情况，而且应用服务器的locale port会被消耗光。通过对数据的对比，发现数据库服务器sys_cpu比user_cpu占用量还要大，原因与上面应用服务性能测试时一致，就是在高并发时，进程不断连接和关闭过度消耗了cpu资源。应用服务器locale port给占完，导致http请求失败。得想办法让应用程序与pg服务连接进程可以复用，减少数据库服务器的cpu压力才行，上pgbouncer连接池吧。

下面先说一下，怎样查询pg进程有没复用，使用捉包工具tcpdump捉取两次独立请求时连接数据库的local port ，如port一致则就是进程复用，如下所示

在数据库服务器上的一个终端输入下面命令

tcpdump -n -i em1 dst 192.168.1.6

在测试机上连续输入下面命令二次，非连接池连接

crul 'http://192.168.1.10:6001/goodslist.php?pool=0&offset=0'

在tcpdump终端你会看到下面的信息，对应的locale port不一致

17:11:01.816414 IP 192.168.1.10.61048 > 192.168.1.6.pyrrho: Flags [.], ack 1505,
 win 37, length 0
17:11:03.250386 IP 192.168.1.10.61053 > 192.168.1.6.pyrrho: Flags [S], seq 724980759,
 win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 9], length 0
17:11:03.250594 IP 192.168.1.10.61053 > 192.168.1.6.pyrrho: Flags [.], ack 3440127706, 
win 29, length 0

在测试机上连续输入下面命令二次，连接池连接

crul 'http://192.168.1.10:6001/goodslist.php?pool=1&offset=0'

在tcpdump终端你会看到下面的信息，对应的locale port一致

17:14:35.184299 IP 192.168.1.10.61068 > 192.168.1.6.pyrrho: Flags [P.], seq 92:449,
 ack 332, win 31, length 357
17:14:35.189180 IP 192.168.1.10.61068 > 192.168.1.6.pyrrho: Flags [P.], seq 449:466, 
ack 1504, win 37, length 17
17:14:35.229312 IP 192.168.1.10.61068 > 192.168.1.6.pyrrho: Flags [.], 
ack 1584, win 37, length 0

2、nginx+apache+php+pgbouncer+pgsql

[root@cqs ~]# echo 'http://192.168.1.10:6001/goodslist.php?pool=1&offset=0' 
> /home/ad/urls.txt

每秒100次访问

[root@cqs ~]#  http_load -p 100 -r 100 -s 300 /home/ad/urls.txt   
29999 fetches, 13 max parallel, 4.42485e+07 bytes, in 300.001 seconds
1475 mean bytes/connection
99.9962 fetches/sec, 147494 bytes/sec
msecs/connect: 0.546002 mean, 5.094 max, 0.244 min
msecs/first-response: 2.80047 mean, 129.88 max, 2.012 min
HTTP response codes:
  code 200 -- 29999

其它略，下面是测试结果对比表

数据库服务器压力对比

次数/秒	系统cpu使用量 系统cpu占比	用户cpu使用量  用户cpu占比	总使用量	总占比
100	1	0.25%	9	2.25%	10	2.50%
200	2	0.50%	20	5.00%	22	5.50%
300	2	0.50%	32	8.00%	34	8.50%
400	3	0.75%	49	12.25%	52	13.00%

应用服务器压力对比

次数/秒	time_wait占用数	load average	502次数	占比  locale port
100	6000	0.1	0		
200	6000	0.2	0		
300	6000	0.4	0		
400	6000	0.5	18487	15.41%	消耗完

从上面的数据来看，qps为400时数据库服务器压力非常的小，但应用服务器与数据库服务器的通信效率太低，导致应用服务器的locale port被消耗光，http服务拒绝连接，瓶颈在应用与连接池的通信效率上，只能水平扩展多一些应用服务器节点了，

下面看看双应用服务器节点的性能是不是会提高

3、nginx负载双节点(apache+php)+pgbouncer+pgsql

nginx.conf配置如下

首先在数据库节点上也装上apache/php/pgbouncer

接着修改应用节点上的nginx.conf配置，如下

upstream openapi{    
    server 192.168.1.10:8812 weight=6;
    server 192.168.1.6:8812 weight=4;
    keepalive 1000000;
    }

重启nginx服务

/home/ad/nginx/sbin/nginx -s reload

测试开始

[root@cqs ~]# echo 'http://192.168.1.10:6001/goodslist.php?pool=1&offset=0' >
/home/ad/urls.txt

每秒100次访问

[root@cqs ~]# http_load -r 100 -s 300 /home/ad/urls.txt 
29999 fetches, 38 max parallel, 4.42485e+07 bytes, in 300 seconds
1475 mean bytes/connection
99.9967 fetches/sec, 147495 bytes/sec
msecs/connect: 0.681952 mean, 11.494 max, 0.258 min
msecs/first-response: 3.68105 mean, 403.703 max, 2.131 min
HTTP response codes:
  code 200 -- 29999

其它略，下面是测试结果对比表

次数/秒	系统cpu使用量  系统cpu占比	用户cpu使用量  用户cpu占比	总使用量	总占比
100	4	1.00%	12 	3.00%	16 	4.00%
200	6	1.50%	23 	5.75%	29 	7.25%
300	9	2.25%	34 	8.50%	43 	10.75%
400	13	3.25%	48 	12.00%	61 	15.25%
500	15	3.75%	60 	15.00%	75 	18.75%
600	20	5.00%	78 	19.50%	98 	24.50%

应用服务器压力对比

次数/秒	time_wait占用数	load average	502次数	占比	locale port
100	4800	0.2	0		
200	6000	0.2	0		
300	6000	0.3	0		
400	6000	0.4	0		
500	6000	0.6	0		
600	6000	0.6	2725	1.51%	消耗完

从上面的数据来看，应用服务器水平扩展之后，并发数也跟着上来了，而且数据库服务器的压力还是在可以接受的范围之内，虽然满足了业务的初步请求，但还是无法对付峰值请求，如果想处理更大的并发量，要不再水平扩展应用服务器，另一个就是想办法提高“应用程序与pg”的通信效率，下面使用go来实现这些架构，看看是不是会提高。

4、基于go的后端服务架构

这是go的服务端程序

car_goods_matching.go 

package main

import (
    "fmt"   
    "github.com/valyala/fasthttp"
    "github.com/jackc/pgx"
    "os"
    "strconv"
    "encoding/json"
)

var pool *pgx.ConnPool


type Row struct {
    Id int `json:"id"`
    Createtime string `json:"createtime"`
    Goods_name string `json:"goods_name"`
    Nickname string `json:"nickname"`
    Mobileno string `json:"mobileno"`
}


type Data struct {
    Return_code string `json:"return_code"`
    Rows []Row `json:"data"`
}


func httpHandle(w *fasthttp.RequestCtx) {  
    offset := string(w.QueryArgs().Peek("offset"))
    if offset == "" {
        offset = "0"
    }
    
    sql := `
    SELECT 
          orders.id,orders.createtime::text,
          orders.goods_name,
          users.nickname,users.mobileno 
    FROM
          orders 
          INNER JOIN users ON orders.users_id=users.id
    ORDER BY 
          orders.id DESC
    OFFSET
          ` + offset + `
    LIMIT 10
    `
    
    rows, err := pool.Query(sql)
    checkErr(err)
    defer rows.Close()
    w.SetContentType("text/html")    

    var data Data = Data{}
    data.Rows = make([]Row,0)
    data.Return_code="FAIL"

    for rows.Next() {             
        var row Row 
        err = rows.Scan( &row.Id , &row.Createtime , &row.Goods_name , &row.Nickname , &row.Mobileno )
        checkErr(err)
        data.Rows = append(data.Rows,row)
    } 
    
    if len(data.Rows)>0 {
        data.Return_code="SUCCESS"
    }
    
    ret, _ := json.Marshal(data)
    fmt.Fprintf(w,"%s",string(ret))
}

func main() {
    var err error
    poolnum,err     := strconv.Atoi(os.Args[1])
    checkErr(err)
    connPoolConfig := pgx.ConnPoolConfig{
        ConnConfig: pgx.ConnConfig{
            Host:     "192.168.1.6",
            User:     "postgres",
            Password: "pgsql",
            Database: "car_goods_matching",
            Port:     9410,
        },
        MaxConnections: poolnum,
    }
    
    pool, err = pgx.NewConnPool(connPoolConfig)
    checkErr(err)
    
    if err := fasthttp.ListenAndServe("0.0.0.0:8091", httpHandle); err != nil {
        fmt.Println("start fasthttp fail:", err.Error())
    }
}

func checkErr(err error) {    
    if err != nil {
        panic(err)        
    }
}

启动服务，配置8个连接数

go run car_goods_matching.go 8

测试开始

[root@cqs ~]# echo 'http://192.168.1.10:8091/?offset=0' >/home/ad/urls.txt

每秒100次访问

[root@cqs ~]# http_load -r 100 -s 300 /home/ad/urls.txt 
29999 fetches, 6 max parallel, 4.57185e+07 bytes, in 300 seconds
1524 mean bytes/connection
99.9966 fetches/sec, 152395 bytes/sec
msecs/connect: 0.601967 mean, 7.028 max, 0.245 min
msecs/first-response: 2.2164 mean, 64.464 max, 1.707 min
HTTP response codes:
  code 200 -- 29999

其它略，下面是测试结果对比表

数据库服务器压力对比

次数/秒	系统cpu使用量 系统cpu占比	用户cpu使用量  用户cpu占比	总使用量	总占比
100	1	0.25%	8 	2.00%	9 	2.25%
200	2	0.50%	18 	4.50%	20 	5.00%
300	2	0.50%	26 	6.50%	28 	7.00%
400	2	0.50%	35 	8.75%	37 	9.25%
500	3	0.75%	42 	10.50%	45 	11.25%
600	3	0.75%	52 	13.00%	55 	13.75%
1000	8	2.00%	71 	17.75%	79 	19.75%
1500	8	2.00%	118 	29.50%	126 	31.50%
2000	11	2.75%	145 	36.25%	156 	39.00%
3000	17	4.25%	214 	53.50%	231 	57.75%

应用服务器压力对比

次数/秒	time_wait占用数	load average	502次数	占比	socket占用
100	5200	0.1	0		
200	6000	0.1	0		
300	6000	0.1	0		
400	6000	0.2	0		
500	6000	0.3	0		
600	6000	0.3	0		
1000	6000	0.3	0		
1500	6000	0.3	0		
2000	6000	0.6	0		
3000	6000	0.6	0

从上面的测试数据我们可以看到基于go的服务端应用程序由于每次处理业务时都不需要与应用程序进行通信，也不需要与数据库进行连接，所以其sys cpu消耗的极其的低，编译型程序，执行效率也高，在并发到3000的qps时，数据库的cpu负载才过一半。由此可见使用Go服务端应用来支撑业务简单的高并发应用是非常适合的方案。

为了更友好的对外通信，统一api入口，我们使用nginx做web服务入口，再代理跑各种应用，现在来看看nginx代理跑go性能是否也能满足

5、基于nginx+go的后端服务架构

nginx上的配置，记得配置nginx与go通信采用长连接哦

#增加一个openapi_goodslist分组，最后的nginx.conf配置如下

user nginx nginx;
worker_processes  4;
worker_cpu_affinity 0001 0010 0100 1000;                      
worker_rlimit_nofile 65535;
error_log  logs/error.log;
pid        logs/nginx.pid;

events {
    use epoll;
    worker_connections  65535;
}

http {
    include       mime.types;
    default_type  application/octet-stream;

    upstream openapi{
    #ip_hash;
    server 192.168.1.10:8812
    #其它节点
    keepalive 1000000;
    }
    # goodslist接口应用
    upstream openapi_goodslist{
    server 192.168.1.10:8091 ;
    #其它节点
    keepalive 1000000;
    }
    
    server {
        listen       6001;
        server_name  192.168.1.10;
        #配置访问路由
        location /openapi/goodslist{
            proxy_pass http://openapi_goodslist;
            proxy_http_version 1.1;    
            proxy_set_header Connection "";
        }
        location / {
            proxy_pass http://openapi;           
            proxy_http_version 1.1;    
            proxy_set_header Connection "";
        }
        error_page   500 502 503 504  /50x.html;
        location = /50x.html {
            root   html;
        }
    }
}

下面是测试结果对比表

数据库服务器压力对比

次数/秒	系统cpu使用量  系统cpu占比	用户cpu使用量  用户cpu占比	总使用量	总占比
100	1	0.25%	8 	2.00%	9 	2.25%
200	2	0.50%	18 	4.50%	20 	5.00%
300	2	0.50%	26 	6.50%	28 	7.00%
400	2	0.50%	35 	8.75%	37 	9.25%
500	3	0.75%	42 	10.50%	45 	11.25%
600	3	0.75%	52 	13.00%	55 	13.75%
1000	9	2.25%	68 	17.00%	77 	19.25%
1500	10	2.50%	112 	28.00%	122 	30.50%
2000	13	3.25%	141 	35.25%	154 	38.50%
3000	19	4.75%	210 	52.50%	229 	57.25%

应用服务器压力对比

次数/秒	time_wait占用数	load average	502次数	占比	socket占用
100	5200	0.1	0		
200	6000	0.1	0		
300	6000	0.1	0		
400	6000	0.2	0		
500	6000	0.3	0		
600	6000	0.4	0		
1000	6000	0.4	0		
1500	6000	0.4	0		
2000	6000	0.8	0		
3000	6000	0.8	0

加了nginx代理之后对数据库的性能没影响，只是应用服务器的cpu开销大了一点，这个损耗是在可接受范围之内的，而且也可以将nginx与应用服务节点进行分离到不同的节点上面。

五、测试结果对比

序号	项目	                              qps峰值  300次/秒cpu压力
1	nginx+apache+php+pgsql	                300	219
2	nginx+apache+php+pgbouncer+pgsql	300	34
3	nginx+apache+php+pgbouncer+pgsql双节点	500	43
4	基于go框架的实现	                       3000	28
5	nginx+go	                       3000	28

各种框架处理qps峰值－－值越大越好

各种框架处理300qps时cpu压力－－值越小越好

以上是对于高并发，业务极其简单的接口在通信架构上的一种优化方式，具体采用哪种构架来搭建你们的应用，需要具体业务具体优化，灵活搭配就是，另外还要考虑运维的成本。

原文网址：https://www.youyong.top/article/11599d0bcd56a

登陆后阅读全文

阅读 2581 赞赏 0 有用 8 没用 0 收藏 0

如果觉得文章对你有用，请随意赞赏！

百万司机车货匹配项目 从 300qps 至 3000qps 的优化之路

平台初步技术方案

一、测试环境服务组件版本号

1、PostgreSQL 9.4.1

编译安装

初始化

参数配置

2、apache-2.2.31

编译安装

参数配置

3、php-5.5.38

编译安装

参数配置

4、nginx-1.6.2

编译安装

参数配置

5、pgbouncer-1.5.4

编译安装

参数配置

6、golang-1.7.3

直接安装

二、测试数据及脚本

1、数据库环境

2、测试程序

三、性能校准

1、apache+php

短连接测试

长连接测试

全程长连接

2、nginx+apache+php

nginx与apache短连接通信

nginx与apache长连接通信

四、业务测试

1、nginx+apache+php+pgsql

2、nginx+apache+php+pgbouncer+pgsql

3、nginx负载双节点(apache+php)+pgbouncer+pgsql

4、基于go的后端服务架构

5、基于nginx+go的后端服务架构

五、测试结果对比

相关文章

他的文章

百万司机车货匹配项目从 300qps 至 3000qps 的优化之路