Performance analysis of clustering algorithm | Reliable Papers

Journal of High Speed Networks 23 (2017) 49–57 49DOI 10.3233/JHS-170556IOS PressPerformance analysis of clustering algorithmunder two kinds of big data architectureBeibei Li a, Bo Liu a, Weiwei Lin b,∗ and Ying Zhang aa School of Computer, South China Normal University, Guangzhou, ChinaE-mails: 1538339980@qq.com, liugubin@126.com, 2283240387@qq.comb School of Computer Science and Engineering, South China University of Technology, Guangzhou, ChinaE-mail: linww@scut.edu.cnAbstract. To compare the performance of the clustering algorithm on two data processing architectures, the implementations of k-meansclustering algorithm on two big data architectures are given at first in this paper. Then we focus on the differences of theoretical performance ofk-means algorithm on two architectures from the mathematical point of view. The theoretical analysis shows that Spark architecture is superiorto the Hadoop in aspects of the average execution time and I/O time. Finally, a text data set of social networking site of users’ behaviors isemployed to conduct algorithm experiments. The results show that Spark is significantly less than MapReduce in aspects of the execution timeand I/O time based on k-means algorithm. The theoretical analysis and the implementation technology of the big data algorithm proposed inthis paper are a good reference for the application of big data technology.Keywords: Hadoop, MapReduce, Spark, clustering algorithm, big data, k-means1. IntroductionWith the coming of the era of Internet+, massive data has been produced in all aspects of social life. How to digout its hidden enormous value has become the focus of the community, and also has risen to the national strategiclevel. In March 2012, the Obama administration announced that they would plan to invest $200 million to start“big data research and a development program”, which followed another major technological development-the“information superhighway” plan [15] announced in 1993. A series of data from the Big Data Report in 2012McKinsey showed that big data industry had brought $300 billion revenues for the US health care system annually,e250 billion revenues for the European public administration annually, 60% pure profit for the retail industry, andhad reduced 50% product development costs for the manufacturing industry. However, Canner thought by 2015more than 85% of Fortune 500 companies would lose their strengths in the big data competition [11]. The marketresearch firm IDC predicted that big data technology and services market would rise from $3.2 billion in 2010 to$16.9 billion in 2015 and achieved 40% growth rate annually [9]. From the statistics above, it is easy to find thatbig data is widely applied and is of great value. In terms of the concept and research status of big data, the coreforce that promote the big data development is the big data processing technology. Whether we can dig out theenormous scientific and economic value hidden in massive data depends on its processing technology. Therefore,the big data technology has become the hot spot and research focus [13]. The limit of traditional data processingmodel in memory and processing capabilities is unable to meet the actual demands. With the development ofscience and technology, parallel processing mechanisms e.g. MPI, PVM and MapReduce have been widely used inthe past years. However, with the deep research on machine learning, there are a large number of applications thatrequires iterative algorithm processing. The result of this application processed by the traditional data processing*Corresponding author. E-mail: linww@scut.edu.cn.0926-6801/17/$35.00 © 2017 – IOS Press and the authors. All rights reserved50 B. Li et al. / Performance analysis of clustering algorithm under two kinds of big data architecturearchitecture is not satisfactory. An open source universal parallel cloud computing platform-Spark developed byUC Berkeley AMP Lab meets the needs [24]. The Spark is the latest parallel distributed computing frameworkmainly based on memory computing on big data technology chain. And some issues related to memory computinghave got support from the National Natural Science Foundation and related research has started. They are alsosupported by a lot of companies in the market, such as Alibaba, Baidu, NetEase and so on. Researchers are moreconcerned about the performance of data processing platform. At home and abroad, most researches have beenfocused on the differences between MapReduce [14] and Spark; integration of both memory computing and datamining algorithm on Spark [20]; improvement on clustering algorithm [19] binding Spark platform and so on. Thedecision tree research on two architectures mentioned in the literature [24] shows that the Spark is more suitablefor the iterative algorithm, and there is no deep research on the performance differences of the two architectures.Meanwhile, the decision tree also points out the performance differences of the two architectures combining withk-means algorithm [21]. The most latest researches on two architectures performance differences just analyze theresults by experiments. But the researches from the mathematical point of theoretical analysis are rare.In this paper we firstly give the implementations of k-means clustering algorithm on MapReduce and Spark.Then, we focus on the theoretical performance differences of the two architectures from the mathematical point ofview. Finally we use experiments to verify the validity of the theoretical analysis of big data algorithm.2. Two implementations of k-means algorithm2.1. Overview of k-means algorithmK-means is a clustering algorithm based on distance and unsupervised learning. It has been used widely onscience, industry, business and so on [17]. Its cluster similarity criterion is the distance between data objects. Thedata of same cluster is similar, and the data of different clusters is different. Clustering function is deviation sumof square criterion function, which is defined as:Gc =c j=1nik=1x(j)k – mj2For each data object xi, the function is to compute which class xi belongs to:ci = argminj |xi – mj|2Where mj is clustering center.The function of computing new center of clustering j is:mnew_j =n i=1 wijxin i=1 wijWhere xi is a data object and wij is the identification whether xi belongs to class j. If it is true, wij = 1; Orwij = 0.K-means algorithm implementation [30] is as follows:Input: data set D, the numbers of cluster k;Output: the k sets of clustering;Select data objects as initial center in data set D;B. Li et al. / Performance analysis of clustering algorithm under two kinds of big data architecture 51RepeatFor each data object xi from data set Dcomputing distance from xi to center pointthe data object is divided to the nearest clusterEnd ForCalculating the data object average value of each clustering center used as a new clustering center until clusteringcenter points no longer change [17,30].From the algorithm implementation point of view, we have seen that algorithm would be inefficient when algorithm required multiple iterations in dealing with massive data. Then algorithm can’t meet the needs of practicalapplications. K-means parallel implementation solves the problem [17]. The following parts mainly introduced thek-means parallel implementations on MapReduce and Spark.2.2. Parallel implementation of k-means based on MapReduceAs is discussed in Section 2.1, the key to parallel implementation of algorithm is to independently assign different samples to the nearest cluster. The Map and Reduce operations are the same in each iteration of parallelimplementation k-means algorithm [2]. Firstly, we select k samples as the center randomly, and store them in theHDFS files as a global variable. Then the iteration includes three parts:Map Function [25]: is inputted by default. The ‘key’ is an offset that is the current sample relativeto the starting point of the input file. The ‘value’ is a string that consists of each dimension coordinate values ofthe current sample. Firstly, we analyze each dimension coordinate values of current sample from the value, andcalculate the distance from data object to k clustering center. We can obtain the clustered index of nearest distance,and output , where the key1 is the clustered index of nearest distance, and the value1 is a stringthat consists of each dimension coordinate values of the current sample.Combine Function: is inputted. The ‘key’ is the clustered index. The ‘V’ is the string linked list thatconsists of each dimension coordinate values whose clustered index is key. Firstly, we obtain the coordinate valuesof each sample from the string linked list. Secondly, we add each value corresponding, and record the total numberof samples in the list. Outputting , where the key1 is the clustered index, the value1 is a stringwhich consists of the sample sum and each dimension coordinate values.Reduce Function: is inputted. Firstly we can obtain the intermediate results. Secondly, we canget new clustering center through related operation and update the HDFS files. Then the next iteration continuesuntil results converge. The implementation process is shown in Fig. 1.2.3. Parallel implementation of k-means based on SparkThe implementation of k-means algorithm based on Spark includes two parts [26]: dividing the data clusteringpoint, computing clustering center through multiple iterations until the results converge. The implementation ismainly achieved by the Driver, Mapper, Combiner and Reducer classes [29].Driver: It’s a underlying driver class of initial program, and it deals with data set through the related functions.Fig. 1. K-means algorithm implementation based on MapReduce.52 B. Li et al. / Performance analysis of clustering algorithm under two kinds of big data architectureFig. 2. K-means algorithm implementation based on Spark.Mapper: It’s a class that determines the initial clustering center, and divides initial data set. It calculates thedistance from the data object RDD to the initial clustering center, and selects the class of the nearest distance tomerge. At last it reelects the new clustering center. The intermediate results generated by iteration are transformedinto the new data object RDD [4].Combiner: It’s a class to achieve the combination process of the RDD intermediate data set. Because the Mapprocess has produced a large number of RDD intermediate results, the combination can reduce the traffic, andavoid congestion for network communication on the Spark platform.Reducer: It’s a class that makes local results through Combiner doing Reducer, and gets the global results. It canjudge the convergence of clustering center according to the clustering center threshold [6]. The implementationprocess is shown in Fig. 2.3. Theoretical analysis of algorithm performance on two architecturesAs is discussed in Section 2, k-means algorithm implementations on two architectures were based on the Mapand Reduce. The main reason for the performance differences between the two architectures is that Spark [16] isbased on memory RDD [4] calculation which doesn’t need to interact with the disk, while Hadoop is based onexternal memory which need to interact with the disk. Then, we will analyze the theoretical performance of twoarchitectures by execution time which is one of the standard to measure platform performance merits.Algorithm execution time consists of computing time, communication time and system execution time. Thecomplexity of computing time of the two architectures is similar. Communication time includes communicationvolume and communication mechanism. Hadoop and Spark are based on RPC mechanism, so the time differencecan be ignored. In terms of communication volume, Hadoop can’t reuse the data set in iterative process, whileSpark can support data set cache policy. Whether the data set reuses or not directly affects the number of iteration.We can merge this difference into the execution time. Execution time includes Map, Reduce and I/O operationtime. Therefore, the difference of time consumption between the two architectures is mainly the system executiontime. Specific time analysis is as follows:The first iteration of the two architectures is to read data from HDFS. The start and end heartbeat mechanismof Hadoop are negligible relatively to total time. The second and subsequent iterative processing ideology arethe same. The mainly difference is I/O time consumption. To analyze the performance differences between thetwo architectures conveniently, we assume that the cluster is homogeneous, and job is evenly distributed to allnodes, and no node fail during the implementation process. We need the following definitions auxiliary instructions [10]:B. Li et al. / Performance analysis of clustering algorithm under two kinds of big data architecture 53Definition 1. We assume that k-means data processing requires (k + 1) times iteration on two architectures. Wemake Hadoop, Spark architectures perform a full iteration. The I/O time required in a complete MapReduce processis respectively Th, Ts. The Map time on the Hadoop and Spark is respectively t1, T1. The Reduce time on theHadoop and Spark is respectively t2, T2.Definition 2. We define the mainly parameters on performing MapReduce process: input data set S, intermediateoutput data set S1, the final output data set S2. when the data size is x, each Map running time on Hadoop is f (x),and each Reduce running time on Hadoop is g(x). Each Map running time on Spark is F(x), and each Reducerunning time on Spark is G(x). They are all directly proportional with x, and ratios coefficient respectively are α,β, γ, μ.Definition 3. The available maximum numbers of Map and Reduce are respectively M, R in MapReduce computing system. During execution process the Map number divided is X, the Reduce number of system starting is Y.The rate of data that is read from HDFS is vi. The rate of data that is written back to the disk is vo. The rate of datathat is read from memory is Vi. The rate of data that will be written back to memory is Vo. Network transmissionrate is vn. The Map initialization overhead is C1 on Hadoop, and Reduce initialization overhead is C2 on Hadoop.The overheads respectively are C3, C4 on Spark. The node number in a cluster is N, The number of CPU cores foreach node is p. So we can conclude that R = Np, M = 2Np.Hadoop Map Time: The process includes reading data from HDFS, executing Map calculation and writing theMap intermediate results back to disk. Each Map input data is XS . The time consumption in this process is:t1 =SXvi+ f XS + Xv S1o + C1 (1)Hadoop Reduce Time: It inputs intermediate results output by Map sorting and executing Reduce calculation,and outputs the results. Each Reduce input data is SY1 . The time consumption in this process is:t2 =S1Yvi+ gSY1 + Yv S2o + C2 (2)So we can conclude, the I/O time of finishing a full MapReduce process is:Th =SXvi+S1Xvo+S1Yvi+S2Yvo(3)Similarly, we can conclude the time consumption of each stage on Spark is:T1 =SXVi+ FXS + XV S1o + C3 (4)T2 =S1YVi+ GSY1 + YV S2o + C4 (5)Ts =SXVi+S1XVo+S1YVi+S2YVo(6)About Hadoop and Spark, we assume that the data size of each map transmission of each reduce is XY S1 . So aniterative calculation of network transmission time is: tn=S1XY vn(7) 54 B. Li et al. / Performance analysis of clustering algorithm under two kinds of big data architectureThe derivation of the formula above is irrespective of task scheduling. Schedule is inevitable, because the practicalapplication data set is too large. Scheduling times of Map, Reduce implementation are:λm =X M(8)λr =Y R(9)When t1 Mtn, there is no need to wait for reduce execution. In the practical application, the time of completingan iterative calculation of Hadoop and Spark is respectively t, t: t = λmt1 + Mtn + (λr – 1)Xtn + λrt2t = λmT1 + Mtn + (λr – 1)Xtn + λrT2The time required for implementation of (k + 1) times iteration under the two architectures is respectively:Thadoop = kt = kλmt1 + kMtn + k(λr – 1)Xtn + kλrt2Tspark = kt = kλmT1 + kMtn + k(λr – 1)Xtn + kλrT2I/O time of (k + 1) times iteration is respectively:(10)(11)(12)(13)T h = kTh;T s = kTs (14)Therefore, the performance differences between the two architectures can be illustrated by execution time and I/Oconsumption. In order to give a more intuitive description of the problem, the parameters will be specific values.According to the experience, set Y = 1.75∗N ∗p, we can get λr = 2. In order to calculate conveniently and reduceworkload imbalance, we assume that S = S1 = S2, vi = vo = 100 Mb/s, Vi = Vo = 10 Gb/s, vn = 1 Gb/s,so wecan get TThs = 100.We assume that vi, vo, Vi, Vo, vn are same, so the other parameters values are respectively, M = 12, R = 6,N = 3, p = 2, C1 = C3 = 0.3 s, C2 = C4 = 0.2 s, α = 0.8 s/M, β = 0.9 s/M, γ = 0.1 s/M, μ = 0.2 s/M. Wecan get TThadoop spark = 18. If C1 = 3 s, C2 = 2 s, C3 = 0.2 s, C4 = 0.1 s, α = 1.8 s/M, β = 2 s/M, γ = 0.04 s/M,μ = 0.05 s/M, we can get TThadoop spark = 41.The results have shown that the overhead and execution rate of each stage of the implementation process havea great influence on the architecture performance. In practical application, the data volume maybe reach to T leveland even bigger. The difference of execution time on two architectures becomes much more obvious. Bandwidthmay also become the bottleneck of the two architectures. Finally, the results have shown that Hadoop is longer thanSpark in I/O consumption or total time. Using the execution time to measure the performance of two architectures,we can conclude that Spark is superior to the Hadoop.4. Experiments and resultsThere are many clustering algorithms now. The algorithm implementation steps are different owing to different algorithm ideas, so algorithm result-clustering effect is different. The effect of clustering algorithms varies inpractical applications. To illustrate the performance differences between the two architectures, this paper analyzesclustering algorithm implementations based on two architectures from the mathematical point of view. The experiments use text data set used to test clustering algorithm to compare the performance differences between thetwo architectures by changing the number of iteration. Therefore, the performance differences between the twoarchitectures can’t be illustrated by clustering effect.B. Li et al. / Performance analysis of clustering algorithm under two kinds of big data architecture 554.1. Experimental environmentIn the experiment we used one server and three virtual hosts produced on the Workstation VMware. We usedCDH5 as Hadoop and Spark platform, and used CentOS6.5 x64 as node operating system. We use Hadoop 2.5.0-cdh5.3.2 benchmark for Hadoop and the Spark 1.2.0 benchmark for Spark [22] and JDK 1.7 benchmark for Java.4.2. Experimental dataExperimental data used text data set of social networking site of users’ behaviors [1]. All data is displayed in.csvfile format. And they are packaged separately in multiple tar.gz file.User information format: [user id]t[user text], for example: 369319 zzzop. User relationship network format:[user id]t[crawled page count]t[friend count]t[friend id list]t[fans count]t[fans list], for example: 1.2.3..htm1 14215 6 hamas jkaneko caol_ila manwomanfilm public_design_center Kaminogoya 4 hamas lawmn shamroytkwshnsk.4.3. Results and analysisThe experiment used the standard pure text data set which is used to test k-means algorithm. By changingthe number of iteration and comparing average execution time and I/O time of the two architectures, we canillustrate the performance differences between the two architectures. Fig. 3 is the average execution time of kmeans algorithm of the two architectures. As we can see from Fig. 3, the processing time of MapReduce increaseswith the number of iteration, and the processing time of Spark architecture is relatively stable. When the number ofiterations is same, the processing time of MapReduce is longer than the processing time of Spark, and the averageexecution time of MapReduce is 50 times of Spark. The conclusion is consistent with the theoretical analysis. Theexperimental environment and different parameter values of the theoretical analysis are the causes of deviation.Figure 4 is the I/O time of the k-means algorithm of the two architectures. As we can see from Fig. 4, the I/Otime ratio of MapReduce and Spark increases with the number of iteration. When the number of iteration is same,the I/O time of MapReduce is longer than the I/O time of Spark. So the iterative processing time is mainly I/O time,and I/O time of MapReduce is 60 times of Spark. The conclusion is consistent with the theoretical analysis. Theexperimental environment and different parameter values of the theoretical analysis and the smaller experimentaldata set are the causes of deviation.In a word, as we can see from the experimental results, the execution time and I/O time of Spark are significantlyless than MapReduce. So Spark performance is superior to MapReduce in terms of time consumption. Moreover,the experimental results are consistent with the theoretical analysis results of Section 3. Then, we can verify thevalidity of the theoretical analysis results.Fig. 3. The average execution time comparison under two kinds of architecture.56 B. Li et al. / Performance analysis of clustering algorithm under two kinds of big data architectureFig. 4. The I/O time comparison under two kinds of architecture.5. Conclusions and future workIn this paper we have introduced the implementation steps of k-means algorithm and its implementation onMapReduce and Spark. Then, this paper focuses on theoretical performance differences of two architectures usingclustering algorithm from the mathematical point of view. Finally, with the increase of the number of iteration,MapReduce will increase significantly and Spark will change small on the execution time. That is to say, theperformance of the Spark is superior to the MapReduce.In future work, we plan to further analyze the performance differences of two architectures on scalability.MapReduce is based on external memory calculation [7,12] and Spark is based on memory calculation, so thememory consumption on processing [18,23] data can also affect the architecture performance [3,27,28]. Thusmemory optimization is one of the most important directions in future research [5,8].AcknowledgementsOur sincere appreciation to the anonymous reviewers for their helpful comments and suggestions. What’s more,this work is partially supported by the National Natural Science Foundation of China (Grant No. 61402183),Guangdong Natural Science Foundation (Grant No. S2012030006242), Guangdong Provincial Scientific andTechnological Projects (Grant Nos. 2016A010101007, 2016B090918021, 2014B010117001, 2014A010103022,2014A010103008, 2013B010202001 and 2013B010401021), Guangzhou Civic Science and Technology Project(Grant Nos. 201607010048 and 201604010040) and Fundamental Research Funds for the Central Universities,SCUT (No. 2015ZZ0098).References[1] http://www.datatang.com/data/list.[2] Apache Hadoop, available at: http://hadoop.apache.org/.[3] Apache Spark documentation, 2014, available at: https://spark.apache.org/documentation.html.[4] Apache Spark Research, 2014, available at: https://spark.apache.org/research.html.[5] H. Byun, A reliable data delivery scheme with delay constraints for wireless sensor networks, Journal of High Speed Networks 21(3)(2015), 195–203. doi:10.3233/JHS-150520.[6] G. Feng and Y. Ma, A distributed frequent itemset mining algorithm based on Spark, in: Computer Supported Cooperative Wok in Design(CSCWD), 2015 IEEE 19th International Conference on, 6–8 May 2015, 2015, pp. 271–275.[7] L. Feng, Research and implementation of memory optimization in cluster computer engine Spark, Tsinghua University, 2013.B. Li et al. / Performance analysis of clustering algorithm under two kinds of big data architecture 57[8] U. Fiore, F. Palmieri, A. Castiglione and A. De Santis, A cluster-based data-centric model for network-aware task scheduling in distributedsystems, International Journal of Parallel Programming 42(5) (2014), 755–775. doi:10.1007/s10766-013-0289-y.[9] J. Gantz and D. Reinsel, 2011 digital universe study: Extracting value from chaos, available at: http://www.b-eye-network.com/blogs/devlin/archives/2011/071.[10] Y. Gao, W. Zhou and J. Han, An evaluation model on key technologies of large-scale graph data processing, Journal of Computer Researchand Development 51(1) (2014), 1–16. doi:10.2190/EC.51.1.a.[11] F. Gu, X. Liu and C. Zuo, Study on carriers’ mobile Internet development strategy in the context of big data, Designing Techniques ofPosts and Telecommunications 8 (2012), 21–24.[12] C. Guo, B. Liu and W. Lin, Research on performance of big data computing and query processing based on Impala, Application Researchof Computer 32(5) (2015), 1331–1334.[13] Y. Huang, A study on the analysis of the research hotspots and development trends of big data overseas, Journal of Intelligence 33(6)(2014), 99–104.[14] Y. Hui and S. Wu, Sequence-growth: A scalable and effective frequent itemset mining algorithm for big data based on MapReduceframework, in: Big Data (Big Data Congress) 2015 IEEE International Congress on, June 27 2015–July 2 2015, IEEE, 2015, pp. 393–400.[15] G. Li, Scientific value of big data research, Communications of the China Computer Federation 8(9) (2012), 8–15.[16] W. Li, Research on spark for big data processing, Modern Computer 3 (2015), 55–60.[17] Y. Liang, Research on parallelization of data mining algorithm based on distribute platforms Spark and YARN, Sun Yat-sen University,2014.[18] W. Lin, An improved data placement strategy for hadoop, Journal of South China University of Technology (Natural Science Edition)40(1) (2012), 153–158.[19] R. Qiu, The parallel design and application of the CURE algorithm based on Spark platform, South China University of Technology,2014.[20] S. Rathee, M. Kaul and A. Kashyap, R-apriori: An efficient apriori based algorithm on Spark, in: PIKM ’15 Proceedings of the 8thWorkshop on Ph.D. Workshop in Information and Knowledge Management, ACM, New York, NY, USA, 2015, pp. 27–34.[21] G. Satish and A. Rohan, Comparing Apache Spark and Map Reduce with performance analysis using K-means, International Journal ofComputer Applications 113(1) (2015), 8–11.[22] Scala, available at, http://www.scala-lang.org.[23] X. Tu, B. Liu and W. Lin, Survey of big data, Application Research of Computers 31(6) (2014), 1613–1623.[24] H. Wang, B. Wu, S. Yang and B. Yang, Research of decision tree on YARN using MapReduce and Spark, in: World Congress in ComputerScience, Computer Engineering, and Applied Computing, 2014, available at: http://www.world-academy-of-science.org/.[25] X. Wang, Clustering in the cloud: Clustering algorithms to Hadoop Map/Reduce framework, Department of Computer Science, TexasState Univerdity, 2010.[26] Z. Yang, The research of recommendation system based on Spark platform, University of Science and Technology of China, 2015.[27] J. Zhang, T. Yang and C. Zhao, Load balancing and data aggregation tree routing algorithm in wireless sensor networks, Journal of HighSpeed Networks 21(2) (2015), 121–129. doi:10.3233/JHS-150515.[28] C. Zhao, C. Xia and C. Jia, Research and analysis on spatial adaptive strategy of End-hopping system, Journal of High Speed Networks21(2) (2015), 95–106. doi:10.3233/JHS-150514.[29] W. Zhao, H. Ma and Y. Fu, Research on parallel k-means algorithm design based on Hadoop platform, Computer Science 38(10) (2011),166–168.[30] T. Zhou, J. Zhang and C. Luo, Realization of K-means clustering algorithm based on Hadoop, Computer Technology and Development23(7) (2013), 17–21.Journal of High Speed Networks 23 (2017) 59–66 59DOI 10.3233/JHS-170557IOS PressA simulation system based on ONE andSUMO simulators: Performance evaluationof different vehicular DTN routing protocolsMiralda Cuka a, Ilir Shinko a, Evjola Spaho a, Tetsuya Oda b, Makoto Ikeda b and Leonard Barolli b,∗a Faculty of Information Technologies, Polytechnic University of Tirana, Bul. “Dëshmorët e Kombit”,“Mother Theresa” Square, Nr. 4, Tirana, Albaniab Department of Information and Communication Engineering, Fukuoka Institute of Technology (FIT), JapanAbstract. The advances in next generation network and IoT have provided a promising opportunity to resolve the challenges caused by theincreasing transportation issues. In this paper, we investigate the performance of different routing protocols in a Vehicular Delay TolerantNetwork (VDTN) crossroad scenario. The mobility patterns of vehicles are generated by means of SUMO (Simulation of Urban MObility)and as communication protocol simulator is used ONE (Opportunistic Network Environment). We use Packet Delivery Ratio (PDR), RelayDelivery Ratio (RDR), hop count and delay as evaluation metrics. We compared the performance of six protocols and the simulation resultsshow that for PDR the Epidemic protocol has better performance than other protocols. Considering RDR and hop count, the performance ofDirect Delivery protocol is better than other protocols. While for delay, the performance of Epidemic protocol is better than other protocols.Keywords: Vehicular Networks, VDTN, ONE, SUMO, Opportunistic Network, IoT1. IntroductionModern vehicles are increasingly equipped with a large amount of sensors, actuators, and communication devices (mobile devices, GPS devices, and embedded computers) [14]. In particular, numerous vehicles have possessed powerful sensing, networking, communication, and data processing capabilities, and can communicate withother vehicles or exchange information with the external environments over various protocols, including HTTP,TCP/IP, SMTP, WAP, and Next Generation Telematics Protocol (NGTP) [17]. As a result, many innovative telematics services [33], such as remote security for disabling engine and remote diagnosis, have been developed toenhance drivers’ safety, convenience, and enjoyment.The advances in cloud computing, Internet of Things (IoT) [1] and Opportunistic Networks [15] have provideda promising opportunity to further address the increasing transportation issues, such as heavy traffic, congestion,and vehicle safety. In the past few years, researchers have proposed a few models that use cloud computing forimplementing intelligent transportation systems (ITSs). For example, a new vehicular cloud architecture calledITS-Cloud was proposed to improve vehicle-to-vehicle communication and road safety [4].Vehicular Delay Tolerant Networks (VDTNs) are a special type of DTNs. They can been utilized to guaranteeroad safety, to avoid potential accidents and make new forms of inter-vehicle communications so they will be animportant part of the future Intelligent Transport Systems (ITS).Due to the high cost of deploying and implementing VDTN systems in a real environment, most of research isconcentrated on simulations.*Corresponding author: Leonard Barolli, Department of Information and Communication Engineering, Fukuoka Institute of Technology(FIT), 3-30-1 Wajiro-Higashi, Higashi-Ku, Fukuoka 811-0295, Japan. E-mail: barolli@fit.ac.jp.0926-6801/17/$35.00 © 2017 – IOS Press and the authors. All rights reservedCopyright of Journal of High Speed Networks is the property of IOS Press and its contentmay not be copied or emailed to multiple sites or posted to a listserv without the copyrightholder’s express written permission. However, users may print, download, or email articles forindividual use.