失眠网,内容丰富有趣,生活中的好帮手!
失眠网 > 数据挖掘冰山立方体构建算法:BUC及实现

数据挖掘冰山立方体构建算法:BUC及实现

时间:2023-07-23 20:25:23

相关推荐

数据挖掘冰山立方体构建算法:BUC及实现

1.冰山立方体知识:

waiting...

2.代码实现:

实现思路:全程模拟,写的时候理解还不够透彻,第一轮算出频率大于min_sup的集合,然后根据此集合开始不断向下扩展,筛选所有大于min_sup的项直至达到最大维数。实现过程遵循的原则:所有出现在数据中但未出现在满足min_sup的集合中的项一律不予计算未实现:理想地,应当首先处理最有区分能力的维。维应当以基数递减数处理。基数越高,分区越小,因而分区越多,从页为BUC剪枝提供了更大的机会。

代码:

test.csv:

a1,b1,c1,d1a1,b2,c2,d2a1,b3,c2,d1a1,b4,c1,d2a2,b1,c1,d1a2,b2,c2,d2a2,b3,c2,d1a2,b4,c1,d2a3,b1,c1,d1a3,b2,c2,d2a3,b3,c2,d1a3,b4,c1,d2a4,b1,c1,d1a4,b2,c2,d2a4,b3,c1,d2a4,b4,c2,d1

BUC:

with open('test.csv','r') as fr:data_count = dict()data = fr.read().splitlines()for i in range(len(data)):data[i] = str(data[i]).split(",")dims = [x[0] for x in data[0]] # record representation of each dimensionmin_sup = 3 # minimum support valuedef BUC(sub, d):# reach the max dimensionif d > len(dims):returnsub_list = list(sub) # list store like ['a1 b1','a2','b2 c1']sub_list = [x.split(" ") for x in sub_list]for x in sub_list:if len(x) == d-1: # expected to expandfor y in sub_list:if len(y) == 1 and y[0] not in x: # single item and not repeatitem_cnt = 0for line in data:minn = 0x3ffffor xx in x:minn = min(minn, line.count(xx)) # count the longer frequencyminn = min(minn, line.count(y[0])) # count the single frequencyitem_cnt += minn # total frequency of new itemitem = " ".join(x + y) # reform item# filter repeatfor key in sub:# filter item like "a1 c1" and "c1 a1"flag = (sorted(item.split(' ')) == sorted(key.split(' ')))item_cnt = -1 if flag is True else item_cnt# add to total sub dictif item_cnt >= min_sup:sub[item] = item_cntprint(d, ":", sub)BUC(sub, d+1)# calculate the single item frequencyfor line in data:for char in line:if char not in data_count:data_count[char] = 1else:data_count[char] += 1sub = {a: data_count[a] for a in data_count if data_count[a] >= min_sup}print(1, ":", sub)# start BUCBUC(sub, 2)

感觉自己好久没写代码了...实在是太菜了

本来想直接借鉴@yanjiaxin1996,太长了没看下去2333

如果觉得《数据挖掘冰山立方体构建算法:BUC及实现》对你有帮助,请点赞、收藏,并留下你的观点哦!

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。